Topic Modeling with NLTK Last Updated : 23 Jul, 2025 Suggest changes Share Like Article Like Report Topic modeling is a way to automatically find hidden themes or topics in a large collection of text. With NLTK you can do the first important step which is cleaning and preparing the text. NLTK helps you tokenize words, remove stopwords and lemmatize or stem words so they’re in their simplest form. Once the text is ready you can pass it to a topic modeling tool like Gensim’s LDA to discover what the main topics are.Topic modelingImplementationStep 1: Install and Download Necessary LibrariesThis step imports necessary libraries for text processing and topic modeling and downloads essential NLTK resources like tokenizers, stopwords and the WordNet lemmatizer to support text cleaning and normalization. Python import pandas as pd import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer from gensim import corpora, models nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet') nltk.download('punkt_tab') Output:OutputStep 2: Load DatasetThis step loads the CSV file, selects only the text column and takes a random sample of 5,000 tweets.You can download the Sentiment140 dataset with 1.6 million tweets from Kaggle.Sampling makes processing and testing faster while maintaining a representative subset of the data. Python df = pd.read_csv('training.1600000.processed.noemoticon.csv.zip', encoding='latin-1', names=['target', 'ids', 'date', 'flag', 'user', 'text']) df = df[['text']].sample(5000, random_state=42) print(df.head(3)) Output:OutputStep 3: Preprocess the DataThis step defines a function to lowercase, tokenize, remove stopwords and non alphabetic tokens and lemmatize each word.It then applies this function to each tweet creating a new column with clean token lists for further analysis. Python stop_words = set(stopwords.words('english')) lemmatizer = WordNetLemmatizer() def preprocess(text): tokens = word_tokenize(text.lower()) tokens = [w for w in tokens if w.isalpha()] tokens = [w for w in tokens if w not in stop_words] tokens = [lemmatizer.lemmatize(w) for w in tokens] return tokens df['tokens'] = df['text'].apply(preprocess) print(df['tokens'].head(3)) Output:OutputStep 4: Create a dictionary and corpus for Gensim LDAThis step builds a Gensim dictionary mapping unique tokens to IDs and converts each tweet into a bag of words representation.This prepares the text data in the format needed for training topic models like LDA. Python dictionary = corpora.Dictionary(df['tokens']) corpus = [dictionary.doc2bow(tokens) for tokens in df['tokens']] print("Sample bag-of-words for first doc:", corpus[0]) Output:Sample bag-of-words for first doc: [(0, 1), (1, 1), (2, 1), (3, 1)]Step 5: Train LDA modelThis step trains a Gensim LDA topic model on the prepared corpus and dictionary specifying the number of topics and training passes.It then prints the top words for each topic helping you interpret the themes found in the tweets. Python lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes=10, random_state=42) topics = lda_model.print_topics(num_words=5) for topic in topics: print(topic) Output: Step 6: Get dominant topic for each tweetThis step defines a function to find the most likely topic for each tweet based on its bag of words representation.It applies this function to all tweets and adds a new column with the dominant topic label helping categorize the data by themes. Python def get_topic(doc_bow): topics = lda_model.get_document_topics(doc_bow) topics = sorted(topics, key=lambda x: -x[1]) return topics[0][0] if topics else None df['topic'] = [get_topic(bow) for bow in corpus] print(df[['text', 'topic']].head(10)) Output:OutputYou can download the source code from here- NLTK for Topic ModelingRelated Articles:Topic Modeling - Types, Working, ApplicationsTopic Modeling using Lda Create Quiz S shrurfu5 Follow 0 Article Tags : NLP Python-nltk Natural-language-processing Explore Natural Language Processing (NLP) Tutorial 5 min read Introduction to NLPNatural Language Processing (NLP) - Overview 9 min read NLP vs NLU vs NLG 3 min read Applications of NLP 6 min read Why is NLP important? 6 min read Phases of Natural Language Processing (NLP) 7 min read The Future of Natural Language Processing: Trends and Innovations 7 min read Libraries for NLPNLTK - NLP 5 min read Tokenization Using Spacy 4 min read Python | Tokenize text using TextBlob 3 min read Introduction to Hugging Face Transformers 5 min read NLP Gensim Tutorial - Complete Guide For Beginners 13 min read NLP Libraries in Python 9 min read Text Normalization in NLPNormalizing Textual Data with Python 7 min read Regex Tutorial - How to write Regular Expressions? 6 min read Tokenization in NLP 8 min read Lemmatization with NLTK 6 min read Introduction to Stemming 6 min read Removing stop words with NLTK in Python 6 min read POS(Parts-Of-Speech) Tagging in NLP 6 min read Text Representation and Embedding TechniquesOne-Hot Encoding in NLP 9 min read Bag of words (BoW) model in NLP 7 min read Understanding TF-IDF (Term Frequency-Inverse Document Frequency) 4 min read N-Gram Language Modelling with NLTK 3 min read Word Embedding using Word2Vec 5 min read Glove Word Embedding in NLP 8 min read Overview of Word Embedding using Embeddings from Language Models (ELMo) 4 min read NLP Deep Learning TechniquesNLP with Deep Learning 3 min read Introduction to Recurrent Neural Networks 10 min read What is LSTM - Long Short Term Memory? 5 min read Gated Recurrent Unit Networks 6 min read Transformers in Machine Learning 4 min read seq2seq Model 6 min read Top 5 PreTrained Models in Natural Language Processing (NLP) 7 min read NLP Projects and PracticeSentiment Analysis with an Recurrent Neural Networks (RNN) 5 min read Text Generation using Recurrent Long Short Term Memory Network 4 min read Machine Translation with Transformer in Python 6 min read Building a Rule-Based Chatbot with Natural Language Processing 4 min read Text Classification using scikit-learn in NLP 5 min read Text Summarization using HuggingFace Model 4 min read Natural Language Processing Interview Question 15+ min read My Profile ${profileImgHtml} My Profile Edit Profile My Courses Join Community Transactions Logout Like