DEV Community

Cover image for Building the classifier - Part I (Live tweet sentiment analysis)
Amal Shaji
Amal Shaji Subscriber

Posted on • Edited on • Originally published at amalshaji.wtf

Building the classifier - Part I (Live tweet sentiment analysis)

In this series, I will show how to create a sentiment analysis app and perform analysis on any hashtag. The series is divided into 3 parts

  • Building the classifier
  • Building the backend
  • Building the frontend

In the final post, we'll bring everything together to make the app. We'll use tools like nltk, docker, streamlit, fastAPI and the link to the code will be provided.

Final product

sentwitter

Building the classifier

We'll be using a pre-trained model that I trained using open-source code. This is not a SOTA(State-of-the-Art) model, but for our task, this should be fine.

Let's begin by making a project directory.

mkdir sentwitter && cd sentwitter mkdir backend mkdir frontend 
Enter fullscreen mode Exit fullscreen mode

Download the trained model to backend/models directory

wget https://raw.githubusercontent.com/amalshaji/sentwitter/master/backend/models/sentiment_model.pickle mv sentiment_model.pickle /backend/models 
Enter fullscreen mode Exit fullscreen mode

Install required libraries

python3 -m pip install nltk python3 -m nltk.downloader punkt python3 -m nltk.downloader wordnet python3 -m nltk.downloader stopwords python3 -m nltk.downloader averaged_perceptron_tagger 
Enter fullscreen mode Exit fullscreen mode

Let's write a function to pre-process the input(tweet)

# backend/classify.py import re, string from nltk.tag import pos_tag from nltk import WordNetLemmatizer from nltk.tokenize import word_tokenize def remove_noise(tweet_tokens, stop_words=()): cleaned_tokens = [] for token, tag in pos_tag(tweet_tokens): token = re.sub( "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|" "(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", token, ) token = re.sub("(@[A-Za-z0-9_]+)", "", token) # remove all the links and special characters  if tag.startswith("NN"): pos = "n" elif tag.startswith("VB"): pos = "v" else: pos = "a" lemmatizer = WordNetLemmatizer() token = lemmatizer.lemmatize(token, pos) # pos tagging  if ( len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words ): cleaned_tokens.append(token.lower()) # remove stopwords and punctuations  return cleaned_tokens 
Enter fullscreen mode Exit fullscreen mode

Write a helper function to load the model.

# backend/utils.py  import pickle def load_model(): f = open("./models/sentiment_model.pickle", "rb") classifier = pickle.load(f) f.close() return classifier 
Enter fullscreen mode Exit fullscreen mode

Test out classifier

# backend/test.py  from utils import load_model from classify import remove_noise from nltk.tokenize import word_tokenize model = load_model() while True: _input = input("Enter a sample sentence: ") custom_tokens = remove_noise(word_tokenize(_input)) result = model.classify(dict([token, True] for token in custom_tokens)) print(f"{_input}: {result}") 
Enter fullscreen mode Exit fullscreen mode

Output

❯ python .\test.py Enter a sample sentence: I am awesome I am awesome: Positive Enter a sample sentence: I hate you I hate you: Negative Enter a sample sentence: I have a gun I have a gun: Positive Enter a sample sentence: I like your hair I like your hair: Negative 
Enter fullscreen mode Exit fullscreen mode

This isn't the best modelπŸ˜‚, gun labeled as Positive and hair sentence as Negative. Feel free to build your own model or try with the heavy ones using transformers( huggingface).

In the next post, we'll be building the backend to serve predictions through an API.

References

  • The open-source code used to train the model was used a long time ago for a project, I can't find the source. So if you do, or you are the author, please comment the link.
  • Article Cover by MorningBrew
  • Series Cover by Ravi Sharma

Top comments (0)