DEV Community

Giuseppe Schillaci
Giuseppe Schillaci

Posted on

Building a Review Scraper with Python using BeautifulSoup and Sentiment Analysis with NLTK

Introduction

In recent years, analyzing online reviews has become a crucial aspect for many businesses. Understanding customer sentiment can help identify areas for improvement and evaluate overall customer satisfaction. In this article, we'll explore how to use Python to create a review scraper and analyze sentiment using the BeautifulSoup and NLTK libraries.

Creating the Review Scraper with BeautifulSoup

To begin, we utilized Python along with the BeautifulSoup library to extract reviews from a leading Italian company's online review site. BeautifulSoup allows us to parse the HTML markup of a web page and efficiently extract the data of interest. Using BeautifulSoup's features, we extracted the reviews and saved them for further analysis.

import requests from bs4 import BeautifulSoup import pandas as pd # Number of pages to scrape page_start = 1 page_end = 49 # DataFrame to store the data df = pd.DataFrame(columns=["title", "text"]) # Loop through the pages for page_num in range(page_start, page_end + 1): # Construct the URL for the current page url = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}' # Make an HTTP request to fetch the page content response = requests.get(url) if response.status_code == 200: # Use BeautifulSoup to parse the HTML of the page soup = BeautifulSoup(response.content, 'html.parser') # Find all review elements reviews = soup.find_all(attrs={"data-review-content": True}) # Extract title and text of each review and add them to the DataFrame for review in reviews: title_element = review.find(attrs={"data-service-review-title-typography": True}) content_element = review.find(attrs={"data-service-review-text-typography": True}) if title_element and content_element: title = title_element.text content = content_element.text # Add data to the DataFrame df = df.append({"title": title, "text": content}, ignore_index=True) else: print("Title or text element not found.") # Print the DataFrame with all review data df 
Enter fullscreen mode Exit fullscreen mode

Review Analysis with NLTK

Once the reviews were extracted, we employed the Natural Language Toolkit (NLTK), a widely-used Python library for Natural Language Processing (NLP). NLTK provides a range of tools for text analysis, including sentiment analysis.

We used NLTK's SentimentIntensityAnalyzer to assess the sentiment of the reviews. This analyzer assigns a numerical score to each review, indicating whether the sentiment is positive, negative, or neutral. This analysis provided us with a clear insight into customer sentiment towards the company.

 import nltk from nltk.sentiment import SentimentIntensityAnalyzer # Download the VADER lexicon for sentiment analysis nltk.download('vader_lexicon') # Create a SentimentIntensityAnalyzer object sid = SentimentIntensityAnalyzer() # Define a function to get the sentiment of a text def get_sentiment(text): # Calculate the sentiment score of the text scores = sid.polarity_scores(text) # Determine the sentiment based on the compound score if scores['compound'] >= 0.05: return 'positive' elif scores['compound'] <= -0.05: return 'negative' else: return 'neutral' 
Enter fullscreen mode Exit fullscreen mode

Visualizing the Results

Finally, we used the analyzed data to create bar and pie charts displaying the percentages of negative, positive, and neutral reviews. These charts offer a visual representation of the overall sentiment of the reviews and allow for easy identification of trends.

import matplotlib.pyplot as plt # Count unique values in the 'sentiment' column value_counts = df['sentiment'].value_counts() # Define colors for each category colors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'} # Create a pie chart using the defined colors plt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%') # Add title plt.title('Sentiment Analysis of Reviews for Company XYZ') # Show the chart plt.show() 
Enter fullscreen mode Exit fullscreen mode

Conclusion

In this article, we've seen how to use Python along with the BeautifulSoup and NLTK libraries to create a review scraper and analyze online sentiment. The combination of these powerful libraries allowed us to gain valuable insights into customer sentiment and visualize the results clearly and comprehensively.

By employing similar techniques, businesses can actively monitor customer feedback and make informed decisions to enhance overall customer experience. The combination of web scraping and sentiment analysis is a powerful tool for online reputation monitoring and customer relationship management.

Top comments (0)