Python Program to crawl a web page and get most frequent words

Python Program to crawl a web page and get most frequent words

Web crawling, combined with text processing, can be used to derive insights or analyze a web page's content. Here's a simple tutorial on creating a Python program to crawl a web page and retrieve the most frequent words.

1. Required Libraries

  • BeautifulSoup4: For parsing HTML and extracting necessary data.
  • requests: To send HTTP requests and fetch webpage content.
  • collections: For the Counter class which will help count word occurrences.

You can install the necessary libraries using pip:

pip install beautifulsoup4 requests 

2. The Program

import requests from bs4 import BeautifulSoup from collections import Counter import re def get_webpage_content(url): """Retrieve the content of a webpage.""" response = requests.get(url) response.raise_for_status() # Raise an error for bad responses return response.text def extract_text_from_html(html_content): """Extract textual content from an HTML page.""" soup = BeautifulSoup(html_content, 'html.parser') # Remove script and style elements for script in soup(["script", "style"]): script.decompose() return " ".join(re.split(r'\W+', soup.get_text().lower())) def get_most_frequent_words(text, top_n=10): """Get the most frequent words from a block of text.""" words = text.split() counter = Counter(words) return counter.most_common(top_n) if __name__ == "__main__": url = input("Enter the URL of the webpage: ") content = get_webpage_content(url) text = extract_text_from_html(content) common_words = get_most_frequent_words(text) print(f"\nThe {len(common_words)} most common words in the webpage are:") for word, freq in common_words: print(f"{word}: {freq}") 

3. How the Program Works

  1. get_webpage_content: Fetches the HTML content of a webpage.
  2. extract_text_from_html: Uses BeautifulSoup to remove any script or style elements and then extracts the textual content. All characters are converted to lowercase, and any non-word characters are removed or replaced by space, ensuring uniformity.
  3. get_most_frequent_words: Takes the cleaned text, splits it into words, and counts the occurrences of each word using the Counter class. It then returns the top n words.

4. Running the Program

  • When you run the program, you'll be prompted to enter a webpage URL.
  • The program will fetch the content, process the text, and then display the most frequent words.

Conclusion

This is a basic web crawling and text analysis program. There are many potential enhancements and considerations for a more robust solution, including handling non-English web pages, considering stop words, or diving deeper into Natural Language Processing (NLP). But as a starting point, this tutorial offers an introduction to some foundational concepts.


More Tags

outlook-restapi tsc onscrolllistener jquery-1.3.2 jenkins-2 find-occurrences sap-gui cosine-similarity core-graphics redux-form

More Programming Guides

Other Guides

More Programming Examples