Newspaper: Article scraping & curation (Python)

Newspaper: Article scraping & curation (Python)

The newspaper library is a popular Python library used for web scraping and curation of articles. It helps extract and curate articles from a given website or article URL.

Here's a step-by-step guide on how to use the newspaper library:

  1. Installation: First, you need to install the newspaper3k package. You can do this using pip:

    pip install newspaper3k 
  2. Basic Usage:

    a. Scraping an Article:

    from newspaper import Article # Specify the article URL url = 'https://example.com/some-news-article' # Create an Article object article = Article(url) # Download and parse the article article.download() article.parse() # Print the article's title and text print(article.title) print(article.text) 

    b. Extracting Additional Information:

    # Print the article's authors print(article.authors) # Print the article's publishing date print(article.publish_date) # Print the top image of the article print(article.top_image) # Print all the images within the article print(article.images) 

    c. Natural Language Processing (NLP):

    # Apply NLP on the article article.nlp() # Print the article's summary print(article.summary) # Print the article's keywords print(article.keywords) 
  3. Scraping a Newspaper:

    If you wish to extract articles from an entire website or newspaper, you can do so with the following:

    from newspaper import build # Specify the newspaper's main URL paper_url = 'https://example-news-website.com' # Create a newspaper object newspaper = build(paper_url) # Print the names of all the categories in the newspaper for category in newspaper.category_urls(): print(category) # Extract and print titles of all articles in the newspaper for article in newspaper.articles: print(article.title) 
  4. Language Support:

    newspaper supports various languages. You can specify the language when creating the Article object:

    article = Article(url, language='en') # 'en' for English 

Remember that while newspaper is a powerful tool for web scraping, always ensure you're respecting the website's robots.txt and terms of service. Additionally, heavy scraping can cause IP bans, so use the library responsibly.


More Tags

autolayout rbind deployment heatmap data-fitting watch transducer fileapi ssim printstacktrace

More Programming Guides

Other Guides

More Programming Examples