Today is my 77th day of #100daysofcode and #python learning journey. Like the usual day, I purchased some hours to learned about pandas data visualization from datacamp.
For the rest of the time, I keep working on my first project(News scrapping). Today I scrapped news of Gorkha Patra online. I could scrap news on a few different pages. I need to write different codes for different news fields like national, economics, business, province, etc. So it takes a lot of time to scrapped news of the same news portal. Below is my code which I used to scrapped news of the national field.
Python code with BeautifulSoup
Here I import different dependencies
import pandas as pd import numpy as np import matplotlib.pyplot as plt import re from bs4 import BeautifulSoup as BS import requests import urllib3 urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
Url of required field is given below,
url = "https://gorkhapatraonline.com/national"
Parse News, Author, Date and Contents: News
ndict = {"Title":[], "Title URL":[], "Author": [], "Date":[], "Description": [], "Content":[]} ndict = {'Title': [], "URL": [], "Date":[], "Author":[], "Author URL":[], "Content":[],"Category": [], "Description":[]} for content in soup.select(".business"): newsurl=content.find('a').get('href') trend2 = content.select_one(".trending2") title = trend2.find("p").text title = title.strip() author = trend2.find('small').text author = author.strip() author = author.split('\xa0\xa0\xa0\xa0\n')[0] # author date = trend2.find('small').text date = date.strip() date = date.split('\xa0\xa0\xa0\xa0\n')[1] date=date.strip() description = trend2.select_one(".description").text.strip() # now got to this news url http.addheaders = [('User-agent', 'Mozilla/61.0')] web_page = http.request('GET',newsurl) news_soup = BS(web_page.data, 'html5lib') author_url = news_soup.select_one(".post-author-name").find("a").get("href") news_content="" for p in news_soup.select_one(".newstext").findAll("p"): news_content+="\n"+p.text ndict["Title"].append(title) catagory = url.split("/")[-1] print(f""" Title: {title}, URL: {newsurl} Date: {date}, Author: {author}, Category :{catagory} , Author URL: {author_url}, Description: {description}, Content: {news_content} """)
Day 77 Of #100DaysOfCode and #Python
— Durga Pokharel (@mathdurga) March 16, 2021
Worked On My First Project (Scrapping news of gorkhapatraonline using beautifulSoup)#WomenWhoCode #CodeNewbie #100DaysOfCode #DEVCommunity pic.twitter.com/T2JZyl2XqF
Top comments (0)