Python Forum
Same Data Showing Several Times With Beautifulsoup Query
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Same Data Showing Several Times With Beautifulsoup Query
#1
Hi there,

I have the following Python Code :-

import pandas as pd import requests import numpy as np from bs4 import BeautifulSoup import xlrd import re pd.set_option('display.max_rows', 500) pd.set_option('display.max_columns', 500) pd.set_option('display.width', 1000) res3 = requests.get("https://web.archive.org/web/20220521203053/https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm") soup3 = BeautifulSoup(res3.content,'lxml') BBMF_2022 = [] #BBMF_elem = soup3.find_all('a', string=re.compile(r'between|Flypast')) for item in soup3.find_all('a', string=re.compile(r'between|Flypast')): li1 = item.find_parent().text #li2 = li1.find_previous().font #print(link) print(li1) #print(li2) #BBMF_2022.append(li1) #check if links are in dataframe #df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022']) #df
The issue I have is when I run the Code, the Data is printed for 15 Entries from May 28th to May 29th, several times,
I am not sure why that is the case ? Could someone suggest for me the reason why ? And tell me what I need to change in the Code, so
that that Data is printed only once and not several times ? I have tried to Scrape Data from a Website, where entries contain the word between or Flypast.

When I use the following piece of Code instead :-

for item in soup3.find_all('a', string=re.compile(r'between|Flypast')): li1 = item.find_parent().text #li2 = li1.find_previous().font #print(link) #print(li1) #print(li2) BBMF_2022.append(li1) df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022']) df


The first entry for the 28th May, is printed out in the DataFrame 15 times ! instead of 15 seperate Entries I mentioned before.

Any help would be much appreciated.

Best Regards

Eddie Winch ))
Reply
#2
You are using a redirected url, instead use: https://python-forum.io/thread-37342.html ?

This code will get all data and save as a json file, without any filtering. You can add filters, and any other data you need
import requests from bs4 import BeautifulSoup import os import json import sys class airshowdata: def __init__(self): self.airshow_details = {} self.cd = CreateDict() self.jsonfile = 'airshow.json' def get_links(self): url = 'https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm' res3 = requests.get(url) if res3.status_code == 200: soup3 = BeautifulSoup(res3.content,'lxml') else: print(f"Cannot load page {url}") sys.exit(-1) links = soup3.find_all('a') for link in links: anode = self.cd.add_node(self.airshow_details, link.text.strip()) self.cd.add_cell(anode, 'url', link.get('href')) with open(self.jsonfile, 'w') as fp: json.dump(self.airshow_details, fp) # following not needed and can be removed (displays dictionary contents) self.cd.display_dict(self.airshow_details) class CreateDict: """ CreateDict.py - Contains methods to simplify node and cell creation within a dictionary Usage: new_dict(dictname) - Creates a new dictionary instance with the name contained in dictname add_node(parent, nodename) - Creates a new node (nested dictionary) named in nodename, in parent dictionary. add_cell(nodename, cellname, value) - Creates a leaf node within node named in nodename, with a cell name of cellname, and value of value. display_dict(dictname) - Recursively displays a nested dictionary. Requirements: Python standard library: os Author: Larz60+ -- May 2019. """ def __init__(self): os.chdir(os.path.abspath(os.path.dirname(__file__))) def new_dict(self, dictname): setattr(self, dictname, {}) def add_node(self, parent, nodename): node = parent[nodename] = {} return node def add_cell(self, nodename, cellname, value): cell = nodename[cellname] = value return cell def display_dict(self, dictname, level=0): indent = " " * (4 * level) for key, value in dictname.items(): if isinstance(value, dict): print(f'\n{indent}{key}') level += 1 self.display_dict(value, level) else: print(f'{indent}{key}: {value}') if level > 0: level -= 1 def main(): airs = airshowdata() airs.get_links() if __name__ == '__main__': main()
Reply
#3
Many thanks for that Code Larz60+, its very much appreciated by me, thankyou for taking the time to type it. I chose
the web.archive link, because the Data is from a week ago, from that Website, the 21st May Data was removed from the Website the other day.

Does anyone have any idea, how I can change my Code, to solve the issue I am having with it ?

Any help would be very much appreciated.

Regards

Eddie Winch ))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Basic SQL query using Py: Inserting or querying sqlite3 database not returning data marlonbown 3 4,134 Nov-08-2022, 07:16 PM
Last Post: marlonbown
  Showing an empty chart, then input data via function kgall89 0 1,864 Jun-02-2022, 01:53 AM
Last Post: kgall89
  Showing data change korenron 10 5,739 Mar-20-2022, 01:50 PM
Last Post: korenron
  Extracting data without showing dtype, name etc. tgottsc1 3 11,212 Jan-10-2021, 02:15 PM
Last Post: buran
  Converting query string as a condition for filter data. shah_entrance 1 2,917 Jan-14-2020, 09:22 AM
Last Post: perfringo
  LDAP code to query for host not returning data burvil 2 6,503 Oct-17-2018, 10:03 PM
Last Post: burvil
  Dataframe Data Query Database ab0217 0 2,977 Oct-16-2018, 02:26 AM
Last Post: ab0217

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.