Fetching text from Wikipedia's Infobox in Python

Fetching text from Wikipedia's Infobox in Python

Fetching text from Wikipedia's Infobox requires a mix of web scraping and parsing techniques. In this tutorial, we'll use the Python libraries requests and BeautifulSoup4 to fetch and extract information from a Wikipedia page's Infobox.

Prerequisites:

  • Install necessary packages:

    pip install requests beautifulsoup4 

Steps:

  • Send a GET Request: Use the requests library to send a GET request to the desired Wikipedia page.

  • Parse the Response: Use BeautifulSoup4 to parse the returned HTML.

  • Locate the Infobox: Identify the table element containing the Infobox.

  • Extract the Desired Data: Extract the data from the Infobox.

Example Code:

Let's fetch the basic information from the Infobox of Python's Wikipedia page:

import requests from bs4 import BeautifulSoup def fetch_infobox(wiki_url): response = requests.get(wiki_url) soup = BeautifulSoup(response.content, 'html.parser') # Wikipedia infoboxes are typically tables with the class 'infobox' infobox = soup.find('table', {'class': 'infobox'}) if not infobox: return "Infobox not found" data = {} for row in infobox.find_all('tr'): # Fetch the header header = row.find('th') # Fetch the data content = row.find('td') if header and content: # Clean up the text, remove newlines, and strip whitespace header_text = header.get_text(separator=" ").replace("\n", "").strip() content_text = content.get_text(separator=" ").replace("\n", "").strip() data[header_text] = content_text return data wiki_url = "https://en.wikipedia.org/wiki/Python_(programming_language)" infobox_data = fetch_infobox(wiki_url) for key, value in infobox_data.items(): print(f"{key}: {value}") 

Notes:

  • The above approach works for a standard Wikipedia Infobox. However, the structure might differ across Wikipedia pages, so you might need to adjust the code to cater to different page structures.
  • Always respect Wikipedia's robots.txt file and usage terms when web scraping. If you're fetching data at a large scale or on a regular basis, consider using the Wikipedia API instead.
  • Heavy scraping can cause your IP to get temporarily banned, so always use web scraping judiciously and consider adding delays if scraping multiple pages.

More Tags

vnc userid cancellationtokensource asp.net-routing typeorm appfuse air indicator vtl facet-grid

More Programming Guides

Other Guides

More Programming Examples