DEV Community

Cover image for 🕸️ Web Scraping in Python: A Practical Guide for Data Scientists
Vikas Gulia
Vikas Gulia

Posted on

🕸️ Web Scraping in Python: A Practical Guide for Data Scientists

"Data is the new oil, and web scraping is one of the drills."

Whether you’re gathering financial data, tracking competitor prices, or building datasets for machine learning projects, web scraping is a powerful tool to extract information from websites automatically.

In this blog post, we’ll explore:

  • What web scraping is
  • How it works
  • Legal and ethical considerations
  • Key Python tools for scraping
  • A complete scraping project using requests, BeautifulSoup, and pandas
  • Bonus: Scraping dynamic websites using Selenium

✅ What is Web Scraping?

Web scraping is the automated process of extracting data from websites. Think of it as teaching Python to browse the web, read pages, and pick out the data you're interested in.


⚖️ Is Web Scraping Legal?

Scraping publicly available data for personal, educational, or research purposes is usually okay. However:

  • Always check the website’s robots.txt file (www.example.com/robots.txt)
  • Read the Terms of Service
  • Avoid overloading servers with too many requests (use time delays)
  • Never scrape private or paywalled content without permission

🧰 Popular Python Libraries for Web Scraping

Library Purpose
requests To send HTTP requests
BeautifulSoup To parse and extract data from HTML
lxml A fast HTML/XML parser
pandas To organize and analyze scraped data
Selenium For dynamic websites with JavaScript
playwright Modern alternative to Selenium

🧪 Step-by-Step Web Scraping Example

Let’s scrape quotes from http://quotes.toscrape.com — a beginner-friendly practice site.

🛠️ Step 1: Install Required Libraries

pip install requests beautifulsoup4 pandas 
Enter fullscreen mode Exit fullscreen mode

🧾 Step 2: Send a Request and Parse HTML

import requests from bs4 import BeautifulSoup URL = "http://quotes.toscrape.com/page/1/" response = requests.get(URL) soup = BeautifulSoup(response.text, "html.parser") print(soup.title.text) # Output: Quotes to Scrape 
Enter fullscreen mode Exit fullscreen mode

🧮 Step 3: Extract the Quotes and Authors

quotes = [] authors = [] for quote in soup.find_all("div", class_="quote"): text = quote.find("span", class_="text").text.strip() author = quote.find("small", class_="author").text.strip() quotes.append(text) authors.append(author) # Print sample for i in range(3): print(f"{quotes[i]}{authors[i]}") 
Enter fullscreen mode Exit fullscreen mode

📊 Step 4: Store Data Using pandas

import pandas as pd df = pd.DataFrame({ "Quote": quotes, "Author": authors }) print(df.head()) # Optional: Save to CSV df.to_csv("quotes.csv", index=False) 
Enter fullscreen mode Exit fullscreen mode

🔁 Scrape Multiple Pages

all_quotes = [] all_authors = [] for page in range(1, 6): # First 5 pages  url = f"http://quotes.toscrape.com/page/{page}/" res = requests.get(url) soup = BeautifulSoup(res.text, "html.parser") for quote in soup.find_all("div", class_="quote"): all_quotes.append(quote.find("span", class_="text").text.strip()) all_authors.append(quote.find("small", class_="author").text.strip()) df = pd.DataFrame({"Quote": all_quotes, "Author": all_authors}) df.to_csv("all_quotes.csv", index=False) 
Enter fullscreen mode Exit fullscreen mode

🔄 Bonus: Scraping JavaScript-Rendered Sites using Selenium

Some sites load data dynamically with JavaScript, so requests won't work.

🛠️ Install Selenium & WebDriver

pip install selenium 
Enter fullscreen mode Exit fullscreen mode

Download the appropriate ChromeDriver from https://chromedriver.chromium.org/downloads and add it to your system path.

🌐 Selenium Example

from selenium import webdriver from selenium.webdriver.chrome.service import Service from bs4 import BeautifulSoup import time service = Service("chromedriver") # Path to your ChromeDriver driver = webdriver.Chrome(service=service) driver.get("https://quotes.toscrape.com/js/") time.sleep(2) # Wait for JS to load  soup = BeautifulSoup(driver.page_source, "html.parser") driver.quit() for quote in soup.find_all("div", class_="quote"): print(quote.find("span", class_="text").text.strip()) 
Enter fullscreen mode Exit fullscreen mode

🧠 Best Practices for Web Scraping

  • ✅ Use headers to mimic a browser:
headers = {"User-Agent": "Mozilla/5.0"} requests.get(url, headers=headers) 
Enter fullscreen mode Exit fullscreen mode
  • ✅ Add delays between requests using time.sleep()
  • ✅ Handle exceptions and errors gracefully
  • ✅ Respect robots.txt and terms of use
  • ✅ Use proxies or rotate IPs for large-scale scraping

📦 Real-World Use Cases

  • 📰 News Monitoring (e.g., scraping articles for sentiment analysis)
  • 🛒 E-commerce Price Tracking
  • 📊 Competitor Research
  • 🧠 Training Datasets for NLP/ML projects
  • 🏢 Job Listings and Market Analysis

📌 Final Thoughts

Web scraping is a foundational tool in a data scientist’s arsenal. Mastering it opens up endless possibilities — from building custom datasets to powering AI models with real-world information.

“If data is fuel, then web scraping is how you build your own pipeline.”

Top comments (0)