Building a Concurrent Web Scraper with Python and Selenium

This article looks at how to speed up a Python web scraping and crawling script with multithreading via the concurrent.futures module. We'll also break down the script itself and show how to test the parsing functionality with pytest.

After completing this article, you will be able to:

Scrape and crawl websites with Selenium and parse HTML with Beautiful Soup
Set up pytest to test the scraping and parsing functionalities
Execute a web scraper concurrently with the concurrent.futures module
Configure headless mode for ChromeDriver with Selenium

Project Setup

Clone down the repo if you'd like to follow along. From the command line run the following commands:

$ git clone [email protected]:testdrivenio/concurrent-web-scraping.git $ cd concurrent-web-scraping $ python -m venv env $ source env/bin/activate (env)$ pip install -r requirements.txt

The above commands may differ depending on your environment.

Install ChromeDriver globally. (We're using version 96.0.4664.45).

Script Overview

The script makes 20 requests to Wikipedia:Random -- https://en.wikipedia.org/wiki/Special:Random -- for information about each article using Selenium to automate interaction with the site and Beautiful Soup to parse the HTML.

script.py:

import datetime import sys from time import sleep, time from scrapers.scraper import connect_to_base, get_driver, parse_html, write_to_file def run_process(filename, browser): if connect_to_base(browser): sleep(2) html = browser.page_source output_list = parse_html(html) write_to_file(output_list, filename) else: print("Error connecting to Wikipedia") if __name__ == "__main__": # headless mode? headless = False if len(sys.argv) > 1: if sys.argv[1] == "headless": print("Running in headless mode") headless = True # set variables start_time = time() current_attempt = 1 output_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") output_filename = f"output_{output_timestamp}.csv" # init browser browser = get_driver(headless=headless) # scrape and crawl while current_attempt <= 20: print(f"Scraping Wikipedia #{current_attempt} time(s)...") run_process(output_filename, browser) current_attempt = current_attempt + 1 # exit browser.quit() end_time = time() elapsed_time = end_time - start_time print(f"Elapsed run time: {elapsed_time} seconds")

Let's start with the main block. After determining whether Chrome should run in headless mode and defining a few variables, the browser is initialized via get_driver() from scrapers/scraper.py:

if __name__ == "__main__": # headless mode? headless = False if len(sys.argv) > 1: if sys.argv[1] == "headless": print("Running in headless mode") headless = True # set variables start_time = time() current_attempt = 1 output_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") output_filename = f"output_{output_timestamp}.csv" ######## # here # ######## # init browser browser = get_driver(headless=headless) # scrape and crawl while current_attempt <= 20: print(f"Scraping Wikipedia #{current_attempt} time(s)...") run_process(output_filename, browser) current_attempt = current_attempt + 1 # exit browser.quit() end_time = time() elapsed_time = end_time - start_time print(f"Elapsed run time: {elapsed_time} seconds")

A while loop is then configured to control the flow of the overall scraper.

if __name__ == "__main__": # headless mode? headless = False if len(sys.argv) > 1: if sys.argv[1] == "headless": print("Running in headless mode") headless = True # set variables start_time = time() current_attempt = 1 output_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") output_filename = f"output_{output_timestamp}.csv" # init browser browser = get_driver(headless=headless) ######## # here # ######## # scrape and crawl while current_attempt <= 20: print(f"Scraping Wikipedia #{current_attempt} time(s)...") run_process(output_filename, browser) current_attempt = current_attempt + 1 # exit browser.quit() end_time = time() elapsed_time = end_time - start_time print(f"Elapsed run time: {elapsed_time} seconds")

Within the loop, run_process() is called, which manages the WebDriver connection and scraping functions.

def run_process(filename, browser): if connect_to_base(browser): sleep(2) html = browser.page_source output_list = parse_html(html) write_to_file(output_list, filename) else: print("Error connecting to Wikipedia")

In run_process(), the browser instance passed to connect_to_base().

def run_process(filename, browser): ######## # here # ######## if connect_to_base(browser): sleep(2) html = browser.page_source output_list = parse_html(html) write_to_file(output_list, filename) else: print("Error connecting to wikipedia")

This function attempts to connect to wikipedia and then uses Selenium's explicit wait functionality to ensure the element with id='content' has loaded before continuing.

def connect_to_base(browser): base_url = "https://en.wikipedia.org/wiki/Special:Random" connection_attempts = 0 while connection_attempts < 3: try: browser.get(base_url) # wait for table element with id = 'content' to load # before returning True WebDriverWait(browser, 5).until( EC.presence_of_element_located((By.ID, "content")) ) return True except Exception as e: print(e) connection_attempts += 1 print(f"Error connecting to {base_url}.") print(f"Attempt #{connection_attempts}.") return False

Review the Selenium docs for more information on explicit wait.

To emulate a human user, sleep(2) is called after the browser has connected to Wikipedia.

def run_process(filename, browser): if connect_to_base(browser): ######## # here # ######## sleep(2) html = browser.page_source output_list = parse_html(html) write_to_file(output_list, filename) else: print("Error connecting to Wikipedia")

Once the page has loaded and sleep(2) has executed, the browser grabs the HTML source, which is then passed to parse_html().

def run_process(filename, browser): if connect_to_base(browser): sleep(2) ######## # here # ######## html = browser.page_source ######## # here # ######## output_list = parse_html(html) write_to_file(output_list, filename) else: print("Error connecting to Wikipedia")

parse_html() uses Beautiful Soup to parse the HTML, generating a list of dicts with the appropriate data.

def parse_html(html): # create soup object soup = BeautifulSoup(html, "html.parser") output_list = [] # parse soup object to get wikipedia article url, title, and last modified date article_url = soup.find("link", {"rel": "canonical"})["href"] article_title = soup.find("h1", {"id": "firstHeading"}).text article_last_modified = soup.find("li", {"id": "footer-info-lastmod"}).text article_info = { "url": article_url, "title": article_title, "last_modified": article_last_modified, } output_list.append(article_info) return output_list

This function also passes the article URL to get_load_time(), which loads the URL and records the subsequent load time.

def get_load_time(article_url): try: # set headers headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36" } # make get request to article_url response = requests.get( article_url, headers=headers, stream=True, timeout=3.000 ) # get page load time load_time = response.elapsed.total_seconds() except Exception as e: print(e) load_time = "Loading Error" return load_time

The output is added to a CSV file.

def run_process(filename, browser): if connect_to_base(browser): sleep(2) html = browser.page_source output_list = parse_html(html) ######## # here # ######## write_to_file(output_list, filename) else: print("Error connecting to Wikipedia")

write_to_file():

def write_to_file(output_list, filename): for row in output_list: with open(Path(BASE_DIR).joinpath(filename), "a") as csvfile: fieldnames = ["url", "title", "last_modified"] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writerow(row)

Finally, back in the while loop, the current_attempt is incremented and the process starts over again.

if __name__ == "__main__": # headless mode? headless = False if len(sys.argv) > 1: if sys.argv[1] == "headless": print("Running in headless mode") headless = True # set variables start_time = time() current_attempt = 1 output_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") output_filename = f"output_{output_timestamp}.csv" # init browser browser = get_driver(headless=headless) # scrape and crawl while current_attempt <= 20: print(f"Scraping Wikipedia #{current_attempt} time(s)...") run_process(output_filename, browser) ######## # here # ######## current_attempt = current_attempt + 1 # exit browser.quit() end_time = time() elapsed_time = end_time - start_time print(f"Elapsed run time: {elapsed_time} seconds")

Want to test this out? Grab the full script here.

It took about 57 seconds to run:

(env)$ python script.py Scraping Wikipedia #1 time(s)... Scraping Wikipedia #2 time(s)... Scraping Wikipedia #3 time(s)... Scraping Wikipedia #4 time(s)... Scraping Wikipedia #5 time(s)... Scraping Wikipedia #6 time(s)... Scraping Wikipedia #7 time(s)... Scraping Wikipedia #8 time(s)... Scraping Wikipedia #9 time(s)... Scraping Wikipedia #10 time(s)... Scraping Wikipedia #11 time(s)... Scraping Wikipedia #12 time(s)... Scraping Wikipedia #13 time(s)... Scraping Wikipedia #14 time(s)... Scraping Wikipedia #15 time(s)... Scraping Wikipedia #16 time(s)... Scraping Wikipedia #17 time(s)... Scraping Wikipedia #18 time(s)... Scraping Wikipedia #19 time(s)... Scraping Wikipedia #20 time(s)... Elapsed run time: 57.36561393737793 seconds

Got it? Great! Let's add some basic testing.

Testing

To test the parsing functionality without initiating the browser and, thus, making repeated GET requests to Wikipedia, you can download the page's HTML (test/test.html) and parse it locally. This can help avoid scenarios where you may get your IP blocked for making too many requests too quickly while writing and testing your parsing functions, as well as saving you time by not needing to fire up a browser every time you run the script.

test/test_scraper.py:

from pathlib import Path import pytest from scrapers import scraper BASE_DIR = Path(__file__).resolve(strict=True).parent @pytest.fixture(scope="module") def html_output(): with open(Path(BASE_DIR).joinpath("test.html"), encoding="utf-8") as f: html = f.read() yield scraper.parse_html(html) def test_output_is_not_none(html_output): assert html_output def test_output_is_a_list(html_output): assert isinstance(html_output, list) def test_output_is_a_list_of_dicts(html_output): assert all(isinstance(elem, dict) for elem in html_output)

Ensure all is well:

(env)$ python -m pytest test/test_scraper.py ================================ test session starts ================================= platform darwin -- Python 3.10.0, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 rootdir: /Users/michael/repos/testdriven/async-web-scraping collected 3 items test/test_scraper.py ... [100%] ================================= 3 passed in 0.19 ==================================

Want to mock get_load_time() to bypass the GET request?

test/test_scraper_mock.py:

from pathlib import Path import pytest from scrapers import scraper BASE_DIR = Path(__file__).resolve(strict=True).parent @pytest.fixture(scope="function") def html_output(monkeypatch): def mock_get_load_time(url): return "mocked!" monkeypatch.setattr(scraper, "get_load_time", mock_get_load_time) with open(Path(BASE_DIR).joinpath("test.html"), encoding="utf-8") as f: html = f.read() yield scraper.parse_html(html) def test_output_is_not_none(html_output): assert html_output def test_output_is_a_list(html_output): assert isinstance(html_output, list) def test_output_is_a_list_of_dicts(html_output): assert all(isinstance(elem, dict) for elem in html_output)

Test:

(env)$ python -m pytest test/test_scraper_mock.py ================================ test session starts ================================= platform darwin -- Python 3.10.0, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 rootdir: /Users/michael/repos/testdriven/async-web-scraping collected 3 items test/test_scraper.py ... [100%] ================================= 3 passed in 0.27s =================================

Configure Multithreading

Now comes the fun part! By making just a few changes to the script, we can speed things up:

import datetime import sys from concurrent.futures import ThreadPoolExecutor, wait from time import sleep, time from scrapers.scraper import connect_to_base, get_driver, parse_html, write_to_file def run_process(filename, headless): # init browser browser = get_driver(headless) if connect_to_base(browser): sleep(2) html = browser.page_source output_list = parse_html(html) write_to_file(output_list, filename) # exit browser.quit() else: print("Error connecting to Wikipedia") browser.quit() if __name__ == "__main__": # headless mode? headless = False if len(sys.argv) > 1: if sys.argv[1] == "headless": print("Running in headless mode") headless = True # set variables start_time = time() output_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") output_filename = f"output_{output_timestamp}.csv" futures = [] # scrape and crawl with ThreadPoolExecutor() as executor: for number in range(1, 21): futures.append( executor.submit(run_process, output_filename, headless) ) wait(futures) end_time = time() elapsed_time = end_time - start_time print(f"Elapsed run time: {elapsed_time} seconds")

With the concurrent.futures library, ThreadPoolExecutor is used to spawn a pool of threads for executing the run_process functions asynchronously. The submit method takes the function along with the parameters for that function and returns a future object. wait is then used to block execution until all tasks are complete.

It's worth noting that you can easily switch to multiprocessing via ProcessPoolExecutor since both ProcessPoolExecutor and ThreadPoolExecutor implement the same interface:

# scrape and crawl with ProcessPoolExecutor() as executor: for number in range(1, 21): futures.append( executor.submit(run_process, output_filename, headless) )

Why multithreading instead of multiprocessing?

Web scraping is I/O bound since the retrieving of the HTML (I/O) is slower than parsing it (CPU). For more on this along with the difference between parallelism (multiprocessing) and concurrency (multithreading), review the Speeding Up Python with Concurrency, Parallelism, and asyncio article.

Run:

(env)$ python script_concurrent.py Elapsed run time: 11.831077098846436 seconds

Check out the completed script here.

To speed things up even further we can run Chrome in headless mode by passing in the headless command line argument:

(env)$ python script_concurrent.py headless Running in headless mode Elapsed run time: 6.222846269607544 seconds

Conclusion

With a small amount of variation from the original code, we were able to execute the web scraper concurrently to take the script's run time from around 57 seconds to just over 6 seconds. In this specific scenario that's just about 90% faster, which is a huge improvement.

I hope this helps your scripts. You can find the code in the repo. Cheers!

Featured Course

Building Your Own Python Web Framework

In this course, you'll learn how to develop your own Python web framework to see how all the magic works beneath the scenes in Flask, Django, and the other Python-based web frameworks.

Buy Now $25 View Course

Share this tutorial

Contents

Project Setup

Script Overview

Testing

Configure Multithreading

Conclusion

Building Your Own Python Web Framework

Building Your Own Python Web Framework

Recommended Tutorials