DEV Community

Huxley
Huxley

Posted on

Python Selenium Infinite Scrolling

Scraping web pages with infinite scrolling using python, bs4 and selenium

Scroll function
This function takes two arguments. The driver that is being used and a timeout. The driver is used to scroll and the timeout is used to wait for the page to load.

def scroll(driver, timeout): scroll_pause_time = timeout # Get scroll height  last_height = driver.execute_script("return document.body.scrollHeight") while True: # Scroll down to bottom  driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page  time.sleep(scroll_pause_time) # Calculate new scroll height and compare with last scroll height  new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: # If heights are the same it will exit the function  break last_height = new_height 
Enter fullscreen mode Exit fullscreen mode

Here is an example using the function

from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.firefox.options import Options from bs4 import BeautifulSoup # Your options may be different options = Options() options.set_preference('permissions.default.image', 2) options.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', False) def all_links(url): # Setup the driver. This one uses firefox with some options and a path to the geckodriver  driver = webdriver.Firefox(options=options ,executable_path='./geckodriver') # implicitly_wait tells the driver to wait before throwing an exception  driver.implicitly_wait(30) # driver.get(url) opens the page  driver.get(url) # This starts the scrolling by passing the driver and a timeout  scroll(driver, 5) # Once scroll returns bs4 parsers the page_source  soup_a = BeautifulSoup(driver.page_source, 'lxml') # Them we close the driver as soup_a is storing the page source  driver.close() # Empty array to store the links  links = [] # Looping through all the a elements in the page source  for link in soup_a.find_all('a'): # link.get('href') gets the href/url out of the a element  links.append(link.get('href')) return links 
Enter fullscreen mode Exit fullscreen mode

And that's how you scrap a page with infinite scrolling

Top comments (8)

Collapse
 
milosblagojevic profile image
Milos-Blagojevic

Hi, thanks so much for the post, it really helped me a lot.
Do you by any chance know why when scrolling through page that has a lot of content I get different results, in a sense that page doesn't always end with the same content, even though it is clearly seen that it reached the end of the page?
For instance I have been trying to scrape posts from an instagram page that has more than 50000 posts and almost everytime I get different results and never do I get even near 50000. Closest I got to was around 20000, but most of the time it is between 5 and 10 thousand.
Do you think this is Instagram related or it has to do with my code?
Any thought will be appreciated.
Thanks in advance :)

Collapse
 
mr_h profile image
Huxley

Could be Instagram trying to stop scraping or could be a issue with your code could also be a issue with the page to loading in time

Collapse
 
suren40 profile image
suren40

Thanks

Collapse
 
shinokada profile image
shin • Edited

I get a NameError.
NameError: name 'time' is not defined
'time' in the scroll function.
Does anyone have any idea how to fix this?

Collapse
 
mr_h profile image
Huxley

the scroll function uses the package time for sleeping

at the top of your project do

import time 
Collapse
 
pabloyzm profile image
pabloyzm

Gracias infinitas <3

Collapse
 
luizsla profile image
Luiz Eduardo Amorim

Man, this script has helped me a lot at work. Thank you for this!

Collapse
 
joyceowcheung profile image
Joyce Cheung

Thank you sososososo much!!!! you helped us solve a great great problem :D