Posted on Jun 20, 2021 • Edited on Aug 28, 2021

Scrape YouTube Search with Python (part 1)

#python #tutorial #webscraping #datascience

Contents: intro, imports, what will be scraped, code, fuckit, links, outro.

Intro

This blog post will show how to scrape YouTube organic search, ad and channel results.

Each section will be represented with the screenshot that will show which part is being scraped.

I decided to use not the fastest solution Selenium but I wanted to scrape everything to the bottom of the search results page, which could be done by calling DOM directly like so:

driver.execute_script("var scrollingElement = (document.scrollingElement || document.body);scrollingElement.scrollTop = scrollingElement.scrollHeight;") # https://stackoverflow.com/a/57076690/15164646 (contains several references for a better understanding)

Imports

from selenium import webdriver from serpapi import GoogleSearch import json, time # this two could be skipped (prettier output/time buffer)

What will be scraped

Video Search Results

Code

from selenium import webdriver import json, time def get_video_results(): driver = webdriver.Chrome() driver.get('https://www.youtube.com/results?search_query=minecraft') youtube_data = [] # scrolling to the end of the page  # https://stackoverflow.com/a/57076690/15164646  while True: # end_result = "No more results" string at the bottom of the page  # this will be used to break out of the while loop  end_result = driver.find_element_by_css_selector('#message').is_displayed() driver.execute_script("var scrollingElement = (document.scrollingElement || document.body);scrollingElement.scrollTop = scrollingElement.scrollHeight;") # time.sleep(1) # could be removed  print(end_result) # once element is located, break out of the loop  if end_result == True: break print('Extracting results. It might take a while...') for result in driver.find_elements_by_css_selector('.text-wrapper.style-scope.ytd-video-renderer'): title = result.find_element_by_css_selector('.title-and-badge.style-scope.ytd-video-renderer').text link = result.find_element_by_css_selector('.title-and-badge.style-scope.ytd-video-renderer a').get_attribute('href') channel_name = result.find_element_by_css_selector('.long-byline').text channel_link = result.find_element_by_css_selector('#text > a').get_attribute('href') views = result.find_element_by_css_selector('.style-scope ytd-video-meta-block').text.split('\n')[0] try: time_published = result.find_element_by_css_selector('.style-scope ytd-video-meta-block').text.split('\n')[1] except: time_published = None try: snippet = result.find_element_by_css_selector('.metadata-snippet-container').text except: snippet = None try: if result.find_element_by_css_selector('#channel-name .ytd-badge-supported-renderer') is not None: verified_badge = True else: verified_badge = False except: verified_badge = None try: extensions = result.find_element_by_css_selector('#badges .ytd-badge-supported-renderer').text except: extensions = None print(verified_badge) youtube_data.append({ 'title': title, 'link': link, 'channel': {'channel_name': channel_name, 'channel_link': channel_link}, 'views': views, 'time_published': time_published, 'snippet': snippet, 'verified_badge': verified_badge, 'extensions': extensions, }) print(json.dumps(youtube_data, indent=2, ensure_ascii=False)) driver.quit() get_video_results() # part of the output: ''' [ { "title": "I Survived 100 Days in Ancient Greece on Minecraft.. Here's What Happened..", "link": "https://www.youtube.com/watch?v=hUAjdnhpTXU", "channel": { "channel_name": "Forrestbono", "channel_link": "https://www.youtube.com/user/ForrestboneMC" }, "views": "2.6M views", "time_published": "5 days ago", "snippet": "I had to survive for 100 Days of Hardcore Minecraft in Ancient Greece and battle Poseidon, God of the Sea, and Cronos, the God ...", "verified_badge": true, "extensions": "New" } ] '''

Using YouTube Video Results API

SerpApi is paid API with a free plan.

from serpapi import GoogleSearch def get_video_results(): params = { "api_key": "YOUR_API_KEY", "engine": "youtube", "search_query": "minecraft" } search = GoogleSearch(params) results = search.get_dict() for results in results['video_results']: title = results['title'] link = results['link'] channel = results['channel'] try: published_date = results['published_date'] except: published_date = None try: views = results['views'] except: views = None try: video_length = results['length'] except: video_length = None try: extensions = results['extensions'] except: extensions = None print(f'{title}\n{link}\n{channel}\n{published_date}\n{views}\n{video_length}\n{extensions}\n') get_video_results() # part of the output: ''' I Spent 100 Days in Medieval Times in Minecraft... Here's What Happened https://www.youtube.com/watch?v=hjV30hf6yEM {'name': 'Forge Labs', 'link': 'https://www.youtube.com/user/AirsoftXX', 'verified': True, 'thumbnail': 'https://yt3.ggpht.com/ytc/AAUvwnjgpo-Pvk7jrXkd4HFErsnrLr2Nwru5f8TgtWGJ7w=s68-c-k-c0x00ffffff-no-rj'} 6 days ago 10136089 1:49:28 ['New'] '''

Ad results

from selenium import webdriver from selenium.webdriver.chrome.options import Options def get_video_ad_results(): options = Options() options.headless = True driver = webdriver.Chrome(options=options) driver.get('https://www.youtube.com/results?search_query=how to tie a tie') for result in driver.find_elements_by_css_selector('.style-scope ytd-search-pyv-renderer'): title = result.find_element_by_css_selector('#video-title').text channel_name = result.find_element_by_css_selector('#channel-name').text channel_link = result.find_element_by_css_selector('#text a').get_attribute('href') video_link = result.find_element_by_css_selector('#endpoint').get_attribute('href') views = result.find_element_by_css_selector('#metadata-line').text desc = result.find_element_by_css_selector('#description-text').text print(f'{title}\n{channel_name}\n{channel_link}{video_link}\n{views}\n{desc}\n') get_video_ad_results() # output: ''' How to tie a tie EASY WAY How to tie a tie https://www.youtube.com/channel/UC4UuK5vs0b8HhqLDE6ssWOA https://www.googleadservices.com/pagead/aclk?sa=L&ai=Cw7aOxrTOYIL4LdqJ9u8PgtWd-AWTucasY9mO756NDsCNtwEQASAAYKWWo4b0IoIBF2NhLXB1Yi02MjE5ODExNzQ3MDQ5MzcxoAGP3d7QA6kC1tyMLsuvYz6oAwSqBIgCT9Cnglg6NKBsd-PDBGHllhIo3j6gxAjkcwDoAkp7nHUsJEW7DGH5yhXLGFX1ZUysJkVvRncH4iJh7A9q1X-LRwJD1cSE8ZODyrNzKmP3YswA23bToV2p5yCKzb3SJJw7pZnp6HBJFQy3_bV4ZZbR5YU7txo9LNOqyCzXHB0zKe8HIRgLCYwz8_lQJwdjzYvtEfQn84kRsvGs646kym5AM7AuK7ZkzYZs68dxtuZU4EV64-8mG4_0kuyKt6GXcFHxydZSYSqQUBm5N8WBFmVYqTTX2MZs6uv7JL_T2ilO_GSvWAXSm_TeJcvwdI0zQlPNvqIF-8kBfRZzx5xLfimBFeJF4hdIS_cqkgUMCBIw4qn4ztD9_P5fkgUHCBN4qZauMKAGVYAH2aKhL5AHBKgHhAioB6jSG6gHtgeoB-DPG6gH6dQbqAeMzRuoB7HcG6gH8NkbqAekmrECqAeBxhuoB9XOG6gHq8UbqAfezhuoB5zcG5IIC1hfM3o3UW5lRk9JqAgB0ggFCIBBEAGxCXHg2OWyEZICyAkXyAmPAZgLAboLHggDEAUYBiAGKAEwBUABSABYC2AAaABwAYgBAJgBAdALE7gMAbgT____________AbAUA8AVgYCAQNAVAdgVAYAXAaAXAQ&num=1&cid=CAASFeRoSrmeG6BSAe4hx5xjr7z2wLbhwQ&sig=AOD64_2-SpesMfmcgSQlQ9oXqQ3KeRo52g&adurl=https://www.youtube.com/watch%3Fv%3DX_3z7QneFOI&ctype=21&video_id=X_3z7QneFOI&client=ca-pub-6219811747049371 43K views How to tie a tie quick and easy Best tutorial '''

Using YouTube Ad Results API

from serpapi import GoogleSearch def get_video_ad_results(): params = { "api_key": "YOUR_API_KEY", "engine": "youtube", "search_query": "how to tie a tie" } search = GoogleSearch(params) results = search.get_dict() for result in results['ads_results']: title = result['title'] link = result['link'] channel = result['channel'] description = result['description'] print(f'{title}\n{link}\n{channel}\n{description}\n') get_video_ad_results() # output: ''' How to tie a tie EASY WAY https://www.youtube.com/watch?v=X_3z7QneFOI {'name': 'How to tie a tie', 'link': 'https://www.youtube.com/channel/UC4UuK5vs0b8HhqLDE6ssWOA'} How to tie a tie quick and easy Best tutorial '''

Channel results

from selenium import webdriver def get_channel_results(): driver = webdriver.Chrome() driver.get('https://www.youtube.com/results?search_query=mojang') title = driver.find_element_by_css_selector('#info #text').text link = driver.find_element_by_css_selector('#main-link').get_attribute('href') subs = driver.find_element_by_css_selector('#subscribers').text video_count = driver.find_element_by_css_selector('#video-count').text desc = driver.find_element_by_css_selector('#description').text print(f'{title}\n{link}\n{subs}\n{video_count}\n{desc}') get_channel_results() # output: ''' Minecraft https://www.youtube.com/user/TeamMojang 7.4M subscribers 542 videos This is the official YouTube channel of Minecraft. We tell stories about the Minecraft Universe. ESRB Rating: Everyone 10+ with ... '''

Using YouTube Channel Results API

from serpapi import GoogleSearch def get_channel_results(): params = { "api_key": "YOUR_API_KEY", "engine": "youtube", "search_query": "mojang" } search = GoogleSearch(params) results = search.get_dict() for result in results['channel_results']: title = result['title'] link = result['link'] verified = result['verified'] subs = result['subscribers'] video_count = result['video_count'] desc = result['description'] print(f'{title}\n{link}\n{verified}\n{subs}\n{video_count}\n{desc}\n') get_channel_results() # output: ''' Minecraft https://www.youtube.com/user/TeamMojang True 7400000.0 542 This is the official YouTube channel of Minecraft. We tell stories about the Minecraft Universe. ESRB Rating: Everyone 10+ with ... '''

Fuckit module

If you don't like too many try/except blocks, then you can use context manager from fuckit module that will continue to run, skipping the statements that cause errors.

# pip install fuckit import fuckit with fuckit: title = result.find_element_by_css_selector('.title-and-badge.style-scope.ytd-video-renderer').text link = result.find_element_by_css_selector('.title-and-badge.style-scope.ytd-video-renderer a').get_attribute('href') channel_name = result.find_element_by_css_selector('.long-byline').text channel_link = result.find_element_by_css_selector('#text > a').get_attribute('href') views = result.find_element_by_css_selector('.style-scope ytd-video-meta-block').text.split('\n')[0] time_published = result.find_element_by_css_selector('.style-scope ytd-video-meta-block').text.split('\n')[1] snippet = result.find_element_by_css_selector('.metadata-snippet-container').text extensions = result.find_element_by_css_selector('#badges .ytd-badge-supported-renderer').text

Links

Code in the online IDE • YouTube Search Engine Results API

Outro

You can also scrape YouTube by using requests-html library where you still have to render the page by calling html.render(), I'm not tested how much quicker it compare to selenium.

Selenium could be also run headless mode in Firefox. If the first solution didn't work for you, check out this or this answer from stackoverflow. Firefox webdriver download.

Or if you're using selenium with Chrome, you can do it like so. Chrome webdriver download.

If you have any questions or something isn't working correctly or you want to write something else, feel free to drop a comment in the comment section or via Twitter at @serp_api.

Yours,
Dimitry, and the rest of SerpApi Team.

DEV Community