Tutorial on Web Scraping in Python

Scraping Data from the Web using Scrapy & Beautiful Soup Nithish Raghunandanan nithishr@gmail.com PyData Munich | 8th November 2017

About Me ● MSc. Informatics Student at the Technical University of Munich ○ Focus on Data Science & Software Engineering ● Student Employee at KI labs, part of KI Group ● Love to play with different technologies ● Connect ■ nithishr1 @nithishr

What is Scraping? ● Extract data from the web pages ● Store the data into structured formats ● Data not available directly or via APIs

Tools for Scraping ● Scrapy ○ Python framework to extract data from web pages ● Beautiful Soup ○ Python library to parse HTML/XML documents ● Alternatives ○ Selenium ○ Requests ○ Octoparse

Scraping 101 ● Spider ○ A bot that downloads web pages ● robots.txt ○ File present on the server specifying access limits to bots

Pitfalls in Crawling ● Javascript heavy websites ○ Splash plugin ○ Selenium ● Default settings not too friendly to website owners ○ Inbuilt Auto throttle extension ● Captchas

Why Yellow Pages? Email Marketing for Customer Acquisition

Email Marketing for Customer Acquisition Initial Approach ● Buy Email Lists ● Send via 3rd Parties ● Poor Quality ○ Non transparent ○ Generic emails ● Expensive Crawling ● Scrapy + Beautiful Soup ● Over 500k Emails ● Quality Improvement ○ Categorized into segments ○ Targeted emails ● Cheap

nithishr1 @nithishr nithishr@gmail.com Connect Nithish Raghunandanan www.ki-labs.com

Resources ● Scrapy Guide ○ https://doc.scrapy.org/en/latest/intro/tutorial.html ● Beautiful Soup Guide ○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/ ● Crawling Etiquette ○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/ ● Code ○ https://github.com/nithishr/meetup_scraping

Tutorial on Web Scraping in Python

More Related Content

What's hot

Viewers also liked

Similar to Tutorial on Web Scraping in Python

More from Nithish Raghunandanan

Recently uploaded

In this document

Tutorial on Web Scraping in Python