Scraping Data from the Web using Scrapy & Beautiful Soup Nithish Raghunandanan nithishr@gmail.com PyData Munich | 8th November 2017
About Me ● MSc. Informatics Student at the Technical University of Munich ○ Focus on Data Science & Software Engineering ● Student Employee at KI labs, part of KI Group ● Love to play with different technologies ● Connect ■ nithishr1 @nithishr
What is Scraping? ● Extract data from the web pages ● Store the data into structured formats ● Data not available directly or via APIs
Use Cases
Tools for Scraping ● Scrapy ○ Python framework to extract data from web pages ● Beautiful Soup ○ Python library to parse HTML/XML documents ● Alternatives ○ Selenium ○ Requests ○ Octoparse
Scraping 101 ● Spider ○ A bot that downloads web pages ● robots.txt ○ File present on the server specifying access limits to bots
Pitfalls in Crawling ● Javascript heavy websites ○ Splash plugin ○ Selenium ● Default settings not too friendly to website owners ○ Inbuilt Auto throttle extension ● Captchas
Why Yellow Pages? Email Marketing for Customer Acquisition
Email Marketing for Customer Acquisition Initial Approach ● Buy Email Lists ● Send via 3rd Parties ● Poor Quality ○ Non transparent ○ Generic emails ● Expensive Crawling ● Scrapy + Beautiful Soup ● Over 500k Emails ● Quality Improvement ○ Categorized into segments ○ Targeted emails ● Cheap
nithishr1 @nithishr nithishr@gmail.com Connect Nithish Raghunandanan www.ki-labs.com
Resources ● Scrapy Guide ○ https://doc.scrapy.org/en/latest/intro/tutorial.html ● Beautiful Soup Guide ○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/ ● Crawling Etiquette ○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/ ● Code ○ https://github.com/nithishr/meetup_scraping

Tutorial on Web Scraping in Python

  • 1.
    Scraping Data fromthe Web using Scrapy & Beautiful Soup Nithish Raghunandanan nithishr@gmail.com PyData Munich | 8th November 2017
  • 2.
    About Me ● MSc.Informatics Student at the Technical University of Munich ○ Focus on Data Science & Software Engineering ● Student Employee at KI labs, part of KI Group ● Love to play with different technologies ● Connect ■ nithishr1 @nithishr
  • 3.
    What is Scraping? ●Extract data from the web pages ● Store the data into structured formats ● Data not available directly or via APIs
  • 4.
  • 5.
    Tools for Scraping ●Scrapy ○ Python framework to extract data from web pages ● Beautiful Soup ○ Python library to parse HTML/XML documents ● Alternatives ○ Selenium ○ Requests ○ Octoparse
  • 7.
    Scraping 101 ● Spider ○A bot that downloads web pages ● robots.txt ○ File present on the server specifying access limits to bots
  • 8.
    Pitfalls in Crawling ●Javascript heavy websites ○ Splash plugin ○ Selenium ● Default settings not too friendly to website owners ○ Inbuilt Auto throttle extension ● Captchas
  • 9.
    Why Yellow Pages? EmailMarketing for Customer Acquisition
  • 10.
    Email Marketing forCustomer Acquisition Initial Approach ● Buy Email Lists ● Send via 3rd Parties ● Poor Quality ○ Non transparent ○ Generic emails ● Expensive Crawling ● Scrapy + Beautiful Soup ● Over 500k Emails ● Quality Improvement ○ Categorized into segments ○ Targeted emails ● Cheap
  • 11.
  • 12.
    Resources ● Scrapy Guide ○https://doc.scrapy.org/en/latest/intro/tutorial.html ● Beautiful Soup Guide ○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/ ● Crawling Etiquette ○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/ ● Code ○ https://github.com/nithishr/meetup_scraping