Web Scraping Technologies

1.
Web Scraping Technologies KUIT MEET Dhulikhel

2.
Krishna Sunuwar Founder &CEO, ontreat & Jaljale @s2krish

3.
Demo • Google translation

4.
What is scraping? •Extract data form web • Why extract? – To structured – For mining – No APIs available – Business Intelligence – Abuse/Spam

5.
Components • Network • HTMLParser • Data Saving

6.
HTTP/Network Libraries • urllib2– library for opening URLs • requests – HTTP library for humans • mechanize - Stateful programmatic web browsing

7.
Parsing HTML • BeautifulSoup •lxml (xpath, cssselect) • HTMLParser (html.parser) • regex

8.
The Challenges • AtServer – Throttle Limit – IP Ban – Authentication Required – CAPTCHA • At Client – Broken HTML – JavaScripts – Badly structured

9.
Tips • Rotate UserAgent • Use different IPs (Use Proxy) • Don’t go fast • Break CAPTCHA - Deathbycaptcha

10.
Applications • Scrapy Framework

11.
Thank you Q &A

More Related Content