Web Scraping Technologies KU IT MEET Dhulikhel
Krishna Sunuwar Founder & CEO, ontreat & Jaljale @s2krish
Demo • Google translation
What is scraping? • Extract data form web • Why extract? – To structured – For mining – No APIs available – Business Intelligence – Abuse/Spam
Components • Network • HTML Parser • Data Saving
HTTP/Network Libraries • urllib2 – library for opening URLs • requests – HTTP library for humans • mechanize - Stateful programmatic web browsing
Parsing HTML • BeautifulSoup • lxml (xpath, cssselect) • HTMLParser (html.parser) • regex
The Challenges • At Server – Throttle Limit – IP Ban – Authentication Required – CAPTCHA • At Client – Broken HTML – JavaScripts – Badly structured
Tips • Rotate User Agent • Use different IPs (Use Proxy) • Don’t go fast • Break CAPTCHA - Deathbycaptcha
Applications • Scrapy Framework
Thank you Q & A

Web Scraping Technologies