Scraping the web with python

Scraping the Web the workshop José Manuel Ortega @jmortegac

Agenda Librerías python BeautifulSoup Scrapy / Proyectos Mechanize / Selenium Herramientas web / plugins

Repositorio Github https://github.com/jmortega/codemotion_scraping_the_web

Técnicas de scraping  Screen scraping  Web scraping  Report mining  Spider

Webscraping  Es el proceso de recolección o extracción de datos de páginas web de forma automática.  Técnica que se emplea para extraer datos usando herramientas de software, está relacionada con la indexación de información que está en la web empleando un robot  Metodología universal adoptada por la mayoría de los motores de búsqueda.

Python  http://www.python.org  Lenguaje de programación interpretado multiparadigma, soporta orientación a objetos, programación imperativa y, en menor medida programación funcional.  Usa tipado dinámico y es multiplataforma.

Librerías Python  Requests  Lxml  Regular expressions  Beautiful Soup 4  Pyquery  Webscraping  Scrapy  Mechanize  Selenium

Request libraries  Urllib2  Python requests: HTTP for Humans  $ pip install requests

Requests http://docs.python-requests.org/en/latest

Web scraping with Python 1. Download webpage with urllib2, requests 2. Parse the page with BeautifulSoup/lxml 3. Select with XPath or css selectors

Web scraping with Python Regular expressions <h1>(.*?)</h1> Xpath //h1 Generar un objeto del HTML (tipo DOM) page.h1

Regular expressions  [A-Z] matches a capital letter  [0-9] matches a number  [a-z][0-9] matches a lowercase letter followed by a number  star * matches the previous item 0 or more times  plus + matches the previous item 1 or more times  dot . will match anything but line break characters r n  question ? makes the preceeding item optional

BeautifulSoup  Librería que permite el parseo de páginas web  Soporta parsers como lxml,html5lib  Instalación  pip install lxml  pip instlal html5lib  pip install beautifulsoup4  http://www.crummy.com/software/BeautifulSoup

BeautifulSoup  soup = BeautifulSoup(html_doc,’lxml’)  Print all: print(soup.prettify())  Print text: print(soup.get_text()) from bs4 import BeautifulSoup

BeautifulSoup functions  find_all(‘a’)Obtiene una lista con todos los enlaces  find(‘title’)Obtiene el primer elemento <title>  get(‘href’)Obtiene el valor del atributo href de un determinado elemento  (element).text  obtiene el texto asociado al elemento for link in soup.find_all('a'): print(link.get('href'))

Extracting links with bs4 https://news.ycombinator.com

Extracting linkedin info with bs4

Extraer datos de la agenda de la pycones http://2015.es.pycon.org/es/schedule

Extraer datos de la agenda de pycones Beautiful Soup 4

Webscraping library pip install webscraping  https://bitbucket.org/richardpenman/webscraping/overview  http://docs.webscraping.com  https://pypi.python.org/pypi/webscraping

Extraer datos de la agenda de pycones webscraping

Scrapy open-source Framework que permite crear spiders para ejecutar procesos de crawling de pag web Permite la definición de reglas Xpath mediante expresiones regulares para la extracción de contenidos Basada en la librería twisted

Scrapy  Simple, conciso  Extensible  Señales, middlewares  Rápido  IO asíncrona (twisted), parseo en C (libxml2)  Portable  Linux, Windows, Mac  Bien testeado  778 unit-tests, 80% de cobertura  Código limpio (PEP-8) y desacoplado  Zen-friendly / pythónico

Scrapy Utiliza un mecanismo basado en expresiones XPath llamado Xpath Selectors. Utiliza LXML XPath para encontrar elementos Utiliza Twisted para el operaciones asíncronas

Ventajas scrapy  Más rápido que mechanize porque utiliza operaciones asíncronas (emplea Twisted).  Scrapy tiene un mejor soporte para el parseado del html  Scrapy maneja mejor caracteres unicode, redirecciones, respuestas gzipped, codificaciones.  Caché HTTP integrada.  Se pueden exportar los datos extraídos directamente a csv o JSON.

Xpath selectors Expression Meaning name matches all nodes on the current level with the specified name name[n] matches the nth element on the current level with the specified name / Do selection from the root // Do selection from current node * matches all nodes on the current level . Or .. Select current / parent node @name the attribute with the specified name [@key='value'] all elements with an attribute that matches the specified key/value pair name[@key='value'] all elements with the specified name and an attribute that matches the specified key/value pair [text()='value'] all elements with the specified text name[text()='value'] all elements with the specified name and text

Scrapy  Cuando usamos Scrapy tenemos que crear un proyecto, y cada proyecto se compone de:  Items Definimos los elementos a extraer.  Spiders Es el corazón del proyecto, aquí definimos el procedimiento de extracción.  Pipelines Son los elementos para analizar lo obtenido: validación de datos, limpieza del código html

Instalación de scrapy Python 2.6 / 2.7 Lxml openSSL pip / easy_install $ pip install scrapy $ easy_install scrapy

Instalación de scrapy pip install scrapy

Scrapy Shell (no es necesario crear proyecto) scrapy shell <url> from scrapy.select import Selector hxs = Selector(response) Info = hxs.select(‘//div[@class=“slot-inner”]’)

Scrapy Shell scrapy shell http://scrapy.org

Projecto scrapy $ scrapy startproject <project_name> scrapy.cfg: the project configuration file. tutorial/:the project’s python module. items.py: the project’s items file. pipelines.py : the project’s pipelines file. setting.py : the project’s setting file. spiders/ : a directory where you’ll later put your spiders.

Scrapy europython http://ep2015.europython.eu/en/events/sessions

Crear Spider  $ scrapy genspider -t basic <YOUR SPIDER NAME> <DOMAIN>  $ scrapy list Listado de spiders de un proyecto

Pipeline  ITEM_PIPELINES = [‘<your_project_name>.pipelines.<your_pipeline_classname>']  pipelines.py

Pipeline SQLite EuropythonSQLitePipeline

Ejecución $ scrapy crawl <spider_name> $ scrapy crawl <spider_name> -o items.json -t json $ scrapy crawl <spider_name> -o items.csv -t csv $ scrapy crawl <spider_name> -o items.xml -t xml

Slidebot $ scrapy crawl -a url="" slideshare $ scrapy crawl -a url="" speakerdeck

Slidebot $ scrapy crawl -a url="http://www.slideshare.net/jmoc25/testing-android-security" slideshare

Write CSV /JSON import csv with open(‘file.csv’,‘wb’) as csvfile: writer=csv.writer(csvfile) for line in list: writer.writerow(line) import json with open(‘file.json’,‘wb’) as jsonfile: json.dump(results,jsonfile)

Fix encode errors myvar.encode("utf-8")

Scrapyd  Scrapy web service daemon $ pip install scrapyd  Web API with simple Web UI: http://localhost:6800  Web API Documentation:  http://scrapyd.readthedocs.org/en/latest/api.html

Mechanize  https://pypi.python.org/pypi/mechanize pip install mechanize  Mechanize permite navegar por los enlaces de forma programática

Mechanize import mechanize # service url URL = ‘’ def main(): # Create a Browser instance b = mechanize.Browser() # Load the page b.open(URL) # Select the form b.select_form(nr=0) # Fill out the form b[key] = value # Submit! return b.submit()

Mechanize mechanize._response.httperror_see k_wrapper: HTTP Error 403: request disallowed by robots.txt browser.set_handle_robots(False)

Mechanize search in duckduckgo

Mechanize extract links import mechanize br = mechanize.Browser() response = br.open(url) for link in br.links(): print link

Alternatives for mechanize  RoboBrowser  https://github.com/jmcarp/robobrowser  MechanicalSoup  https://github.com/hickford/MechanicalSoup

Robobrowser  Basada en BeatifulSoup  Emplea la librería requests  Compatible con python 3

Selenium  Open Source framework for automating browsers  Python-Module http://pypi.python.org/pypi/selenium  pip install selenium  Firefox-Driver

Selenium  Open a browser  Open a Page

Selenium  find_element_ by_link_text(‘text’): find the link by text by_css_selector: just like with lxml css by_tag_name: ‘a’ for the first link or all links by_xpath: practice xpath regex by_class_name: CSS related, but this finds all different types that have the same class

Selenium <div id=“myid">...</div> browser.find_element_by_id(“myid") <input type="text" name="example" /> browser.find_elements_by_xpath("//input") <input type="text" name="example" /> browser.find_element_by_name(“example")

Selenium <div id=“myid"> <span class=“myclass">content</span> </div> browser. find_element_by_css_selector("#myid span.myclass") <a href="">content</a> browser.find_element_by_link_text("content")

Selenium element.click() element.submit()

Extraer datos de la agenda de codemotion

Web Scraper plugin http://webscraper.io

XPath expressions  Plugins para firefox  FireFinder for FireBug  FirePath

XPath expressions  Xpath Helper  Mover el mouse + tecla shift  Obtener la expresión xpath de un determinado elemento html

Scraping Hub  Scrapy Cloud es una plataforma para la implementación, ejecución y seguimiento de las arañas Scrapy y un visualizador de los datos scrapeados.  Permite controlar las arañas mediante tareas programadas, revisar que procesos están corriendo y obtener los datos scrapeados.  Los proyectos se pueden gestionan desde la API o a través de su Panel Web.

Scrapy Cloud http://doc.scrapinghub.com/scrapy-cloud.html https://dash.scrapinghub.com >>pip install shub >>shub login >>Insert your ScrapingHub API Key:

Scrapy Cloud /scrapy.cfg # Project: demo [deploy] url =https://dash.scrapinghub.com/api/scrapyd/ #API_KEY username = ec6334d7375845fdb876c1d10b2b1622 password = project = 25767

Scrapy Cloud Scheduling curl -u APIKEY: https://dash.scrapinghub.com/api/schedule.json -d project=PROJECT -d spider=SPIDER

Referencias  http://www.crummy.com/software/BeautifulSoup  http://scrapy.org  https://pypi.python.org/pypi/mechanize  http://docs.python-requests.org/en/latest  http://selenium- python.readthedocs.org/index.html  https://github.com/REMitchell/python-scraping

Scraping the web with python

En este documento

Más contenido relacionado

Destacado

Similar a Scraping the web with python

Más de Jose Manuel Ortega Candel

Último

Scraping the web with python