Scrapy is an open source framework written in Python that allows you to extract data from structural content such as html and xml. It is able to scraping and crawling on websites especially fast enough. First of all, you should install python packet manager, pip.
Installing with using pip:
pip install scrapy
Starting a new project:
scrapy startproject tutorial
When a scrapy project is created, a file / directory structure will be created as follows.
tutorial/ scrapy.cfg # deploy configuration file tutorial/ # project's Python module, you'll import your code from here __init__.py items.py # project items definition file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py
Example spider:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), }
Scrapy spiders can scan and extract data on one or more addresses. With Scrapy selectors, the desired fields can be selected and filtered. Scrapy is supported by xpath scrapy in selectors.
For crawling:
scrapy crawl quotes
For writing output to json file:
scrapy crawl quotes -o qutoes.json
Top comments (0)