Skip to content

Windows: crawl freezes when multiple schemas are set in DOWNLOAD_HANDLERS #290

@resse42

Description

@resse42

If I have said anything impolite, please forgive me, as I have used a translator. My basic situation is the same as this issue Spiders hangs. I locate the problem to this setting item.

"DOWNLOAD_HANDLERS": { "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", },

Code

When I comment out DOWNLOAD_HANDLERS, the spider runs normally. Here is my code, you can directly run it and try.

from pathlib import Path from scrapy.crawler import CrawlerProcess import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): urls = [ "https://quotes.toscrape.com/page/1/", "https://quotes.toscrape.com/page/2/", ] for url in urls: yield scrapy.Request(url=url, callback=self.parse, meta={"playwright": True},) def parse(self, response): page = response.url.split("/")[-2] filename = f"quotes-{page}.html" Path(filename).write_bytes(response.body) self.log(f"Saved file {filename}") if __name__=="__main__": process = CrawlerProcess( settings={ "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor", "DOWNLOAD_HANDLERS": { "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }, "CONCURRENT_REQUESTS": 32, } ) process.crawl(QuotesSpider) process.start()

Result

(env4gpt) D:\Repo\crawl>D:/anaconda3/envs/env4gpt/python.exe d:/Repo/crawl/main.py 2024-07-05 19:36:00 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapybot) 2024-07-05 19:36:00 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.0, Twisted 24.3.0, Python 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.19045-SP0 2024-07-05 19:36:00 [scrapy.addons] INFO: Enabled addons: [] 2024-07-05 19:36:00 [py.warnings] WARNING: D:\anaconda3\envs\env4gpt\lib\site-packages\scrapy\utils\request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy. See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation. return cls(crawler) 2024-07-05 19:36:00 [asyncio] DEBUG: Using selector: SelectSelector 2024-07-05 19:36:00 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2024-07-05 19:36:00 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop 2024-07-05 19:36:00 [scrapy.extensions.telnet] INFO: Telnet Password: 6b77837564f0d1cc 2024-07-05 19:36:00 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2024-07-05 19:36:00 [scrapy.crawler] INFO: Overridden settings: {'CONCURRENT_REQUESTS': 32, 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'} 2024-07-05 19:36:01 [asyncio] DEBUG: Using proactor: IocpProactor 2024-07-05 19:36:01 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop running=True closed=False debug=False> 2024-07-05 19:36:01 [asyncio] DEBUG: Using proactor: IocpProactor 2024-07-05 19:36:01 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop running=True closed=False debug=False> 2024-07-05 19:36:01 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2024-07-05 19:36:01 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2024-07-05 19:36:01 [scrapy.middleware] INFO: Enabled item pipelines: [] 2024-07-05 19:36:01 [scrapy.core.engine] INFO: Spider opened 2024-07-05 19:36:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:36:01 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2024-07-05 19:37:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:38:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:39:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:40:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:41:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:42:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:43:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:44:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:45:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:46:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:47:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:48:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:49:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:50:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:51:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-07-05 19:52:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 

The log continues to repeat "[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)".

Conda Environment

openpyxl=3.1.4 scrapy==2.11.2 lxml==5.2.2 python-dotenv==1.0.1 scrapy-playwright==0.0.37 

System

Windows 10

Guess

Is it possible that it is related to the computer system? I currently do not have a computer with a Linux system and cannot verify this. Besides, I am in China, could it be related to the IP address? However, I checked the ScrapyPlaywrightDownloadHandler class and did not see any bugs. Thank you for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    WindowsbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions