Skip to content

Scrapy hangs with no exception raisedΒ #266

@Zhou-Haowei

Description

@Zhou-Haowei

Configurations

Configurations 1:
Model Name: MacBook Pro
Chip: Apple M1 Max
Cores: 10 (8 performance and 2 efficiency)
Memory: 32GB
System Version: macOS Sonoma 14.2.1
Playwright Version: 1.42.0
Python Version: Python 3.10.14
browser: chromium version 123.0.6312.4
browser: firefox version 123.0

Configurations 2:
Model Name:
Chip: Intel Xeon
Cores: 4 Logical Cores
Memory: 8GB
System Version: Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-76-generic x86_64)
Docker Version: Docker version 25.0.4
Playwright Version: 1.42.0
Python Version: Python 3.10.13
browser: chromium version 123.0.6312.4
browser: firefox version 123.0

Description

I'm experiencing a problem where my Scrapy spider hangs indefinitely on a random url during the crawling process without throwing any exceptions. The spider uses Scrapy Playwright for page rendering. I have tried running my project on both of the above configurations and the same problem occurs.

Actual Behavior

Chromium and firefox will hangs on scraping tasks with different urls.

2024-03-23 10:13:07 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: webspider) 2024-03-23 10:13:07 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 24.3.0, Python 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.5, Platform macOS-14.2.1-arm64-arm-64bit 2024-03-23 10:13:07 [scrapy.addons] INFO: Enabled addons: [] 2024-03-23 10:13:07 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2024-03-23 10:13:07 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop 2024-03-23 10:13:07 [scrapy.extensions.telnet] INFO: Telnet Password: 231864415aa317a2 2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2024-03-23 10:13:07 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'webspider', 'CONCURRENT_REQUESTS': 32, 'DNS_TIMEOUT': 6, 'DOWNLOAD_DELAY': 0.25, 'DOWNLOAD_TIMEOUT': 30, 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'webspider.spiders', 'REACTOR_THREADPOOL_MAXSIZE': 16, 'REDIRECT_MAX_TIMES': 5, 'REFERER_ENABLED': False, 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7', 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['webspider.spiders'], 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'} 2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'webspider.middlewares.RandomUserAgentMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled item pipelines: [] 2024-03-23 10:13:07 [scrapy.core.engine] INFO: Spider opened 2024-03-23 10:13:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:13:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2024-03-23 10:13:07 [scrapy-playwright] INFO: Starting download handler 2024-03-23 10:13:07 [scrapy-playwright] INFO: Starting download handler 2024-03-23 10:13:12 [scrapy-playwright] INFO: Launching browser chromium 2024-03-23 10:13:13 [scrapy-playwright] INFO: Browser chromium launched 2024-03-23 10:13:13 [scrapy-playwright] DEBUG: Browser context started: 'm-3baa5cd4-1614-4a6a-af55-5dbbd9682f29' (persistent=False, remote=False) 2024-03-23 10:13:13 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] New page created, page count is 1 (1 for all contexts) ...... 2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/js/view/view_base.js> 2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/js/share/api_base.js> 2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Request: <GET http://bdimg.share.baidu.com/static/api/css/share_style0_16.css?v=8105b07e.css> (resource type: stylesheet) 2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/css/share_style0_16.css?v=8105b07e.css> 2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 https://nsclick.baidu.com/v.gif?pid=307&type=3071&sign=&desturl=https%253A%252F%252Fikikiuu.online%252F&linkid=lu3gj3suojw&apitype=0> 2024-03-23 10:14:41 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 https://api.share.baidu.com/v.gif> 2024-03-23 10:15:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:16:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:17:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:18:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:19:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:20:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 

Code

# SETTINGS @classmethod def update_settings(cls, settings: BaseSettings) -> None: super().update_settings(settings) # spider settings.set("REFERER_ENABLED", False) settings.set("TWISTED_REACTOR", "twisted.internet.asyncioreactor.AsyncioSelectorReactor") settings.set("DOWNLOAD_HANDLERS", { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }) settings.set("SPIDER_MIDDLEWARES", { 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None, }) settings.set("DOWNLOADER_MIDDLEWARES", { "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None, "webspider.middlewares.RandomUserAgentMiddleware": 300, }) settings.set("ROBOTSTXT_OBEY", False) settings.set("RETRY_ENABLED", False) settings.set("REDIRECT_MAX_TIMES", 5) settings.set("DOWNLOAD_TIMEOUT", 30) settings.set("DOWNLOAD_DELAY", 0.25) # playwright settings.set("PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT", "30000") settings.set("PLAYWRIGHT_BROWSER_TYPE", "chromium") settings.set("PLAYWRIGHT_MAX_CONTEXTS", 1) settings.set("PLAYWRIGHT_MAX_PAGES_PER_CONTEXT", 1) settings.set("PLAYWRIGHT_DEFAULT_HEADERS", {'Cache-Control': 'no-cache'}) if settings.get("PLAYWRIGHT_BROWSER_TYPE") == "firefox": prefs = { "security.mixed_content.block_active_content": False, "webgl.disabled": True, "services.sync.engine.addons": False, "services.sync.engine.bookmarks": False, "services.sync.engine.history": False, "services.sync.engine.passwords": False, "services.sync.engine.prefs": False, "dom.webaudio.enabled": False, "extensions.abuseReport.enabled": False, "dom.webnotifications.enabled": False, "browser.cache.disk.enable": False, } browser_args = [ "-private", ] settings.set("PLAYWRIGHT_LAUNCH_OPTIONS", { "headless": False, "timeout": 30 * 1000, "devtools": False, "handle_sigint": False, "handle_sigterm": False, "handle_sighup": False, "args": browser_args, "firefox_user_prefs": prefs, }) if settings.get("PLAYWRIGHT_BROWSER_TYPE") == "chromium": browser_args = [ "--incognito", "--disable-gpu" "--disable-sync" "--disable-apps" "--disable-audio" "--disable-plugins" "--disable-infobars" "--disable-extensions" "--disable-translate" "--disable-geolocation" "--disable-notifications" "--disable-winsta" "--disable-dev-shm-usage" "--disable-webgl" "--disable-cache" "--disable-popup-blocking" "--disable-back-forward-cache" "--arc-disable-gms-core-cache" "--process-per-site" "--disable-offline-load-stale-cache" "--disk-cache-size=0" "--no-sandbox" "--disable-client-side-phishing-detection" "--disable-breakpad" "--ignore-certificate-errors", "--ignore-urlfetcher-cert-requests" "--disable-blink-features=AutomationControlled", "--disable-web-security", "--allow-running-insecure-content", ] settings.set("PLAYWRIGHT_LAUNCH_OPTIONS", { "headless": False, "timeout": 30 * 1000, "devtools": False, "handle_sigint": False, "handle_sigterm": False, "handle_sighup": False, "args": browser_args, })
def start_requests(self) -> Iterable[Request]: logging.info("start url: %s" % self.start_urls[0]) task_item = WebTaskItem( task_id="m-" + str(uuid.uuid4()), url=self.start_urls[0], ) yield Request( url=self.start_urls[0], meta={ TASK_ITEM_META_KEY: task_item, "spider": self.name, RandomUserAgentMiddleware.use_mobile_ua_flag: task_item.is_mobile, "playwright": True, "playwright_include_page": True, "playwright_context": task_item.task_id, "playwright_context_kwargs": { "ignore_https_errors": True, "is_mobile": task_item.is_mobile, "bypass_csp": True, }, "playwright_page_init_callback": self.init_page, "playwright_page_goto_kwargs": { "wait_until": "load", "timeout": 30000 }, "handle_httpstatus_all": True, }, dont_filter=False, callback=self.response_callback, errback=self.failure_callback ) async def response_callback(self, response: Response, **kwargs) -> WebTaskResultItem: page: Page = response.meta.get("playwright_page", None) if not page: raise ValueError("PlaywrightSpider.response_callback: page is None") try: info = PageInfoItem() for processor in self.processors: # just fetch html / get title / take screenshots from page object await processor.process(response=response, info=info) result = WebTaskResultItem( task_id=response.request.meta[TASK_ITEM_META_KEY].task_id, page_info=info, error_message="" ) logging.info("result: %s" % result.model_dump_json()) yield result except Exception as e: logging.warning(f"PlaywrightSpider.response_callback: ex={e}") result = WebTaskResultItem( task_id=response.request.meta[TASK_ITEM_META_KEY].task_id, error_message=str(e) ) logging.info("result: %s" % result.model_dump_json()) finally: await page.close() await page.context.close() async def failure_callback(self, failure, **kwargs) -> WebTaskResultItem: page: Page = failure.request.meta.get("playwright_page", None) if page: try: await page.close() await page.context.close() except Exception as e: self.logger.warning(f"stop error:{e}") result = WebTaskResultItem( task_id=failure.request.meta[TASK_ITEM_META_KEY].task_id, error_message=failure.getErrorMessage() ) logging.info("result: %s" % result.model_dump_json()) yield result

Troubleshooting Steps Taken

  • Increased logging level to DEBUG, but no clues were found in the logs.
  • Ensured that resources like Playwright pages and contexts are being closed correctly.
  • Checked for potential deadlocks in my asynchronous code.
  • Ran the spider with cProfile to profile the application, but no obvious bottlenecks were identified.
  • Remove all chromium args or firefox user preference, results are same.

Debugging

I tried to use python debugger to debug the code and I found following information:

  1. The requests were added to the task queue.
  2. The debugger shows that the request was processed by scrapy-playwright.
  3. The browser started correctly.
  4. The browser render the page correctly. goto function was called.
  5. Then the spider hangs without any reasons, not following any timeout rules.
  6. No more tasks can be processed, only keep printing scraping stats.
  7. Both the callback=self.response_callback and errback=self.failure_callback function wasn't called
  8. Main thread loops in event select.

Huge thanks for any provided ideas!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions