- Notifications
You must be signed in to change notification settings - Fork 150
Description
Configurations
Configurations 1:
Model Name: MacBook Pro
Chip: Apple M1 Max
Cores: 10 (8 performance and 2 efficiency)
Memory: 32GB
System Version: macOS Sonoma 14.2.1
Playwright Version: 1.42.0
Python Version: Python 3.10.14
browser: chromium version 123.0.6312.4
browser: firefox version 123.0
Configurations 2:
Model Name:
Chip: Intel Xeon
Cores: 4 Logical Cores
Memory: 8GB
System Version: Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-76-generic x86_64)
Docker Version: Docker version 25.0.4
Playwright Version: 1.42.0
Python Version: Python 3.10.13
browser: chromium version 123.0.6312.4
browser: firefox version 123.0
Description
I'm experiencing a problem where my Scrapy spider hangs indefinitely on a random url during the crawling process without throwing any exceptions. The spider uses Scrapy Playwright for page rendering. I have tried running my project on both of the above configurations and the same problem occurs.
Actual Behavior
Chromium and firefox will hangs on scraping tasks with different urls.
2024-03-23 10:13:07 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: webspider) 2024-03-23 10:13:07 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 24.3.0, Python 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.5, Platform macOS-14.2.1-arm64-arm-64bit 2024-03-23 10:13:07 [scrapy.addons] INFO: Enabled addons: [] 2024-03-23 10:13:07 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2024-03-23 10:13:07 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop 2024-03-23 10:13:07 [scrapy.extensions.telnet] INFO: Telnet Password: 231864415aa317a2 2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2024-03-23 10:13:07 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'webspider', 'CONCURRENT_REQUESTS': 32, 'DNS_TIMEOUT': 6, 'DOWNLOAD_DELAY': 0.25, 'DOWNLOAD_TIMEOUT': 30, 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'webspider.spiders', 'REACTOR_THREADPOOL_MAXSIZE': 16, 'REDIRECT_MAX_TIMES': 5, 'REFERER_ENABLED': False, 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7', 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['webspider.spiders'], 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'} 2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'webspider.middlewares.RandomUserAgentMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled item pipelines: [] 2024-03-23 10:13:07 [scrapy.core.engine] INFO: Spider opened 2024-03-23 10:13:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:13:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2024-03-23 10:13:07 [scrapy-playwright] INFO: Starting download handler 2024-03-23 10:13:07 [scrapy-playwright] INFO: Starting download handler 2024-03-23 10:13:12 [scrapy-playwright] INFO: Launching browser chromium 2024-03-23 10:13:13 [scrapy-playwright] INFO: Browser chromium launched 2024-03-23 10:13:13 [scrapy-playwright] DEBUG: Browser context started: 'm-3baa5cd4-1614-4a6a-af55-5dbbd9682f29' (persistent=False, remote=False) 2024-03-23 10:13:13 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] New page created, page count is 1 (1 for all contexts) ...... 2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/js/view/view_base.js> 2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/js/share/api_base.js> 2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Request: <GET http://bdimg.share.baidu.com/static/api/css/share_style0_16.css?v=8105b07e.css> (resource type: stylesheet) 2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/css/share_style0_16.css?v=8105b07e.css> 2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 https://nsclick.baidu.com/v.gif?pid=307&type=3071&sign=&desturl=https%253A%252F%252Fikikiuu.online%252F&linkid=lu3gj3suojw&apitype=0> 2024-03-23 10:14:41 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 https://api.share.baidu.com/v.gif> 2024-03-23 10:15:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:16:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:17:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:18:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:19:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-23 10:20:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) Code
# SETTINGS @classmethod def update_settings(cls, settings: BaseSettings) -> None: super().update_settings(settings) # spider settings.set("REFERER_ENABLED", False) settings.set("TWISTED_REACTOR", "twisted.internet.asyncioreactor.AsyncioSelectorReactor") settings.set("DOWNLOAD_HANDLERS", { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }) settings.set("SPIDER_MIDDLEWARES", { 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None, }) settings.set("DOWNLOADER_MIDDLEWARES", { "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None, "webspider.middlewares.RandomUserAgentMiddleware": 300, }) settings.set("ROBOTSTXT_OBEY", False) settings.set("RETRY_ENABLED", False) settings.set("REDIRECT_MAX_TIMES", 5) settings.set("DOWNLOAD_TIMEOUT", 30) settings.set("DOWNLOAD_DELAY", 0.25) # playwright settings.set("PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT", "30000") settings.set("PLAYWRIGHT_BROWSER_TYPE", "chromium") settings.set("PLAYWRIGHT_MAX_CONTEXTS", 1) settings.set("PLAYWRIGHT_MAX_PAGES_PER_CONTEXT", 1) settings.set("PLAYWRIGHT_DEFAULT_HEADERS", {'Cache-Control': 'no-cache'}) if settings.get("PLAYWRIGHT_BROWSER_TYPE") == "firefox": prefs = { "security.mixed_content.block_active_content": False, "webgl.disabled": True, "services.sync.engine.addons": False, "services.sync.engine.bookmarks": False, "services.sync.engine.history": False, "services.sync.engine.passwords": False, "services.sync.engine.prefs": False, "dom.webaudio.enabled": False, "extensions.abuseReport.enabled": False, "dom.webnotifications.enabled": False, "browser.cache.disk.enable": False, } browser_args = [ "-private", ] settings.set("PLAYWRIGHT_LAUNCH_OPTIONS", { "headless": False, "timeout": 30 * 1000, "devtools": False, "handle_sigint": False, "handle_sigterm": False, "handle_sighup": False, "args": browser_args, "firefox_user_prefs": prefs, }) if settings.get("PLAYWRIGHT_BROWSER_TYPE") == "chromium": browser_args = [ "--incognito", "--disable-gpu" "--disable-sync" "--disable-apps" "--disable-audio" "--disable-plugins" "--disable-infobars" "--disable-extensions" "--disable-translate" "--disable-geolocation" "--disable-notifications" "--disable-winsta" "--disable-dev-shm-usage" "--disable-webgl" "--disable-cache" "--disable-popup-blocking" "--disable-back-forward-cache" "--arc-disable-gms-core-cache" "--process-per-site" "--disable-offline-load-stale-cache" "--disk-cache-size=0" "--no-sandbox" "--disable-client-side-phishing-detection" "--disable-breakpad" "--ignore-certificate-errors", "--ignore-urlfetcher-cert-requests" "--disable-blink-features=AutomationControlled", "--disable-web-security", "--allow-running-insecure-content", ] settings.set("PLAYWRIGHT_LAUNCH_OPTIONS", { "headless": False, "timeout": 30 * 1000, "devtools": False, "handle_sigint": False, "handle_sigterm": False, "handle_sighup": False, "args": browser_args, })def start_requests(self) -> Iterable[Request]: logging.info("start url: %s" % self.start_urls[0]) task_item = WebTaskItem( task_id="m-" + str(uuid.uuid4()), url=self.start_urls[0], ) yield Request( url=self.start_urls[0], meta={ TASK_ITEM_META_KEY: task_item, "spider": self.name, RandomUserAgentMiddleware.use_mobile_ua_flag: task_item.is_mobile, "playwright": True, "playwright_include_page": True, "playwright_context": task_item.task_id, "playwright_context_kwargs": { "ignore_https_errors": True, "is_mobile": task_item.is_mobile, "bypass_csp": True, }, "playwright_page_init_callback": self.init_page, "playwright_page_goto_kwargs": { "wait_until": "load", "timeout": 30000 }, "handle_httpstatus_all": True, }, dont_filter=False, callback=self.response_callback, errback=self.failure_callback ) async def response_callback(self, response: Response, **kwargs) -> WebTaskResultItem: page: Page = response.meta.get("playwright_page", None) if not page: raise ValueError("PlaywrightSpider.response_callback: page is None") try: info = PageInfoItem() for processor in self.processors: # just fetch html / get title / take screenshots from page object await processor.process(response=response, info=info) result = WebTaskResultItem( task_id=response.request.meta[TASK_ITEM_META_KEY].task_id, page_info=info, error_message="" ) logging.info("result: %s" % result.model_dump_json()) yield result except Exception as e: logging.warning(f"PlaywrightSpider.response_callback: ex={e}") result = WebTaskResultItem( task_id=response.request.meta[TASK_ITEM_META_KEY].task_id, error_message=str(e) ) logging.info("result: %s" % result.model_dump_json()) finally: await page.close() await page.context.close() async def failure_callback(self, failure, **kwargs) -> WebTaskResultItem: page: Page = failure.request.meta.get("playwright_page", None) if page: try: await page.close() await page.context.close() except Exception as e: self.logger.warning(f"stop error:{e}") result = WebTaskResultItem( task_id=failure.request.meta[TASK_ITEM_META_KEY].task_id, error_message=failure.getErrorMessage() ) logging.info("result: %s" % result.model_dump_json()) yield resultTroubleshooting Steps Taken
- Increased logging level to DEBUG, but no clues were found in the logs.
- Ensured that resources like Playwright pages and contexts are being closed correctly.
- Checked for potential deadlocks in my asynchronous code.
- Ran the spider with
cProfileto profile the application, but no obvious bottlenecks were identified. - Remove all chromium args or firefox user preference, results are same.
Debugging
I tried to use python debugger to debug the code and I found following information:
- The requests were added to the task queue.
- The debugger shows that the request was processed by scrapy-playwright.
- The browser started correctly.
- The browser render the page correctly.
gotofunction was called. - Then the spider hangs without any reasons, not following any timeout rules.
- No more tasks can be processed, only keep printing scraping stats.
- Both the
callback=self.response_callbackanderrback=self.failure_callbackfunction wasn't called - Main thread loops in event select.
Huge thanks for any provided ideas!