Apify & Crawlee

AC

Apify & Crawlee

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻creators-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Limit request queue

I have some crawlers that are consuming from RabbitMQ, but they obviously take all the messages from Rabbit and move them to the internal queue. Can I somehow cap the requestQueue ? so it can only take a finite amount of requests ?

RabbitMQ and RequestQueue or List

Did anybody managed to actually straight connect the RequeustQueue or RequestList to a continuous consumption process from a RabbitMQ queue ?

Initializing CloudFlare cookies with Crawlee

Hi, I am currently using a PlaywrightScrapper to initialize cloudflare cookies to then send requests to the website programmatically. My problem is that the target website does multiple redirects to itself before the CF cookie is ready, which I did not achieve to handle using my code. You know the cookie is ready when you get a 200 status code....

Safe to parallelize Dataset writes across processes?

Context: • Crawlee v3.13.10, Node 22 • Linux (ext4), using storage-local • Multiple forked workers share one RequestQueueV2 (with request locking) • Each worker does:...

Crawlee Hybrid Crawler?

I notice a lot of the time I end up writing the exact same type of crawler where it first uses CheerioCrawler and then falls back to PlaywrightCrawler for failed requests. The only annoying thing is the obviously different syntax between cheerio and playwright ($ and load for Cheerio vs page for Playwright). For code reuse purposes i end up writing a lot of code that looks like this
...(crawlerType === 'playwright' ? { launchContext: getLaunchContext() } : {}),
...(crawlerType === 'playwright' ? { launchContext: getLaunchContext() } : {}),
...

How to combine the scraping results for Crawlee Playwright actor?

Hello folks! I'm building an actor with Crawlee and playwright for bbb scraping. I have completed the "detail" content parsing, and I made two booleans inputs to let users decide whether to scrape reviews, complaints or not. Then I realized if they choose both, I'll need to send two more requests and I don't know how to use the router to combine the results(the reviews, complaints and detail results) before using Dataset.pushData....
Solution:
You need a way to persist state per entity (e.g., one business from BBB) across multiple requests, and only pushData() once all requested pieces (detail + reviews + complaints) are collected. In Crawlee, you do this by: ``` 1. Storing a “partial result” in request.userData....
No description

Crawl sitemap

Hi guys i am trying to extract links from sitemap it works fine when i call for first time but if i call it again with same sitemap url it not extract links from sitemap again what is the problem ? this is my code ```js async crawlSitemap({ url,...
Solution:
nvm i add
const config = new Configuration({ persistStorage: false });
const config = new Configuration({ persistStorage: false });
and pass it to PlaywrightCrawler fix...

FAcebook Ads library src video/images

Hi everyone, I'm building a scraper using Apify for the Facebook Ads Library. I'm fetching ad data via the Ads Library API, which provides details including an ad_snapshot_url. The issue is that the direct URL for the ad creative (the image or video file) is not included in the API response. My approach is to open the ad_snapshot_url with Playwright and attempt to extract the <img> or <video> element from the DOM....

Skip request in preNavigationHooks

is it possible to skip que request for url in preNavigationHooks ? I don't want to do the request at all in request handler if something occurs in preNavigationHooks. The only thing that worked for me is throwing a NonRetryableError but I think this is not ideal. The request.skipNavigation is not ideal because the request itself still occurs. ATM I'm using NonRetryableError but my logs are ugly. How do I suppress the logs?...
Solution:
hmmm I like the idea with SKIP label.. I'll try that. Thanks

postNavigationHooks timeout

I'm using camoufox and handleCloudflareChallenge utility in postNavigationHooks and request timeout after 100 seconds. Is it possible to lower the timeout limit from 100 in postNavigationHooks? it seems like it doesn't respect requestHandlerTimeoutSecs or navigationTimeoutSecs
Solution:
RequestHandlerTimeoutSecs is enforced by Apify/Crawlee’s overall request handler, but once inside a postNavigationHook, you're in user-defined logic. If handleCloudflareChallenge doesn't internally support a timeout (or ignores one), it might block longer than desired. navigationTimeoutSecs applies to page.goto() and similar calls — not necessarily to post-navigation scripts....

Rotate country in proxy for each request

can we rotate country for proxy without relaunching crawlee? I need to use specific country for every url, without relaunching crawlee everytime.

Crawleee js vs crawlee python

I've only used crawlee js and I'm wondering does cralee js has the same features as crawlee python? Is one better than the other in some cases?
Solution:
Hi JS version is much older, hence more battle tested, but we are getting close to the feature parity with the upcoming v1 release of crawlee for python.

Managing duplicate queries using RequestQueue but it seems off.

Description It appears that my custom RequestQueue isn't working as expected. Very few jobs are being processed, even though my RequestQueue list has many more job IDs. ``` import { RequestQueue } from "crawlee";...

re-enqueue request without throwing error

Is there any method of doing a request again without throwing an error but also respecting the maximum retries ?

Configuring playwright + crawlee js to bypass certain sites

I have noticed some pages that appear completely normal are sometimes hard to fetch content from. After some investigation, it might have something to do with the site being behind cloudflare. Do you have any suggestions on how to get past this? I believe in certain cases, it's simply a matter of popups and accepting some cookies. I do have stealth plugin added, but it still does not pierce through.

Target business owners #crawlee-js

Business Owners: Automate the Impossible — Before Your Competitors Do From securing high-demand tickets to automating bulk product checkouts, online reservations, and real-time data scraping Whether you're in:...

Pure LLM approach

How would you go about this problem? Given x topic, you want to extract y data from a list of website base urls. Is there any built-in functionality for this? If not, how do you solve this? I have attempted crawling entire sites, and one shot prompt the entire aggregated stuff to LLM given context window 1mill or higher. Seems to work okay, but I'm positive there are techniques to scrap tags / unrelated meta data from each url straped within every site....
Solution:
Yeah, Crawlee doesn’t have a built-in way to strip irrelevant stuff like headers or ads automatically. You’re not missing anything — cleanup is still a manual step. You can use libraries like readability or unfluff to extract the main content, or filter DOM sections manually (like removing .footer, .nav, etc.). For trickier cases, you can even use the LLM to clean up pages before extraction. Embedding-based filtering is also a nice option if you want to skip irrelevant pages before sending to the LLM, but it adds complexity. You're on the right track — it's just about fine-tuning the cleanup now....

Anyone here automated LinkedIn profile analytics before?

Trying to build a dashboard that fetches data like impressions, followers, views, etc. Using Playwright with saved cookies, realistic headers, delays, etc., but still running into issues: - Getting blocked by bot detection...

X's terms of service

Hello @Kacka H. , Does crawlee & apify service abide by X's terms of services when i use it to collect tweets for academic purposes ? Thanks in advance....

Invalidate request queue after some time

Hello! I would like to know if there's a builtin feature to invalidate (purge) the request queue after some time? Thanks!...
Next