Questions tagged [web-crawler]

Question 1

There has been lots of reports of aggressive Web crawling lately, which people speculate may be linked to the development of LLMs and the incentive to collect Web content to train them. In fact, I ...

Question 2

I found these entries in the Authelia logs: authelia | time="2025-08-20T20:32:25Z" level=error msg="Target URL does not appear to have a relevant session cookies configuration" ...

Question 3

My web server has been recently hit with mass visits by web scraper bots. Scraping bots have always been there, but munin plots show that since June 2nd around 8:00 UTC the traffic has increased ...

Question 4

I have a robots.txt file that looks like this: User-agent: * Disallow: /account/ Disallow: /captcha/ Disallow: /checkout/ User-agent: DataForSeoBot Disallow: /p- User-...

Question 5

I'm running a PHP VBulletin forum and it's spammed by extremely many requests from Facebook crawl servers with hostnames like: fwdproxy-cco-031.fbsv.net fwdproxy-prn-050.fbsv.net Does anyone know a ...

Question 6

Gday folks. Recently we discovered a significant spike in outgoing data on our web-server. It turns out Amazon bots are downloading our web imagery, a lot. We set a disallow in our Robots.txt, over a ...

Question 7

I am using an NGINX server to host a static website exposed to the open internet. While glancing through the access logs I came across a cluster of requests for resources ending with .env, e.g: "...

Question 8

top - 19:51:36 up 1 day, 12:27, 1 user, load average: 19.14, 11.33, 4.74 Tasks: 172 total, 18 running, 154 sleeping, 0 stopped, 0 zombie %Cpu(s): 90.0 us, 10.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0....

Question 9

// Not sure if this question is best fit for serverfault or webmasters stack exchange... I am thinking to rate limit access to my sites because identifying and blocking bad bots take most of my time. ...

Question 10

We are planning a mantainance that could take down the services for a whole day. I would like therefore to show a mantainance page, explaining the issue and providing additional info/links. During ...

Question 11

Original question title: "Allow only cloudflare access to my website and block all visits, bots or crawlers to my IP address" I have a question, I use cloudflare DNS on my domain. My VPS 30....

Question 12

I'm running crawler on my company's internet. 10 raspberry pi * 45 crawlers each, 2 desktops * 70 crawlers each These processes are sending requests 24/7. 3~5% of packets are getting lost. This is ...

Question 13

I am getting weird GET requests on my (non php supporting) web server for some curious looking php files. Was just wondering whether these are harmless requests of certain browser tools or attempts ...

Question 14

I'm a Java engineer with zero dev ops experience. Lately I was playing around with linux ubuntu server first time and used docker with my selenium project and faced this problem: I try to scrape HTML ...

Question 15

My website has an area restricted to users who sign up with a valid email. I have got requests with bogus emails, and I want to avoid sending emails to non-existent addresses lest they increase the ...

Question 16

I have a couple of podcasts I host on my site and I've noticed a disturbing trend the last couple of months: my site's bandwidth usage has gone up by 10x, but it appears most of it was a series of ...

Question 17

I am currently trying to analyze the traffic of a website. Besides specifics regarding the requested resource and timestamps, the tracking system only provides the request's HTTP referrer. In most ...

Question 18

I have a Nextcloud server running on Apache, and disabled my firewall for about 5 minutes while I ran an apt-update. I decided to check the logs after, and found this from an unknown IP. It looks like ...

Question 19

I noticed a couple (ostensibly-)harmless log entries, and--I'm admittedly overthinking this by a mile--got curious about Apache2 response sizes. This Ukranian crawler † hit my web daemon, two seconds ...

Question 20

I converted my website from asp.net to .net core and host on same server. Now, website gets hundred of hits daily from different IP's trying to access like below /php-myadmin/ /wp-content/ /mysql/ ...

Question 21

When no radioactive decay is available and good entropy is strongly advised for security reasons you experience a real problem. HTTPS connections consume a lot of entropy. If you have thousands of ...

Question 22

If http://example2.com sends cURL connection to a website called http://example1.com. If I access http://example2.com from my pc to see the content of http://example1.com, than would http://example1....

Question 23

I'd like to mirror an old site of mine to local files. I've used httrack for this in the past, but I'm having a problem this time that I really thought I figured out before, but can't seem to now. ...

Question 24

I have multiple physical sub-domains and I don't want to change any robots.txt file of any of that sub-domains. Is there any way to disallow all the sub-domains from my main domain's physical robots....

Question 25

Here is a strange one for you. We have a server with multiple VHOSTS that include both SSL and Non-SSL domains. Domain1 is SSL enabled, while Domain2 doesn't have SSL. Since all these domains are ...

Question 26

I recently logged into a vps I have (with vultr, if that is of any concern). To find a large amount of nginx logs and higher than expected load average. This server is doing effectively nothing, and ...

Question 27

Something/someone from 40.96.18.165 has been hitting my web server exactly eight times a day everyday since Feb 5, 2017. The user agent used is Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0). ...

Question 28

I made a script to scan a file which contains a portion of ipv4 addresses (about 50 million), it attempts to connect to the website using OpenSSL and extract a small piece of it and writes it into a ...

Question 29

I've got some errors showing up in my site logs where some bots are trying to access URLs with strange GET params. # normal url example.com?foo=123456 # odd url triggering integer error by bots ...

Question 30

I recently received a large number of hits on my home page from 64.235.153.8. It revolves to barracuda.com I know Barracuda as an enterprise class spam detection/prevention solution. Do they also ...

Question 31

I have written a small bash script for crawling an XML sitemap of URLs. It retrieves 5 URLs in parallel using xargs. Now I want an E-Mail to be sent when all URLs have been crawled, so it has to wait ...

Question 32

I need to block a bunch of robots from crawling a few hundred sites hosted on a Nginx web server running on a Ubuntu 16.04 machine. I've found a fairly simple example here (important part of the ...

Question 33

I have developed a nice little app that crawls a bunch of newspaper web sites and makes their latest content available on my phone offline. It's basically a Pocket app that saves contents ...

Question 34

How do large sites (e.g. Wikipedia) deal with bots that are behind other IP masker? For instance, in my university, everybody searches Wikipedia, giving it a significant load. But, as far as I know, ...

Question 35

In the course of about 2 hours, a logged in user on my website accessed roughly 1,600 pages in a way that looks suspiciously similar to a bot. I am concerned because users must purchase access to the ...

Question 36

I recently noticed some strange traffic in my nginx access logs. I'm not sure if these indicate an attack, a mistake, or something else. I've started sending these to HTTP 444, so these logs will ...

Question 37

I'm in a difficult situation, the Baidu spider is hitting my site causing about 3Gb a day worth of bandwidth. At the same time I do business in China so don't want to just block it. Has anyone else ...

Question 38

About ten days ago I moved a site - mostly a Joomla discussion board - to a new server at a different IP address. During a brief scheduled downtime I replicated the content over and completed DNS ...

Question 39

I would like to protect my server from too many hits from Bots. Considering a scenario, where in a server (physical) located in a private network and hitting my server continuously. Do i have a ...

Question 40

On my Nginx log recently i have noticed 100's entries like this where a directory search was executed with error, because those directory does not exist on my webserver. now, how can I block them once ...

Question 41

In the logs of my website, there's a lot of visits with a HTTP referer set to spam-like websites (usually Russian sites, I've noticed). I assume what they're doing is just using a web crawler to visit ...

Question 42

I'm maintaining some web crawlers. I want to improve our load/throttling system to be more intelligent. Of cause I look at response codes, and throttle up or down based on that. I would though like ...

Question 43

My domain name has both IPv4 and IPv6 addresses assigned. IPv4 connection to Google can't be available all the time due to restrictions of my campus network, but IPv6 is available all the time. ...

Question 44

I have some web crawlers, and a specific website seems to be blocking traffic temporarily after some time. The thing is, even though all clients have the same external IP address (they access the ...

Question 45

I have a secure SSO site that uses Shibboleth authentication and SAML identity provider. I need to allow a Google Search Appliance crawler to come index the URL's. I have a requirement to change on ...

Question 46

I have found out that McAfee SiteAdvisor has reported my website as "may be having security issues". I care little about whatever McAfee thinks of my website (I can secure it myself and if not, ...

Question 47

I wanna build a tool which scans a website for all urls, but not the urls within the page but of the site self, but I don't know how. Could anyone give me an example how I can start? Example: www....

Question 48

We supply Magento and Typo3 installations to customers. To improve QA we wanted to use an automatic link checker to check for broken and/or outdated links - automatically. We want to check all links ...

Question 49

I run ossec on my server and periodically I receive a warning like this: Received From: myserver->/var/log/auth.log Rule: 5701 fired (level 8) -> "Possible attack on the ssh server (or version ...

Question 50

Actually I am using Httrack as a web crawler, can it use my credentials to access members area and download the zip files because they are restricted from public access. Thank you in advance. ...