Skip to main content

Questions tagged [web-crawler]

A web-crawler (also known as a web-spider) traverses the webpages of the internet by following the links of urls contained within each webpage. There is usually an initial seed of URLs from which the crawler is given to initialize its crawl.

0 votes
1 answer
118 views

There has been lots of reports of aggressive Web crawling lately, which people speculate may be linked to the development of LLMs and the incentive to collect Web content to train them. In fact, I ...
a3nm's user avatar
  • 939
-2 votes
2 answers
276 views

I found these entries in the Authelia logs: authelia | time="2025-08-20T20:32:25Z" level=error msg="Target URL does not appear to have a relevant session cookies configuration" ...
Peter Harmann's user avatar
0 votes
0 answers
165 views

My web server has been recently hit with mass visits by web scraper bots. Scraping bots have always been there, but munin plots show that since June 2nd around 8:00 UTC the traffic has increased ...
Nop's user avatar
  • 1
0 votes
1 answer
111 views

I have a robots.txt file that looks like this: User-agent: * Disallow: /account/ Disallow: /captcha/ Disallow: /checkout/ User-agent: DataForSeoBot Disallow: /p- User-...
David Christian's user avatar
1 vote
1 answer
589 views

I'm running a PHP VBulletin forum and it's spammed by extremely many requests from Facebook crawl servers with hostnames like: fwdproxy-cco-031.fbsv.net fwdproxy-prn-050.fbsv.net Does anyone know a ...
kungfooman's user avatar
1 vote
1 answer
297 views

Gday folks. Recently we discovered a significant spike in outgoing data on our web-server. It turns out Amazon bots are downloading our web imagery, a lot. We set a disallow in our Robots.txt, over a ...
Sami.C's user avatar
  • 111
1 vote
1 answer
861 views

I am using an NGINX server to host a static website exposed to the open internet. While glancing through the access logs I came across a cluster of requests for resources ending with .env, e.g: "...
Rexxyboy's user avatar
0 votes
1 answer
90 views

top - 19:51:36 up 1 day, 12:27, 1 user, load average: 19.14, 11.33, 4.74 Tasks: 172 total, 18 running, 154 sleeping, 0 stopped, 0 zombie %Cpu(s): 90.0 us, 10.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0....
Crypto Coupons's user avatar
1 vote
1 answer
379 views

// Not sure if this question is best fit for serverfault or webmasters stack exchange... I am thinking to rate limit access to my sites because identifying and blocking bad bots take most of my time. ...
adrianTNT's user avatar
  • 1,262
0 votes
1 answer
465 views

We are planning a mantainance that could take down the services for a whole day. I would like therefore to show a mantainance page, explaining the issue and providing additional info/links. During ...
jacopo3001's user avatar
3 votes
3 answers
4k views

Original question title: "Allow only cloudflare access to my website and block all visits, bots or crawlers to my IP address" I have a question, I use cloudflare DNS on my domain. My VPS 30....
Razyit's user avatar
  • 31
0 votes
0 answers
116 views

I'm running crawler on my company's internet. 10 raspberry pi * 45 crawlers each, 2 desktops * 70 crawlers each These processes are sending requests 24/7. 3~5% of packets are getting lost. This is ...
startergate's user avatar
1 vote
1 answer
545 views

I am getting weird GET requests on my (non php supporting) web server for some curious looking php files. Was just wondering whether these are harmless requests of certain browser tools or attempts ...
Luftbaum's user avatar
  • 111
1 vote
1 answer
648 views

I'm a Java engineer with zero dev ops experience. Lately I was playing around with linux ubuntu server first time and used docker with my selenium project and faced this problem: I try to scrape HTML ...
Vytautas Šerėnas's user avatar
0 votes
1 answer
687 views

My website has an area restricted to users who sign up with a valid email. I have got requests with bogus emails, and I want to avoid sending emails to non-existent addresses lest they increase the ...
ginjaemocoes's user avatar
1 vote
0 answers
94 views

I have a couple of podcasts I host on my site and I've noticed a disturbing trend the last couple of months: my site's bandwidth usage has gone up by 10x, but it appears most of it was a series of ...
Timothy R. Butler's user avatar
0 votes
1 answer
460 views

I am currently trying to analyze the traffic of a website. Besides specifics regarding the requested resource and timestamps, the tracking system only provides the request's HTTP referrer. In most ...
user avatar
0 votes
2 answers
283 views

I have a Nextcloud server running on Apache, and disabled my firewall for about 5 minutes while I ran an apt-update. I decided to check the logs after, and found this from an unknown IP. It looks like ...
user3207650's user avatar
1 vote
1 answer
966 views

I noticed a couple (ostensibly-)harmless log entries, and--I'm admittedly overthinking this by a mile--got curious about Apache2 response sizes. This Ukranian crawler † hit my web daemon, two seconds ...
zedmelon's user avatar
  • 113
0 votes
1 answer
123 views

I converted my website from asp.net to .net core and host on same server. Now, website gets hundred of hits daily from different IP's trying to access like below /php-myadmin/ /wp-content/ /mysql/ ...
Bunty Choudhary's user avatar
-3 votes
1 answer
242 views

When no radioactive decay is available and good entropy is strongly advised for security reasons you experience a real problem. HTTPS connections consume a lot of entropy. If you have thousands of ...
Andreas Karatassios-Peios's user avatar
-4 votes
2 answers
107 views

If http://example2.com sends cURL connection to a website called http://example1.com. If I access http://example2.com from my pc to see the content of http://example1.com, than would http://example1....
Suraj Neupane's user avatar
1 vote
0 answers
166 views

I'd like to mirror an old site of mine to local files. I've used httrack for this in the past, but I'm having a problem this time that I really thought I figured out before, but can't seem to now. ...
boomhauer's user avatar
  • 151
-1 votes
2 answers
2k views

I have multiple physical sub-domains and I don't want to change any robots.txt file of any of that sub-domains. Is there any way to disallow all the sub-domains from my main domain's physical robots....
Aditya Shah's user avatar
0 votes
0 answers
103 views

Here is a strange one for you. We have a server with multiple VHOSTS that include both SSL and Non-SSL domains. Domain1 is SSL enabled, while Domain2 doesn't have SSL. Since all these domains are ...
mamad's user avatar
  • 1
0 votes
1 answer
680 views

I recently logged into a vps I have (with vultr, if that is of any concern). To find a large amount of nginx logs and higher than expected load average. This server is doing effectively nothing, and ...
dukky's user avatar
  • 1
-1 votes
1 answer
87 views

Something/someone from 40.96.18.165 has been hitting my web server exactly eight times a day everyday since Feb 5, 2017. The user agent used is Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0). ...
u936293's user avatar
  • 407
1 vote
1 answer
2k views

I made a script to scan a file which contains a portion of ipv4 addresses (about 50 million), it attempts to connect to the website using OpenSSL and extract a small piece of it and writes it into a ...
user153882's user avatar
0 votes
1 answer
272 views

I've got some errors showing up in my site logs where some bots are trying to access URLs with strange GET params. # normal url example.com?foo=123456 # odd url triggering integer error by bots ...
Pete's user avatar
  • 303
2 votes
0 answers
61 views

I recently received a large number of hits on my home page from 64.235.153.8. It revolves to barracuda.com I know Barracuda as an enterprise class spam detection/prevention solution. Do they also ...
Luke G's user avatar
  • 151
1 vote
1 answer
3k views

I have written a small bash script for crawling an XML sitemap of URLs. It retrieves 5 URLs in parallel using xargs. Now I want an E-Mail to be sent when all URLs have been crawled, so it has to wait ...
Alex's user avatar
  • 322
3 votes
0 answers
3k views

I need to block a bunch of robots from crawling a few hundred sites hosted on a Nginx web server running on a Ubuntu 16.04 machine. I've found a fairly simple example here (important part of the ...
Sledge Hammer's user avatar
-1 votes
0 answers
51 views

I have developed a nice little app that crawls a bunch of newspaper web sites and makes their latest content available on my phone offline. It's basically a Pocket app that saves contents ...
user221200's user avatar
12 votes
3 answers
3k views

How do large sites (e.g. Wikipedia) deal with bots that are behind other IP masker? For instance, in my university, everybody searches Wikipedia, giving it a significant load. But, as far as I know, ...
user4052054's user avatar
1 vote
1 answer
86 views

In the course of about 2 hours, a logged in user on my website accessed roughly 1,600 pages in a way that looks suspiciously similar to a bot. I am concerned because users must purchase access to the ...
Nick S.'s user avatar
  • 131
2 votes
1 answer
555 views

I recently noticed some strange traffic in my nginx access logs. I'm not sure if these indicate an attack, a mistake, or something else. I've started sending these to HTTP 444, so these logs will ...
user153775's user avatar
3 votes
1 answer
885 views

I'm in a difficult situation, the Baidu spider is hitting my site causing about 3Gb a day worth of bandwidth. At the same time I do business in China so don't want to just block it. Has anyone else ...
d.lanza38's user avatar
  • 407
0 votes
1 answer
543 views

About ten days ago I moved a site - mostly a Joomla discussion board - to a new server at a different IP address. During a brief scheduled downtime I replicated the content over and completed DNS ...
Ryan's user avatar
  • 81
-3 votes
2 answers
180 views

I would like to protect my server from too many hits from Bots. Considering a scenario, where in a server (physical) located in a private network and hitting my server continuously. Do i have a ...
kris123456's user avatar
0 votes
1 answer
1k views

On my Nginx log recently i have noticed 100's entries like this where a directory search was executed with error, because those directory does not exist on my webserver. now, how can I block them once ...
Tapash's user avatar
  • 153
2 votes
2 answers
81 views

In the logs of my website, there's a lot of visits with a HTTP referer set to spam-like websites (usually Russian sites, I've noticed). I assume what they're doing is just using a web crawler to visit ...
user avatar
1 vote
0 answers
731 views

I'm maintaining some web crawlers. I want to improve our load/throttling system to be more intelligent. Of cause I look at response codes, and throttle up or down based on that. I would though like ...
Niels Kristian's user avatar
1 vote
0 answers
821 views

My domain name has both IPv4 and IPv6 addresses assigned. IPv4 connection to Google can't be available all the time due to restrictions of my campus network, but IPv6 is available all the time. ...
ReeseWang's user avatar
0 votes
1 answer
4k views

I have some web crawlers, and a specific website seems to be blocking traffic temporarily after some time. The thing is, even though all clients have the same external IP address (they access the ...
Doug's user avatar
  • 239
0 votes
2 answers
2k views

I have a secure SSO site that uses Shibboleth authentication and SAML identity provider. I need to allow a Google Search Appliance crawler to come index the URL's. I have a requirement to change on ...
chowmojo's user avatar
30 votes
4 answers
9k views

I have found out that McAfee SiteAdvisor has reported my website as "may be having security issues". I care little about whatever McAfee thinks of my website (I can secure it myself and if not, ...
kralyk's user avatar
  • 497
-8 votes
2 answers
3k views

I wanna build a tool which scans a website for all urls, but not the urls within the page but of the site self, but I don't know how. Could anyone give me an example how I can start? Example: www....
chunk0r's user avatar
  • 11
0 votes
1 answer
614 views

We supply Magento and Typo3 installations to customers. To improve QA we wanted to use an automatic link checker to check for broken and/or outdated links - automatically. We want to check all links ...
Dabu's user avatar
  • 359
3 votes
1 answer
351 views

I run ossec on my server and periodically I receive a warning like this: Received From: myserver->/var/log/auth.log Rule: 5701 fired (level 8) -> "Possible attack on the ssh server (or version ...
Brian's user avatar
  • 796
-4 votes
1 answer
518 views

Actually I am using Httrack as a web crawler, can it use my credentials to access members area and download the zip files because they are restricted from public access. Thank you in advance. ...
M. A.'s user avatar
  • 97