1

I'm a Java engineer with zero dev ops experience. Lately I was playing around with linux ubuntu server first time and used docker with my selenium project and faced this problem:

I try to scrape HTML from a website, but my calls are getting blocked, and I get 403 forbidden response. I tried to curl same website and also get same response.

Furthermore, I only get blocked in my Linux machine, everything works in local dev env with same docker image, so thats why I think its "server fault".

Any ideas what my Linux server is missing here? Maybe I don't have some sort of certificate or have cors problem? Any ideas, what can I try? (For learning purposes only)

curl call here

6
  • Pass the web browser and your curl and Java apps through a proxy like mitmproxy and check the request, especially the headers. I am sure will will see the differences that cause the web server to send different responses. Commented Jan 31, 2022 at 20:31
  • 3
    Not really on topic for ServerFault, getting selenium and curl commands to work is more StackOverflow. But most likely: the site tries to detect scrapers and uses mechanisms like cookies and sessions to identify real interactive users/browsers. Commented Jan 31, 2022 at 20:36
  • @Bob I would say it's ServerFault, because it works with my local machine with same docker image. Commented Feb 1, 2022 at 6:28
  • @Robert appreciate your suggestion, I'm going to investigate and update this question. Commented Feb 1, 2022 at 6:30
  • Just being the servers fault doesn't make it on topic for ServerFault. If this is your server you are trying to scrape, provide your server configuration and log files and we can try to help you. If this is not your server, it's off topic here. And in that case, I'd stop doing what you are doing. Now you are just getting a 403, the next notice might be from a lawyer. Commented Feb 4, 2022 at 9:14

1 Answer 1

1

I believe you're getting rate-limited or blocked by the website. If I run the same curl command from my laptop, I get the webpage back.

Remember to respect robots.txt if you're doing web scraping.

1
  • Did not know about robots.txt, great findings, thanks. I had no idea about rate-limiting, but I think it's not the case, because from the start after deploy first call was blocked. Commented Feb 4, 2022 at 9:16

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.