
This repository shows how to build a Gemini-powered web scraper using Python and LLMs to extract structured data from complex web pages — without writing custom parsing logic.
📖 Read the full tutorial → How to Leverage Gemini AI for Web Scraping
- Fetches HTML from any public webpage
- Converts HTML to Markdown using markdownify
- Sends it to Gemini AI with a natural language prompt
- Extracts structured data in JSON format
google-generativeai
– Gemini API for LLM-powered parsingrequests
– For basic HTTP requests (if not using a proxy)beautifulsoup4
– For basic HTML parsing (optional)markdownify
– Converts HTML into cleaner Markdownpython-dotenv
– For managing API keys and environment variables
- Clone this repo:
git clone https://github.com/yourusername/gemini-ai-web-scraper.git cd gemini-ai-web-scraper
- Install dependencies:
pip install google-generativeai python-dotenv requests beautifulsoup4 markdownify
- Add your Gemini API Key in the script or as environment variable.
Web scraping with Gemini AI can hit blocks, CAPTCHAs, and anti-bot systems. Crawlbase Smart Proxy solves that.
- Avoid IP blocks with automatic rotation
- Bypass CAPTCHAs seamlessly
- Skip proxy management
- Get clean, parsed HTML for better AI input
import requests import time proxy_url = "http://_USER_TOKEN_@smartproxy.crawlbase.com:8012" proxies = {"http": proxy_url, "https": proxy_url} url = "https://example.com/protected-page" time.sleep(2) # Mimic human behavior response = requests.get(url, proxies=proxies, verify=False) print(response.text)
Replace _USER_TOKEN_
with your Crawlbase Smart Proxy token. Get one after signup on Crawlbase.