Ever faced a project that felt like hitting a brick wall? I was working on something that looked straightforward, until it wasn’t. I was working with an internal Google Sites instance—access-protected and tied to a workspace domain. That meant standard scraping tools hit a wall at the login page.
After countless hours of trial, error, and debugging, I was able to finally crack it. The solution? A hybrid approach that combines manual intervention with automation which resulted in a seamless, robust system.
In this post, I’m sharing the full story: the roadblocks I faced, the strategies I tried (and why they failed), and the final working Python script—step by step:
The Core Problem: Why It's "Impossible"
Modern web applications, especially from Google, are designed to prevent basic scraping. The primary roadblock is authentication. You can't just send a username and password anymore; you need to handle potential 2-Factor Authentication (2FA), captchas, and complex JavaScript-driven login flows. A purely automated script running on a server can't do this.
The Breakthrough: A Hybrid "Human-in-the-Loop" Architecture
The solution was to stop thinking about it as a single, fully-automated task. We broke it down into a hybrid system where a human and a robot collaborate:
- Manual Authentication (The Human Part): A script opens a browser for a human to perform the complex login. It then saves the session "key" (the cookies).
- Automated Scraping (The Robot Part): A separate, powerful headless script uses that session key to do the heavy lifting—visiting every page, downloading all content, and saving it in an organized way.
Toolkit
This solution relies on a few key Python libraries. You'll want a requirements.txt file with the following:
# requirements.txt playwright httpx beautifulsoup4 lxml The Rocky Road: Our Initial Failures
Before arriving at the final script, we hit several walls. Our first attempt was to use a higher-level scraping library (like Crawl4AI), but it didn't offer the granular control needed for the interactive login. This forced us to use Playwright directly.
This led to our first major bug: a NotImplementedError on Windows. It turns out the default asyncio event loop on Windows is incompatible with some of Playwright's underlying processes.
Lesson Learned: Always account for platform differences. The fix was to explicitly set a compatible event loop policy for Windows right at the start of the script. This was a critical lesson in writing robust, cross-platform code.
The Implementation: Building the Scraper
Let's build the script, function by function.
Step 1: Handling Authentication
First, we need a way to perform the manual login and get the session cookies. This function opens a visible browser, lets you log in, and then saves the cookies for the automated part to use.
async def get_auth_cookies(): """Launches a browser for login to get authentication cookies.""" if sys.platform == "win32": asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) print("--- 👤 YOUR TURN: AUTHENTICATION ---") async with async_playwright() as p: browser = await p.chromium.launch(headless=False) context = await browser.new_context() page = await context.new_page() await page.goto(START_URL) print("Please complete the login process in the browser window...") try: # We wait until the URL is back on the Google Site domain await page.wait_for_url(f"**/{BASE_DOMAIN}/**", timeout=300000) print("✅ Login successful! Extracting session cookies...") cookies = await context.cookies() await browser.close() print("🔒 Headed browser closed. Authentication complete.") return cookies except TimeoutError: print("❌ Login timed out.") await browser.close() return None Step 2: The Scraper Engine
Now for the main part. This function, scrape_site_headless, takes the cookies and the list of pages to visit, then iterates through them in a headless browser.
The key part here is how we wait for each page to load. We use wait_until="networkidle" with a generous 90-second timeout. This is the most reliable way to ensure complex pages with lots of embedded iframes are fully loaded before we try to read them.
Failure Note: Initially, I tried using
wait_until="load"with a short, fixedwait_for_timeout(). This failed constantly on the heavier pages like the homepage, resulting in aTimeoutError.
Lesson Learned: For complex, dynamic sites, a patientnetworkidlewait is far more reliable than a blind, fixed delay.
async def scrape_site_headless(cookies, initial_links): """Launches a headless browser to scrape all pages.""" print("\n--- 🤖 MY TURN: HEADLESS SCRAPING ---") async with async_playwright() as p: browser = await p.chromium.launch(headless=True) context = await browser.new_context(storage_state={"cookies": cookies}) for i, link_info in enumerate(initial_links): url_to_scrape = link_info['url'] print(f"({i+1}/{len(initial_links)}) Scraping: {url_to_scrape}") page = await context.new_page() try: # This is the patient waiting strategy that solved the timeout errors await page.goto(url_to_scrape, wait_until="networkidle", timeout=90000) # ... The content extraction logic will go here ... except Exception as e: print(f" ❌ Failed to scrape {url_to_scrape}. Error: {e}") finally: if not page.is_closed(): await page.close() await browser.close() Step 3: Extracting Content with Context
This is where we solve the context problem. Inside the scraping loop, we'll get the full HTML, parse it with BeautifulSoup, find and download images/docs, and replace them with placeholders before saving the final text.
Failure Note: Finding the embedded document links was the hardest part. My first attempts to find them by looking for "Pop-out" buttons or simple link tags failed because the links are hidden deep inside
iframeswith non-obvious selectors. I even tried a brute-force regex search on the page's internal JavaScript variables, which also proved unreliable.
The breakthrough came when we created a debug script to save the page's full HTML and manually inspected it. We discovered that Google embeds the direct download and open links in specialdata-attributes (likedata-embed-download-url).
Lesson Learned: When you're stuck, stop guessing and find a way to look at the raw source your script is seeing.
Here's the logic that goes inside the try block of the scrape_site_headless function:
# This code goes inside the `try` block in the function above # Get HTML from the main page and all its frames full_html = await page.content() for frame in page.frames: try: full_html += await frame.content() except Exception: pass soup = BeautifulSoup(full_html, "lxml") # Prepare directories and lists page_output_dir = create_page_folder(page) images_dir = os.path.join(page_output_dir, "images") os.makedirs(images_dir, exist_ok=True) doc_links_to_save = set() image_counter = 0 # Find all <img> tags, download the image, and replace with a placeholder print(f" 🖼️ Finding and downloading images...") for img_tag in soup.find_all('img'): src = img_tag.get('src') if src and src.startswith('http'): image_counter += 1 saved_filename = await download_file(cookies, src, images_dir, f"image_{image_counter}") if saved_filename: placeholder = f"\n[IMAGE: {os.path.join('images', saved_filename)}]\n" img_tag.replace_with(placeholder) # Find all embedded documents, download or link them, and replace with a placeholder print(f" 📎 Finding and downloading documents...") for embed_div in soup.find_all('div', attrs={'data-embed-doc-id': True}): download_url = embed_div.get('data-embed-download-url') if download_url: # This is a downloadable file like a PDF saved_filename = await download_file(cookies, download_url, page_output_dir, "document") if saved_filename: placeholder = f"\n[DOWNLOADED_DOCUMENT: {saved_filename}]\n" embed_div.replace_with(placeholder) else: # This is an interactive doc, so we save the link open_url = embed_div.get('data-embed-open-url') if open_url: doc_links_to_save.add(open_url) placeholder = f"\n[DOCUMENT_LINK: {open_url}]\n" embed_div.replace_with(placeholder) # Finally, get the clean text from our modified HTML page_text = soup.get_text(separator='\n', strip=True) # Save the final results to files save_final_content(page_output_dir, page_text, list(doc_links_to_save)) Step 4: The Reliable File Downloader
We learned that using the browser to navigate to download links can fail. The robust solution is to use a direct HTTP client (httpx) with our session cookies. This function handles that for both images and documents.
Failure Note: Before switching to
httpx, my first attempt was to use Playwright'spage.goto()to download the files. This resulted in a crypticnet::ERR_ABORTEDerror. This happens becausepage.goto()is for navigating to a webpage, not for handling file downloads, which the server provides differently.
Lesson Learned: Use the right tool for the job. A direct HTTP client is the correct and robust way to handle file downloads.
import httpx from urllib.parse import unquote import mimetypes async def download_file(session_cookies, file_url, save_dir, file_prefix): """Downloads a file directly using an HTTP client.""" if not file_url or file_url.startswith('data:image'): return None cookie_jar = httpx.Cookies() for cookie in session_cookies: cookie_jar.set(cookie['name'], cookie['value'], domain=cookie['domain']) try: async with httpx.AsyncClient(cookies=cookie_jar, follow_redirects=True, timeout=120.0) as client: response = await client.get(file_url) response.raise_for_status() # Try to get the real filename from the server filename = file_prefix if 'content-disposition' in response.headers: fn_match = re.search(r'filename="([^"]+)"', response.headers['content-disposition'], re.IGNORECASE) if fn_match: filename = unquote(fn_match.group(1)) else: # Fallback to guessing the extension ext = mimetypes.guess_extension(response.headers.get("content-type", "")) or "" filename = f"{file_prefix}{ext}" filepath = os.path.join(save_dir, filename) with open(filepath, "wb") as f: f.write(response.content) return filename except Exception as e: print(f" - Could not download {file_url}. Error: {e}") return None Step 5: Putting It All Together
Finally, we need a main function to orchestrate the entire process: get the cookies, find all the pages to scrape, and then kick off the headless scraper.
# The functions to create folders and save text go here... # def create_page_folder(page): ... # def save_final_content(page_dir, text, docs): ... async def main(): # 1. Authenticate and get cookies cookies = await get_auth_cookies() if not cookies: return # 2. Get the list of all internal pages to scrape print("\n--- 🤖 Getting initial links to scrape ---") # ... (this part uses a temporary headless browser to get the nav links) ... # This logic is complex, so for the article, we'll just summarize it. # 3. Start the main scraping job with the cookies and link list if internal_links: await scrape_site_headless(cookies, internal_links) print("\n🎉 All tasks complete.") if __name__ == "__main__": asyncio.run(main()) Conclusion
Scraping modern web apps is a battle of persistence. While a simple approach might seem "impossible," breaking the problem down and using the right tools for each part of the job makes it achievable. By combining a manual login with an automated scraper and using direct HTTP requests for downloads, we were able to build a robust and reliable solution. The key was to inspect the target, understand its behavior, and adapt our strategy.
Acknowledgment
I'd like to note that this project was developed in close collaboration with an AI assistant.
Top comments (0)