DEV Community

Cover image for How to Scrape Data Behind Login Pages Using Python
Crawlbase
Crawlbase

Posted on • Originally published at crawlbase.com

How to Scrape Data Behind Login Pages Using Python

This blog was initially posted to Crawlbase Blog

In this article, we’ll show you a unique way to extract your session cookies from a logged-in session and pass them to an API, allowing it to log in to a website and extract the data you need. The process may sound complicated, but give it a chance and see how Crawlbase can simplify the entire process for you - see it in action.

Authenticated Scraping with Python Requests Library

Let’s put our extracted cookies into action. First, ensure that your Python environment is completely set up. Install the latest Python version, use any of your preferred IDE, and install the Python Requests module. Once your environment is set up, we can proceed with the exercise.

Say we want to scrape this Facebook Hashtag Music page, and our goal is to scrape data from protected web pages if you try to open this using Chrome Incognito mode (without logging in to your Facebook account), you’ll get the sign-in page:

An image displaying Facebook login page. 'An image displaying Facebook login page

We can try to scrape this page manually by using Python alone to see what will happen. Create a file and name it scraping_with_crawlbase.py, then copy and paste the code below.

import requests from requests.exceptions import RequestException TARGET_URL = "https://www.facebook.com/hashtag/music" HEADERS = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'accept-language': 'en-US,en', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36', 'sec-fetch-mode': 'navigate', 'Cookie': '<cookies-goes-here>' } OUTPUT_FILE_NAME = "output.html" try: response = requests.get(TARGET_URL, headers=HEADERS) response.raise_for_status() html_content = response.text with open(OUTPUT_FILE_NAME, "w", encoding="utf-8") as file: file.write(response.text) print(f"\nPage successfully saved to '{OUTPUT_FILE_NAME}'\n") except RequestException as error: print(f"\n Failed to fetch the page: {error}\n") 
Enter fullscreen mode Exit fullscreen mode

Make sure to replace <cookies-goes-here> with the actual cookies you extracted from your Facebook account earlier and run the code using the command below.

python scraping_with_crawlbase.py 
Enter fullscreen mode Exit fullscreen mode

After running the script, open the output.html file. You’ll notice that the content looks blank or incomplete. If you inspect it, you’ll see that it's mostly unexecuted JavaScript.

Why? Because the data you’re looking for is loaded dynamically with JavaScript, and requests alone can’t execute JavaScript like a browser does.

So, how do we resolve this issue? That’s what we’ll cover in the next section.

Scraping Behind Login Using Crawlbase

Now that we’ve seen the limitations of using Python’s requests library alone. Let’s use Crawlbase to handle problems like rendering JavaScript and working behind login walls. Here’s how you can do it:

  • Step 1: Prepare Your Script. Create or update your scraping_with_crawlbase.py file with the following code:
import json import requests from requests.exceptions import RequestException API_TOKEN = "<Javascript requests token>" TARGET_URL = "https://www.facebook.com/hashtag/music" SCRAPER = "facebook-hashtag" COOKIES = """ <cookies-goes-here> """ COUNTRY = "US" API_ENDPOINT = "https://api.crawlbase.com/" params = { "token": API_TOKEN, "url": TARGET_URL, "scraper": SCRAPER, "cookies": COOKIES, "country": COUNTRY } try: response = requests.get(API_ENDPOINT, params=params) response.raise_for_status() json_string_content = response.text json_data = json.loads(json_string_content) pretty_json = json.dumps(json_data, indent=2) print(pretty_json) except RequestException as error: print(f"\nFailed to fetch the page: {error}\n") 
Enter fullscreen mode Exit fullscreen mode
  • Step 2: Replace <Javascript requests token> with your Crawlbase JavaScript token. If you don’t have an account yet, sign up to Crawlbase to claim your free API requests.

  • Step 3: Replace <cookies-goes-here> with the same cookies you extracted earlier from your logged-in Facebook session.

Make sure the cookies are properly formatted. Otherwise, Crawlbase might reject them. According to the cookies documentation, the correct format should look like this:

cookies: key1=value1; key2=value2; key3=value3 
Enter fullscreen mode Exit fullscreen mode
  • Step 4: Now run the script using:
python scraping_with_crawlbase.py 
Enter fullscreen mode Exit fullscreen mode

If everything's set up correctly, you’ll see a clean JSON output printed in your terminal. This is the actual content from the Facebook hashtag page successfully scraped.

{ "original_status": 200, "pc_status": 200, "url": "https://www.facebook.com/hashtag/music", "domain_complexity": "standard", "body": { "hashtag": "", "posts": [ { "userName": "Dave Moffatt Music", "text": "You\u2019ll get by with a smileYou can\u2019t win at everything but you can try! @eraserheads_official #nevada #music #withasmile #song", "url": "https://www.facebook.com/hashtag/music?__cft__[0]=AZWbgQE-_wYwW47AUbqqhzfqC6moiJrxFQs7glnpepq5ibId2fvbkZe1E3UoNwI-Ywj4gaQp3qbQjOMGmNVD1fu4Ofx-uPcDfWPJGhRCtKrHKV1G-rXqg2mxRSzd93AL281FwDSfjERvTMkdWK6bZI_cJC_CxDD63x_K5WycyUe1lnt5kBwyBOdIk4z2jfeFeRCZASbYvSLGQS9eQ4GQh-c2&__tn__=%2CO%2CP-R#?bee", "dateTime": "oSspoenrdt0iS27g8ie7lm4c2gt19779f1mpraaec87et108um8 b3,7 56g", "likesCount": "", "sharesCount": "", "commentsCount": "", "links": [ { "link": "https://www.facebook.com/hashtag/nevada?__eep__=6&__cft__[0]=AZWbgQE-_wYwW47AUbqqhzfqC6moiJrxFQs7glnpepq5ibId2fvbkZe1E3UoNwI-Ywj4gaQp3qbQjOMGmNVD1fu4Ofx-uPcDfWPJGhRCtKrHKV1G-rXqg2mxRSzd93AL281FwDSfjERvTMkdWK6bZI_cJC_CxDD63x_K5WycyUe1lnt5kBwyBOdIk4z2jfeFeRCZASbYvSLGQS9eQ4GQh-c2&__tn__=*NK-R", "text": "#nevada" }, { "link": "https://www.facebook.com/hashtag/music?__eep__=6&__cft__[0]=AZWbgQE-_wYwW47AUbqqhzfqC6moiJrxFQs7glnpepq5ibId2fvbkZe1E3UoNwI-Ywj4gaQp3qbQjOMGmNVD1fu4Ofx-uPcDfWPJGhRCtKrHKV1G-rXqg2mxRSzd93AL281FwDSfjERvTMkdWK6bZI_cJC_CxDD63x_K5WycyUe1lnt5kBwyBOdIk4z2jfeFeRCZASbYvSLGQS9eQ4GQh-c2&__tn__=*NK-R", "text": "#music" }, { "link": "https://www.facebook.com/hashtag/withasmile?__eep__=6&__cft__[0]=AZWbgQE-_wYwW47AUbqqhzfqC6moiJrxFQs7glnpepq5ibId2fvbkZe1E3UoNwI-Ywj4gaQp3qbQjOMGmNVD1fu4Ofx-uPcDfWPJGhRCtKrHKV1G-rXqg2mxRSzd93AL281FwDSfjERvTMkdWK6bZI_cJC_CxDD63x_K5WycyUe1lnt5kBwyBOdIk4z2jfeFeRCZASbYvSLGQS9eQ4GQh-c2&__tn__=*NK-R", "text": "#withasmile" }, { "link": "https://www.facebook.com/hashtag/song?__eep__=6&__cft__[0]=AZWbgQE-_wYwW47AUbqqhzfqC6moiJrxFQs7glnpepq5ibId2fvbkZe1E3UoNwI-Ywj4gaQp3qbQjOMGmNVD1fu4Ofx-uPcDfWPJGhRCtKrHKV1G-rXqg2mxRSzd93AL281FwDSfjERvTMkdWK6bZI_cJC_CxDD63x_K5WycyUe1lnt5kBwyBOdIk4z2jfeFeRCZASbYvSLGQS9eQ4GQh-c2&__tn__=*NK-R", "text": "#song" } ] } // Note: some results have been omitted for brevity. ] } } 
Enter fullscreen mode Exit fullscreen mode
  • Bonus Step: The Crawlbase Facebook datascraper isn’t limited to scraping just hashtag pages. It also supports other types of Facebook content. So, if your target page falls into one of the categories below, you’re in luck:

    • facebook-group
    • facebook-page
    • facebook-profile
    • facebook-event

All you have to do is update two lines in your script to match the type of page you want to scrape:

TARGET_URL = "https://www.facebook.com/hashtag/music" SCRAPER = "facebook-hashtag" 
Enter fullscreen mode Exit fullscreen mode

For example, if you want to scrape a private Facebook group, change it to something like:

TARGET_URL = "https://www.facebook.com/groups/examplegroup" SCRAPER = "facebook-group" 
Enter fullscreen mode Exit fullscreen mode

Just swap in the correct URL and corresponding web scraper name and Crawlbase will take care of the rest.

Top comments (0)