Skip to content

πŸ•ΈοΈ A multi-stage Docker app that scrapes any URL using Node.js + Puppeteer and serves the data via a lightweight Flask API. Combines the power of browser automation and Python web serving in a clean, efficient container.

License

Notifications You must be signed in to change notification settings

sanjaykadavarath/puppeteer-scraper-flask-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Scraper and Flask Web Server

This project demonstrates the use of a multi-stage Docker build to scrape data from a specified URL using Node.js with Puppeteer and Chromium, and serve the scraped data via a simple Python Flask web server.

Project Structure

/scraper-flask-app β”‚ β”œβ”€β”€ app.py # Flask web server for serving scraped data β”œβ”€β”€ scraper.js # Node.js script to scrape the provided URL β”œβ”€β”€ Dockerfile # Multi-stage Dockerfile for building the image β”œβ”€β”€ scraped_data.json # Output file containing the scraped data (generated by the scraper) └── README.md # Project documentation 

Requirements

Before you begin, ensure that you have the following installed:

Project Description

The project consists of two main parts:

  1. Scraper (Node.js with Puppeteer): A Node.js script (scrape.js) that uses Puppeteer to scrape content from a specified URL and stores the output in a JSON file.
  2. Web Server (Flask): A simple Flask web server (server.py) that reads the scraped JSON data and serves it via an HTTP endpoint.

Scraping Flow:

  • The scraper script will accept a URL as an environment variable.
  • It will use Puppeteer to load the page and scrape content (e.g., the title of the page).
  • The scraped data will be stored as a JSON file (scraped_data.json).

Web Server Flow:

  • The Flask web server will read the scraped_data.json file.
  • It will serve the data through an endpoint (/scraped_data) that returns the content as JSON when accessed.

Docker Setup:

The Dockerfile includes two stages:

  1. Scraper Stage: Uses a Node.js image to install Puppeteer and Chromium, and then runs the scraper script.
  2. Server Stage: Uses a Python image with Flask to serve the scraped content.

Setup and Usage

Step 1: Clone the Repository

Start by cloning this repository to your local machine:

git clone https://github.com/sanjaykadavarath/puppeteer-scraper-flask-app.git cd puppeteer-scraper-flask-app

Step 2: Create Docker Image

The first step in setting up the project is to build the Docker image. You will need to specify the URL you want to scrape via a build argument.

docker build --build-arg SCRAPE_URL=http://example.com -t scraper-flask-app .
  • Replace http://example.com with the URL you want to scrape.

Step 3: Run the Docker Container

Once the image is built, run the container on your local machine or a server:

docker run -p 5000:5000 sanjaykadavarath/scraper-flask-app:latest
  • This command runs the container and maps port 5000 on the host machine to port 5000 inside the container.

Step 4: Access the Web Server

After the container starts, you can access the Flask web server by opening a browser and navigating to:

http://localhost:5000/scraped_data 

If you're running it on a remote server, replace localhost with the server's IP address.

Step 5: Pushing to Docker Hub

To push the Docker image to Docker Hub, follow these steps:

  1. Tag the image with your Docker Hub username and repository name:

    docker tag scraper-flask-app sanjaykadavarath/scraper-flask-app:latest
  2. Push the image to Docker Hub:

    docker push sanjaykadavarath/scraper-flask-app:latest

Step 6: Running on Another Machine

To run this project on another machine, follow these steps:

  1. Install Docker on the other machine.

  2. Login to Docker Hub on the new machine:

    docker login
  3. Pull the image from Docker Hub:

    docker pull sanjaykadavarath/scraper-flask-app:latest
  4. Run the container:

    docker run -p 5000:5000 sanjaykadavarath/scraper-flask-app:latest
  5. Access the Flask server at http://<machine-ip>:5000/scraped_data.

Dockerfile Breakdown

Scraper Stage

FROM node:16 AS scraper # Install dependencies RUN apt-get update && apt-get install -y wget ca-certificates --no-install-recommends && rm -rf /var/lib/apt/lists/* # Install Puppeteer and Chromium RUN npm install puppeteer --save # Set working directory WORKDIR /app # Copy the scraper script COPY scraper.js . # Set the environment variable for the URL to scrape ARG SCRAPE_URL ENV SCRAPE_URL=$SCRAPE_URL # Run the scraper RUN node scraper.js
  • This stage installs necessary dependencies, installs Puppeteer, and runs the scraper.js script.

Server Stage

FROM python:3.9-slim AS server # Install Flask RUN pip install flask # Set working directory WORKDIR /app # Copy the scraped data and Flask app COPY --from=scraper /app/scraped_data.json . COPY app.py . # Expose the port EXPOSE 5000 # Run Flask app CMD ["python", "app.py"]
  • This stage copies the scraped_data.json file from the first stage and sets up the Flask web server.

Notes

  • Environment Variable: The scraper script uses the SCRAPE_URL environment variable to specify the URL to scrape. You must pass this as a build argument when building the Docker image.
  • Dynamic Scraping: The scraper can be easily adapted to scrape different data by modifying the scraper.js script.
  • Flask Web Server: The Flask app serves the scraped data as a JSON response at the /scraped_data endpoint.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

πŸ•ΈοΈ A multi-stage Docker app that scrapes any URL using Node.js + Puppeteer and serves the data via a lightweight Flask API. Combines the power of browser automation and Python web serving in a clean, efficient container.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published