Posted on Jun 13 • Edited on Jul 12

My First Data Pipeline Project Using Airflow, Docker & Postgres (COVID API Edition)

#dataengineering #postgres #airflow #docker

Hey Devs 👋,

If you’re starting out in data engineering or curious how real-world data pipelines work, this post is for you.

As an Associate Data Engineer Intern, I wanted to go beyond watching tutorials and actually build a working pipeline — one that pulls real-world data daily, processes it, stores it, and is fully containerized.
So I picked something simple but meaningful: global COVID-19 stats.

Here’s a breakdown of what I built, how it works, and what I learned.

📊 What This Pipeline Does

This mini-project automates the following:

✅ Pulls daily global COVID-19 stats from a public API
✅ Uses Airflow to schedule and monitor the task
✅ Stores the results in a PostgreSQL database
✅ Runs everything inside Docker containers

It's a beginner-friendly, end-to-end project to get your hands dirty with core data engineering tools.

🧰 The Tech Stack

Python — for the main fetch/store logic
Airflow — to orchestrate and schedule tasks
PostgreSQL — for storing daily data
Docker — to containerize and simplify setup
disease.sh API — open-source COVID-19 stats API

⚙️ How It Works (Behind the Scenes)

Airflow DAG triggers once per day
A Python script sends a request to the COVID-19 API
Parses the JSON response
Inserts the cleaned data into a PostgreSQL table
Logs everything (success/failure) into Airflow's UI

Everything runs locally via docker-compose — one command and you're up and running.

🗂️ Project Structure

airflow-docker/ ├── dags/ # Airflow DAG (main logic) ├── scripts/ # Python file to fetch + insert data ├── docker-compose.yaml # Setup for Airflow + Postgres ├── logs/ # Logs generated by Airflow └── plugins/ # (Optional) Airflow plugins

You can check the full repo here:
👉 GitHub: mohhddhassan/covid-data-pipeline

🧠 Key Learnings

✅ How to build and run a simple Airflow DAG
✅ Using Docker to spin up services like Postgres & Airflow
✅ How Python connects to a DB and inserts structured data
✅ Observing how tasks are logged, retried, and managed in Airflow

This small project gave me confidence in how the core parts of a pipeline talk to each other.

🔍 Sample Output from API

Here’s a snippet of the JSON response from the API:

{ "cases": 708128930, "deaths": 7138904, "recovered": 0, "updated": 1717689600000 }

And here’s a sample SQL insert triggered via Python:

INSERT INTO covid_stats (date, total_cases, total_deaths, recovered) VALUES ('2025-06-06', 708128930, 7138904, 0);

🔧 What’s Next?

I’m planning to:

🚧 Add deduplication logic (so it doesn’t insert same data daily)
📊 Maybe create a Streamlit dashboard on top of the database
⚙️ Play with sensors, templates, and XComs in Airflow
⚡ Extend the pipeline with ClickHouse for OLAP-style analytics

📌 Why You Should Try Something Like This

If you're learning data engineering:

Start small, but make it real
Use public APIs to practice fetching and storing data
Wrap it with orchestration + containerization — it’s closer to the real thing

This project taught me way more than passively following courses ever could.

🙋‍♂️ About Me

Mohamed Hussain S
Associate Data Engineer Intern
LinkedIn | GitHub

🚀 Learning in public, one pipeline at a time.

DEV Community