0% found this document useful (0 votes)

6 views4 pages

Design A Workflow Management Platform Like Apache Airflo

The document outlines the design of a workflow management platform similar to Apache Airflow, focusing on building and running workflows represented as Directed Acyclic Graphs (DAGs). It emphasizes functional requirements such as real-time and batch processing, data transformation from multiple sources, and the use of a data lake to manage data efficiently while maintaining performance. The processing architecture combines Map Reduce and streaming data for real-time analytics, ensuring scalability and fault tolerance.

Uploaded by

nitrogamingyt7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views4 pages

Design A Workflow Management Platform Like Apache Airflo

Uploaded by

nitrogamingyt7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Design a workflow management

platform like Apache Airflow

Introduction:

Airflow is a platform that lets us build and run workflows. A workflow is represented as a DAG
(a Directed Acyclic Graph), and contains individual pieces of work called Tasks, arranged with
dependencies and data flows taken into account.

Functional requirements:

1. Used by downstream services: Analytics, Alerts and Monitoring, on-demand query

dashboards.
2. Allow both real-time and batch processing, with batch processing having high accuracy
and real-time processing having 1-2 minute latency.
3. Take data from multiple sources and transform them as required by developer scripts.
The developers provide scripts, we provide the infra and fault tolerance to run it.
4. Cheap and generic so most engineers can use it for ML alerts.

Design Process:

Microservices like Profile, Session, Payment service push data into their respective data
sources.

So we might have different types of databases such as cassandra, mysql, graph

databases.

Why can’t we directly query the data from the microservice ?

If we can derive insights from data then we would be reading from the data every time.
This would put load on our production systems, using I/O operations that we could be
using for our main service.

For example, the payments database needs could use a high level of isolation. This
requires taking locks on rows. Running an analytics read query in parallel would lock
rows unnecessarily.

Hence, we make a copy of these databases and pool them into something called a data
lake. This is a heterogeneous lake (having different sources of data).

How do we link the different types of data into a single record in this data lake ?

Depending on the use case, we join data on identifiers. For example, a food delivery app
will have a trip, delivery, rating and payment corresponding to a single order. This order ID
would help merge data across databases.

This data can be pulled from upstream databases using cron jobs, which run periodically.
We should use an inbuilt Change Data Capture solution when available (this is provided
by popular databases like MySQL and Cassandra).

Once pulled, the data lake is immutable. We can add data to it but we cannot update or
delete records. A distributed file system would be ideal here, since it's cheap.
Example: HDFS.
Processing:

We use a "Map Reduce Architecture". Here are some of it's advantages -

1. Horizontal scaling is easy
2. Task are broken into stages and run parallel
3. No Single Point of Failure
Real Time Analytics for Alerts and Monitoring:

If we need alerts then we need it immediately, and if we query from Data Lake that's a time
consuming task.
For handling this, we ask services to send data on an event bus like a message queue having
multiple publishers and subscribers which condenses data to a single event that is then pulled
by Analytics Engine. This is actually streaming data.

Advantages of Data Lake over Streaming Data:

1. It can be used for Auditory process

2. It can do complex processes having the batch process
3. Perfect consistency is maintained as there is no loss of data

The architecture that involves the combination of both Map Reduce and Streaming is Lambda
Architecture.

That's it for now!

You can find more designs at InterviewReady.

012 - Lambda Architecture
No ratings yet
012 - Lambda Architecture
2 pages
Lambda Architecture for Data Pros
No ratings yet
Lambda Architecture for Data Pros
20 pages
6
No ratings yet
6
1 page
System Design Terms
No ratings yet
System Design Terms
9 pages
7
No ratings yet
7
1 page
Cheatsheet System Design
No ratings yet
Cheatsheet System Design
16 pages
System Design CheatSheet
No ratings yet
System Design CheatSheet
9 pages
Building Effective Data Pipelines
No ratings yet
Building Effective Data Pipelines
16 pages
Data Integration & Modeling Guide
No ratings yet
Data Integration & Modeling Guide
27 pages
Data Lake Bootcamp Overview
No ratings yet
Data Lake Bootcamp Overview
46 pages
009.4 - Traditional Vs Streaming Systems Data Models
No ratings yet
009.4 - Traditional Vs Streaming Systems Data Models
3 pages
Big Data en Gros Deepseek
No ratings yet
Big Data en Gros Deepseek
7 pages
19 Databricks
No ratings yet
19 Databricks
28 pages
ETL Question and Answers
No ratings yet
ETL Question and Answers
6 pages
Azure de QSN and Ans
No ratings yet
Azure de QSN and Ans
16 pages
Data Warehousing
No ratings yet
Data Warehousing
6 pages
Big Data Analytics Project Guidelines
No ratings yet
Big Data Analytics Project Guidelines
6 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
DP 900 Day 4
No ratings yet
DP 900 Day 4
40 pages
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
No ratings yet
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
30 pages
ETL Architecture for Data Ingestion System
No ratings yet
ETL Architecture for Data Ingestion System
13 pages
Architecture Patterns of Analytics and Big Data
No ratings yet
Architecture Patterns of Analytics and Big Data
12 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
Data Engineering System Design
No ratings yet
Data Engineering System Design
37 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Data Engineer Interview Questions With Examples
No ratings yet
Data Engineer Interview Questions With Examples
8 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Counting Distinct Elements in Streams
No ratings yet
Counting Distinct Elements in Streams
19 pages
System Design Data Engineers Pocket Full
No ratings yet
System Design Data Engineers Pocket Full
15 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
3
No ratings yet
3
2 pages
System Design
No ratings yet
System Design
6 pages
RPSG (FMCG) - Datalake Technical Design Document
No ratings yet
RPSG (FMCG) - Datalake Technical Design Document
23 pages
Streaming Data Ingestion v1 181001151203
No ratings yet
Streaming Data Ingestion v1 181001151203
59 pages
007 - Big Data Architecture Style
No ratings yet
007 - Big Data Architecture Style
3 pages
How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium
No ratings yet
How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium
22 pages
8
No ratings yet
8
1 page
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Understanding Lambda Architecture
No ratings yet
Understanding Lambda Architecture
2 pages
Systems Analysis and Design 3
No ratings yet
Systems Analysis and Design 3
5 pages
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
No ratings yet
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
18 pages
Week 4 - Azure-AWSStorage
No ratings yet
Week 4 - Azure-AWSStorage
97 pages
Lambda - A Modern Big Data Architecture 5 - 12 PDF
No ratings yet
Lambda - A Modern Big Data Architecture 5 - 12 PDF
128 pages
When and How To Leverage Lambda Architecture in Big Data - Cuelogic Technologies Pvt. LTD
No ratings yet
When and How To Leverage Lambda Architecture in Big Data - Cuelogic Technologies Pvt. LTD
9 pages
B22DCVT246 Tran Van Huy
No ratings yet
B22DCVT246 Tran Van Huy
69 pages
Karthik (Project Details)
No ratings yet
Karthik (Project Details)
14 pages
ADMT End War
No ratings yet
ADMT End War
30 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
Architecture Document For Web Service Registry Platform
No ratings yet
Architecture Document For Web Service Registry Platform
4 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
009 - Streaming Data Applications
No ratings yet
009 - Streaming Data Applications
2 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
41 pages
Lectur 5
No ratings yet
Lectur 5
37 pages
Microservice Patterns
No ratings yet
Microservice Patterns
8 pages
DSBDA Insem
No ratings yet
DSBDA Insem
18 pages
Key Elements of Lambda Architecture
No ratings yet
Key Elements of Lambda Architecture
2 pages
Four Distributed System Architectural Patterns
No ratings yet
Four Distributed System Architectural Patterns
10 pages
My Journey As A Data Engineer Spans Over
No ratings yet
My Journey As A Data Engineer Spans Over
6 pages
Bigdata
No ratings yet
Bigdata
23 pages
Verisk's Big Data Solutions in Insurance
No ratings yet
Verisk's Big Data Solutions in Insurance
4 pages
UW Fundraising Boost Through Data Quality
No ratings yet
UW Fundraising Boost Through Data Quality
8 pages
Essential Git Commands for DevOps
No ratings yet
Essential Git Commands for DevOps
9 pages
CS314 Lab2
No ratings yet
CS314 Lab2
7 pages
Act 9 Csef
No ratings yet
Act 9 Csef
6 pages
Data Engineer Candidate Profile
No ratings yet
Data Engineer Candidate Profile
2 pages
XLStylesTool ReadMea
No ratings yet
XLStylesTool ReadMea
2 pages
MS Access Reports Basics
No ratings yet
MS Access Reports Basics
19 pages
SAP HANA & ABAP Development Guide
0% (1)
SAP HANA & ABAP Development Guide
43 pages
CS698Y: Modern Memory Systems Lecture-15 (DRAM Organization)
No ratings yet
CS698Y: Modern Memory Systems Lecture-15 (DRAM Organization)
18 pages
Odoo Module Development Guide
No ratings yet
Odoo Module Development Guide
3 pages
Free Databricks Certified Data Engineer Associate Exam Exam Questions by Blackburn
No ratings yet
Free Databricks Certified Data Engineer Associate Exam Exam Questions by Blackburn
11 pages
Kroenke Dbc7ge Tif 02
100% (1)
Kroenke Dbc7ge Tif 02
14 pages
DBMS Important 100 MCQs MCA
No ratings yet
DBMS Important 100 MCQs MCA
19 pages
DAX Workshop Slides
No ratings yet
DAX Workshop Slides
14 pages
Camera Trap Data Standardization
No ratings yet
Camera Trap Data Standardization
9 pages
Lec02B - Logical Model - Mapping
No ratings yet
Lec02B - Logical Model - Mapping
29 pages
Built - in Methods in Python
No ratings yet
Built - in Methods in Python
3 pages
Varnika Bir: Education
No ratings yet
Varnika Bir: Education
1 page
Computer Applications Exam Guide
No ratings yet
Computer Applications Exam Guide
6 pages
Iam Role For RDS To Access ECS Task
No ratings yet
Iam Role For RDS To Access ECS Task
1 page
Saurav Tibrewal CV
No ratings yet
Saurav Tibrewal CV
2 pages
Unit 5 PHP Notes
No ratings yet
Unit 5 PHP Notes
12 pages
T24 Ofs
100% (7)
T24 Ofs
27 pages
Essential Database Concepts Explained
No ratings yet
Essential Database Concepts Explained
2 pages
Class 12 DBMS Unit 2 Notes
No ratings yet
Class 12 DBMS Unit 2 Notes
12 pages
Erased Log by Sos
No ratings yet
Erased Log by Sos
2 pages
Pharma Data Analysis Expertise
No ratings yet
Pharma Data Analysis Expertise
4 pages
2021 - Q3 - The Forrester Wave™ - Augmented BI Platforms, Q3 2021
No ratings yet
2021 - Q3 - The Forrester Wave™ - Augmented BI Platforms, Q3 2021
13 pages
Dbms Project
No ratings yet
Dbms Project
51 pages

Design A Workflow Management Platform Like Apache Airflo

Uploaded by

Design A Workflow Management Platform Like Apache Airflo

Uploaded by

Design a workflow management

platform like Apache Airflow

1. Used by downstream services: Analytics, Alerts and Monitoring, on-demand query

So we might have different types of databases such as cassandra, mysql, graph

Why can’t we directly query the data from the microservice ?

We use a "Map Reduce Architecture". Here are some of it's advantages -

Advantages of Data Lake over Streaming Data:

1. It can be used for Auditory process

That's it for now!

You might also like