Posted on Nov 4

Real-Time Cryptocurrency Data Pipeline

#cryptocurrency #dataengineering #python #monitoring

Real-Time Cryptocurrency Data Pipeline

Streaming Analytics Platform for Digital Asset Monitoring

Enterprise Data Engineering Solution • Kafka • MongoDB • Docker • Python

Executive Summary

This project implements a robust, scalable real-time data pipeline designed to continuously monitor and store cryptocurrency price data from leading digital assets including Bitcoin (BTC), Ethereum (ETH), and Solana (SOL).

Key Objective

Enable real-time financial data ingestion, processing, and persistence for analytics, monitoring, and decision-making applications in the cryptocurrency market.

Core Capabilities

Real-time data acquisition from Binance API
Event-driven architecture using Apache Kafka
Scalable NoSQL data persistence with MongoDB Atlas
Containerized deployment for portability and consistency
Monitoring and observability with Kafdrop and Grafana

System Architecture

┌─────────────┐ ┌──────────┐ ┌────────────┐ ┌──────────┐ ┌──────────────┐ │ Binance │────▶│ Producer │────▶│ Kafka │────▶│ Consumer │────▶│ MongoDB │ │ API │ │ (Python) │ │ Broker │ │ (Python) │ │ Atlas │ └─────────────┘ └──────────┘ └────────────┘ └──────────┘ └──────────────┘ Topics: BTC Collections: ETH - btc SOL - eth - sol

Architecture Highlights

Decoupled Design: Producer and consumer operate independently, ensuring fault tolerance and scalability
Topic-Based Routing: Separate Kafka topics for each cryptocurrency enable parallel processing
Cloud-Native Storage: MongoDB Atlas provides global accessibility and automated backups
Containerization: Docker Compose orchestrates all infrastructure components

Technology Stack

Technology	Purpose	Version
Apache Kafka	Distributed streaming platform for high-throughput message brokering	7.6.1
MongoDB Atlas	Cloud-hosted NoSQL database with automatic scaling	Latest
Python	Core application logic with kafka-python and pymongo	3.8+
Docker & Compose	Container orchestration for consistent deployment	20.10+
Kafdrop	Web UI for monitoring Kafka topics and consumers	Latest
Grafana	Observability platform for metrics visualization	Latest

Development Process

Step 1: Infrastructure Setup with Docker Compose

First, we set up the complete infrastructure using Docker Compose. This creates isolated, reproducible environments for Kafka, Zookeeper, and monitoring tools.

File: docker-compose.yml

services: zookeeper: image: confluentinc/cp-zookeeper:7.6.1 environment: ZOOKEEPER_CLIENT_PORT: 2181 ZOOKEEPER_TICK_TIME: 2000 kafka: image: confluentinc/cp-kafka:7.6.1 depends_on: [zookeeper] ports: - "9092:9092" # Internal listener - "9094:9094" # External listener for producers/consumers environment: KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,EXTERNAL:PLAINTEXT KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,EXTERNAL://0.0.0.0:9094 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,EXTERNAL://localhost:9094 KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 kafdrop: image: obsidiandynamics/kafdrop:latest ports: - "9000:9000" environment: KAFKA_BROKERCONNECT: "kafka:9092" depends_on: [kafka] grafana: image: grafana/grafana-enterprise container_name: grafana restart: unless-stopped ports: - '3000:3000'

Start the infrastructure:

# Start all services docker-compose up -d # Verify services are running docker-compose ps # Check Kafka logs docker-compose logs kafka # Access Kafdrop at http://localhost:9000 # Access Grafana at http://localhost:3000

Step 2: Data Generation Module

We created modular generator functions that fetch real-time cryptocurrency prices from the Binance API. Each function is designed as a generator for potential future expansion (e.g., historical data, multiple exchanges).

File: data_gen.py

import requests import time def btc(): """Fetch current Bitcoin (BTC) price from Binance API""" url = "https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT" response = requests.get(url) data = response.json() yield { "symbol": data['symbol'], "price": float(data['price']) } def eth(): """Fetch current Ethereum (ETH) price from Binance API""" url = "https://api.binance.com/api/v3/ticker/price?symbol=ETHUSDT" response = requests.get(url) data = response.json() yield { "symbol": data['symbol'], "price": float(data['price']) } def sol(): """Fetch current Solana (SOL) price from Binance API""" url = "https://api.binance.com/api/v3/ticker/price?symbol=SOLUSDT" response = requests.get(url) data = response.json() yield { "symbol": data['symbol'], "price": float(data['price']) }

Design Decisions:

Generator Pattern: Using yield allows for easy extension to streaming multiple prices or batch fetching
Modular Functions: Separate functions for each cryptocurrency enable independent testing and maintenance
Type Conversion: Converting price to float ensures consistent numeric operations downstream
Direct API Call: Simple REST API approach suitable for 3-second polling intervals

Testing the data generators:

# Test individual generators from data_gen import btc, eth, sol # Test BTC price fetch for price_data in btc(): print(f"BTC Price: ${price_data['price']:,.2f}") # Output: BTC Price: $67,234.50

Step 3: Kafka Producer Implementation

The producer fetches cryptocurrency prices and publishes them to dedicated Kafka topics. This component demonstrates the "fire-and-forget" pattern with JSON serialization.

File: producer.py

from kafka import KafkaProducer from data_gen import btc, eth, sol import json import time # Initialize Kafka Producer with JSON serialization producer = KafkaProducer( bootstrap_servers=['localhost:9094'], # Connect to external listener  value_serializer=lambda v: json.dumps(v).encode() # Serialize dict to JSON bytes ) try: while True: # Fetch and publish BTC price  for r in btc(): producer.send("btc", r) print(f"✓ Sent BTC: {r}") # Fetch and publish ETH price  for r in eth(): producer.send("eth", r) print(f"✓ Sent ETH: {r}") # Fetch and publish SOL price  for r in sol(): producer.send("sol", r) print(f"✓ Sent SOL: {r}") time.sleep(3) # Poll every 3 seconds  except KeyboardInterrupt: print("\n⚠ Producer stopped by user") producer.close()

Key Implementation Details:

Bootstrap Servers: Connect to port 9094 (external listener) from host machine
Value Serializer: Lambda function converts Python dictionaries to JSON-encoded bytes
Topic Routing: Each cryptocurrency has its own topic for independent consumption
Polling Interval: 3-second sleep balances API rate limits with real-time requirements
Graceful Shutdown: KeyboardInterrupt handler ensures proper connection closure

Running the producer:

# Install required packages pip install kafka-python requests # Run the producer python producer.py # Expected output: # ✓ Sent BTC: {'symbol': 'BTCUSDT', 'price': 67234.5} # ✓ Sent ETH: {'symbol': 'ETHUSDT', 'price': 3456.78} # ✓ Sent SOL: {'symbol': 'SOLUSDT', 'price': 145.23}

Verify messages in Kafdrop:

Open http://localhost:9000
Click on btc, eth, or sol topics
View messages in real-time

Step 4: Kafka Consumer with MongoDB Persistence

The consumer subscribes to all cryptocurrency topics, processes incoming messages, and persists them to MongoDB collections based on the topic name.

File: consumer.py

from kafka import KafkaConsumer from pymongo import MongoClient import json # Initialize Kafka Consumer consumer = KafkaConsumer( "btc", "eth", "sol", # Subscribe to multiple topics  bootstrap_servers=['localhost:9094'], value_deserializer=lambda m: json.loads(m.decode()), # Deserialize JSON bytes to dict  enable_auto_commit=True, # Automatically commit offsets  auto_offset_reset="earliest" # Start from beginning if no offset exists ) # Connect to MongoDB Atlas mongo = MongoClient( "mongodb+srv://kimtryx_db_user:3100@cluster0.nds95yl.mongodb.net/?appname=Cluster0" ) # Select database col = mongo.fx # Consume messages indefinitely for msg in consumer: topic = msg.topic value = msg.value # Route to appropriate collection based on topic  if topic == 'btc': col.btc.insert_one(value) elif topic == 'eth': col.eth.insert_one(value) elif topic == 'sol': col.sol.insert_one(value) print(f"💾 Saved to {topic}: {value}")

Implementation Highlights:

Multi-Topic Subscription: Single consumer handles all three cryptocurrencies
Automatic Deserialization: Lambda function converts JSON bytes back to Python dictionaries
Auto-Commit: Kafka automatically tracks consumption progress
Offset Reset Strategy: earliest ensures no messages are missed on first run
Topic-Based Routing: Clean if-elif chain maps topics to MongoDB collections
MongoDB Atlas: Cloud database eliminates local database management

Running the consumer:

# Install MongoDB driver pip install pymongo # Run the consumer (in a separate terminal from producer) python consumer.py # Expected output: # 💾 Saved to btc: {'symbol': 'BTCUSDT', 'price': 67234.5} # 💾 Saved to eth: {'symbol': 'ETHUSDT', 'price': 3456.78} # 💾 Saved to sol: {'symbol': 'SOLUSDT', 'price': 145.23}

Verify data in MongoDB:

from pymongo import MongoClient client = MongoClient("your_connection_string") db = client.fx # Count documents in each collection print(f"BTC documents: {db.btc.count_documents({})}") print(f"ETH documents: {db.eth.count_documents({})}") print(f"SOL documents: {db.sol.count_documents({})}") # Fetch latest BTC price latest_btc = db.btc.find_one(sort=[('_id', -1)]) print(f"Latest BTC: ${latest_btc['price']:,.2f}")

Step 5: Enhanced Consumer with Error Handling

For production readiness, we add comprehensive error handling, logging, and retry mechanisms.

File: consumer_enhanced.py

from kafka import KafkaConsumer from pymongo import MongoClient from pymongo.errors import ConnectionFailure, OperationFailure import json import logging import time # Configure logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s' ) logger = logging.getLogger(__name__) def get_mongo_client(max_retries=3): """Establish MongoDB connection with retry logic""" for attempt in range(max_retries): try: client = MongoClient( "mongodb+srv://user:pass@cluster.mongodb.net/", serverSelectionTimeoutMS=5000 ) # Verify connection  client.admin.command('ping') logger.info("✓ MongoDB connection established") return client except ConnectionFailure as e: logger.error(f"MongoDB connection attempt {attempt + 1} failed: {e}") if attempt < max_retries - 1: time.sleep(2 ** attempt) # Exponential backoff  raise Exception("Failed to connect to MongoDB after retries") def main(): # Initialize connections  try: mongo = get_mongo_client() db = mongo.fx consumer = KafkaConsumer( "btc", "eth", "sol", bootstrap_servers=['localhost:9094'], value_deserializer=lambda m: json.loads(m.decode()), enable_auto_commit=True, auto_offset_reset="earliest", max_poll_interval_ms=300000 # 5 minutes  ) logger.info("✓ Kafka consumer initialized") logger.info("Listening for messages...") except Exception as e: logger.error(f"Initialization failed: {e}") return # Process messages  try: for msg in consumer: try: topic = msg.topic value = msg.value # Validate message  if not value or 'symbol' not in value or 'price' not in value: logger.warning(f"Invalid message format: {value}") continue # Insert into appropriate collection  collection = getattr(db, topic) result = collection.insert_one(value) logger.info( f"💾 {topic.upper()}: ${value['price']:,.2f} " f"(ID: {result.inserted_id})" ) except OperationFailure as e: logger.error(f"MongoDB operation failed: {e}") except Exception as e: logger.error(f"Error processing message: {e}") except KeyboardInterrupt: logger.info("Shutting down consumer...") finally: consumer.close() mongo.close() logger.info("✓ Connections closed") if __name__ == "__main__": main()

Production Features:

Connection Retry Logic: Exponential backoff for MongoDB connection failures
Message Validation: Ensures required fields exist before processing
Structured Logging: Timestamp and severity level for each log entry
Graceful Shutdown: Properly closes connections on exit
Error Isolation: Continues processing even if individual messages fail
Health Checks: Verifies MongoDB connection on startup

Data Flow Process

Producer Flow

1. Fetch Data (data_gen.py) ├─ HTTP GET → Binance API ├─ Parse JSON response └─ Extract symbol & price 2. Serialize (producer.py) ├─ Convert dict → JSON string ├─ Encode string → bytes └─ Return serialized message 3. Publish to Kafka ├─ Route to topic (btc/eth/sol) ├─ Kafka appends to partition └─ Return acknowledgment 4. Repeat every 3 seconds

Message Format:

{ "symbol": "BTCUSDT", "price": 67234.50 }

Consumer Flow

1. Poll Kafka Topics ├─ Fetch batch of messages ├─ Deserialize bytes → JSON → dict └─ Extract topic metadata 2. Route by Topic ├─ btc → db.fx.btc ├─ eth → db.fx.eth └─ sol → db.fx.sol 3. Persist to MongoDB ├─ Insert document ├─ Auto-generate _id └─ Return insert result 4. Commit Offset └─ Update consumer position

MongoDB Document Structure:

{ "_id": ObjectId("507f1f77bcf86cd799439011"), "symbol": "BTCUSDT", "price": 67234.50 }

Performance Metrics

Throughput Analysis

# Calculate message throughput messages_per_cycle = 3 # BTC, ETH, SOL cycle_duration = 3 # seconds messages_per_minute = (messages_per_cycle / cycle_duration) * 60 messages_per_hour = messages_per_minute * 60 print(f"Messages per minute: {messages_per_minute}") # 60 print(f"Messages per hour: {messages_per_hour}") # 3,600 print(f"Messages per day: {messages_per_hour * 24}") # 86,400

Storage Growth Estimation

import json # Sample message size sample_message = {"symbol": "BTCUSDT", "price": 67234.50} message_size_bytes = len(json.dumps(sample_message).encode()) # Add MongoDB overhead (_id field + document structure) document_size = message_size_bytes + 50 # ~50 bytes overhead  # Daily storage per cryptocurrency messages_per_day = 28800 # 24 hours * 60 min * 20 messages/min daily_storage_mb = (document_size * messages_per_day) / (1024 * 1024) print(f"Message size: {message_size_bytes} bytes") print(f"Document size: {document_size} bytes") print(f"Daily storage per crypto: {daily_storage_mb:.2f} MB") print(f"Monthly storage (all 3): {daily_storage_mb * 30 * 3:.2f} MB")

Expected Output:

Message size: 38 bytes Document size: 88 bytes Daily storage per crypto: 2.42 MB Monthly storage (all 3): 217.80 MB

Testing & Validation

Unit Testing the Data Generator

# test_data_gen.py import unittest from data_gen import btc, eth, sol class TestDataGenerators(unittest.TestCase): def test_btc_returns_valid_data(self): """Test BTC generator returns correct structure""" for data in btc(): self.assertIn('symbol', data) self.assertIn('price', data) self.assertEqual(data['symbol'], 'BTCUSDT') self.assertIsInstance(data['price'], float) self.assertGreater(data['price'], 0) break # Test only first yield  def test_eth_returns_valid_data(self): """Test ETH generator returns correct structure""" for data in eth(): self.assertEqual(data['symbol'], 'ETHUSDT') self.assertIsInstance(data['price'], float) break def test_price_precision(self): """Verify price has proper decimal precision""" for data in btc(): # Bitcoin prices should have 2 decimal places  price_str = f"{data['price']:.2f}" self.assertEqual(len(price_str.split('.')[1]), 2) break if __name__ == '__main__': unittest.main()

Integration Testing

# test_integration.py from kafka import KafkaProducer, KafkaConsumer import json import time def test_end_to_end(): """Test complete producer-consumer flow""" # Setup  producer = KafkaProducer( bootstrap_servers=['localhost:9094'], value_serializer=lambda v: json.dumps(v).encode() ) consumer = KafkaConsumer( 'btc', bootstrap_servers=['localhost:9094'], value_deserializer=lambda m: json.loads(m.decode()), auto_offset_reset='latest', consumer_timeout_ms=5000 ) # Test data  test_message = {"symbol": "BTCUSDT", "price": 12345.67} # Send message  producer.send('btc', test_message) producer.flush() print("✓ Message sent") # Consume message  time.sleep(1) # Allow propagation  for msg in consumer: received = msg.value assert received['symbol'] == test_message['symbol'] assert received['price'] == test_message['price'] print("✓ Message received and validated") break # Cleanup  producer.close() consumer.close() print("✓ Test passed") if __name__ == '__main__': test_end_to_end()

Monitoring & Observability

Kafka Topic Monitoring

# Create topics manually (optional - auto-created by producer) docker exec -it <kafka_container> kafka-topics \ --create --topic btc \ --bootstrap-server localhost:9092 \ --partitions 3 \ --replication-factor 1 # List all topics docker exec -it <kafka_container> kafka-topics \ --list --bootstrap-server localhost:9092 # Describe topic details docker exec -it <kafka_container> kafka-topics \ --describe --topic btc \ --bootstrap-server localhost:9092 # Check consumer group lag docker exec -it <kafka_container> kafka-consumer-groups \ --describe --group my-consumer-group \ --bootstrap-server localhost:9092

MongoDB Query Examples

from pymongo import MongoClient from datetime import datetime, timedelta client = MongoClient("your_connection_string") db = client.fx # Get latest prices def get_latest_prices(): btc_latest = db.btc.find_one(sort=[('_id', -1)]) eth_latest = db.eth.find_one(sort=[('_id', -1)]) sol_latest = db.sol.find_one(sort=[('_id', -1)]) return { 'BTC': btc_latest['price'], 'ETH': eth_latest['price'], 'SOL': sol_latest['price'] } # Calculate price statistics def get_price_stats(symbol='btc', limit=100): collection = getattr(db, symbol) prices = [doc['price'] for doc in collection.find().limit(limit)] return { 'min': min(prices), 'max': max(prices), 'avg': sum(prices) / len(prices), 'count': len(prices) } # Find price movements def detect_price_jumps(symbol='btc', threshold=0.02): """Find price changes greater than threshold (2%)""" collection = getattr(db, symbol) docs = list(collection.find().sort('_id', -1).limit(100)) jumps = [] for i in range(len(docs) - 1): current = docs[i]['price'] previous = docs[i + 1]['price'] change = abs(current - previous) / previous if change > threshold: jumps.append({ 'from': previous, 'to': current, 'change_pct': change * 100 }) return jumps # Usage print(get_latest_prices()) print(get_price_stats('btc')) print(detect_price_jumps('btc', threshold=0.01))

Deployment Guide

Prerequisites

# Install Docker and Docker Compose sudo apt-get update sudo apt-get install docker.io docker-compose # Install Python dependencies pip install kafka-python pymongo requests # Verify installations docker --version docker-compose --version python --version

Step-by-Step Deployment

# 1. Clone repository git clone https://github.com/yourusername/crypto-pipeline.git cd crypto-pipeline # 2. Start infrastructure docker-compose up -d # 3. Wait for Kafka to be ready (30-60 seconds) docker-compose logs -f kafka | grep "started" # 4. Run producer (in terminal 1) python producer.py # 5. Run consumer (in terminal 2) python consumer.py # 6. Monitor with Kafdrop # Open browser: http://localhost:9000 # 7. View metrics in Grafana # Open browser: http://localhost:3000 # Default credentials: admin/admin

Troubleshooting Common Issues

# Issue: Kafka not starting docker-compose logs kafka # Solution: Ensure ports 9092, 9094 are not in use # Issue: Consumer not receiving messages docker exec -it <kafka_container> kafka-topics --list --bootstrap-server localhost:9092 # Solution: Verify topics exist # Issue: MongoDB connection timeout # Solution: Check connection string, network access in Atlas # Issue: Producer connection refused # Solution: Wait for Kafka startup (~60 seconds) # Check container health docker-compose ps docker stats # Restart services docker-compose restart kafka docker-compose down && docker-compose up -d

Security Considerations

Current Implementation (Development)

MongoDB credentials hardcoded (for demo purposes)
Kafka running without authentication
No TLS/SSL encryption
Open ports on localhost

Production Hardening

# Use environment variables for credentials import os from dotenv import load_dotenv load_dotenv() MONGO_URI = os.getenv('MONGO_CONNECTION_STRING') KAFKA_SERVERS = os.getenv('KAFKA_BOOTSTRAP_SERVERS', 'localhost:9094') KAFKA_USERNAME = os.getenv('KAFKA_USERNAME') KAFKA_PASSWORD = os.getenv('KAFKA_PASSWORD') # Secure MongoDB connection from pymongo import MongoClient from pymongo.encryption import ClientEncryption client = MongoClient( MONGO_URI, tls=True, tlsAllowInvalidCertificates=False ) # Kafka with SASL authentication from kafka import KafkaProducer producer = KafkaProducer( bootstrap_servers=[KAFKA_SERVERS], security_protocol='SASL_SSL', sasl_mechanism='PLAIN', sasl_plain_username=KAFKA_USERNAME, sasl_plain_password=KAFKA_PASSWORD, value_serializer=lambda v: json.dumps(v).encode() )

Environment Variables (.env file):

# .env (DO NOT COMMIT TO GIT) MONGO_CONNECTION_STRING=mongodb+srv://user:pass@cluster.mongodb.net/ KAFKA_BOOTSTRAP_SERVERS=kafka.example.com:9094 KAFKA_USERNAME=your_username KAFKA_PASSWORD=your_password

Future Enhancements

Phase 1: Reliability & Resilience

# Add circuit breaker for API calls from pybreaker import CircuitBreaker api_breaker = CircuitBreaker(fail_max=5, timeout_duration=60) @api_breaker def fetch_price(url): response = requests.get(url, timeout=5) response.raise_for_status() return response.json() # Implement retry logic with exponential backoff from tenacity import retry, stop_after_attempt, wait_exponential @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10)) def publish_to_kafka(producer, topic, data): future = producer.send(topic, data) return future.get(timeout=10) # Block until send completes

Phase 2: Advanced Analytics

# Real-time moving average calculation from collections import deque class PriceAnalyzer: def __init__(self, window_size=20): self.prices = deque(maxlen=window_size) def add_price(self, price): self.prices.append(price) def get_moving_average(self): return sum(self.prices) / len(self.prices) if self.prices else 0 def get_volatility(self): if len(self.prices) < 2: return 0 avg = self.get_moving_average() variance = sum((p - avg) ** 2 for p in self.prices) / len(self.prices) return variance ** 0.5 # Usage in consumer analyzer = PriceAnalyzer() for msg in consumer: price = msg.value['price'] analyzer.add_price(price) if len(analyzer.prices) == 20: ma = analyzer.get_moving_average() vol = analyzer.get_volatility() print(f"MA: ${ma:.2f}, Volatility: ${vol:.2f}")

Phase 3: Scalability

# Multi-threaded producer for higher throughput from concurrent.futures import ThreadPoolExecutor def produce_crypto_data(crypto_func, topic, producer): for data in crypto_func(): producer.send(topic, data) with ThreadPoolExecutor(max_workers=3) as executor: executor.submit(produce_crypto_data, btc, 'btc', producer) executor.submit(produce_crypto_data, eth, 'eth', producer) executor.submit(produce_crypto_data, sol, 'sol', producer)

Conclusion

This cryptocurrency data pipeline demonstrates a production-ready approach to real-time financial data engineering, combining industry-standard technologies with cloud-native practices to deliver a scalable, maintainable, and extensible solution.

Key Achievements

✅ Real-time data ingestion from external APIs

✅ Event-driven architecture with Kafka

✅ Persistent storage in cloud database

✅ Containerized deployment

✅ Monitoring and observability

✅ Modular, testable codebase

Lessons Learned

Decoupling is Critical: Separating producers and consumers enables independent scaling and fault tolerance
Message Ordering: Kafka partitions guarantee ordering within a partition, essential for time-series data
Schema Evolution: JSON flexibility allows easy field additions without breaking consumers
Monitoring is Essential: Kafdrop and logging provide crucial visibility into system behavior

Project Metrics

Lines of Code: ~150 (excluding configuration)
Response Time: <1 second from API to database
Throughput: 3,600 messages/hour
Storage Efficiency: ~218 MB/month for 3 cryptocurrencies
Uptime Target: 99.9% (with proper error handling)

Repository Structure

 crypto-pipeline/