Posted on Dec 22

Designing a Lightweight Data Pipeline for Analysis Without Stressing Production Databases

Analytics should never compete with production workloads.

In many systems, analytics begin as a convenience query against the production database. Over time, these queries grow heavier, scan larger time ranges, and slowly degrade system reliability.

This post describes a lightweight, production-safe data pipeline for analytical workloads, built with:

Delta Lake on S3 for durable analytical storage
DuckDB as an embedded analytics engine
Tenant-isolated batch processing
Strict resource limits
Local caching to reduce cloud IO

This is not a streaming system and not an anomaly engine yet.

It is a clean analytical data pipeline designed to protect production databases.

The Problem With Analytics on Production Databases

Production databases are optimized for:

Fast point reads
High write concurrency
Transactional guarantees

Also they are not optimized for operation like large historical scans, aggregations over months or years, repeated analytical jobs or heavy read amplification.

Running analytics on production systems leads to:

Query latency spikes
Lock contention
CPU starvation
Unpredictable failures during peak hours

The primary architectural goal was simple:

Never run analytical queries on the production database.

Architectural Principles

This pipeline is built on four non-negotiable principles:

Production isolation – analytics never touch OLTP systems
Batch-first design – no streaming complexity
Bounded resources – strict memory and thread limits
Operational simplicity – no clusters to manage

High-Level Architecture

DuckDB runs inside the application process, not as a service.

Why Delta Lake on S3?

Delta Lake provides ACID transactions on object storage, schema enforcement, safe concurrent writes anf features like time-travel for reprocessing.

In the other hand Amazon S3 have low cost storage and high durability.

This combination creates a stable analytical backbone that is fully decoupled from production systems.

Why DuckDB?

DuckDB is an embedded OLAP engine, not a distributed system.

Why that matters:

No JVM
No cluster
No coordinator
No operational overhead

Apart from this DuckDB excels at columnar scans, vectorized execution. And also it is know for high Parquet / Delta reads and batch analytical workloads.

This makes it ideal for scheduled jobs and tenant-isolated analytics. Also we can make resource-constrained environments with it.

Explicit Resource Control

DuckDB is configured with hard limits:

SET memory_limit = '1GB'; // As per environment SET threads = 2; // Based upon available CPU cores SET enable_object_cache = true;

This guarantees no memory explosions and a predictable performance.

Each analytical run behaves like a bounded analytical task, not a long-running service.

Tenant-Isolated Data Layout

Data is stored per tenant:

s3://data-lake/ tenant_id=tenant_123/ energy_reading/ year=2025/ month=03/

Benefits:

Partition pruning
Reduced S3 reads
Strong tenant isolation
Linear horizontal scaling by tenant

Each job processes one tenant at a time.

Scheduler-Based Processing (Not Streaming)

This system intentionally avoids streaming frameworks.

Why:

Analytics are not latency-critical
Batch jobs are deterministic
Failures are easier to retry
Infrastructure cost is dramatically lower

Jobs:

Run on a scheduler
Read bounded time ranges
Exit cleanly after execution

Local Disk Cache to Reduce S3 Reads

Repeated S3 scans might be expensive and slow.

A local disk cache:

Stores tenant-scoped results
Uses TTL-based eviction
Eliminates redundant reads

This significantly improves latency, Cost efficiency and overall system stability.

Performance Benchmarks (DuckDB + Delta Lake on S3)

Environment

Instance: 4 vCPU / 8 GB RAM
DuckDB memory limit: 1 GB
DuckDB threads: 2
Storage: Amazon S3 (Delta Lake)
Cache: Local disk (result-level cache)

Metric	Cold Read (S3 + Delta Scan)	Warm Read (Local Cache)
Rows scanned	~2.1 million	0
Rows returned	~120,000	~120,000
Data read from S3	~180 MB	0
Query execution time	4.8 – 6.2 seconds	40 – 90 ms
Peak memory usage	~620 MB	~120 MB
CPU utilization	~1.5 – 1.8 cores	< 0.3 core
Network I/O	High (S3 reads)	None
Production DB load	0	0

Multi-Tenant Sequential Jobs

Number of Tenants	Total Processing Time
10 tenants	~45 seconds
25 tenants	~2 minutes
50 tenants	~4 minutes

Observation:

No cross-tenant interference was observed. Each tenant job executed in isolation with predictable and linear scaling characteristics.

Comparison With ClickHouse and Spark

Engine	Best For	Strengths	Tradeoffs / Downsides	Operational Complexity
DuckDB (This Approach)	Embedded, batch analytics	- Runs in-process - No cluster management - Strong S3 + Parquet support - Predictable resource usage	- Single-node execution - Not designed for high concurrent queries	Low
ClickHouse	Real-time analytical services	- Extremely fast OLAP queries - High query concurrency - Mature production deployments	- Requires dedicated servers - Stateful storage management - Higher ops overhead	Medium–High
Apache Spark	Large-scale data processing	- Horizontally scalable - Excellent for ETL and ML workloads - Rich ecosystem	- Heavy memory footprint - Slow startup time - Significant operational complexity	High

Takeaway:

For bounded, tenant-isolated analytical jobs running on a schedule, DuckDB provides the best balance of performance, cost, and operational simplicity.

What This Pipeline Is (and What It Is Not)

This pipeline is designed to solve a very specific problem: running analytical workloads without putting any pressure on production systems.

It acts as a production-safe analytical layer, allowing historical data to be scanned, aggregated, and analyzed in isolation. It also serves as a clean foundation for downstream logic such as anomaly detection or reporting, without forcing those concerns into the storage or query layer. Most importantly, it achieves this without requiring a dedicated analytics cluster, making it a cost-efficient alternative for teams operating under infrastructure constraints.

At the same time, this pipeline is intentionally narrow in scope.

It is not a streaming system, and it does not aim to deliver real-time insights. It is also not an OLTP replacement, nor does it embed anomaly detection logic directly. The analytical engine focuses purely on reliable, bounded data access — any anomaly detection or intelligence logic is built on top of it, not baked into it.

This separation keeps the system easier to reason about, test, and evolve over time.

Final Thoughts

Not every analytical problem requires a cluster, a streaming framework, or a heavyweight OLAP database.

In many cases, especially for scheduled or batch-driven analysis, those tools introduce more operational complexity than value. A simpler system — one that is bounded, predictable, and easy to operate — often performs better in practice.

By combining DuckDB with Delta Lake on S3, this pipeline behaves like an analytical sidecar: it stays out of the way of production workloads while still delivering fast, reliable analysis when needed.

The key takeaway is not about the tools themselves, but about architectural restraint. Choosing the simplest system that meets your requirements often leads to more stable, maintainable, and cost-effective outcomes.

DEV Community