Analytics should never compete with production workloads.
In many systems, analytics begin as a convenience query against the production database. Over time, these queries grow heavier, scan larger time ranges, and slowly degrade system reliability.
This post describes a lightweight, production-safe data pipeline for analytical workloads, built with:
- Delta Lake on S3 for durable analytical storage
- DuckDB as an embedded analytics engine
- Tenant-isolated batch processing
- Strict resource limits
- Local caching to reduce cloud IO
This is not a streaming system and not an anomaly engine yet.
It is a clean analytical data pipeline designed to protect production databases.
The Problem With Analytics on Production Databases
Production databases are optimized for:
- Fast point reads
- High write concurrency
- Transactional guarantees
Also they are not optimized for operation like large historical scans, aggregations over months or years, repeated analytical jobs or heavy read amplification.
Running analytics on production systems leads to:
- Query latency spikes
- Lock contention
- CPU starvation
- Unpredictable failures during peak hours
The primary architectural goal was simple:
Never run analytical queries on the production database.
Architectural Principles
This pipeline is built on four non-negotiable principles:
- Production isolation – analytics never touch OLTP systems
- Batch-first design – no streaming complexity
- Bounded resources – strict memory and thread limits
- Operational simplicity – no clusters to manage
High-Level Architecture

DuckDB runs inside the application process, not as a service.
Why Delta Lake on S3?
Delta Lake provides ACID transactions on object storage, schema enforcement, safe concurrent writes anf features like time-travel for reprocessing.
In the other hand Amazon S3 have low cost storage and high durability.
This combination creates a stable analytical backbone that is fully decoupled from production systems.
Why DuckDB?
DuckDB is an embedded OLAP engine, not a distributed system.
Why that matters:
- No JVM
- No cluster
- No coordinator
- No operational overhead
Apart from this DuckDB excels at columnar scans, vectorized execution. And also it is know for high Parquet / Delta reads and batch analytical workloads.
This makes it ideal for scheduled jobs and tenant-isolated analytics. Also we can make resource-constrained environments with it.
Explicit Resource Control
DuckDB is configured with hard limits:
SET memory_limit = '1GB'; // As per environment SET threads = 2; // Based upon available CPU cores SET enable_object_cache = true; This guarantees no memory explosions and a predictable performance.
Each analytical run behaves like a bounded analytical task, not a long-running service.
Tenant-Isolated Data Layout
Data is stored per tenant:
s3://data-lake/ tenant_id=tenant_123/ energy_reading/ year=2025/ month=03/ Benefits:
- Partition pruning
- Reduced S3 reads
- Strong tenant isolation
- Linear horizontal scaling by tenant
Each job processes one tenant at a time.
Scheduler-Based Processing (Not Streaming)
This system intentionally avoids streaming frameworks.
Why:
- Analytics are not latency-critical
- Batch jobs are deterministic
- Failures are easier to retry
- Infrastructure cost is dramatically lower
Jobs:
- Run on a scheduler
- Read bounded time ranges
- Exit cleanly after execution
Local Disk Cache to Reduce S3 Reads
Repeated S3 scans might be expensive and slow.
A local disk cache:
- Stores tenant-scoped results
- Uses TTL-based eviction
- Eliminates redundant reads
This significantly improves latency, Cost efficiency and overall system stability.
Performance Benchmarks (DuckDB + Delta Lake on S3)
Environment
- Instance: 4 vCPU / 8 GB RAM
- DuckDB memory limit: 1 GB
- DuckDB threads: 2
- Storage: Amazon S3 (Delta Lake)
- Cache: Local disk (result-level cache)
| Metric | Cold Read (S3 + Delta Scan) | Warm Read (Local Cache) |
|---|---|---|
| Rows scanned | ~2.1 million | 0 |
| Rows returned | ~120,000 | ~120,000 |
| Data read from S3 | ~180 MB | 0 |
| Query execution time | 4.8 – 6.2 seconds | 40 – 90 ms |
| Peak memory usage | ~620 MB | ~120 MB |
| CPU utilization | ~1.5 – 1.8 cores | < 0.3 core |
| Network I/O | High (S3 reads) | None |
| Production DB load | 0 | 0 |
Multi-Tenant Sequential Jobs
| Number of Tenants | Total Processing Time |
|---|---|
| 10 tenants | ~45 seconds |
| 25 tenants | ~2 minutes |
| 50 tenants | ~4 minutes |
Observation:
No cross-tenant interference was observed. Each tenant job executed in isolation with predictable and linear scaling characteristics.
Comparison With ClickHouse and Spark
| Engine | Best For | Strengths | Tradeoffs / Downsides | Operational Complexity |
|---|---|---|---|---|
| DuckDB (This Approach) | Embedded, batch analytics | - Runs in-process - No cluster management - Strong S3 + Parquet support - Predictable resource usage | - Single-node execution - Not designed for high concurrent queries | Low |
| ClickHouse | Real-time analytical services | - Extremely fast OLAP queries - High query concurrency - Mature production deployments | - Requires dedicated servers - Stateful storage management - Higher ops overhead | Medium–High |
| Apache Spark | Large-scale data processing | - Horizontally scalable - Excellent for ETL and ML workloads - Rich ecosystem | - Heavy memory footprint - Slow startup time - Significant operational complexity | High |
Takeaway:
For bounded, tenant-isolated analytical jobs running on a schedule, DuckDB provides the best balance of performance, cost, and operational simplicity.
What This Pipeline Is (and What It Is Not)
This pipeline is designed to solve a very specific problem: running analytical workloads without putting any pressure on production systems.
It acts as a production-safe analytical layer, allowing historical data to be scanned, aggregated, and analyzed in isolation. It also serves as a clean foundation for downstream logic such as anomaly detection or reporting, without forcing those concerns into the storage or query layer. Most importantly, it achieves this without requiring a dedicated analytics cluster, making it a cost-efficient alternative for teams operating under infrastructure constraints.
At the same time, this pipeline is intentionally narrow in scope.
It is not a streaming system, and it does not aim to deliver real-time insights. It is also not an OLTP replacement, nor does it embed anomaly detection logic directly. The analytical engine focuses purely on reliable, bounded data access — any anomaly detection or intelligence logic is built on top of it, not baked into it.
This separation keeps the system easier to reason about, test, and evolve over time.
Final Thoughts
Not every analytical problem requires a cluster, a streaming framework, or a heavyweight OLAP database.
In many cases, especially for scheduled or batch-driven analysis, those tools introduce more operational complexity than value. A simpler system — one that is bounded, predictable, and easy to operate — often performs better in practice.
By combining DuckDB with Delta Lake on S3, this pipeline behaves like an analytical sidecar: it stays out of the way of production workloads while still delivering fast, reliable analysis when needed.
The key takeaway is not about the tools themselves, but about architectural restraint. Choosing the simplest system that meets your requirements often leads to more stable, maintainable, and cost-effective outcomes.
Top comments (0)