Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →
Top 23 Scala Big Data Projects
- Project mention: 15 AWS EMR Cost Optimization Tips to Slash Your EMR Spending (2025) | dev.to | 2025-12-16
AWS EMR (Elastic MapReduce) is a fully managed big data platform. It manages the setup, configuration, and tuning of open source frameworks like Apache Hadoop, Apache Spark, Apache Hive, Presto, and more at scale on AWS infrastructure. EMR handles cluster scaling, resource allocation, and lifecycle management. This allows you to work with large datasets for various use cases, from ETL pipelines to ML workloads. EMR uses a pay-as-you-go pricing model. Costs for compute, storage, and other AWS services can add up quickly as your data grows, clusters get bigger, and jobs become more complex. If you're not careful, costs can skyrocket due to inefficient resource use, poor instance choices, and misconfigured storage. That's why AWS EMR Cost Optimization is key. It helps you get the best performance per dollar while maintaining data processing speed, reliability, and scalability.
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)
Project mention: Top Open-Source Data Engineering Tools- Unravelling the Best in 2026 | dev.to | 2025-12-10Delta Lake
-
> to see how they ended up in that situation
The "how" is almost always lack of discipline (or as I sometimes couch it, "imagination") but usually shit like https://github.com/microsoft/SynapseML/issues/405#:~:text=cl...
-
-
-
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
Reactive-kafka
Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.
-
adam
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
-
- Project mention: Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL | news.ycombinator.com | 2025-05-12
-
-
-
-
-
nussknacker
Low-code tool for automating actions on real time data | Stream processing for the users.
-
-
-
qbeast-spark
Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
-
Clustering4Ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
-
Schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
-
-
Spark Utils
Basic framework utilities to quickly start writing production ready Apache Spark applications
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Scala Big Data discussion
Scala Big Data related posts
-
Delta Lake 4.0.0
-
Apache Iceberg V3 Spec new features for more efficient and flexible data lakes
-
Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub
-
Engenharia de Dados com Scala: masterizando o processamento de dados em tempo real com Apache Flink e Google Pub/Sub
-
Make Rust Object Oriented with the dual-trait pattern
-
Ask AN: Anyone using Delta Sharing in production?
-
Azure data lake - Data Share
- A note from our sponsor - Stream getstream.io | 21 Dec 2025
Index
What are some of the best open-source Big Data projects in Scala? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | Apache Spark | 42,518 |
| 2 | kafka-manager | 11,937 |
| 3 | delta | 8,472 |
| 4 | SynapseML | 5,191 |
| 5 | Scalding | 3,520 |
| 6 | Scio | 2,613 |
| 7 | Jupyter Scala | 1,617 |
| 8 | Reactive-kafka | 1,419 |
| 9 | adam | 1,041 |
| 10 | H2O | 977 |
| 11 | spark-rapids | 951 |
| 12 | BIDMach | 916 |
| 13 | delta-sharing | 907 |
| 14 | Gearpump | 758 |
| 15 | Vegas | 729 |
| 16 | nussknacker | 703 |
| 17 | Sparkta | 528 |
| 18 | Scoobi | 482 |
| 19 | qbeast-spark | 235 |
| 20 | Clustering4Ever | 130 |
| 21 | Schemer | 112 |
| 22 | spark-deployer | 75 |
| 23 | Spark Utils | 36 |