Top 23 Scala Big Data Projects

Apache Spark

1 128 42,518 10.0 Scala

Apache Spark - A unified analytics engine for large-scale data processing

Project mention: 15 AWS EMR Cost Optimization Tips to Slash Your EMR Spending (2025) | dev.to | 2025-12-16

AWS EMR (Elastic MapReduce) is a fully managed big data platform. It manages the setup, configuration, and tuning of open source frameworks like Apache Hadoop, Apache Spark, Apache Hive, Presto, and more at scale on AWS infrastructure. EMR handles cluster scaling, resource allocation, and lifecycle management. This allows you to work with large datasets for various use cases, from ETL pipelines to ML workloads. EMR uses a pay-as-you-go pricing model. Costs for compute, storage, and other AWS services can add up quickly as your data grows, clusters get bigger, and jobs become more complex. If you're not careful, costs can skyrocket due to inefficient resource use, poor instance choices, and misconfigured storage. That's why AWS EMR Cost Optimization is key. It helps you get the best performance per dollar while maintaining data processing speed, reliability, and scalability.
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
kafka-manager

2 13 11,937 0.0 Scala

CMAK is a tool for managing Apache Kafka clusters
delta

3 80 8,472 9.9 Scala

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

Project mention: Top Open-Source Data Engineering Tools- Unravelling the Best in 2026 | dev.to | 2025-12-10

Delta Lake
SynapseML

4 19 5,191 8.6 Scala

Simple and Distributed Machine Learning

Project mention: The Grug Brained Developer | news.ycombinator.com | 2025-06-17

> to see how they ended up in that situation
The "how" is almost always lack of discipline (or as I sometimes couch it, "imagination") but usually shit like https://github.com/microsoft/SynapseML/issues/405#:~:text=cl...
Scalding

5 0 3,520 2.5 Scala

A Scala API for Cascading
Scio

6 7 2,613 9.2 Scala

A Scala API for Apache Beam and Google Cloud Dataflow.
Jupyter Scala

7 7 1,617 9.1 Scala

A Scala kernel for Jupyter
Stream

getstream.io featured

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
Reactive-kafka

8 0 1,419 8.1 Scala

Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.
adam

9 3 1,041 3.7 Scala

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
H2O

10 0 977 3.4 Scala

Sparkling Water provides H2O functionality inside Spark cluster
spark-rapids

11 6 951 9.8 Scala

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

Project mention: Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL | news.ycombinator.com | 2025-05-12
BIDMach

12 0 916 0.0 Scala

CPU and GPU-accelerated Machine Learning Library
delta-sharing

13 5 907 8.5 Scala

An open protocol for secure data sharing
Gearpump

14 0 758 0.0 Scala

Lightweight real-time big data streaming engine over Akka
Vegas

15 0 729 0.0 Scala

The missing MatPlotLib for Scala + Spark (by vegas-viz)
nussknacker

16 1 703 9.7 Scala

Low-code tool for automating actions on real time data | Stream processing for the users.
Sparkta

17 0 528 0.0 Scala

Real Time Analytics and Data Pipelines based on Spark Streaming (by Stratio)
Scoobi

18 0 482 0.0 Scala

A Scala productivity framework for Hadoop. (by NICTA)
qbeast-spark

19 12 235 8.6 Scala

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
Clustering4Ever

20 0 130 0.0 Scala

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
Schemer

21 0 112 0.0 Scala

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
spark-deployer

22 0 75 0.0 Scala

Deploy Spark cluster in an easy way.
Spark Utils

23 0 36 4.6 Scala

Basic framework utilities to quickly start writing production ready Apache Spark applications
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Scala Big Data discussion

Scala Big Data related posts

Delta Lake 4.0.0

1 project | news.ycombinator.com | 5 Nov 2025
Apache Iceberg V3 Spec new features for more efficient and flexible data lakes

2 projects | news.ycombinator.com | 11 Aug 2025
Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub

3 projects | dev.to | 17 Oct 2024
Engenharia de Dados com Scala: masterizando o processamento de dados em tempo real com Apache Flink e Google Pub/Sub

3 projects | dev.to | 8 Aug 2024
Make Rust Object Oriented with the dual-trait pattern

2 projects | dev.to | 8 Jul 2024
Ask AN: Anyone using Delta Sharing in production?

1 project | news.ycombinator.com | 1 Jul 2024
Azure data lake - Data Share

1 project | /r/dataengineering | 29 Jun 2023
A note from our sponsor - Stream
getstream.io | 21 Dec 2025

Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →

Index

What are some of the best open-source Big Data projects in Scala? This list will help you:

#	Project	Stars
1	Apache Spark	42,518
2	kafka-manager	11,937
3	delta	8,472
4	SynapseML	5,191
5	Scalding	3,520
6	Scio	2,613
7	Jupyter Scala	1,617
8	Reactive-kafka	1,419
9	adam	1,041
10	H2O	977
11	spark-rapids	951
12	BIDMach	916
13	delta-sharing	907
14	Gearpump	758
15	Vegas	729
16	nussknacker	703
17	Sparkta	528
18	Scoobi	482
19	qbeast-spark	235
20	Clustering4Ever	130
21	Schemer	112
22	spark-deployer	75
23	Spark Utils	36

Scala Big Data

Top 23 Scala Big Data Projects

Scala Big Data discussion

Scala Big Data related posts

Delta Lake 4.0.0

Apache Iceberg V3 Spec new features for more efficient and flexible data lakes

Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub

Engenharia de Dados com Scala: masterizando o processamento de dados em tempo real com Apache Flink e Google Pub/Sub

Make Rust Object Oriented with the dual-trait pattern

Ask AN: Anyone using Delta Sharing in production?

Azure data lake - Data Share

Index

Did you know that Scala is the 32nd most popular programming language based on number of references?

Did you know that Scala is
the 32nd most popular programming language
based on number of references?