Scala Big Data

Open-source Scala projects categorized as Big Data

Top 23 Scala Big Data Projects

  1. Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: 15 AWS EMR Cost Optimization Tips to Slash Your EMR Spending (2025) | dev.to | 2025-12-16

    AWS EMR (Elastic MapReduce) is a fully managed big data platform. It manages the setup, configuration, and tuning of open source frameworks like Apache Hadoop, Apache Spark, Apache Hive, Presto, and more at scale on AWS infrastructure. EMR handles cluster scaling, resource allocation, and lifecycle management. This allows you to work with large datasets for various use cases, from ETL pipelines to ML workloads. EMR uses a pay-as-you-go pricing model. Costs for compute, storage, and other AWS services can add up quickly as your data grows, clusters get bigger, and jobs become more complex. If you're not careful, costs can skyrocket due to inefficient resource use, poor instance choices, and misconfigured storage. That's why AWS EMR Cost Optimization is key. It helps you get the best performance per dollar while maintaining data processing speed, reliability, and scalability.

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. kafka-manager

    CMAK is a tool for managing Apache Kafka clusters

  4. delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

    Project mention: Top Open-Source Data Engineering Tools- Unravelling the Best in 2026 | dev.to | 2025-12-10

    Delta Lake

  5. SynapseML

    Simple and Distributed Machine Learning

    Project mention: The Grug Brained Developer | news.ycombinator.com | 2025-06-17

    > to see how they ended up in that situation

    The "how" is almost always lack of discipline (or as I sometimes couch it, "imagination") but usually shit like https://github.com/microsoft/SynapseML/issues/405#:~:text=cl...

  6. Scalding

    A Scala API for Cascading

  7. Scio

    A Scala API for Apache Beam and Google Cloud Dataflow.

  8. Jupyter Scala

    A Scala kernel for Jupyter

  9. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  10. Reactive-kafka

    Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.

  11. adam

    ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

  12. H2O

    Sparkling Water provides H2O functionality inside Spark cluster

  13. spark-rapids

    Spark RAPIDS plugin - accelerate Apache Spark with GPUs

    Project mention: Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL | news.ycombinator.com | 2025-05-12
  14. BIDMach

    CPU and GPU-accelerated Machine Learning Library

  15. delta-sharing

    An open protocol for secure data sharing

  16. Gearpump

    Lightweight real-time big data streaming engine over Akka

  17. Vegas

    The missing MatPlotLib for Scala + Spark (by vegas-viz)

  18. nussknacker

    Low-code tool for automating actions on real time data | Stream processing for the users.

  19. Sparkta

    Real Time Analytics and Data Pipelines based on Spark Streaming (by Stratio)

  20. Scoobi

    A Scala productivity framework for Hadoop. (by NICTA)

  21. qbeast-spark

    Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!

  22. Clustering4Ever

    C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

  23. Schemer

    Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

  24. spark-deployer

    Deploy Spark cluster in an easy way.

  25. Spark Utils

    Basic framework utilities to quickly start writing production ready Apache Spark applications

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Scala Big Data discussion

Scala Big Data related posts

  • Delta Lake 4.0.0

    1 project | news.ycombinator.com | 5 Nov 2025
  • Apache Iceberg V3 Spec new features for more efficient and flexible data lakes

    2 projects | news.ycombinator.com | 11 Aug 2025
  • Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub

    3 projects | dev.to | 17 Oct 2024
  • Engenharia de Dados com Scala: masterizando o processamento de dados em tempo real com Apache Flink e Google Pub/Sub

    3 projects | dev.to | 8 Aug 2024
  • Make Rust Object Oriented with the dual-trait pattern

    2 projects | dev.to | 8 Jul 2024
  • Ask AN: Anyone using Delta Sharing in production?

    1 project | news.ycombinator.com | 1 Jul 2024
  • Azure data lake - Data Share

    1 project | /r/dataengineering | 29 Jun 2023
  • A note from our sponsor - Stream
    getstream.io | 21 Dec 2025
    Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →

Index

What are some of the best open-source Big Data projects in Scala? This list will help you:

# Project Stars
1 Apache Spark 42,518
2 kafka-manager 11,937
3 delta 8,472
4 SynapseML 5,191
5 Scalding 3,520
6 Scio 2,613
7 Jupyter Scala 1,617
8 Reactive-kafka 1,419
9 adam 1,041
10 H2O 977
11 spark-rapids 951
12 BIDMach 916
13 delta-sharing 907
14 Gearpump 758
15 Vegas 729
16 nussknacker 703
17 Sparkta 528
18 Scoobi 482
19 qbeast-spark 235
20 Clustering4Ever 130
21 Schemer 112
22 spark-deployer 75
23 Spark Utils 36

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Scala is
the 32nd most popular programming language
based on number of references?