Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark

Michael Rys & Rahul Potharaju Microsoft Corp. Big Data Team Level: @MikeDoesBigData,@RahulPotharaju #DotNETForSpark #vslive Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark

• Introducing .NET for Apache® Spark™ for building data pipelines – Why do we need .NET for Apache Spark? – What is .NET for Apache Spark? – Can I use .NET for Apache Spark with Azure HDInsight Spark, Azure Databricks etc? – Show me some examples!

INGEST STORE PREP & TRAIN MODEL & SERVE Azure modern data warehouse architecture Azure Data Lake Storage Logs, files and media (unstructured) Azure SQL Data Warehouse Azure Data Factory Azure Analysis Services Azure Databricks Azure HDInsight Spark (Python, Scala, Spark SQL, .NET for Apache Spark) Polybase Business/custom apps (Structured) Power BI Azure also supports other Big Data services like Azure Data Lake to allow customers to tailor the above architecture to meet their unique needs. ORCHESTRATION & DATA FLOW ETL Azure Data Factory

• Apache Spark is an OSS fast analytics engine for big data and machine learning • Improves efficiency through: • General computation graphs beyond map/reduce • In-memory computing primitives • Allows developers to scale out their user code & write in their language of choice • Rich APIs in Java, Scala, Python, R, SparkSQL etc. • Batch processing, streaming and interactive shell – Available on Azure via • Azure Databricks • Azure HDInsight • IaaS/Kubernetes

.NET Developers 💖 Apache Spark… A lot of big data-usable business logic (millions of lines of code) is written in .NET! Expensive and difficult to translate into Python/Scala/Java! Locked out from big data processing due to lack of .NET support in OSS big data solutions In a recently conducted .NET Developer survey (> 1000 developers), more than 70% expressed interest in Apache Spark! Would like to tap into OSS eco-system for: Code libraries, support, hiring

Goal: .NET for Apache Spark is aimed at providing .NET developers a first-class experience when working with Apache Spark. Non-Goal: Converting existing Scala/Python/Java Spark developers.

• Interop layer for .NET (Scala-side) • Potentially optimizing Python and R interop layers • Technical documentation, blogs and articles • End-to-end scenarios • Performance benchmarking (cluster) • Production workloads • Out of Box with Azure HDInsight, easy to use with Azure Databricks • C# (and F#) language extensions using .NET • Performance benchmarking (Interop) • Portability aspects (e.g., cross-platform .NET Standard) • Tooling (e.g., Apache Jupyter, Visual Studio, Visual Studio Code) Microsoft is committed…

… and developing in the open! Contributions to foundational OSS projects: • Apache arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW- 4502, ARROW-4737, ARROW-4543, ARROW-4435 • Pyrolite (pickling library): Improve pickling/unpickling performance, Add a Strong Name to Pyrolite .NET for Apache Spark was open sourced @Spark+AI Summit 2019 • Website: https://dot.net/spark • GitHub: https://github.com/dotnet/spark • Version 0.4 released End July 2019 Spark project improvement proposals: • Interop support for Spark language extensions: SPARK-26257 • .NET bindings for Apache Spark: SPARK-27006

Journey Since //Build 2019 (~ 3 mo) ~1k GitHub unique visitors/wk ~7k GitHub page views/wk 63 GitHub issues closed 86 GitHub PRs merged ~1.9k Nuget Downloads

.NET provides full-spectrum Spark support Spark DataFrames with SparkSQL works with Spark v2.3.x/v2.4.[0/1] and includes ~300 SparkSQL functions Grouped Map (Reducer, v0.4) .NET Spark UDFs Batch & streaming including Spark Structured Streaming and all Spark-supported data sources .NET Standard 2.0 works with .NET Framework v4.6.1+ and .NET Core v2.1+ and includes C#/F# support .NET Standard Machine Learning Including access to ML.NET Speed & productivity Performance optimized interop, as fast or faster than pySpark, Support for HW Vectorization (v0.4) https://github.com/dotnet/spark/examples

UserId State Salary Terry WA XX Rahul WA XX Dan WA YY Tyson CA ZZ Ankit WA YY Michael WA YY Introduction to Spark Programming: DataFrame

.NET for Apache Spark programmability var spark = SparkSession.Builder().GetOrCreate(); var dataframe = spark.Read().Json(“input.json”); dataframe.Filter(df["age"] > 21) .Select(concat(df[“age”], df[“name”]).Show(); var concat = Udf<int?, string, string>((age, name)=>name+age);

Language comparison: TPC-H Query 2 val europe = region.filter($"r_name" === "EUROPE") .join(nation, $"r_regionkey" === nation("n_regionkey")) .join(supplier, $"n_nationkey" === supplier("s_nationkey")) .join(partsupp, supplier("s_suppkey") === partsupp("ps_suppkey")) val brass = part.filter(part("p_size") === 15 && part("p_type").endsWith("BRASS")) .join(europe, europe("ps_partkey") === $"p_partkey") val minCost = brass.groupBy(brass("ps_partkey")) .agg(min("ps_supplycost").as("min")) brass.join(minCost, brass("ps_partkey") === minCost("ps_partkey")) .filter(brass("ps_supplycost") === minCost("min")) .select("s_acctbal", "s_name", "n_name", "p_partkey", "p_mfgr", "s_address", "s_phone", "s_comment") .sort($"s_acctbal".desc, $"n_name", $"s_name", $"p_partkey") .limit(100) .show() var europe = region.Filter(Col("r_name") == "EUROPE") .Join(nation, Col("r_regionkey") == nation["n_regionkey"]) .Join(supplier, Col("n_nationkey") == supplier["s_nationkey"]) .Join(partsupp, supplier["s_suppkey"] == partsupp["ps_suppkey"]); var brass = part.Filter(part["p_size"] == 15 & part["p_type"].EndsWith("BRASS")) .Join(europe, europe["ps_partkey"] == Col("p_partkey")); var minCost = brass.GroupBy(brass["ps_partkey"]) .Agg(Min("ps_supplycost").As("min")); brass.Join(minCost, brass["ps_partkey"] == minCost["ps_partkey"]) .Filter(brass["ps_supplycost"] == minCost["min"]) .Select("s_acctbal", "s_name", "n_name", "p_partkey", "p_mfgr", "s_address", "s_phone", "s_comment") .Sort(Col("s_acctbal").Desc(), Col("n_name"), Col("s_name"), Col("p_partkey")) .Limit(100) .Show(); Similar syntax – dangerously copy/paste friendly! $”col_name” vs. Col(“col_name”) Capitalization Scala C# C# vs Scala (e.g., == vs ===)

Submitting a Spark Application spark-submit ` --class <user-app-main-class> ` --master local ` <path-to-user-jar> <argument(s)-to-your-app> spark-submit (Scala) spark-submit ` --class org.apache.spark.deploy.DotnetRunner ` --master local ` <path-to-microsoft-spark-jar> ` <path-to-your-app-exe> <argument(s)-to-your-app> spark-submit (.NET) Provided by .NET for Apache Spark Library Provided by User & has business logic

Demo 2: Locally debugging a .NET for Spark App

Demo 3: GitHub analysis on the Cloud

Revisiting the question… How does OSS developer commit pattern look like over a week - do people work more over weekdays or weekends?

Microsoft, as a workplace, has a great work-life balance…. … that, or this is proof that I am not a data scientist! Y-Axis: % total time spent on commits that day X-Axis: Top-10 GitHub projects

What is happening when you write .NET Spark code? DataFrame SparkSQL .NET for Apache Spark .NET Program Did you define a .NET UDF? Regular execution path (no .NET runtime during execution) Interop between Spark and .NET No Yes Spark operation tree

Performance – warm cluster runs for Pickling Serialization (Arrow will be tested in the future) Takeaway 1: Where UDF performance does not matter, .NET is on- par with Python Takeaway 2: Where UDF performance is critical, .NET is ~2x faster than Python!

Works everywhere! Cross platform Cross Cloud Windows Ubuntu Azure & AWS Databricks macOS AWS EMR Spark Azure HDI Spark

VSCode extension for Spark .NET • Spark .NET Project creation • Dependency packaging • Language service • Sample code Author • Reference management • Spark local run • Spark cluster run (e.g. HDInsight) Run • DebugFix Extension to VSCode  Tap into VSCode for C# programming  Automate Maven and Spark dependency for environment setup  Facilitate first project success through project template and sample code  Support Spark local run and cluster run  Integrate with Azure for HDInsight clusters navigation  Azure Databricks integration planned

More programming experiences in .NET (UDAF, UDT support, multi- language UDFs) What’s next? Spark data connectors in .NET (e.g., Apache Kafka, Azure Blob Store, Azure Data Lake) Tooling experiences (e.g., Jupyter, VS Code, Visual Studio, others?) Idiomatic experiences for C# and F# (LINQ, Type Provider) Go to https://github.com/dotnet/spark and let us know what is important to you! Out-of-Box Experiences (Azure HDInsight, Azure Databricks, Cosmos DB Spark, SQL 2019 BDC, …)

Call to action: Engage, use & guide us! Useful links: • http://github.com/dotnet/spark https://aka.ms/GoDotNetForSpark Website: • https://dot.net/spark Available out-of-box on Azure HDInsight Spark Running .NET for Spark anywhere— https://aka.ms/InstallDotNetForSpark You & .NET

.NET for Apache Spark Github repo: https://github.com/dotnet/spark Microsoft resources and blog posts: • https://dot.net/spark • https://docs.microsoft.com/dotnet/spark • https://devblogs.microsoft.com/dotnet/introducing-net-for-apache-spark/ • Build BRK3011 Demo video: https://www.youtube.com/watch?v=ZlO1utbB2GQ&t=356s • https://www.slideshare.net/MichaelRys Apache Spark project proposals: • Spark Language Interop Spark Proposal (Jira SPARK-26257) • “.NET for Spark” Spark Project Proposal (Jira SPARK-27006)

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark

More Related Content

What's hot

Similar to Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark

More from Michael Rys

Recently uploaded

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark

Editor's Notes