Self-Service Apache Spark Structured Streaming Applications and Analytics

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Jayant Shekhar Self-Service Apache Spark Structured Streaming Applications & Analytics #UnifiedAnalytics #SparkAISummit

Use Cases • IoT analytics • Real-time fraud detection • Anomaly detection • Analyzing streaming data from devices, turbines etc. 3#UnifiedAnalytics #SparkAISummit

Why Self-Serve 4#UnifiedAnalytics #SparkAISummit Easy Debugging Ability to handle error cases View what is happening inside the jobs Easy Operationalization Deploy immediately Track status of jobs, jobs history Failure Notifications etc. Build Quickly Build in minutes instead of hours and days Visualizations View the streaming data visually

Workflow – Find delayed Flights 5#UnifiedAnalytics #SparkAISummit Read from Kafka Parse Fields Find delayed Flights (SQL) Display Delayed Flights Aggregate by Airports Display Delays by Airport Save to Hbase/MongoDB

Workflows stored as JSON 6#UnifiedAnalytics #SparkAISummit

Spark Submit spark2-submit --class fire.execute.WorkflowExecuteFromFile --master yarn --deploy-mode client --proxy-user sparkflows /home/sparkflows/fire-3.1.0/fire-core-lib/fire-spark_2_1-core-3.1.0-jar-with-dependencies.jar --postback-url http://demo50:8080/messageFromSparkJob --sql-context HiveContext --job-id 1aa66851-0e69-4fc8-b525-539839f72046 --workflow-file /tmp/fire/workflows/workflow-4994719000673479151.json 7#UnifiedAnalytics #SparkAISummit

Processor Types • Connectors • Transforms • Aggregations • Languages (SQL, Scala, Python) • Visualizations • ML Scoring • More – Sessionization – Dedup 8#UnifiedAnalytics #SparkAISummit

Processors Details 9#UnifiedAnalytics #SparkAISummit Geo • IP2Geo • Spatial Joins • Map lat/lon to Zipcode ETL • Join, Union • Filter • Data Validation • Math/String/Date Functions • Data Cleanup Data Profiling • Correlation • Data Summary • Histograms File Formats Scalability is an attribute that describes the ability of a process, network, software or organization to grow and manage increased demand. A system, business or software that is described as scalable has an advantage because it is more adaptable to the File Formats • CSV / TSV • Avro • Parquet • JSON • XML • PDF • Binary Visualizations • Graphs • Maps • Heatmaps • Barchart • Piechart Machine Learning • Classification • Regression • Clustering • Collaborative Filtering • Save / Load Model • Cross-Validation • Predict Languages • Scala • SQL • Jython • Java • Python Data Sources • HDFS / S3 • HIVE • HBase • Cassandra • Elastic Search • Kafka • Salesforce • Marketo Feature Generation • Tokenization • Stop Words Remover • Imputer • Locality Sensitive Hashing • One Hot Encoder NLP/OCR • Named Entity Recognition • Sentiment Analysis • Document Categorizer • OCR RDBMS • MySQL • Oracle • Postgres • Teradata • Etc. Streaming • Kafka • Kinesis • Files • Flume • Sockets

Components 10#UnifiedAnalytics #SparkAISummit Browser Web Server Running Apache Spark Instances Apache Spark Clusters HDFS S3 ADLS Kafka Kinesis HBase Mongo DB Cassa ndra ES

Connectors • Cannot easily start/stop streaming jobs when designing. • Build connectors for reading from the stores and converting to DataFrames 11#UnifiedAnalytics #SparkAISummit

Analytics • Slice & Dice data • Aggregations • Streaming Visualizations 12#UnifiedAnalytics #SparkAISummit

Analytics • SQL • Scala 13#UnifiedAnalytics #SparkAISummit

NLP • Integrated Apache OpenNLP & StanfordNLP • Processors ensure serialization of objects etc. and making things parallizable. • So users can easily start applying NLP to millions and millions of records. 14#UnifiedAnalytics #SparkAISummit

Streaming Charts • Results produced by the Spark Streaming jobs are streamed back to the browser. • Displayed on streaming charts. 15#UnifiedAnalytics #SparkAISummit

Streaming Connectors • Specialized code to run in the Workflow Designer reading from Streaming Sources. • We cannot run a full streaming job for interactive execution. 16#UnifiedAnalytics #SparkAISummit

ML Scoring • ML Pipelines include featurization significantly simplifying things. • Ability of Processors to read models and pass them to the next Processors • VectorAssembler, VectorIndexer, StringIndexer, OneHotEncoder, Bucketizer 17#UnifiedAnalytics #SparkAISummit

Storing Results • Ability to store in Hbase, ElasticSearch, HDFS etc. • Do not allow running at design mode so as not to mess up the stores. 18#UnifiedAnalytics #SparkAISummit

Large Scale Deployment & Monitoring 19#UnifiedAnalytics #SparkAISummit

Deployment - Executions 20#UnifiedAnalytics #SparkAISummit

Track Job Status • Status : STARTING / RUNNING / COMPLETED / FAILED / KILLED • Jobs post back their status to the server • Poll the jobs in various ways – logs, YARN etc. 21#UnifiedAnalytics #SparkAISummit

Scheduling & Triggering • Schedule by Time • Poll Kafka topic for events – Workflow ID – Workflow Parameters – Spark Submit Configurations 22#UnifiedAnalytics #SparkAISummit

Notification & Alerts • When jobs complete / fail send email alerts etc. 23#UnifiedAnalytics #SparkAISummit

Some more interesting things… • When no events received for defined time period, stop the Streaming Job. 24#UnifiedAnalytics #SparkAISummit

Execution Results 25#UnifiedAnalytics #SparkAISummit Execution Results stored in an RDBMS and tracked

View and compare various runs of a Workflow 26#UnifiedAnalytics #SparkAISummit

Performance 27#UnifiedAnalytics #SparkAISummit

Performance • Continue to update each process for best performance. Write once run many times… • Allow user control – Ability to control the Persistence Level of the DataFrames at any step • Focused on steps which took longer than expected, analyzed the code and updated it. • Ran load tests to compare various runs. 28#UnifiedAnalytics #SparkAISummit

Learnings… • Many more users able to get value from data. • Reduced time from idea to deployment. • Performance become easier. • Deployment becomes one click. • Easier to write complex modules like Dedup, CDC, ML etc. and use them at many places. 29#UnifiedAnalytics #SparkAISummit

SQL & Schema Inference 30#UnifiedAnalytics #SparkAISummit

DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

Self-Service Apache Spark Structured Streaming Applications and Analytics

More Related Content

What's hot

Similar to Self-Service Apache Spark Structured Streaming Applications and Analytics

More from Databricks

Recently uploaded

Self-Service Apache Spark Structured Streaming Applications and Analytics