WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Jayant Shekhar Self-Service Apache Spark Structured Streaming Applications & Analytics #UnifiedAnalytics #SparkAISummit
Use Cases • IoT analytics • Real-time fraud detection • Anomaly detection • Analyzing streaming data from devices, turbines etc. 3#UnifiedAnalytics #SparkAISummit
Why Self-Serve 4#UnifiedAnalytics #SparkAISummit Easy Debugging Ability to handle error cases View what is happening inside the jobs Easy Operationalization Deploy immediately Track status of jobs, jobs history Failure Notifications etc. Build Quickly Build in minutes instead of hours and days Visualizations View the streaming data visually
Workflow – Find delayed Flights 5#UnifiedAnalytics #SparkAISummit Read from Kafka Parse Fields Find delayed Flights (SQL) Display Delayed Flights Aggregate by Airports Display Delays by Airport Save to Hbase/MongoDB
Workflows stored as JSON 6#UnifiedAnalytics #SparkAISummit
Spark Submit spark2-submit --class fire.execute.WorkflowExecuteFromFile --master yarn --deploy-mode client --proxy-user sparkflows /home/sparkflows/fire-3.1.0/fire-core-lib/fire-spark_2_1-core-3.1.0-jar-with-dependencies.jar --postback-url http://demo50:8080/messageFromSparkJob --sql-context HiveContext --job-id 1aa66851-0e69-4fc8-b525-539839f72046 --workflow-file /tmp/fire/workflows/workflow-4994719000673479151.json 7#UnifiedAnalytics #SparkAISummit
Processor Types • Connectors • Transforms • Aggregations • Languages (SQL, Scala, Python) • Visualizations • ML Scoring • More – Sessionization – Dedup 8#UnifiedAnalytics #SparkAISummit
Processors Details 9#UnifiedAnalytics #SparkAISummit Geo • IP2Geo • Spatial Joins • Map lat/lon to Zipcode ETL • Join, Union • Filter • Data Validation • Math/String/Date Functions • Data Cleanup Data Profiling • Correlation • Data Summary • Histograms File Formats Scalability is an attribute that describes the ability of a process, network, software or organization to grow and manage increased demand. A system, business or software that is described as scalable has an advantage because it is more adaptable to the File Formats • CSV / TSV • Avro • Parquet • JSON • XML • PDF • Binary Visualizations • Graphs • Maps • Heatmaps • Barchart • Piechart Machine Learning • Classification • Regression • Clustering • Collaborative Filtering • Save / Load Model • Cross-Validation • Predict Languages • Scala • SQL • Jython • Java • Python Data Sources • HDFS / S3 • HIVE • HBase • Cassandra • Elastic Search • Kafka • Salesforce • Marketo Feature Generation • Tokenization • Stop Words Remover • Imputer • Locality Sensitive Hashing • One Hot Encoder NLP/OCR • Named Entity Recognition • Sentiment Analysis • Document Categorizer • OCR RDBMS • MySQL • Oracle • Postgres • Teradata • Etc. Streaming • Kafka • Kinesis • Files • Flume • Sockets
Components 10#UnifiedAnalytics #SparkAISummit Browser Web Server Running Apache Spark Instances Apache Spark Clusters HDFS S3 ADLS Kafka Kinesis HBase Mongo DB Cassa ndra ES
Connectors • Cannot easily start/stop streaming jobs when designing. • Build connectors for reading from the stores and converting to DataFrames 11#UnifiedAnalytics #SparkAISummit
Analytics • Slice & Dice data • Aggregations • Streaming Visualizations 12#UnifiedAnalytics #SparkAISummit
Analytics • SQL • Scala 13#UnifiedAnalytics #SparkAISummit
NLP • Integrated Apache OpenNLP & StanfordNLP • Processors ensure serialization of objects etc. and making things parallizable. • So users can easily start applying NLP to millions and millions of records. 14#UnifiedAnalytics #SparkAISummit
Streaming Charts • Results produced by the Spark Streaming jobs are streamed back to the browser. • Displayed on streaming charts. 15#UnifiedAnalytics #SparkAISummit
Streaming Connectors • Specialized code to run in the Workflow Designer reading from Streaming Sources. • We cannot run a full streaming job for interactive execution. 16#UnifiedAnalytics #SparkAISummit
ML Scoring • ML Pipelines include featurization significantly simplifying things. • Ability of Processors to read models and pass them to the next Processors • VectorAssembler, VectorIndexer, StringIndexer, OneHotEncoder, Bucketizer 17#UnifiedAnalytics #SparkAISummit
Storing Results • Ability to store in Hbase, ElasticSearch, HDFS etc. • Do not allow running at design mode so as not to mess up the stores. 18#UnifiedAnalytics #SparkAISummit
Large Scale Deployment & Monitoring 19#UnifiedAnalytics #SparkAISummit
Deployment - Executions 20#UnifiedAnalytics #SparkAISummit
Track Job Status • Status : STARTING / RUNNING / COMPLETED / FAILED / KILLED • Jobs post back their status to the server • Poll the jobs in various ways – logs, YARN etc. 21#UnifiedAnalytics #SparkAISummit
Scheduling & Triggering • Schedule by Time • Poll Kafka topic for events – Workflow ID – Workflow Parameters – Spark Submit Configurations 22#UnifiedAnalytics #SparkAISummit
Notification & Alerts • When jobs complete / fail send email alerts etc. 23#UnifiedAnalytics #SparkAISummit
Some more interesting things… • When no events received for defined time period, stop the Streaming Job. 24#UnifiedAnalytics #SparkAISummit
Execution Results 25#UnifiedAnalytics #SparkAISummit Execution Results stored in an RDBMS and tracked
View and compare various runs of a Workflow 26#UnifiedAnalytics #SparkAISummit
Performance 27#UnifiedAnalytics #SparkAISummit
Performance • Continue to update each process for best performance. Write once run many times… • Allow user control – Ability to control the Persistence Level of the DataFrames at any step • Focused on steps which took longer than expected, analyzed the code and updated it. • Ran load tests to compare various runs. 28#UnifiedAnalytics #SparkAISummit
Learnings… • Many more users able to get value from data. • Reduced time from idea to deployment. • Performance become easier. • Deployment becomes one click. • Easier to write complex modules like Dedup, CDC, ML etc. and use them at many places. 29#UnifiedAnalytics #SparkAISummit
SQL & Schema Inference 30#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

Self-Service Apache Spark Structured Streaming Applications and Analytics

  • 1.
    WIFI SSID:SparkAISummit |Password: UnifiedAnalytics
  • 2.
    Jayant Shekhar Self-Service ApacheSpark Structured Streaming Applications & Analytics #UnifiedAnalytics #SparkAISummit
  • 3.
    Use Cases • IoTanalytics • Real-time fraud detection • Anomaly detection • Analyzing streaming data from devices, turbines etc. 3#UnifiedAnalytics #SparkAISummit
  • 4.
    Why Self-Serve 4#UnifiedAnalytics #SparkAISummit EasyDebugging Ability to handle error cases View what is happening inside the jobs Easy Operationalization Deploy immediately Track status of jobs, jobs history Failure Notifications etc. Build Quickly Build in minutes instead of hours and days Visualizations View the streaming data visually
  • 5.
    Workflow – Finddelayed Flights 5#UnifiedAnalytics #SparkAISummit Read from Kafka Parse Fields Find delayed Flights (SQL) Display Delayed Flights Aggregate by Airports Display Delays by Airport Save to Hbase/MongoDB
  • 6.
    Workflows stored asJSON 6#UnifiedAnalytics #SparkAISummit
  • 7.
    Spark Submit spark2-submit --classfire.execute.WorkflowExecuteFromFile --master yarn --deploy-mode client --proxy-user sparkflows /home/sparkflows/fire-3.1.0/fire-core-lib/fire-spark_2_1-core-3.1.0-jar-with-dependencies.jar --postback-url http://demo50:8080/messageFromSparkJob --sql-context HiveContext --job-id 1aa66851-0e69-4fc8-b525-539839f72046 --workflow-file /tmp/fire/workflows/workflow-4994719000673479151.json 7#UnifiedAnalytics #SparkAISummit
  • 8.
    Processor Types • Connectors •Transforms • Aggregations • Languages (SQL, Scala, Python) • Visualizations • ML Scoring • More – Sessionization – Dedup 8#UnifiedAnalytics #SparkAISummit
  • 9.
    Processors Details 9#UnifiedAnalytics #SparkAISummit Geo •IP2Geo • Spatial Joins • Map lat/lon to Zipcode ETL • Join, Union • Filter • Data Validation • Math/String/Date Functions • Data Cleanup Data Profiling • Correlation • Data Summary • Histograms File Formats Scalability is an attribute that describes the ability of a process, network, software or organization to grow and manage increased demand. A system, business or software that is described as scalable has an advantage because it is more adaptable to the File Formats • CSV / TSV • Avro • Parquet • JSON • XML • PDF • Binary Visualizations • Graphs • Maps • Heatmaps • Barchart • Piechart Machine Learning • Classification • Regression • Clustering • Collaborative Filtering • Save / Load Model • Cross-Validation • Predict Languages • Scala • SQL • Jython • Java • Python Data Sources • HDFS / S3 • HIVE • HBase • Cassandra • Elastic Search • Kafka • Salesforce • Marketo Feature Generation • Tokenization • Stop Words Remover • Imputer • Locality Sensitive Hashing • One Hot Encoder NLP/OCR • Named Entity Recognition • Sentiment Analysis • Document Categorizer • OCR RDBMS • MySQL • Oracle • Postgres • Teradata • Etc. Streaming • Kafka • Kinesis • Files • Flume • Sockets
  • 10.
    Components 10#UnifiedAnalytics #SparkAISummit Browser WebServer Running Apache Spark Instances Apache Spark Clusters HDFS S3 ADLS Kafka Kinesis HBase Mongo DB Cassa ndra ES
  • 11.
    Connectors • Cannot easilystart/stop streaming jobs when designing. • Build connectors for reading from the stores and converting to DataFrames 11#UnifiedAnalytics #SparkAISummit
  • 12.
    Analytics • Slice &Dice data • Aggregations • Streaming Visualizations 12#UnifiedAnalytics #SparkAISummit
  • 13.
    Analytics • SQL •Scala 13#UnifiedAnalytics #SparkAISummit
  • 14.
    NLP • Integrated ApacheOpenNLP & StanfordNLP • Processors ensure serialization of objects etc. and making things parallizable. • So users can easily start applying NLP to millions and millions of records. 14#UnifiedAnalytics #SparkAISummit
  • 15.
    Streaming Charts • Resultsproduced by the Spark Streaming jobs are streamed back to the browser. • Displayed on streaming charts. 15#UnifiedAnalytics #SparkAISummit
  • 16.
    Streaming Connectors • Specializedcode to run in the Workflow Designer reading from Streaming Sources. • We cannot run a full streaming job for interactive execution. 16#UnifiedAnalytics #SparkAISummit
  • 17.
    ML Scoring • MLPipelines include featurization significantly simplifying things. • Ability of Processors to read models and pass them to the next Processors • VectorAssembler, VectorIndexer, StringIndexer, OneHotEncoder, Bucketizer 17#UnifiedAnalytics #SparkAISummit
  • 18.
    Storing Results • Abilityto store in Hbase, ElasticSearch, HDFS etc. • Do not allow running at design mode so as not to mess up the stores. 18#UnifiedAnalytics #SparkAISummit
  • 19.
    Large Scale Deployment& Monitoring 19#UnifiedAnalytics #SparkAISummit
  • 20.
  • 21.
    Track Job Status •Status : STARTING / RUNNING / COMPLETED / FAILED / KILLED • Jobs post back their status to the server • Poll the jobs in various ways – logs, YARN etc. 21#UnifiedAnalytics #SparkAISummit
  • 22.
    Scheduling & Triggering •Schedule by Time • Poll Kafka topic for events – Workflow ID – Workflow Parameters – Spark Submit Configurations 22#UnifiedAnalytics #SparkAISummit
  • 23.
    Notification & Alerts •When jobs complete / fail send email alerts etc. 23#UnifiedAnalytics #SparkAISummit
  • 24.
    Some more interestingthings… • When no events received for defined time period, stop the Streaming Job. 24#UnifiedAnalytics #SparkAISummit
  • 25.
  • 26.
    View and comparevarious runs of a Workflow 26#UnifiedAnalytics #SparkAISummit
  • 27.
  • 28.
    Performance • Continue toupdate each process for best performance. Write once run many times… • Allow user control – Ability to control the Persistence Level of the DataFrames at any step • Focused on steps which took longer than expected, analyzed the code and updated it. • Ran load tests to compare various runs. 28#UnifiedAnalytics #SparkAISummit
  • 29.
    Learnings… • Many moreusers able to get value from data. • Reduced time from idea to deployment. • Performance become easier. • Deployment becomes one click. • Easier to write complex modules like Dedup, CDC, ML etc. and use them at many places. 29#UnifiedAnalytics #SparkAISummit
  • 30.
    SQL & SchemaInference 30#UnifiedAnalytics #SparkAISummit
  • 31.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT