Software Architecture for Data Applications Ding Li 2021.6
2
3 A Sample Data System Memcached Redis ElastiCache Elasticsearch Solr Service Level Objectives (SLOs) Service Level Agreements (SLAs) A service may be required to be up at least 99.9% of the time. A service may be considered up if: • Median response time < 200 ms • 99th percentile < 1s Elastic Service: Automatically add computing resources when detecting a load increase Operability: Making life easy for operations Simplicity: Managing Complexity Evolvability: Making Change Easy Kinesis Kafka RabbitMQ
4 Data Models Relational Model (SQL) Document Model (NoSQL) Tree Structure Pro: join Con: locality Schema-on-write Pro: locality Con: join Schema-on-read Json support in relational databases: PostgreSQL, MySQL, SQL, Oracle
5 Query Languages Imperative Language tells the computer to perform certain operations in a certain order Declarative Language (SQL) just specify the pattern of the data you want, but not how to achieve that goal. (often lend themselves to parallel execution) Declarative Queries on the Web the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use PostgreSQL CSS: all <p> elements whose direct parent is an <li> element with a CSS class of selected XSL (XPath): Imperative Approach in JavaScript
6 Graph Models & Query Language Find people who emigrated from the US to Europe: Cypher Query Language (Neo4j) Using Recursive Common Table Expressions in SQL: follow a chain of outgoing WITHIN edges until eventually you reach a vertex of type Location, whose name property is equal to "United States".
7 Data Warehouse and Star Schema OLTP: Online Transaction Processing OLAP: Online Analytic Processing
8 Column-Oriented Storage Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses a few of them at one time Column Compression (when values in a column are not unique) redshift parquet If fact table 10M rows, but only 100 products, good compression ratio
9 Partitioning (Sharding) Combining Replication and Partitioning Partitioning by Key Range If using (update_time) as Key: • Range scan is easy • Can lead to hot spots If using (user_id, update_time) as key: • Different users may be stored on different partitions • Within each user, the updates are stored ordered by timestamp on a single partition Request Routing ZooKeeper: keep track of cluster metadata (used by Kafka, SolrCloud, HBase)
10 Transaction And ACID Transaction: group several reads and writes together into a logical unit. All the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback). To be reliable, a system must deal with various faults and ensure that they don’t cause catastrophic failure of the entire system ATOMICITY If the transaction cannot be completed (committed) due to a fault, the database must discard or undo any writes it has made so far. CONSISTENCY Certain statements about your data (invariants) that must always be true—for example, in an accounting system, credits and debits across all accounts must always be balanced. ISOLATION Concurrently executing transactions are isolated from each other: they cannot step on each other’s toes. DURABILITY Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes. Serializable isolation has a performance cost, many systems to use weaker levels of isolation, Read Committed is a basic one. Dirty Read: read an uncommitted writes user 2: the mailbox listing shows an unread message, but the counter shows zero unread messages Dirty Write: overwrite an uncommitted value The sale’s listing is assigned to “Bob”, but recipient to “Alice”
11 Read Committed & Repeatable Read No Dirty Read: only committed data can be read user 2: sees the new value for x only after user 1’s transaction has committed No Dirty Write: only committed data can be overwritten Read Committed Nonrepeatable Read Issue: Alice see a total of $900 in her accounts, instead of $1000. Repeatable Read (Snapshot Isolation) Each transaction reads from a consistent snapshot of the database— that is, the transaction sees all the data that was committed in the database at the start of the transaction.
12 Distributed System: Lock and Fencing Token Lock and Issue: Fencing Token (a number that increases every time a lock is granted): client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage. Only one transaction or client is allowed to hold the lock for a particular resource or object, to prevent concurrently writing to it and corrupting it Every time a client sends a write request to the storage service, it must include its current fencing token. Client 1 comes back to life and sends its write to the storage service, including its token value 33. However, the storage server remembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33 Packets can be lost, reordered, duplicated, or arbitrarily delayed in the network; clocks are approximate at best; and nodes can pause (e.g., due to garbage collection) or crash at any time.
13 Batch Processing with Unix Tools, MapReduce, and Spark • Services [HTTP/REST-based APIs] (online systems) • Batch Processing Systems (offline systems) • Stream Processing Systems (near-real-time systems) Batch Processing Web Logs with Unix Tools Output: MapReduce and Distributed Filesystems • Read a set of input files and break it up into records. • Call the mapper function to extract a key and value from each input record. awk '{print $7}': it extracts the URL ($7) as the key and leaves the value empty. • Sort all the key-value pairs by key. This is done by the first sort command. • Call the reducer function to iterate over the sorted key-value pairs. uniq -c, which counts the number of adjacent records with the same key. Read log file Split each line, get http ($7 token) Alphabetically sort by URL Counts of distinct URL Sort by number in reverse Output the first 5 lines Downsides of MapReduce • A MapReduce job can only start when all tasks in the preceding jobs (that generate its inputs) have completed • Mappers are often redundant: they just read back the same file that was just written by a reducer and prepare it for the next stage of partitioning and sorting. • Storing intermediate state in a distributed filesystem means those files are replicated across several nodes, which is often overkill for such temporary data. Solution: Dataflow Engines (such as Spark) • Explicitly model the flow of data through several processing stages • Parallelize work by partitioning inputs, and they copy the output of one function over the network to become the input to another function • All joins and data dependencies in a workflow are explicitly declared, the scheduler has an overview of what data is required where, so it can make locality optimizations
14 Stream Processing Load Balancing (to one consumer) & Fan-out (to all consumers) Acknowledgments and Redelivery Partition (can be on different machines) & Topic (same type of messages) Use of Steam Processing • Complex Event Processing (CEP) Describe the patterns of events that should be detected. • Stream Analytics Aggregations and statistical metrics of events (rate, rolling average) • Search on Streams Search for individual events (mentioning of companies, products, topics)
15 Keeping Systems in Sync & Stream Joins Change Data Capture (CDC) Dual Writes Issue Log Compaction: the storage engine periodically looks for log records with the same key, throws away any duplicates, and keeps only the most recent update for each key Stream Joins Stream-Stream Joins (Window Join) To capture the click events for corresponding onsite search events to get CTR. Matched by session id. Stream-Table Joins (Stream Enrichment) Join user activity stream events and user profiles in database. Can load a copy of database into the stream processor. Table-Table Joins (Materialized View Maintenance) Twitter timeline cache • When user u sends a new tweet, it is added to the timeline of every user who is following u. • When a user deletes a tweet, it is removed from all users’ timelines. • When user u1 starts following user u2, recent tweets by u2 are added to u1’s timeline. • When user u1 unfollows user u2, tweets by u2 are removed from u1’stimeline Microbatching Break the stream into small blocks, treat each block like a miniature batch process The batch size is typically around one second
16
17 Apache Spark: A Unified Engine for Large-scale Distributed Data Processing The central thrust of the Spark project was to bring in ideas borrowed from Hadoop MapReduce, but to enhance the system: make it highly fault tolerant and parallel, support in- memory storage for intermediate results between iterative and interactive map and reduce computations, offer easy and composable APIs in multiple languages as a programming model, and support other workloads in a unified manner. Speed • Take advantage of efficient multithreading and parallel processing • Directed Acyclic Graph (DAG): efficient computational graph across works on the cluster • Generate compact code for execution Ease of Use • Resilient Distributed Dataset (RDD) • a fundamental abstraction of a simple logical data structure • upon which DataFrames and Datasets, are constructed Modularity • Components: Spark SQL, Spark Structured Steaming, Spark Mllib, Graph X • API: Scala, SQL, Python, Java, R Extensibility • Decouple the storage and compute • Spark’s DataFrameReaders and DataFrameWriters can also be extended to read data from other sources, such as Apache Kafka, Kinesis, Azure Storage, and Amazon S3, into its logical data abstraction, on which it can operate. Distributed Data and Partitions A distributed scheme of breaking up data into chunks or partitions allows Spark executors to process only data that is close to them, minimizing network bandwidth. That is, each executor’s core is assigned its own data partition to work on Apache Spark’s Distributed Execution Spark driver: requests resources (CPU, memory, etc.) from the cluster manager for Spark’s executors (JVMs); transforms all the Spark operations into DAG computations, schedules them, and distributes their execution as tasks across the Spark executors.
18 Spark Job/Stage/Task, Transformation/Action Spark driver converts Spark application into one or more Spark jobs. It then transforms each job into a DAG, Spark’s execution plan. As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel Each task maps to a single core and works on a single partition of data Transformations, Actions, and Lazy Evaluation • All transformations are evaluated lazily. • To optimize transformations into stages for more efficient execution. • Nothing in a query plan is executed until an action is invoked. Narrow and Wide Transformations a single output partition can be computed from a single input partition filter() contains() data from other partitions is read in, combined, and written to disk groupby() orderby()
19 Sample Code: Counting M&Ms for the Cookie Monster Spark program that reads a file with over 100,000 entries (where each row or line has a <state, mnm_color, count>) and computes and aggregates the counts for each color and state. Spark program would distribute the task in each partition and return us the final aggregated count.
20 Apache Spark’s Structured APIs The RDD is the most basic abstraction in Spark • Dependencies instructs Spark how an RDD is constructed with its inputs • Partitions (with some locality information) ability to parallelize computation on partitions across executors • Compute function: Partition => Iterator[T] produces an Iterator[T] for the data that will be stored in the RDD Basic Python data types in Spark Python structured data types in Spark Schemas and Creating DataFrames A schema defines the column names and associated data types for a DataFrame. schema = "author STRING, title STRING, pages INT“ blogs_df = spark.createDataFrame(data, schema)
21 Spark SQL and the Underlying Engine • Unifies Spark components and permits abstraction to DataFrames/Datasets in Java, Scala, Python, and R, which simplifies working with structured data sets. • Connects to the Apache Hive meta store and tables. • Reads and writes structured data with a specific schema from structured file formats (JSON, CSV, Text, Avro, Parquet, ORC, etc.) and converts data into temporary tables. • Offers an interactive Spark SQL shell for quick data exploration. • Provides a bridge to (and from) external tools via standard database JDBC/ODBC connectors. • Generates optimized query plans and compact code for the JVM, for final execution.
22 Spark SQL and DataFrames Create SQL Database Create Table Create View Views disappear after Spark application terminates Global: visible across all SparkSessions on a given cluster Session-scoped: visible only to a single SparkSession Viewing the Metadata Reading Tables into DataFrames Reading from Database using JDBC
23 DataFrameReader and DataFrameWriter DataFrameWriter.format(args).option(args).bucketBy(args).partitionBy(args).save(path) DataFrameWriter.format(args).option(args).sortBy(args).saveAsTable(table) DataFrameReader.format(args).option("key","value").schema(args).load()
24 Higher-Order Functions in Spark SQL and DataFrames Transform() Filter() EXISTS() Union two different DataFrames with the same schema together Join Pivot
25 Structured Steaming Data stream as an unbound table Append mode: only append new rows Update mode: update rows changed since last trigger Complete mode: write entire result table Structured Streaming Query Define input source Transform data Define output sink & mode Specify processing details Triggering details • Default: next micro-batch is triggered as soon as the previous one has completed • Processing time with trigger interval: will trigger micro-batches at fixed interval • Once: will start the query using custom schedule and stop • Continuous: will process data continuously instead of in micro-batches, experimental
26 Steaming Data Sources and Sinks Files Each file must appear in the directory listing atomically—that is, the whole file must be available at once for reading, and once it is available, the file cannot be updated or modified. Structured Streaming will process the file when the engine finds it (using directory listing) and internally mark it as processed. Any changes to that file will not be processed. Apache Kafka Fields of returned DataFrame: key, value, topic, partition, offset, timestamp, timestampType
27 Stateful Steaming Aggregations Steam – Stream Join Aggregations with Event-Time Windows watermark defines how long the engine will wait for late data to arrive. With these time constraints for each event, the processing engine can automatically calculate how long events need to be buffered to generate correct results, and when the events can be dropped from the state.
28 Machine Learning Pipeline
29 Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark Joblib Joblib can be used for hyperparameter tuning as it automatically broadcasts a copy of your data to all your workers, which then create their own models with different hyperparameters on their copies of the data. Koalas Pandas is a very popular data analysis and manipulation library in Python, but it is limited to running on a single machine. Koalas is an open-source library that implements the Pandas DataFrame API on top of Apache Spark, easing the transition from Pandas to Spark. AWS SageMaker Azure ML
30 Further Reading for Software Architecture

Software architecture for data applications

  • 1.
    Software Architecture for DataApplications Ding Li 2021.6
  • 2.
  • 3.
    3 A Sample DataSystem Memcached Redis ElastiCache Elasticsearch Solr Service Level Objectives (SLOs) Service Level Agreements (SLAs) A service may be required to be up at least 99.9% of the time. A service may be considered up if: • Median response time < 200 ms • 99th percentile < 1s Elastic Service: Automatically add computing resources when detecting a load increase Operability: Making life easy for operations Simplicity: Managing Complexity Evolvability: Making Change Easy Kinesis Kafka RabbitMQ
  • 4.
    4 Data Models Relational Model(SQL) Document Model (NoSQL) Tree Structure Pro: join Con: locality Schema-on-write Pro: locality Con: join Schema-on-read Json support in relational databases: PostgreSQL, MySQL, SQL, Oracle
  • 5.
    5 Query Languages Imperative Language tellsthe computer to perform certain operations in a certain order Declarative Language (SQL) just specify the pattern of the data you want, but not how to achieve that goal. (often lend themselves to parallel execution) Declarative Queries on the Web the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use PostgreSQL CSS: all <p> elements whose direct parent is an <li> element with a CSS class of selected XSL (XPath): Imperative Approach in JavaScript
  • 6.
    6 Graph Models &Query Language Find people who emigrated from the US to Europe: Cypher Query Language (Neo4j) Using Recursive Common Table Expressions in SQL: follow a chain of outgoing WITHIN edges until eventually you reach a vertex of type Location, whose name property is equal to "United States".
  • 7.
    7 Data Warehouse andStar Schema OLTP: Online Transaction Processing OLAP: Online Analytic Processing
  • 8.
    8 Column-Oriented Storage Although facttables are often over 100 columns wide, a typical data warehouse query only accesses a few of them at one time Column Compression (when values in a column are not unique) redshift parquet If fact table 10M rows, but only 100 products, good compression ratio
  • 9.
    9 Partitioning (Sharding) Combining Replicationand Partitioning Partitioning by Key Range If using (update_time) as Key: • Range scan is easy • Can lead to hot spots If using (user_id, update_time) as key: • Different users may be stored on different partitions • Within each user, the updates are stored ordered by timestamp on a single partition Request Routing ZooKeeper: keep track of cluster metadata (used by Kafka, SolrCloud, HBase)
  • 10.
    10 Transaction And ACID Transaction:group several reads and writes together into a logical unit. All the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback). To be reliable, a system must deal with various faults and ensure that they don’t cause catastrophic failure of the entire system ATOMICITY If the transaction cannot be completed (committed) due to a fault, the database must discard or undo any writes it has made so far. CONSISTENCY Certain statements about your data (invariants) that must always be true—for example, in an accounting system, credits and debits across all accounts must always be balanced. ISOLATION Concurrently executing transactions are isolated from each other: they cannot step on each other’s toes. DURABILITY Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes. Serializable isolation has a performance cost, many systems to use weaker levels of isolation, Read Committed is a basic one. Dirty Read: read an uncommitted writes user 2: the mailbox listing shows an unread message, but the counter shows zero unread messages Dirty Write: overwrite an uncommitted value The sale’s listing is assigned to “Bob”, but recipient to “Alice”
  • 11.
    11 Read Committed &Repeatable Read No Dirty Read: only committed data can be read user 2: sees the new value for x only after user 1’s transaction has committed No Dirty Write: only committed data can be overwritten Read Committed Nonrepeatable Read Issue: Alice see a total of $900 in her accounts, instead of $1000. Repeatable Read (Snapshot Isolation) Each transaction reads from a consistent snapshot of the database— that is, the transaction sees all the data that was committed in the database at the start of the transaction.
  • 12.
    12 Distributed System: Lockand Fencing Token Lock and Issue: Fencing Token (a number that increases every time a lock is granted): client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage. Only one transaction or client is allowed to hold the lock for a particular resource or object, to prevent concurrently writing to it and corrupting it Every time a client sends a write request to the storage service, it must include its current fencing token. Client 1 comes back to life and sends its write to the storage service, including its token value 33. However, the storage server remembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33 Packets can be lost, reordered, duplicated, or arbitrarily delayed in the network; clocks are approximate at best; and nodes can pause (e.g., due to garbage collection) or crash at any time.
  • 13.
    13 Batch Processing withUnix Tools, MapReduce, and Spark • Services [HTTP/REST-based APIs] (online systems) • Batch Processing Systems (offline systems) • Stream Processing Systems (near-real-time systems) Batch Processing Web Logs with Unix Tools Output: MapReduce and Distributed Filesystems • Read a set of input files and break it up into records. • Call the mapper function to extract a key and value from each input record. awk '{print $7}': it extracts the URL ($7) as the key and leaves the value empty. • Sort all the key-value pairs by key. This is done by the first sort command. • Call the reducer function to iterate over the sorted key-value pairs. uniq -c, which counts the number of adjacent records with the same key. Read log file Split each line, get http ($7 token) Alphabetically sort by URL Counts of distinct URL Sort by number in reverse Output the first 5 lines Downsides of MapReduce • A MapReduce job can only start when all tasks in the preceding jobs (that generate its inputs) have completed • Mappers are often redundant: they just read back the same file that was just written by a reducer and prepare it for the next stage of partitioning and sorting. • Storing intermediate state in a distributed filesystem means those files are replicated across several nodes, which is often overkill for such temporary data. Solution: Dataflow Engines (such as Spark) • Explicitly model the flow of data through several processing stages • Parallelize work by partitioning inputs, and they copy the output of one function over the network to become the input to another function • All joins and data dependencies in a workflow are explicitly declared, the scheduler has an overview of what data is required where, so it can make locality optimizations
  • 14.
    14 Stream Processing Load Balancing(to one consumer) & Fan-out (to all consumers) Acknowledgments and Redelivery Partition (can be on different machines) & Topic (same type of messages) Use of Steam Processing • Complex Event Processing (CEP) Describe the patterns of events that should be detected. • Stream Analytics Aggregations and statistical metrics of events (rate, rolling average) • Search on Streams Search for individual events (mentioning of companies, products, topics)
  • 15.
    15 Keeping Systems inSync & Stream Joins Change Data Capture (CDC) Dual Writes Issue Log Compaction: the storage engine periodically looks for log records with the same key, throws away any duplicates, and keeps only the most recent update for each key Stream Joins Stream-Stream Joins (Window Join) To capture the click events for corresponding onsite search events to get CTR. Matched by session id. Stream-Table Joins (Stream Enrichment) Join user activity stream events and user profiles in database. Can load a copy of database into the stream processor. Table-Table Joins (Materialized View Maintenance) Twitter timeline cache • When user u sends a new tweet, it is added to the timeline of every user who is following u. • When a user deletes a tweet, it is removed from all users’ timelines. • When user u1 starts following user u2, recent tweets by u2 are added to u1’s timeline. • When user u1 unfollows user u2, tweets by u2 are removed from u1’stimeline Microbatching Break the stream into small blocks, treat each block like a miniature batch process The batch size is typically around one second
  • 16.
  • 17.
    17 Apache Spark: AUnified Engine for Large-scale Distributed Data Processing The central thrust of the Spark project was to bring in ideas borrowed from Hadoop MapReduce, but to enhance the system: make it highly fault tolerant and parallel, support in- memory storage for intermediate results between iterative and interactive map and reduce computations, offer easy and composable APIs in multiple languages as a programming model, and support other workloads in a unified manner. Speed • Take advantage of efficient multithreading and parallel processing • Directed Acyclic Graph (DAG): efficient computational graph across works on the cluster • Generate compact code for execution Ease of Use • Resilient Distributed Dataset (RDD) • a fundamental abstraction of a simple logical data structure • upon which DataFrames and Datasets, are constructed Modularity • Components: Spark SQL, Spark Structured Steaming, Spark Mllib, Graph X • API: Scala, SQL, Python, Java, R Extensibility • Decouple the storage and compute • Spark’s DataFrameReaders and DataFrameWriters can also be extended to read data from other sources, such as Apache Kafka, Kinesis, Azure Storage, and Amazon S3, into its logical data abstraction, on which it can operate. Distributed Data and Partitions A distributed scheme of breaking up data into chunks or partitions allows Spark executors to process only data that is close to them, minimizing network bandwidth. That is, each executor’s core is assigned its own data partition to work on Apache Spark’s Distributed Execution Spark driver: requests resources (CPU, memory, etc.) from the cluster manager for Spark’s executors (JVMs); transforms all the Spark operations into DAG computations, schedules them, and distributes their execution as tasks across the Spark executors.
  • 18.
    18 Spark Job/Stage/Task, Transformation/Action Sparkdriver converts Spark application into one or more Spark jobs. It then transforms each job into a DAG, Spark’s execution plan. As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel Each task maps to a single core and works on a single partition of data Transformations, Actions, and Lazy Evaluation • All transformations are evaluated lazily. • To optimize transformations into stages for more efficient execution. • Nothing in a query plan is executed until an action is invoked. Narrow and Wide Transformations a single output partition can be computed from a single input partition filter() contains() data from other partitions is read in, combined, and written to disk groupby() orderby()
  • 19.
    19 Sample Code: CountingM&Ms for the Cookie Monster Spark program that reads a file with over 100,000 entries (where each row or line has a <state, mnm_color, count>) and computes and aggregates the counts for each color and state. Spark program would distribute the task in each partition and return us the final aggregated count.
  • 20.
    20 Apache Spark’s StructuredAPIs The RDD is the most basic abstraction in Spark • Dependencies instructs Spark how an RDD is constructed with its inputs • Partitions (with some locality information) ability to parallelize computation on partitions across executors • Compute function: Partition => Iterator[T] produces an Iterator[T] for the data that will be stored in the RDD Basic Python data types in Spark Python structured data types in Spark Schemas and Creating DataFrames A schema defines the column names and associated data types for a DataFrame. schema = "author STRING, title STRING, pages INT“ blogs_df = spark.createDataFrame(data, schema)
  • 21.
    21 Spark SQL andthe Underlying Engine • Unifies Spark components and permits abstraction to DataFrames/Datasets in Java, Scala, Python, and R, which simplifies working with structured data sets. • Connects to the Apache Hive meta store and tables. • Reads and writes structured data with a specific schema from structured file formats (JSON, CSV, Text, Avro, Parquet, ORC, etc.) and converts data into temporary tables. • Offers an interactive Spark SQL shell for quick data exploration. • Provides a bridge to (and from) external tools via standard database JDBC/ODBC connectors. • Generates optimized query plans and compact code for the JVM, for final execution.
  • 22.
    22 Spark SQL andDataFrames Create SQL Database Create Table Create View Views disappear after Spark application terminates Global: visible across all SparkSessions on a given cluster Session-scoped: visible only to a single SparkSession Viewing the Metadata Reading Tables into DataFrames Reading from Database using JDBC
  • 23.
  • 24.
    24 Higher-Order Functions inSpark SQL and DataFrames Transform() Filter() EXISTS() Union two different DataFrames with the same schema together Join Pivot
  • 25.
    25 Structured Steaming Data streamas an unbound table Append mode: only append new rows Update mode: update rows changed since last trigger Complete mode: write entire result table Structured Streaming Query Define input source Transform data Define output sink & mode Specify processing details Triggering details • Default: next micro-batch is triggered as soon as the previous one has completed • Processing time with trigger interval: will trigger micro-batches at fixed interval • Once: will start the query using custom schedule and stop • Continuous: will process data continuously instead of in micro-batches, experimental
  • 26.
    26 Steaming Data Sourcesand Sinks Files Each file must appear in the directory listing atomically—that is, the whole file must be available at once for reading, and once it is available, the file cannot be updated or modified. Structured Streaming will process the file when the engine finds it (using directory listing) and internally mark it as processed. Any changes to that file will not be processed. Apache Kafka Fields of returned DataFrame: key, value, topic, partition, offset, timestamp, timestampType
  • 27.
    27 Stateful Steaming Aggregations Steam– Stream Join Aggregations with Event-Time Windows watermark defines how long the engine will wait for late data to arrive. With these time constraints for each event, the processing engine can automatically calculate how long events need to be buffered to generate correct results, and when the events can be dropped from the state.
  • 28.
  • 29.
    29 Managing, Deploying, andScaling Machine Learning Pipelines with Apache Spark Joblib Joblib can be used for hyperparameter tuning as it automatically broadcasts a copy of your data to all your workers, which then create their own models with different hyperparameters on their copies of the data. Koalas Pandas is a very popular data analysis and manipulation library in Python, but it is limited to running on a single machine. Koalas is an open-source library that implements the Pandas DataFrame API on top of Apache Spark, easing the transition from Pandas to Spark. AWS SageMaker Azure ML
  • 30.
    30 Further Reading forSoftware Architecture