© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Radhika Ravirala, Solutions Architect, AWS August 17, 2017 Serverless Big Data Architectures Serverless Data Analytics
Agenda Cloud Architecture Evolution – Why Serverless Data and Analytics Flow Key Services Overview Design Patterns Call to Action
Cloud Architecture Evolution Virtualized Managed Serverless Virtualized Servers Managed Platforms Serverless Analytics
No servers to provision or manage Scales with usage Never pay for idle Availability and fault tolerance built in Serverless characteristics
Data and Analytics Flow Ingest/ Collect Store Analyze/ Process Visualization/ Consume Orchestrate/Transform
What Is the Temperature of Your Data / Access ?
Orchestration/Transform AWS Big Data Services Ingest/ Collect Store Analyze/ Process Visualization/ Consume Batch ETL/ELT Realtime ETL/ELT Transactional / CDC B.I. Tools Data Science Notebooks Bulk Transport File/Object Upload Streaming Ingest Commits Transactional NoSQL Data Lake Streaming Storage Dashboards Batch Analytics Interactive Querying Machine Learning/ Deep Learning Realtime Analytics …
Orchestration/Transform AWS Big Data Services Ingest/ Collect Store Analyze/ Process Visualization/ Consume = Serverless Serverless Managed Virtualized Batch ETL/ELT Realtime ETL/ELT Transactional / CDC B.I. Tools Data Science Notebooks Bulk Transport File/Object Upload Streaming Ingest Commits Transactional NoSQL Data Lake Streaming Storage Dashboards Batch Analytics Interactive Querying Machine Learning/ Deep Learning Realtime Analytics
Orchestration/Transform AWS Big Data Services EMR EC2 S3 RedshiftDynamoDB AWS DMS (CDC) AWS Lambda Kinesis Analytics Amazon Athena Amazon QuickSight Aurora AWS Glue AWS Step Functions Kinesis Streams Ingest/ Collect Store Analyze/ Process Visualization/ Consume AWS Snowball ISV Connectors Kinesis Firehose S3 Transfer Acceleration = Serverless Amazon ElasticSearc h
Key Services Overview
Big Data Storage for Virtually All AWS Services Amazon S3 • Store anything • Object storage • Scalable • 99.999999999% durability • Extremely low cost
Amazon DynamoDB Fast & Flexible NoSQL Database Service • NoSQL Database • Seamless scalability • Zero admin • Single digit millisecond latency
Amazon Kinesis Real-time Streaming Platform • Streams, Firehose, Analytics • Real-time processing • High throughput; elastic • Easy to use • Integration with S3, EMR, Redshift, DynamoDB
Amazon Kinesis Streams • For Technical Developers • Build your own custom applications that process or analyze streaming data Amazon Kinesis Firehose • For all developers, data scientists • Easily load massive volumes of streaming data into S3, Amazon Redshift and Amazon Elasticsearch Amazon Kinesis Analytics • For all developers, data scientists • Easily analyze data streams using standard SQL queries Amazon Kinesis: Streaming Data Made Easy Services make it easy to capture, deliver and process streams on AWS
AWS Lambda • Run your code in the cloud - fully managed and highly-available • Triggered through API or state changes in your setup • Scales automatically to match the incoming event rate • Node.js (JavaScript), Python, Java, and C# • Charged per 100ms execution time Serverless Compute
Amazon Athena Interactive Query Service • Query directly from Amazon S3 • Use ANSI SQL • Serverless • Multiple Data Formats • Pay per query
AWS Glue Fully Managed ETL Service • Catalog data sources • Identify data formats & data types • Error Handling • Manage and scale resources • Generate ETL code • Schedules, executes ETL jobs New !
AWS Glue: services Data Catalog  Hive metastore compatible metadata repository of data sources.  Crawls data source to infer table, data type, partition format. Job Execution  Runs jobs in Spark containers – automatic scaling based on SLA.  Glue is serverless - only pay for the resources you consume. Job Authoring  Generates Python code to move data from source to destination.  Edit with your favorite IDE; share code snippets using Git.
• Fast and cloud-powered • Easy to use, no infrastructure to manage • Scales to 100s of thousands of users • Quick calculations with SPICE • 1/10th the cost of legacy BI software Business Intelligence Amazon QuickSight
Serverless Design Patterns
Real-time Analytics Producer Apache Kafka KCL AWS Lambda Spark Streaming Apache Storm Amazon SNS Notifications Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES Alert Analytics Output KPI Serverless Managed DynamoDB Streams Kinesis Streams Virtualized Kinesis Analytics Ingest/ Collect Store Analyze/ Process Visualization/ Consume Apache FlinkSQS
Interactive Queries Ingest/ Collect Store Analyze/ Process Visualization/ Consume Producer Amazon S3 Amazon Redshift Amazon EMR Presto Impala Spark Interactive Amazon Athena Serverless Managed Virtualized QuickSight
Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight Amazon AI EMR Redshift Athena Kinesis RDS Central Storage Secure, cost-effective Storage in Amazon S3 S3 Snowball Database Migration Service Kinesis Firehose Direct Connect Data Ingestion Get your data into S3 Quickly and securely Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Security Token Service CloudWatch CloudTrail Key Management Service Data Lake Reference Architecture = Serverless
Amazon S3 Data Lake Amazon Kinesis Streams & Firehose Hadoop / Spark Streaming Analytics Tools Amazon Redshift Data Warehouse Amazon DynamoDB NoSQL Database AWS Lambda Spark Streaming on EMR Amazon Elasticsearch Service Relational Database Amazon EMR Amazon Aurora Amazon Machine Learning Predictive Analytics Any Open Source Tool of Choice on EC2 Data Science Sandbox Visualization / Reporting Apache Storm on EMR Apache Flink on EMR Amazon Kinesis Analytics Serving Tier Clusterless SQL Query Amazon Athena DataSourcesTransactionalData Amazon Glue Clusterless ETL Amazon ElastiCache Redis Data Lake and Real-time Analytics
Serverless ETL Store Transform Store Analyze/ Process Visualize/ Consume Amazon S3 Apache Kafka Kinesis Streams Amazon EMR Spark Flink AWS Glue AWS Lambda ISV Amazon S3 Apache Kafka Redshift Kinesis Streams Data CatalogAWS Glue DynamoDB Streams DynamoDB Hive M/D
Serverless nicely fits into big data platforms • AWS Serverless Big Data Services • Complements existing big data flows • Focus on the analytics and not on infrastructure or servers • Don’t focus on the scaling, availability, and undifferentiated heavy lifting • Pay only for what you use • Easily try out different tools, analytics, and solutions
DEMO
Serverless Big Data Architectures: Serverless Data Analytics

Serverless Big Data Architectures: Serverless Data Analytics

  • 1.
    © 2015, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Radhika Ravirala, Solutions Architect, AWS August 17, 2017 Serverless Big Data Architectures Serverless Data Analytics
  • 2.
    Agenda Cloud Architecture Evolution– Why Serverless Data and Analytics Flow Key Services Overview Design Patterns Call to Action
  • 3.
    Cloud Architecture Evolution VirtualizedManaged Serverless Virtualized Servers Managed Platforms Serverless Analytics
  • 4.
    No servers toprovision or manage Scales with usage Never pay for idle Availability and fault tolerance built in Serverless characteristics
  • 5.
    Data and AnalyticsFlow Ingest/ Collect Store Analyze/ Process Visualization/ Consume Orchestrate/Transform
  • 6.
    What Is theTemperature of Your Data / Access ?
  • 7.
    Orchestration/Transform AWS Big DataServices Ingest/ Collect Store Analyze/ Process Visualization/ Consume Batch ETL/ELT Realtime ETL/ELT Transactional / CDC B.I. Tools Data Science Notebooks Bulk Transport File/Object Upload Streaming Ingest Commits Transactional NoSQL Data Lake Streaming Storage Dashboards Batch Analytics Interactive Querying Machine Learning/ Deep Learning Realtime Analytics …
  • 8.
    Orchestration/Transform AWS Big DataServices Ingest/ Collect Store Analyze/ Process Visualization/ Consume = Serverless Serverless Managed Virtualized Batch ETL/ELT Realtime ETL/ELT Transactional / CDC B.I. Tools Data Science Notebooks Bulk Transport File/Object Upload Streaming Ingest Commits Transactional NoSQL Data Lake Streaming Storage Dashboards Batch Analytics Interactive Querying Machine Learning/ Deep Learning Realtime Analytics
  • 9.
    Orchestration/Transform AWS Big DataServices EMR EC2 S3 RedshiftDynamoDB AWS DMS (CDC) AWS Lambda Kinesis Analytics Amazon Athena Amazon QuickSight Aurora AWS Glue AWS Step Functions Kinesis Streams Ingest/ Collect Store Analyze/ Process Visualization/ Consume AWS Snowball ISV Connectors Kinesis Firehose S3 Transfer Acceleration = Serverless Amazon ElasticSearc h
  • 10.
  • 11.
    Big Data Storagefor Virtually All AWS Services Amazon S3 • Store anything • Object storage • Scalable • 99.999999999% durability • Extremely low cost
  • 12.
    Amazon DynamoDB Fast & FlexibleNoSQL Database Service • NoSQL Database • Seamless scalability • Zero admin • Single digit millisecond latency
  • 13.
    Amazon Kinesis Real-time Streaming Platform •Streams, Firehose, Analytics • Real-time processing • High throughput; elastic • Easy to use • Integration with S3, EMR, Redshift, DynamoDB
  • 14.
    Amazon Kinesis Streams • ForTechnical Developers • Build your own custom applications that process or analyze streaming data Amazon Kinesis Firehose • For all developers, data scientists • Easily load massive volumes of streaming data into S3, Amazon Redshift and Amazon Elasticsearch Amazon Kinesis Analytics • For all developers, data scientists • Easily analyze data streams using standard SQL queries Amazon Kinesis: Streaming Data Made Easy Services make it easy to capture, deliver and process streams on AWS
  • 15.
    AWS Lambda • Runyour code in the cloud - fully managed and highly-available • Triggered through API or state changes in your setup • Scales automatically to match the incoming event rate • Node.js (JavaScript), Python, Java, and C# • Charged per 100ms execution time Serverless Compute
  • 16.
    Amazon Athena Interactive Query Service •Query directly from Amazon S3 • Use ANSI SQL • Serverless • Multiple Data Formats • Pay per query
  • 17.
    AWS Glue Fully ManagedETL Service • Catalog data sources • Identify data formats & data types • Error Handling • Manage and scale resources • Generate ETL code • Schedules, executes ETL jobs New !
  • 18.
    AWS Glue: services DataCatalog  Hive metastore compatible metadata repository of data sources.  Crawls data source to infer table, data type, partition format. Job Execution  Runs jobs in Spark containers – automatic scaling based on SLA.  Glue is serverless - only pay for the resources you consume. Job Authoring  Generates Python code to move data from source to destination.  Edit with your favorite IDE; share code snippets using Git.
  • 19.
    • Fast andcloud-powered • Easy to use, no infrastructure to manage • Scales to 100s of thousands of users • Quick calculations with SPICE • 1/10th the cost of legacy BI software Business Intelligence Amazon QuickSight
  • 20.
  • 21.
  • 22.
    Interactive Queries Ingest/ CollectStore Analyze/ Process Visualization/ Consume Producer Amazon S3 Amazon Redshift Amazon EMR Presto Impala Spark Interactive Amazon Athena Serverless Managed Virtualized QuickSight
  • 23.
    Catalog & Search Accessand search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight Amazon AI EMR Redshift Athena Kinesis RDS Central Storage Secure, cost-effective Storage in Amazon S3 S3 Snowball Database Migration Service Kinesis Firehose Direct Connect Data Ingestion Get your data into S3 Quickly and securely Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Security Token Service CloudWatch CloudTrail Key Management Service Data Lake Reference Architecture = Serverless
  • 24.
    Amazon S3 Data Lake AmazonKinesis Streams & Firehose Hadoop / Spark Streaming Analytics Tools Amazon Redshift Data Warehouse Amazon DynamoDB NoSQL Database AWS Lambda Spark Streaming on EMR Amazon Elasticsearch Service Relational Database Amazon EMR Amazon Aurora Amazon Machine Learning Predictive Analytics Any Open Source Tool of Choice on EC2 Data Science Sandbox Visualization / Reporting Apache Storm on EMR Apache Flink on EMR Amazon Kinesis Analytics Serving Tier Clusterless SQL Query Amazon Athena DataSourcesTransactionalData Amazon Glue Clusterless ETL Amazon ElastiCache Redis Data Lake and Real-time Analytics
  • 25.
    Serverless ETL Store TransformStore Analyze/ Process Visualize/ Consume Amazon S3 Apache Kafka Kinesis Streams Amazon EMR Spark Flink AWS Glue AWS Lambda ISV Amazon S3 Apache Kafka Redshift Kinesis Streams Data CatalogAWS Glue DynamoDB Streams DynamoDB Hive M/D
  • 26.
    Serverless nicely fitsinto big data platforms • AWS Serverless Big Data Services • Complements existing big data flows • Focus on the analytics and not on infrastructure or servers • Don’t focus on the scaling, availability, and undifferentiated heavy lifting • Pay only for what you use • Easily try out different tools, analytics, and solutions
  • 27.