Azure Data Lake Kenneth M. Nielsen
About me Kenneth M. Nielsen Worked with SQL Server since 1999 Data Solution Architect at Microsoft Kenneth.Nielsen@microsoft.com @doktorkermit Linkedin.com/in/KennethMNielsen www.funkylab.com
Agenda • Azure Data Lake Store • Azure Data Lake Analytics • Azure Data Lake Analytics – Using Visual Studio • Azure Data Lake Analytics – Using PowerShell • Q & A
Data Lake Store
Azure Data Lake Store A hyper scale repository for big data analytics workloads No limits to SCALE Store ANY DATA in its native format HADOOP FILE SYSTEM (HDFS) for the cloud ENTERPRISE READY access control, encryption at rest Optimized for analytic workload PERFORMANCE
Azure Data Lake Store Any Data • Unstructured • Semi-structured • Structured
Azure Data Lake Store
Azure Data Lake Store HDFS for the cloud New filesystem build from the ground up, based on HADOOP file system • Integrates with HDInsight, Hortonworks and Cloudera • Supports Files and Folder objects and operations
Azure Data Lake Store Unlimited storage • Files sizes can be from Gigabytes to Petabytes • No limits to scale
Azure Data Lake Store Security • Integrates with Azure Active Directory • Audit logs for all operations* • Server side Encryption* • ACL on files and folders* • Enterprise ready security when in GA
Data Lake Analytics
Azure Data Lake Analytics A elastic analytics service built on Apache YARN that processes all data, at any size • No limits to SCALE • Includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C# • Optimized to work with ADL STORE • FEDERATED QUERY across Azure data sources • ENTERPRISE READY Role based access control & Auditing • Pay PER JOB & Scale PER JOB
U-SQL A new language for Big Data • Familiar syntax to millions of SQL & .NET developers • Unifies declarative nature of SQL with the imperative power of C# • Unifies structured, semi-structured and unstructured data • Distributed query support over all data
Language Overview U-SQL Fundamentals • All the familiar SQL clauses SELECT | FROM | WHERE GROUP BY | JOIN | OVER • Operate on unstructured and structured data • Relational metadata objects .NET integration and extensibility • U-SQL expressions are full C# expressions • Reuse .NET code in your own assemblies • Use C# to define your own: Types | Functions | Joins | Aggregators | I/O (Extractors, Outputters)
U-SQL Capabilities Interactive Batch Streaming Machine Learning IN PROGRESS AVAILABLE NOW FUTURE FUTURE
U-SQL Distributed Query Azure Storage Blobs Azure Data Lake Store Azure SQL Database Azure SQL Data Warehouse Azure SQL DB in Azure VM READ READ READ READ READ WRITE WRITE WRITE WRITE WRITE
@orders = EXTRACT OrderId int, Customer string, Date DateTime, Amount float FROM "/input/orders.txt" USING Extractors.Tsv(); OUTPUT @orders TO "/output/orders_copy.txt" USING Outputters.Tsv(); Apply Schema on read From a file in a Data Lake Easy delimited text handling Write out Read the input, write it directly to output (just a simple copy) Rowset
Azure Data Lake Pattern ADL Storage Visual Studio ADL Power BI Desktop Get Data From CSV Where CAQS Files are stored, but would load into ADLS directly if ingesting from scratch Upload Dataset ADL Analytics AML Experiment ADL Storage Data Analyst Data Scientist Data Engineer
Execution with Requested Parallelism Requested Parallelism = 1 (reserve enough to do 1 vertex at a time) Requested Parallelism = 4 (reserve enough to do 4 vertices at a time)
Stage Details 252 Pieces of work AVG Vertex execution time 4.3 Billion rows Data Read & Written
ADLAUs Azure Data Lake Analytics Unit Parallelism N = N ADLAUs 1 ADLAU ~= A VM with 2 cores and 6 GB of memory
Data Lake Analytics Visual Studio
Azure Data Lake – Visual Studio Available project types
Azure Data Lake – Visual Studio Fully integrates to Solution Explorer
Azure Data Lake – Visual Studio • Monitor and manage jobs • Browse and manage storage • Browse U-SQL catalog
Creating U-SQL
Creating U-SQL IntelliSense Supported
Creating U-SQL Code behind enhance your code
Installing Azure PowerShell • PowerShell Gallery • Recommended approach • PowerShell 5.0 supports PowerShell Gallery • Windows 10 ships with PowerShell 5.0 • Web Platform Installation (WebPI)
Installing from the PowerShell Gallery • Launch Windows PowerShell ISE as Administrator • Install-Module AzureRM • Install-AzureRM
Finding the ADL cmdlets • Option 1 • Get-Command -Module AzureRM.DataLakeStore • Get-Command -Module AzureRM.DataLakeAnalytics • Option 2 • Get-Command *DataLake*
Logging in to Azure $subname = “BDHadoopTeamPMTestDemo” Login-AzureRmAccount –SubscriptionName $subname
ADLS: Listing files in a store • $adls = “sqlkonferenz” • Get-AzureRmDataLakeStoreChildItem • -Account $adls • -Path /
ADLS: Upload and download • $adls = “sqlkonferenz” • Import-AzureRmDataLakeStoreItem -Account $adls -Path d:somefile.txt -Destination /somefile.txt • Export-AzureRmDataLakeStoreItem -Account $adls -Path /somefile.txt -Destination d:somefile_copy.txt
ADLA: List and submit jobs • $adla = “sqlkonferenz” • Get-AzureRmDataLakeAnalyticsJob -Account $adla • Submit-AzureRmDataLakeAnalyticsJob -Account $adla -Script “…” # U-SQL text -Name myjob • Submit-AzureRmDataLakeAnalyticsJob -Account $adla -ScriptPath D:test.script -Name myjob
ADL Store (ADLS) feature set Account Management Create new account List accounts Update account properties Delete account Transferring Data Upload into store from local disk Download from store to local disk Files and Folders List contents of folder Create Move Delete Does file exist Security Get ACLs Update ACLs Get Owner Set Owner File Content Set file content Append file content Get file content Merge files
ADL Analytics (ADLA) feature set Account Management Create new account List accounts Update account properties Delete account Data Sources Add a data source List data sources Update data source Delete data source Compute List jobs Submit job Cancel job Catalog Items List items in U-SQL catalog Update item Catalog Secrets Create catalog secret List catalog secrets Delete catalog secrets
Questions
Azure data lake sql konf 2016

Azure data lake sql konf 2016

  • 1.
  • 2.
    About me Kenneth M.Nielsen Worked with SQL Server since 1999 Data Solution Architect at Microsoft Kenneth.Nielsen@microsoft.com @doktorkermit Linkedin.com/in/KennethMNielsen www.funkylab.com
  • 3.
    Agenda • Azure DataLake Store • Azure Data Lake Analytics • Azure Data Lake Analytics – Using Visual Studio • Azure Data Lake Analytics – Using PowerShell • Q & A
  • 4.
  • 5.
    Azure Data LakeStore A hyper scale repository for big data analytics workloads No limits to SCALE Store ANY DATA in its native format HADOOP FILE SYSTEM (HDFS) for the cloud ENTERPRISE READY access control, encryption at rest Optimized for analytic workload PERFORMANCE
  • 6.
    Azure Data LakeStore Any Data • Unstructured • Semi-structured • Structured
  • 7.
  • 8.
    Azure Data LakeStore HDFS for the cloud New filesystem build from the ground up, based on HADOOP file system • Integrates with HDInsight, Hortonworks and Cloudera • Supports Files and Folder objects and operations
  • 9.
    Azure Data LakeStore Unlimited storage • Files sizes can be from Gigabytes to Petabytes • No limits to scale
  • 10.
    Azure Data LakeStore Security • Integrates with Azure Active Directory • Audit logs for all operations* • Server side Encryption* • ACL on files and folders* • Enterprise ready security when in GA
  • 11.
  • 12.
    Azure Data LakeAnalytics A elastic analytics service built on Apache YARN that processes all data, at any size • No limits to SCALE • Includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C# • Optimized to work with ADL STORE • FEDERATED QUERY across Azure data sources • ENTERPRISE READY Role based access control & Auditing • Pay PER JOB & Scale PER JOB
  • 13.
    U-SQL A new languagefor Big Data • Familiar syntax to millions of SQL & .NET developers • Unifies declarative nature of SQL with the imperative power of C# • Unifies structured, semi-structured and unstructured data • Distributed query support over all data
  • 14.
    Language Overview U-SQL Fundamentals •All the familiar SQL clauses SELECT | FROM | WHERE GROUP BY | JOIN | OVER • Operate on unstructured and structured data • Relational metadata objects .NET integration and extensibility • U-SQL expressions are full C# expressions • Reuse .NET code in your own assemblies • Use C# to define your own: Types | Functions | Joins | Aggregators | I/O (Extractors, Outputters)
  • 15.
  • 16.
    U-SQL Distributed Query AzureStorage Blobs Azure Data Lake Store Azure SQL Database Azure SQL Data Warehouse Azure SQL DB in Azure VM READ READ READ READ READ WRITE WRITE WRITE WRITE WRITE
  • 17.
    @orders = EXTRACT OrderId int, Customerstring, Date DateTime, Amount float FROM "/input/orders.txt" USING Extractors.Tsv(); OUTPUT @orders TO "/output/orders_copy.txt" USING Outputters.Tsv(); Apply Schema on read From a file in a Data Lake Easy delimited text handling Write out Read the input, write it directly to output (just a simple copy) Rowset
  • 18.
    Azure Data LakePattern ADL Storage Visual Studio ADL Power BI Desktop Get Data From CSV Where CAQS Files are stored, but would load into ADLS directly if ingesting from scratch Upload Dataset ADL Analytics AML Experiment ADL Storage Data Analyst Data Scientist Data Engineer
  • 19.
    Execution with RequestedParallelism Requested Parallelism = 1 (reserve enough to do 1 vertex at a time) Requested Parallelism = 4 (reserve enough to do 4 vertices at a time)
  • 20.
    Stage Details 252 Piecesof work AVG Vertex execution time 4.3 Billion rows Data Read & Written
  • 21.
    ADLAUs Azure Data Lake Analytics Unit Parallelism N =N ADLAUs 1 ADLAU ~= A VM with 2 cores and 6 GB of memory
  • 22.
  • 23.
    Azure Data Lake– Visual Studio Available project types
  • 24.
    Azure Data Lake– Visual Studio Fully integrates to Solution Explorer
  • 25.
    Azure Data Lake– Visual Studio • Monitor and manage jobs • Browse and manage storage • Browse U-SQL catalog
  • 26.
  • 27.
  • 28.
  • 30.
    Installing Azure PowerShell •PowerShell Gallery • Recommended approach • PowerShell 5.0 supports PowerShell Gallery • Windows 10 ships with PowerShell 5.0 • Web Platform Installation (WebPI)
  • 31.
    Installing from thePowerShell Gallery • Launch Windows PowerShell ISE as Administrator • Install-Module AzureRM • Install-AzureRM
  • 32.
    Finding the ADLcmdlets • Option 1 • Get-Command -Module AzureRM.DataLakeStore • Get-Command -Module AzureRM.DataLakeAnalytics • Option 2 • Get-Command *DataLake*
  • 33.
    Logging in toAzure $subname = “BDHadoopTeamPMTestDemo” Login-AzureRmAccount –SubscriptionName $subname
  • 34.
    ADLS: Listing filesin a store • $adls = “sqlkonferenz” • Get-AzureRmDataLakeStoreChildItem • -Account $adls • -Path /
  • 35.
    ADLS: Upload anddownload • $adls = “sqlkonferenz” • Import-AzureRmDataLakeStoreItem -Account $adls -Path d:somefile.txt -Destination /somefile.txt • Export-AzureRmDataLakeStoreItem -Account $adls -Path /somefile.txt -Destination d:somefile_copy.txt
  • 36.
    ADLA: List andsubmit jobs • $adla = “sqlkonferenz” • Get-AzureRmDataLakeAnalyticsJob -Account $adla • Submit-AzureRmDataLakeAnalyticsJob -Account $adla -Script “…” # U-SQL text -Name myjob • Submit-AzureRmDataLakeAnalyticsJob -Account $adla -ScriptPath D:test.script -Name myjob
  • 37.
    ADL Store (ADLS)feature set Account Management Create new account List accounts Update account properties Delete account Transferring Data Upload into store from local disk Download from store to local disk Files and Folders List contents of folder Create Move Delete Does file exist Security Get ACLs Update ACLs Get Owner Set Owner File Content Set file content Append file content Get file content Merge files
  • 38.
    ADL Analytics (ADLA)feature set Account Management Create new account List accounts Update account properties Delete account Data Sources Add a data source List data sources Update data source Delete data source Compute List jobs Submit job Cancel job Catalog Items List items in U-SQL catalog Update item Catalog Secrets Create catalog secret List catalog secrets Delete catalog secrets
  • 40.

Editor's Notes

  • #6 Data lake store is your new friend for storing data, actually almost unlimited data, and the price, well it cost next to nothing to store data on Azure Any file-format is supported, data is stored in its native format, meaning that you can store, images, json tables, csv, tcv, blobs etc etc. It is build on HDFS, and here it is HDFS for the cloud.
  • #9 Support for rename, create and delete files and folders. Files system build from the scratch, based on HADOOP files system. Microsoft Azure Data Lake Store is a Hadoop file system that’s compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. Data Lake Store is integrated with Azure Data Lake Analytics and Azure HDInsight and will be integrated with Microsoft offerings like Revolution-R Enterprise; industry-standard distributions like Hortonworks, Cloudera, and MapR; and individual Hadoop projects like Spark, Storm, Flume, Sqoop, and Kafka.
  • #10 Data Lake Store has no fixed limits on account size or file size. While other cloud storage offerings might restrict individual file sizes to a few terabytes, Data Lake Store can store very large files that are hundreds of times larger. At the same time, it provides very low latency read/write access and high throughput for scenarios like high-resolution video, scientific, medical, large backup data, event streams, web logs, and Internet of Things (IoT). Collect and store everything in Data Lake Store without restriction or prior understanding of business requirements.
  • #11 Access Control List is only at root level at the moment, meaning that a user is granted access to a root folder, and will have access to everything in that root This will be changed when the service goes into GA.
  • #24 U-SQL project, where you write your statements U-SQL sample project, really extensive project that you can work with on you own account, will give you a head start to getting up to speed on the topic U-SQL unit testing project,
  • #26 Integrates seamlessly with server explorer