SQL Server and Big Data Projects in the Real World Mark Kromer Pentaho Big Data Analytics Product Manager @mssqldude @kromerbigdata http://www.kromerbigdata.com
What we’ll (try) to cover today 1. The Big Data Technology Landscape 2. Big Data Analytics 3. 3 Big Data Analytics Scenarios: ❯ Digital Marketing Analytics • Hadoop, Aster Data, SQL Server ❯ Sentiment Analysis • MongoDB, SQL Server ❯ Data Refinery • Hadoop, MPP, SQL Server, Pentaho 4. SQL Server in the Big Data world
Big Data 101 3 V’s ❯ Volume – Terabyte records, transactions, tables, files ❯ Velocity – Batch, near-time, real-time (analytics), streams. ❯ Variety – Structures, unstructured, semi-structured, and all the above in a mix Text Processing ❯ Techniques for processing and analyzing unstructured (and structured) LARGE files Analytics & Insights Distributed File System & Programming
• Batch Processing • Commodity Hardware • Data Locality, no shared storage • Scales linearly • Great for large text file processing, not so great on small files • Distributed programming paradigm Hadoop 1.x
Hadoop 1 vs Hadoop 2 HADOOP 1.0 MapReduce (cluster resource management & data processing) © Hortonworks Inc. 2014 HDFS (redundant, reliable storage) HADOOP 2.0 YARN MapReduce (data processing) Others (cluster resource management) HDFS2 (redundant, highly-available & reliable storage) Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, …
YARN: Taking Hadoop Beyond Batch © Hortonworks Inc. 2014 Applications Run Natively in Hadoop YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, S4,…) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) ONLINE (HBase) OTHER (Search) (Weave…) Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service
YARN Eco-system © Hortonworks Inc. 2014 Page 7 Applications Powered by YARN Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative applications Elastic Search – Scalable Search Cloudera Llama – Impala on YARN DataTorrent – Data Analysis HOYA – HBase on YARN Frameworks Powered By YARN Apache Twill REEF by Microsoft Spring support for Hadoop 2
Apache Spark High-Speed In-Memory Analytics over Hadoop ● Open Source ● Alternative to Map Reduce for certain applications ● A low latency cluster computing system ● For very large data sets ● May be 100 times faster than Map Reduce for – Iterative algorithms – Interactive data mining ● Used with Hadoop / HDFS ● Released under BSD License
Popular Hadoop Distributions Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft Hadoop on Azure
Popular NoSQL Distributions Transactional-based, not analytics schemas
Popular MPP Distributions Big Data as distributed, scale-out, sharded data stores
Big Data Analytics Web Platform – RA 1
Sentiment Analysis Reference Architecture 2 MongoDB Hadoop PDW Big Data Platforms Social Media Sources Data Orchestration Data Mining OLAP Cubes Data Models Analytical Models OLAP Analytics Tools, Reporting Tools, Dashboards
Streamlined Data Refinery Reference Architecture 3 Transactions – Batch & Real-time Enrollments & Redemptions Location, Email, Other Data Hadoop Cluster Analytics Reports Data Orchestration
Big Data Analytics
Big Data Analytics Core Tenets • Distributed Data (Data Locality) ❯ HDFS / MapReduce ❯ YARN / TEZ ❯ Replicated / Sharded Data • MPP Databases ❯ Vertica, Aster, Microsoft, Greenplum … In-database analytics that can scale-out with distributed processing across nodes • Distributed Analytics ❯ SAS: Quickly solve complex problems using big data and sophisticated analytics in a distributed, in-memory and parallel environment.” http://www.sas.com/resources/whitepaper/wp_46345.pdf • In-memory Analytics ❯ Microsoft PowerPivot (Tabular models) ❯ SAP HANA ❯ Tableau
SQL on Hadoop Hortonworks and Cloudera DW Engine Approaches
SQL on Hadoop Landscape Gartner Research on SQL on Hadoop Not Quite Real Time Many vendors market their SQL interfaces to Hadoop as providing so called "real-time access" to data stored in a Hadoop cluster … SQL on Hadoop provides a purely interactive data query and data manipulation experience — faster than batch, but not truly real time. In the case of Hadoop and the types of tasks it performs, we define interactive time frames as between 30 milliseconds and 10 minutes. If your usage truly needs realtime, a different set of technologies and vendors may be required.
SQL on Hadoop Vendor Perspective: MapR Batch SQL Hive is used primarily for queries on very large data sets and large ETL jobs. The queries can take anywhere between a few minutes to several hours depending on the complexity of the job. The Apache Tez project aims to provide targeted performance improvements for Hive to deliver interactive query capabilities in future. MapR ships and supports Apache Hive today. Interactive SQL Technologies such as Impala and Apache Drill provide interactive query capabilities to enable traditional business intelligence and analytics on Hadoop-scale datasets. The response times vary between milliseconds to minutes depending on the query complexity. Users expect SQL-on- Hadoop technologies to support common BI tools such as Tableau and MicroStrategy (to name a couple) for reporting and ad-hoc queries. MapR supports customers using Impala on the MapR distribution of Hadoop today. Apache Drill will be available Q2 2014.
MapReduce Framework (Map) using Microsoft.Hadoop.MapReduce; using System.Text.RegularExpressions; public class TotalHitsForPageMap : MapperBase { public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } }
MapReduce Framework (Reduce & Job) public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } }
Get Data into Hadoop Linux shell commands to access data in HDFS Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv List files in HDFS: c:Hadoop>hadoop fs -ls /import Found 1 items -rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv View file in HDFS: c:Hadoop>hadoop fs -cat /import/sales.csv Kromer,123,5,55 Smith,567,1,25 Jones,123,9,99 James,11,12,1 Johnson,456,2,2.5 Singh,456,1,3.25 Yu,123,1,11 Now, we can work on the data with MapReduce, Hive, Pig, etc.
Use Hive for Data Schema and Analysis create external table ext_sales ( lastname string, productid int, quantity int, sales_amount float ) row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input'; LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
Sqoop Data transfer to & from Hadoop & SQL Server sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1 > hadoop fs -cat /user/mark/customers/part-m-00000 > 5,Bob Smith sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
Role of NoSQL in a Big Data Analytics Solution ‣ Use NoSQL to store data quickly without the overhead of RDBMS ‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few ‣ Why NoSQL? ‣ In the world of “Big Data” ‣ “Schema later” ‣ Ignore ACID properties ‣ Drop data into key-value store quick & dirty ‣ Worry about query & read later ‣ Why NOT NoSQL? ‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface ‣ SQL Server and NoSQL ‣ Not a natural fit ‣ Use HDFS or your favorite NoSQL database ‣ Consider turning off SQL Server locking mechanisms ‣ Focus on writes, not reads (read uncommitted)
MongoDB and Enterprise IT Stack Applications CRM, ERP, Collaboration, Mobile, BI Data Management Online Data Offline Data Hadoop EDW Management & Monitoring Security & Auditing RDBMS RDBMS Infrastructure OS & Virtualization, Compute, Storage, Network
General document per customer per account { _id : ObjectId("4e2e3f92268cdda473b628f6"), sourceIDs: { ABCSystemIDPart1: 8397897, ABCSystemIDPart2: 2937430, ABCSystemIDPart3: 932018 } accountType: “Checking”, accountOwners: [ { firstName : ”John", lastName: “Smith”, contactMethods: [ { type: “phone”, subtype: “mobile”, number: 8743927394}, { type: “mail”, address: “58 3rd St.”, city: …} ] possibleMatchCriteria: { govtID: 2938932432, fullName: “johnsmith”, dob: … } }, { firstName : ”Anne", maidenName: “Collins”, lastName: “Smith”, …} ], openDate: ISODate("2013-02-15 10:00”), accountFeatures { Overdraft: true, APR: 20, … } } OR creditCardNumber: 8392384938391293 OR mortgageID: 2374389 OR policyID: 18374923
Text Search Example (e.g. address typo so do fuzzy match) // Text search for address filtered by first name and NY > db.ticks.runCommand( “text”, { search: “vanderbilt ave. vander bilt”, filter: {name: “Smith”, city: “New York”} })
Aggregate: Total Value of Accounts //Find total value of each customer’s accounts for a given RM (or Agent) sorted by value db.accts.aggregate( { $match: {relationshipManager: “Smith”}}, { $group : { _id : “$ssn”, totalValue: {$sum: ”$value”} }}, { $sort: { totalValue: -1}} )
SQL Server Big Data – Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
SQL Server Big Data Environment SQL Server Database ❯ SQL 2012 Enterprise Edition ❯ Page Compression ❯ 2012 Columnar Compression on Fact Tables ❯ Clustered Index on all tables ❯ Auto-update Stats Asynch ❯ Partition Fact Tables by month and archive data with sliding window technique ❯ Drop all indexes before nightly ETL load jobs ❯ Rebuild all indexes when ETL completes SQL Server Analysis Services ❯ SSAS 2012 Enterprise Edition ❯ 2008 R2 OLAP cubes partition-aligned with DW ❯ 2012 cubes in-memory tabular cubes ❯ All access through MSMDPUMP or SharePoint
SQL Server Big Data Analytics Features Columnstore Sqoop adapter PolyBase Hive In-memory analytics Scale-out MPP SQL Server APS
Pentaho Big Data Analytics DBA ETL/BI Developer Business Users & Executives Analysts & Data Scientists Enterprise & Interactive Reporting Pentaho Business Analytics Interactive Analysis Dashboards Predictive Analytics DIRECT ACCESS Data Integration Instaview | Visual Map Reduce OPERATIONAL DATA BIG DATA PUBLIC/PRIVATE CLOUDS DATA STREAM
Pentaho Big Data Analytics Accelerate the time to big data value • Full continuity from data access to decisions – complete data integration & analytics for any big data store • Faster development, faster runtime – visual development, distributed execution • Instant and interactive analysis – no coding and no ETL required
Product Components • Visual data exploration • Ad hoc analysis • Interactive charts & visualizations Pentaho Data Integration • Visual development for big data • Broad connectivity • Data quality & enrichment • Integrated scheduling • Security integration Pentaho Dashboards • Self-service dashboard builder • Content linking & drill through • Highly customized mash-ups Pentaho Data Mining & Predictive Analytics • Model construction & evaluation • Learning schemes • Integration with 3rd part models using PMML Pentaho Enterprise & Interactive Reports • Both ad hoc & distributed reporting • Drag & drop interactive reporting • Pixel-perfect enterprise reports Pentaho for Big Data MapReduce & Instaview • Visual Interface for Developing MR • Self-service big data discovery • Big data access to Data Analysts Pentaho Analyzer
Pentaho Interactive Analysis & Data Discovery Highly Flexible Advanced Visualizations ❯ Simple, easy-to-use visual data exploration ❯ Web-based thin client; in-memory caching ❯ Rich library of interactive visualizations • Geo-mapping, heat grids, scatter plots, bubble charts, line over bar and more • Pluggable visualizations ❯ Java ROLAP engine to analyze structured and unstructured data, with SQL dialects for querying data from RDBMs ❯ Pluggable cache integrating with leading caching architectures: Infinispan (JBoss Data Grid) & Memcached

Big Data Analytics with Hadoop, MongoDB and SQL Server

  • 1.
    SQL Server andBig Data Projects in the Real World Mark Kromer Pentaho Big Data Analytics Product Manager @mssqldude @kromerbigdata http://www.kromerbigdata.com
  • 2.
    What we’ll (try)to cover today 1. The Big Data Technology Landscape 2. Big Data Analytics 3. 3 Big Data Analytics Scenarios: ❯ Digital Marketing Analytics • Hadoop, Aster Data, SQL Server ❯ Sentiment Analysis • MongoDB, SQL Server ❯ Data Refinery • Hadoop, MPP, SQL Server, Pentaho 4. SQL Server in the Big Data world
  • 3.
    Big Data 101 3 V’s ❯ Volume – Terabyte records, transactions, tables, files ❯ Velocity – Batch, near-time, real-time (analytics), streams. ❯ Variety – Structures, unstructured, semi-structured, and all the above in a mix Text Processing ❯ Techniques for processing and analyzing unstructured (and structured) LARGE files Analytics & Insights Distributed File System & Programming
  • 4.
    • Batch Processing • Commodity Hardware • Data Locality, no shared storage • Scales linearly • Great for large text file processing, not so great on small files • Distributed programming paradigm Hadoop 1.x
  • 5.
    Hadoop 1 vsHadoop 2 HADOOP 1.0 MapReduce (cluster resource management & data processing) © Hortonworks Inc. 2014 HDFS (redundant, reliable storage) HADOOP 2.0 YARN MapReduce (data processing) Others (cluster resource management) HDFS2 (redundant, highly-available & reliable storage) Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, …
  • 6.
    YARN: Taking HadoopBeyond Batch © Hortonworks Inc. 2014 Applications Run Natively in Hadoop YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, S4,…) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) ONLINE (HBase) OTHER (Search) (Weave…) Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service
  • 7.
    YARN Eco-system ©Hortonworks Inc. 2014 Page 7 Applications Powered by YARN Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative applications Elastic Search – Scalable Search Cloudera Llama – Impala on YARN DataTorrent – Data Analysis HOYA – HBase on YARN Frameworks Powered By YARN Apache Twill REEF by Microsoft Spring support for Hadoop 2
  • 8.
    Apache Spark High-SpeedIn-Memory Analytics over Hadoop ● Open Source ● Alternative to Map Reduce for certain applications ● A low latency cluster computing system ● For very large data sets ● May be 100 times faster than Map Reduce for – Iterative algorithms – Interactive data mining ● Used with Hadoop / HDFS ● Released under BSD License
  • 9.
    Popular Hadoop Distributions Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft Hadoop on Azure
  • 10.
    Popular NoSQL Distributions Transactional-based, not analytics schemas
  • 11.
    Popular MPP Distributions Big Data as distributed, scale-out, sharded data stores
  • 12.
    Big Data AnalyticsWeb Platform – RA 1
  • 13.
    Sentiment Analysis ReferenceArchitecture 2 MongoDB Hadoop PDW Big Data Platforms Social Media Sources Data Orchestration Data Mining OLAP Cubes Data Models Analytical Models OLAP Analytics Tools, Reporting Tools, Dashboards
  • 14.
    Streamlined Data Refinery Reference Architecture 3 Transactions – Batch & Real-time Enrollments & Redemptions Location, Email, Other Data Hadoop Cluster Analytics Reports Data Orchestration
  • 15.
  • 16.
    Big Data Analytics Core Tenets • Distributed Data (Data Locality) ❯ HDFS / MapReduce ❯ YARN / TEZ ❯ Replicated / Sharded Data • MPP Databases ❯ Vertica, Aster, Microsoft, Greenplum … In-database analytics that can scale-out with distributed processing across nodes • Distributed Analytics ❯ SAS: Quickly solve complex problems using big data and sophisticated analytics in a distributed, in-memory and parallel environment.” http://www.sas.com/resources/whitepaper/wp_46345.pdf • In-memory Analytics ❯ Microsoft PowerPivot (Tabular models) ❯ SAP HANA ❯ Tableau
  • 17.
    SQL on Hadoop Hortonworks and Cloudera DW Engine Approaches
  • 18.
    SQL on HadoopLandscape Gartner Research on SQL on Hadoop Not Quite Real Time Many vendors market their SQL interfaces to Hadoop as providing so called "real-time access" to data stored in a Hadoop cluster … SQL on Hadoop provides a purely interactive data query and data manipulation experience — faster than batch, but not truly real time. In the case of Hadoop and the types of tasks it performs, we define interactive time frames as between 30 milliseconds and 10 minutes. If your usage truly needs realtime, a different set of technologies and vendors may be required.
  • 19.
    SQL on Hadoop Vendor Perspective: MapR Batch SQL Hive is used primarily for queries on very large data sets and large ETL jobs. The queries can take anywhere between a few minutes to several hours depending on the complexity of the job. The Apache Tez project aims to provide targeted performance improvements for Hive to deliver interactive query capabilities in future. MapR ships and supports Apache Hive today. Interactive SQL Technologies such as Impala and Apache Drill provide interactive query capabilities to enable traditional business intelligence and analytics on Hadoop-scale datasets. The response times vary between milliseconds to minutes depending on the query complexity. Users expect SQL-on- Hadoop technologies to support common BI tools such as Tableau and MicroStrategy (to name a couple) for reporting and ad-hoc queries. MapR supports customers using Impala on the MapR distribution of Hadoop today. Apache Drill will be available Q2 2014.
  • 20.
    MapReduce Framework (Map) using Microsoft.Hadoop.MapReduce; using System.Text.RegularExpressions; public class TotalHitsForPageMap : MapperBase { public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } }
  • 21.
    MapReduce Framework (Reduce& Job) public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } }
  • 22.
    Get Data intoHadoop Linux shell commands to access data in HDFS Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv List files in HDFS: c:Hadoop>hadoop fs -ls /import Found 1 items -rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv View file in HDFS: c:Hadoop>hadoop fs -cat /import/sales.csv Kromer,123,5,55 Smith,567,1,25 Jones,123,9,99 James,11,12,1 Johnson,456,2,2.5 Singh,456,1,3.25 Yu,123,1,11 Now, we can work on the data with MapReduce, Hive, Pig, etc.
  • 23.
    Use Hive forData Schema and Analysis create external table ext_sales ( lastname string, productid int, quantity int, sales_amount float ) row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input'; LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
  • 24.
    Sqoop Data transferto & from Hadoop & SQL Server sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1 > hadoop fs -cat /user/mark/customers/part-m-00000 > 5,Bob Smith sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
  • 25.
    Role of NoSQLin a Big Data Analytics Solution ‣ Use NoSQL to store data quickly without the overhead of RDBMS ‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few ‣ Why NoSQL? ‣ In the world of “Big Data” ‣ “Schema later” ‣ Ignore ACID properties ‣ Drop data into key-value store quick & dirty ‣ Worry about query & read later ‣ Why NOT NoSQL? ‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface ‣ SQL Server and NoSQL ‣ Not a natural fit ‣ Use HDFS or your favorite NoSQL database ‣ Consider turning off SQL Server locking mechanisms ‣ Focus on writes, not reads (read uncommitted)
  • 26.
    MongoDB and EnterpriseIT Stack Applications CRM, ERP, Collaboration, Mobile, BI Data Management Online Data Offline Data Hadoop EDW Management & Monitoring Security & Auditing RDBMS RDBMS Infrastructure OS & Virtualization, Compute, Storage, Network
  • 27.
    General document percustomer per account { _id : ObjectId("4e2e3f92268cdda473b628f6"), sourceIDs: { ABCSystemIDPart1: 8397897, ABCSystemIDPart2: 2937430, ABCSystemIDPart3: 932018 } accountType: “Checking”, accountOwners: [ { firstName : ”John", lastName: “Smith”, contactMethods: [ { type: “phone”, subtype: “mobile”, number: 8743927394}, { type: “mail”, address: “58 3rd St.”, city: …} ] possibleMatchCriteria: { govtID: 2938932432, fullName: “johnsmith”, dob: … } }, { firstName : ”Anne", maidenName: “Collins”, lastName: “Smith”, …} ], openDate: ISODate("2013-02-15 10:00”), accountFeatures { Overdraft: true, APR: 20, … } } OR creditCardNumber: 8392384938391293 OR mortgageID: 2374389 OR policyID: 18374923
  • 28.
    Text Search Example (e.g. address typo so do fuzzy match) // Text search for address filtered by first name and NY > db.ticks.runCommand( “text”, { search: “vanderbilt ave. vander bilt”, filter: {name: “Smith”, city: “New York”} })
  • 29.
    Aggregate: Total Valueof Accounts //Find total value of each customer’s accounts for a given RM (or Agent) sorted by value db.accts.aggregate( { $match: {relationshipManager: “Smith”}}, { $group : { _id : “$ssn”, totalValue: {$sum: ”$value”} }}, { $sort: { totalValue: -1}} )
  • 30.
    SQL Server BigData – Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
  • 31.
    SQL Server BigData Environment SQL Server Database ❯ SQL 2012 Enterprise Edition ❯ Page Compression ❯ 2012 Columnar Compression on Fact Tables ❯ Clustered Index on all tables ❯ Auto-update Stats Asynch ❯ Partition Fact Tables by month and archive data with sliding window technique ❯ Drop all indexes before nightly ETL load jobs ❯ Rebuild all indexes when ETL completes SQL Server Analysis Services ❯ SSAS 2012 Enterprise Edition ❯ 2008 R2 OLAP cubes partition-aligned with DW ❯ 2012 cubes in-memory tabular cubes ❯ All access through MSMDPUMP or SharePoint
  • 32.
    SQL Server BigData Analytics Features Columnstore Sqoop adapter PolyBase Hive In-memory analytics Scale-out MPP SQL Server APS
  • 33.
    Pentaho Big DataAnalytics DBA ETL/BI Developer Business Users & Executives Analysts & Data Scientists Enterprise & Interactive Reporting Pentaho Business Analytics Interactive Analysis Dashboards Predictive Analytics DIRECT ACCESS Data Integration Instaview | Visual Map Reduce OPERATIONAL DATA BIG DATA PUBLIC/PRIVATE CLOUDS DATA STREAM
  • 34.
    Pentaho Big DataAnalytics Accelerate the time to big data value • Full continuity from data access to decisions – complete data integration & analytics for any big data store • Faster development, faster runtime – visual development, distributed execution • Instant and interactive analysis – no coding and no ETL required
  • 35.
    Product Components •Visual data exploration • Ad hoc analysis • Interactive charts & visualizations Pentaho Data Integration • Visual development for big data • Broad connectivity • Data quality & enrichment • Integrated scheduling • Security integration Pentaho Dashboards • Self-service dashboard builder • Content linking & drill through • Highly customized mash-ups Pentaho Data Mining & Predictive Analytics • Model construction & evaluation • Learning schemes • Integration with 3rd part models using PMML Pentaho Enterprise & Interactive Reports • Both ad hoc & distributed reporting • Drag & drop interactive reporting • Pixel-perfect enterprise reports Pentaho for Big Data MapReduce & Instaview • Visual Interface for Developing MR • Self-service big data discovery • Big data access to Data Analysts Pentaho Analyzer
  • 36.
    Pentaho Interactive Analysis& Data Discovery Highly Flexible Advanced Visualizations ❯ Simple, easy-to-use visual data exploration ❯ Web-based thin client; in-memory caching ❯ Rich library of interactive visualizations • Geo-mapping, heat grids, scatter plots, bubble charts, line over bar and more • Pluggable visualizations ❯ Java ROLAP engine to analyze structured and unstructured data, with SQL dialects for querying data from RDBMs ❯ Pluggable cache integrating with leading caching architectures: Infinispan (JBoss Data Grid) & Memcached