DATA VIRTUALIZATION PACKED LUNCH WEBINAR SERIES Sessions Covering Key Data Integration Challenges Solved with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization Pablo Alvarez-Yanez Director of Product Management, Denodo
3 Chikio Hayashi, 1998: "What is Data Science? Fundamental Concepts and a Heuristic Example" Data science is a concept to unify statistics, data analysis, machine learning and their related methods in order to understand and analyze actual phenomena with data
4 Data Science – Brief History Data Science is an umbrella term that has recently received a lot of media attention However, making sense of data in some way has been the job of scientists, statisticians, computer scientist and business analysts for years The term data science was used for the first time in Japan during 1996 in a conference by the International Federation of Classification Societies (IFCS) For a good review of the history of the term, check the Forbes article “A Very Short History of Data Science” • https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short- history-of-data-science/#53641eb955cf
5 The Tools of Data Science When thinking about data science, most minds immediately go to languages of Python and R, or tools like Spark and TensorFlow There is a myriad projects that currently server the needs of the data scientist
6 The Data Scientist Workflow A typical workflow for a data scientist is: 1. Gather the requirements for the business problem 2. Identify useful data ▪ Ingest data 3. Cleanse data into a useful format 4. Analyze data 5. Prepare input for your algorithms 6. Execute data science algorithms (ML, AI, etc.) ▪ Iterate steps 2 to 6 until valuable insights are produced 7. Visualize and share Source: http://sudeep.co/data-science/Understanding-the-Data-Science-Lifecycle/
7 Where does your time go? A large amount of time and effort goes into tasks not intrinsically related to data science: • Finding where the right data may be • Getting access to the data • Bureaucracy • Understand access methods and technology (noSQL, REST APIs, etc.) • Transforming data into a format easy to work with • Combining data originally available in different sources and formats • Profile and cleanse data to eliminate incomplete or inconsistent data points
8 Reference Architecture ETL Data Warehouse Kafka Physical Data Lake SparkML SQL interface Logical Data Lake Spark Streaming Distributed Storage (HDFS, S3) Files
Denodo for a Data Scientist 9
10 Data Scientist Flow Identify useful data Modify data into a useful format Analyze data Execute data science algorithms (ML, AI, etc.) Prepare for ML algorithm
11 Identify useful data If the company has a virtual layer with a good coverage of data sources, this task is greatly simplified • A data virtualization tool like Denodo can offer unified access to all data available in the company • It abstracts the technologies underneath, offering a standard SQL interface to query and manipulate To further simplify the challenge, Denodo offers a Data Catalog to search, find and explore your data assets
12 Search & Explore: Metadata Search the catalog and refine your results using descriptions, tags and business categories
13 Search & Explore: Content Integration with Lucene and ElasticSearch for indexing and performing keyword-base searches on the content
14 Document your models Rich HTML descriptions, editable directly from the catalog Extended metadata support to enrich the catalog with custom fields and details
15 Data Scientist Flow Identify useful data Modify data into a useful format Analyze data Execute data science algorithms (ML, AI, etc.) Prepare for ML algorithm
16 Ingestion and Data Manipulation tasks • Typically, scientists get data from a variety of places through various formats and protocols. From relational databases, to REST web services or noSQL engines. • Data is often exported into CSV files or loaded into Spark • Later, that data is manipulated in scripts (e.g. Pandas and Python) • However, data virtualization offers the unique opportunity of using standard SQL (joins, aggregations, transformations, etc.) to access, manipulate and analyze any data • Cleansing and transformation steps can be easily accomplished in SQL • Its modeling capabilities enable the definition of views that embed this logic to foster reusability
17 Denodo Administration Tool
18 Notebooks: Apache Zeppelin
19 Denodo Test Drive for Data Science Launched in December 2018 to promote the use of Denodo in the data science ecosystem
20 https://flic.kr/p/x8HgrF Can we predict the usage of the NYC bike system based on data from previous years?
21 NYC Citi bike data
22 NOAA Weather
23 Test Drive Tools
Denodo and Spark: Working with Larger Datasets 24
25 Denodo and Spark: data science with large volumes ✓ Spark as a source ▪ Spark, as well as many other Hadoop systems (Hive, Presto, Impala, HBase, etc.), can be use by Denodo as a data source to read data ✓ Spark as the processing engine ▪ In cases where Denodo needs to post-process data, for example in multi-source queries, Denodo is able to lift and shift to automatically use Spark’s engine for execution ✓ Spark as the data target ▪ Denodo can automatically save the data from any execution in a target Spark cluster when your processing needs (e.g. SparkML) require local data
26 Access to Big Data Sources Single access to all data assets, internal and external: ▪ Physical Data Lake, usually based on SQL-on-Hadoop systems: ▪ SparkSQL (onPrem, Databricks) ▪ Presto ▪ Impala ▪ Hive ▪ Other relational databases (EDW, ODS, applications, etc.) ▪ NoSQL (MongoDB, HBase, etc.) ▪ Indexes (ElasticSearch) ▪ Files (local, S3, Azure, etc.) ▪ SaaS APIs (Salesforce, Google, social media, etc.)
27 Using Spark’s Processing Engine Denodo optimizer provides native integration with MPP systems to provide one extra key capability: Query Acceleration Denodo can move, on demand, processing to the MPP during execution of a query • Parallel power for calculations in the virtual layer • Avoids slow processing in-disk when processing buffers don’t fit into Denodo’s memory (swapped data)
28 Ingesting and Caching Denodo’s integration with SQL-on-Hadoop systems is bi- directional: remote tables and caching enable Denodo to create tables and load them with data This allows to quickly load any data accessible by Denodo to the Hadoop cluster. • It’s significantly faster than tools like Sqoop. This approach becomes an alternative to ingestion and ELT processes. • However, it preserves lineage and governance Load process based on direct load to HDFS/S3/ADLS: 1. Creation of the target table in Cache system 2. Generation of Parquet files (in chunks) with Snappy compression in the local machine 3. Upload in parallel of Parquet files to HDFS
Key Takeaways 29
30 Key Takeaways ✓ Denodo can play key role in the data science ecosystem to reduce data exploration and analysis timeframes ✓ Extends and integrates with the capabilities of notebooks, Python, R, etc. to improve the toolset of the data scientist ✓ Provides a modern “SQL-on-Anything” engine ✓ Can leverage Big Data technologies like Spark (as a data source, an ingestion tool and for external processing) to efficiently work with large data volumes ✓ Helps productionalize data science
Q&A
32 Next Steps Access Denodo Platform in the Cloud! Take a Test Drive today! www.denodo.com/TestDrive GET STARTED TODAY
Thank you! © Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.

Minimizing the Complexities of Machine Learning with Data Virtualization

  • 1.
    DATA VIRTUALIZATION PACKEDLUNCH WEBINAR SERIES Sessions Covering Key Data Integration Challenges Solved with Data Virtualization
  • 2.
    Minimizing the Complexitiesof Machine Learning with Data Virtualization Pablo Alvarez-Yanez Director of Product Management, Denodo
  • 3.
    3 Chikio Hayashi, 1998:"What is Data Science? Fundamental Concepts and a Heuristic Example" Data science is a concept to unify statistics, data analysis, machine learning and their related methods in order to understand and analyze actual phenomena with data
  • 4.
    4 Data Science –Brief History Data Science is an umbrella term that has recently received a lot of media attention However, making sense of data in some way has been the job of scientists, statisticians, computer scientist and business analysts for years The term data science was used for the first time in Japan during 1996 in a conference by the International Federation of Classification Societies (IFCS) For a good review of the history of the term, check the Forbes article “A Very Short History of Data Science” • https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short- history-of-data-science/#53641eb955cf
  • 5.
    5 The Tools ofData Science When thinking about data science, most minds immediately go to languages of Python and R, or tools like Spark and TensorFlow There is a myriad projects that currently server the needs of the data scientist
  • 6.
    6 The Data ScientistWorkflow A typical workflow for a data scientist is: 1. Gather the requirements for the business problem 2. Identify useful data ▪ Ingest data 3. Cleanse data into a useful format 4. Analyze data 5. Prepare input for your algorithms 6. Execute data science algorithms (ML, AI, etc.) ▪ Iterate steps 2 to 6 until valuable insights are produced 7. Visualize and share Source: http://sudeep.co/data-science/Understanding-the-Data-Science-Lifecycle/
  • 7.
    7 Where does yourtime go? A large amount of time and effort goes into tasks not intrinsically related to data science: • Finding where the right data may be • Getting access to the data • Bureaucracy • Understand access methods and technology (noSQL, REST APIs, etc.) • Transforming data into a format easy to work with • Combining data originally available in different sources and formats • Profile and cleanse data to eliminate incomplete or inconsistent data points
  • 8.
    8 Reference Architecture ETL Data Warehouse Kafka PhysicalData Lake SparkML SQL interface Logical Data Lake Spark Streaming Distributed Storage (HDFS, S3) Files
  • 9.
    Denodo for aData Scientist 9
  • 10.
    10 Data Scientist Flow Identifyuseful data Modify data into a useful format Analyze data Execute data science algorithms (ML, AI, etc.) Prepare for ML algorithm
  • 11.
    11 Identify useful data Ifthe company has a virtual layer with a good coverage of data sources, this task is greatly simplified • A data virtualization tool like Denodo can offer unified access to all data available in the company • It abstracts the technologies underneath, offering a standard SQL interface to query and manipulate To further simplify the challenge, Denodo offers a Data Catalog to search, find and explore your data assets
  • 12.
    12 Search & Explore:Metadata Search the catalog and refine your results using descriptions, tags and business categories
  • 13.
    13 Search & Explore:Content Integration with Lucene and ElasticSearch for indexing and performing keyword-base searches on the content
  • 14.
    14 Document your models RichHTML descriptions, editable directly from the catalog Extended metadata support to enrich the catalog with custom fields and details
  • 15.
    15 Data Scientist Flow Identifyuseful data Modify data into a useful format Analyze data Execute data science algorithms (ML, AI, etc.) Prepare for ML algorithm
  • 16.
    16 Ingestion and DataManipulation tasks • Typically, scientists get data from a variety of places through various formats and protocols. From relational databases, to REST web services or noSQL engines. • Data is often exported into CSV files or loaded into Spark • Later, that data is manipulated in scripts (e.g. Pandas and Python) • However, data virtualization offers the unique opportunity of using standard SQL (joins, aggregations, transformations, etc.) to access, manipulate and analyze any data • Cleansing and transformation steps can be easily accomplished in SQL • Its modeling capabilities enable the definition of views that embed this logic to foster reusability
  • 17.
  • 18.
  • 19.
    19 Denodo Test Drivefor Data Science Launched in December 2018 to promote the use of Denodo in the data science ecosystem
  • 20.
    20 https://flic.kr/p/x8HgrF Can we predictthe usage of the NYC bike system based on data from previous years?
  • 21.
  • 22.
  • 23.
  • 24.
    Denodo and Spark: Workingwith Larger Datasets 24
  • 25.
    25 Denodo and Spark:data science with large volumes ✓ Spark as a source ▪ Spark, as well as many other Hadoop systems (Hive, Presto, Impala, HBase, etc.), can be use by Denodo as a data source to read data ✓ Spark as the processing engine ▪ In cases where Denodo needs to post-process data, for example in multi-source queries, Denodo is able to lift and shift to automatically use Spark’s engine for execution ✓ Spark as the data target ▪ Denodo can automatically save the data from any execution in a target Spark cluster when your processing needs (e.g. SparkML) require local data
  • 26.
    26 Access to BigData Sources Single access to all data assets, internal and external: ▪ Physical Data Lake, usually based on SQL-on-Hadoop systems: ▪ SparkSQL (onPrem, Databricks) ▪ Presto ▪ Impala ▪ Hive ▪ Other relational databases (EDW, ODS, applications, etc.) ▪ NoSQL (MongoDB, HBase, etc.) ▪ Indexes (ElasticSearch) ▪ Files (local, S3, Azure, etc.) ▪ SaaS APIs (Salesforce, Google, social media, etc.)
  • 27.
    27 Using Spark’s ProcessingEngine Denodo optimizer provides native integration with MPP systems to provide one extra key capability: Query Acceleration Denodo can move, on demand, processing to the MPP during execution of a query • Parallel power for calculations in the virtual layer • Avoids slow processing in-disk when processing buffers don’t fit into Denodo’s memory (swapped data)
  • 28.
    28 Ingesting and Caching Denodo’sintegration with SQL-on-Hadoop systems is bi- directional: remote tables and caching enable Denodo to create tables and load them with data This allows to quickly load any data accessible by Denodo to the Hadoop cluster. • It’s significantly faster than tools like Sqoop. This approach becomes an alternative to ingestion and ELT processes. • However, it preserves lineage and governance Load process based on direct load to HDFS/S3/ADLS: 1. Creation of the target table in Cache system 2. Generation of Parquet files (in chunks) with Snappy compression in the local machine 3. Upload in parallel of Parquet files to HDFS
  • 29.
  • 30.
    30 Key Takeaways ✓ Denodocan play key role in the data science ecosystem to reduce data exploration and analysis timeframes ✓ Extends and integrates with the capabilities of notebooks, Python, R, etc. to improve the toolset of the data scientist ✓ Provides a modern “SQL-on-Anything” engine ✓ Can leverage Big Data technologies like Spark (as a data source, an ingestion tool and for external processing) to efficiently work with large data volumes ✓ Helps productionalize data science
  • 31.
  • 32.
    32 Next Steps Access DenodoPlatform in the Cloud! Take a Test Drive today! www.denodo.com/TestDrive GET STARTED TODAY
  • 33.
    Thank you! © CopyrightDenodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.