Building a Hybrid Data Pipeline for Salesforce and Hadoop Sumit Sarkar, Chief Data Evangelist for Progress @SAsInSumit www.linkedin.com/in/meetsumit
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.2 Agenda  Overview of project  Hybrid pipeline for ingestion of SaaS sources into Hadoop  Hybrid pipeline to access Hadoop from Salesforce Cloud  Best practices and lessons learned
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.3 Overview of project
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.4 Need answers at marketing speed.. 1. Invite contacts to our webinar who identified specific interests with active opportunities and include related contacts to that same project? 2. Identify common ProductA pages that convert to ProductB evaluations (i.e. measure cross sell potential) 3. What web content did the 1100 webinar attendees view following the webinar? 4. Analyze lead histories to track accuracy in lead routing assignment (Salesforce limits values that can be reported against) 5. Identify which of our web content is most visited for sales opportunities that were closed/won? 6. Create a list based on content consumption and 2017 survey answers? 7. What content was consumed across our strategic accounts? 8. Who complained about a broken link in the survey? 9. …
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.5 Analytics Today Marketing Data Management Platform Embedded Insights Operational Insights Information Technology Data Warehouse Enterprise Reporting/Analytics Enterprise Data Integration LoB Desktop Analytics / Spreadmarts Sumologic Analytics Mixed access to Martech/IT analytics
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.6 Thus, we embarked on a Data Lake for Progress  Detailed data such as activities in Eloqua were not suitable to store in Corporate Data Warehouse or Marketing DMP  Data fragmented across sales CRM and service CX; marketing automation; web analytics; usage for cloud apps; survey platforms; webinar data, etc  Did not know what questions to ask in advance – how to define star schema?  Many analytics tools across Progress  Emerging data science expertise  Started with Pilot to start experimenting rather than wait to build a business case for funding
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.7 Overview of Progress Marketing Data Lake Progress Corporate Firewall
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.8 Hybrid pipeline for ingestion of SaaS sources into Hadoop
Data Collection Process
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.10 Sample Data: Oracle Eloqua Profiler data
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.11 Sample Data: CRM
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.12 Sample Data: Web traffic
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.13 Sample Data: Survey Other popular responses Linode Pironet Redhat OpenShift OpenStack Cloud Share Thomson Reuters Elektron SAP HANA Claro Cloud
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.14 Hybrid Environments can Limit Access to Data Locked behind the firewall Locked behind other clouds
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.15 Not trivial to get data across different marketing data sources Data Source API Eloqua Web Services API (REST/SOAP) Bulk and non-Bulk APIs No query language Oracle Service Cloud Web Services APIs (REST/SOAP) ROQL Google Analytics Hypercube (query limits of 10 metrics grouped by max of 7 dimensions) Veeva CRM SOAP, BULK, Metadata APIs SOQL
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.16 REST API for bulk export to support analytics
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.17 Hybrid pipeline for ingestion of cloud data sources to ground Hadoop system Define SaaS data model for data integration Optimize SQL request against SaaS APIs Mapping JDBC user auth to SaaS APIs Standard JDBC Client for Apache Sqoop
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.18 Sample Insights: Trends in 2016 for product usage following related events March & September June
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.19 Hybrid pipeline to access Hadoop from Salesforce Cloud
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.20 Connect Salesforce to Big Data Success Scoring Personalization Archived Insight 360 Reporting Corporate Firewall ?
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.21  Salesforce Connect maps Salesforce external objects to data tables in external systems. Instead of copying the data into your organization, Salesforce Connect accesses the data on demand and in real time. The data is never stale, and we access only what you need.  Recommended when: • You have a large amount of data that you don’t want to copy into your Salesforce organization. • You need small amounts of data at any one time. • You want real-time access to the latest data. Salesforce Connect integration for Big Data
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.22 An open protocol to allow the creation and consumption of queryable and interoperable RESTful APIs in a simple and standard way. OASIS Standard REST API (“SQL for the web”) Ratified as an OASIS standard February, 2014 Operations built on REST principles Uniform URL conventions Surface metadata in standard way Access requires OData endpoint First member to join OData Technical Committee
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.23 For Data Behind a Firewall, there is no Common Access Approach for Clouds 1. Network Based VPN 2. SSH Tunneling 3. Reverse Proxy Servers
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.24 Firewall Becoming Barrier for Hybrid Data Tech Adoption Source: The 2017 State of the Firewall” produced by Firemon
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.25 Hybrid pipeline for on-demand access to Hadoop system on the ground from Salesforce Cloud Publish OData endpoint for on-demand access On-premises data gateway for firewall friendly connection Mapping user auth to Hadoop ecosystem Reverse Engineer OData REST API entity data model
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.26 Firewall Friendly Architecture for On-premises Data Gateway On-premises data gateway for firewall friendly connection
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.27 Getting Started with Salesforce External Object Reports Report with data blended from Standard and External Objects (pulled on-demand from on-premises Data Lake)
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.28 Best practices and lessons learned
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.29 What worked in building the Data Lake Building advocacy for marketing data Brought together several teams across Product, Sales, Marketing Operations, Engineering, and others interested in the data available for analysis. Revenue attribution to content using activity data correlated across opportunities and content consumed. Analyze detailed CRM lead histories to measure lead routing effectiveness. Salesforce reports do not support analysis on values of detailed lead activity fields. Able to leverage SMEs to identify laser focused targets. I.e. which leads have the specific tech stack that the next webinar is targeting. Trends and Research Supplement existing CRM analytics Identify new and highly focused segments
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.30 Lessons Learned ingesting SaaS data into Hadoop Infrastructure On-premises Hadoop cluster was not ideal for LoB to manage and tune for the pilot, but we did not have approval to land LOB data in the cloud. Need additional support to go live. Data lake dumps raw data from source systems, so we end up with activity from automated tests that can skew data, for example. Schema changes to source objects impacted ingestion so that needs to be planned for with SaaS applications. Even with a standard pipeline in place, there are still limits that apply for initial load on certain objects as many APIs are primarily designed for application integration. Data Quality Metadata Data Integration
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.31 Lessons Learned accessing Hadoop from Salesforce 1. Mapping OData entities to Big Data objects 2. Primary keys for Big Data entities 3. HiveServer1 vs HiverServer2 for concurrency 4. External Objects have limits and 2 minute max timeout 5. Native Reporting support was added in Winter ‘17 6. Search considerations 7. Need agile OData service with Data Lake 8. Data Governance and Masking 9. CRM User Experience (strategies to improve performance)  Accessing external Big Data objects
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.32 Design Patterns for external objects  Enable Separate Loading of Related Lists of External Objects Performance Tuning Tips for Related Lists in Account
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.33 Decrease latency accessing Big Data over Hive  Tuning details 1. Use Apache Tez as execution engine for Hive 2. Use ORCfile, new storage format 3. Use vectorization query execution (Hive 0.13) 4. Performance Tuning (Partitions, Indexes, Buckets, Block Sizes, etc) 5. Consider another query interface (i.e. Apache Hawq)
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.34 Roadmap Sumit Sarkar Product Marketing @SAsInSumit linkedin.com/in/meetsumit
Building a Hybrid Data Pipeline for Salesforce and Hadoop

Building a Hybrid Data Pipeline for Salesforce and Hadoop

  • 1.
    Building a HybridData Pipeline for Salesforce and Hadoop Sumit Sarkar, Chief Data Evangelist for Progress @SAsInSumit www.linkedin.com/in/meetsumit
  • 2.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.2 Agenda  Overview of project  Hybrid pipeline for ingestion of SaaS sources into Hadoop  Hybrid pipeline to access Hadoop from Salesforce Cloud  Best practices and lessons learned
  • 3.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.3 Overview of project
  • 4.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.4 Need answers at marketing speed.. 1. Invite contacts to our webinar who identified specific interests with active opportunities and include related contacts to that same project? 2. Identify common ProductA pages that convert to ProductB evaluations (i.e. measure cross sell potential) 3. What web content did the 1100 webinar attendees view following the webinar? 4. Analyze lead histories to track accuracy in lead routing assignment (Salesforce limits values that can be reported against) 5. Identify which of our web content is most visited for sales opportunities that were closed/won? 6. Create a list based on content consumption and 2017 survey answers? 7. What content was consumed across our strategic accounts? 8. Who complained about a broken link in the survey? 9. …
  • 5.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.5 Analytics Today Marketing Data Management Platform Embedded Insights Operational Insights Information Technology Data Warehouse Enterprise Reporting/Analytics Enterprise Data Integration LoB Desktop Analytics / Spreadmarts Sumologic Analytics Mixed access to Martech/IT analytics
  • 6.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.6 Thus, we embarked on a Data Lake for Progress  Detailed data such as activities in Eloqua were not suitable to store in Corporate Data Warehouse or Marketing DMP  Data fragmented across sales CRM and service CX; marketing automation; web analytics; usage for cloud apps; survey platforms; webinar data, etc  Did not know what questions to ask in advance – how to define star schema?  Many analytics tools across Progress  Emerging data science expertise  Started with Pilot to start experimenting rather than wait to build a business case for funding
  • 7.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.7 Overview of Progress Marketing Data Lake Progress Corporate Firewall
  • 8.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.8 Hybrid pipeline for ingestion of SaaS sources into Hadoop
  • 9.
  • 10.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.10 Sample Data: Oracle Eloqua Profiler data
  • 11.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.11 Sample Data: CRM
  • 12.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.12 Sample Data: Web traffic
  • 13.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.13 Sample Data: Survey Other popular responses Linode Pironet Redhat OpenShift OpenStack Cloud Share Thomson Reuters Elektron SAP HANA Claro Cloud
  • 14.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.14 Hybrid Environments can Limit Access to Data Locked behind the firewall Locked behind other clouds
  • 15.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.15 Not trivial to get data across different marketing data sources Data Source API Eloqua Web Services API (REST/SOAP) Bulk and non-Bulk APIs No query language Oracle Service Cloud Web Services APIs (REST/SOAP) ROQL Google Analytics Hypercube (query limits of 10 metrics grouped by max of 7 dimensions) Veeva CRM SOAP, BULK, Metadata APIs SOQL
  • 16.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.16 REST API for bulk export to support analytics
  • 17.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.17 Hybrid pipeline for ingestion of cloud data sources to ground Hadoop system Define SaaS data model for data integration Optimize SQL request against SaaS APIs Mapping JDBC user auth to SaaS APIs Standard JDBC Client for Apache Sqoop
  • 18.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.18 Sample Insights: Trends in 2016 for product usage following related events March & September June
  • 19.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.19 Hybrid pipeline to access Hadoop from Salesforce Cloud
  • 20.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.20 Connect Salesforce to Big Data Success Scoring Personalization Archived Insight 360 Reporting Corporate Firewall ?
  • 21.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.21  Salesforce Connect maps Salesforce external objects to data tables in external systems. Instead of copying the data into your organization, Salesforce Connect accesses the data on demand and in real time. The data is never stale, and we access only what you need.  Recommended when: • You have a large amount of data that you don’t want to copy into your Salesforce organization. • You need small amounts of data at any one time. • You want real-time access to the latest data. Salesforce Connect integration for Big Data
  • 22.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.22 An open protocol to allow the creation and consumption of queryable and interoperable RESTful APIs in a simple and standard way. OASIS Standard REST API (“SQL for the web”) Ratified as an OASIS standard February, 2014 Operations built on REST principles Uniform URL conventions Surface metadata in standard way Access requires OData endpoint First member to join OData Technical Committee
  • 23.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.23 For Data Behind a Firewall, there is no Common Access Approach for Clouds 1. Network Based VPN 2. SSH Tunneling 3. Reverse Proxy Servers
  • 24.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.24 Firewall Becoming Barrier for Hybrid Data Tech Adoption Source: The 2017 State of the Firewall” produced by Firemon
  • 25.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.25 Hybrid pipeline for on-demand access to Hadoop system on the ground from Salesforce Cloud Publish OData endpoint for on-demand access On-premises data gateway for firewall friendly connection Mapping user auth to Hadoop ecosystem Reverse Engineer OData REST API entity data model
  • 26.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.26 Firewall Friendly Architecture for On-premises Data Gateway On-premises data gateway for firewall friendly connection
  • 27.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.27 Getting Started with Salesforce External Object Reports Report with data blended from Standard and External Objects (pulled on-demand from on-premises Data Lake)
  • 28.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.28 Best practices and lessons learned
  • 29.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.29 What worked in building the Data Lake Building advocacy for marketing data Brought together several teams across Product, Sales, Marketing Operations, Engineering, and others interested in the data available for analysis. Revenue attribution to content using activity data correlated across opportunities and content consumed. Analyze detailed CRM lead histories to measure lead routing effectiveness. Salesforce reports do not support analysis on values of detailed lead activity fields. Able to leverage SMEs to identify laser focused targets. I.e. which leads have the specific tech stack that the next webinar is targeting. Trends and Research Supplement existing CRM analytics Identify new and highly focused segments
  • 30.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.30 Lessons Learned ingesting SaaS data into Hadoop Infrastructure On-premises Hadoop cluster was not ideal for LoB to manage and tune for the pilot, but we did not have approval to land LOB data in the cloud. Need additional support to go live. Data lake dumps raw data from source systems, so we end up with activity from automated tests that can skew data, for example. Schema changes to source objects impacted ingestion so that needs to be planned for with SaaS applications. Even with a standard pipeline in place, there are still limits that apply for initial load on certain objects as many APIs are primarily designed for application integration. Data Quality Metadata Data Integration
  • 31.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.31 Lessons Learned accessing Hadoop from Salesforce 1. Mapping OData entities to Big Data objects 2. Primary keys for Big Data entities 3. HiveServer1 vs HiverServer2 for concurrency 4. External Objects have limits and 2 minute max timeout 5. Native Reporting support was added in Winter ‘17 6. Search considerations 7. Need agile OData service with Data Lake 8. Data Governance and Masking 9. CRM User Experience (strategies to improve performance)  Accessing external Big Data objects
  • 32.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.32 Design Patterns for external objects  Enable Separate Loading of Related Lists of External Objects Performance Tuning Tips for Related Lists in Account
  • 33.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.33 Decrease latency accessing Big Data over Hive  Tuning details 1. Use Apache Tez as execution engine for Hive 2. Use ORCfile, new storage format 3. Use vectorization query execution (Hive 0.13) 4. Performance Tuning (Partitions, Indexes, Buckets, Block Sizes, etc) 5. Consider another query interface (i.e. Apache Hawq)
  • 34.
    © 2016 ProgressSoftware Corporation and/or its subsidiaries or affiliates. All rights reserved.34 Roadmap Sumit Sarkar Product Marketing @SAsInSumit linkedin.com/in/meetsumit

Editor's Notes

  • #2 My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sumit Sarkar, Product Marketing Engineer at Progress, will discuss how the key to delivering on this was using standard interfaces using a bi-directional data pipeline to connect the systems. On the Salesforce side, we were able to get frictionless access to the data lake using clicks-not-code via OData. On the Hadoop side, we were able to ingest data from Salesforce using JDBC for Apache Sqoop. Join us to hear best practices and lessons learned.
  • #8 Database: Apache Hadoop (cluster has 4 nodes with 16TB of disk space) Interfaces: Hive 1.2.1 and Spark 2.0.0 Tuning: Apache Tez Execution Engine with ORC file storage format Data Loader: Apache Sqoop Data Access: Oracle Eloqua, Google Analytics and Salesforce JDBC connectors with the DataDirect Cloud service for use with Apache Sqoop Initial load size: 96 GB Sample Sqoop Scripts located here.
  • #14 1200 respondents telling us what they use!
  • #18 Optimization: Transparently leverage bulk/non-bulk APIs based on request Normalize complex data models Optimize push down to back end API – i.e. relationships, query language, etc. Approximate Object sizes in April 2017 Emailsend: 63 GB* Pageview: 42 GB* Salesforce Lead history: 21 GB * Growing at ~20-25 GB / yr
  • #21 We’ve got this cool big data lake doing really cool stuff … But how do we take this to the next level?? Decided to expose the insight from the lake for Sales and Service Org which effectively transformed Salesforce into a system of engagement for Big Data Sets.
  • #22 Don’t need to deal with storage or synchornization
  • #23 Bridge between External Objects and Big Data REST VERBS: GET, PUT, DELETE
  • #24 SSH: Free for developers to quickly get connected from PuTTY Not scalable and has security vulnerabilities with ssh keys. Requires SSH client and server must be configured to allow SSH connections. VPN: Trusted support from cloud provider Requires IT administration and does not work as well in SaaS model for deployed applications Reverse Proxy Servers: Implemented by networking professionals and helps achieve regulatory compliance Requires IT expertise and maintenance
  • #28 https://releasenotes.docs.salesforce.com/en-us/winter17/release-notes/rn_forcecom_external_data_reports.htm Report types aren’t created for external lookup or indirect lookup relationships, but you can create custom report types for these relationships.
  • #32 Limit JOIN across 4 objects 50,000 rows/hour by default – but can be configured for “High Volume External Data Source” which has other limits like not available via Salesforce1 https://help.salesforce.com/apex/HTViewHelpDoc?id=limits_external_data.htm Reporting: Each block in JOIN is separate call out Performance Tuning Tips for Related Lists: https://help.salesforce.com/HTViewSolution?id=000148978 5 Ways to make hive queries run faster: http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ Use Apache Tez as execution engine for Hive Use ORCfile, new storage format Use vectorization query execution (Hive 0.13) Performance Tuning (Partitions, Indexes, Buckets, Block Sizes, etc) Consider another query interface (i.e. Apache Hawq)
  • #34 Stinger: http://hortonworks.com/blog/100x-faster-hive/
  • #35 The project is still run by marketing engineers, so we don’t get the perks of an IT driven project. We expect to continue expanding use and value to then elevate this to production and start looking to do more buying than building. Statistical analysis of detailed data Log data to study application characteristics Performance lab data