Kumar Ramanathan, Gautam Gupta How to Build An AI Based Customer Data Platform: Learn the design patterns for Real Time Use Cases September 2020
©2020 Intuit Inc. All rights reserved. 2 Introduction Kumar Ramanathan Gautam Gupta Group Manager @ IntuitDirector Engineering @ Intuit
©2020 Intuit Inc. All rights reserved. 3 Intuit’s mission: Powering Prosperity around the World AI-DRIVEN EXPERT PLATFORM 4.5M QBO CUSTOMERS
Personalize the journey for everyone in ecosystem 1 2 What do we know about our Customer? What do our customers need?
©2020 Intuit Inc. All rights reserved. 5 High Level Architecture Read TPS 20K FCI 0.0003% TP99 60 ms Ingest TPS 60K Ingest Latency TP99 1s (2kb payloads)
©2020 Intuit Inc. All rights reserved. 6 Financial Ownership WHAT IS THE FINANCIAL OWNERSHIP OF THIS USER? WHO IS THIS USER OR VISITOR? WHAT IS THEIR INTENT? Identity Resolution WHERE IS THIS USER IN THEIR JOURNEY WITH US/LIFE Customer Journey Unlock Relationships and ML driven Insights visitor Financial attributes user company financial ownership offerings journey stage Intent persona transactions Graph Queries Analytics & ML Graph Mining
©2020 Intuit Inc. All rights reserved. 7 Why did we switch to Tiger Graph? ● Developer Friendly Platform ● Better 1-hop query performance & efficient multi-hop queries ● 77% reduction in AWS infrastructure costs; 10x less IOPS ● Excellent, responsive customer support Improved OPEX Savings Developer Friendly Excellent Support
©2020 Intuit Inc. All rights reserved. 8 Use Case: Increase Sign-in Success Rate for Tax Prep If we Leverage Identity Graph in the Risk Scoring Service Sign-in Flow for TurboTax Online Then We can recognize more unknown visitors So that We can provide a lower friction sign-in experience for those visitors Increase Sign-in Success Rate
Intuit Confidential and Proprietary 9 Identity Graph Stitching anonymous visitor to known user Returning Customer Recognition Frictionless Sign In/up Personalization visitor user user <> visitor :: stitch Clickstream: 159 columns x ∞ rows Users: 142M Nodes Input Model Pairwise binary classification Let: Learn if pair (IVID, UID) is “matched” to each other where Θ parameter vector of the learned model Optimize resulting quadratic complexity by selecting subset Final prediction function: Chose unique UID, if exists: 99.9982% Ranked multiple UID candidates: 98.8609% Results● Identity graph able to recognize ~4% more visitors ● Sign-in Success rate for unrecognized cohort went from 89% to 94%
©2020 Intuit Inc. All rights reserved. 10 1. Readily Accessible Data Publishing data from source systems through batch and eventing to stream processing infrastructure & data lake Top 3 challenges for creating Identity Graph 2. Lack of Universal First-class Entities Creating universal definition of key entities like User, Visitor, Account etc. across product lines 3. Entity Resolution & Attribute Normalization Data across multiple sources is not resolved and normalized, through deterministic & predictive algorithms
Design Patterns
©2020 Intuit Inc. All rights reserved. 12 1. Data Movement: Platform not Pipelines Why? Rapid increase of new data sources and existing ones changing fast Domain ownership for publishing data High quality, large scale, domain agnostic data infrastructure How? Generic event processing pipeline with built-in configurable stages Standardized implementation of domain agnostic stages ● Sessionization, Geo-Coding, Entity standardization & resolution, encryption/decryption, compression, schema validation, authentication/security controls, governance & compliance checks and many more. ● Metadata repository with beautiful UI for discoverability, lineage tracking & data trust ● Scalable operational platform - ability consume data from batch and stream sources with built-in auto scaling, monitoring, alerting, error handling etc. Support for adding custom stream computation stages for unique needs for specialized pipelines - ML feature computation, dynamic traits computation Step 2 Deduplication Step 3 Data Validation Step 4 NotificationAuthentication Step 1
©2020 Intuit Inc. All rights reserved. 13 2. Data Storage: Polyglot not Monolithic Why? Performance at scale is critical in a real-time data platform serving customer experiences Shaping data to match access patterns allows for optimal and efficient access patterns Tools for operating distributed systems and native NoSQL (KV, Search, Graph) DBs have matured How? One can use following patterns for creating a data store to handle all the information about Customer: ● Leverage a KV data store for entities and attributes ● Use Search based persistence for search queries on the attributes ● Store relationships using a Graph database We can solve a wide variety of use cases with low latency. Using same database for handling Entities, Relationships and Search capabilities leads to performance bottlenecks and higher latencies.
©2020 Intuit Inc. All rights reserved. 14 3. Data Access: Right-for-me not One-size-fits-all Why? Data products are used in very different contexts. Need right interface for the right context. How? Support as many patterns below as possible: ● UI Widgets: Ability to data-enable products/features by embed winning experiences quickly ● Request/Response: Provide direct API access for synchronous communication ● Pub/Sub: Publish CDC notifications/messages to consumers who store data for specific use case ● Data Lake: To support offline model training or historical bootstrap for new pub/sub consumers
©2020 Intuit Inc. All rights reserved. 15 4. AI Toolchain: Deeply integrated not Bolted on Why? AI models evolve over time and need access to new data AI data needs grow with the evolution of models AI models need real time access to data in production How? Develop a self serve mechanism for onboarding new AI models and Features required for them ● Create an input for a rich Feature store from Entities, Attributes and Relationships ● Provide aggregation and functional formulation on attributes ● Capture feedback from AI models to provide 360 view of model performance ● Solve for Real time availability of Data to AI models ● Data Exploration -> Featurization -> Training -> Model Optimization - > Model Deployment -> Model execution
©2020 Intuit Inc. All rights reserved. 16 5. Data Entities: Self-serve not Product Backlog Why? Time to market for new data products or expanding the features of existing ones is an important driver of growth How? ● Mindset: Data is product. Domain teams think of data consumers as their customers just like the end users of their products through UI and other developers through their APIs. ● Producer and consumer work directly with each other to create value quickly using shared domain knowledge ● No new engineering work is needed in the domain agnostic part of the platform to add new data entities or attributes or inferring new relationships. ● Self-describing semantics for data - set of robust metadata capabilities that configures the platform behavior for specific use cases. Metadata determines which stages of the pipeline is executed, which version of business logic is run in a particular stage ● You’re successful when non-technical business users like product managers are able to discover data, understand the business meaning of it and expand the data set in a self-serve form
Thank You!

How to Build An AI Based Customer Data Platform: Learn the design patterns for Real Time Use Cases

  • 1.
    Kumar Ramanathan, GautamGupta How to Build An AI Based Customer Data Platform: Learn the design patterns for Real Time Use Cases September 2020
  • 2.
    ©2020 Intuit Inc.All rights reserved. 2 Introduction Kumar Ramanathan Gautam Gupta Group Manager @ IntuitDirector Engineering @ Intuit
  • 3.
    ©2020 Intuit Inc.All rights reserved. 3 Intuit’s mission: Powering Prosperity around the World AI-DRIVEN EXPERT PLATFORM 4.5M QBO CUSTOMERS
  • 4.
    Personalize the journeyfor everyone in ecosystem 1 2 What do we know about our Customer? What do our customers need?
  • 5.
    ©2020 Intuit Inc.All rights reserved. 5 High Level Architecture Read TPS 20K FCI 0.0003% TP99 60 ms Ingest TPS 60K Ingest Latency TP99 1s (2kb payloads)
  • 6.
    ©2020 Intuit Inc.All rights reserved. 6 Financial Ownership WHAT IS THE FINANCIAL OWNERSHIP OF THIS USER? WHO IS THIS USER OR VISITOR? WHAT IS THEIR INTENT? Identity Resolution WHERE IS THIS USER IN THEIR JOURNEY WITH US/LIFE Customer Journey Unlock Relationships and ML driven Insights visitor Financial attributes user company financial ownership offerings journey stage Intent persona transactions Graph Queries Analytics & ML Graph Mining
  • 7.
    ©2020 Intuit Inc.All rights reserved. 7 Why did we switch to Tiger Graph? ● Developer Friendly Platform ● Better 1-hop query performance & efficient multi-hop queries ● 77% reduction in AWS infrastructure costs; 10x less IOPS ● Excellent, responsive customer support Improved OPEX Savings Developer Friendly Excellent Support
  • 8.
    ©2020 Intuit Inc.All rights reserved. 8 Use Case: Increase Sign-in Success Rate for Tax Prep If we Leverage Identity Graph in the Risk Scoring Service Sign-in Flow for TurboTax Online Then We can recognize more unknown visitors So that We can provide a lower friction sign-in experience for those visitors Increase Sign-in Success Rate
  • 9.
    Intuit Confidential andProprietary 9 Identity Graph Stitching anonymous visitor to known user Returning Customer Recognition Frictionless Sign In/up Personalization visitor user user <> visitor :: stitch Clickstream: 159 columns x ∞ rows Users: 142M Nodes Input Model Pairwise binary classification Let: Learn if pair (IVID, UID) is “matched” to each other where Θ parameter vector of the learned model Optimize resulting quadratic complexity by selecting subset Final prediction function: Chose unique UID, if exists: 99.9982% Ranked multiple UID candidates: 98.8609% Results● Identity graph able to recognize ~4% more visitors ● Sign-in Success rate for unrecognized cohort went from 89% to 94%
  • 10.
    ©2020 Intuit Inc.All rights reserved. 10 1. Readily Accessible Data Publishing data from source systems through batch and eventing to stream processing infrastructure & data lake Top 3 challenges for creating Identity Graph 2. Lack of Universal First-class Entities Creating universal definition of key entities like User, Visitor, Account etc. across product lines 3. Entity Resolution & Attribute Normalization Data across multiple sources is not resolved and normalized, through deterministic & predictive algorithms
  • 11.
  • 12.
    ©2020 Intuit Inc.All rights reserved. 12 1. Data Movement: Platform not Pipelines Why? Rapid increase of new data sources and existing ones changing fast Domain ownership for publishing data High quality, large scale, domain agnostic data infrastructure How? Generic event processing pipeline with built-in configurable stages Standardized implementation of domain agnostic stages ● Sessionization, Geo-Coding, Entity standardization & resolution, encryption/decryption, compression, schema validation, authentication/security controls, governance & compliance checks and many more. ● Metadata repository with beautiful UI for discoverability, lineage tracking & data trust ● Scalable operational platform - ability consume data from batch and stream sources with built-in auto scaling, monitoring, alerting, error handling etc. Support for adding custom stream computation stages for unique needs for specialized pipelines - ML feature computation, dynamic traits computation Step 2 Deduplication Step 3 Data Validation Step 4 NotificationAuthentication Step 1
  • 13.
    ©2020 Intuit Inc.All rights reserved. 13 2. Data Storage: Polyglot not Monolithic Why? Performance at scale is critical in a real-time data platform serving customer experiences Shaping data to match access patterns allows for optimal and efficient access patterns Tools for operating distributed systems and native NoSQL (KV, Search, Graph) DBs have matured How? One can use following patterns for creating a data store to handle all the information about Customer: ● Leverage a KV data store for entities and attributes ● Use Search based persistence for search queries on the attributes ● Store relationships using a Graph database We can solve a wide variety of use cases with low latency. Using same database for handling Entities, Relationships and Search capabilities leads to performance bottlenecks and higher latencies.
  • 14.
    ©2020 Intuit Inc.All rights reserved. 14 3. Data Access: Right-for-me not One-size-fits-all Why? Data products are used in very different contexts. Need right interface for the right context. How? Support as many patterns below as possible: ● UI Widgets: Ability to data-enable products/features by embed winning experiences quickly ● Request/Response: Provide direct API access for synchronous communication ● Pub/Sub: Publish CDC notifications/messages to consumers who store data for specific use case ● Data Lake: To support offline model training or historical bootstrap for new pub/sub consumers
  • 15.
    ©2020 Intuit Inc.All rights reserved. 15 4. AI Toolchain: Deeply integrated not Bolted on Why? AI models evolve over time and need access to new data AI data needs grow with the evolution of models AI models need real time access to data in production How? Develop a self serve mechanism for onboarding new AI models and Features required for them ● Create an input for a rich Feature store from Entities, Attributes and Relationships ● Provide aggregation and functional formulation on attributes ● Capture feedback from AI models to provide 360 view of model performance ● Solve for Real time availability of Data to AI models ● Data Exploration -> Featurization -> Training -> Model Optimization - > Model Deployment -> Model execution
  • 16.
    ©2020 Intuit Inc.All rights reserved. 16 5. Data Entities: Self-serve not Product Backlog Why? Time to market for new data products or expanding the features of existing ones is an important driver of growth How? ● Mindset: Data is product. Domain teams think of data consumers as their customers just like the end users of their products through UI and other developers through their APIs. ● Producer and consumer work directly with each other to create value quickly using shared domain knowledge ● No new engineering work is needed in the domain agnostic part of the platform to add new data entities or attributes or inferring new relationships. ● Self-describing semantics for data - set of robust metadata capabilities that configures the platform behavior for specific use cases. Metadata determines which stages of the pipeline is executed, which version of business logic is run in a particular stage ● You’re successful when non-technical business users like product managers are able to discover data, understand the business meaning of it and expand the data set in a self-serve form
  • 17.