Mastering Customer Data on Apache Spark BIG DATA WAREHOUSING MEETUP AWS LOFT APRIL 7, 2016
Agenda 6:30 Networking Grab some food and drink... Make some friends. 6:50 Joe Caserta President Caserta Concepts Welcome + Intro to BDW Meetup About the Meetup. Why MDM needs Graph now. 7:15 Kevin Rasmussen Big Data Engineer Caserta Concepts Deeper Dive into Spark/MDM/Graph Deep dive into DUNBAR technology and how we came up with it for Customer Data Integration 7:45 Vida Ha, Lead Solutions Engineer Databricks Using Apache Spark Intro to the different components of Spark: MLLib, GraphX, SQL, Streaming, Python, ETL 8:30 Q&A Ask Questions, Share your experience
About Caserta Concepts • Consulting Data Innovation and Modern Data Engineering • Award-winning company • Internationally recognized work force • Strategy, Architecture, Implementation, Governance • Innovation Partner • Strategic Consulting • Advanced Architecture • Build & Deploy • Leader in Enterprise Data Solutions • Big Data Analytics • Data Warehousing • Business Intelligence • Data Science • Cloud Computing • Data Governance Amazon Best Sellers Most popular products based on sales. Updated hourly.
Client Portfolio Retail/eCommerce & Manufacturing Digital Media/AdTech Education & Services Finance. Healthcare & Insurance
Partners
Awards & Recognition
This Meetup CIL - Caserta Innovations Lab Experience Big Data Warehousing Meetup • Established in 2012 in NYC • Meet monthly to share data best practices, experiences • 3,700+ Members http://www.meetup.com/Big-Data-Warehousing/ http://www.slideshare.net/CasertaConcepts/ Examples of Topics • Data Governance, Compliance & Security in Hadoop w/Cloudera • Real Time Trade Data Monitoring with Storm & Cassandra • Predictive Analytics • Exploring Big Data Analytics Techniques w/Datameer • Using a Graph DB for MDM & Relationship Mgmt • Data Science w/ Revolution Analytics • Processing 1.4 Trillion Events in Hadoop • Building a Relevance Engine using Hadoop, Mahout & Pig • Big Data 2.0 – YARN Distributed ETL & SQL w/Hadoop • Intro to NoSQL w/10GEN
Informational Master Data MDM Information Ecosystem 8 Operational Master Data Holistic Master Data Service Leads Policies Claims Enrolls Sales Finance DW Dimensions & Cross-References Marketing Insights
What is wrong with traditional approach to MDM • Conceptually problems with “enterprise” approach • Long, complex implementations  low ROI • Complex data model • Too much human interaction • Deliverable??? • Challenges with big data • Data volumes • Evolving data sources • Need to further remove humans out of the process
 Hierarchical relationships are never rigid  Relational models with tables and columns not flexible enough  Neo4j is the leading graph database  Many MDM systems are going graph:  Pitney Bowes - Spectrum MDM  Reltio - Worry-Free Data for Life Sciences. Graph Databases to the Rescue
How does a Graph DB help MDM • Data is stored in it’s natural form  no mismatch between requirements and data model • Both Nodes and Relationships can have properties  supports sparse and evolving data • MDM for analytics  your MDM solution now delivers new enablement, not just a back office system • Relationship science
Open source and commercial Gelphi Tom Sawyer linkurio.us
Caserta Innovation Lab (CIL) • Internal laboratory established to test & develop solution concepts and ideas • Used to accelerate client projects • Examples: • Search (SOLR) based BI • Big Data Governance Toolkit / Data Quality Sub-System • Text Analytics on Social Network Data • Continuous Integration / End-to-end streaming • Recommendation Engine Optimization • Relationship Intelligence / Spark Graph / CDI (Dunbar) • CIL is hosted on
Introducing Dunbar (Relationship Intelligence / CDI ) Kevin Rasmussen Big Data Engineer, Caserta Concepts kevin@casertaconcepts.com
How many people do you know??
Anthropologists say it’s 150… MAX
What if we could increase this number? Opportunities Closing Revenue Yeah,butthat’sCRM101… Howcanweimprovethis? Expand Data Sources Explore Relationships Enhanced Insight
Project Dunbar - Internal Developed in: • Python • Neo4j Database • Build a social graph based on internal and external data • Run pathing algorithms to understand strategic opportunity advantages
Whoa… Not so FAST! Throwing a bunch of unrelated points in a graph will not give us a useable solution. We need to MASTER our contact data… • We need to clean and normalize our incoming interaction and relationship data (edges) • Clean normalize and match our entities (vertexes)
Mastering Customer Data Customer Data Integration (CDI): is the process of consolidating and managing customer information from all available sources. In other words… We need to figure out how to LINK people across systems!
Steps required Standardization Matching Survivorship Validation
Traditional Standardization and Matching Matching: join based on combinations of cleansed and standardized data to create match results Cleanse and Parse: • Names • Resolve nicknames • Create deterministic hash, phonetic representation • Addresses • Emails • Phone Numbers
Great – But the NEW data is different Reveal • Wait for the customer to “reveal” themselves • Create a link between anonymous self and known profile Vector • May need behavioral statistical profiling • Compare use vectors Rebuild • Recluster all prior activities • Rebuild the Graph
..and then things got bigger One of our customers wanted us to build it for them: • Much larger dataset • 6 million customers > 30% duplication rate • 100’s of millions of customer interactions • Few direct links across channels
Why Spark “Big Box” MDM tools vs ROI? • Prohibitively expensive  limited by licensing $$$ • Typically limited to the scalability of a single server We Spark! • Development local or distributed is similar • Beautiful high level API’s • Databricks cloud is soo Easy • Full universe of Python modules • Open source and Free** • Blazing fast! Spark has become our default processing engine for a myriad of engineering problems
Spark map operations Cleansing, transformation, and standardization of both interaction and customer data Amazing universe of Python modules: • Address Parsing: usaddress, postal-address, etc • Name Hashing: fuzzy, etc • Genderization: sexmachine, etc And all the goodies of the standard library! We can now parallelize our workload against a large number of machines:
Matching process • We now have clean standardized, linkable data • We need to resolve our links between our customer • Large table self joins • We can even use SQL:
Matching process The matching process output gives us the relationships between customers: Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the same person (and this is a trivial case) And we need to cluster and identify our survivors (vertex) xid yid match_type 1234 4849 phone 4849 5499 email 5499 1235 address 4849 7788 cookie 5499 7788 cookie 4849 1234 phone
Graphx to the rescue 1234 4849 5499 7788 We just need to import our edges into a graph and “dump” out communities Don’t think table… think Graph! These matches are actually communities 1235
Connected components Connected Components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex This lowest number vertex can serve as our “survivor” (not field survivorship) Is it possible to write less code?
Field level survivorship rules We now need to survive fields to our survivor to make it a “best record”. Depending on the attribution we choose: • Latest reported • Most frequently used We then do some simple ranking (thanks windowed functions)
What you really want to know... With a customer dataset of approximately 6 million customers and 100’s of millions of data points. … when directly compared to traditional “big box” enterprise MDM software Was Spark faster….
Map operations like cleansing and standardization saw a 10x improvement on a 4 node r3.2xlarge cluster But the real winner was the “Graph Trick”… Survivorship step reduced from 2 hours 5 minutes Connected Components < 1 minute **loading RDD’s from S3, building the graph, and running connected components
Next Steps… • Graph Frames • In development by Berkley Amp Labs • Combines Full Graph algorithm nature of GraphX with Relationship mining of Neo4j
Thank You / Q&A Kevin Rasmussen Big Data Engineer, Caserta Concepts kevin@casertaconcepts.com

Mastering Customer Data on Apache Spark

  • 1.
    Mastering Customer Data onApache Spark BIG DATA WAREHOUSING MEETUP AWS LOFT APRIL 7, 2016
  • 2.
    Agenda 6:30 Networking Grab somefood and drink... Make some friends. 6:50 Joe Caserta President Caserta Concepts Welcome + Intro to BDW Meetup About the Meetup. Why MDM needs Graph now. 7:15 Kevin Rasmussen Big Data Engineer Caserta Concepts Deeper Dive into Spark/MDM/Graph Deep dive into DUNBAR technology and how we came up with it for Customer Data Integration 7:45 Vida Ha, Lead Solutions Engineer Databricks Using Apache Spark Intro to the different components of Spark: MLLib, GraphX, SQL, Streaming, Python, ETL 8:30 Q&A Ask Questions, Share your experience
  • 3.
    About Caserta Concepts •Consulting Data Innovation and Modern Data Engineering • Award-winning company • Internationally recognized work force • Strategy, Architecture, Implementation, Governance • Innovation Partner • Strategic Consulting • Advanced Architecture • Build & Deploy • Leader in Enterprise Data Solutions • Big Data Analytics • Data Warehousing • Business Intelligence • Data Science • Cloud Computing • Data Governance Amazon Best Sellers Most popular products based on sales. Updated hourly.
  • 4.
    Client Portfolio Retail/eCommerce & Manufacturing DigitalMedia/AdTech Education & Services Finance. Healthcare & Insurance
  • 5.
  • 6.
  • 7.
    This Meetup CIL -Caserta Innovations Lab Experience Big Data Warehousing Meetup • Established in 2012 in NYC • Meet monthly to share data best practices, experiences • 3,700+ Members http://www.meetup.com/Big-Data-Warehousing/ http://www.slideshare.net/CasertaConcepts/ Examples of Topics • Data Governance, Compliance & Security in Hadoop w/Cloudera • Real Time Trade Data Monitoring with Storm & Cassandra • Predictive Analytics • Exploring Big Data Analytics Techniques w/Datameer • Using a Graph DB for MDM & Relationship Mgmt • Data Science w/ Revolution Analytics • Processing 1.4 Trillion Events in Hadoop • Building a Relevance Engine using Hadoop, Mahout & Pig • Big Data 2.0 – YARN Distributed ETL & SQL w/Hadoop • Intro to NoSQL w/10GEN
  • 8.
    Informational Master Data MDM InformationEcosystem 8 Operational Master Data Holistic Master Data Service Leads Policies Claims Enrolls Sales Finance DW Dimensions & Cross-References Marketing Insights
  • 9.
    What is wrongwith traditional approach to MDM • Conceptually problems with “enterprise” approach • Long, complex implementations  low ROI • Complex data model • Too much human interaction • Deliverable??? • Challenges with big data • Data volumes • Evolving data sources • Need to further remove humans out of the process
  • 10.
     Hierarchical relationshipsare never rigid  Relational models with tables and columns not flexible enough  Neo4j is the leading graph database  Many MDM systems are going graph:  Pitney Bowes - Spectrum MDM  Reltio - Worry-Free Data for Life Sciences. Graph Databases to the Rescue
  • 11.
    How does aGraph DB help MDM • Data is stored in it’s natural form  no mismatch between requirements and data model • Both Nodes and Relationships can have properties  supports sparse and evolving data • MDM for analytics  your MDM solution now delivers new enablement, not just a back office system • Relationship science
  • 12.
    Open source andcommercial Gelphi Tom Sawyer linkurio.us
  • 13.
    Caserta Innovation Lab(CIL) • Internal laboratory established to test & develop solution concepts and ideas • Used to accelerate client projects • Examples: • Search (SOLR) based BI • Big Data Governance Toolkit / Data Quality Sub-System • Text Analytics on Social Network Data • Continuous Integration / End-to-end streaming • Recommendation Engine Optimization • Relationship Intelligence / Spark Graph / CDI (Dunbar) • CIL is hosted on
  • 14.
    Introducing Dunbar (RelationshipIntelligence / CDI ) Kevin Rasmussen Big Data Engineer, Caserta Concepts kevin@casertaconcepts.com
  • 15.
    How many peopledo you know??
  • 16.
  • 17.
    What if wecould increase this number? Opportunities Closing Revenue Yeah,butthat’sCRM101… Howcanweimprovethis? Expand Data Sources Explore Relationships Enhanced Insight
  • 18.
    Project Dunbar -Internal Developed in: • Python • Neo4j Database • Build a social graph based on internal and external data • Run pathing algorithms to understand strategic opportunity advantages
  • 19.
    Whoa… Not soFAST! Throwing a bunch of unrelated points in a graph will not give us a useable solution. We need to MASTER our contact data… • We need to clean and normalize our incoming interaction and relationship data (edges) • Clean normalize and match our entities (vertexes)
  • 20.
    Mastering Customer Data CustomerData Integration (CDI): is the process of consolidating and managing customer information from all available sources. In other words… We need to figure out how to LINK people across systems!
  • 21.
  • 22.
    Traditional Standardization andMatching Matching: join based on combinations of cleansed and standardized data to create match results Cleanse and Parse: • Names • Resolve nicknames • Create deterministic hash, phonetic representation • Addresses • Emails • Phone Numbers
  • 23.
    Great – Butthe NEW data is different Reveal • Wait for the customer to “reveal” themselves • Create a link between anonymous self and known profile Vector • May need behavioral statistical profiling • Compare use vectors Rebuild • Recluster all prior activities • Rebuild the Graph
  • 24.
    ..and then thingsgot bigger One of our customers wanted us to build it for them: • Much larger dataset • 6 million customers > 30% duplication rate • 100’s of millions of customer interactions • Few direct links across channels
  • 25.
    Why Spark “Big Box”MDM tools vs ROI? • Prohibitively expensive  limited by licensing $$$ • Typically limited to the scalability of a single server We Spark! • Development local or distributed is similar • Beautiful high level API’s • Databricks cloud is soo Easy • Full universe of Python modules • Open source and Free** • Blazing fast! Spark has become our default processing engine for a myriad of engineering problems
  • 26.
    Spark map operations Cleansing,transformation, and standardization of both interaction and customer data Amazing universe of Python modules: • Address Parsing: usaddress, postal-address, etc • Name Hashing: fuzzy, etc • Genderization: sexmachine, etc And all the goodies of the standard library! We can now parallelize our workload against a large number of machines:
  • 27.
    Matching process • Wenow have clean standardized, linkable data • We need to resolve our links between our customer • Large table self joins • We can even use SQL:
  • 28.
    Matching process The matchingprocess output gives us the relationships between customers: Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the same person (and this is a trivial case) And we need to cluster and identify our survivors (vertex) xid yid match_type 1234 4849 phone 4849 5499 email 5499 1235 address 4849 7788 cookie 5499 7788 cookie 4849 1234 phone
  • 29.
    Graphx to therescue 1234 4849 5499 7788 We just need to import our edges into a graph and “dump” out communities Don’t think table… think Graph! These matches are actually communities 1235
  • 30.
    Connected components Connected Componentsalgorithm labels each connected component of the graph with the ID of its lowest-numbered vertex This lowest number vertex can serve as our “survivor” (not field survivorship) Is it possible to write less code?
  • 31.
    Field level survivorshiprules We now need to survive fields to our survivor to make it a “best record”. Depending on the attribution we choose: • Latest reported • Most frequently used We then do some simple ranking (thanks windowed functions)
  • 32.
    What you reallywant to know... With a customer dataset of approximately 6 million customers and 100’s of millions of data points. … when directly compared to traditional “big box” enterprise MDM software Was Spark faster….
  • 33.
    Map operations likecleansing and standardization saw a 10x improvement on a 4 node r3.2xlarge cluster But the real winner was the “Graph Trick”… Survivorship step reduced from 2 hours 5 minutes Connected Components < 1 minute **loading RDD’s from S3, building the graph, and running connected components
  • 34.
    Next Steps… • GraphFrames • In development by Berkley Amp Labs • Combines Full Graph algorithm nature of GraphX with Relationship mining of Neo4j
  • 35.
    Thank You /Q&A Kevin Rasmussen Big Data Engineer, Caserta Concepts kevin@casertaconcepts.com