WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Max Melnick, Deloitte Consulting LLP Massive-Scale Entity Resolution Using Spark + Graph #UnifiedAnalytics #SparkAISummit
About Me 3#UnifiedAnalytics #SparkAISummit • Passion for building tech products • Engineering Lead / Architect / Developer • Spark Certified Developer • Based in Washington, DC • UVA Systems Engineering • Love sports, travel, cooking/eating, and listening to podcasts maxmelnick.com maxmelnick@gmail.com linkedin.com/in/maxmelnick
4#UnifiedAnalytics #SparkAISummit MissionGraph™ is an open architecture, data integration, enhancement, and exploration platform that powers massive- scale analysis. MissionGraph™ by
Agenda • Entity Resolution (ER) Overview • Spark + Graph ER Solution Walkthrough – Technical Architecture – Example Patterns • Graph gotchas and tips 5#UnifiedAnalytics #SparkAISummit
ER enables analytics 6#UnifiedAnalytics #SparkAISummit
ER Use-Cases • Customer 360 • Fraud Detection • Network Analysis • Recommendation Engines 7#UnifiedAnalytics #SparkAISummit
Logical ER Flow 8#UnifiedAnalytics #SparkAISummit
Simple ER Example 9#UnifiedAnalytics #SparkAISummit
Simple ER Example (cont.) 10#UnifiedAnalytics #SparkAISummit
Simple ER Example (cont.) 11#UnifiedAnalytics #SparkAISummit
Simple ER Example (cont.) 12#UnifiedAnalytics #SparkAISummit
ER is hard • Difficult to scale algorithms vertically (more of the same data) or horizontally (new types of data) • Prohibitively expensive to compare each record with every other record • Heterogeneous datasets • Data lacks strong keys • Difficult to manage changes over time • Similarity varies significantly across types of entities, languages, etc. • Data quality issues 13#UnifiedAnalytics #SparkAISummit
Improve ER with Spark + Graph 14#UnifiedAnalytics #SparkAISummit + = Better ER
Technical Architecture 15#UnifiedAnalytics #SparkAISummit
Flexible graph candidate selection 16#UnifiedAnalytics #SparkAISummit The flexibility of graph enables you to easily add new attributes to your candidate selection query vs
Flexible graph candidate selection – Spark GraphFrames query 17#UnifiedAnalytics #SparkAISummit
Flexible graph candidate selection – query by phone 18#UnifiedAnalytics #SparkAISummit
Flexible graph candidate selection – query by phone 19#UnifiedAnalytics #SparkAISummit GraphFrames SparkSQL
Flexible graph candidate selection – query by phone or address 20#UnifiedAnalytics #SparkAISummit
Flexible graph candidate selection – query by phone or address 21#UnifiedAnalytics #SparkAISummit GraphFrames SparkSQL Same candidate selection query Candidate selection query changes
Flexible graph candidate selection – query by phone or address or email 22#UnifiedAnalytics #SparkAISummit vs
Flexible graph candidate selection – query by phone or address or email 23#UnifiedAnalytics #SparkAISummit GraphFramesSparkSQL Same candidate selection query Candidate selection query changes
Simplify entity canonicalization 24#UnifiedAnalytics #SparkAISummit
Simplify entity canonicalization (cont.) 25#UnifiedAnalytics #SparkAISummit
Graph context helps when data is limited 26#UnifiedAnalytics #SparkAISummit
Graph context helps when data is limited (cont.) 27#UnifiedAnalytics #SparkAISummit
Graph context helps when data is limited (cont.) 28#UnifiedAnalytics #SparkAISummit
Graph gotchas • Supernodes • Graph adoption learning curve • Not a silver bullet • Less streaming support than traditional SQL- based workflows 29#UnifiedAnalytics #SparkAISummit
Graph tip #1: Persist graph at scale 30#UnifiedAnalytics #SparkAISummit
Graph tip #2: Debug visually 31#UnifiedAnalytics #SparkAISummit .show() GraphFrame vertex and edge DataFrames View in DSE Studio (must be persisted in DSE Graph) Easier to understand visually vs
Graph tip #3: Is it a graph problem? Graph is great for… • Connecting many different types of data • Performing indeterminate number of hops analysis Alternatives to consider • Fuzzy search / programmable indexes -> search engine • Simple, static joins on homogenous data -> SQL • Hybrid (graph + SQL/search/etc) 32#UnifiedAnalytics #SparkAISummit
Code for this presentation https://github.com/maxmelnick/spark-graph-er 33#UnifiedAnalytics #SparkAISummit
Recap • ER enables many analytics use-cases • ER is hard, but Spark + Graph = Improved ER 34#UnifiedAnalytics #SparkAISummit
Thank You! 35#UnifiedAnalytics #SparkAISummit maxmelnick.com maxmelnick@gmail.com linkedin.com/in/maxmelnickThis publication contains general information only, and none of the member firms of Deloitte Touche Tohmatsu Limited, its member firms, or their related entities (collective, the “Deloitte Network”) is, by means of this publication, rendering professional advice or services. Before making any decision or taking any action that may affect your business, you should consult a qualified professional adviser. No entity in the Deloitte Network shall be responsible for any loss whatsoever sustained by any person who relies on this publication. As used in this document, “Deloitte” means Deloitte Consulting LLP, a subsidiary of Deloitte LLP. Please see www.deloitte.com/us/about for a detailed description of the legal structure of Deloitte USA LLP, Deloitte LLP and their respective subsidiaries. Certain services may not be available to attest clients under the rules and regulations of public accounting. Copyright © 2019 Deloitte Development LLC. All rights reserved. Member of Deloitte Touche Tohmatsu Limited
DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph

  • 1.
    WIFI SSID:SparkAISummit |Password: UnifiedAnalytics
  • 2.
    Max Melnick, DeloitteConsulting LLP Massive-Scale Entity Resolution Using Spark + Graph #UnifiedAnalytics #SparkAISummit
  • 3.
    About Me 3#UnifiedAnalytics #SparkAISummit •Passion for building tech products • Engineering Lead / Architect / Developer • Spark Certified Developer • Based in Washington, DC • UVA Systems Engineering • Love sports, travel, cooking/eating, and listening to podcasts maxmelnick.com maxmelnick@gmail.com linkedin.com/in/maxmelnick
  • 4.
    4#UnifiedAnalytics #SparkAISummit MissionGraph™ isan open architecture, data integration, enhancement, and exploration platform that powers massive- scale analysis. MissionGraph™ by
  • 5.
    Agenda • Entity Resolution(ER) Overview • Spark + Graph ER Solution Walkthrough – Technical Architecture – Example Patterns • Graph gotchas and tips 5#UnifiedAnalytics #SparkAISummit
  • 6.
  • 7.
    ER Use-Cases • Customer360 • Fraud Detection • Network Analysis • Recommendation Engines 7#UnifiedAnalytics #SparkAISummit
  • 8.
  • 9.
  • 10.
    Simple ER Example(cont.) 10#UnifiedAnalytics #SparkAISummit
  • 11.
    Simple ER Example(cont.) 11#UnifiedAnalytics #SparkAISummit
  • 12.
    Simple ER Example(cont.) 12#UnifiedAnalytics #SparkAISummit
  • 13.
    ER is hard •Difficult to scale algorithms vertically (more of the same data) or horizontally (new types of data) • Prohibitively expensive to compare each record with every other record • Heterogeneous datasets • Data lacks strong keys • Difficult to manage changes over time • Similarity varies significantly across types of entities, languages, etc. • Data quality issues 13#UnifiedAnalytics #SparkAISummit
  • 14.
    Improve ER withSpark + Graph 14#UnifiedAnalytics #SparkAISummit + = Better ER
  • 15.
  • 16.
    Flexible graph candidateselection 16#UnifiedAnalytics #SparkAISummit The flexibility of graph enables you to easily add new attributes to your candidate selection query vs
  • 17.
    Flexible graph candidateselection – Spark GraphFrames query 17#UnifiedAnalytics #SparkAISummit
  • 18.
    Flexible graph candidateselection – query by phone 18#UnifiedAnalytics #SparkAISummit
  • 19.
    Flexible graph candidateselection – query by phone 19#UnifiedAnalytics #SparkAISummit GraphFrames SparkSQL
  • 20.
    Flexible graph candidateselection – query by phone or address 20#UnifiedAnalytics #SparkAISummit
  • 21.
    Flexible graph candidateselection – query by phone or address 21#UnifiedAnalytics #SparkAISummit GraphFrames SparkSQL Same candidate selection query Candidate selection query changes
  • 22.
    Flexible graph candidateselection – query by phone or address or email 22#UnifiedAnalytics #SparkAISummit vs
  • 23.
    Flexible graph candidateselection – query by phone or address or email 23#UnifiedAnalytics #SparkAISummit GraphFramesSparkSQL Same candidate selection query Candidate selection query changes
  • 24.
  • 25.
    Simplify entity canonicalization(cont.) 25#UnifiedAnalytics #SparkAISummit
  • 26.
    Graph context helpswhen data is limited 26#UnifiedAnalytics #SparkAISummit
  • 27.
    Graph context helpswhen data is limited (cont.) 27#UnifiedAnalytics #SparkAISummit
  • 28.
    Graph context helpswhen data is limited (cont.) 28#UnifiedAnalytics #SparkAISummit
  • 29.
    Graph gotchas • Supernodes •Graph adoption learning curve • Not a silver bullet • Less streaming support than traditional SQL- based workflows 29#UnifiedAnalytics #SparkAISummit
  • 30.
    Graph tip #1:Persist graph at scale 30#UnifiedAnalytics #SparkAISummit
  • 31.
    Graph tip #2:Debug visually 31#UnifiedAnalytics #SparkAISummit .show() GraphFrame vertex and edge DataFrames View in DSE Studio (must be persisted in DSE Graph) Easier to understand visually vs
  • 32.
    Graph tip #3:Is it a graph problem? Graph is great for… • Connecting many different types of data • Performing indeterminate number of hops analysis Alternatives to consider • Fuzzy search / programmable indexes -> search engine • Simple, static joins on homogenous data -> SQL • Hybrid (graph + SQL/search/etc) 32#UnifiedAnalytics #SparkAISummit
  • 33.
    Code for thispresentation https://github.com/maxmelnick/spark-graph-er 33#UnifiedAnalytics #SparkAISummit
  • 34.
    Recap • ER enablesmany analytics use-cases • ER is hard, but Spark + Graph = Improved ER 34#UnifiedAnalytics #SparkAISummit
  • 35.
    Thank You! 35#UnifiedAnalytics #SparkAISummit maxmelnick.com maxmelnick@gmail.com linkedin.com/in/maxmelnickThispublication contains general information only, and none of the member firms of Deloitte Touche Tohmatsu Limited, its member firms, or their related entities (collective, the “Deloitte Network”) is, by means of this publication, rendering professional advice or services. Before making any decision or taking any action that may affect your business, you should consult a qualified professional adviser. No entity in the Deloitte Network shall be responsible for any loss whatsoever sustained by any person who relies on this publication. As used in this document, “Deloitte” means Deloitte Consulting LLP, a subsidiary of Deloitte LLP. Please see www.deloitte.com/us/about for a detailed description of the legal structure of Deloitte USA LLP, Deloitte LLP and their respective subsidiaries. Certain services may not be available to attest clients under the rules and regulations of public accounting. Copyright © 2019 Deloitte Development LLC. All rights reserved. Member of Deloitte Touche Tohmatsu Limited
  • 36.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT