DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Slides @ www.jakequist.com/go/dataengconf

http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf

Talk Structure Layer 1: Naive ER Layer 2: Graphical ER Layer 3: Big Data ER Layer 4: Temporal ER Layer 5: Learned ER

Entity Resolution ID Name Website Geo A Facebook facebook.com Menlo Park, CA B FB facebook.com CA C Joe's Cookies joescookies.com San Francisco, CA Suppose we have the following data:

Entity Resolution Suppose we have the following data: ID Name Website Geo A Facebook facebook.com Menlo Park, CA B FB facebook.com CA C Joe's Cookies joescookies.com San Francisco, CA D Joes Cookies facebook.com San Francisco, CA

Entity Resolution Suppose we have the following data: ID Name Website Geo A Facebook facebook.com Menlo Park, CA B FB facebook.com CA C Joe's Cookies joescookies.com San Francisco, CA D Joes Cookies facebook.com San Francisco, CA E Joes Cookies NULL New York, NY

Fundamental Concept Match entities on the similarity of their properties

Problems • What about when match arity != 2 • Entities can’t duplicate across matches • O(N^2) isn’t great either

Think Like a Graph A B EC D ID Name Website Geo A Facebook facebook.com Menlo Park, CA B FB facebook.com CA C Joe's Cookies joescookies.com San Francisco, CA D Joes Cookies facebook.com San Francisco, CA E Joes Cookies NULL New York, NY

Think Like a Graph A B EC D 150 50 -100 -100 50 50 50 50 -150-150

Think Like a Clique A B EC D 150 50 -100 -100 50 50 50 50 -150-150 {A} {B} {C} {D} {E} {E, A} {E, B} {E, C} {E, D} {A, B} {A, C} {A, D} {B, C} {B, D} {C, D} {E, A, B} {E, A, C} {E, A, D} {E, B, C} {E, B, D} {E, C, D} {A, B, C} {A, B, D} {A, C, D} {B, C, D} {E, A, B, C} {E, A, B, D} {E, A, C, D} {E, B, C, D} {A, B, C, D} {E, A, B, C, D} possible cliques =>

Overlapping Cliques A B EC D A B EC D A = 0.75 B = 0.55

Overlapping Cliques An entity can’t belong to more than one clique. When we choose a clique, we must ensure no other cliques use any of those entities

Recap • Given a dataset of entities… • Take the powerset of those entities => every possible clique • Score all the cliques • In sorted order, choose the best cliques when no elements have been touched

• Get potential matches on the same machine • Avoid using powerset(n) for large n Challenges

Locality-Sensitive Hashing (LSH) Basic Idea: Use Map Reduce to get likely matches onto the same machines “Johnathon” “Sequoia Capital, LLC” [37.773972, -122.431297] “John” “Sequoia” [37.73, -122.43] “app.example.com” “example.com”

Problems • What if our entities have missing properties?

Locality-Sensitive Hashing Joe’s CookiesJoe’s Cookie’s joescookies.com joescookies.com A B C “Joe Cookie” “Joe Cookie” “” LSH on “name”

Multilevel LSH • Basic Idea: Use LSH multiple times on converging cliques

Joe’s CookiesJoe’s Cookie’s joescookies.com joescookies.com A B C “Joe Cookie” “Joe Cookie” “” LSN on “name” Joe’s Cookie’s joescookies.com joescookies.com Clique #3 Clique #2 “joescookies.com” “joescookies.com” LSN on “website” Clique #1

Clique Choosing • We now have all potential cliques, spread across the cluster • We now need to choose the best cliques? • Remember: But choosing one clique invalidates others • Fundamentally a Serial Algorithm!

Clique Choosing RDD[T].toLocalIterator() : Iterator[T] • Produces an iterator on the Driver that seamlessly iterates every partition

Challenge • We need to keep track of which entities we’ve “touched” • But using a HashSet means we will start eating a lot memory

Primer: Bloom Filters BloomFilter { def mightContain(T obj) def put(T obj) } example: 1 MB @ 0.5% error => 130 KB

Clique Choosing w/ Bloom Filters

Recap • Challenge: Get data to the right machine. Solution: Use Locality-Sensitive-Hashing • Challenge: Choose the best cliques. Solution: Use serial iterator and bloom-ﬁlters to keep memory low

Temporal Entity Resolution T1 T2 Ms Sally Smith Mrs Sally Doe thefacebook.com facebook.com Zen Payroll Gusto

Temporal Entity Resolution A B Zen Payroll zenpayroll.com Gusto gusto.com -1000

Temporal Entity Resolution A B Zen Payroll zenpayroll.com +100 C Zen Payroll <=> Gusto zenpayroll.com <=> gusto.com Gusto gusto.com +100 -1000

Iterative Poison Pills • Basic Idea: Use ER techniques we’ve already established • Introduce “poison pills” that can break up cliques if temporal properties don’t match • Iteratively use the poison pills to match on increasingly temporally-aware entities

gusto.com (Payroll) 2016 Perform Regular ER gusto.com (Travel) 2010 gusto.com < 2015 gusto.com zenpayroll.com > 2015 zenpayroll.com (Payroll) 2014 A B C D E A, C, D, E B, E Kick Out Entities That Don’t Match Temporal Requirements A, D gusto.com < 2015 B, E gusto.com > 2015 zenpayroll < 2014 C, E gusto,2016 Perform Regular ER (now with more temporal fields available) A, C, D B, C, E Temporal Poison Pills

Temporal Entity Resolution • Very Computational Expensive • Requires Signiﬁcant Tuning & Tweaking to Keep Tractable • Considered one of the Holy Grails of ER

Recap • Gorilla in the room: All of our scoring has been manual

Supervised Learning ER • Basic Idea: Use a training set to learn the weights in our scoring functions • Disclaimer: Only proceed with this if you have very complex scoring properties

More Learning Opts • Gradient Descent: What if we viewed the system as having overall “error”? We can then use Gradient Descent to ﬁnd optimal solution. • Very very computationally intense

Questions? Thanks! jakequist@gmail.com

DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

More Related Content

Viewers also liked

Similar to DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

More from Hakka Labs

Recently uploaded

DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark