Slides @ www.jakequist.com/go/dataengconf
http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
Entity Resolution
Talk Structure Layer 1: Naive ER Layer 2: Graphical ER Layer 3: Big Data ER Layer 4: Temporal ER Layer 5: Learned ER
Naive ER
Entity Resolution ID Name Website Geo A Facebook facebook.com Menlo	Park,	CA B FB facebook.com CA C Joe's	Cookies joescookies.com San	Francisco,	CA Suppose we have the following data:
Entity Resolution Suppose we have the following data: ID Name Website Geo A Facebook facebook.com Menlo	Park,	CA B FB facebook.com CA C Joe's	Cookies joescookies.com San	Francisco,	CA D Joes	Cookies facebook.com San	Francisco,	CA
Entity Resolution Suppose we have the following data: ID Name Website Geo A Facebook facebook.com Menlo	Park,	CA B FB facebook.com CA C Joe's	Cookies joescookies.com San	Francisco,	CA D Joes	Cookies facebook.com San	Francisco,	CA E Joes	Cookies NULL New	York,	NY
Fundamental Concept Match entities on the similarity of their properties
Example: Company Similarity
Example: Company Similarity
Problems • What about when match arity != 2 • Entities can’t duplicate across matches • O(N^2) isn’t great either
Graphical ER
Think Like a Graph A B EC D ID Name Website Geo A Facebook facebook.com Menlo	Park, CA B FB facebook.com CA C Joe's	Cookies joescookies.com San	Francisco, CA D Joes	Cookies facebook.com San	Francisco, CA E Joes	Cookies NULL New	York,	NY
Think Like a Graph A B EC D 150 50 -100 -100 50 50 50 50 -150-150
Think Like a Graph A B EC D 150 50 -100 -100 50 50 50 50 -150-150
Key Concept: Cliques
Think Like a Clique A B EC D 150 50 -100 -100 50 50 50 50 -150-150 {A} {B} {C} {D} {E} {E, A} {E, B} {E, C} {E, D} {A, B} {A, C} {A, D} {B, C} {B, D} {C, D} {E, A, B} {E, A, C} {E, A, D} {E, B, C} {E, B, D} {E, C, D} {A, B, C} {A, B, D} {A, C, D} {B, C, D} {E, A, B, C} {E, A, B, D} {E, A, C, D} {E, B, C, D} {A, B, C, D} {E, A, B, C, D} possible cliques =>
Recurring Theme: Powerset
Scoring Cliques from above
Overlapping Cliques A B EC D A B EC D A = 0.75 B = 0.55
Overlapping Cliques An entity can’t belong to more than one clique. When we choose a clique, we must ensure no other cliques use any of those entities
Clique Choosing
Clique Choosing
Recap • Given a dataset of entities… • Take the powerset of those entities => every possible clique • Score all the cliques • In sorted order, choose the best cliques when no elements have been touched
ER on Bigger Data
• Get potential matches on the same machine • Avoid using powerset(n) for large n Challenges
Locality-Sensitive Hashing (LSH) Basic Idea: Use Map Reduce to get likely matches onto the same machines “Johnathon” “Sequoia Capital, LLC” [37.773972, -122.431297] “John” “Sequoia” [37.73, -122.43] “app.example.com” “example.com”
Locality-Sensitive Hashing
Locality-Sensitive Hashing
Problems • What if our entities have missing properties?
Locality-Sensitive Hashing Joe’s CookiesJoe’s Cookie’s joescookies.com joescookies.com A B C “Joe Cookie” “Joe Cookie” “” LSH on “name”
Multilevel LSH • Basic Idea: Use LSH multiple times on converging cliques
Joe’s CookiesJoe’s Cookie’s joescookies.com joescookies.com A B C “Joe Cookie” “Joe Cookie” “” LSN on “name” Joe’s Cookie’s joescookies.com joescookies.com Clique #3 Clique #2 “joescookies.com” “joescookies.com” LSN on “website” Clique #1
Clique Choosing • We now have all potential cliques, spread across the cluster • We now need to choose the best cliques? • Remember: But choosing one clique invalidates others • Fundamentally a Serial Algorithm!
Clique Choosing RDD[T].toLocalIterator() : Iterator[T] • Produces an iterator on the Driver that seamlessly iterates every partition
Clique Choosing
Clique Choosing uh oh
Challenge • We need to keep track of which entities we’ve “touched” • But using a HashSet means we will start eating a lot memory
Primer: Bloom Filters BloomFilter { def mightContain(T obj) def put(T obj) } example: 1 MB @ 0.5% error => 130 KB
Clique Choosing w/ Bloom Filters
Clique Choosing w/ Bloom Filters
Recap • Challenge: Get data to the right machine. Solution: Use Locality-Sensitive-Hashing • Challenge: Choose the best cliques. Solution: Use serial iterator and bloom-filters to keep memory low
Temporal ER
Temporal Entity Resolution T1 T2 Ms Sally Smith Mrs Sally Doe thefacebook.com facebook.com Zen Payroll Gusto
Temporal Entity Resolution A B Zen Payroll zenpayroll.com Gusto gusto.com -1000
Temporal Entity Resolution A B Zen Payroll zenpayroll.com +100 C Zen Payroll <=> Gusto zenpayroll.com <=> gusto.com Gusto gusto.com +100 -1000
Iterative Poison Pills • Basic Idea: Use ER techniques we’ve already established • Introduce “poison pills” that can break up cliques if temporal properties don’t match • Iteratively use the poison pills to match on increasingly temporally-aware entities
gusto.com (Payroll) 2016 Perform Regular ER gusto.com (Travel) 2010 gusto.com < 2015 gusto.com zenpayroll.com > 2015 zenpayroll.com (Payroll) 2014 A B C D E A, C, D, E B, E Kick Out Entities That Don’t Match Temporal Requirements A, D gusto.com < 2015 B, E gusto.com > 2015 zenpayroll < 2014 C, E gusto,2016 Perform Regular ER (now with more temporal fields available) A, C, D B, C, E Temporal Poison Pills
Temporal Entity Resolution • Very Computational Expensive • Requires Significant Tuning & Tweaking to Keep Tractable • Considered one of the Holy Grails of ER
Learned ER
Recap • Gorilla in the room: All of our scoring has been manual
Supervised Learning ER • Basic Idea: Use a training set to learn the weights in our scoring functions • Disclaimer: Only proceed with this if you have very complex scoring properties
Supervised Learning ER
Supervised Learning ER
More Learning Opts • Gradient Descent: What if we viewed the system as having overall “error”? We can then use Gradient Descent to find optimal solution. • Very very computationally intense
Questions? Thanks! jakequist@gmail.com
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark