How We Use Functional Programming to Find the Bad Guys

How We Use FP to Find the Bad Guys Richard Minerich, Director of R&D at @Rickasaurus

Bad Guy Database Onboarding Anatomy of Anti-Money Laundering Branch Branch Branch Branch Bank Other Bank Other Bank Other Bank Other Bank News Sanctions Watch Lists Real time Lookup (Efficient Search) Transaction Monitoring (Sparse Information) Batch Scanning (O(N*M) Result Space) Risk Calculation

Relationship Network (Safe View)

Entity Resolution in Theory 𝑅1 𝑅2 𝑅3 𝑅4 𝑇1 𝑇2 𝑇3 𝑇4 Bank Records Bad Guys 𝑃(𝑅 𝑛 𝑟𝑒𝑝𝑟𝑒𝑛𝑡𝑠 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑒𝑛𝑡𝑖𝑡𝑦 𝑎𝑠 𝑇 𝑚) Example Variations: - Aggregating Products - Finding Medical Records - Resolving Paper Authors - Census - Finding Bad Guys - Database Deduping Different tradeoffs per domain

The Pairwise Entity Resolution Process Blocking • Two Datasets (Customer Data and Bad Guys Data) • Pairs of Somehow Similar Records Scoring • Pairs of Records • Probability of Representing Same Entity Review • Records, Probability, Similarity Features • True/False Labels (Mostly by Hand)

Why Blocking? ▪ 100 Million x 100 Million = 10 quadrillion pairs ▪ 86,400,000 milliseconds per day ▪ One pair per ms: ~116,000,000 days to compute (~317K years)

How do we beat N*M? Blocking Algorithms. “Blocks”: Candidate Pairs or Clusters 𝑅1 𝑅2 𝑅3 𝑅4 𝑇1 𝑇2 𝑇3 𝑇4 Blocking Algorithm(s) 𝐵𝑖 ∈ (𝑅 × 𝑇) Input: - Source Records R - Target Records T Output: - Blocks of Similar Records 𝐵𝑖 ∈ (𝑅 × 𝑇)

Onboarding ▪ When a customer opens an account at a bank an agent does a search ▪ As it is done by a human, errors and missing information are common ▪ Low risk process as bad guys may be caught in the batch scanning later ▪ Blocking data structure is kept loaded into memory and queried against ▪ Results above some probability threshold are returned to the user ordered by probability and risk Onboarding Branch Branch Branch Branch Bank

Transaction Monitoring ▪ SWIFT messages are passed on the internet of money ▪ Banks must process huge numbers of these ▪ Account information is often not accessible ▪ Messages are low information compared to accounts ▪ Messages must leave within 24h of being received ▪ Similar to Onboarding (but huge numbers and time constraints): ▪ Blocking data structure is kept loaded into memory and queried against ▪ Results below some probability/risk threshold are discarded ▪ Hits are manually reviewed in order by probability, risk and timeliness Bank Other Bank Other Bank Other Bank Other Bank

Batch Scanning ▪ Initially, all customer records vs all bad guy records. ▪ Often hundreds of millions of customer records vs ~3 million bad guy records ▪ Incrementally what we call Diff-Diff ▪ All Customer records vs changed bad-guy records ▪ Changed Customer records vs all bad-guy records ▪ Computation is distributed across many beefy machines ▪ ~1TB of ram, 32 Cores ▪ Results are viewed in order by probability and risk with some thresholding on very lower probability or very low risk Bank

Overall Model: Risk vs Probability Money Laundering Risk Same Person Probability

Typed Functional Programming is a natural fit ▪ Small components (i.e. functions) can be independently tested and reused ▪ Changes are unlikely to break other parts of the system ▪ ~3 Bugs in production over 5 years ▪ Code locality eases understanding of complex components ▪ Huge code reduction over standard object-oriented approaches ▪ Math reads like math, with proper operators and order of operation Cleaning Blocking Featurization Scoring Slicing

“OLS” Regression via Gradient Descent in F#

Stages as (simplified) functions Stage Function Type Cleaning Rec -> Rec Blocking PRecs -> CRecs -> (CRec, PRec) Sequence Featurization (CRec, PRec) -> float Vector Scoring float Vector -> Probability Slicing (PRec.Risk, Probability) -> Class Label Cleaning Blocking Featurization Scoring Slicing

Disgustingly Bad but Fairly Large Datasets ▪ Both Wide (many fields) and Tall (many records) ▪ From different systems (with different encodings) ▪ Missing data ▪ Poorly merged data ▪ Extra data ▪ Non-unique IDs Every client is awful in a completely different way. NAME LARRY O BRIAN STATE CANADA CITY 121 Buffalo Drive, Montreal, Quebec H3G 1Z2 ADDRESS NULL ZIP 12345 DOB 10/24/80; 1/1/1979

Fighting Bad Data with Configurable Functional Subsystems CustRecord Names DOBs? Countries? States? Cities? … ListRecord Names DOBs? Countries? States? Cities? … Blocked Pair

Functions on Record Tree Structure CustRecord Names DOBs? Countries? States? Cities? … ListRecord Names DOBs? Countries? States? Cities? … Blocked Pair • stripAccents • stripCharacters • replaceSubstring • oneToManyFromFile • isLocalCountry Hit.Cust.Names => stripAccents => oneToMany “nicknames.csv”

Rebuilding with Quotations CustRecord Names DOBs? Countries? States? Cities? … ListRecord Names DOBs? Countries? States? Cities? … Blocked Pair CustRecord Names DOBs? Countries? States? Cities? … ListRecord Names DOBs? Countries? States? Cities? … Blocked Pair

Barb, a simple .net record query language (We use it for data cleaning and features) Name.Contains "John“ and (Age > 20 or Weight > 200) https://github.com/Rickasaurus/Barb

Barb for Cleaning, Queries, and Features on the Fly

Simplest: Key-based Blocking Table: Peter Christen - Data Matching 2012 • Nothing’s easier than a table lookup! • Many ways to key, choosing is hard • Small errors can cause misses • What about missing data?

Suffix Trees/Arrays Images care of: http://alexbowe.com/fm-index/ Input length: n, Search length: m • Construction in O(n) KS[2003] • Search in O(m) AKO[2004] • Space is 4n Bytes Naively • Compressed Space O(n*H(T)) + o(n) Where T is the input text Newer: Compressed Compact SA How to Introduce Fuzziness?

Canopy Clustering 1. Start with a set (S) of all records - Some cheap distance metric f(r1,r2) : {0,1} - Some upper bound (u) < 1 - Some lower bound (l) < 1 2. Take one out (c) and put it in a new cluster (C) 3. For each record still in (S) compare it to (c) via function (f) - if it’s higher than (u), add it to (C) and remove it from (S) - if higher than (l), add it to (C) and leave it in (S) 4. If (S) is not empty, go to 2

Canopy Clustering - Wait, isn’t this O(n^2)? - How do we pick the thresholds? - What might we miss if upper threshold is less than 1? Other approaches: - Different functions for (u) and (l) - Inverted Indices

Blocking: Industry Concerns ▪ Can we predict what will and won’t block with absolute certainty? ▪ Can it find matches across different fields, or in blobs of text? ▪ Can we improve the process if we find counterexamples? ▪ Will standard human errors mess up blocking? ▪ Does it scale to large data sizes on reasonable local hardware?

Many more ways to Block ▪ Sorted Neighborhood ▪ Various kinds of Q-gram Indices ▪ Metric Space Embedding ▪ Semantic Hashing ▪ Cluster-based approaches like Swoosh Most are terrible in their own special way.

Pairwise Probability Distribution Tiny Bump 937Upper Threshold 161 161,358

The Basics of Pairwise Probability Institutional Knowledge Data Analysis Model Evaluation Model Generation Features Labels Past Decisions Current Observation • Smart Features and Clean Labels are Most Important • Understandability is Key • Inference Algorithm is Secondary

Simplest: Empirical Summed Similarity ▪ F: feature functions (0 .. m) : (a,b) -> [0, 1] ▪ W: feature weights (0 .. m) : {0+} ▪ 𝑆𝑖𝑚𝑆𝑢𝑚(𝑎, 𝑏) = 𝑖=0 𝑚 𝑓𝑖 𝑤𝑖 Thresholds such that: Match: SimSum(a,b) >= Upper Review: Lower <= SimSum(a,b) <= Upper Discard: SimSum(a,b) <= Lower Image Via: https://www.cs.umd.edu/class/spring2012/cmsc828L/Papers/HerzogEtWires10.pdf

Thresholds in Context ReviewDiscard Automatic Match

Expectation Maximization for log odds via Fellegi-Sunter Pros: ▪ Robust to missing data ▪ Easy enough to understand and well known Cons: ▪ Needs careful sampling due to class imbalance. ▪ Starting probabilities need to be chosen carefully (local optimization). Expectation Maximization

Other kinds of applicable inference ▪ Logistic Regression ▪ Support Vector Machines ▪ Bayesian Models ▪ Neural Networks ▪ Random Forests Complex models are harder to explain than complex features

Page Rank: The Hard Parts ▪ Domains, Websites, Pages in Context ▪ Determining Initial Risk for Sources ▪ 27 Pages of Data Transformation Code ▪ Fluctuation with no changes ▪ Prediction and Explainability Not Hard: The Algorithm

Normalizing Page Rank for Humans Power Law Picture: Donato et. al., 2004

Combining Ranking and Probability: Big Picture

Algorithms for Awful Data: String Matching ▪ Goal: Robust and Forgiving with the Fewest Possible Assumptions Somewhat Reasonable Data: Rotational Alignment Extremely Awful Data: Gale-Shapley

Baseline Function: Jaro-Winkler m = # of matching characters t = # of character transpositions |s1| = length of the first string |s2| = length of the second string l = # of characters that match at the start of the string over the number considered p = proportion of the score given to the initial character matching We use further tweaks on top of this for improved effectiveness.

Gale-Shapley for Stable Marriages: O(n^2) Input: beau tokens m in M, belle tokens w in W, comparison function f UM as the unattached beau, UW as unattached belle, P as the pair set (empty) 1) Select a beau m from UM 2) m selects the w in W s.t. f(m,w) is maximized and not previous selected by m 3) If w is in UW, remove m from UM and w from UW and add (m,w) to P if a pair (m’, w) exists and f(w,m) > f(w,m’) then remove (m’,w) from P, add m’ to UM, add (m,w) to P 4) If UM is not empty, go to 1

Rotational Token Alignment ▪ Less forgiving than Gale-Shapley but also less prone to egregious errors. ▪ Also known as cyclic suborders of size k of a cyclic order of size n ▪ O(k(n choose k)) Richard Thomas Minerich Minerich Richard Thomas Thomas Minerich Richard

Rotational Alignment (cont.) ▪ Pre-calculate matrix of f(x,y) values ▪ Gosper’s Hack for Fast Rotations Gosper’s Hack via: http://programmers.stackexchange.com/questions/67065/whats-your-favorite-bit-wise-technique

Directions for Future Research ▪ Record pair population estimation ▪ Safe partial inference for tuning ▪ Prediction of future risk ▪ Collective entity resolution ▪ Mixed entity resolution-fraud detection models

Thank You! Questions? You can read more on my blog at: http://richardminerich.com Contact me on twitter: @Rickasaurus Email me with questions: rick@bayardrock.com Check out the NYC F# User Group: http://www.meetup.com/nyc-fsharp Code on Github: http://github.com/BayardRock http://github.com/Rickasaurus

How We Use Functional Programming to Find the Bad Guys

More Related Content

Similar to How We Use Functional Programming to Find the Bad Guys

More from New York City College of Technology Computer Systems Technology Colloquium

Recently uploaded

How We Use Functional Programming to Find the Bad Guys