1 By Greg Makowski http://www.LinkedIn.com/in/GregMakowski Predictive Model and Record Description Using Segmented Sensitivity Analysis Global Big Data Conference, Santa Clara http://globalbigdataconference.com/santa-clara/global-software-architecture-conference/schedule-86.html Friday, June 9, 2017
Benefits Describe the most important data inputs to a model • What is driving the forecast? • Good Communication is a Competitive Advantage During model building – use to improve the model Use to detect data drift – when model refresh is needed For each record, what are reasons for the forecast? 2
“3 Reasons Why Data Scientist Remains the Top Job in America” – Infoworld 4/14/17 In 2015: 11k to 19k Data Scientists (existed) Now: On LinkedIn, 13.7k OPEN POSITIONS (89% more pos in 2 yrs) Reason #1: There’s a shortage of talent • “Business leaders are after professionals who can not only understand the numbers, but also communicate their findings effectively.” Reason #2: Org Face Challenges in organizing data • “Data preparation accounts for 80% of the work of Data Scientists” Reason #3: Need for DS is no longer restricted to tech giants 3 http://www.infoworld.com/article/3190008/big-data/3-reasons-why-data-scientist- remains-the-top-job-in-america.html#tk.drr_mlt
“3 Reasons Why Data Scientist Remains the Top Job in America” – Infoworld 4/14/17 4 http://www.infoworld.com/article/3190008/big-data/3-reasons-why-data-scientist- remains-the-top-job-in-america.html#tk.drr_mlt
Algorithm Design Objectives 1. Describe the model in terms of variables understandable to the target audience 2. Be independent of the algorithm (i.e. Neural Net, SVM, Xtreame Gradient Boosting, Random Forests…) 3. Support describing an arbitrary ensemble of models 4. Pick up non-linearities in the vars 5. Pick up interaction effects 6. Understand the model system in a very local way 5 x z (target)
Set Client Expectations I understand completely how a bicycle works…. However, I still drive a car to work A certain level of detail is NOT needed Do you find out why the automotive engineer picked X mm for the diameter of the cylinders? You can learn enough detail to let the model drive your business 6
Sensitivity Analysis (OAT) One At a Time https://en.wikipedia.org/wiki/Sensitivity_analysis Arbitrarily Complex Data Mining System (S) Source fields Target field For source fields with binned ranges, sensitivity tells you importance of the range, i.e. “low”, …. “high” Can put sensitivity values in Pivot Tables or Cluster Record Level “Reason codes” can be extracted from the most important bins that apply to the given record Delta in forecast Present record N, S times, each input 5% bigger (fixed input delta) Record delta change in output, S times per record Aggregate: average(abs(delta)), target change per input field delta
5 Example Sensitivity Records Intermediate Table of Sensitivities /rec /var Forecasted Target Variable Changes from the target variable, after multiplying each input by 1.05, One At a Time (OAT) Delta 1 Delta 2 Delta N
Both Positive and Negative Effects Changes within Variable Range (Neural Net model 3) Example Raw Values for Top 12 Variables Standard Deviation Can be another ranking metric Abs = (Total Width over neg and pos)
Both Positive and Negative Effects Changes within Variable Range (Neural Net model 3) Avg(negative values) by variable Avg(positive values) by variable
11 Define business objectives and project plan during the Knowledge Discovery Workshop Select the “Analysis Universe” data Include holdout verification data Repeat through model loop (1-3 times, ~2 weeks each) Exploratory Data Analysis (EDA) Transformation (Preprocessing) Build Model – dozens or 100’s of models (Data Mining) Evaluate and explain the model – use business metric Score or deploy the model on “Forecast Universe” Track results, refresh or rebuild model, subdivide or refine as needed Data Mining Project Overview Scoring past Analysis past Forecasted future Example future Reference Date Days per sprint 2 1 1 5 4 4 2 4 3 1 1 2 https://www.csd.uwo.ca/faculty/ling/cs435/fayyad.pdf From Data Mining to Knowledge Discovery in Databases, 1996
During the Data Mining Project at the End of the First Sprint Sprint 1: basic data preprocessing and clean up At the end (before Sprint 2) • Perform Sensitivity Analysis to rank variables Sprint 2, start • Now have quantitative feedback on the most important variables • Start working on more detailed knowledge representation • Check variable interactions “More data beats clever algorithms, But BETTER DATA beats more data” - Peter Norvig Director of Research at Google Fellow of Association for the Advancement of Artificial Intelligence
Higher Level Detectors Illustrated as rules, but typically functions for a continuous score ”Higher Level” or compound detectors –Group one of many to an overall behavior issue (using NLP tags) if (hide communications identity with email alias) or (hide communication subject with code phrase) then hiding_comm on date_time X = 0.2 –Group many low level alerts in a short time if (5 <= failed login attempts) and (3 minutes <= time window) then Possible password guessing = 0.3 else if (20 <= failed login attempts) and (5 minutes <= time) then Possible password guessing = 0.7 –Compare different levels of context (possibly from different source systems) if (4 <= sum(over=week, event=hiding_comm) and # sum smaller detector over time (3 <= comm network size(hiding_comm)) and # network analysis (manager not in(network(hiding_comm))) # reporting hierarchy escalating comm secrecy = 0.8 # thresholds distance increases score Analogy • Defense attorney debating plausible innocence • Prosecuting attorney debating guilt • Detectors seeing the plausible “best case” (to reduce false alerts) • Other detectors seeing the “worst case” in each record Accurate General
Want to Capture COMPLEX Interactions All this complex variation is incredibly helpful !!!
Capture “Data Drift” Over Time Behavior Changes (pricing, competition) Current Scoring Data Training Data Think about what you want the model to be general on, capture behavior VARIETY: satellite images only during afternoon Christmas or vacation spending spikes The best model is limited by fitting the TRAINING data surface Do you have a large enough sample by behavior pocket? “Non-Stationary Data” DOES change over training to scoring time
MODEL DRIFT DETECTOR in N dimensions • Change in distribution of most important input fields Diagnose CAUSES, what is changing, how much… Out of the top 25% of the most important input fields… Which had the largest change? Tracking Model Drift Distribution of important variable X (where Y=15) changes from one peak to two x z (target) x z (target)TRAINING DATA SCORING DATA General
Capture “Data Drift” Over Time Behavior Changes (pricing, competition) Use “Training Data” as the baseline • Create 20 equal frequency bins of the forecast variable (5.0% / bin) • Save the original, Training, bin thresholds Check the Scored data over time (i.e. daily, monthly) Chi-Sqare or KS-Statistic To measure The slow changes
Description Per Record Use Segments of Variable Ranges • Reason codes are specific to the model and record record 1 record 2 • Ranked predictive fields Mr. Smith Mrs. Jones max_late_payment_120d 0 0 max_late_payment_90d 1 0 bankrupt_in_last_5_yrs 1 0 max_late_payment_60d 0 1 • Mr. Smith’s reason codes include: max_late_payment_90d 1 bankrupt_in_last_5_yrs 1
Description Per Record Need ”reasons” that apply to some people (records) but not others A given variable has some value for everybody Need “sub-ranges” that only apply to some people, i.e. • Very Low, Low, Medium, High, Very High • Create 5 “bins”, with a roughly equal number of records per bin • Focus on the sub-ranges or bins that have the highest sensitivity
20 Questions? Greg_Makowski@yahoo.com
21 5. Model Training Demo/Lab with HMEQ (Home Equity) Data Line of credit loan application, using existing home as loan equity. 5,960 records COLUMN rec_ID BAD CLAGE CLNO DEBTINC DELINQ DEROG JOB LOAN MORTDUE NINQ REASON VALUE YOJ DATA ROLE Key Target Applicant Applicant Applicant Applicant Applicant Applicant Loan applic Property Applicant Loan applic Property Applicant DESCRIPTION Record ID or key field, for each line of credit loan or person After 1 year, loan went in default, (=1, 20%) vs. still being paid (=0) Credit Line Age, in months (for another credit line) Credit Line Number Debt to Income ratio Number of delinquent credit lines Number of major derogatory reports Job, 6 occupation categories Requested loan amount Amount due on existing mortgage Number of recent credit inquiries “DebtCon“ = debt consolidation, “HomeImp“ = home improvement Value of current property Years on present job https://inclass.kaggle.com/c/pred-411-2016-04-u2-bonus-hmeq/data?heloc.csv
Rules or Queries to Detectors Simple Example Select 1 as detect_prospect (result field has 0 or 1 values) where (.6 < recency) and (.7 < frequency) and (.3 < time) Select recency + frequency + time as detect_prospect where (.6 < recency) and (has 100’s of values (.7 < frequency) and in the [0..1] range) (.3 < time) Develop “fuzzy” detectors, result in [0..1] 22 Accurate General
Compound Detectors Implemented as a Lookup Table (in this case, same for all people) • This illustrates the process of creating a detector • Lets not debate now about specific values • Don’t need perfection • Dozens of reasonable detectors are powerful • If user is failing login attempts over more applications, that is more suspicious (virus intrusion?) • Joe failed logging in over 3 applications, 8 times in 5 minutes  failed_log_risk = 0.6 Accurate General

Predictive Model and Record Description with Segmented Sensitivity Analysis (SSA)

  • 1.
    1 By Greg Makowski http://www.LinkedIn.com/in/GregMakowski PredictiveModel and Record Description Using Segmented Sensitivity Analysis Global Big Data Conference, Santa Clara http://globalbigdataconference.com/santa-clara/global-software-architecture-conference/schedule-86.html Friday, June 9, 2017
  • 2.
    Benefits Describe the mostimportant data inputs to a model • What is driving the forecast? • Good Communication is a Competitive Advantage During model building – use to improve the model Use to detect data drift – when model refresh is needed For each record, what are reasons for the forecast? 2
  • 3.
    “3 Reasons WhyData Scientist Remains the Top Job in America” – Infoworld 4/14/17 In 2015: 11k to 19k Data Scientists (existed) Now: On LinkedIn, 13.7k OPEN POSITIONS (89% more pos in 2 yrs) Reason #1: There’s a shortage of talent • “Business leaders are after professionals who can not only understand the numbers, but also communicate their findings effectively.” Reason #2: Org Face Challenges in organizing data • “Data preparation accounts for 80% of the work of Data Scientists” Reason #3: Need for DS is no longer restricted to tech giants 3 http://www.infoworld.com/article/3190008/big-data/3-reasons-why-data-scientist- remains-the-top-job-in-america.html#tk.drr_mlt
  • 4.
    “3 Reasons WhyData Scientist Remains the Top Job in America” – Infoworld 4/14/17 4 http://www.infoworld.com/article/3190008/big-data/3-reasons-why-data-scientist- remains-the-top-job-in-america.html#tk.drr_mlt
  • 5.
    Algorithm Design Objectives 1.Describe the model in terms of variables understandable to the target audience 2. Be independent of the algorithm (i.e. Neural Net, SVM, Xtreame Gradient Boosting, Random Forests…) 3. Support describing an arbitrary ensemble of models 4. Pick up non-linearities in the vars 5. Pick up interaction effects 6. Understand the model system in a very local way 5 x z (target)
  • 6.
    Set Client Expectations Iunderstand completely how a bicycle works…. However, I still drive a car to work A certain level of detail is NOT needed Do you find out why the automotive engineer picked X mm for the diameter of the cylinders? You can learn enough detail to let the model drive your business 6
  • 7.
    Sensitivity Analysis (OAT) OneAt a Time https://en.wikipedia.org/wiki/Sensitivity_analysis Arbitrarily Complex Data Mining System (S) Source fields Target field For source fields with binned ranges, sensitivity tells you importance of the range, i.e. “low”, …. “high” Can put sensitivity values in Pivot Tables or Cluster Record Level “Reason codes” can be extracted from the most important bins that apply to the given record Delta in forecast Present record N, S times, each input 5% bigger (fixed input delta) Record delta change in output, S times per record Aggregate: average(abs(delta)), target change per input field delta
  • 8.
    5 Example SensitivityRecords Intermediate Table of Sensitivities /rec /var Forecasted Target Variable Changes from the target variable, after multiplying each input by 1.05, One At a Time (OAT) Delta 1 Delta 2 Delta N
  • 9.
    Both Positive andNegative Effects Changes within Variable Range (Neural Net model 3) Example Raw Values for Top 12 Variables Standard Deviation Can be another ranking metric Abs = (Total Width over neg and pos)
  • 10.
    Both Positive andNegative Effects Changes within Variable Range (Neural Net model 3) Avg(negative values) by variable Avg(positive values) by variable
  • 11.
    11 Define business objectivesand project plan during the Knowledge Discovery Workshop Select the “Analysis Universe” data Include holdout verification data Repeat through model loop (1-3 times, ~2 weeks each) Exploratory Data Analysis (EDA) Transformation (Preprocessing) Build Model – dozens or 100’s of models (Data Mining) Evaluate and explain the model – use business metric Score or deploy the model on “Forecast Universe” Track results, refresh or rebuild model, subdivide or refine as needed Data Mining Project Overview Scoring past Analysis past Forecasted future Example future Reference Date Days per sprint 2 1 1 5 4 4 2 4 3 1 1 2 https://www.csd.uwo.ca/faculty/ling/cs435/fayyad.pdf From Data Mining to Knowledge Discovery in Databases, 1996
  • 12.
    During the DataMining Project at the End of the First Sprint Sprint 1: basic data preprocessing and clean up At the end (before Sprint 2) • Perform Sensitivity Analysis to rank variables Sprint 2, start • Now have quantitative feedback on the most important variables • Start working on more detailed knowledge representation • Check variable interactions “More data beats clever algorithms, But BETTER DATA beats more data” - Peter Norvig Director of Research at Google Fellow of Association for the Advancement of Artificial Intelligence
  • 13.
    Higher Level Detectors Illustratedas rules, but typically functions for a continuous score ”Higher Level” or compound detectors –Group one of many to an overall behavior issue (using NLP tags) if (hide communications identity with email alias) or (hide communication subject with code phrase) then hiding_comm on date_time X = 0.2 –Group many low level alerts in a short time if (5 <= failed login attempts) and (3 minutes <= time window) then Possible password guessing = 0.3 else if (20 <= failed login attempts) and (5 minutes <= time) then Possible password guessing = 0.7 –Compare different levels of context (possibly from different source systems) if (4 <= sum(over=week, event=hiding_comm) and # sum smaller detector over time (3 <= comm network size(hiding_comm)) and # network analysis (manager not in(network(hiding_comm))) # reporting hierarchy escalating comm secrecy = 0.8 # thresholds distance increases score Analogy • Defense attorney debating plausible innocence • Prosecuting attorney debating guilt • Detectors seeing the plausible “best case” (to reduce false alerts) • Other detectors seeing the “worst case” in each record Accurate General
  • 14.
    Want to CaptureCOMPLEX Interactions All this complex variation is incredibly helpful !!!
  • 15.
    Capture “Data Drift”Over Time Behavior Changes (pricing, competition) Current Scoring Data Training Data Think about what you want the model to be general on, capture behavior VARIETY: satellite images only during afternoon Christmas or vacation spending spikes The best model is limited by fitting the TRAINING data surface Do you have a large enough sample by behavior pocket? “Non-Stationary Data” DOES change over training to scoring time
  • 16.
    MODEL DRIFT DETECTORin N dimensions • Change in distribution of most important input fields Diagnose CAUSES, what is changing, how much… Out of the top 25% of the most important input fields… Which had the largest change? Tracking Model Drift Distribution of important variable X (where Y=15) changes from one peak to two x z (target) x z (target)TRAINING DATA SCORING DATA General
  • 17.
    Capture “Data Drift”Over Time Behavior Changes (pricing, competition) Use “Training Data” as the baseline • Create 20 equal frequency bins of the forecast variable (5.0% / bin) • Save the original, Training, bin thresholds Check the Scored data over time (i.e. daily, monthly) Chi-Sqare or KS-Statistic To measure The slow changes
  • 18.
    Description Per Record UseSegments of Variable Ranges • Reason codes are specific to the model and record record 1 record 2 • Ranked predictive fields Mr. Smith Mrs. Jones max_late_payment_120d 0 0 max_late_payment_90d 1 0 bankrupt_in_last_5_yrs 1 0 max_late_payment_60d 0 1 • Mr. Smith’s reason codes include: max_late_payment_90d 1 bankrupt_in_last_5_yrs 1
  • 19.
    Description Per Record Need”reasons” that apply to some people (records) but not others A given variable has some value for everybody Need “sub-ranges” that only apply to some people, i.e. • Very Low, Low, Medium, High, Very High • Create 5 “bins”, with a roughly equal number of records per bin • Focus on the sub-ranges or bins that have the highest sensitivity
  • 20.
  • 21.
    21 5. Model TrainingDemo/Lab with HMEQ (Home Equity) Data Line of credit loan application, using existing home as loan equity. 5,960 records COLUMN rec_ID BAD CLAGE CLNO DEBTINC DELINQ DEROG JOB LOAN MORTDUE NINQ REASON VALUE YOJ DATA ROLE Key Target Applicant Applicant Applicant Applicant Applicant Applicant Loan applic Property Applicant Loan applic Property Applicant DESCRIPTION Record ID or key field, for each line of credit loan or person After 1 year, loan went in default, (=1, 20%) vs. still being paid (=0) Credit Line Age, in months (for another credit line) Credit Line Number Debt to Income ratio Number of delinquent credit lines Number of major derogatory reports Job, 6 occupation categories Requested loan amount Amount due on existing mortgage Number of recent credit inquiries “DebtCon“ = debt consolidation, “HomeImp“ = home improvement Value of current property Years on present job https://inclass.kaggle.com/c/pred-411-2016-04-u2-bonus-hmeq/data?heloc.csv
  • 22.
    Rules or Queriesto Detectors Simple Example Select 1 as detect_prospect (result field has 0 or 1 values) where (.6 < recency) and (.7 < frequency) and (.3 < time) Select recency + frequency + time as detect_prospect where (.6 < recency) and (has 100’s of values (.7 < frequency) and in the [0..1] range) (.3 < time) Develop “fuzzy” detectors, result in [0..1] 22 Accurate General
  • 23.
    Compound Detectors Implemented asa Lookup Table (in this case, same for all people) • This illustrates the process of creating a detector • Lets not debate now about specific values • Don’t need perfection • Dozens of reasonable detectors are powerful • If user is failing login attempts over more applications, that is more suspicious (virus intrusion?) • Joe failed logging in over 3 applications, 8 times in 5 minutes  failed_log_risk = 0.6 Accurate General

Editor's Notes

  • #2 10
  • #15 [1 min] Running tot: 7:32 There are many other details around this – how to handle sparse data?