Secure  Because  Math:  A  Deep-­‐Dive  on   Machine  Learning-­‐Based  Monitoring     (#SecureBecauseMath)   Alex  Pinto   Chief  Data  Scien2st  |  MLSec  Project     @alexcpsec   @MLSecProject!
Alex  Pinto   •  Chief  Data  Scien2st  at  MLSec  Project   •  Machine  Learning  Researcher  and  Trainer   •  Network  security  and  incident  response  aficionado     •  Tortured  by  SIEMs  as  a  child   •  Hacker  Spirit  Animal™:  CAFFEINATED  CAPYBARA! whoami   (hPps://secure.flickr.com/photos/kobashi_san/)  
  •  Security  Singularity   •  Some  History   •  TLA   •  ML  Marke2ng  PaPerns   •  Anomaly  Detec2on   •  Classifica2on   •  Buyer’s  Guide   •  MLSec  Project   Agenda  
Security  Singularity  Approaches  
(Side  Note)   First  hit  on  Google  images  for  “Network  Security  Solved”  is  a   picture  of  Jack  Daniel!
Security  Singularity  Approaches   •  “Machine  learning  /  math  /  algorithms…  these  terms  are   used  interchangeably  quite  frequently.”   •  “Is  behavioral  baselining  and  anomaly  detec2on  part  of   this?”   •  “What  about  Big  Data  Security  Analy2cs?”     (hPp://bigdatapix.tumblr.com/)  
Are  we  even  trying?   •  “Hyper-­‐dimensional  security   analy2cs”   •  “3rd  genera2on  Ar2ficial   Intelligence”   •  “Secure  because  Math”     •  Lack  of  ability  to  differen2ate   hurts  buyers,  investors.   •  Are  we  even  funding  the  right   things?  
Is  this  a  communicaCon  issue?  
Guess  the  Year!   •  “(…)  behavior  analysis  system  that  enhances  your   network  intelligence  and  security  by  audi2ng  network   flow  data  from  exis2ng  infrastructure  devices”   •  "Mathema2cal  models  (…)  that  determine  baseline   behavior  across  users  and  machines,  detec2ng  (...)   anomalous  and  risky  ac2vi2es  (...)”   •  ”(…)  maintains  historical  profiles  of  usage  per  user  and   raises  an  alarm  when  observed  ac2vity  departs  from   established  paPerns  of  usage  for  an  individual.”    
A  liGle  history   •  Dorothy  E.  Denning  (professor  at  the   Department  of  Defense  Analysis  at  the   Naval  Postgraduate  School)   •  1986  (SRI)  -­‐  First  research  that  led   to  IDS   •  Intrusion  Detec2on  Expert  System   (IDES)   •  Already  had  sta2s2cal  anomaly   detec2on  built-­‐in   •  1993:  Her  colleagues  release  the  Next   Genera2on  (!)  IDES  
Three  LeGer  Acronyms  -­‐  KDD   •  Ajer  the  release  of  Bro  (1998)  and  Snort  (1999),  DARPA   thought  we  were  covered  for  this  signature  thing   •  DARPA  released  datasets  for  user  anomaly  detec2on  in   1998  and  1999   •  And  then  came  the  KDD-­‐99  dataset  –  over  6200  cita2ons   on  Google  Scholar  
Three  LeGer  Acronyms  
Three  LeGer  Acronyms  -­‐  KDD  
Trolling,  maybe?  
Not  here  to  bash  academia  
A  Probable  Outcome   GRAD   SCHOOL   FRESHMAN   ZOMG   RESULTS  !! 11!1!   ZOMG!   RESULTS???   MATH,  STAHP!   MATH  IS   HARD,  LET’S   GO  SHOPPING  
ML  MarkeCng  PaGerns   •  The  “Has-­‐beens”     •  Name  is  a  bit  harsh,  but  hey,  you  hardly  use  ML   anymore,  let  us  try  it   •  The  “Machine  Learning  ¯ˉ_(ツ)_/¯ˉ”   •  Hey,  that  sounds  cool,  let’s  put  that  in  our  brochure   •  The  “Sweet  Spot”   •  People  that  actually  are  trying  to  do  something   •  Anomaly  Detec2on  vs.  Classifica2on  
Anomaly  DetecCon  
Anomaly  DetecCon   •  Works  wonders  for  well   defined  “industrial-­‐like”   processes.   •  Looking  at  single,   consistently  measured   variables   •  Historical  usage  in  financial   fraud  preven2on.  
Anomaly  DetecCon  
Anomaly  DetecCon   •  What  fits  this  mold?   •  Network/Neqlow  behavior  analysis     •  User  behavior  analysis   •  What  are  the  challenges?   •  Curse  of  Dimensionality   •  Lack  of  ground  truth  and  normality  poisoning   •  Hanlon’s  Razor  
AD:  Curse  of  Dimensionality   •  We  need  “distances”  to  measure   the  features/variables   •  Usually  ManhaPan  or  Euclidian   •  For  high-­‐dimensional  data,  the   distribu2on  of  distances  between   all  pairwise  points  in  the  space   becomes  concentrated  around  an   average  distance.  
AD:  Curse  of  Dimensionality   •  The  volume  of  the  high   dimensional  sphere   becomes  negligible  in   rela2on  to  the  volume  of   the  high  dimensional  cube.   •  The  prac2cal  result  is  that   everything  just  seems  too   far  away,  and  at  similar   distances.   (hPp://www.datasciencecentral.com/m/blogpost? id=6448529%3ABlogPost%3A175670)  
A  PracCcal  example   •  NetFlow  data,  company  with  n  internal  nodes.   •  2(nˆ2  -­‐  n)  communica2on  direc2ons   •  2*2*2*65535(nˆ2  -­‐  n)  measures  of  network  ac2vity   •  1000  nodes  -­‐>  Half  a  trillion  possible  dimensions  
Breaking  the  Curse   •  Different  /  crea2ve   distance  metrics   •  Organizing  the  space  into   sub-­‐manifolds  where   Euclidean  distances  make   more  sense.   •  Aggressive  feature   removal   •  A  few  interes2ng  results   available  
Breaking  the  Curse  
AD:  Normality-­‐poisoning  aGacks   •  Ground  Truth  (labels)  >>  Features  >>  Algorithms   •  There  is  no  (or  next  to  none)  Ground  Truth  in  AD   •  What  is  “normal”  in  your  environment?   •  Problem  asymmetry   •  Solu2ons  are  biased  to  the  prevalent  class   •  Very  hard  to  fine-­‐tune,  becomes  prone  to  a  lot  of  false   nega2ves  or  false  posi2ves  
AD:  Normality-­‐poisoning  aGacks  
AD:  Hanlon’s  Razor   Never attribute to malice that which is adequately explained by stupidity.
AD:  Hanlon’s  Razor   vs! Evil  Hacker! Hipster  Developer     (a.k.a.  MaP  Johansen)!
What  about  User  Behavior?   •  Surprise,  it  kinda  works!  (as  supervised,  that  is)   •  As  specific  implementa2ons  for  specific  solu2ons   •  Good  stuff  from  Square,  AirBnB   •  Well  defined  scope  and  labeling.   •  Can  it  be  general  enough?   •  File  exfiltra2on  example  (roles/info  classifica2on   are  mandatory?)   •  Can  I  “average  out”  user  behaviors  in  different   applica2ons?  
ClassificaCon!   VS!
•  Lots  of  available  academic  research  around  this   •  Classifica2on  and  clustering  of  malware  samples   •  More  success  into  classifying  ar2facts  you  already  know  to   be  malware  then  to  actually  detect  it.  (Lineage)   •  State  of  the  art?  My  guess  is  AV  companies!   •  All  of  them  have  an  absurd  amount  of  samples   •  Have  been  researching  and  consolida2ng  data  on  them   for  decades.   Lots  of  Malware  AcCvity  
•  Can  we  do  bePer  than  “AV  Heuris2cs”?   •  Lots  and  lots  of  available  data  that  has  been  made  public   •  Some  of  the  papers  also  suffer  from  poten2ally  bad  ground   truth.   Lots  of  Malware  AcCvity   VS!
Lots  of  Malware  AcCvity   VS!
Everyone  makes  mistakes!  
•  Private  Beta  of  our  Threat  Intelligence-­‐based  models:   •  Some  use  TI  indicator  feeds  as  blocklists   •  More  mature  companies  use  the  feeds  to  learn  about   the  threats  (Trained  professionals  only)   •  Our  models  extrapolate  the  knowledge  of  exis2ng  threat   intelligence  feeds  as  those  experienced  analysis  would.   •  Supervised  model  w/same  data  analyst  has   •  Seeded  labeling  from  TI  feeds   How  is  it  going  then,  Alex?  
•  Very  effec2ve  first  triage  for  SOCs  and  Incident  Responders   •  Send  us:  log  data  from  firewalls,  DNS,  web  proxies   •  Receive:  Report  with  a  short  list  of  poten2al   compromised  machines   •  Would  you  rather  download  all  the  feeds  and  integrate  it   yourself?   •  MLSecProject/Combine   •  MLSecProject/TIQ-­‐test     Yeah,  but  why  should  I  care?  
•  Huge  amounts  of  TI  feeds  available  now  (open/commercial)   •  Non-­‐malicious  samples  s2ll  challenging,  but  we  have   expanded  to  a  lot  of  collec2on  techniques  from  different   sources.   •  Very  high-­‐ranked  Alexa  /  Quan2cast  /  OpenDNS   Random  domains  as  seeds  for  search  of  trust   •  Helped  by  the  customer  logs  as  well  in  a  semi-­‐ supervised  fashion   What  about  the  Ground  Truth   (labels)?  
•  Vast  majority  of  features  are  derived  from  structural/ intrinsic  data:   •  GeoIP,  ASN  informa2on,  BGP  Prefixes   •  pDNS  informa2on  for  the  IP  addresses,  hostnames   •  WHOIS  informa2on   •  APacker  can’t  change  those  things  without  cost.   •  Log  data  from  the  customer,  can,  of  course.  But  this  does   not  make  it  worse  than  human  specialist.   But  what  about  data  tampering?  
•  False  posi2ves  /  false  nega2ves  are  an  intrinsic  part  of  ML.   •  “False  posi2ves  are  very  good,  and  would  have  fooled  our   human  analysts  at  first.”   •  Their  feedback  helps  us  improve  the  models  for  everyone.   •  Remember  it  is  about  ini2al  triage.  A  Tier-­‐2/Tier-­‐3  analyst   must  inves2gate  and  provide  feedback  to  the  model.   And  what  about  false  posiCves?  
•  1)  What  are  you  trying  to  achieve  with  adding  Machine   Learning  to  the  solu2on?   •  2)  What  are  the  sources  of  Ground  Truth  for  your  models?   •  3)  How  can  you  protect  the  features  /  ground  truth  from   adversaries?   •  4)  How  does  the  solu2on/processes  around  it  handle  false   posi2ves?  ! Buyer’s  Guide  
  #NotAllAlgorithms! Buyer’s  Guide  
MLSec  Project   •  Don’t  take  my  word  for  it!  Try  it  out!!   •  Help  us  test  and  improve  the  models!   •  Looking  for  par2cipants  and  data  sharing  agreements   •  Limited  capacity  at  the  moment,  so  be  pa2ent.  :)     •  Visit  hGps://www.mlsecproject.org  ,  message  @MLSecProject   or  just  e-­‐mail  me.!
Thanks!   •  Q&A?   •  Don’t  forget  the  feedback!   Alex  Pinto     @alexcpsec   @MLSecProject   ”We  are  drowning  on  informa2on  and  starved  for  knowledge"                        -­‐  John  NaisbiP    

Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

  • 1.
    Secure  Because  Math:  A  Deep-­‐Dive  on   Machine  Learning-­‐Based  Monitoring     (#SecureBecauseMath)   Alex  Pinto   Chief  Data  Scien2st  |  MLSec  Project     @alexcpsec   @MLSecProject!
  • 2.
    Alex  Pinto   • Chief  Data  Scien2st  at  MLSec  Project   •  Machine  Learning  Researcher  and  Trainer   •  Network  security  and  incident  response  aficionado     •  Tortured  by  SIEMs  as  a  child   •  Hacker  Spirit  Animal™:  CAFFEINATED  CAPYBARA! whoami   (hPps://secure.flickr.com/photos/kobashi_san/)  
  • 3.
      •  Security  Singularity   •  Some  History   •  TLA   •  ML  Marke2ng  PaPerns   •  Anomaly  Detec2on   •  Classifica2on   •  Buyer’s  Guide   •  MLSec  Project   Agenda  
  • 4.
  • 5.
    (Side  Note)   First  hit  on  Google  images  for  “Network  Security  Solved”  is  a   picture  of  Jack  Daniel!
  • 6.
    Security  Singularity  Approaches   •  “Machine  learning  /  math  /  algorithms…  these  terms  are   used  interchangeably  quite  frequently.”   •  “Is  behavioral  baselining  and  anomaly  detec2on  part  of   this?”   •  “What  about  Big  Data  Security  Analy2cs?”     (hPp://bigdatapix.tumblr.com/)  
  • 7.
    Are  we  even  trying?   •  “Hyper-­‐dimensional  security   analy2cs”   •  “3rd  genera2on  Ar2ficial   Intelligence”   •  “Secure  because  Math”     •  Lack  of  ability  to  differen2ate   hurts  buyers,  investors.   •  Are  we  even  funding  the  right   things?  
  • 8.
    Is  this  a  communicaCon  issue?  
  • 9.
    Guess  the  Year!   •  “(…)  behavior  analysis  system  that  enhances  your   network  intelligence  and  security  by  audi2ng  network   flow  data  from  exis2ng  infrastructure  devices”   •  "Mathema2cal  models  (…)  that  determine  baseline   behavior  across  users  and  machines,  detec2ng  (...)   anomalous  and  risky  ac2vi2es  (...)”   •  ”(…)  maintains  historical  profiles  of  usage  per  user  and   raises  an  alarm  when  observed  ac2vity  departs  from   established  paPerns  of  usage  for  an  individual.”    
  • 10.
    A  liGle  history   •  Dorothy  E.  Denning  (professor  at  the   Department  of  Defense  Analysis  at  the   Naval  Postgraduate  School)   •  1986  (SRI)  -­‐  First  research  that  led   to  IDS   •  Intrusion  Detec2on  Expert  System   (IDES)   •  Already  had  sta2s2cal  anomaly   detec2on  built-­‐in   •  1993:  Her  colleagues  release  the  Next   Genera2on  (!)  IDES  
  • 11.
    Three  LeGer  Acronyms  -­‐  KDD   •  Ajer  the  release  of  Bro  (1998)  and  Snort  (1999),  DARPA   thought  we  were  covered  for  this  signature  thing   •  DARPA  released  datasets  for  user  anomaly  detec2on  in   1998  and  1999   •  And  then  came  the  KDD-­‐99  dataset  –  over  6200  cita2ons   on  Google  Scholar  
  • 13.
  • 14.
    Three  LeGer  Acronyms  -­‐  KDD  
  • 15.
  • 16.
    Not  here  to  bash  academia  
  • 17.
    A  Probable  Outcome   GRAD   SCHOOL   FRESHMAN   ZOMG   RESULTS  !! 11!1!   ZOMG!   RESULTS???   MATH,  STAHP!   MATH  IS   HARD,  LET’S   GO  SHOPPING  
  • 18.
    ML  MarkeCng  PaGerns   •  The  “Has-­‐beens”     •  Name  is  a  bit  harsh,  but  hey,  you  hardly  use  ML   anymore,  let  us  try  it   •  The  “Machine  Learning  ¯ˉ_(ツ)_/¯ˉ”   •  Hey,  that  sounds  cool,  let’s  put  that  in  our  brochure   •  The  “Sweet  Spot”   •  People  that  actually  are  trying  to  do  something   •  Anomaly  Detec2on  vs.  Classifica2on  
  • 19.
  • 20.
    Anomaly  DetecCon   • Works  wonders  for  well   defined  “industrial-­‐like”   processes.   •  Looking  at  single,   consistently  measured   variables   •  Historical  usage  in  financial   fraud  preven2on.  
  • 21.
  • 22.
    Anomaly  DetecCon   • What  fits  this  mold?   •  Network/Neqlow  behavior  analysis     •  User  behavior  analysis   •  What  are  the  challenges?   •  Curse  of  Dimensionality   •  Lack  of  ground  truth  and  normality  poisoning   •  Hanlon’s  Razor  
  • 23.
    AD:  Curse  of  Dimensionality   •  We  need  “distances”  to  measure   the  features/variables   •  Usually  ManhaPan  or  Euclidian   •  For  high-­‐dimensional  data,  the   distribu2on  of  distances  between   all  pairwise  points  in  the  space   becomes  concentrated  around  an   average  distance.  
  • 24.
    AD:  Curse  of  Dimensionality   •  The  volume  of  the  high   dimensional  sphere   becomes  negligible  in   rela2on  to  the  volume  of   the  high  dimensional  cube.   •  The  prac2cal  result  is  that   everything  just  seems  too   far  away,  and  at  similar   distances.   (hPp://www.datasciencecentral.com/m/blogpost? id=6448529%3ABlogPost%3A175670)  
  • 25.
    A  PracCcal  example   •  NetFlow  data,  company  with  n  internal  nodes.   •  2(nˆ2  -­‐  n)  communica2on  direc2ons   •  2*2*2*65535(nˆ2  -­‐  n)  measures  of  network  ac2vity   •  1000  nodes  -­‐>  Half  a  trillion  possible  dimensions  
  • 26.
    Breaking  the  Curse   •  Different  /  crea2ve   distance  metrics   •  Organizing  the  space  into   sub-­‐manifolds  where   Euclidean  distances  make   more  sense.   •  Aggressive  feature   removal   •  A  few  interes2ng  results   available  
  • 27.
  • 28.
    AD:  Normality-­‐poisoning  aGacks   •  Ground  Truth  (labels)  >>  Features  >>  Algorithms   •  There  is  no  (or  next  to  none)  Ground  Truth  in  AD   •  What  is  “normal”  in  your  environment?   •  Problem  asymmetry   •  Solu2ons  are  biased  to  the  prevalent  class   •  Very  hard  to  fine-­‐tune,  becomes  prone  to  a  lot  of  false   nega2ves  or  false  posi2ves  
  • 29.
  • 30.
    AD:  Hanlon’s  Razor   Never attribute to malice that which is adequately explained by stupidity.
  • 31.
    AD:  Hanlon’s  Razor   vs! Evil  Hacker! Hipster  Developer     (a.k.a.  MaP  Johansen)!
  • 32.
    What  about  User  Behavior?   •  Surprise,  it  kinda  works!  (as  supervised,  that  is)   •  As  specific  implementa2ons  for  specific  solu2ons   •  Good  stuff  from  Square,  AirBnB   •  Well  defined  scope  and  labeling.   •  Can  it  be  general  enough?   •  File  exfiltra2on  example  (roles/info  classifica2on   are  mandatory?)   •  Can  I  “average  out”  user  behaviors  in  different   applica2ons?  
  • 33.
  • 34.
    •  Lots  of  available  academic  research  around  this   •  Classifica2on  and  clustering  of  malware  samples   •  More  success  into  classifying  ar2facts  you  already  know  to   be  malware  then  to  actually  detect  it.  (Lineage)   •  State  of  the  art?  My  guess  is  AV  companies!   •  All  of  them  have  an  absurd  amount  of  samples   •  Have  been  researching  and  consolida2ng  data  on  them   for  decades.   Lots  of  Malware  AcCvity  
  • 35.
    •  Can  we  do  bePer  than  “AV  Heuris2cs”?   •  Lots  and  lots  of  available  data  that  has  been  made  public   •  Some  of  the  papers  also  suffer  from  poten2ally  bad  ground   truth.   Lots  of  Malware  AcCvity   VS!
  • 36.
    Lots  of  Malware  AcCvity   VS!
  • 37.
  • 38.
    •  Private  Beta  of  our  Threat  Intelligence-­‐based  models:   •  Some  use  TI  indicator  feeds  as  blocklists   •  More  mature  companies  use  the  feeds  to  learn  about   the  threats  (Trained  professionals  only)   •  Our  models  extrapolate  the  knowledge  of  exis2ng  threat   intelligence  feeds  as  those  experienced  analysis  would.   •  Supervised  model  w/same  data  analyst  has   •  Seeded  labeling  from  TI  feeds   How  is  it  going  then,  Alex?  
  • 39.
    •  Very  effec2ve  first  triage  for  SOCs  and  Incident  Responders   •  Send  us:  log  data  from  firewalls,  DNS,  web  proxies   •  Receive:  Report  with  a  short  list  of  poten2al   compromised  machines   •  Would  you  rather  download  all  the  feeds  and  integrate  it   yourself?   •  MLSecProject/Combine   •  MLSecProject/TIQ-­‐test     Yeah,  but  why  should  I  care?  
  • 40.
    •  Huge  amounts  of  TI  feeds  available  now  (open/commercial)   •  Non-­‐malicious  samples  s2ll  challenging,  but  we  have   expanded  to  a  lot  of  collec2on  techniques  from  different   sources.   •  Very  high-­‐ranked  Alexa  /  Quan2cast  /  OpenDNS   Random  domains  as  seeds  for  search  of  trust   •  Helped  by  the  customer  logs  as  well  in  a  semi-­‐ supervised  fashion   What  about  the  Ground  Truth   (labels)?  
  • 41.
    •  Vast  majority  of  features  are  derived  from  structural/ intrinsic  data:   •  GeoIP,  ASN  informa2on,  BGP  Prefixes   •  pDNS  informa2on  for  the  IP  addresses,  hostnames   •  WHOIS  informa2on   •  APacker  can’t  change  those  things  without  cost.   •  Log  data  from  the  customer,  can,  of  course.  But  this  does   not  make  it  worse  than  human  specialist.   But  what  about  data  tampering?  
  • 42.
    •  False  posi2ves  /  false  nega2ves  are  an  intrinsic  part  of  ML.   •  “False  posi2ves  are  very  good,  and  would  have  fooled  our   human  analysts  at  first.”   •  Their  feedback  helps  us  improve  the  models  for  everyone.   •  Remember  it  is  about  ini2al  triage.  A  Tier-­‐2/Tier-­‐3  analyst   must  inves2gate  and  provide  feedback  to  the  model.   And  what  about  false  posiCves?  
  • 43.
    •  1)  What  are  you  trying  to  achieve  with  adding  Machine   Learning  to  the  solu2on?   •  2)  What  are  the  sources  of  Ground  Truth  for  your  models?   •  3)  How  can  you  protect  the  features  /  ground  truth  from   adversaries?   •  4)  How  does  the  solu2on/processes  around  it  handle  false   posi2ves?  ! Buyer’s  Guide  
  • 44.
  • 45.
    MLSec  Project   • Don’t  take  my  word  for  it!  Try  it  out!!   •  Help  us  test  and  improve  the  models!   •  Looking  for  par2cipants  and  data  sharing  agreements   •  Limited  capacity  at  the  moment,  so  be  pa2ent.  :)     •  Visit  hGps://www.mlsecproject.org  ,  message  @MLSecProject   or  just  e-­‐mail  me.!
  • 46.
    Thanks!   •  Q&A?   •  Don’t  forget  the  feedback!   Alex  Pinto     @alexcpsec   @MLSecProject   ”We  are  drowning  on  informa2on  and  starved  for  knowledge"                        -­‐  John  NaisbiP