Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Secure Because Math: A Deep-‐Dive on Machine Learning-‐Based Monitoring (#SecureBecauseMath) Alex Pinto Chief Data Scien2st | MLSec Project @alexcpsec @MLSecProject!

Alex Pinto •  Chief Data Scien2st at MLSec Project •  Machine Learning Researcher and Trainer •  Network security and incident response aﬁcionado •  Tortured by SIEMs as a child •  Hacker Spirit Animal™: CAFFEINATED CAPYBARA! whoami (hPps://secure.ﬂickr.com/photos/kobashi_san/)

•  Security Singularity •  Some History •  TLA •  ML Marke2ng PaPerns •  Anomaly Detec2on •  Classiﬁca2on •  Buyer’s Guide •  MLSec Project Agenda

Security Singularity Approaches

(Side Note) First hit on Google images for “Network Security Solved” is a picture of Jack Daniel!

Security Singularity Approaches •  “Machine learning / math / algorithms… these terms are used interchangeably quite frequently.” •  “Is behavioral baselining and anomaly detec2on part of this?” •  “What about Big Data Security Analy2cs?” (hPp://bigdatapix.tumblr.com/)

Are we even trying? •  “Hyper-‐dimensional security analy2cs” •  “3rd genera2on Ar2ﬁcial Intelligence” •  “Secure because Math” •  Lack of ability to diﬀeren2ate hurts buyers, investors. •  Are we even funding the right things?

Is this a communicaCon issue?

Guess the Year! •  “(…) behavior analysis system that enhances your network intelligence and security by audi2ng network ﬂow data from exis2ng infrastructure devices” •  "Mathema2cal models (…) that determine baseline behavior across users and machines, detec2ng (...) anomalous and risky ac2vi2es (...)” •  ”(…) maintains historical proﬁles of usage per user and raises an alarm when observed ac2vity departs from established paPerns of usage for an individual.”

A liGle history •  Dorothy E. Denning (professor at the Department of Defense Analysis at the Naval Postgraduate School) •  1986 (SRI) -‐ First research that led to IDS •  Intrusion Detec2on Expert System (IDES) •  Already had sta2s2cal anomaly detec2on built-‐in •  1993: Her colleagues release the Next Genera2on (!) IDES

Three LeGer Acronyms -‐ KDD •  Ajer the release of Bro (1998) and Snort (1999), DARPA thought we were covered for this signature thing •  DARPA released datasets for user anomaly detec2on in 1998 and 1999 •  And then came the KDD-‐99 dataset – over 6200 cita2ons on Google Scholar

Three LeGer Acronyms -‐ KDD

Not here to bash academia

A Probable Outcome GRAD SCHOOL FRESHMAN ZOMG RESULTS !! 11!1! ZOMG! RESULTS??? MATH, STAHP! MATH IS HARD, LET’S GO SHOPPING

ML MarkeCng PaGerns •  The “Has-‐beens” •  Name is a bit harsh, but hey, you hardly use ML anymore, let us try it •  The “Machine Learning ¯ˉ_(ツ)_/¯ˉ” •  Hey, that sounds cool, let’s put that in our brochure •  The “Sweet Spot” •  People that actually are trying to do something •  Anomaly Detec2on vs. Classiﬁca2on

Anomaly DetecCon •  Works wonders for well deﬁned “industrial-‐like” processes. •  Looking at single, consistently measured variables •  Historical usage in ﬁnancial fraud preven2on.

Anomaly DetecCon •  What ﬁts this mold? •  Network/Neqlow behavior analysis •  User behavior analysis •  What are the challenges? •  Curse of Dimensionality •  Lack of ground truth and normality poisoning •  Hanlon’s Razor

AD: Curse of Dimensionality •  We need “distances” to measure the features/variables •  Usually ManhaPan or Euclidian •  For high-‐dimensional data, the distribu2on of distances between all pairwise points in the space becomes concentrated around an average distance.

AD: Curse of Dimensionality •  The volume of the high dimensional sphere becomes negligible in rela2on to the volume of the high dimensional cube. •  The prac2cal result is that everything just seems too far away, and at similar distances. (hPp://www.datasciencecentral.com/m/blogpost? id=6448529%3ABlogPost%3A175670)

A PracCcal example •  NetFlow data, company with n internal nodes. •  2(nˆ2 -‐ n) communica2on direc2ons •  2*2*2*65535(nˆ2 -‐ n) measures of network ac2vity •  1000 nodes -‐> Half a trillion possible dimensions

Breaking the Curse •  Diﬀerent / crea2ve distance metrics •  Organizing the space into sub-‐manifolds where Euclidean distances make more sense. •  Aggressive feature removal •  A few interes2ng results available

AD: Normality-‐poisoning aGacks •  Ground Truth (labels) >> Features >> Algorithms •  There is no (or next to none) Ground Truth in AD •  What is “normal” in your environment? •  Problem asymmetry •  Solu2ons are biased to the prevalent class •  Very hard to ﬁne-‐tune, becomes prone to a lot of false nega2ves or false posi2ves

AD: Normality-‐poisoning aGacks

AD: Hanlon’s Razor Never attribute to malice that which is adequately explained by stupidity.

AD: Hanlon’s Razor vs! Evil Hacker! Hipster Developer (a.k.a. MaP Johansen)!

What about User Behavior? •  Surprise, it kinda works! (as supervised, that is) •  As specific implementa2ons for specific solu2ons •  Good stuff from Square, AirBnB •  Well defined scope and labeling. •  Can it be general enough? •  File exfiltra2on example (roles/info classifica2on are mandatory?) •  Can I “average out” user behaviors in different applica2ons?

•  Lots of available academic research around this •  Classiﬁca2on and clustering of malware samples •  More success into classifying ar2facts you already know to be malware then to actually detect it. (Lineage) •  State of the art? My guess is AV companies! •  All of them have an absurd amount of samples •  Have been researching and consolida2ng data on them for decades. Lots of Malware AcCvity

•  Can we do bePer than “AV Heuris2cs”? •  Lots and lots of available data that has been made public •  Some of the papers also suﬀer from poten2ally bad ground truth. Lots of Malware AcCvity VS!

Lots of Malware AcCvity VS!

•  Private Beta of our Threat Intelligence-‐based models: •  Some use TI indicator feeds as blocklists •  More mature companies use the feeds to learn about the threats (Trained professionals only) •  Our models extrapolate the knowledge of exis2ng threat intelligence feeds as those experienced analysis would. •  Supervised model w/same data analyst has •  Seeded labeling from TI feeds How is it going then, Alex?

•  Very effec2ve first triage for SOCs and Incident Responders •  Send us: log data from firewalls, DNS, web proxies •  Receive: Report with a short list of poten2al compromised machines •  Would you rather download all the feeds and integrate it yourself? •  MLSecProject/Combine •  MLSecProject/TIQ-‐test Yeah, but why should I care?

•  Huge amounts of TI feeds available now (open/commercial) •  Non-‐malicious samples s2ll challenging, but we have expanded to a lot of collec2on techniques from diﬀerent sources. •  Very high-‐ranked Alexa / Quan2cast / OpenDNS Random domains as seeds for search of trust •  Helped by the customer logs as well in a semi-‐ supervised fashion What about the Ground Truth (labels)?

•  Vast majority of features are derived from structural/ intrinsic data: •  GeoIP, ASN informa2on, BGP Preﬁxes •  pDNS informa2on for the IP addresses, hostnames •  WHOIS informa2on •  APacker can’t change those things without cost. •  Log data from the customer, can, of course. But this does not make it worse than human specialist. But what about data tampering?

•  False posi2ves / false nega2ves are an intrinsic part of ML. •  “False posi2ves are very good, and would have fooled our human analysts at ﬁrst.” •  Their feedback helps us improve the models for everyone. •  Remember it is about ini2al triage. A Tier-‐2/Tier-‐3 analyst must inves2gate and provide feedback to the model. And what about false posiCves?

•  1) What are you trying to achieve with adding Machine Learning to the solu2on? •  2) What are the sources of Ground Truth for your models? •  3) How can you protect the features / ground truth from adversaries? •  4) How does the solu2on/processes around it handle false posi2ves? ! Buyer’s Guide

#NotAllAlgorithms! Buyer’s Guide

MLSec Project •  Don’t take my word for it! Try it out!! •  Help us test and improve the models! •  Looking for par2cipants and data sharing agreements •  Limited capacity at the moment, so be pa2ent. :) •  Visit hGps://www.mlsecproject.org , message @MLSecProject or just e-‐mail me.!

Thanks! •  Q&A? •  Don’t forget the feedback! Alex Pinto @alexcpsec @MLSecProject ”We are drowning on informa2on and starved for knowledge" -‐ John NaisbiP

Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

More Related Content

What's hot

Viewers also liked

Similar to Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Recently uploaded

Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)