Big  Data  Security   Joey  Echeverria  |  Principal  Solu8ons  Architect   joey@cloudera.com  |  @fwiffo   1 ©2013 Cloudera, Inc.
Big  Data  Security   EARLY  DAYS   2  
Hadoop  File  Permissions   •  Added  in  HADOOP-­‐1298   •  Hadoop  0.16   •  Early  2008   •  Authoriza8on  without  authen8ca8on   •  POSIX-­‐like  RWX  bits   3
MapReduce  ACLs   •  Added  in  HADOOP-­‐3698   •  Hadoop  0.19   •  Late  2008   •  ACLs  per  job  queue   •  Set  a  list  of  allowed  users  or  groups  per  opera8on   •  Job  submission   •  Job  administra8on   •  No  authen8ca8on   4
Securing  a  Cluster  Through  a  Gateway   •  Hadoop  cluster  runs  on  a  private  network   •  Gateway  server  dual-­‐homed  (Hadoop  network  and   public  network)   •  Users  SSH  onto  gateway   •  Op8onally  can  create  an  SSH  proxy  for  jobs  to  be   submi`ed  from  the  client  machine   •  Provides  minimum  level  of  protec8on   5
Big  Data  Security   WHY  SECURITY  MATTERS   6  
Prevent  Accidental  Access   •  Don’t  let  users  shoot  themselves  in  the  foot   •  Main  driver  for  early  features   •  Not  security  per-­‐se,  but  a  cri8cal  first  step   •  Doesn’t  require  strong  authen8ca8on   7
Stop  Malicious  Users   •  Early  features  were  necessary,  but  not  sufficient   •  Security  has  to  get  real   •  Hadoop  runs  arbitrary  code   •  Implicit  trust  doesn’t  prevent  the  insider  threat   8
Co-­‐mingle  All  Your  Data   •  Ofen  overlooked   •  Big  data  means  gegng  rid  of  stovepipes   •  Scalability  and  flexibility  are  only  50%  of  the  problem   •  Trust  your  data  in  a  mul8-­‐tenant  environment   •  Most  cri8cal  driver   9
Big  Data  Security   AN  EVOLVING  STORY   10  
Authoriza8on   •  Files   •  MapReduce/YARN  job  queues   •  Service-­‐level  authoriza8on   •  Whitelists  and  blacklists  of  hosts  and  users   11
Authen8ca8on   2.2 High Level Use Cases 2 USE CASES •  HADOOP-­‐4487   •  Hadoop  0.22  and  0.20.205   2.2 High Level Use Cases 1. Applications accessing files on HDFS clusters Non-MapReduce ap- •  Late  2010   including hadoop fs, access files stored on one or more HDFS plications, clusters. The application should only be able to access files and services •  Based  on  Kerberos  and  internal  delega8on  tokens   they are authorized to access. See figure 1. Variations: (a) Access HDFS directly using HDFS protocol. •  Provides  strong  user  authen8ca8on   servers via the HFTP (b) Access HDFS indirectly though HDFS proxy FileSystem or HTTP get. •  Also  used  for  service-­‐to-­‐service  authen8ca8on     (joe) Name Node delg(jo e) kerb MapReduce Application kerb(hdfs) Task bloc n k to oke ken ck t Data blo Node Figure 1: HDFS High-level Dataflow 12 2. Applications accessing third-party (non-Hadoop) services Non- MapReduce applications and MapReduce tasks accessing files or opera-
Encryp8on   •  Over  the  wire  encryp8on  for  some  socket   connec8ons   •  RPC  encryp8on  added  soon  afer  Kerberos   •  Shuffle  encryp8on  (HTTPS)  added  in  Hadoop  2.0.2-­‐ alpha,  back  ported  to  CDH4  MR1   •  HDFS  block  streamer  encryp8on  added  in  Hadoop   2.0.2-­‐alpha   •  Volume-­‐level  encryp8on  for  data  at  rest   13
Big  Data  Security   SECURITY  FOR  KEY  VALUE  STORES   14  
Apache  Accumulo   •  Robust,  scalable,  high  performance  data  storage  and   retrieval  system   •  Built  by  NSA,  now  an  Apache  project   •  Based  on  Google’s  BigTable   •  Built  on  top  of  HDFS,  ZooKeeper  and  Thrif   •  Iterators  for  server-­‐side  extensions   •  Cell  labels  for  flexible  security  models   15
Data  Model   •  Mul8-­‐dimensional,  persistent,  sorted  map   •  Key/Value  store  with  a  twist   •  A  single  primary  key  (Row  ID)   •  Secondary  key  (Column)  internal  to  a  row   •  Family   •  Qualifier   •  Per-­‐cell  8mestamp   16
Cell-­‐Level  Security   •  Labels  stored  per  cell   •  Labels  consist  of  Boolean  expressions  (AND,  OR,   nes8ng)   •  Labels  associated  with  each  user   •  Cell  labels  checked  against  user’s  labels  with  a  built-­‐ in  iterator   17
Pluggable  Authen8ca8on   •  Currently  supports  username/password   authen8ca8on  backed  by  ZooKeeper   •  ACCUMULO-­‐259   •  Targeted  for  Accumulo  1.5.0   •  Authen8ca8on  info  replaced  with  generic  tokens   •  Supports  mul8ple  implementa8ons  (e.g.  Kerberos)   18
Applica8on  Level   •  Accumulo  ofen  paired  with  applica8on  level   authen8ca8on/authoriza8on   •  Accumulo  users  created  per  applica8on   •  Each  applica8on  granted  access  level  of  most   permi`ed  user   •  Applica8on  authen8cates  users,  grabs  user   authoriza8ons,  passes  user  labels  with  requests   19
Apache  HBase   •  Also  based  on  Google’s  BigTable   •  Started  as  a  Hadoop  contrib  project   •  Supports  column-­‐level  ACLs   •  Kerberos  for  authen8ca8on   •  Discussion  and  early  prototypes  of  cell-­‐level  security   ongoing   20
Big  Data  Security   FUTURE   21  
Encryp8on  for  Data  at  Rest   •  Need  mul8ple  levels  of  granularity   •  Encryp8on  keys  8ed  to  authoriza8on  labels  (like   Accumulo  labels  or  HBase  ACLs)   •  APIs  for  file-­‐level,  block-­‐level,  or  record-­‐level   encryp8on   22
Hive  Security   •  Column-­‐level  ACLs   •  Kerberos  authen8ca8on   •  AccessServer   23
24 ©2013 Cloudera, Inc.

Big Data Security with Hadoop

  • 1.
    Big  Data  Security   Joey  Echeverria  |  Principal  Solu8ons  Architect   joey@cloudera.com  |  @fwiffo   1 ©2013 Cloudera, Inc.
  • 2.
    Big  Data  Security   EARLY  DAYS   2  
  • 3.
    Hadoop  File  Permissions   •  Added  in  HADOOP-­‐1298   •  Hadoop  0.16   •  Early  2008   •  Authoriza8on  without  authen8ca8on   •  POSIX-­‐like  RWX  bits   3
  • 4.
    MapReduce  ACLs   •  Added  in  HADOOP-­‐3698   •  Hadoop  0.19   •  Late  2008   •  ACLs  per  job  queue   •  Set  a  list  of  allowed  users  or  groups  per  opera8on   •  Job  submission   •  Job  administra8on   •  No  authen8ca8on   4
  • 5.
    Securing  a  Cluster  Through  a  Gateway   •  Hadoop  cluster  runs  on  a  private  network   •  Gateway  server  dual-­‐homed  (Hadoop  network  and   public  network)   •  Users  SSH  onto  gateway   •  Op8onally  can  create  an  SSH  proxy  for  jobs  to  be   submi`ed  from  the  client  machine   •  Provides  minimum  level  of  protec8on   5
  • 6.
    Big  Data  Security   WHY  SECURITY  MATTERS   6  
  • 7.
    Prevent  Accidental  Access   •  Don’t  let  users  shoot  themselves  in  the  foot   •  Main  driver  for  early  features   •  Not  security  per-­‐se,  but  a  cri8cal  first  step   •  Doesn’t  require  strong  authen8ca8on   7
  • 8.
    Stop  Malicious  Users   •  Early  features  were  necessary,  but  not  sufficient   •  Security  has  to  get  real   •  Hadoop  runs  arbitrary  code   •  Implicit  trust  doesn’t  prevent  the  insider  threat   8
  • 9.
    Co-­‐mingle  All  Your  Data   •  Ofen  overlooked   •  Big  data  means  gegng  rid  of  stovepipes   •  Scalability  and  flexibility  are  only  50%  of  the  problem   •  Trust  your  data  in  a  mul8-­‐tenant  environment   •  Most  cri8cal  driver   9
  • 10.
    Big  Data  Security   AN  EVOLVING  STORY   10  
  • 11.
    Authoriza8on   •  Files   •  MapReduce/YARN  job  queues   •  Service-­‐level  authoriza8on   •  Whitelists  and  blacklists  of  hosts  and  users   11
  • 12.
    Authen8ca8on   2.2 High Level Use Cases 2 USE CASES •  HADOOP-­‐4487   •  Hadoop  0.22  and  0.20.205   2.2 High Level Use Cases 1. Applications accessing files on HDFS clusters Non-MapReduce ap- •  Late  2010   including hadoop fs, access files stored on one or more HDFS plications, clusters. The application should only be able to access files and services •  Based  on  Kerberos  and  internal  delega8on  tokens   they are authorized to access. See figure 1. Variations: (a) Access HDFS directly using HDFS protocol. •  Provides  strong  user  authen8ca8on   servers via the HFTP (b) Access HDFS indirectly though HDFS proxy FileSystem or HTTP get. •  Also  used  for  service-­‐to-­‐service  authen8ca8on     (joe) Name Node delg(jo e) kerb MapReduce Application kerb(hdfs) Task bloc n k to oke ken ck t Data blo Node Figure 1: HDFS High-level Dataflow 12 2. Applications accessing third-party (non-Hadoop) services Non- MapReduce applications and MapReduce tasks accessing files or opera-
  • 13.
    Encryp8on   •  Over  the  wire  encryp8on  for  some  socket   connec8ons   •  RPC  encryp8on  added  soon  afer  Kerberos   •  Shuffle  encryp8on  (HTTPS)  added  in  Hadoop  2.0.2-­‐ alpha,  back  ported  to  CDH4  MR1   •  HDFS  block  streamer  encryp8on  added  in  Hadoop   2.0.2-­‐alpha   •  Volume-­‐level  encryp8on  for  data  at  rest   13
  • 14.
    Big  Data  Security   SECURITY  FOR  KEY  VALUE  STORES   14  
  • 15.
    Apache  Accumulo   •  Robust,  scalable,  high  performance  data  storage  and   retrieval  system   •  Built  by  NSA,  now  an  Apache  project   •  Based  on  Google’s  BigTable   •  Built  on  top  of  HDFS,  ZooKeeper  and  Thrif   •  Iterators  for  server-­‐side  extensions   •  Cell  labels  for  flexible  security  models   15
  • 16.
    Data  Model   •  Mul8-­‐dimensional,  persistent,  sorted  map   •  Key/Value  store  with  a  twist   •  A  single  primary  key  (Row  ID)   •  Secondary  key  (Column)  internal  to  a  row   •  Family   •  Qualifier   •  Per-­‐cell  8mestamp   16
  • 17.
    Cell-­‐Level  Security   •  Labels  stored  per  cell   •  Labels  consist  of  Boolean  expressions  (AND,  OR,   nes8ng)   •  Labels  associated  with  each  user   •  Cell  labels  checked  against  user’s  labels  with  a  built-­‐ in  iterator   17
  • 18.
    Pluggable  Authen8ca8on   •  Currently  supports  username/password   authen8ca8on  backed  by  ZooKeeper   •  ACCUMULO-­‐259   •  Targeted  for  Accumulo  1.5.0   •  Authen8ca8on  info  replaced  with  generic  tokens   •  Supports  mul8ple  implementa8ons  (e.g.  Kerberos)   18
  • 19.
    Applica8on  Level   •  Accumulo  ofen  paired  with  applica8on  level   authen8ca8on/authoriza8on   •  Accumulo  users  created  per  applica8on   •  Each  applica8on  granted  access  level  of  most   permi`ed  user   •  Applica8on  authen8cates  users,  grabs  user   authoriza8ons,  passes  user  labels  with  requests   19
  • 20.
    Apache  HBase   •  Also  based  on  Google’s  BigTable   •  Started  as  a  Hadoop  contrib  project   •  Supports  column-­‐level  ACLs   •  Kerberos  for  authen8ca8on   •  Discussion  and  early  prototypes  of  cell-­‐level  security   ongoing   20
  • 21.
    Big  Data  Security   FUTURE   21  
  • 22.
    Encryp8on  for  Data  at  Rest   •  Need  mul8ple  levels  of  granularity   •  Encryp8on  keys  8ed  to  authoriza8on  labels  (like   Accumulo  labels  or  HBase  ACLs)   •  APIs  for  file-­‐level,  block-­‐level,  or  record-­‐level   encryp8on   22
  • 23.
    Hive  Security   •  Column-­‐level  ACLs   •  Kerberos  authen8ca8on   •  AccessServer   23
  • 24.
    24 ©2013 Cloudera, Inc.