2015   Miraj  Godha   6/5/2015   Hadoop  Architecture  Approaches  
1         Table  of  Contents   EXECUTIVE  SUMMARY  .................................................................................................................................  2   Big  data  Classification  ...................................................................................................................................  3   Hadoop-­‐based  architecture  approaches  ......................................................................................................  5   Data  Lake  ..................................................................................................................................................  5   Lambda  .....................................................................................................................................................  5   Choosing  the  correct  architecture  ............................................................................................................  5   Data  Lake  Architecture  .................................................................................................................................  9   Generic  Data  lake  Architecture  ..............................................................................................................  11   Steps  Involved  ....................................................................................................................................  12   Lambda  Architecture  ..................................................................................................................................  13   Batch  Layer  .............................................................................................................................................  14   Serving  layer  ...........................................................................................................................................  14   Speed  layer  .............................................................................................................................................  14   Generic  Lambda  Architecture  ................................................................................................................  16   References  ..................................................................................................................................................  17            
2     EXECUTIVE  SUMMARY     Apache  Hadoop  didn’t  disrupt  the  datacenter,  the  data  did.  Shortly  after  Corporate  IT  functions  within   enterprises  adopted  large  scale  systems  to  manage  data  then  the  Enterprise  Data  Warehouse  (EDW)   emerged  as  the  logical  home  of  all  enterprise  data.  Today,  every  enterprise  has  a  Data  Warehouse  that   serves  to  model  and  capture  the  essence  of  the  business  from  their  enterprise  systems.  The  explosion  of   new  types  of  data  in  recent  years  –  from  inputs  such  as  the  web  and  connected  devices,  or  just  sheer   volumes  of  records  –  has  put  tremendous  pressure  on  the  EDW.  In  response  to  this  disruption,  an   increasing  number  of  organizations  have  turned  to  Apache  Hadoop  to  help  manage  the  enormous   increase  in  data  whilst  maintaining  coherence  of  the  Data  Warehouse.  This  POV  discusses  Apache   Hadoop,  its  capabilities  as  a  data  platform  and  data  processing.  How  the  core  of  Hadoop  and  its   surrounding  ecosystems  provides  the  enterprise  requirements  to  integrate  alongside  the  Data   Warehouse  and  other  enterprise  data  systems  as  part  of  a  modern  data  architecture.  A  step  on  the   journey  toward  delivering  an  enterprise  ‘Data  Lake’  or  Lambda  Architecture  (Immutable  data  +  views).     An  enterprise  data  lake  provides  the  following  core  benefits  to  an  enterprise:  New  efficiencies  for  data   architecture  through  a  significantly  lower  cost  of  storage,  and  through  optimization  of  data  processing   workloads  such  as  data  transformation  and  integration.  New  opportunities  for  business  through  flexible   ‘schema-­‐on-­‐read’  access  to  all  enterprise  data,  and  through  multi-­‐use  and  multi-­‐workload  data   processing  on  the  same  sets  of  data:  from  batch  to  real-­‐time.     Apache  Hadoop  provides  both  reliable  storage  (HDFS)  and  a  processing  system  (MapReduce)  for  large   data  sets  across  clusters  of  computers.  MapReduce  is  a  batch  query  processor  that  is  targeted  at  long-­‐ running  background  processes.  Hadoop  can  handle  Volume.  But  to  handle  Velocity,  we  need  real-­‐time   processing  tools  that  can  compensate  for  the  high-­‐latency  of  batch  systems,  and  serve  the  most  recent   data  continuously,  as  new  data  arrives  and  older  data  is  progressively  integrated  into  the  batch   framework.  And  the  answer  to  the  problem  is  Lambda  Architecture.              
3     Big  data  Classification                                                 Processing  Type   Batch   Processing   Methodology   Near  Real  time   Real  Time  +  Batch   Prescriptive   Predictive   Diagnostic   Descriptive   Data  Frequency   On  demand   Continuous   Real  Time   Batch   Data  Type   Transactional   Historical   Master  data   Meta  data   Content  Format   Structured   Unstructured:-­‐Images,   Text,  Videos,  Documents,   emails  etc.   Semi-­‐Structured:  -­‐   XML,  JSON   Data  Sources   Machine   generated   Web  &  Social   media   IOT   Human   Generated   Transactional   data  Via  other  data  providers  
4     It's  helpful  to  look  at  the  characteristics  of  the  big  data  along  certain  lines  —  for  example,  how  the  data   is  collected,  analyzed,  and  processed.  Once  the  data  and  its  processing  are  classified,  it  can  be  matched   with  the  appropriate  big  data  analysis  architecture:     • Processing  type  -­‐  Whether  the  data  is  analyzed  in  real  time  or  batched  for  later  analysis.  Give  careful   consideration  to  choosing  the  analysis  type,  since  it  affects  several  other  decisions  about  products,  tools,   hardware,  data  sources,  and  expected  data  frequency.  A  mix  of  both  types  ‘Near  real  time  or  micro   batch”  may  also  be  required  by  the  use  case.   • Processing  methodology  -­‐  The  type  of  technique  to  be  applied  for  processing  data  (e.g.,  predictive,   analytical,  ad-­‐hoc  query,  and  reporting).  Business  requirements  determine  the  appropriate  processing   methodology.  A  combination  of  techniques  can  be  used.  The  choice  of  processing  methodology  helps   identify  the  appropriate  tools  and  techniques  to  be  used  in  your  big  data  solution.   • Data  frequency  and  size  -­‐  How  much  data  is  expected  and  at  what  frequency  does  it  arrive.  Knowing   frequency  and  size  helps  determine  the  storage  mechanism,  storage  format,  and  the  necessary   preprocessing  tools.  Data  frequency  and  size  depend  on  data  sources:   • On  demand,  as  with  social  media  data   • Continuous  feed,  real-­‐time  (weather  data,  transactional  data)   • Time  series  (time-­‐based  data)   • Data  type  -­‐  Type  of  data  to  be  processed  —  transactional,  historical,  master  data,  and  others.  Knowing   the  data  type  helps  segregate  the  data  in  storage.   • Content  format  -­‐  Format  of  incoming  data  —  structured  (RDMBS,  for  example),  unstructured  (audio,   video,  and  images,  for  example),  or  semi-­‐structured.  Format  determines  how  the  incoming  data  needs   to  be  processed  and  is  key  to  choosing  tools  and  techniques  and  defining  a  solution  from  a  business   perspective.   • Data  source  -­‐  Sources  of  data  (where  the  data  is  generated)  —  web  and  social  media,  machine-­‐ generated,  human-­‐generated,  etc.  Identifying  all  the  data  sources  helps  determine  the  scope  from  a   business  perspective.          
5     Hadoop-­‐based  architecture  approaches     Data  Lake   A  data  lake  is  a  set  of  centralized  repositories  containing  vast  amounts  of  raw  data  (either  structured  or   unstructured),  described  by  metadata,  organized  into  identifiable  data  sets,  and  available  on  demand.   Data  in  the  lake  supports  discovery,  analytics,  and  reporting,  usually  by  deploying  cluster  tools  like   Hadoop.     Lambda   Lambda  architecture  is  a  data-­‐processing  architecture  designed  to  handle  massive  quantities  of  data  by   taking  advantage  of  both  batch-­‐  and  stream-­‐processing  methods.  This  approach  to  architecture   attempts  to  balance  latency,  throughput,  and  fault-­‐tolerance  by  using  batch  processing  to  provide   comprehensive  and  accurate  views  of  batch  data,  while  simultaneously  using  real-­‐time  stream   processing  to  provide  views  of  online  data.  The  two  view  outputs  may  be  joined  before  presentation.   The  rise  of  lambda  architecture  is  correlated  with  the  growth  of  big  data,  real-­‐time  analytics,  and  the   drive  to  mitigate  the  latencies  of  map-­‐reduce.   Choosing  the  correct  architecture    
6     Parameter   Data  Lake   Lambda   Simultaneous  access  to  Real   time  and  Batch  data     Data  Lake  can  use  real  time   processing  technologies  like   Storm  to  return  real  time   results,  however  in  such  a   scenario  historical  results   cannot  be  made  available.  If  we   use  technologies  like  Spark  to   process  data,  real  time  data  and   historical  data,  on  request  there   can  be  significant  delays  in   response  time  to  clients  as   compared  to  Lambda   architecture.     Lambda  Architecture’s  Serving   Layer  merges  the  output  of   Batch  Layer  and  Speed  Layer,   before  sending  the  results  of   user  queries.  As  data  is  already   processed  into  views  at  both   the  layers,  the  response  time  is   significantly  less.     Latency   Latency  is  high  as  compared  to   Lambda,  as  real  time  data  need   to  be  processed  with  historical   data  on-­‐demand  or  as  a  part  of   batch.   Low-­‐latency  real  time  results   are  processed  by  Speed  layer   and  Batch  results  are  pre-­‐ processed  in  Batch  layer.  On   request,  both  the  results  are   just  merged,  there  by  resulting   low  latency  time  for  real  time   processing.   Ease  of  Data  Governance   Data  lake  is  coined  to  convey   the  concept  of  centralized   repository  containing  virtually   inexhaustible  amounts  of  raw   data  (or  minimally  curated)  data   that  is  readily  made  available   anytime  to  anyone  authorized   to  perform  analytical  activities.   Lambda  architecture’s  serving   layer  gives  access  to  processed   and  analyzed  data.  As  uses  get   access  to  processed  data   directly,  it  can  lead  to  top  down   data  governance  issues.   Updates  in  source  data   As  data  lake  stores  only  raw   data,  updates  are  just  appended   to  raw  data,  thereby  makes  life   of  business  users  difficult  to   write  business  logic,  in  such  a   way  that  latest  updated  records   are  considered  in  calculations.   Batch  Views  are  always   computed  from  starch  in   Lambda  Architecture.  As  a   result,  updates  can  be  easily   incorporated  in  calculated   Views  in  each  reprocess  batch   cycle.   Fault  tolerance  against  human   errors     Data  Scientist  or  business  users,   running  business  logic  on   relevant  raw  data  in  Data  Lake   might  lead  to  human  errors.   Although,  re-­‐covering  from   those  errors  is  not  difficult  as   it’s  just  a  matter  of  re-­‐running   the  logic.  However,  the   reprocessing  time  for  large   datasets  might  lead  to  some   delays.     Lambda  architecture  assures   fault  tolerance  not  only  against   hardware  failures  but  against   human  errors.  Re-­‐computation   of  views  every  time  from  raw   data  in  batch  layer,  insures  that   any  human  errors  in  business   logic  would  not  be  cascaded  to  a   level  where  it’s  unrecoverable.     Ease  of  business  users   Data  is  stored  in  raw  format,   Data  is  processed  and  available  
7     with  data  definitions  and   sometime  groomed  to  make   digestible  by  data  management   tools.  At  times,  it  difficult  for   business  users  to  use  data  in  as-­‐ is  conditions.   from  Serving  makes  life  easy  for   business  users.   Accuracy  for  real  time  results   Irrespective  of  any  scenario,   users  accessing  data  from  Data   Lake  has  access  to  immutable   raw  data,  they  can  do  exact   computations,  thereby  always   get  the  accurate  results.   In  scenarios,  where  real  time   calculations  need  to  access   historical  data,  which  is  not   possible,  Lambda  architecture   would  return  you  estimated   results.  For  example,  calculation   of  mean  value,  cannot  be   achieved  until  whole  historical   data  and  real  time  data  is   referenced  at  one  go.  In  such  a   scenario,  serving  layer  would   return  estimated  results.   Infrastructure  Cost   Data  lake  architecture  process   the  data  as  and  when  need  and   thereby  the  cluster  cost  can  be   much  less  as  compared  to   Lambda.  Moreover,  it  only   persist  the  raw  data  however   Lambda  architecture  not  only   persist  the  raw  data  but   processed  data  too.  This  leads   to  extra  storage  cost  in  Lambda   architecture.   Lambda  architecture  data   processing  life  cycle  is  designed   in  such  a  fashion  that  as  soon   the  one  cycle  of  batch  process  is   finished,  it  starts  a  new  cycle  of   batch  processing  which  includes   the  recently  inserted  data.   Simultaneously,  the  speed  layer   is  always  processing  the  real   time  data.   OLAP   Unlike  data  marts,  which  are   optimized  for  data  analysis  by   storing  only  some  attributes  and   dropping  data  below  the  level   aggregation,  a  data  lake  is   designed  to  retain  all  attributes,   especially  so  when  you  do  not   yet  know  what  the  scope  of   data  or  its  use  will  be.   As  Lambda  exposes  the   processed  views  from  serving   layer,  all  the  attributes  of  data   would  not  be  available  to  Data   Scientist  for  running  an   analytical  queries  at  times.   Historical  data  reference  for   processing   OLAP  &  OLTP  queries  access  the   raw  or  groomed  data  directly   from  the  data  lake,  making  it   feasible  to  access  and  refer  the   historical  data  while  processing   data  for  given  time  interval.   Speed  layer  do  not  have   reference  to  historical  data   stored  in  batch  layer,  make  it   difficult  to  run  queries  which   refer  historical  data.  For  e.g.   ‘Unique  count’  type  of  queries   cannot  return  correct  results   from  Speed  layer.  However,   ‘calculating  average’  type  of  
8           query  calculations  be  done   easily  on  Serving  layer,  by   generating  the  average  of   results  returned  from  Speed  and   Batch  layer  on  the  fly.   Slowly  Changing  Dimensions   Although,  data  lake  has  records   of  changed  dimension   attributes,  however  extra   business  logic  need  to  be   written  by  business  uses  to   cater  it.   Lambda  architecture  can  easily   cater  the  slowly  changing   dimensions  by  creating   surrogate  keys  parallel  to   natural  keys  in  case  of  any   change  detected  in  dimension   attributes  while  batch  layer   processing  cycle.   Slowly  changing  Facts     However,  in  Data  Lake  both  the   versions  of  facts  are  available   for  users  to  look  at,  this  would   lead  to  good  analytical  results  if   fact  life  cycle  is  an  attribute  in   business  logic  for  data  analytics.   Although  it’s  easy  to  change  the   facts  in  Lambda  architecture,   but  this  will  lead  to  loss  in   information  of  fact  life  cycle.  As   the  previous  state  of  fact  in  case   of  slowly  changing  facts  is  not   available  to  Data  Scientist,  the   analytical  queries  might  not  give   desired  results  on  views   exposed  by  Serving  Layer.   Frequently  changing  business   logic     Changes  in  processing  code   need  to  be  done.  But  there  is  no   clear  solution,  of  how  the   historically  processed  data  need   to  be  handled.   As  data  is  re-­‐processed  from   starch,  even  if  business  logic   changes  frequently  the   historical  data  problem  is   resolved  automatically.   Implementation  lifecycle     Data  lake  is  fast  to  implement     as  it  eliminates  the  dependency   of  data  modeling  upfront   Processing  logic  need  to  be   implemented  at  batch  and   speed  layer,  leading  to   significant  implementation  time   as  comparted  to  Data  Lake   Adding  new  data  sources     Very  easy  to  add   Need  to  be  incorporated  in   processing  layers  and  would   require  code  changes    
9     IF  YOU  THINK  OF  A  DATAMART  AS  A  STORE   OF  BOTTLED  WATER  –  CLEANSED  AND   PACKAGED  AND  STRUCTURED  FOR  EASY   CONSUMPTION  –  THE  DATA  LAKE  IS  A   LARGE  BODY  OF  WATER  IN  A  MORE   NATURAL  STATE.  THE  CONTENTS  OF  THE   DATA  LAKE  STREAM  IN  FROM  A  SOURCE  TO   FILL  THE  LAKE,  AND  VARIOUS  USERS  OF  THE   LAKE  CAN  COME  TO  EXAMINE,  DIVE  IN,  OR   TAKE  SAMPLES.   BY:  JAMES  DIXON  (PENTAHO  CTO)   Data  Lake  Architecture     Much  of  today's  research  and  decision  making  are  based  on  knowledge  and  insight  that  can  be  gained   from  analyzing  and  contextualizing  the  vast  (and  growing)  amount  of  “open”  or  “raw”  data.  The  concept   that  the  large  number  of  data  sources  available  today  facilitates  analyses  on  combinations  of   heterogeneous  information  that  would  not  be  achievable  via  “siloed”  data  maintained  in  warehouses  is   very  powerful.  The  term  data  lake  has  been  coined  to   convey  the  concept  of  a  centralized  repository  containing   virtually  inexhaustible  amounts  of  raw  (or  minimally   curated)  data  that  is  readily  made  available  anytime  to   anyone  authorized  to  perform  analytical  activities.     A  data  lake  is  a  set  of  centralized  repositories  containing   vast  amounts  of  raw  data  (either  structured  or   unstructured),  described  by  metadata,  organized  into   identifiable  data  sets,  and  available  on  demand.  Data  in   the  lake  supports  discovery,  analytics,  and  reporting,   usually  by  deploying  cluster  tools  like   Hadoop.  Unlike  traditional  warehouses,  the   format  of  the  data  is  not  described  (that  is,   its  schema  is  not  available)  until  the  data  is   needed.  By  delaying  the  categorization  of   data  from  the  point  of  entry  to  the  point  of   use,  analytical  operations  that  transcend  the   rigid  format  of  an  adopted  schema  become   possible.  Query  and  search  operations  on  the  data  can  be  performed  using  traditional  database   technologies  (when  structured),  as  well  as  via  alternate  means  such  as  indexing  and  NoSQL  derivatives.   Key  Features   • Stores  Raw  data  –  Single  source  of  truth   • Data  accessible  to  anyone  authorized   • Polyglot  Persistence   • Support  multiple  applications  &  Workloads   • Low  Cost,  High  Performance  storage   • Flexible,  easy  to  use  data  organization   • Self-­‐service  end-­‐user     • More  Flexible  to  answer  new  questions   • Easy  to  add  new  data  sources   • Loosely  coupled  architecture  –  enables   flexibility  of  analysis   • Eliminating  dependency  of  data  modeling   upfront  –  thereby  fast  to  implement   • Storage  is  highly  optimized  as  raw  data  is   stored   Disadvantages   • High  Latency  for  composite  analysis  view  of   both  real  time  and  historical  data   • Raw  data  does  not  provide  relational  structure   that  is  not  friendly  for  business  analytis  on  the   fly  
10     In  a  practical  sense,  a  data  lake  is  characterized  by  three  key  attributes:     • Collect  everything:    A  data  lake  contains  all  data,  both  raw  sources  over  extended  periods  of   time  as  well  as  any  processed  data.     • Dive  in  anywhere:  A  data  lake  enables  users  across  multiple  business  units  to  refine,  explore   and  enrich  data  on  their  terms.     • Flexible  access:    A  data  lake  enables  multiple  data  access  patterns  across  a  shared   infrastructure:  batch,  interactive,  online,  search,  in-­‐memory  and  other  processing  engines.              
11     Generic  Data  lake  Architecture     H         Data   Sources Real   Time Micro   Batch   Mega   Batch   Desktop  &  Mobile                     Social  Media  and  cloud     Operational  Systems               Internet  of  Things  (IOT)               Ingestion   Tier Query   Interface SQL No  SQL Extern al   Storag e Centralized  Management System  monitoring System  management Unified  Data  Management  Tier Data  mgmt. Data  Access Processing  Tier Workflow  Management HDFS  storage Unstructured  and  structured  data In-­‐memory MapReduce/  Hive/MPP Flexible   Actions Real-­‐time   insights Interactive   insights Batch insights Schematic  Metadata Grooming  Data Processed   Data Raw     Data Processed   Data Processed   Data
12     Steps  Involved   • Procuring  data  –  Process  of  obtaining  data  and  metadata  and  preparing  them  for  eventual   inclusion  in  a  data  lake.   • Obtaining  data  –Transferring  the  data  physically  from  source  to  Data  Lake.   • Describing  data  –  Data  scientist  searching  a  data  lake  for  useful  data  must  be  able  to  find  the   data  relevant  to  his  or  her  need,  for  the  same  they  require  metadata  for  the  data.  Schematic   metadata  for  this  data  set  would  include  information  about  how  the  data  is  formatted  and   information  about  the  schema.   • Grooming  data  –  Although  we  were  talking  about  raw  data  is  made  consumable  by  analytics   applications.  However,  in  some  scenarios  grooming  process  use  schematic  metadata  to   transform  raw  data,  into  data  that  can  be  processed  by  standard  data  management  tools.   • Provisioning  data  –  Authentication  and  authorization  policies  by  which  consumers  take  out  data   from  Data  Lake.   • Preserving  data  –  Managing  Data  Lake  also  require  attention  to  maintenance  issues  such  as   staleness,  expiration,  decommissions  and  renewals.      
13     LAMBDA  ARCHITECTURE  IS  A  DATA-­‐ PROCESSING  ARCHITECTURE  DESIGNED  TO   HANDLE  MASSIVE  QUANTITIES  OF  DATA  BY   TAKING  ADVANTAGE  OF  BOTH  BATCH-­‐  AND   STREAM-­‐PROCESSING  METHODS.  THIS   APPROACH  TO  ARCHITECTURE  ATTEMPTS   TO  BALANCE  LATENCY,  THROUGHPUT,  AND   FAULT-­‐TOLERANCE  BY  USING  BATCH   PROCESSING  TO  PROVIDE  COMPREHENSIVE   AND  ACCURATE  VIEWS  OF  BATCH  DATA,   WHILE  SIMULTANEOUSLY  USING  REAL-­‐TIME   STREAM  PROCESSING  TO  PROVIDE  VIEWS   OF  ONLINE  DATA.  THE  TWO  VIEW  OUTPUTS   MAY  BE  JOINED  BEFORE  PRESENTATION.   Lambda  Architecture      The  Lambda  architecture  is  split  into  three   layers,  the  batch  layer,  the  serving  layer,  and  the   speed  layer.   1. Batch layer (Apache Hadoop)   2. Serving layer (Cloudera Impala, Spark)   3. Speed layer (Storm, Spark, Apache HBase, Cassandra)       Key  Features   • Low  latency  simultaneous  analysis  of  the  (near)  real-­‐ time  information  extracted  from  a  continuous  inflow   of  data  and  persisting  analysis  of  a  massive  volume  of   data.   • Fault  tolerant  not  against  hardware  failure  but  against   human  error  too   • Mistakes  are  corrected  by  re-­‐computations   • Storage  is  highly  optimized  as  raw  data  is  stored    
14     Batch  Layer     The  batch  layer  is  responsible  for  two   things.  The  first  is  to  store  the  immutable,   constantly  growing  master  dataset  (HDFS),   and  the  second  is  to  compute  arbitrary   views  from  this  dataset  (MapReduce).   Computing  the  views  is  a  continuous   operation,  so  when  new  data  arrives  it  will   be  aggregated  into  the  views  when  they   are  recomputed  during  the  next   MapReduce  iteration.   The  views  should  be  computed  from  the   entire  dataset  and  therefore  the  batch   layer  is  not  expected  to  update  the  views   frequently.  Depending  on  the  size  of  your   dataset  and  cluster,  each  iteration  could  take  hours.     Serving  layer     The  output  from  the  batch  layer  is  a  set  of  flat  files  containing  the  precomputed  views.  The  serving  layer   is  responsible  for  indexing  and  exposing  the  views  so  that  they  can  be  queried.  Although,  the  batch  and   serving  layers  alone  do  not  satisfy  any  realtime  requirement  because  MapReduce  (by  design)  is  high   latency  and  it  could  take  a  few  hours  for  new  data  to  be  represented  in  the  views  and  propagated  to  the   serving  layer.  This  is  why  we  need  the  speed  layer.   Speed  layer     In  essence  the  speed  layer  is  the  same  as  the  batch  layer  in  that  it  computes  views  from  the  data  it   receives.  The  speed  layer  is  needed  to  compensate  for  the  high  latency  of  the  batch  layer  and  it  does   this  by  computing  realtime  views  in  Storm.  The  realtime  views  contain  only  the  delta  results  to   supplement  the  batch  views.   Whilst  the  batch  layer  is  designed  to  continuously  recompute  the  batch  views  from  scratch,  the  speed   layer  uses  an  incremental  model  whereby  the  realtime  views  are  incremented  as  and  when  new  data  is   received.  What’s  clever  about  the  speed  layer  is  the  realtime  views  are  intended  to  be  transient  and  as   soon  as  the  data  propagates  through  the  batch  and  serving  layers  the  corresponding  results  in  the   Disadvantages   • Maintaining  copies  code  that  needs  to  produce   the  same  result  in  two  complex  distributed   systems   • Could  return  estimated  or  approx.  results.     • Expensive  full  recomputation  is  required  for   fault  tolerance   • Requires  high  cluster  up-­‐time,  as  batch  data   need  to  be  processed  continuously.   • Requires  more  implementation  time,  as   duplicate  code  need  to  be  written  in  separate   technologies  to  process  real  time  and  batch   data.   • Time  taken  to  process  batch  is  linearly  
15     realtime  views  can  be  discarded.  This  is  referred  to  as  “complexity  isolation”,  meaning  that  the  most   complex  part  of  the  architecture  is  pushed  into  the  layer  whose  results  are  only  temporary.                 Realtime  views  are  discarded   once  the  data  they  contain  is   represented  in  batch  view   Now   Batch   Batch   Batch   Realtime   Realtime   Realtime   Time  
16     Generic  Lambda  Architecture                                                   Batch  Layer   Serving  Layer   Speed  Layer   All  Data   (HDFS)   Pre-­‐computed   Views  &   Summarized  data   Batch   Precompute   Data  Stream   Data  Stream   Data  Stream   Data  Stream   Process   Stream   Increment  views  /   Stream   Summarization   Query   V   V   V   V   V   V   Near  real  time  -­‐   Increment   Real  time   views   Batch   Views   Storm  or  Spark   MR  /  Hive/  Pig   Data  Management  &   Access  
17     References     http://www.ibm.com/developerworks/library/bd-­‐archpatterns1/     http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf     https://en.wikipedia.org/wiki/Lambda_architecture     http://voltdb.com/blog/simplifying-­‐complex-­‐lambda-­‐architecture     http://en.wiktionary.org/wiki/data_lake          

Hadoop-based architecture approaches

  • 1.
          2015   Miraj  Godha   6/5/2015   Hadoop  Architecture  Approaches  
  • 2.
    1         Table  of  Contents   EXECUTIVE  SUMMARY  .................................................................................................................................  2   Big  data  Classification  ...................................................................................................................................  3   Hadoop-­‐based  architecture  approaches  ......................................................................................................  5   Data  Lake  ..................................................................................................................................................  5   Lambda  .....................................................................................................................................................  5   Choosing  the  correct  architecture  ............................................................................................................  5   Data  Lake  Architecture  .................................................................................................................................  9   Generic  Data  lake  Architecture  ..............................................................................................................  11   Steps  Involved  ....................................................................................................................................  12   Lambda  Architecture  ..................................................................................................................................  13   Batch  Layer  .............................................................................................................................................  14   Serving  layer  ...........................................................................................................................................  14   Speed  layer  .............................................................................................................................................  14   Generic  Lambda  Architecture  ................................................................................................................  16   References  ..................................................................................................................................................  17            
  • 3.
    2     EXECUTIVE  SUMMARY     Apache  Hadoop  didn’t  disrupt  the  datacenter,  the  data  did.  Shortly  after  Corporate  IT  functions  within   enterprises  adopted  large  scale  systems  to  manage  data  then  the  Enterprise  Data  Warehouse  (EDW)   emerged  as  the  logical  home  of  all  enterprise  data.  Today,  every  enterprise  has  a  Data  Warehouse  that   serves  to  model  and  capture  the  essence  of  the  business  from  their  enterprise  systems.  The  explosion  of   new  types  of  data  in  recent  years  –  from  inputs  such  as  the  web  and  connected  devices,  or  just  sheer   volumes  of  records  –  has  put  tremendous  pressure  on  the  EDW.  In  response  to  this  disruption,  an   increasing  number  of  organizations  have  turned  to  Apache  Hadoop  to  help  manage  the  enormous   increase  in  data  whilst  maintaining  coherence  of  the  Data  Warehouse.  This  POV  discusses  Apache   Hadoop,  its  capabilities  as  a  data  platform  and  data  processing.  How  the  core  of  Hadoop  and  its   surrounding  ecosystems  provides  the  enterprise  requirements  to  integrate  alongside  the  Data   Warehouse  and  other  enterprise  data  systems  as  part  of  a  modern  data  architecture.  A  step  on  the   journey  toward  delivering  an  enterprise  ‘Data  Lake’  or  Lambda  Architecture  (Immutable  data  +  views).     An  enterprise  data  lake  provides  the  following  core  benefits  to  an  enterprise:  New  efficiencies  for  data   architecture  through  a  significantly  lower  cost  of  storage,  and  through  optimization  of  data  processing   workloads  such  as  data  transformation  and  integration.  New  opportunities  for  business  through  flexible   ‘schema-­‐on-­‐read’  access  to  all  enterprise  data,  and  through  multi-­‐use  and  multi-­‐workload  data   processing  on  the  same  sets  of  data:  from  batch  to  real-­‐time.     Apache  Hadoop  provides  both  reliable  storage  (HDFS)  and  a  processing  system  (MapReduce)  for  large   data  sets  across  clusters  of  computers.  MapReduce  is  a  batch  query  processor  that  is  targeted  at  long-­‐ running  background  processes.  Hadoop  can  handle  Volume.  But  to  handle  Velocity,  we  need  real-­‐time   processing  tools  that  can  compensate  for  the  high-­‐latency  of  batch  systems,  and  serve  the  most  recent   data  continuously,  as  new  data  arrives  and  older  data  is  progressively  integrated  into  the  batch   framework.  And  the  answer  to  the  problem  is  Lambda  Architecture.              
  • 4.
    3     Big  data  Classification                                                 Processing  Type   Batch   Processing   Methodology   Near  Real  time   Real  Time  +  Batch   Prescriptive   Predictive   Diagnostic   Descriptive   Data  Frequency   On  demand   Continuous   Real  Time   Batch   Data  Type   Transactional   Historical   Master  data   Meta  data   Content  Format   Structured   Unstructured:-­‐Images,   Text,  Videos,  Documents,   emails  etc.   Semi-­‐Structured:  -­‐   XML,  JSON   Data  Sources   Machine   generated   Web  &  Social   media   IOT   Human   Generated   Transactional   data  Via  other  data  providers  
  • 5.
    4     It's  helpful  to  look  at  the  characteristics  of  the  big  data  along  certain  lines  —  for  example,  how  the  data   is  collected,  analyzed,  and  processed.  Once  the  data  and  its  processing  are  classified,  it  can  be  matched   with  the  appropriate  big  data  analysis  architecture:     • Processing  type  -­‐  Whether  the  data  is  analyzed  in  real  time  or  batched  for  later  analysis.  Give  careful   consideration  to  choosing  the  analysis  type,  since  it  affects  several  other  decisions  about  products,  tools,   hardware,  data  sources,  and  expected  data  frequency.  A  mix  of  both  types  ‘Near  real  time  or  micro   batch”  may  also  be  required  by  the  use  case.   • Processing  methodology  -­‐  The  type  of  technique  to  be  applied  for  processing  data  (e.g.,  predictive,   analytical,  ad-­‐hoc  query,  and  reporting).  Business  requirements  determine  the  appropriate  processing   methodology.  A  combination  of  techniques  can  be  used.  The  choice  of  processing  methodology  helps   identify  the  appropriate  tools  and  techniques  to  be  used  in  your  big  data  solution.   • Data  frequency  and  size  -­‐  How  much  data  is  expected  and  at  what  frequency  does  it  arrive.  Knowing   frequency  and  size  helps  determine  the  storage  mechanism,  storage  format,  and  the  necessary   preprocessing  tools.  Data  frequency  and  size  depend  on  data  sources:   • On  demand,  as  with  social  media  data   • Continuous  feed,  real-­‐time  (weather  data,  transactional  data)   • Time  series  (time-­‐based  data)   • Data  type  -­‐  Type  of  data  to  be  processed  —  transactional,  historical,  master  data,  and  others.  Knowing   the  data  type  helps  segregate  the  data  in  storage.   • Content  format  -­‐  Format  of  incoming  data  —  structured  (RDMBS,  for  example),  unstructured  (audio,   video,  and  images,  for  example),  or  semi-­‐structured.  Format  determines  how  the  incoming  data  needs   to  be  processed  and  is  key  to  choosing  tools  and  techniques  and  defining  a  solution  from  a  business   perspective.   • Data  source  -­‐  Sources  of  data  (where  the  data  is  generated)  —  web  and  social  media,  machine-­‐ generated,  human-­‐generated,  etc.  Identifying  all  the  data  sources  helps  determine  the  scope  from  a   business  perspective.          
  • 6.
    5     Hadoop-­‐based  architecture  approaches     Data  Lake   A  data  lake  is  a  set  of  centralized  repositories  containing  vast  amounts  of  raw  data  (either  structured  or   unstructured),  described  by  metadata,  organized  into  identifiable  data  sets,  and  available  on  demand.   Data  in  the  lake  supports  discovery,  analytics,  and  reporting,  usually  by  deploying  cluster  tools  like   Hadoop.     Lambda   Lambda  architecture  is  a  data-­‐processing  architecture  designed  to  handle  massive  quantities  of  data  by   taking  advantage  of  both  batch-­‐  and  stream-­‐processing  methods.  This  approach  to  architecture   attempts  to  balance  latency,  throughput,  and  fault-­‐tolerance  by  using  batch  processing  to  provide   comprehensive  and  accurate  views  of  batch  data,  while  simultaneously  using  real-­‐time  stream   processing  to  provide  views  of  online  data.  The  two  view  outputs  may  be  joined  before  presentation.   The  rise  of  lambda  architecture  is  correlated  with  the  growth  of  big  data,  real-­‐time  analytics,  and  the   drive  to  mitigate  the  latencies  of  map-­‐reduce.   Choosing  the  correct  architecture    
  • 7.
    6     Parameter   Data  Lake   Lambda   Simultaneous  access  to  Real   time  and  Batch  data     Data  Lake  can  use  real  time   processing  technologies  like   Storm  to  return  real  time   results,  however  in  such  a   scenario  historical  results   cannot  be  made  available.  If  we   use  technologies  like  Spark  to   process  data,  real  time  data  and   historical  data,  on  request  there   can  be  significant  delays  in   response  time  to  clients  as   compared  to  Lambda   architecture.     Lambda  Architecture’s  Serving   Layer  merges  the  output  of   Batch  Layer  and  Speed  Layer,   before  sending  the  results  of   user  queries.  As  data  is  already   processed  into  views  at  both   the  layers,  the  response  time  is   significantly  less.     Latency   Latency  is  high  as  compared  to   Lambda,  as  real  time  data  need   to  be  processed  with  historical   data  on-­‐demand  or  as  a  part  of   batch.   Low-­‐latency  real  time  results   are  processed  by  Speed  layer   and  Batch  results  are  pre-­‐ processed  in  Batch  layer.  On   request,  both  the  results  are   just  merged,  there  by  resulting   low  latency  time  for  real  time   processing.   Ease  of  Data  Governance   Data  lake  is  coined  to  convey   the  concept  of  centralized   repository  containing  virtually   inexhaustible  amounts  of  raw   data  (or  minimally  curated)  data   that  is  readily  made  available   anytime  to  anyone  authorized   to  perform  analytical  activities.   Lambda  architecture’s  serving   layer  gives  access  to  processed   and  analyzed  data.  As  uses  get   access  to  processed  data   directly,  it  can  lead  to  top  down   data  governance  issues.   Updates  in  source  data   As  data  lake  stores  only  raw   data,  updates  are  just  appended   to  raw  data,  thereby  makes  life   of  business  users  difficult  to   write  business  logic,  in  such  a   way  that  latest  updated  records   are  considered  in  calculations.   Batch  Views  are  always   computed  from  starch  in   Lambda  Architecture.  As  a   result,  updates  can  be  easily   incorporated  in  calculated   Views  in  each  reprocess  batch   cycle.   Fault  tolerance  against  human   errors     Data  Scientist  or  business  users,   running  business  logic  on   relevant  raw  data  in  Data  Lake   might  lead  to  human  errors.   Although,  re-­‐covering  from   those  errors  is  not  difficult  as   it’s  just  a  matter  of  re-­‐running   the  logic.  However,  the   reprocessing  time  for  large   datasets  might  lead  to  some   delays.     Lambda  architecture  assures   fault  tolerance  not  only  against   hardware  failures  but  against   human  errors.  Re-­‐computation   of  views  every  time  from  raw   data  in  batch  layer,  insures  that   any  human  errors  in  business   logic  would  not  be  cascaded  to  a   level  where  it’s  unrecoverable.     Ease  of  business  users   Data  is  stored  in  raw  format,   Data  is  processed  and  available  
  • 8.
    7     with  data  definitions  and   sometime  groomed  to  make   digestible  by  data  management   tools.  At  times,  it  difficult  for   business  users  to  use  data  in  as-­‐ is  conditions.   from  Serving  makes  life  easy  for   business  users.   Accuracy  for  real  time  results   Irrespective  of  any  scenario,   users  accessing  data  from  Data   Lake  has  access  to  immutable   raw  data,  they  can  do  exact   computations,  thereby  always   get  the  accurate  results.   In  scenarios,  where  real  time   calculations  need  to  access   historical  data,  which  is  not   possible,  Lambda  architecture   would  return  you  estimated   results.  For  example,  calculation   of  mean  value,  cannot  be   achieved  until  whole  historical   data  and  real  time  data  is   referenced  at  one  go.  In  such  a   scenario,  serving  layer  would   return  estimated  results.   Infrastructure  Cost   Data  lake  architecture  process   the  data  as  and  when  need  and   thereby  the  cluster  cost  can  be   much  less  as  compared  to   Lambda.  Moreover,  it  only   persist  the  raw  data  however   Lambda  architecture  not  only   persist  the  raw  data  but   processed  data  too.  This  leads   to  extra  storage  cost  in  Lambda   architecture.   Lambda  architecture  data   processing  life  cycle  is  designed   in  such  a  fashion  that  as  soon   the  one  cycle  of  batch  process  is   finished,  it  starts  a  new  cycle  of   batch  processing  which  includes   the  recently  inserted  data.   Simultaneously,  the  speed  layer   is  always  processing  the  real   time  data.   OLAP   Unlike  data  marts,  which  are   optimized  for  data  analysis  by   storing  only  some  attributes  and   dropping  data  below  the  level   aggregation,  a  data  lake  is   designed  to  retain  all  attributes,   especially  so  when  you  do  not   yet  know  what  the  scope  of   data  or  its  use  will  be.   As  Lambda  exposes  the   processed  views  from  serving   layer,  all  the  attributes  of  data   would  not  be  available  to  Data   Scientist  for  running  an   analytical  queries  at  times.   Historical  data  reference  for   processing   OLAP  &  OLTP  queries  access  the   raw  or  groomed  data  directly   from  the  data  lake,  making  it   feasible  to  access  and  refer  the   historical  data  while  processing   data  for  given  time  interval.   Speed  layer  do  not  have   reference  to  historical  data   stored  in  batch  layer,  make  it   difficult  to  run  queries  which   refer  historical  data.  For  e.g.   ‘Unique  count’  type  of  queries   cannot  return  correct  results   from  Speed  layer.  However,   ‘calculating  average’  type  of  
  • 9.
    8           query  calculations  be  done   easily  on  Serving  layer,  by   generating  the  average  of   results  returned  from  Speed  and   Batch  layer  on  the  fly.   Slowly  Changing  Dimensions   Although,  data  lake  has  records   of  changed  dimension   attributes,  however  extra   business  logic  need  to  be   written  by  business  uses  to   cater  it.   Lambda  architecture  can  easily   cater  the  slowly  changing   dimensions  by  creating   surrogate  keys  parallel  to   natural  keys  in  case  of  any   change  detected  in  dimension   attributes  while  batch  layer   processing  cycle.   Slowly  changing  Facts     However,  in  Data  Lake  both  the   versions  of  facts  are  available   for  users  to  look  at,  this  would   lead  to  good  analytical  results  if   fact  life  cycle  is  an  attribute  in   business  logic  for  data  analytics.   Although  it’s  easy  to  change  the   facts  in  Lambda  architecture,   but  this  will  lead  to  loss  in   information  of  fact  life  cycle.  As   the  previous  state  of  fact  in  case   of  slowly  changing  facts  is  not   available  to  Data  Scientist,  the   analytical  queries  might  not  give   desired  results  on  views   exposed  by  Serving  Layer.   Frequently  changing  business   logic     Changes  in  processing  code   need  to  be  done.  But  there  is  no   clear  solution,  of  how  the   historically  processed  data  need   to  be  handled.   As  data  is  re-­‐processed  from   starch,  even  if  business  logic   changes  frequently  the   historical  data  problem  is   resolved  automatically.   Implementation  lifecycle     Data  lake  is  fast  to  implement     as  it  eliminates  the  dependency   of  data  modeling  upfront   Processing  logic  need  to  be   implemented  at  batch  and   speed  layer,  leading  to   significant  implementation  time   as  comparted  to  Data  Lake   Adding  new  data  sources     Very  easy  to  add   Need  to  be  incorporated  in   processing  layers  and  would   require  code  changes    
  • 10.
    9     IF  YOU  THINK  OF  A  DATAMART  AS  A  STORE   OF  BOTTLED  WATER  –  CLEANSED  AND   PACKAGED  AND  STRUCTURED  FOR  EASY   CONSUMPTION  –  THE  DATA  LAKE  IS  A   LARGE  BODY  OF  WATER  IN  A  MORE   NATURAL  STATE.  THE  CONTENTS  OF  THE   DATA  LAKE  STREAM  IN  FROM  A  SOURCE  TO   FILL  THE  LAKE,  AND  VARIOUS  USERS  OF  THE   LAKE  CAN  COME  TO  EXAMINE,  DIVE  IN,  OR   TAKE  SAMPLES.   BY:  JAMES  DIXON  (PENTAHO  CTO)   Data  Lake  Architecture     Much  of  today's  research  and  decision  making  are  based  on  knowledge  and  insight  that  can  be  gained   from  analyzing  and  contextualizing  the  vast  (and  growing)  amount  of  “open”  or  “raw”  data.  The  concept   that  the  large  number  of  data  sources  available  today  facilitates  analyses  on  combinations  of   heterogeneous  information  that  would  not  be  achievable  via  “siloed”  data  maintained  in  warehouses  is   very  powerful.  The  term  data  lake  has  been  coined  to   convey  the  concept  of  a  centralized  repository  containing   virtually  inexhaustible  amounts  of  raw  (or  minimally   curated)  data  that  is  readily  made  available  anytime  to   anyone  authorized  to  perform  analytical  activities.     A  data  lake  is  a  set  of  centralized  repositories  containing   vast  amounts  of  raw  data  (either  structured  or   unstructured),  described  by  metadata,  organized  into   identifiable  data  sets,  and  available  on  demand.  Data  in   the  lake  supports  discovery,  analytics,  and  reporting,   usually  by  deploying  cluster  tools  like   Hadoop.  Unlike  traditional  warehouses,  the   format  of  the  data  is  not  described  (that  is,   its  schema  is  not  available)  until  the  data  is   needed.  By  delaying  the  categorization  of   data  from  the  point  of  entry  to  the  point  of   use,  analytical  operations  that  transcend  the   rigid  format  of  an  adopted  schema  become   possible.  Query  and  search  operations  on  the  data  can  be  performed  using  traditional  database   technologies  (when  structured),  as  well  as  via  alternate  means  such  as  indexing  and  NoSQL  derivatives.   Key  Features   • Stores  Raw  data  –  Single  source  of  truth   • Data  accessible  to  anyone  authorized   • Polyglot  Persistence   • Support  multiple  applications  &  Workloads   • Low  Cost,  High  Performance  storage   • Flexible,  easy  to  use  data  organization   • Self-­‐service  end-­‐user     • More  Flexible  to  answer  new  questions   • Easy  to  add  new  data  sources   • Loosely  coupled  architecture  –  enables   flexibility  of  analysis   • Eliminating  dependency  of  data  modeling   upfront  –  thereby  fast  to  implement   • Storage  is  highly  optimized  as  raw  data  is   stored   Disadvantages   • High  Latency  for  composite  analysis  view  of   both  real  time  and  historical  data   • Raw  data  does  not  provide  relational  structure   that  is  not  friendly  for  business  analytis  on  the   fly  
  • 11.
    10     In  a  practical  sense,  a  data  lake  is  characterized  by  three  key  attributes:     • Collect  everything:    A  data  lake  contains  all  data,  both  raw  sources  over  extended  periods  of   time  as  well  as  any  processed  data.     • Dive  in  anywhere:  A  data  lake  enables  users  across  multiple  business  units  to  refine,  explore   and  enrich  data  on  their  terms.     • Flexible  access:    A  data  lake  enables  multiple  data  access  patterns  across  a  shared   infrastructure:  batch,  interactive,  online,  search,  in-­‐memory  and  other  processing  engines.              
  • 12.
    11     Generic  Data  lake  Architecture     H         Data   Sources Real   Time Micro   Batch   Mega   Batch   Desktop  &  Mobile                     Social  Media  and  cloud     Operational  Systems               Internet  of  Things  (IOT)               Ingestion   Tier Query   Interface SQL No  SQL Extern al   Storag e Centralized  Management System  monitoring System  management Unified  Data  Management  Tier Data  mgmt. Data  Access Processing  Tier Workflow  Management HDFS  storage Unstructured  and  structured  data In-­‐memory MapReduce/  Hive/MPP Flexible   Actions Real-­‐time   insights Interactive   insights Batch insights Schematic  Metadata Grooming  Data Processed   Data Raw     Data Processed   Data Processed   Data
  • 13.
    12     Steps  Involved   • Procuring  data  –  Process  of  obtaining  data  and  metadata  and  preparing  them  for  eventual   inclusion  in  a  data  lake.   • Obtaining  data  –Transferring  the  data  physically  from  source  to  Data  Lake.   • Describing  data  –  Data  scientist  searching  a  data  lake  for  useful  data  must  be  able  to  find  the   data  relevant  to  his  or  her  need,  for  the  same  they  require  metadata  for  the  data.  Schematic   metadata  for  this  data  set  would  include  information  about  how  the  data  is  formatted  and   information  about  the  schema.   • Grooming  data  –  Although  we  were  talking  about  raw  data  is  made  consumable  by  analytics   applications.  However,  in  some  scenarios  grooming  process  use  schematic  metadata  to   transform  raw  data,  into  data  that  can  be  processed  by  standard  data  management  tools.   • Provisioning  data  –  Authentication  and  authorization  policies  by  which  consumers  take  out  data   from  Data  Lake.   • Preserving  data  –  Managing  Data  Lake  also  require  attention  to  maintenance  issues  such  as   staleness,  expiration,  decommissions  and  renewals.      
  • 14.
    13     LAMBDA  ARCHITECTURE  IS  A  DATA-­‐ PROCESSING  ARCHITECTURE  DESIGNED  TO   HANDLE  MASSIVE  QUANTITIES  OF  DATA  BY   TAKING  ADVANTAGE  OF  BOTH  BATCH-­‐  AND   STREAM-­‐PROCESSING  METHODS.  THIS   APPROACH  TO  ARCHITECTURE  ATTEMPTS   TO  BALANCE  LATENCY,  THROUGHPUT,  AND   FAULT-­‐TOLERANCE  BY  USING  BATCH   PROCESSING  TO  PROVIDE  COMPREHENSIVE   AND  ACCURATE  VIEWS  OF  BATCH  DATA,   WHILE  SIMULTANEOUSLY  USING  REAL-­‐TIME   STREAM  PROCESSING  TO  PROVIDE  VIEWS   OF  ONLINE  DATA.  THE  TWO  VIEW  OUTPUTS   MAY  BE  JOINED  BEFORE  PRESENTATION.   Lambda  Architecture      The  Lambda  architecture  is  split  into  three   layers,  the  batch  layer,  the  serving  layer,  and  the   speed  layer.   1. Batch layer (Apache Hadoop)   2. Serving layer (Cloudera Impala, Spark)   3. Speed layer (Storm, Spark, Apache HBase, Cassandra)       Key  Features   • Low  latency  simultaneous  analysis  of  the  (near)  real-­‐ time  information  extracted  from  a  continuous  inflow   of  data  and  persisting  analysis  of  a  massive  volume  of   data.   • Fault  tolerant  not  against  hardware  failure  but  against   human  error  too   • Mistakes  are  corrected  by  re-­‐computations   • Storage  is  highly  optimized  as  raw  data  is  stored    
  • 15.
    14     Batch  Layer     The  batch  layer  is  responsible  for  two   things.  The  first  is  to  store  the  immutable,   constantly  growing  master  dataset  (HDFS),   and  the  second  is  to  compute  arbitrary   views  from  this  dataset  (MapReduce).   Computing  the  views  is  a  continuous   operation,  so  when  new  data  arrives  it  will   be  aggregated  into  the  views  when  they   are  recomputed  during  the  next   MapReduce  iteration.   The  views  should  be  computed  from  the   entire  dataset  and  therefore  the  batch   layer  is  not  expected  to  update  the  views   frequently.  Depending  on  the  size  of  your   dataset  and  cluster,  each  iteration  could  take  hours.     Serving  layer     The  output  from  the  batch  layer  is  a  set  of  flat  files  containing  the  precomputed  views.  The  serving  layer   is  responsible  for  indexing  and  exposing  the  views  so  that  they  can  be  queried.  Although,  the  batch  and   serving  layers  alone  do  not  satisfy  any  realtime  requirement  because  MapReduce  (by  design)  is  high   latency  and  it  could  take  a  few  hours  for  new  data  to  be  represented  in  the  views  and  propagated  to  the   serving  layer.  This  is  why  we  need  the  speed  layer.   Speed  layer     In  essence  the  speed  layer  is  the  same  as  the  batch  layer  in  that  it  computes  views  from  the  data  it   receives.  The  speed  layer  is  needed  to  compensate  for  the  high  latency  of  the  batch  layer  and  it  does   this  by  computing  realtime  views  in  Storm.  The  realtime  views  contain  only  the  delta  results  to   supplement  the  batch  views.   Whilst  the  batch  layer  is  designed  to  continuously  recompute  the  batch  views  from  scratch,  the  speed   layer  uses  an  incremental  model  whereby  the  realtime  views  are  incremented  as  and  when  new  data  is   received.  What’s  clever  about  the  speed  layer  is  the  realtime  views  are  intended  to  be  transient  and  as   soon  as  the  data  propagates  through  the  batch  and  serving  layers  the  corresponding  results  in  the   Disadvantages   • Maintaining  copies  code  that  needs  to  produce   the  same  result  in  two  complex  distributed   systems   • Could  return  estimated  or  approx.  results.     • Expensive  full  recomputation  is  required  for   fault  tolerance   • Requires  high  cluster  up-­‐time,  as  batch  data   need  to  be  processed  continuously.   • Requires  more  implementation  time,  as   duplicate  code  need  to  be  written  in  separate   technologies  to  process  real  time  and  batch   data.   • Time  taken  to  process  batch  is  linearly  
  • 16.
    15     realtime  views  can  be  discarded.  This  is  referred  to  as  “complexity  isolation”,  meaning  that  the  most   complex  part  of  the  architecture  is  pushed  into  the  layer  whose  results  are  only  temporary.                 Realtime  views  are  discarded   once  the  data  they  contain  is   represented  in  batch  view   Now   Batch   Batch   Batch   Realtime   Realtime   Realtime   Time  
  • 17.
    16     Generic  Lambda  Architecture                                                   Batch  Layer   Serving  Layer   Speed  Layer   All  Data   (HDFS)   Pre-­‐computed   Views  &   Summarized  data   Batch   Precompute   Data  Stream   Data  Stream   Data  Stream   Data  Stream   Process   Stream   Increment  views  /   Stream   Summarization   Query   V   V   V   V   V   V   Near  real  time  -­‐   Increment   Real  time   views   Batch   Views   Storm  or  Spark   MR  /  Hive/  Pig   Data  Management  &   Access  
  • 18.
    17     References     http://www.ibm.com/developerworks/library/bd-­‐archpatterns1/     http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf     https://en.wikipedia.org/wiki/Lambda_architecture     http://voltdb.com/blog/simplifying-­‐complex-­‐lambda-­‐architecture     http://en.wiktionary.org/wiki/data_lake