Big data beyond Hadoop – How to integrate ALL your data Kai  Wähner   kwaehner@talend.com   @KaiWaehner   www.kai-­‐waehner.de   4/26/13  
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Consulting Developing Coaching Speaking Writing Main Tasks Requirements Engineering Enterprise Architecture Management Business Process Management Architecture and Development of Applications Service-oriented Architecture Integration of Legacy Applications Cloud Computing Big Data Contact Email: kontakt@kai-waehner.de Blog: www.kai-waehner.de/blog Twitter: @KaiWaehner Social Networks: Xing, LinkedIn Kai Wähner
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Key messages You have to care about big data to be competitive in the future! You have to integrate different sources to get most value out of it! Big data integration is no (longer) rocket science!
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     William  Edwards  Deming     (1900  –1993)     American  staPsPcian,  professor,     author,  lecturer  and  consultant   “If  you  can't  measure  it,     you  can't  manage  it.”   Why should you care about big data?
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     è  „Silence  the  HiPPOs“  (highest-­‐paid  person‘s  opinion)   è  Being  able  to  interpret  unimaginable  large  data   stream,  the  gut  feeling  is  no  longer  jusPfied!     Why should you care about big data?
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     What is big data? The Vs of big data Volume     (terabytes,   petabytes)                     Variety     (social  networks,   blog  posts,  logs,   sensors,  etc.)            Velocity                (realPme  or  near-­‐ realPme)           Value  
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Big  Data  Integra3on   –  Land  data  in  a  Big  Data  cluster   –  Implement  or  generate  parallel  processes      Big  Data  Manipula3on   –  Simplify  manipulaPon,  such  as  sort  and  filter   –  ComputaPonal  expensive  funcPons     Big  Data  Quality  &  Governance   –  IdenPfy  linkages  and  duplicates,  validate  big  data   –  Match  component,  execute  basic  quality  features     Big  Data  Project  Management   –  Place  frameworks  around  big  data  projects   –  Common  Repository,  scheduling,  monitoring     Big data tasks to solve - before analysis
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     “The  advantage  of  their  new  system  is  that  they  can  now  look  at  their  data   [from  their  log  processing  system]  in  anyway  they  want:   ➜  Nightly  MapReduce  jobs  collect  staPsPcs  about  their  mail  system  such  as   spam  counts  by  domain,  bytes  transferred  and  number  of  logins.     ➜  When  they  wanted  to  find  out  which  part  of  the  world  their  customers   logged  in  from,  a  quick  [ad  hoc]  MapReduce  job  was  created  and  they  had   the  answer  within  a  few  hours.  Not  really  possible  in  your  typical  ETL   system.”   hjp://highscalability.com/how-­‐rackspace-­‐now-­‐uses-­‐mapreduce-­‐and-­‐hadoop-­‐query-­‐terabytes-­‐data   Use case: Replacing ETL jobs
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     hjp://hkotadia.com/archives/5021   Deduce   Customer     DefecPons   Use case: Risk management
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     ➜  With  revenue  of  almost  USD  30  billion  and  a  network  of   800  locaPons,  Macy's  is  considered  the  largest  store  operator  in  the   USA   ➜  Daily  price  check  analysis  of  its  10,000  arPcles  in  less  than  two  hours   ➜  Whenever  a  neighboring  compePtor  anywhere  between  New  York   and  Los  Angeles  goes  for  aggressive  price  reducPons,  Macy's  follows   its  example   ➜  If  there  is  no  market  compePtor,  the  prices  remain  unchanged   hjp://www.t-­‐systems.com/about-­‐t-­‐systems/examples-­‐of-­‐successes-­‐companies-­‐analyze-­‐big-­‐data-­‐in-­‐record-­‐Pme-­‐l-­‐t-­‐systems/1029702   Use case: Flexible pricing
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     This is your company Big Data Geek Limited big data experts
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     ➜  Wanna  buy  a  big  data  soluPon  for  your  industry?     ➜  Maybe  a  compePtor  has  a  big  data  soluPon  which   adds  business  value?   ➜  The  compePtor  will  never  publish  it  (rat-­‐race)!   Big data tool selection (business perspective)
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Looking  for  ‚your‘  required  big  data  product?   Support  your  data  from  scratch?     Good  luck!  J       Big data tool selection (technical perspective)
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     How to solve these big data challenges?
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     à  “[OMen]  simple  models  and      big  data  trump  more-­‐elaborate      [and  complex]  analyPcs  approaches”     à  “OMen  someone  coming  from      outside  an  industry  can  spot      a  bejer  way  to  use  big  data      than  an  insider”         Erik  Brynjolfsson  /  Lynn  Wu     hjp://alfredopassos.tumblr.com/post/32461599327/big-­‐data-­‐the-­‐management-­‐revoluPon-­‐by-­‐andrew-­‐mcafee   Be no expert! Be simple!
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     à  Look  at  use  cases  of  others      (SMU,  but  also  large  companies)   à  How  can  you  do  something  similar    with  your  data?     à  You  have  different  data  sources?      Use  it!  Combine  it!  Play  with  it!   Be creative!
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     1)  Do  not  begin  with  the  data,  think  about  business  opportuniPes   2)  Choose  the  right  data  (combine  different  data  sources)   3)  Use  easy  tooling      hjp://hbr.org/2012/10/making-­‐advanced-­‐analyPcs-­‐work-­‐for-­‐you     What is your Big Data process? Step  1   Step  2   Step  3  
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Technology perspective How  to  process  big  data?  
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner       The  criPcal  flaw  in  parallel  ETL  tools  is  the  fact  that  the  data  is  almost  never  local  to  the  processing   nodes.  This  means  that  every  Pme  a  large  job  is  run,  the  data  has  to  first  be  read  from  the  source,   split  N  ways  and  then  delivered  to  the  individual  nodes.    Worse,  if  the  parPPon  key  of  the  source   doesn’t  match  the  parPPon  key  of  the  target,  data  has  to  be  constantly  exchanged  among  the   nodes.  In  essence,  parallel  ETL  treats  the  network  as  if  it  were  a  physical  I/O  subsystem.    The   network,  which  is  always  the  slowest  part  of  the  process,  becomes  the  weakest  link  in  the   performance  chain.     hjp://blog.syncsort.com/2012/08/parallel-­‐etl-­‐tools-­‐are-­‐dead   How to process big data?
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Slides:  hjp://www.slideshare.net/pavlobaron/100-­‐big-­‐data-­‐0-­‐hadoop-­‐0-­‐java     Video:  hjp://www.infoq.com/presentaPons/Big-­‐Data-­‐Hadoop-­‐Java   How to process big data?
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     The  defacto  standard  for  big  data  processing   How to process big data?
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Even  MicrosoM  (the  .NET  house)  relies  on  Hadoop  since  2011   How to process big data? “A  big  part  of  [the   company’s  strategy]   includes  wiring  SQL  Server   2012  (formerly  known  by   the  codename  “Denali”)  to   the  Hadoop  distributed   compuPng  playorm,  and   bringing  Hadoop  to   Windows  Server  and  Azure”  
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Apache  Hadoop,  an  open-­‐source  soMware  library,  is  a   framework  that  allows  for  the  distributed  processing  of   large  data  sets  across  clusters  of  commodity  hardware   using  simple  programming  models.  It  is  designed  to  scale   up  from  single  servers  to  thousands  of  machines,  each   offering  local  computaPon  and  storage.       What is Hadoop?
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Simple  example   •  Input:  (very  large)  text  files  with  lists  of  strings,  such  as:      „318,  0043012650999991949032412004...0500001N9+01111+99999999999...“   •  We  are  interested  just  in  some  content:  year  and  temperate  (marked  in  red)   •  The  Map  Reduce  funcPon  has  to  compute  the  maximum  temperature  for  every  year   Example  from  the  book  “Hadoop:  The  DefiniPve  Guide,  3rd  EdiPon”   Map (Shuffle) Reduce
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     How to process big data?
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     MapReduce HDFS Ecosystem Features included Hadoop   DistribuPon   Big  Data  Suite   few many Apache Hadoop Packaging Deployment-Tooling Support + Tooling / Modeling Code Generation Scheduling Integration + Hadoop alternatives
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Connectivity Routing Transformation Complexity of Integration Enterprise   Service  Bus   IntegraPon  Suite   Low High Integration Framework INTEGRATION Tooling Monitoring Support+ BUSINESS PROCESS MGT. BIG DATA / MDM REGISTRY / REPOSITORY RULES ENGINE „YOU NAME IT“ + Alternatives for systems integration
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Complexity of Integration Enterprise   Service  Bus   IntegraPon  Suite   Low High Integration Framework Alternatives for systems integration
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     More details about integration frameworks... hjp://www.kai-­‐waehner.de/blog/2012/12/20/showdown-­‐integraPon-­‐framework-­‐ spring-­‐integraPon-­‐apache-­‐camel-­‐vs-­‐enterprise-­‐service-­‐bus-­‐esb/  
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Enterprise Integration Patterns (EIP) Apache Camel Implements the EIPs
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Enterprise Integration Patterns (EIP)
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Enterprise Integration Patterns (EIP)
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Architecture hjp://java.dzone.com/arPcles/apache-­‐camel-­‐integraPon  
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     HTTP   FTP   File   XSLT   MQ   JDBC Akka   TCP   SMTP   RSS   Quartz   Log   LDAP   JMS   EJB   AMQP   Atom   AWS-S3   Bean-Validation   CXF   IRC   Jetty   JMX   Lucene   Netty   RMI   SQL   Many many more   Custom Components Choose your required components
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Choose your favorite DSL XML (not production-ready yet)
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Deploy it wherever you need Standalone OSGi Application Server Web Container Spring Container Cloud
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Enterprise-ready • Open Source • Scalability • Error Handling • Transaction • Monitoring • Tooling • Commercial Support  
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Example: Camel integration route
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Example: camel-hdfs component // Producer from(“jms:MyQueue") .to(“hdfs:///myDirectory/myFile.txt?valueType=TEXT"); // Consumer from(“hdfs:///myDirectory/myFile.txt") .to(“file:target/reports/report.txt");
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Live demo Apache Camel in action...
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Connectivity Routing Transformation Complexity of Integration Enterprise   Service  Bus   IntegraPon  Suite   Low High Integration Framework INTEGRATION Tooling Monitoring Support+ BUSINESS PROCESS MGT. BIG DATA / MDM REGISTRY / REPOSITORY RULES ENGINE „YOU NAME IT“ + Alternatives for systems integration
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Complexity of Integration Enterprise   Service  Bus   IntegraPon  Suite   Low High Integration Framework Alternatives for systems integration
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     More details about ESBs and suites... hjp://www.kai-­‐waehner.de/blog/2013/01/23/spoilt-­‐for-­‐choice-­‐ how-­‐to-­‐choose-­‐the-­‐right-­‐enterprise-­‐service-­‐bus-­‐esb/  
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     …an  open  source   ecosystem   Talend  Open  Studio  for  Big  Data     •  Improves  efficiency  of  big  data  job  design  with   graphic  interface   •  Generates  Hadoop  code  and  run  transforms   inside  Hadoop   •  NaPve  support  for  HDFS,  Pig,  Hbase,  Hcatalog,   Sqoop  and  Hive   •  100%  open  source  under  an  Apache  License   •  Standards  based   Pig Vision: Democratize big data
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     …an  open  source   ecosystem   Talend  PlaAorm  for  Big  Data     •  Builds  on  Talend  Open  Studio  for  Big  Data   •  Adds  data  quality,  advanced  scalability  and   management  funcPons   •  MapReduce  massively  parallel  data   processing   •  Shared  Repository  and  remote  deployment   •  Data  quality  and  profiling   •  Data  cleansing   •  ReporPng  and  dashboards   •  Commercial  support,  warranty/IP  indemnity   under  a  subscripPon  license   Pig Vision: Democratize big data
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Talend Open Studio for Big Data
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     „Talend  Open  Studio  for  Big  Data“  in  acPon...   Live demo
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Did you get the key message?
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Key messages You have to care about big data to be competitive in the future! You have to integrate different sources to get most value out of it! Big data integration is no (longer) rocket science!
©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Did you get the key message?
Thank you for your attention. Questions? kwaehner@talend.com www.kai-waehner.de LinkedIn / Xing @KaiWaehner

Big Data beyond Apache Hadoop - How to integrate ALL your Data

  • 1.
    Big data beyondHadoop – How to integrate ALL your data Kai  Wähner   kwaehner@talend.com   @KaiWaehner   www.kai-­‐waehner.de   4/26/13  
  • 2.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Consulting Developing Coaching Speaking Writing Main Tasks Requirements Engineering Enterprise Architecture Management Business Process Management Architecture and Development of Applications Service-oriented Architecture Integration of Legacy Applications Cloud Computing Big Data Contact Email: kontakt@kai-waehner.de Blog: www.kai-waehner.de/blog Twitter: @KaiWaehner Social Networks: Xing, LinkedIn Kai Wähner
  • 3.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Key messages You have to care about big data to be competitive in the future! You have to integrate different sources to get most value out of it! Big data integration is no (longer) rocket science!
  • 4.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
  • 5.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
  • 6.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     William  Edwards  Deming     (1900  –1993)     American  staPsPcian,  professor,     author,  lecturer  and  consultant   “If  you  can't  measure  it,     you  can't  manage  it.”   Why should you care about big data?
  • 7.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     è  „Silence  the  HiPPOs“  (highest-­‐paid  person‘s  opinion)   è  Being  able  to  interpret  unimaginable  large  data   stream,  the  gut  feeling  is  no  longer  jusPfied!     Why should you care about big data?
  • 8.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     What is big data? The Vs of big data Volume     (terabytes,   petabytes)                     Variety     (social  networks,   blog  posts,  logs,   sensors,  etc.)            Velocity                (realPme  or  near-­‐ realPme)           Value  
  • 9.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Big  Data  Integra3on   –  Land  data  in  a  Big  Data  cluster   –  Implement  or  generate  parallel  processes      Big  Data  Manipula3on   –  Simplify  manipulaPon,  such  as  sort  and  filter   –  ComputaPonal  expensive  funcPons     Big  Data  Quality  &  Governance   –  IdenPfy  linkages  and  duplicates,  validate  big  data   –  Match  component,  execute  basic  quality  features     Big  Data  Project  Management   –  Place  frameworks  around  big  data  projects   –  Common  Repository,  scheduling,  monitoring     Big data tasks to solve - before analysis
  • 10.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     “The  advantage  of  their  new  system  is  that  they  can  now  look  at  their  data   [from  their  log  processing  system]  in  anyway  they  want:   ➜  Nightly  MapReduce  jobs  collect  staPsPcs  about  their  mail  system  such  as   spam  counts  by  domain,  bytes  transferred  and  number  of  logins.     ➜  When  they  wanted  to  find  out  which  part  of  the  world  their  customers   logged  in  from,  a  quick  [ad  hoc]  MapReduce  job  was  created  and  they  had   the  answer  within  a  few  hours.  Not  really  possible  in  your  typical  ETL   system.”   hjp://highscalability.com/how-­‐rackspace-­‐now-­‐uses-­‐mapreduce-­‐and-­‐hadoop-­‐query-­‐terabytes-­‐data   Use case: Replacing ETL jobs
  • 11.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     hjp://hkotadia.com/archives/5021   Deduce   Customer     DefecPons   Use case: Risk management
  • 12.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     ➜  With  revenue  of  almost  USD  30  billion  and  a  network  of   800  locaPons,  Macy's  is  considered  the  largest  store  operator  in  the   USA   ➜  Daily  price  check  analysis  of  its  10,000  arPcles  in  less  than  two  hours   ➜  Whenever  a  neighboring  compePtor  anywhere  between  New  York   and  Los  Angeles  goes  for  aggressive  price  reducPons,  Macy's  follows   its  example   ➜  If  there  is  no  market  compePtor,  the  prices  remain  unchanged   hjp://www.t-­‐systems.com/about-­‐t-­‐systems/examples-­‐of-­‐successes-­‐companies-­‐analyze-­‐big-­‐data-­‐in-­‐record-­‐Pme-­‐l-­‐t-­‐systems/1029702   Use case: Flexible pricing
  • 13.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
  • 14.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     This is your company Big Data Geek Limited big data experts
  • 15.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     ➜  Wanna  buy  a  big  data  soluPon  for  your  industry?     ➜  Maybe  a  compePtor  has  a  big  data  soluPon  which   adds  business  value?   ➜  The  compePtor  will  never  publish  it  (rat-­‐race)!   Big data tool selection (business perspective)
  • 16.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Looking  for  ‚your‘  required  big  data  product?   Support  your  data  from  scratch?     Good  luck!  J       Big data tool selection (technical perspective)
  • 17.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     How to solve these big data challenges?
  • 18.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     à  “[OMen]  simple  models  and      big  data  trump  more-­‐elaborate      [and  complex]  analyPcs  approaches”     à  “OMen  someone  coming  from      outside  an  industry  can  spot      a  bejer  way  to  use  big  data      than  an  insider”         Erik  Brynjolfsson  /  Lynn  Wu     hjp://alfredopassos.tumblr.com/post/32461599327/big-­‐data-­‐the-­‐management-­‐revoluPon-­‐by-­‐andrew-­‐mcafee   Be no expert! Be simple!
  • 19.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     à  Look  at  use  cases  of  others      (SMU,  but  also  large  companies)   à  How  can  you  do  something  similar    with  your  data?     à  You  have  different  data  sources?      Use  it!  Combine  it!  Play  with  it!   Be creative!
  • 20.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     1)  Do  not  begin  with  the  data,  think  about  business  opportuniPes   2)  Choose  the  right  data  (combine  different  data  sources)   3)  Use  easy  tooling      hjp://hbr.org/2012/10/making-­‐advanced-­‐analyPcs-­‐work-­‐for-­‐you     What is your Big Data process? Step  1   Step  2   Step  3  
  • 21.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
  • 22.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Technology perspective How  to  process  big  data?  
  • 23.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner       The  criPcal  flaw  in  parallel  ETL  tools  is  the  fact  that  the  data  is  almost  never  local  to  the  processing   nodes.  This  means  that  every  Pme  a  large  job  is  run,  the  data  has  to  first  be  read  from  the  source,   split  N  ways  and  then  delivered  to  the  individual  nodes.    Worse,  if  the  parPPon  key  of  the  source   doesn’t  match  the  parPPon  key  of  the  target,  data  has  to  be  constantly  exchanged  among  the   nodes.  In  essence,  parallel  ETL  treats  the  network  as  if  it  were  a  physical  I/O  subsystem.    The   network,  which  is  always  the  slowest  part  of  the  process,  becomes  the  weakest  link  in  the   performance  chain.     hjp://blog.syncsort.com/2012/08/parallel-­‐etl-­‐tools-­‐are-­‐dead   How to process big data?
  • 24.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Slides:  hjp://www.slideshare.net/pavlobaron/100-­‐big-­‐data-­‐0-­‐hadoop-­‐0-­‐java     Video:  hjp://www.infoq.com/presentaPons/Big-­‐Data-­‐Hadoop-­‐Java   How to process big data?
  • 25.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     The  defacto  standard  for  big  data  processing   How to process big data?
  • 26.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Even  MicrosoM  (the  .NET  house)  relies  on  Hadoop  since  2011   How to process big data? “A  big  part  of  [the   company’s  strategy]   includes  wiring  SQL  Server   2012  (formerly  known  by   the  codename  “Denali”)  to   the  Hadoop  distributed   compuPng  playorm,  and   bringing  Hadoop  to   Windows  Server  and  Azure”  
  • 27.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Apache  Hadoop,  an  open-­‐source  soMware  library,  is  a   framework  that  allows  for  the  distributed  processing  of   large  data  sets  across  clusters  of  commodity  hardware   using  simple  programming  models.  It  is  designed  to  scale   up  from  single  servers  to  thousands  of  machines,  each   offering  local  computaPon  and  storage.       What is Hadoop?
  • 28.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Simple  example   •  Input:  (very  large)  text  files  with  lists  of  strings,  such  as:      „318,  0043012650999991949032412004...0500001N9+01111+99999999999...“   •  We  are  interested  just  in  some  content:  year  and  temperate  (marked  in  red)   •  The  Map  Reduce  funcPon  has  to  compute  the  maximum  temperature  for  every  year   Example  from  the  book  “Hadoop:  The  DefiniPve  Guide,  3rd  EdiPon”   Map (Shuffle) Reduce
  • 29.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     How to process big data?
  • 30.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     MapReduce HDFS Ecosystem Features included Hadoop   DistribuPon   Big  Data  Suite   few many Apache Hadoop Packaging Deployment-Tooling Support + Tooling / Modeling Code Generation Scheduling Integration + Hadoop alternatives
  • 31.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
  • 32.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Connectivity Routing Transformation Complexity of Integration Enterprise   Service  Bus   IntegraPon  Suite   Low High Integration Framework INTEGRATION Tooling Monitoring Support+ BUSINESS PROCESS MGT. BIG DATA / MDM REGISTRY / REPOSITORY RULES ENGINE „YOU NAME IT“ + Alternatives for systems integration
  • 33.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Complexity of Integration Enterprise   Service  Bus   IntegraPon  Suite   Low High Integration Framework Alternatives for systems integration
  • 34.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     More details about integration frameworks... hjp://www.kai-­‐waehner.de/blog/2012/12/20/showdown-­‐integraPon-­‐framework-­‐ spring-­‐integraPon-­‐apache-­‐camel-­‐vs-­‐enterprise-­‐service-­‐bus-­‐esb/  
  • 35.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Enterprise Integration Patterns (EIP) Apache Camel Implements the EIPs
  • 36.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Enterprise Integration Patterns (EIP)
  • 37.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Enterprise Integration Patterns (EIP)
  • 38.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Architecture hjp://java.dzone.com/arPcles/apache-­‐camel-­‐integraPon  
  • 39.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     HTTP   FTP   File   XSLT   MQ   JDBC Akka   TCP   SMTP   RSS   Quartz   Log   LDAP   JMS   EJB   AMQP   Atom   AWS-S3   Bean-Validation   CXF   IRC   Jetty   JMX   Lucene   Netty   RMI   SQL   Many many more   Custom Components Choose your required components
  • 40.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Choose your favorite DSL XML (not production-ready yet)
  • 41.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Deploy it wherever you need Standalone OSGi Application Server Web Container Spring Container Cloud
  • 42.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Enterprise-ready • Open Source • Scalability • Error Handling • Transaction • Monitoring • Tooling • Commercial Support  
  • 43.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Example: Camel integration route
  • 44.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Example: camel-hdfs component // Producer from(“jms:MyQueue") .to(“hdfs:///myDirectory/myFile.txt?valueType=TEXT"); // Consumer from(“hdfs:///myDirectory/myFile.txt") .to(“file:target/reports/report.txt");
  • 45.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Live demo Apache Camel in action...
  • 46.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     • Big  data  paradigm  shiM     • Challenges  of  big  data   • Big  data  from  a  technology  perspecPve   • IntegraPon  with  an  open  source  framework   • IntegraPon  with  an  open  source  suite   Agenda
  • 47.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Connectivity Routing Transformation Complexity of Integration Enterprise   Service  Bus   IntegraPon  Suite   Low High Integration Framework INTEGRATION Tooling Monitoring Support+ BUSINESS PROCESS MGT. BIG DATA / MDM REGISTRY / REPOSITORY RULES ENGINE „YOU NAME IT“ + Alternatives for systems integration
  • 48.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Complexity of Integration Enterprise   Service  Bus   IntegraPon  Suite   Low High Integration Framework Alternatives for systems integration
  • 49.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     More details about ESBs and suites... hjp://www.kai-­‐waehner.de/blog/2013/01/23/spoilt-­‐for-­‐choice-­‐ how-­‐to-­‐choose-­‐the-­‐right-­‐enterprise-­‐service-­‐bus-­‐esb/  
  • 50.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     …an  open  source   ecosystem   Talend  Open  Studio  for  Big  Data     •  Improves  efficiency  of  big  data  job  design  with   graphic  interface   •  Generates  Hadoop  code  and  run  transforms   inside  Hadoop   •  NaPve  support  for  HDFS,  Pig,  Hbase,  Hcatalog,   Sqoop  and  Hive   •  100%  open  source  under  an  Apache  License   •  Standards  based   Pig Vision: Democratize big data
  • 51.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     …an  open  source   ecosystem   Talend  PlaAorm  for  Big  Data     •  Builds  on  Talend  Open  Studio  for  Big  Data   •  Adds  data  quality,  advanced  scalability  and   management  funcPons   •  MapReduce  massively  parallel  data   processing   •  Shared  Repository  and  remote  deployment   •  Data  quality  and  profiling   •  Data  cleansing   •  ReporPng  and  dashboards   •  Commercial  support,  warranty/IP  indemnity   under  a  subscripPon  license   Pig Vision: Democratize big data
  • 52.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Talend Open Studio for Big Data
  • 53.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     „Talend  Open  Studio  for  Big  Data“  in  acPon...   Live demo
  • 54.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Did you get the key message?
  • 55.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Key messages You have to care about big data to be competitive in the future! You have to integrate different sources to get most value out of it! Big data integration is no (longer) rocket science!
  • 56.
    ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner     Did you get the key message?
  • 57.
    Thank you foryour attention. Questions? kwaehner@talend.com www.kai-waehner.de LinkedIn / Xing @KaiWaehner