Bigdata and Hadoop Haridas Narayanaswamy https://haridas.in 06/Mar/2019
Quick reminder 1. HYD and BLR: Online Attendance registration link https://docs.google.com/spreadsheets/d/1DAuclSqS6xdv4QtixUQIs6 OjpxXRLVEEmnCeAkgD9z8/edit#gid=0 2. docker pull haridasn/hadoop-2.8.5 docker pull haridasn/hadoop-cli
Agenda • Introduction to the big-data problems • How we can scale systems. • Hadoop introduction • Setup a Hadoop cluster on your laptop.
IO bound vs CPU bound problems • Downloading a file • Bitcoin mining ( proof of work ) • Watching a movie • ETL jobs • ML training
Latency Numbers • nano second - unit speed
Application Architectures Online Standalone Distributed
Monolithic • All on single machine • Every application starts here
N-tier systems • Multiple services in your application • Separation of concerns • Master-slave database systems • Load balancer • Caching layer • Pretty good for standard application workloads.
SOA and Microservices • Service Oriented Architecture • Each service will do only one task – fine grained separation, hence named as micro-services • API Gateways • Messaging Systems • Scalable Databases accessible to micro services • Caching services • Etc.
CAP Theorem • Consistency - Reads from all machines on the cluster would give same data. • Availability - All operation would produce some result, even though they aren’t consistent. • Partition tolerance - System perform normally even after some nodes got disconnected from network. • Eventual Consistent systems • Split brain problem
Bigdata platforms Where they fit in ? Offline vs Online systems
Count number of unique hits from India • Consider this scenario for Facebook • Daily total log file comes is in Petabytes • One machine can’t save it • Other scenarios • User click tracking • Tracking effectiveness of an Ad • All flavours of ETL jobs
Hadoop • Every node can store the data - HDFS • Every node can do processing on data it holds locally. • Easily scalable • Redundant and fault tolerant • Main Components • Name Node • Data Node • Resource Manager • Node Manager
Map-Reduce compute model • Main stages of map-reduce • Copy code not data. • Data locality • How we can use map-reduce to count unique hits from a region ?
Yarn and HDFS • Main stages of map-reduce • Dynamic programming ? • Bring code to data location • Data locality • How we can use mapreduce to count unique hits from a region ?
Yarn Framework • Client Submits jobs into cluster • AM handles application orchestration, once it has been started by RM • Containers are actual resources ( CPU/Ram/IO) allocated to AM. • Job progress tracking
Hybrid environments • Storage on cloud storages s3/azure storage • Execution engine on our cloud or Kubernetes.
Workshop
Next: Spark on Hadoop Yarn and Pyspark ON 13-March-2019
Thank you Haridas N <hn@haridas.in>

Bigdata and Hadoop with Docker

  • 1.
    Bigdata and Hadoop HaridasNarayanaswamy https://haridas.in 06/Mar/2019
  • 2.
    Quick reminder 1. HYDand BLR: Online Attendance registration link https://docs.google.com/spreadsheets/d/1DAuclSqS6xdv4QtixUQIs6 OjpxXRLVEEmnCeAkgD9z8/edit#gid=0 2. docker pull haridasn/hadoop-2.8.5 docker pull haridasn/hadoop-cli
  • 3.
    Agenda • Introduction tothe big-data problems • How we can scale systems. • Hadoop introduction • Setup a Hadoop cluster on your laptop.
  • 4.
    IO bound vsCPU bound problems • Downloading a file • Bitcoin mining ( proof of work ) • Watching a movie • ETL jobs • ML training
  • 5.
    Latency Numbers • nanosecond - unit speed
  • 6.
  • 7.
    Monolithic • All onsingle machine • Every application starts here
  • 8.
    N-tier systems • Multipleservices in your application • Separation of concerns • Master-slave database systems • Load balancer • Caching layer • Pretty good for standard application workloads.
  • 9.
    SOA and Microservices •Service Oriented Architecture • Each service will do only one task – fine grained separation, hence named as micro-services • API Gateways • Messaging Systems • Scalable Databases accessible to micro services • Caching services • Etc.
  • 10.
    CAP Theorem • Consistency- Reads from all machines on the cluster would give same data. • Availability - All operation would produce some result, even though they aren’t consistent. • Partition tolerance - System perform normally even after some nodes got disconnected from network. • Eventual Consistent systems • Split brain problem
  • 11.
    Bigdata platforms Where theyfit in ? Offline vs Online systems
  • 12.
    Count number ofunique hits from India • Consider this scenario for Facebook • Daily total log file comes is in Petabytes • One machine can’t save it • Other scenarios • User click tracking • Tracking effectiveness of an Ad • All flavours of ETL jobs
  • 13.
    Hadoop • Every nodecan store the data - HDFS • Every node can do processing on data it holds locally. • Easily scalable • Redundant and fault tolerant • Main Components • Name Node • Data Node • Resource Manager • Node Manager
  • 14.
    Map-Reduce compute model •Main stages of map-reduce • Copy code not data. • Data locality • How we can use map-reduce to count unique hits from a region ?
  • 15.
    Yarn and HDFS •Main stages of map-reduce • Dynamic programming ? • Bring code to data location • Data locality • How we can use mapreduce to count unique hits from a region ?
  • 16.
    Yarn Framework • ClientSubmits jobs into cluster • AM handles application orchestration, once it has been started by RM • Containers are actual resources ( CPU/Ram/IO) allocated to AM. • Job progress tracking
  • 17.
    Hybrid environments • Storageon cloud storages s3/azure storage • Execution engine on our cloud or Kubernetes.
  • 18.
  • 19.
    Next: Spark onHadoop Yarn and Pyspark ON 13-March-2019
  • 20.
    Thank you Haridas N<hn@haridas.in>