© 2011 IBM Corporation Hadoop architecture IBM Information Management – Cloud Computing Center of Competence IBM Canada Labs
© 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 2
© 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 3
© 2011 IBM Corporation 4 Node 1 Terminology review
© 2011 IBM Corporation 5 Node 2 Node 1 Terminology review
© 2011 IBM Corporation 6 Node 2 Node n … Node 1 Terminology review
© 2011 IBM Corporation 7 Rack 1 Node 2 Node n … Node 1 Terminology review
© 2011 IBM Corporation 8 Rack 1 Node 2 Node n … Node 1 Node 2 Node n … Rack 2 Node 1 Terminology review
© 2011 IBM Corporation 9 Rack 1 Node 2 Node n … Node 1 Node 2 Node n … Rack 2 Node 1 Node 2 Node n … Rack n Node 1 … Terminology review
© 2011 IBM Corporation 10 Hadoop cluster Rack 1 Node 2 Node n … Node 1 Node 2 Node n … Rack 2 Node 1 Node 2 Node n … Rack n Node 1 … Terminology review
© 2011 IBM Corporation Hadoop architecture • Two main components: – Hadoop Distributed File System (HDFS) 11 – MapReduce Engine
© 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 12
© 2011 IBM Corporation Hadoop distributed file system (HDFS) 13 • Hadoop file system that runs on top of existing file system • Designed to handle very large files with streaming data access patterns • Uses blocks to store a file or parts of a file
© 2011 IBM Corporation HDFS - Blocks 14 • File Blocks – 64MB (default), 128MB (recommended) – compare to 4KB in UNIX – Behind the scenes, 1 HDFS block is supported by multiple operating system (OS) blocks 128 MB OS Blocks HDFS Block . . .
© 2011 IBM Corporation HDFS - Blocks 15 • Fits well with replication to provide fault tolerance and availability • Advantages of blocks: – Fixed size – easy to calculate how many fit on a disk – A file can be larger than any single disk in the network – If a file or a chunk of the file is smaller than the block size, only needed space is used. Eg: 420MB file is split as: 128 MB 128 MB128 MB 36 MB
© 2011 IBM Corporation HDFS - Replication • Blocks with data are replicated to multiple nodes • Allows for node failure without data loss 16 Node 1 Node 2 Node 3
© 2011 IBM Corporation Writing a file to HDFS 17
© 2011 IBM Corporation Writing a file to HDFS 18
© 2011 IBM Corporation Writing a file to HDFS 19
© 2011 IBM Corporation Writing a file to HDFS 20
© 2011 IBM Corporation Writing a file to HDFS 21
© 2011 IBM Corporation Writing a file to HDFS 22
© 2011 IBM Corporation Writing a file to HDFS 23
© 2011 IBM Corporation Writing a file to HDFS 24
© 2011 IBM Corporation Writing a file to HDFS 25
© 2011 IBM Corporation Writing a file to HDFS 26
© 2011 IBM Corporation Writing a file to HDFS 27
© 2011 IBM Corporation HDFS Command line interface 28 • File System Shell (fs) • Invoked as follows: hadoop fs <args> • Example: • Listing the current directory in hdfs hadoop fs –ls .
© 2011 IBM Corporation HDFS Command line interface 29 • FS shell commands take paths URIs as argument • URI format: scheme://authority/path • Scheme: • For the local filesystem, the scheme is file • For HDFS, the scheme is hdfs hadoop fs –copyFromLocal file://myfile.txt hdfs://localhost/user/keith/myfile.txt • Scheme and authority are optional • Defaults are taken from configuration file core-site.xml
© 2011 IBM Corporation HDFS Command line interface 30 • Many POSIX-like commands • cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail • Some HDFS-specific commands • copyFromLocal, copyToLocal, get, getmerge, put, setrep
© 2011 IBM Corporation HDFS – Specific commands 31 • copyFromLocal / put • Copy files from the local file system into fs hadoop fs -copyFromLocal <localsrc> .. <dst> hadoop fs -put <localsrc> .. <dst> Or
© 2011 IBM Corporation HDFS – Specific commands 32 • copyToLocal / get • Copy files from fs into the local file system hadoop fs -copyToLocal [-ignorecrc] [-crc] <src> <localdst> hadoop fs -get [-ignorecrc] [-crc] <src> <localdst> Or
© 2011 IBM Corporation HDFS – Specific commands 33 • getMerge • Get all the files in the directories that match the source file pattern • Merge and sort them to only one file on local fs • <src> is kept hadoop fs -getmerge <src> <localdst>
© 2011 IBM Corporation HDFS – Specific commands 34 • setRep • Set the replication level of a file. • The -R flag requests a recursive change of replication level for an entire tree. • If -w is specified, waits until new replication level is achieved. hadoop fs -setrep [-R] [-w] <rep> <path/file>
© 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 35
© 2011 IBM Corporation MapReduce engine 36 • Technology from Google • A MapReduce program consists of map and reduce functions • A MapReduce job is broken into tasks that run in parallel
© 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 37
© 2011 IBM Corporation Types of nodes - Overview 38 • HDFS nodes – NameNode – DataNode • MapReduce nodes – JobTracker – TaskTracker • There are other nodes not discussed in this course
© 2011 IBM Corporation Types of nodes - Overview 39
© 2011 IBM Corporation Types of nodes - Overview 40
© 2011 IBM Corporation Types of nodes - Overview 41
© 2011 IBM Corporation Types of nodes - Overview 42
© 2011 IBM Corporation Types of nodes - NameNode 43 • NameNode – Only one per Hadoop cluster – Manages the filesystem namespace and metadata – Single point of failure, but mitigated by writing state to multiple filesystems – Single point of failure: Don’t use inexpensive commodity hardware for this node, large memory requirements
© 2011 IBM Corporation Types of nodes - DataNode 44 • DataNode – Many per Hadoop cluster – Manages blocks with data and serves them to clients – Periodically reports to name node the list of blocks it stores – Use inexpensive commodity hardware for this node
© 2011 IBM Corporation Types of nodes - JobTracker 45 • JobTracker node – One per Hadoop cluster – Receives job requests submitted by client – Schedules and monitors MapReduce jobs on task trackers
© 2011 IBM Corporation Types of nodes - TaskTracker 46 • TaskTracker node – Many per Hadoop cluster – Executes MapReduce operations – Reads blocks from DataNodes
© 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 47
© 2011 IBM Corporation Topology awareness (or Rack awareness) 48 Bandwidth becomes progressively smaller in the following scenarios:
© 2011 IBM Corporation Topology awareness 49 Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node.
© 2011 IBM Corporation Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node 2.Different nodes on the same rack Topology awareness 50
© 2011 IBM Corporation Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node 2.Different nodes on the same rack 3.Nodes on different racks in the same data center Topology awareness 51
© 2011 IBM Corporation Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node 2.Different nodes on the same rack 3.Nodes on different racks in the same data center 4.Nodes in different data centers Topology awareness 52
© 2011 IBM Corporation Thank you!

hadoop architecture -Big data hadoop

  • 1.
    © 2011 IBMCorporation Hadoop architecture IBM Information Management – Cloud Computing Center of Competence IBM Canada Labs
  • 2.
    © 2011 IBMCorporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 2
  • 3.
    © 2011 IBMCorporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 3
  • 4.
    © 2011 IBMCorporation 4 Node 1 Terminology review
  • 5.
    © 2011 IBMCorporation 5 Node 2 Node 1 Terminology review
  • 6.
    © 2011 IBMCorporation 6 Node 2 Node n … Node 1 Terminology review
  • 7.
    © 2011 IBMCorporation 7 Rack 1 Node 2 Node n … Node 1 Terminology review
  • 8.
    © 2011 IBMCorporation 8 Rack 1 Node 2 Node n … Node 1 Node 2 Node n … Rack 2 Node 1 Terminology review
  • 9.
    © 2011 IBMCorporation 9 Rack 1 Node 2 Node n … Node 1 Node 2 Node n … Rack 2 Node 1 Node 2 Node n … Rack n Node 1 … Terminology review
  • 10.
    © 2011 IBMCorporation 10 Hadoop cluster Rack 1 Node 2 Node n … Node 1 Node 2 Node n … Rack 2 Node 1 Node 2 Node n … Rack n Node 1 … Terminology review
  • 11.
    © 2011 IBMCorporation Hadoop architecture • Two main components: – Hadoop Distributed File System (HDFS) 11 – MapReduce Engine
  • 12.
    © 2011 IBMCorporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 12
  • 13.
    © 2011 IBMCorporation Hadoop distributed file system (HDFS) 13 • Hadoop file system that runs on top of existing file system • Designed to handle very large files with streaming data access patterns • Uses blocks to store a file or parts of a file
  • 14.
    © 2011 IBMCorporation HDFS - Blocks 14 • File Blocks – 64MB (default), 128MB (recommended) – compare to 4KB in UNIX – Behind the scenes, 1 HDFS block is supported by multiple operating system (OS) blocks 128 MB OS Blocks HDFS Block . . .
  • 15.
    © 2011 IBMCorporation HDFS - Blocks 15 • Fits well with replication to provide fault tolerance and availability • Advantages of blocks: – Fixed size – easy to calculate how many fit on a disk – A file can be larger than any single disk in the network – If a file or a chunk of the file is smaller than the block size, only needed space is used. Eg: 420MB file is split as: 128 MB 128 MB128 MB 36 MB
  • 16.
    © 2011 IBMCorporation HDFS - Replication • Blocks with data are replicated to multiple nodes • Allows for node failure without data loss 16 Node 1 Node 2 Node 3
  • 17.
    © 2011 IBMCorporation Writing a file to HDFS 17
  • 18.
    © 2011 IBMCorporation Writing a file to HDFS 18
  • 19.
    © 2011 IBMCorporation Writing a file to HDFS 19
  • 20.
    © 2011 IBMCorporation Writing a file to HDFS 20
  • 21.
    © 2011 IBMCorporation Writing a file to HDFS 21
  • 22.
    © 2011 IBMCorporation Writing a file to HDFS 22
  • 23.
    © 2011 IBMCorporation Writing a file to HDFS 23
  • 24.
    © 2011 IBMCorporation Writing a file to HDFS 24
  • 25.
    © 2011 IBMCorporation Writing a file to HDFS 25
  • 26.
    © 2011 IBMCorporation Writing a file to HDFS 26
  • 27.
    © 2011 IBMCorporation Writing a file to HDFS 27
  • 28.
    © 2011 IBMCorporation HDFS Command line interface 28 • File System Shell (fs) • Invoked as follows: hadoop fs <args> • Example: • Listing the current directory in hdfs hadoop fs –ls .
  • 29.
    © 2011 IBMCorporation HDFS Command line interface 29 • FS shell commands take paths URIs as argument • URI format: scheme://authority/path • Scheme: • For the local filesystem, the scheme is file • For HDFS, the scheme is hdfs hadoop fs –copyFromLocal file://myfile.txt hdfs://localhost/user/keith/myfile.txt • Scheme and authority are optional • Defaults are taken from configuration file core-site.xml
  • 30.
    © 2011 IBMCorporation HDFS Command line interface 30 • Many POSIX-like commands • cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail • Some HDFS-specific commands • copyFromLocal, copyToLocal, get, getmerge, put, setrep
  • 31.
    © 2011 IBMCorporation HDFS – Specific commands 31 • copyFromLocal / put • Copy files from the local file system into fs hadoop fs -copyFromLocal <localsrc> .. <dst> hadoop fs -put <localsrc> .. <dst> Or
  • 32.
    © 2011 IBMCorporation HDFS – Specific commands 32 • copyToLocal / get • Copy files from fs into the local file system hadoop fs -copyToLocal [-ignorecrc] [-crc] <src> <localdst> hadoop fs -get [-ignorecrc] [-crc] <src> <localdst> Or
  • 33.
    © 2011 IBMCorporation HDFS – Specific commands 33 • getMerge • Get all the files in the directories that match the source file pattern • Merge and sort them to only one file on local fs • <src> is kept hadoop fs -getmerge <src> <localdst>
  • 34.
    © 2011 IBMCorporation HDFS – Specific commands 34 • setRep • Set the replication level of a file. • The -R flag requests a recursive change of replication level for an entire tree. • If -w is specified, waits until new replication level is achieved. hadoop fs -setrep [-R] [-w] <rep> <path/file>
  • 35.
    © 2011 IBMCorporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 35
  • 36.
    © 2011 IBMCorporation MapReduce engine 36 • Technology from Google • A MapReduce program consists of map and reduce functions • A MapReduce job is broken into tasks that run in parallel
  • 37.
    © 2011 IBMCorporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 37
  • 38.
    © 2011 IBMCorporation Types of nodes - Overview 38 • HDFS nodes – NameNode – DataNode • MapReduce nodes – JobTracker – TaskTracker • There are other nodes not discussed in this course
  • 39.
    © 2011 IBMCorporation Types of nodes - Overview 39
  • 40.
    © 2011 IBMCorporation Types of nodes - Overview 40
  • 41.
    © 2011 IBMCorporation Types of nodes - Overview 41
  • 42.
    © 2011 IBMCorporation Types of nodes - Overview 42
  • 43.
    © 2011 IBMCorporation Types of nodes - NameNode 43 • NameNode – Only one per Hadoop cluster – Manages the filesystem namespace and metadata – Single point of failure, but mitigated by writing state to multiple filesystems – Single point of failure: Don’t use inexpensive commodity hardware for this node, large memory requirements
  • 44.
    © 2011 IBMCorporation Types of nodes - DataNode 44 • DataNode – Many per Hadoop cluster – Manages blocks with data and serves them to clients – Periodically reports to name node the list of blocks it stores – Use inexpensive commodity hardware for this node
  • 45.
    © 2011 IBMCorporation Types of nodes - JobTracker 45 • JobTracker node – One per Hadoop cluster – Receives job requests submitted by client – Schedules and monitors MapReduce jobs on task trackers
  • 46.
    © 2011 IBMCorporation Types of nodes - TaskTracker 46 • TaskTracker node – Many per Hadoop cluster – Executes MapReduce operations – Reads blocks from DataNodes
  • 47.
    © 2011 IBMCorporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 47
  • 48.
    © 2011 IBMCorporation Topology awareness (or Rack awareness) 48 Bandwidth becomes progressively smaller in the following scenarios:
  • 49.
    © 2011 IBMCorporation Topology awareness 49 Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node.
  • 50.
    © 2011 IBMCorporation Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node 2.Different nodes on the same rack Topology awareness 50
  • 51.
    © 2011 IBMCorporation Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node 2.Different nodes on the same rack 3.Nodes on different racks in the same data center Topology awareness 51
  • 52.
    © 2011 IBMCorporation Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node 2.Different nodes on the same rack 3.Nodes on different racks in the same data center 4.Nodes in different data centers Topology awareness 52
  • 53.
    © 2011 IBMCorporation Thank you!