0% found this document useful (0 votes)

207 views36 pages

Mapreduce and Hadoop Distributed File System

The document discusses MapReduce and Hadoop Distributed File System. It provides an introduction to "big data" computing and the growing amounts of data being collected. It then describes MapReduce as a programming model for processing large data sets across clusters of machines, and the Hadoop Distributed File System (HDFS) as a supporting file system. The outline presented covers MapReduce programming model, HDFS, relevance to undergraduate curriculum, and a demonstration.

Uploaded by

Yogesh Bansal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

207 views36 pages

Mapreduce and Hadoop Distributed File System

Uploaded by

Yogesh Bansal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 36

MapReduce and Hadoop

Distributed File System

K. MADURAI AND B. RAMAMURTHY

Contact:
Dr. Bina Ramamurthy
CSE Department
University at Buffalo (SUNY)
bina@buffalo.edu
http://www.cse.buffalo.edu/faculty/bina
Partially Supported by
NSF DUE Grant: 0737243

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

The Context: Big-data
2
 Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
 Google collects 270PB data in a month (2007), 20000PB a day (2008)
 2010 census data is expected to be a huge gold mine of information
 Data mining huge amounts of data collected in a wide range of domains
from astronomy to healthcare has become essential for planning and
performance.
 We are in a knowledge economy.
 Data is an important asset to any organization

 Discovery of knowledge; Enabling discovery; annotation of data

 We are looking at newer

 programming models, and

 Supporting algorithms and data structures.

 NSF refers to it as “data-intensive computing” and industry calls it “big-

data” and “cloud computing”
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Purpose of this talk
3

To provide a simple introduction to:

 “Thebig-data computing” : An important
advancement that has a potential to impact
significantly the CS and undergraduate curriculum.
 A programming model called MapReduce for
processing “big-data”
 A supporting file system called Hadoop Distributed
File System (HDFS)
To encourage educators to explore ways to infuse
relevant concepts of this emerging area into their
curriculum.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

The Outline
4

Introduction to MapReduce
From CS Foundation to MapReduce
MapReduce programming model
Hadoop Distributed File System
Relevance to Undergraduate Curriculum
Demo (Internet access needed)
Our experience with the framework
Summary
References

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

MapReduce
5

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

What is MapReduce?
6

 MapReduce is a programming model Google has used

successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
 Users specify the computation in terms of a map and a
reduce function,
 Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
 Underlying system also handles machine failures,
efficient communications, and performance issues.
-- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters. Communication of
ACM 51, 1 (Jan. 2008), 107-113.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

From CS Foundations to MapReduce
7

Consider a large data collection:

{web, weed, green, sun, moon, land, part, web, green,
…}
Problem: Count the occurrences of the different words
in the collection.

Lets design a solution for this problem;

 We will start from scratch
 We will add and relax constraints
 We will do incremental design, improving the solution for
performance and scalability

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Word Counter and Result Table
8
{web, weed, green, sun, moon, land, part, web 2
web, green,…}
weed 1

green 2
Data Main
sun 1
collection
moon 1

land 1

part 1
WordCounter

parse( )
count( )

DataCollection ResultTable

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Multiple Instances of Word Counter
9

web 2

weed 1

green 2
Data
Main sun 1
collection
moon 1
Thread
land 1
1..*
WordCounter part 1

parse( )
count( )

DataCollection ResultTable Observe:

Multi-thread
Lock on shared data

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Improve Word Counter for Performance
10 N No need for lock
Main oweb 2

weed 1

Data green 2
collection
sun 1

moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter

WordList
Separate counters
DataCollection ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Peta-scale Data
11
Main web 2

weed 1

green 2

Data sun 1

collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Addressing the Scale Issue
12

 Single machine cannot serve all the data: you need a distributed
special (file) system
 Large number of commodity hardware disks: say, 1000 disks 1TB
each
 Issue: With Mean time between failures (MTBF) or failure rate of
1/1000, then at least 1 of the above 1000 disks would be down at a
given time.
 Thus failure is norm and not an exception.
 File system has to be fault-tolerant: replication, checksum
 Data transfer bandwidth is critical (location of data)

 Critical aspects: fault tolerance + replication + load balancing,

monitoring
 Exploit parallelism afforded by splitting parsing and counting
 Provision and locate computing at data locations

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Peta-scale Data
13
Main web 2

weed 1

green 2

Data sun 1

collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Data Peta Scale Data is Commonly Distributed
collection
14
Main web 2
Data
collection weed 1

green 2

Data sun 1
collection moon 1
Thread
land 1
1..*
Data part 1
1..*
collection Parser Counter

Data DataCollection WordList ResultTable

collection Issue: managing the
large scale data
KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Data Write Once Read Many (WORM) data
collection
15
Main web 2
Data
collection weed 1

green 2

Data sun 1
collection moon 1
Thread
land 1
1..*
Data part 1
1..*
collection Parser Counter

Data DataCollection WordList ResultTable

collection

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Data WORM Data is Amenable to Parallelism
collection
16
Main
Data
collection
1. Data with WORM
characteristics : yields
Data to parallel processing;
collection 2. Data without
Thread dependencies: yields
1..*
to out of order
Data processing
1..*
collection Parser Counter

Data DataCollection WordList ResultTable

collection

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Divide and Conquer: Provision Computing at Data Location
17
Main For our example,
#1: Schedule parallel parse tasks
Data Thread
#2: Schedule parallel count tasks
collection
1..*
1..*
Parser Counter

One node DataCollection WordList ResultTable This is a particular solution;

Main
Lets generalize it:

Data Thread
Our parse is a mapping operation:
collection Parser
1..*
1..*

Counter
MAP: input  <key, value> pairs
DataCollection WordList ResultTable

Main
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Data Thread

collection
1..*
1..*
Counter

Map/Reduce originated from Lisp

Parser

DataCollection WordList ResultTable

But have different meaning here

Main

Runtime adds distribution + fault

Data Thread tolerance + replication + monitoring +
collection Parser
1..*
1..*

Counter

load balancing to your base application!

DataCollection WordList ResultTable

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Mapper and Reducer
18

Remember: MapReduce is simplified processing for larger data sets:

MapReduce Version of WordCount Source code
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Map Operation
19
web 1

MAP: Input data  <key, value> pair weed

green
1

sun 1

moon 1

land 1

part 1
Map web 1
Data green 1

Collection: split1 Split the data to … 1

Supply multiple KEY VALUE

processors

Data Map
Collection: split 2
……

Data
…
Collection: split n

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Reduce Operation
20

MAP: Input data  <key, value> pair

REDUCE: <key, value> pair  <result>

Reduce
Map
Data
Collection: split1 Split the data to
Supply multiple
processors
Reduce
Data Map
Collection: split 2
……

Data
…
Reduce
Collection: split n Map

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Large scale data splits
Map <key, 1> Reducers (say, Count)

Parse-hash

Count
P-0000
, count1

Parse-hash

Count
P-0001
, count2
Parse-hash

Count
P-0002
Parse-hash ,count3

CCSCNE 2009 Palttsburg, April 24 2009 21 B.Ramamurthy & K.Madurai

MapReduce Example in my operating systems class
22

combine part0
map reduce
Cat split

reduce part1
split map combine

Bat

map part2
split combine reduce
Dog

split map
Other
Words
(size:
TByte)
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
MapReduce Programming
Model
23

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

MapReduce programming model
24

 Determine if the problem is parallelizable and solvable using

MapReduce (ex: Is the data WORM?, large data set).
 Design and implement solution as Mapper classes and
Reducer class.
 Compile the source code with hadoop core.
 Package the code as jar executable.
 Configure the application (job) as to the number of mappers
and reducers (tasks), input and output streams
 Load the data (or use it on previously available data)
 Launch the job and monitor.
 Study the result.
 Detailed steps.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

MapReduce Characteristics
25

 Very large scale data: peta, exa bytes

 Write once and read many data: allows for parallelism without
mutexes
 Map and Reduce are the main operations: simple code
 There are other supporting operations such as combine and partition
(out of the scope of this talk).
 All the map should be completed before reduce operation starts.
 Map and reduce operations are typically performed by the same
physical processor.
 Number of map tasks and reduce tasks are configurable.
 Operations are provisioned near the data.
 Commodity hardware and storage.
 Runtime takes care of splitting and moving data for operations.
 Special distributed file system. Example: Hadoop Distributed File
System and Hadoop Runtime.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Classes of problems “mapreducable”
26

 Benchmark for comparing: Jim Gray’s challenge on data-

intensive computing. Ex: “Sort”
 Google uses it (we think) for wordcount, adwords, pagerank,
indexing data.
 Simple algorithms such as grep, text-indexing, reverse
indexing
 Bayesian classification: data mining domain
 Facebook uses it for various operations: demographics
 Financial services use it for analytics
 Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
 Expected to play a critical role in semantic web and web3.0

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Scope of MapReduce
27
Data size: small
Pipelined Instruction level

Concurrent Thread level

Service Object level

Indexed File level

Mega Block level

Virtual System Level

Data size: large

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Hadoop
28

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

What is Hadoop?
29

At Google MapReduce operation are run on a special

file system called Google File System (GFS) that is
highly optimized for this purpose.
GFS is not open source.
Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System
(HDFS).
The software framework that supports HDFS,
MapReduce and other related entities is called the
project Hadoop or simply Hadoop.
This is open source and distributed by Apache.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Basic Features: HDFS
30

Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Hadoop Distributed File System
31

HDFS Server Master node

HDFS Client
Application

Local file
system
Block size: 2K
Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Hadoop Distributed File System
32

HDFS Server Master node

blockmap

HDFS Client heartbeat

Application

Data structures and algorithms: a new look at traditional

algorithms such as sort: Quicksort may not be your
choice! It is not easily parallelizable. Merge sort is better.
You can identify mappers and reducers among your
algorithms. Mappers and reducers are simply place
holders for algorithms relevant for your applications.
Large scale data and analytics are indeed concepts to
reckon with similar to how we addressed “programming
in the large” by OO concepts.
While a full course on MR/HDFS may not be warranted,
the concepts perhaps can be woven into most courses in
our CS curriculum.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Demo
34

VMware simulated Hadoop and MapReduce demo

Remote access to NEXOS system at my Buffalo office
5-node HDFS running HDFS on Ubuntu 8.04
1 –name node and 4 data-nodes
Each is an old commodity PC with 512 MB RAM,
120GB – 160GB external memory
Zeus (namenode), datanodes: hermes, dionysus,
aphrodite, athena

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Summary
35

We introduced MapReduce programming model for

processing large scale data
We discussed the supporting Hadoop Distributed
File System
The concepts were illustrated using a simple
example
We reviewed some important parts of the source
code for the example.
Relationship to Cloud Computing

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

References
36

1. Apache Hadoop Tutorial: http://hadoop.apache.org

http://hadoop.apache.org/core/docs/current/mapred_tu
torial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters.
Communication of ACM 51, 1 (Jan. 2008), 107-113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Hortonworks Sandbox for Beginners
No ratings yet
Hortonworks Sandbox for Beginners
12 pages
The Secrets of Oracle Bitmap Indexes
100% (14)
The Secrets of Oracle Bitmap Indexes
3 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Spark
No ratings yet
Spark
160 pages
MapReduce Algorithms For Big Data Analysis
No ratings yet
MapReduce Algorithms For Big Data Analysis
2 pages
Unit 6 - Compression and Serialization in Hadoop
No ratings yet
Unit 6 - Compression and Serialization in Hadoop
24 pages
An Investigation of NoSQL Database Performance From A MYSQL Perspective
No ratings yet
An Investigation of NoSQL Database Performance From A MYSQL Perspective
3 pages
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
No ratings yet
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
54 pages
Hortonworks Sandbox Setup
No ratings yet
Hortonworks Sandbox Setup
12 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
4 pages
Neo4j Graph Database Overview
0% (1)
Neo4j Graph Database Overview
19 pages
Cassandra: Types of Nosql Databases
No ratings yet
Cassandra: Types of Nosql Databases
6 pages
MongoDB Schema Design Guide
No ratings yet
MongoDB Schema Design Guide
59 pages
Algorithm Design in MapReduce
No ratings yet
Algorithm Design in MapReduce
62 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Document Database Data Modeling
No ratings yet
Document Database Data Modeling
27 pages
No SQL
No ratings yet
No SQL
32 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
30 pages
A Performance Comparison of SQL and NoSQL Databases
No ratings yet
A Performance Comparison of SQL and NoSQL Databases
5 pages
Chapter - 2: Database Model Key-Value Data Store Document Databases Column Databases Graph Databases
No ratings yet
Chapter - 2: Database Model Key-Value Data Store Document Databases Column Databases Graph Databases
61 pages
Modernize Data Platforms With SingleStore - IBM
No ratings yet
Modernize Data Platforms With SingleStore - IBM
27 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
SPARQL & RDF: A Guide for Developers
No ratings yet
SPARQL & RDF: A Guide for Developers
39 pages
B+ Tree in DBMS
No ratings yet
B+ Tree in DBMS
21 pages
Big Data - RDBMS, NoSQL and DynamoDB
No ratings yet
Big Data - RDBMS, NoSQL and DynamoDB
6 pages
Comprehensive Azure SQL Training Guide
No ratings yet
Comprehensive Azure SQL Training Guide
6 pages
Apache Cassandra
No ratings yet
Apache Cassandra
7 pages
Bitmap Indexing Overview & Applications
No ratings yet
Bitmap Indexing Overview & Applications
11 pages
Databricks Widgets
No ratings yet
Databricks Widgets
13 pages
10gen-MongoDB Operations Best Practices
No ratings yet
10gen-MongoDB Operations Best Practices
26 pages
NoSQL Scaling and Consistency
No ratings yet
NoSQL Scaling and Consistency
76 pages
Drill Slides
No ratings yet
Drill Slides
14 pages
Database Systems Overview
No ratings yet
Database Systems Overview
12 pages
Introduction to Graph Databases
No ratings yet
Introduction to Graph Databases
18 pages
HDFS Intro
No ratings yet
HDFS Intro
9 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
Migration Strategy
No ratings yet
Migration Strategy
3 pages
Bitmap Index
No ratings yet
Bitmap Index
20 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
Graph Database
No ratings yet
Graph Database
92 pages
Distributed Computing BE (AI&DS)
No ratings yet
Distributed Computing BE (AI&DS)
53 pages
Bda - Unit 1
No ratings yet
Bda - Unit 1
33 pages
Relational Databases for CS Students
No ratings yet
Relational Databases for CS Students
111 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Spark Runtime Architecture Overview
No ratings yet
Spark Runtime Architecture Overview
5 pages
CT113H Lecture 1 - Introduction To NoSQL
No ratings yet
CT113H Lecture 1 - Introduction To NoSQL
51 pages
Machine Learning with Spark Guide
No ratings yet
Machine Learning with Spark Guide
26 pages
Spark Architecture
No ratings yet
Spark Architecture
6 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Top 500 Data Engineering Interview Questions
No ratings yet
Top 500 Data Engineering Interview Questions
126 pages
Four Distributed System Architectural Patterns
No ratings yet
Four Distributed System Architectural Patterns
10 pages
NoSQL Architecture: MongoDB vs. Couchbase
No ratings yet
NoSQL Architecture: MongoDB vs. Couchbase
45 pages
Cypher Database Manipulation Guide
No ratings yet
Cypher Database Manipulation Guide
26 pages
Apache HIVE
No ratings yet
Apache HIVE
9 pages
Spark Architecture
No ratings yet
Spark Architecture
17 pages
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
No ratings yet
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
36 pages
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
No ratings yet
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
35 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
Software Metrics-4
No ratings yet
Software Metrics-4
30 pages
Big-Data Computing: B. Ramamurthy
No ratings yet
Big-Data Computing: B. Ramamurthy
61 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Software Metrics-3
No ratings yet
Software Metrics-3
19 pages
Software Metrics-5
100% (1)
Software Metrics-5
40 pages
What Is Hadoop
No ratings yet
What Is Hadoop
16 pages
4chap4 BM
No ratings yet
4chap4 BM
24 pages
Software Metrics
No ratings yet
Software Metrics
62 pages
6th Central Pay Commission Salary Calculator
100% (436)
6th Central Pay Commission Salary Calculator
15 pages
Panjab University Transcript
63% (8)
Panjab University Transcript
1 page
Metrics: Product Metrics (Size, Complexity, Performance) Maintenance) Project Metrics (Cost, Schedule, Productivity)
No ratings yet
Metrics: Product Metrics (Size, Complexity, Performance) Maintenance) Project Metrics (Cost, Schedule, Productivity)
22 pages
A RSSI-based Algorithm For Indoor Localization Using ZigBee in Wireless Sensor Network PDF
No ratings yet
A RSSI-based Algorithm For Indoor Localization Using ZigBee in Wireless Sensor Network PDF
6 pages
Adjustment For Internet Lab
No ratings yet
Adjustment For Internet Lab
1 page
Scalable Energy Efficient Location Aware Multicast Protocol For MANET (SEELAMP)
No ratings yet
Scalable Energy Efficient Location Aware Multicast Protocol For MANET (SEELAMP)
11 pages
d6b5 Tuition and Fees Explanation 2014 2015 PDF
No ratings yet
d6b5 Tuition and Fees Explanation 2014 2015 PDF
2 pages
Metadata: The Data Warehouse Key
No ratings yet
Metadata: The Data Warehouse Key
9 pages
Course Coverage
No ratings yet
Course Coverage
3 pages
Software Validation Verification and Testing
No ratings yet
Software Validation Verification and Testing
6 pages
Types of Testing: V Model Testing Phases Static Testing
No ratings yet
Types of Testing: V Model Testing Phases Static Testing
32 pages
List 6000 + Ebooks On Art, Architecture & Designing Available
80% (5)
List 6000 + Ebooks On Art, Architecture & Designing Available
204 pages
Re Thinking Autism Diagnosis, Identity and Equality Research PDF Download
100% (8)
Re Thinking Autism Diagnosis, Identity and Equality Research PDF Download
15 pages
LR Monitoring Tool Version 1.0 1.
No ratings yet
LR Monitoring Tool Version 1.0 1.
9 pages
Leadership Research and Development
No ratings yet
Leadership Research and Development
5 pages
PH&BIO - Internship Evaluation Form & Student Report 2
No ratings yet
PH&BIO - Internship Evaluation Form & Student Report 2
4 pages
Sloson Intelligence Scale (SIT-R3)
100% (1)
Sloson Intelligence Scale (SIT-R3)
17 pages
Lia Soal
No ratings yet
Lia Soal
4 pages
Topic 7-Idelogy and Relativisme
No ratings yet
Topic 7-Idelogy and Relativisme
41 pages
Influence of Polishing Systems On Surface Roughness of Composite Resins: Polishability of Composite Resins
No ratings yet
Influence of Polishing Systems On Surface Roughness of Composite Resins: Polishability of Composite Resins
11 pages
Geotechnical Design for Embankment Construction
No ratings yet
Geotechnical Design for Embankment Construction
7 pages
DMSE Handbook 2023
No ratings yet
DMSE Handbook 2023
104 pages
PN 16 / CLASS 125: Specification
No ratings yet
PN 16 / CLASS 125: Specification
2 pages
Chapter 5 A Phoenix Rises Notes
No ratings yet
Chapter 5 A Phoenix Rises Notes
2 pages
EAS Catalogue 2020
No ratings yet
EAS Catalogue 2020
193 pages
CES EduPack - Exercises With Worked Solutions - 2011
No ratings yet
CES EduPack - Exercises With Worked Solutions - 2011
14 pages
Advanced MOSFET Technologies For Next Generation C
No ratings yet
Advanced MOSFET Technologies For Next Generation C
17 pages
2021 Ethics Model Answers
No ratings yet
2021 Ethics Model Answers
31 pages
Africa Economic Systems Cloze Notes
No ratings yet
Africa Economic Systems Cloze Notes
4 pages
IoT Intrusion Detection via ANN
No ratings yet
IoT Intrusion Detection via ANN
5 pages
Definition and Meaning of Risk-Based Auditing: Chapter One: Introduction and Over-All View
No ratings yet
Definition and Meaning of Risk-Based Auditing: Chapter One: Introduction and Over-All View
13 pages
Lightning Protection at Burj Khalifa, Dubai.: Technical Paper
100% (1)
Lightning Protection at Burj Khalifa, Dubai.: Technical Paper
4 pages
How To Build A Luxury Brand and Elevated Business - Viveura
No ratings yet
How To Build A Luxury Brand and Elevated Business - Viveura
3 pages
Annie Hughes Talking About How Young Learners Learn
No ratings yet
Annie Hughes Talking About How Young Learners Learn
20 pages
Assessment Task 3 Project
No ratings yet
Assessment Task 3 Project
7 pages
Science Study Schedule March To July 2025
No ratings yet
Science Study Schedule March To July 2025
3 pages
ft10lv1l PDF
No ratings yet
ft10lv1l PDF
60 pages
Growth Hacking Secrets - 3 Steps To Optimize Profile and Increase Conversion
No ratings yet
Growth Hacking Secrets - 3 Steps To Optimize Profile and Increase Conversion
12 pages
Waymed Endo CL Ls
No ratings yet
Waymed Endo CL Ls
27 pages
PMG Price List for Engineers
No ratings yet
PMG Price List for Engineers
2 pages
Two-Column Script Format
No ratings yet
Two-Column Script Format
11 pages

Mapreduce and Hadoop Distributed File System

Uploaded by

Mapreduce and Hadoop Distributed File System

Uploaded by

MapReduce and Hadoop

Distributed File System

K. MADURAI AND B. RAMAMURTHY

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

 Discovery of knowledge; Enabling discovery; annotation of data

 We are looking at newer

 Supporting algorithms and data structures.

 NSF refers to it as “data-intensive computing” and industry calls it “big-

To provide a simple introduction to:

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

 MapReduce is a programming model Google has used

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Consider a large data collection:

Lets design a solution for this problem;

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

DataCollection ResultTable Observe:

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

DataCollection WordList ResultTable

 Critical aspects: fault tolerance + replication + load balancing,

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

DataCollection WordList ResultTable

Data DataCollection WordList ResultTable

Data DataCollection WordList ResultTable

Data DataCollection WordList ResultTable

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

One node DataCollection WordList ResultTable This is a particular solution;

Map/Reduce originated from Lisp

DataCollection WordList ResultTable

But have different meaning here

Runtime adds distribution + fault

load balancing to your base application!

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Remember: MapReduce is simplified processing for larger data sets:

MAP: Input data  <key, value> pair weed

Collection: split1 Split the data to … 1

Supply multiple KEY VALUE

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

MAP: Input data  <key, value> pair

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

CCSCNE 2009 Palttsburg, April 24 2009 21 B.Ramamurthy & K.Madurai

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

 Determine if the problem is parallelizable and solvable using

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

 Very large scale data: peta, exa bytes

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

 Benchmark for comparing: Jim Gray’s challenge on data-

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Concurrent Thread level

Service Object level

Indexed File level

Mega Block level

Virtual System Level

Data size: large

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

At Google MapReduce operation are run on a special

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

HDFS Server Master node

HDFS Server Master node

HDFS Client heartbeat

Data structures and algorithms: a new look at traditional

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

VMware simulated Hadoop and MapReduce demo

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

We introduced MapReduce programming model for

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

1. Apache Hadoop Tutorial: http://hadoop.apache.org

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

You might also like