0% found this document useful (0 votes)

73 views24 pages

Unit 6 - Compression and Serialization in Hadoop

The document discusses compression and serialization techniques in Hadoop, emphasizing their importance for efficient data storage, transfer, and processing. It outlines various compression formats, codecs, and the role of serialization in the Hadoop I/O workflow. Additionally, it compares built-in and third-party serialization frameworks, highlighting their characteristics and use cases.

Uploaded by

Muhwezi Arthur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views24 pages

Unit 6 - Compression and Serialization in Hadoop

Uploaded by

Muhwezi Arthur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

MCS7101 - Big Data Analytics

Unit Six – Compression and

Serialization in Hadoop

Template designed by the eLearning Unit @Kab

Instructor

Tamale Micheal
Assistant Lecturer - Computer Science (PhD - Student)
Department of Computer Science
Faculty of Computing, Library and Information Sciences
Kabale University

2 Template designed by the eLearning Unit @Kab

Introduction

▪ In Hadoop, compression is an essential technique used in the I/O (Input/Output)

operations to reduce the size of data, which improves storage efficiency and speeds up
data transfer across the network.
▪ Since Hadoop deals with very large datasets, compressing data helps reduce the amount
of space required to store it and decreases the time required for data processing by
lowering disk I/O and network traffic.

3
Why Compression Is Important in Hadoop

i. Reduced Storage Costs

ii. Improved Data Transfer Speed
iii. Faster Processing
iv. Optimized Resource Usage

4
Types of Compression in Hadoop

i. Gzip (GNU zip)

ii. Bzip2

iii. Snappy

iv. LZO (Lempel-Ziv-Oberhumer)

5
Splittable vs. Non-Splittable Compression Formats

▪ Splittability refers to the ability to split a compressed file into chunks for parallel
processing, which is a key aspect of how Hadoop processes large datasets in a distributed
manner.
▪ Splittable Compression Formats (e.g., Bzip2, LZO with indexing) allow large files to be
processed in parallel, making Hadoop’s MapReduce framework more efficient.
▪ Non-Splittable Compression Formats (e.g., Gzip, Snappy) don’t allow splitting, which
can be a bottleneck for large files. This means that the entire compressed file has to be
handled by a single mapper, reducing parallelism.

6
Hadoop Compression Codecs

▪ In Hadoop, a codec is a compression/decompression algorithm.

▪ Hadoop provides several built-in compression codecs.

▪ Some popular codecs include:

i. GzipCodec: Handles Gzip compression and decompression.

ii. BZip2Codec: Handles Bzip2 compression and decompression.

iii. SnappyCodec: Handles Snappy compression and decompression.

iv. Lz4Codec: Provides a balance between compression speed and ratio, often used in
high-performance environments.

v. LzoCodec: Used for LZO compression, requires installation of additional libraries for
7
Hadoop to support it.
Compression in Hadoop MapReduce

▪ In Hadoop MapReduce, compression can be applied at different stages.

1. Input Compression

2. Intermediate (Shuffle) Compression

3. Output Compression

8
Trade-offs in Compression

1. Compression Ratio vs. Speed

▪ High compression ratio algorithms (e.g., Bzip2) result in smaller files but tend to be
slower. They are suitable for scenarios where storage space is a bigger concern than
processing speed.
▪ Fast compression algorithms (e.g., Snappy, LZO) focus on speed but may not
compress as well. They are more suitable for real-time data processing or situations
where speed is a priority.

9
Cont....

2. CPU Overhead
▪ Compression and decompression require CPU resources. For large clusters, the CPU
overhead may be offset by the benefits of reduced I/O and network usage.
▪ However, in smaller environments, the CPU cost may become a bottleneck if
compression algorithms are too slow.

10
Serialization

▪ Serialization in Hadoop I/O refers to the process of converting data objects (like records,
values, or structures) into a stream of bytes that can be efficiently stored or transmitted
over a network.
▪ In Hadoop, serialization is critical because it allows data to be written to and read from the
Hadoop Distributed File System (HDFS) and enables communication between nodes
during distributed processing tasks like MapReduce.

11
Why Is Serialization Important?

▪ Efficient Storage

▪ Efficient Transmission

▪ Interoperability

12
Hadoop's Default Serialization Frameworks

▪ Hadoop provides several serialization mechanisms, each designed for specific use cases.

▪ The most commonly used serialization frameworks in Hadoop are;

1. Writable Interface (Hadoop’s Native Serialization)

▪ Writable is the default serialization mechanism in Hadoop.

▪ Any object in Hadoop that needs to be serialized must implement the Writable interface.

▪ It’s highly optimized for Hadoop’s I/O operations and is lightweight compared to other
serialization mechanisms.

13
Cont...

Key Characteristics:
• Compact: Data is serialized in a binary format, resulting in minimal storage
overhead.
• Fast: Writable is designed for high-performance I/O.
• Customizable: Users can implement custom Writable objects for their specific
data types.
Writable Example:
• IntWritable, LongWritable, Text, and DoubleWritable are examples of built-in
Writable types that correspond to primitive data types.
14
Cont...

2. WritableComparable Interface
▪ This is an extension of Writable that adds comparison functionality, often used when key
objects need to be compared (such as in sorting tasks).
▪ Any object that is used as a key in Hadoop MapReduce must implement
WritableComparable.

15
Cont...

3. Text (Writable for String data)

▪ In Hadoop, the Text class is a specialized Writable for handling UTF-8 encoded strings.

▪ It is more efficient and optimized for Hadoop’s internal data processing compared to Java’s
String.

16
Third-Party Serialization Frameworks in Hadoop

▪ In addition to Hadoop's built-in Writable system, Hadoop can integrate with third-party
serialization frameworks that are more flexible or efficient for specific use cases, especially
when interoperability with other systems is required.

1. Apache Avro
▪ Avro is a popular serialization framework used in Hadoop for working with complex
data types.
▪ It stores data in a compact binary format and also includes a schema with the data,
which makes it self-describing.

17
Cont...

Key Characteristics
• Schema-based: Avro uses a schema to describe the structure of the data,
enabling both serialization and deserialization to be flexible across different
languages.
• Interoperability: Avro is language-neutral, meaning data serialized with Avro can
be deserialized in any language that has an Avro library (e.g., Java, Python, C++).
• Efficient: Avro is more compact than Hadoop's Writable, making it suitable for
scenarios requiring schema evolution and cross-language communication.

18
Cont...

2. Protocol Buffers (Protobuf)

▪ Google Protocol Buffers is another serialization framework used in Hadoop for
structured data. Like Avro, Protobuf is schema-based and language-neutral.
Key Characteristics
• Compact binary format: Data serialized with Protobuf is extremely compact.
• Schema-based: Like Avro, Protobuf uses a schema to describe the structure of
the data.
• Language-neutral: Supports multiple programming languages (e.g., Java, C++,
Python).
19
Cont...

3. Thrift
▪ Apache Thrift is a serialization and RPC (Remote Procedure Call) framework
developed by Facebook.
▪ It allows efficient data serialization and is used in Hadoop when cross-language data
exchange and high-performance network communication are needed.

20
Cont...

Key Characteristics
• Schema-based: Thrift, like Avro and Protobuf, relies on schema definitions.
• RPC support: In addition to serialization, Thrift supports RPC, making it more
suitable for distributed applications that need both data serialization and service
communication.

21
Key Considerations for Choosing a Serialization
Framework

i. Efficiency

ii. Schema Evolution

iii. Interoperability

iv. Speed vs. Size

22
Serialization in Hadoop I/O Workflow

▪ Serialization plays a critical role throughout the entire Hadoop I/O workflow.

1. Input to MapReduce: When data is read from HDFS or other sources, it is deserialized
from a byte stream into objects that MapReduce jobs can process.

2. Intermediate Data (Shuffle Phase): During the shuffle phase in MapReduce, the key-
value pairs generated by mappers are serialized before being sent to reducers over the
network.

3. Output to HDFS: Once the data is processed, it is serialized again before being written
back to HDFS for storage.

23
Thank you | Asante | Mwebare

Questions?

24
Template by the eLearning Unit @Kab

BDA - Unit - II Big Data
No ratings yet
BDA - Unit - II Big Data
43 pages
Hadoop and Big Data Unit 4
No ratings yet
Hadoop and Big Data Unit 4
13 pages
Hadoop
No ratings yet
Hadoop
30 pages
Hadoop Primitives
No ratings yet
Hadoop Primitives
6 pages
Avro Data Serialization Guide
No ratings yet
Avro Data Serialization Guide
30 pages
B.Tech VIII BDA Chapter - 3 1
No ratings yet
B.Tech VIII BDA Chapter - 3 1
3 pages
Hadoop IO Explanation
No ratings yet
Hadoop IO Explanation
3 pages
Unit 4: IT Dept
No ratings yet
Unit 4: IT Dept
21 pages
Hadoop Writable Interface Guide
No ratings yet
Hadoop Writable Interface Guide
22 pages
Hadoop Compression Techniques Guide
No ratings yet
Hadoop Compression Techniques Guide
3 pages
Data Serialization
No ratings yet
Data Serialization
5 pages
Data Analytics
No ratings yet
Data Analytics
26 pages
Big Data PPT Unit 2 1
No ratings yet
Big Data PPT Unit 2 1
25 pages
CH 3 BDA
No ratings yet
CH 3 BDA
13 pages
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
No ratings yet
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
26 pages
Hadoop Writable Interface Guide
No ratings yet
Hadoop Writable Interface Guide
15 pages
Unit 3 Topic 9 Hadoop Archives
No ratings yet
Unit 3 Topic 9 Hadoop Archives
32 pages
Unit3 Bda
No ratings yet
Unit3 Bda
71 pages
BDAmod 3
No ratings yet
BDAmod 3
18 pages
HADOOP Notes Unit 3 and 4
No ratings yet
HADOOP Notes Unit 3 and 4
14 pages
IET Udaipur BDA Unit-3
No ratings yet
IET Udaipur BDA Unit-3
10 pages
42 P16cse5a-P16ite3a 2020052204503639
No ratings yet
42 P16cse5a-P16ite3a 2020052204503639
23 pages
Unit II Hadoop IO
No ratings yet
Unit II Hadoop IO
27 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Unit3 BDAT
No ratings yet
Unit3 BDAT
18 pages
BDP 2023 04
No ratings yet
BDP 2023 04
25 pages
Unit 2lecturenotes 240530095215 Bebaac62
No ratings yet
Unit 2lecturenotes 240530095215 Bebaac62
98 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
Storage Formats in Hadoop
No ratings yet
Storage Formats in Hadoop
4 pages
Big-Data-Unit 3
No ratings yet
Big-Data-Unit 3
47 pages
Hadoop Data Types Guide
No ratings yet
Hadoop Data Types Guide
3 pages
Bda Queston and Answer
No ratings yet
Bda Queston and Answer
8 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
BigData (126 150)
No ratings yet
BigData (126 150)
25 pages
U Iv Avro I
No ratings yet
U Iv Avro I
38 pages
2 Notes
No ratings yet
2 Notes
61 pages
Hadoop File Formats - YoussefEtman
No ratings yet
Hadoop File Formats - YoussefEtman
8 pages
Data Serialization in Big Data
No ratings yet
Data Serialization in Big Data
3 pages
Big-Data-Unit 3
No ratings yet
Big-Data-Unit 3
47 pages
BDA Unit-4
No ratings yet
BDA Unit-4
47 pages
Unit 3-BDA
50% (2)
Unit 3-BDA
26 pages
Hadoop I/O for Data Engineers
No ratings yet
Hadoop I/O for Data Engineers
36 pages
IT JOB Tips
No ratings yet
IT JOB Tips
36 pages
A Survey On Compression Algorithms in Hadoop
No ratings yet
A Survey On Compression Algorithms in Hadoop
4 pages
LinkedIn's Use of Apache Samza
No ratings yet
LinkedIn's Use of Apache Samza
20 pages
Chapter 3 Basics of Hadoop
No ratings yet
Chapter 3 Basics of Hadoop
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
8 pages
Unit 2
No ratings yet
Unit 2
56 pages
BDT Unit - Iii
No ratings yet
BDT Unit - Iii
12 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
Unit Iii
No ratings yet
Unit Iii
107 pages
Big Data Solutions with Hadoop
No ratings yet
Big Data Solutions with Hadoop
27 pages
Module-2 - Introduction To Hadoop
No ratings yet
Module-2 - Introduction To Hadoop
13 pages
Module - 2 Half
No ratings yet
Module - 2 Half
12 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Module 2 BDA
No ratings yet
Module 2 BDA
64 pages
Big Data Analytics Midterm Q&A
No ratings yet
Big Data Analytics Midterm Q&A
15 pages
Job Scheduling in MR
No ratings yet
Job Scheduling in MR
6 pages
Data Build Tool (DBT)
No ratings yet
Data Build Tool (DBT)
65 pages
DatabricksDataEngineer Associate2024
80% (5)
DatabricksDataEngineer Associate2024
157 pages
Data Engineering With Databricks Da
100% (3)
Data Engineering With Databricks Da
232 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Snowflake Notes
100% (10)
Snowflake Notes
67 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
Snowflake Training Slide SANMs
71% (7)
Snowflake Training Slide SANMs
218 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Architecting A Data Lake
100% (9)
Architecting A Data Lake
60 pages
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
No ratings yet
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
18 pages
Oracle Database 19c: Backup and Recovery: Activity Guide D106548GC10
100% (3)
Oracle Database 19c: Backup and Recovery: Activity Guide D106548GC10
272 pages
AWR Reports
No ratings yet
AWR Reports
12 pages
Oracle Cloud Infrastructure Architect by IP Specialist
No ratings yet
Oracle Cloud Infrastructure Architect by IP Specialist
480 pages
Snowflake Vs Data Bricks
100% (1)
Snowflake Vs Data Bricks
10 pages
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
Snowflake Scenario Based Interview Questions
100% (2)
Snowflake Scenario Based Interview Questions
20 pages
PostgreSQL SQL Queries Guide
No ratings yet
PostgreSQL SQL Queries Guide
1 page
Azure Data Factory Guide
No ratings yet
Azure Data Factory Guide
2,982 pages
Srikanth Resume
100% (1)
Srikanth Resume
5 pages
Azure Databricks
67% (6)
Azure Databricks
69 pages
Spark Databricks Summary
80% (5)
Spark Databricks Summary
100 pages
Oracle Clusterware
100% (1)
Oracle Clusterware
12 pages
Azure Databricks Documentation
100% (1)
Azure Databricks Documentation
7,197 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
DBA Shell Scripts Guide
100% (1)
DBA Shell Scripts Guide
6 pages
Aws Certified Data Engineer Slides
100% (3)
Aws Certified Data Engineer Slides
691 pages
ETL Processes Using PySpark
67% (3)
ETL Processes Using PySpark
7 pages
Introduction To Spark For Data Engineers / Data Scientists
100% (3)
Introduction To Spark For Data Engineers / Data Scientists
100 pages
Unit 4 - MapReduce
No ratings yet
Unit 4 - MapReduce
16 pages
Unit 2 - Hadoop Architecture
No ratings yet
Unit 2 - Hadoop Architecture
19 pages
Unit 1 - Introduction To Big Data and Big Data Analytics
No ratings yet
Unit 1 - Introduction To Big Data and Big Data Analytics
24 pages
The Story of Etim Esin
No ratings yet
The Story of Etim Esin
3 pages
How Diffrent Is Student Complaint Tracker System From The Existing Systems
No ratings yet
How Diffrent Is Student Complaint Tracker System From The Existing Systems
4 pages
Big Data For Dummies
100% (1)
Big Data For Dummies
64 pages
MBA Exam 2024: Business Analytics
No ratings yet
MBA Exam 2024: Business Analytics
6 pages
Big Data in Disaster Management
No ratings yet
Big Data in Disaster Management
4 pages
Building Distributed Systems
100% (3)
Building Distributed Systems
73 pages
Vardhaman College of Engineering: Project Planning and Management
No ratings yet
Vardhaman College of Engineering: Project Planning and Management
62 pages
Chap7 BigData
No ratings yet
Chap7 BigData
35 pages
Ubuntu & Hadoop Setup Guide
No ratings yet
Ubuntu & Hadoop Setup Guide
30 pages
Sem 7 All
No ratings yet
Sem 7 All
15 pages
MCA Sem III IV
No ratings yet
MCA Sem III IV
114 pages
Distributed System Lab Manual
No ratings yet
Distributed System Lab Manual
42 pages
ST1 KCS061 - Updated
No ratings yet
ST1 KCS061 - Updated
2 pages
BIgData and Hadoop Ecosytem
No ratings yet
BIgData and Hadoop Ecosytem
8 pages
Seminar Big Data Hadoop
No ratings yet
Seminar Big Data Hadoop
28 pages
Unit - 4
No ratings yet
Unit - 4
50 pages
Satya Sandeep - Data Engineer Resume
No ratings yet
Satya Sandeep - Data Engineer Resume
8 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
Unit III
No ratings yet
Unit III
8 pages
Hadoop Notes
No ratings yet
Hadoop Notes
11 pages
No Answer (Big Data)
No ratings yet
No Answer (Big Data)
37 pages
Understanding Big Data: Key Characteristics
No ratings yet
Understanding Big Data: Key Characteristics
24 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
MULTIMEDIA COMMUNICATIONS As Per Choice PDF
No ratings yet
MULTIMEDIA COMMUNICATIONS As Per Choice PDF
27 pages
Big Data Analytics-Digital Notes
No ratings yet
Big Data Analytics-Digital Notes
86 pages
Hands On Exercises 2013
No ratings yet
Hands On Exercises 2013
51 pages
Big Data Analytics in Health Care A Review Paper
No ratings yet
Big Data Analytics in Health Care A Review Paper
12 pages
Experiment 6 BDA
No ratings yet
Experiment 6 BDA
4 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
Cloud Computing Unit-2and3
No ratings yet
Cloud Computing Unit-2and3
45 pages
Distributed Systems Exam Key
No ratings yet
Distributed Systems Exam Key
15 pages

Unit 6 - Compression and Serialization in Hadoop

Uploaded by

Unit 6 - Compression and Serialization in Hadoop

Uploaded by

MCS7101 - Big Data Analytics

Unit Six – Compression and

Template designed by the eLearning Unit @Kab

2 Template designed by the eLearning Unit @Kab

▪ In Hadoop, compression is an essential technique used in the I/O (Input/Output)

i. Reduced Storage Costs

i. Gzip (GNU zip)

iv. LZO (Lempel-Ziv-Oberhumer)

▪ In Hadoop, a codec is a compression/decompression algorithm.

▪ Hadoop provides several built-in compression codecs.

▪ Some popular codecs include:

i. GzipCodec: Handles Gzip compression and decompression.

ii. BZip2Codec: Handles Bzip2 compression and decompression.

iii. SnappyCodec: Handles Snappy compression and decompression.

▪ In Hadoop MapReduce, compression can be applied at different stages.

2. Intermediate (Shuffle) Compression

1. Compression Ratio vs. Speed

▪ The most commonly used serialization frameworks in Hadoop are;

1. Writable Interface (Hadoop’s Native Serialization)

3. Text (Writable for String data)

2. Protocol Buffers (Protobuf)

ii. Schema Evolution

iv. Speed vs. Size

You might also like