0% found this document useful (0 votes)
73 views24 pages

Unit 6 - Compression and Serialization in Hadoop

The document discusses compression and serialization techniques in Hadoop, emphasizing their importance for efficient data storage, transfer, and processing. It outlines various compression formats, codecs, and the role of serialization in the Hadoop I/O workflow. Additionally, it compares built-in and third-party serialization frameworks, highlighting their characteristics and use cases.

Uploaded by

Muhwezi Arthur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views24 pages

Unit 6 - Compression and Serialization in Hadoop

The document discusses compression and serialization techniques in Hadoop, emphasizing their importance for efficient data storage, transfer, and processing. It outlines various compression formats, codecs, and the role of serialization in the Hadoop I/O workflow. Additionally, it compares built-in and third-party serialization frameworks, highlighting their characteristics and use cases.

Uploaded by

Muhwezi Arthur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

MCS7101 - Big Data Analytics

Unit Six – Compression and


Serialization in Hadoop

Template designed by the eLearning Unit @Kab


Instructor

Tamale Micheal
Assistant Lecturer - Computer Science (PhD - Student)
Department of Computer Science
Faculty of Computing, Library and Information Sciences
Kabale University

2 Template designed by the eLearning Unit @Kab


Introduction

▪ In Hadoop, compression is an essential technique used in the I/O (Input/Output)


operations to reduce the size of data, which improves storage efficiency and speeds up
data transfer across the network.
▪ Since Hadoop deals with very large datasets, compressing data helps reduce the amount
of space required to store it and decreases the time required for data processing by
lowering disk I/O and network traffic.

3
Why Compression Is Important in Hadoop

i. Reduced Storage Costs


ii. Improved Data Transfer Speed
iii. Faster Processing
iv. Optimized Resource Usage

4
Types of Compression in Hadoop

i. Gzip (GNU zip)

ii. Bzip2

iii. Snappy

iv. LZO (Lempel-Ziv-Oberhumer)

5
Splittable vs. Non-Splittable Compression Formats

▪ Splittability refers to the ability to split a compressed file into chunks for parallel
processing, which is a key aspect of how Hadoop processes large datasets in a distributed
manner.
▪ Splittable Compression Formats (e.g., Bzip2, LZO with indexing) allow large files to be
processed in parallel, making Hadoop’s MapReduce framework more efficient.
▪ Non-Splittable Compression Formats (e.g., Gzip, Snappy) don’t allow splitting, which
can be a bottleneck for large files. This means that the entire compressed file has to be
handled by a single mapper, reducing parallelism.

6
Hadoop Compression Codecs

▪ In Hadoop, a codec is a compression/decompression algorithm.

▪ Hadoop provides several built-in compression codecs.

▪ Some popular codecs include:

i. GzipCodec: Handles Gzip compression and decompression.

ii. BZip2Codec: Handles Bzip2 compression and decompression.

iii. SnappyCodec: Handles Snappy compression and decompression.

iv. Lz4Codec: Provides a balance between compression speed and ratio, often used in
high-performance environments.

v. LzoCodec: Used for LZO compression, requires installation of additional libraries for
7
Hadoop to support it.
Compression in Hadoop MapReduce

▪ In Hadoop MapReduce, compression can be applied at different stages.

1. Input Compression

2. Intermediate (Shuffle) Compression

3. Output Compression

8
Trade-offs in Compression

1. Compression Ratio vs. Speed


▪ High compression ratio algorithms (e.g., Bzip2) result in smaller files but tend to be
slower. They are suitable for scenarios where storage space is a bigger concern than
processing speed.
▪ Fast compression algorithms (e.g., Snappy, LZO) focus on speed but may not
compress as well. They are more suitable for real-time data processing or situations
where speed is a priority.

9
Cont....

2. CPU Overhead
▪ Compression and decompression require CPU resources. For large clusters, the CPU
overhead may be offset by the benefits of reduced I/O and network usage.
▪ However, in smaller environments, the CPU cost may become a bottleneck if
compression algorithms are too slow.

10
Serialization

▪ Serialization in Hadoop I/O refers to the process of converting data objects (like records,
values, or structures) into a stream of bytes that can be efficiently stored or transmitted
over a network.
▪ In Hadoop, serialization is critical because it allows data to be written to and read from the
Hadoop Distributed File System (HDFS) and enables communication between nodes
during distributed processing tasks like MapReduce.

11
Why Is Serialization Important?

▪ Efficient Storage

▪ Efficient Transmission

▪ Interoperability

12
Hadoop's Default Serialization Frameworks

▪ Hadoop provides several serialization mechanisms, each designed for specific use cases.

▪ The most commonly used serialization frameworks in Hadoop are;

1. Writable Interface (Hadoop’s Native Serialization)


▪ Writable is the default serialization mechanism in Hadoop.

▪ Any object in Hadoop that needs to be serialized must implement the Writable interface.

▪ It’s highly optimized for Hadoop’s I/O operations and is lightweight compared to other
serialization mechanisms.

13
Cont...

Key Characteristics:
• Compact: Data is serialized in a binary format, resulting in minimal storage
overhead.
• Fast: Writable is designed for high-performance I/O.
• Customizable: Users can implement custom Writable objects for their specific
data types.
Writable Example:
• IntWritable, LongWritable, Text, and DoubleWritable are examples of built-in
Writable types that correspond to primitive data types.
14
Cont...

2. WritableComparable Interface
▪ This is an extension of Writable that adds comparison functionality, often used when key
objects need to be compared (such as in sorting tasks).
▪ Any object that is used as a key in Hadoop MapReduce must implement
WritableComparable.

15
Cont...

3. Text (Writable for String data)


▪ In Hadoop, the Text class is a specialized Writable for handling UTF-8 encoded strings.

▪ It is more efficient and optimized for Hadoop’s internal data processing compared to Java’s
String.

16
Third-Party Serialization Frameworks in Hadoop

▪ In addition to Hadoop's built-in Writable system, Hadoop can integrate with third-party
serialization frameworks that are more flexible or efficient for specific use cases, especially
when interoperability with other systems is required.

1. Apache Avro
▪ Avro is a popular serialization framework used in Hadoop for working with complex
data types.
▪ It stores data in a compact binary format and also includes a schema with the data,
which makes it self-describing.

17
Cont...

Key Characteristics
• Schema-based: Avro uses a schema to describe the structure of the data,
enabling both serialization and deserialization to be flexible across different
languages.
• Interoperability: Avro is language-neutral, meaning data serialized with Avro can
be deserialized in any language that has an Avro library (e.g., Java, Python, C++).
• Efficient: Avro is more compact than Hadoop's Writable, making it suitable for
scenarios requiring schema evolution and cross-language communication.

18
Cont...

2. Protocol Buffers (Protobuf)


▪ Google Protocol Buffers is another serialization framework used in Hadoop for
structured data. Like Avro, Protobuf is schema-based and language-neutral.
Key Characteristics
• Compact binary format: Data serialized with Protobuf is extremely compact.
• Schema-based: Like Avro, Protobuf uses a schema to describe the structure of
the data.
• Language-neutral: Supports multiple programming languages (e.g., Java, C++,
Python).
19
Cont...

3. Thrift
▪ Apache Thrift is a serialization and RPC (Remote Procedure Call) framework
developed by Facebook.
▪ It allows efficient data serialization and is used in Hadoop when cross-language data
exchange and high-performance network communication are needed.

20
Cont...

Key Characteristics
• Schema-based: Thrift, like Avro and Protobuf, relies on schema definitions.
• RPC support: In addition to serialization, Thrift supports RPC, making it more
suitable for distributed applications that need both data serialization and service
communication.

21
Key Considerations for Choosing a Serialization
Framework

i. Efficiency

ii. Schema Evolution

iii. Interoperability

iv. Speed vs. Size

22
Serialization in Hadoop I/O Workflow

▪ Serialization plays a critical role throughout the entire Hadoop I/O workflow.

1. Input to MapReduce: When data is read from HDFS or other sources, it is deserialized
from a byte stream into objects that MapReduce jobs can process.

2. Intermediate Data (Shuffle Phase): During the shuffle phase in MapReduce, the key-
value pairs generated by mappers are serialized before being sent to reducers over the
network.

3. Output to HDFS: Once the data is processed, it is serialized again before being written
back to HDFS for storage.

23
Thank you | Asante | Mwebare

Questions?

24
Template by the eLearning Unit @Kab

You might also like