Apache Hadoop and AWS EMR: Distributed LLM Text Processing and Embeddings

Author: Sunil Kuruba
Instructor: Mark Grechanik
Youtube video: https://youtu.be/TylcxCHhEik

Project Summary

This project focuses on implementing a distributed solution for processing large-scale text data using Hadoop on AWS EMR. The system leverages custom MapReduce jobs to tokenize large corpora and generate vector embeddings using a custom Token Generator. The goal is to enable efficient parallel processing for tasks like natural language understanding, large language model (LLM) training, and text analysis by harnessing the scalable resources of AWS Elastic MapReduce (EMR).

The solution is built using Scala for the MapReduce logic, with two main classes handling different stages of the text processing pipeline. This project also aims to demonstrate the deployment of MapReduce jobs in AWS EMR clusters with multiple steps, using a single JAR file for the entire workflow.

Project Structure

├── src │ ├── main │ │ ├── scala │ │ │ ├── Tokenizer.scala # Tokenization logic for MapReduce │ │ │ ├── Embedding.scala # Generating embeddings MapReduce │ │ │ ├── JobConfig.scala # Setting Job configuration │ │ │ ├── Utility.scala # Utility class for text parsing │ │ ├── resources │ │ ├── application.conf # Configuration file for Hadoop and job settings ├── README.md # Project documentation ├── build.sbt # Build configuration file └── project ├── dependencies.sbt # Project dependencies

Prerequisites

To run this project, you will need the following:

Scala 3.5: Ensure Scala is installed on your machine.
Java JDK 8: Scala uses Java platform to perform runs
Sbt 1.10.2: For building scala project
Hadoop 3.3.3: A working Hadoop setup (either local or AWS EMR).
AWS Account: Access to an AWS account to create and manage an EMR cluster.
EMR Cluster: Set up an EMR cluster on AWS with sufficient nodes for distributed processing.
S3 Bucket: Create an S3 bucket to store input data and retrieve output.
Git: Version control for managing and deploying the project.

Steps to Execute the Project

Clone the Repository

git clone https://github.com/SunilKuruba/CS441-HW1.git cd hadoop-llm-aws-emr

Prepare AWS

Create S3 bucket to store the JARs and input data. Make a note of the S3 paths.
Load the input training files of your choice in the S3 folder

├── jar │ └── common.jar # Common jar file ├── input │ ├── novel.txt # Input text file │ ├── subtitles.csv # Input CSV file ├── output # Output folder (initially empty)

Select Environment
Update the JobConfig.environment variable according to your use case, then follow these environment-specific steps:

Environment	Configuration Requirements
Production	Update application configuration file: - io.inputdir.prod (S3 input path) - io.outputdir.prod (S3 output path)
Local Development	Ensure Hadoop processes are running on your laptop. Ensure that the hadoop fs has the required input and output path as per `application.config`.
Test	Ensure Hadoop processes are running on your laptop. No configuration updates required. Checkout test/resources for input and output.

Build the Project
Use sbt to compile the project and produce the necessary JAR file. You should be able to find your JAR in the relative path target/scala-3.5.0/..
```
 sbt clean compile assembly
```

Upload JAR to S3 Upload your generated JAR file and text corpus to S3:

aws s3 cp <target/scala-3.5.0/common.jar> s3://your-bucket-name/ aws s3 cp input-data.txt s3://your-bucket-name/ # Optional step

Create an AWS EMR Cluster Launch an EMR cluster with the following configurations:
- Instance type: Select an instance type based on your specific processing requirements (e.g., m5.xlarge).
- Application: Install Hadoop only.
- Add steps:
  - Tokenizer step:
    - Jar location: [input S3 common JAR path]
    - Jar arguments: tokenizerMain
  - Embedding step:
    - Jar location: [input S3 common JAR path]
    - Jar arguments: embeddingMain
- Submit the job to EMR for processing.
Check Results Monitor the job for completion. Once the steps complete, retrieve the results from S3 output path and inspect the generated embeddings:
```
 aws s3 cp s3://your-bucket-name/output/ ./output --recursive
```

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.bsp		.bsp
.idea		.idea
project		project
src		src
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
feedback.md		feedback.md
img.png		img.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apache Hadoop and AWS EMR: Distributed LLM Text Processing and Embeddings

Project Summary

Project Structure

Prerequisites

Steps to Execute the Project

About

Uh oh!

Contributors 2

Uh oh!

Languages

SunilKuruba/Apache-Hadoop-and-AWS-EMR-Distributed-LLM-Text-Processing-and-Embeddings

Folders and files

Latest commit

History

Repository files navigation

Apache Hadoop and AWS EMR: Distributed LLM Text Processing and Embeddings

Project Summary

Project Structure

Prerequisites

Steps to Execute the Project

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages