Skip to content

A scalable NLP pipeline for processing and embedding large text corpora using Hadoop MapReduce on AWS EMR. Implements custom tokenization and embedding generation using Scala, deployed via a single JAR across multiple EMR steps.

Notifications You must be signed in to change notification settings

SunilKuruba/Apache-Hadoop-and-AWS-EMR-Distributed-LLM-Text-Processing-and-Embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Hadoop and AWS EMR: Distributed LLM Text Processing and Embeddings

Author: Sunil Kuruba
Instructor: Mark Grechanik
Youtube video: https://youtu.be/TylcxCHhEik

Project Summary

This project focuses on implementing a distributed solution for processing large-scale text data using Hadoop on AWS EMR. The system leverages custom MapReduce jobs to tokenize large corpora and generate vector embeddings using a custom Token Generator. The goal is to enable efficient parallel processing for tasks like natural language understanding, large language model (LLM) training, and text analysis by harnessing the scalable resources of AWS Elastic MapReduce (EMR).

The solution is built using Scala for the MapReduce logic, with two main classes handling different stages of the text processing pipeline. This project also aims to demonstrate the deployment of MapReduce jobs in AWS EMR clusters with multiple steps, using a single JAR file for the entire workflow.

img.png

Project Structure

├── src │ ├── main │ │ ├── scala │ │ │ ├── Tokenizer.scala # Tokenization logic for MapReduce │ │ │ ├── Embedding.scala # Generating embeddings MapReduce │ │ │ ├── JobConfig.scala # Setting Job configuration │ │ │ ├── Utility.scala # Utility class for text parsing │ │ ├── resources │ │ ├── application.conf # Configuration file for Hadoop and job settings ├── README.md # Project documentation ├── build.sbt # Build configuration file └── project ├── dependencies.sbt # Project dependencies

Prerequisites

To run this project, you will need the following:

  1. Scala 3.5: Ensure Scala is installed on your machine.
  2. Java JDK 8: Scala uses Java platform to perform runs
  3. Sbt 1.10.2: For building scala project
  4. Hadoop 3.3.3: A working Hadoop setup (either local or AWS EMR).
  5. AWS Account: Access to an AWS account to create and manage an EMR cluster.
  6. EMR Cluster: Set up an EMR cluster on AWS with sufficient nodes for distributed processing.
  7. S3 Bucket: Create an S3 bucket to store input data and retrieve output.
  8. Git: Version control for managing and deploying the project.

Steps to Execute the Project

  1. Clone the Repository
    git clone https://github.com/SunilKuruba/CS441-HW1.git cd hadoop-llm-aws-emr 
  2. Prepare AWS
  • Create S3 bucket to store the JARs and input data. Make a note of the S3 paths.
  • Load the input training files of your choice in the S3 folder
├── jar │ └── common.jar # Common jar file ├── input │ ├── novel.txt # Input text file │ ├── subtitles.csv # Input CSV file ├── output # Output folder (initially empty)
  1. Select Environment
    Update the JobConfig.environment variable according to your use case, then follow these environment-specific steps:
Environment Configuration Requirements
Production Update application configuration file:
- io.inputdir.prod (S3 input path)
- io.outputdir.prod (S3 output path)
Local Development Ensure Hadoop processes are running on your laptop. Ensure that the hadoop fs has the required input and output path as per application.config.
Test Ensure Hadoop processes are running on your laptop. No configuration updates required. Checkout test/resources for input and output.
  1. Build the Project
    Use sbt to compile the project and produce the necessary JAR file. You should be able to find your JAR in the relative path target/scala-3.5.0/..

     sbt clean compile assembly
  2. Upload JAR to S3 Upload your generated JAR file and text corpus to S3:

    aws s3 cp <target/scala-3.5.0/common.jar> s3://your-bucket-name/ aws s3 cp input-data.txt s3://your-bucket-name/ # Optional step
  3. Create an AWS EMR Cluster Launch an EMR cluster with the following configurations:

    • Instance type: Select an instance type based on your specific processing requirements (e.g., m5.xlarge).
    • Application: Install Hadoop only.
    • Add steps:
      • Tokenizer step:
        • Jar location: [input S3 common JAR path]
        • Jar arguments: tokenizerMain
      • Embedding step:
        • Jar location: [input S3 common JAR path]
        • Jar arguments: embeddingMain
    • Submit the job to EMR for processing.
  4. Check Results Monitor the job for completion. Once the steps complete, retrieve the results from S3 output path and inspect the generated embeddings:

     aws s3 cp s3://your-bucket-name/output/ ./output --recursive

About

A scalable NLP pipeline for processing and embedding large text corpora using Hadoop MapReduce on AWS EMR. Implements custom tokenization and embedding generation using Scala, deployed via a single JAR across multiple EMR steps.

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages