Name	Name	Last commit message	Last commit date
Latest commit History 40 Commits
hadoop-spark-cluster	hadoop-spark-cluster
scripts	scripts
spark_jobs	spark_jobs
.gitignore	.gitignore
Pipfile	Pipfile
Pipfile.lock	Pipfile.lock
README.md	README.md
init.sh	init.sh

Name

Last commit message

Last commit date

hadoop-spark-cluster

Hadoop Spark Cluster - Analyzing big data

Project Description

This project aims to develop a versatile data pipeline capable of processing datasets large in size, utilizing Docker containers and Hadoop + Spark.

About

For our practical implementation, we selected the May 2015 Reddit Comments Dataset available on Kaggle. However, the pipeline's flexibility allows for the incorporation of various datasets. This adaptability is achieved by adjusting the NAMENODE_DATA_DIR variable in the ./hadoop-spark-cluster/Makefile and setting the namenode HDFS URL in scripts/spark/config.json.

Leveraging Apache Spark for data processing and HDFS on a Hadoop cluster for data storage, each node operates within its own container, ensuring efficient data handling.

The pipeline is designed to generate an output.csv file (prior to uploading it in parts as Parquet parts to the virtual HDFS container), located in the /data directory at the project's root. Should you opt to use the SQLite database from the provided link, a handy conversion script scripts/utils/csv_converter.py is available to convert the data from SQLite to CSV format before running the initialization script.

Prerequisites

A comments.csv file under /data/output.csv (not included in the repository due to size), which can be downloaded from May 2015 Reddit Comments and then manually parsed to a csv file with the helper script csv_converter.py under scripts/.
Pipenv (for installing dependencies)
Docker
Docker Compose

How to run

Create a python virtual environment
```
pipenv install
```
Source the virtual environment
```
pipenv shell
```
Run the 'init.sh' script to move the output.csv file to HDFS as Parquet parts
```
chmod +x init.sh ./init.sh
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hadoop Spark Cluster - Analyzing big data

Project Description

About

Prerequisites

How to run

Authors

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

defaultdino/hadoop-spark-cluster

Folders and files

Latest commit

History

Repository files navigation

Hadoop Spark Cluster - Analyzing big data

Project Description

About

Prerequisites

How to run

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages