ML Distributed training

Demo: https://youtu.be/OOPVA-eqBTY

Introduction

This project leverage the power of multiple GPUs with the target is to reduce the training time of complex models by data parallelism method with 2 approaches:

Multi-worker Training using 2 PCs with GeForce RTX GPU as Workers via:
- Local area network (LAN).
- VPN tunnel using OpenVPN (not included in the demo).
Parameter Server Training using 5 machines in LAN:
- 2 Laptops as Parameter Server connected via 5GHz Wi-Fi.
- 2 PCs with GeForce RTX GPU as Workers.
- 1 PC just with CPU as a Coordinator.

Dataset

We used our self-built 30VNFoods dataset which includes collected and labeled images of 30 famous Vietnamese dishes. This dataset is divided into:

17,581 images for training.
2,515 images for validation.
5,040 images for testing.

In addition, we also used a small TensorFlow flowers dataset with about 3700 images of flowers, which includes 5 folders corresponding to 5 types of flowers (daisy, dandelion, roses, sunflowers, tulips).

Setup


Image size	(224, 224)
Batch size/worker	32
Optimizer	Adam
Learning rate	0.001

The iperf3 tool is used to measure the bandwidth of machines in network.

1. Multi-worker Training

2. Parameter Server Training

Result

Training method	Dataset	Connection	Avg. s/epoch
Single-worker	flowers	LAN	14
Multi-worker	flowers	LAN	18
Multi-worker	flowers	VPN Tunnel	635
Multi-worker	30VNFoods	LAN	184
Parameter Server	30VNFoods	LAN	115

⇒ For more information, see Report.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
multi-worker training		multi-worker training
parameter server training		parameter server training
.gitignore		.gitignore
Paper.pdf		Paper.pdf
Presentation.pptx		Presentation.pptx
README.md		README.md
bandwidth.csv		bandwidth.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ML Distributed training

Introduction

Dataset

Setup

1. Multi-worker Training

2. Parameter Server Training

Result

References

About

Uh oh!

Languages

18520339/ml-distributed-training

Folders and files

Latest commit

History

Repository files navigation

ML Distributed training

Introduction

Dataset

Setup

1. Multi-worker Training

2. Parameter Server Training

Result

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages