This project leverage the power of multiple GPUs with the target is to reduce the training time of complex models by data parallelism method with 2 approaches:
- Multi-worker Training using 2 PCs with GeForce RTX GPU as Workers via:
- Local area network (LAN).
- VPN tunnel using OpenVPN (not included in the demo).
- Parameter Server Training using 5 machines in LAN:
- 2 Laptops as Parameter Server connected via 5GHz Wi-Fi.
- 2 PCs with GeForce RTX GPU as Workers.
- 1 PC just with CPU as a Coordinator.
We used our self-built 30VNFoods dataset which includes collected and labeled images of 30 famous Vietnamese dishes. This dataset is divided into:
- 17,581 images for training.
- 2,515 images for validation.
- 5,040 images for testing.
In addition, we also used a small TensorFlow flowers dataset with about 3700 images of flowers, which includes 5 folders corresponding to 5 types of flowers (daisy, dandelion, roses, sunflowers, tulips).
| Image size | (224, 224) |
| Batch size/worker | 32 |
| Optimizer | Adam |
| Learning rate | 0.001 |
The iperf3 tool is used to measure the bandwidth of machines in network.
| Training method | Dataset | Connection | Avg. s/epoch |
|---|---|---|---|
| Single-worker | flowers | LAN | 14 |
| Multi-worker | flowers | LAN | 18 |
| Multi-worker | flowers | VPN Tunnel | 635 |
| Multi-worker | 30VNFoods | LAN | 184 |
| Parameter Server | 30VNFoods | LAN | 115 |
⇒ For more information, see Report.pdf.
- Distributed training with Keras
- A friendly introduction to distributed training (ML Tech Talks)
- Distributed TensorFlow training (Google I/O '18)
- Inside TensorFlow: Parameter server training
- Performance issue for Distributed TF
- When is TensorFlow's ParameterServerStrategy preferable to its MultiWorkerMirroredStrategy?

