EDL is an Elastic Deep Learning framework designed to help deep learning cloud service providers to build cluster cloud services using deep learning framework PaddlePaddle.
EDL includes two parts:
-
A Kubernetes controller for the elastic scheduling of distributed deep learning jobs and tools for adjusting manually.
-
Making PaddlePaddle a fault-tolerable deep learning framework with usability API for job management.
EDL is an incubation-stage project of the LF AI Foundation.
While many hardware and software manufacturers are working on improving the running time of deep learning jobs, EDL optimizes
- the global utilization of the cluster, and
- the waiting time of job submitters.
- Efficiency: Provides parallelism strategies to minimize adjustment overheads.
- Consistency: Accuracy verification on multiple models compared those without scaling.
- Flexibility: Any components can be killed or joined at any time.
- Easy to use: Few lines of code need to be added to support EDL.
We highly recommand you run it in our docker:
- Start a Jobserver on one node.
docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7 cd example/demo/collective ./start_job_server.sh - Start a Jobclient which controls the worker process.
#Set the ImageNet data path export PADDLE_EDL_IMAGENET_PATH=<your path> #Set the checkpoint path export PADDLE_EDL_FLEET_CHECKPOINT_PATH=<your path> mkdir -p resnet50_pod ./start_job_client.sh - Experiments result
| total batch size | acc1 | acc5 |
|---|---|---|
| 1024 | 76.0 | 75.8 |
- A scheduler on Kubernetes:
- EDL framework on PaddlePaddle:
- EDL Distillation:
- EDL CTR
TBD
EDL is provided under the Apache-2.0 license.


