Skip to content

elasticdeeplearning/edl

Repository files navigation

EDL: Elastic Deep Learning

EDL is an Elastic Deep Learning framework designed to help deep learning cloud service providers to build cluster cloud services using deep learning framework PaddlePaddle.

EDL includes two parts:

  1. A Kubernetes controller for the elastic scheduling of distributed deep learning jobs and tools for adjusting manually.

  2. Making PaddlePaddle a fault-tolerable deep learning framework with usability API for job management.

EDL is an incubation-stage project of the LF AI Foundation.

While many hardware and software manufacturers are working on improving the running time of deep learning jobs, EDL optimizes

  1. the global utilization of the cluster, and
  2. the waiting time of job submitters.

Key Features:

  • Efficiency: Provides parallelism strategies to minimize adjustment overheads.
  • Consistency: Accuracy verification on multiple models compared those without scaling.
  • Flexibility: Any components can be killed or joined at any time.
  • Easy to use: Few lines of code need to be added to support EDL.

Quick start demo: EDL Resnet50 experiments on a single machine:

We highly recommand you run it in our docker:

  1. Start a Jobserver on one node.
docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7 cd example/demo/collective ./start_job_server.sh 
  1. Start a Jobclient which controls the worker process.
#Set the ImageNet data path export PADDLE_EDL_IMAGENET_PATH=<your path> #Set the checkpoint path export PADDLE_EDL_FLEET_CHECKPOINT_PATH=<your path> mkdir -p resnet50_pod ./start_job_client.sh 
  1. Experiments result
total batch size acc1 acc5
1024 76.0 75.8

Design Docs

Applications:

FAQ

TBD

License

EDL is provided under the Apache-2.0 license.

About

Elastic Deep Learning for deep learning framework on Kubernetes

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 21