Name	Name	Last commit message	Last commit date
Latest commit History 1,049 Commits
.github	.github
.tools	.tools
cmake	cmake
cmd/master	cmd/master
doc	doc
docker	docker
example	example
k8s	k8s
logo	logo
pkg	pkg
python	python
scripts	scripts
.dockerignore	.dockerignore
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
.travis.yml	.travis.yml
CMakeLists.txt	CMakeLists.txt
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE	LICENSE
OWNERS.md	OWNERS.md
README.md	README.md
RELEASE.md	RELEASE.md
go.mod	go.mod
go.sum	go.sum

Motivation

Computing resources on cloud such as Amazon AWS、Baidu Cloud have multi-tenancy. Deep learning model training and inference with elastic resources will be common on cloud. We propose Elastic Deep Learning (EDL) that makes training and inference of deep learning model on cloud easier and more efficient.

Now EDL is an incubation-stage project of the LF AI Foundation.

Installation

You can install with pip install paddle_edl. But we highly recommend you use it in our docker:

docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7 nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7 /bin/bash

Latest Release(0.3.0)

Support elastic training with inference type services during training, such as knowledge distillationo
Inference type services are automatically registered through service discovery in EDL
Knowledge distillation examples in computer vision and natural language processing

Knowledge Distillation in EDL

Theory: Distilling the Knowledge in a Neural Network
- Knowledge distillation consists of two parts in general, i.e. strong teachers and weak students.
- Student model learns from a teacher or mixture-of-teachers model's feed-forward results to achieve better results.
Application scenarios of EDL knowledge distillation
- Teacher models and student models are runing on the same GPU devices that training throughputs are not maximized
- Offline GPU cluster has limited resources but some online GPU resources can be used during training.
- Heterogenous teacher models can improve student model's performance but are hard to deploy on single GPU card due to memory limitation.
- Computation burden of teacher models and student models are hard to balanced to maximize the training throughputs.
Solution:
- Deploy teacher models as online inference service through Paddle Serving
- Online inference services are elastic and are registered to EDL service management modules.
- Dynamical adaptation of teacher models' online instance to maximize students' training throughputs and resources utilization.

Quick start on single GPU

The Teacher Model: mnist_cnn_model?

cd example/distill/mnist_distill python -m paddle_serving_server_gpu.serve \ --model mnist_cnn_model \ --port 9292 \ --gpu_ids 0

The Student Model: xx?

python train_with_fleet.py \ --use_distill_service True \ --distill_teachers 127.0.0.1:9292

Performance comparison

total batch size	acc1	acc5
1024	75.5	92.8

To run distillation on clusters, please reference Run EDL distillation training

EDL Framework

How to change from a normal train program to an EDL train program

The main change is that you should load_checkpoint at the beginning of training and save_checkpoint at the end of every epoch and the checkpoint should be on a distributed file system such as HDFS so all trainers can download from it. A complete example is here

fs=HDFSClient(args.hdfs_name, args.hdfs_ugi,20*60*1000, 3 * 1000) train_status =TrainStatus() tmp_s = fleet.load_checkpoint(exe, args.checkpoint, fs=fs, trainer_id=trainer_id) if tmp_s is not None: train_status = tmp_s for pass_id in range(train_status.next(), params["num_epochs"]): train() if trainer_id == 0: saved_status = TrainStatus(pass_id) fleet.save_checkpoint(exe, train_status=saved_status, path=args.checkpoint, fs=fs)

Quickstart

EDL Resnet50 experiments on a single machine in docker:

Start a JobServer on one node which generates changing scripts.

cd example/demo/collective ./start_job_server.sh

Start a Jobclient which controls the worker process.

#Set the ImageNet data path export PADDLE_EDL_IMAGENET_PATH=<your path> #Set the checkpoint path export PADDLE_EDL_FLEET_CHECKPOINT_PATH=<your path> mkdir -p resnet50_pod ./start_job_client.sh

Experiments result

total batch size	acc1	acc5
1024	75.5	92.8

FAQ

TBD

License

EDL is provided under the Apache-2.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Motivation

Installation

Latest Release(0.3.0)

Knowledge Distillation in EDL

Quick start on single GPU

EDL Framework

How to change from a normal train program to an EDL train program

Quickstart

EDL Resnet50 experiments on a single machine in docker:

FAQ

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors 21

Uh oh!

Languages

License

elasticdeeplearning/edl

Folders and files

Latest commit

History

Repository files navigation

Motivation

Installation

Latest Release(0.3.0)

Knowledge Distillation in EDL

Quick start on single GPU

EDL Framework

How to change from a normal train program to an EDL train program

Quickstart

EDL Resnet50 experiments on a single machine in docker:

FAQ

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 21

Uh oh!

Languages

Packages