Cpp parallel executor #9080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

reyoung merged 163 commits into PaddlePaddle:develop from reyoung:cpp_parallel_executor

Mar 30, 2018

Collaborator

reyoung commented Mar 14, 2018 •

edited

Loading

I just add a dependency engine to parse the dependencies of operators. There are still a lot of jobs need to be done.

~~Complete Broadcast parameters.~~
~~Use thread pool to invoke operator parallelly.~~
~~Complete NCCL AllReduce OpHandle in this implementation.~~
Add fetch methods.

I just use VarHandle and OpHandle to parse Program as a SSA form graph. A variable is assigned by only one OpHandle. When all inputs of OpHandle is ready, the OpHandle can be run.

The speed of ResNeXt152 is

Number of GPUs	1	2	3	4
Image/Sec	18.639	27.8863	39.3787	52.9688
Speed Up	N/A	1.4961264	2.11270454	2.84182628

Yang Yang and others added 5 commits

March 13, 2018 23:52

init commit

0621c32

update readme

e67325c

delete param name

8f061e4

ParallelExecutor And dependency engine

baef112

Better name

692a0f7

reyoung requested a review from tonyyang-svail

March 14, 2018 13:19

helinwang reviewed

View reviewed changes

doc/design/parallel_executor.md Outdated

      opt = fluid.optimizer.SGDOptimizer()  
    opt.minimize(avg_cost)  
    
    # change Executor -> ParallelExecutor

Contributor

helinwang Mar 14, 2018

Maybe we can let the user still use the Executor interface, and add an optional argument "gpu_list", and underlying if there are multiple GPUs available (either len(gpu_list) > 0, or gpu_list == None and multiple GPUs initialized), create and return the parallel executor instance.

doc/design/parallel_executor.md Outdated

       // e.g. sgd should wait for allreduce to be finished  
    CallBack->BeforeOp(op);  
    
    op->Run(*local_scope, place_);

Contributor

helinwang Mar 14, 2018

From my understanding, the reason we need callback is ParallelExecutor will call Executor::Run, but need to be notified before and after each OP::Run. Do we even need the Executor implementation anymore? Maybe we can consolidate them into a single executor, so that we don't need the callback anymore.

And it will be easier for the Python side, Python always create the same executor.

tonyyang-svail reviewed

View reviewed changes

paddle/fluid/framework/parallel_executor.cc Outdated

       std::vector<OpHandle *> to_run;  
     for (auto *var : to_remove) {  
     for (auto *op : var->pending_ops_) {  
     if (var->name_ == "mean_0.tmp_0@GRAD") {  
 

tonyyang-svail Mar 15, 2018

what is the purpose of this special case?

Collaborator Author

reyoung Mar 15, 2018

Just debug code... Sorry

paddle/fluid/framework/parallel_executor.cc Outdated

      struct OpHandle {  
     std::vector<VarHandle *> inputs_;  
     std::vector<VarHandle *> outputs_;  
     platform::DeviceContext *dev_ctx_;  
 

tonyyang-svail Mar 15, 2018

add framework::Scope* scope_?

paddle/fluid/framework/parallel_executor.cc Outdated

      }  
    
    std::vector<LoDTensor> ParallelExecutor::Run(  
     const std::vector<std::string> &fetch_tensors) {

tonyyang-svail Mar 15, 2018

Instantiate Variables here?

Paddle/paddle/fluid/framework/executor.cc

Lines 276 to 305 in 41894da

     Scope* local_scope = scope;  
   if (create_vars) {  
   if (create_local_scope) {  
   local_scope = &scope->NewScope();  
   for (auto& var : block.AllVars()) {  
   if (var->Name() == framework::kEmptyVarName) {  
   continue;  
   }  
    
   if (var->Persistable()) {  
   auto* ptr = scope->Var(var->Name());  
   CreateTensor(ptr, var->GetType());  
   VLOG(3) << "Create Variable " << var->Name()  
   << " global, which pointer is " << ptr;  
   } else {  
   auto* ptr = local_scope->Var(var->Name());  
   CreateTensor(ptr, var->GetType());  
   VLOG(3) << "Create Variable " << var->Name()  
   << " locally, which pointer is " << ptr;  
   }  
   }  
   } else {  
   for (auto& var : block.AllVars()) {  
   auto* ptr = local_scope->Var(var->Name());  
   CreateTensor(ptr, var->GetType());  
   VLOG(3) << "Create variable " << var->Name() << ", which pointer is "  
   << ptr;  
   }  
   } // if (create_local_scope)  
   } // if (create_vars)  
 

Collaborator Author

reyoung Mar 15, 2018

Yes. We need to instantiate variables here. We might extract this routine to a global function.

reyoung added 3 commits

March 15, 2018 14:51

Use thread pool

ae88fde

Remove out of date design

22bb262

Polish code

35744e7

QiJune reviewed

View reviewed changes

paddle/fluid/framework/parallel_executor.cc Outdated

       }  
     };  
    
     member_->pool_.Run(op_run);

Member

QiJune Mar 15, 2018

Actually, we should add a callback after we push operator run job to memory pool. In this callback, we change pending_var state.

Collaborator Author

reyoung Mar 15, 2018

The pending_var state has been changed during L404-L406

chengduoZH self-requested a review

March 15, 2018 08:48

chengduoZH reviewed

View reviewed changes

paddle/fluid/framework/parallel_executor.cc

       member_->local_scopes_.size() != 1) { // Is CUDA  
     BuildNCCLCommunicator();  
     BCastParamsToGPUs(startup_program);  
     }  
 

Contributor

chengduoZH Mar 15, 2018

Why do not initialize parameters on the respective devices？

Collaborator Author

reyoung Mar 16, 2018

Since randomize result might not same when seed = 0

reyoung added 2 commits

March 15, 2018 17:27

Handle var hazard

193c0a7

Stash

d84ddcf

tonyyang-svail mentioned this pull request

[WIP] C++ implementation of parallel executor #9035

Closed

reyoung added 12 commits

March 16, 2018 14:33

Single GPU ParallelExecutor complete

6f0dfd8

Polish code style

8c9cd36

Make recordio file reader thread-safe by default

8b397d1

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

5e87cd7

… cpp_parallel_executor

Stash

0ef9edf

Test more

9fc0b59

Stash

d470763

Update

c15d2c9

Add ncclAllReduce

8f0590e

Update

e8a7e5d

Wait by stream

b2c7a9b

Refactor local_scopes

254d7ff

QiJune reviewed

View reviewed changes

paddle/fluid/framework/parallel_executor.h Outdated

       void RunOp(std::unordered_map<VarHandleBase*, bool>& pending_vars,  
     OpHandle* op) const;  
    
     void PolishGraphToSupportDataHarzaeds() const;

Member

QiJune Mar 19, 2018

DataHarzaeds --> DataHazards

reyoung added 3 commits

March 27, 2018 15:26

Add performance tests

c42c4a6

Fix merge op

3f88fad

NCCL AllReduce

c0c2e15

reyoung force-pushed the cpp_parallel_executor branch from 17f0491 to c0c2e15 Compare

March 27, 2018 07:49

reyoung added 4 commits

March 27, 2018 15:54

Refine allreduce op

7dcb217

Using blocking queue

50f71f5

Add initP2P

dcf7bd2

Use Extend method

201f79d

tonyyang-svail mentioned this pull request

多线程时，同一输入做两个embedding程序会崩溃 #9200

Closed

reyoung force-pushed the cpp_parallel_executor branch from 9441175 to 201f79d Compare

March 28, 2018 04:51

reyoung commented

View reviewed changes

paddle/fluid/framework/details/multi_devices_graph_builder.cc Outdated

Collaborator Author

reyoung Mar 28, 2018

Yes. Since the operators in a sub-block will be executed by a control flow operator, e.g., While. The behaviour between control flow operators and computational operators should be same.

reyoung added 9 commits

March 28, 2018 13:23

Disable model evaluation in unittests

5408854

Add design doc

9f4a98f

Rename code

084cdd1

Disable transformer

f2d29be

Add link

f707a83

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

b077558

… cpp_parallel_executor

Use PopAll

7da1ea0

Merge branch 'cpp_parallel_executor' of github.com:reyoung/Paddle int…

ce2f096

…o cpp_parallel_executor

Remove Pop method

38b53b3

panyx0718 reviewed

View reviewed changes

Contributor

panyx0718 left a comment

I have finished my review. @chengduoZH Let's verify the correctness and speed of transfomer and resnext. If they are OK, let's merge it soon so that everyone can start improving it

paddle/fluid/framework/details/fetch_op_handle.h

       namespace details {  
    
    struct FetchOpHandle : public OpHandleBase {  
     FeedFetchList *data_;

Contributor

panyx0718 Mar 28, 2018

Let's add a "private:" then?

paddle/fluid/framework/details/multi_devices_graph_builder.cc Outdated

Contributor

panyx0718 Mar 28, 2018

I see. Can you add your reply as comments in the code?

paddle/fluid/memory/detail/system_allocator.cc

       // if size is 0. We just make sure it does.  
     if (size <= 0) return nullptr;  
     void* p;  
     int prev_id;  
 

Contributor

panyx0718 Mar 28, 2018

I would suggest this to be a PADDLE_ENFORCE. if the current behavior works. Otherwise, reader will think the Allocator currently works on multiple GPUs.

paddle/fluid/framework/details/fetch_op_handle.cc

       PADDLE_THROW("Nobody should wait FetchOp. Unexpceted Error");  
    }  
    
    void FetchOpHandle::WaitAndMergeCPUTensors() const {

Contributor

panyx0718 Mar 28, 2018

Does this mean really wait on anything? Maybe just MergeCPUTensors?

paddle/fluid/framework/details/multi_devices_graph_builder.cc

       // FIXME: Currently ScaleLossGradOp only use device_count as scale  
     // factor. So it does not depend on any other operators.  
     // VarHandle *loss = GetVarHandle(loss_var_name, place);  
     // loss->pending_ops_.emplace_back(op_handle);  
 

Contributor

panyx0718 Mar 28, 2018

if this op doesn't not depend on anything, when will it be scheduled?

Collaborator Author

reyoung Mar 29, 2018

It will be run at the first time.

paddle/fluid/framework/details/ssa_graph_builder.h

       *  
     * https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Write_after_read_(WAR)  
     */  
     static void PolishGraphToSupportDataHazards(SSAGraph *graph);  
 

Contributor

panyx0718 Mar 28, 2018

Add more comments to describe when does data hazard happens?

paddle/fluid/framework/details/threaded_ssa_graph_executor.cc

       std::unordered_set<OpHandleBase *> ready_ops;  
    
     auto InsertPendingVar = [&pending_vars, &ready_vars](VarHandleBase &var) {  
     pending_vars.insert(&var);

Contributor

panyx0718 Mar 28, 2018

should this be skipped if generated_op_ is nullptr?

paddle/fluid/framework/reader.cc

       PADDLE_ENFORCE_EQ(actual.size(), expect.size());  
     for (int j = 0; j < actual.size(); ++j) {  
     PADDLE_ENFORCE(actual[i] == expect[i] || expect[i] == -1);  
     // PADDLE_ENFORCE(actual[i] == expect[i] || expect[i] == -1);  
 

Contributor

panyx0718 Mar 28, 2018

update this line?

paddle/fluid/framework/details/threaded_ssa_graph_executor.cc

       // Create local scopes.  
     for (auto &scope : local_scopes_) {  
     auto &local_scope = scope->NewScope();  
     *scope->Var("@TMP_SCOPE@")->GetMutable<Scope *>() = &local_scope;  
 

Contributor

panyx0718 Mar 28, 2018

Why are all the scopes using the same var name?

Collaborator Author

reyoung Mar 29, 2018

When Op::Run, there are some temporary variables will be created in local scopes. So, here just use Variable @TMP_SCOPE@ to holds these temporary variables. They will be destroied after a period.

paddle/fluid/framework/details/ssa_graph_builder.cc

       auto *dep_var = new DummyVarHandle();  
     read_op->AddOutput(dep_var);  
     write_op->AddInput(dep_var);  
     graph->dep_vars_.emplace(dep_var);  
 

Contributor

panyx0718 Mar 28, 2018

should this be called data_hazard_vars_?

Collaborator Author

reyoung Mar 29, 2018

Comments have been added.

panyx0718 previously approved these changes

View reviewed changes

Add comments

e868950

reyoung dismissed panyx0718’s stale review via e868950

March 29, 2018 03:33

chengduoZH approved these changes

View reviewed changes

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

af230d9

… cpp_parallel_executor

chengduoZH approved these changes

View reviewed changes

reyoung merged commit fa21436 into PaddlePaddle:develop

reyoung deleted the cpp_parallel_executor branch

March 30, 2018 02:44

chengduoZH added the parallel_exe label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment