- Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
Actually, we want to implement a new language which is differentiable. Please refer to https://medium.com/@maxbendick/designing-a-differentiable-language-for-deep-learning-1812ee480ff1.
Using a deep learning framework like TensorFlow requires users to create a graph of symbolic tensors connected by operations like layers. It feels weird because you’re not writing the program, you’re writing a program that constructs a program (you write the Python code that constructs the computation graph, which is then interpreted by TensorFlow).
In differentiable programming, computation graphs are the implicit substrate of the language. You could read a lot of differentiable code before even realizing it’s differentiable.
Because the graphs are implicit, differentiable languages are much more expressive at making complex models. First-class lists are used for variable-length data. First-class conditionals make control-flow easy. Likewise, deep learning models, like stacked convolutional layers, are first-class functions. Any function in a differentiable language a model, because you necessarily can run backprop through any of them.
The differentiable program language should focus on the expressing the data and function. We should separate expression and execution.
In operator level, it has been a trend that the kernel code will be generate automatically. Please refer to the tvm from mxnet community and TensorComprehensions from Facebook research. Users only need to write what's an operator do, and how the operator do in certain hardware is automatically generated.
Both these two work learn a lot from Halide language, which targets at separating algorithm description and schedule.
In Graph/Program level, I think that we may should also separate algorithm description and schedule.
Currently, in our design, parallel do operator, and go/select operator are actually solving the execution of a program. These operators are actually executing a block. It's hard to say what's the backward of a parallel_do/go/select operator.
I propose that we'd separate our model expression/algorithm description with its execution/schedule, and keep our ProgramDesc differentiable. Otherwise, we have to write a lot of if/else codes in our transpiler to distinguish which operator is for scheduling, and which operator is for describing algorithm.
This problem has been met in memory optimization transpiler. Actually, memory optimization transpiler analyze neural network operator and related variables, not parallel do operator. We have to skip it.
There are several ways to separate the description and schedule:
- keep ProgramDesc differentiable and the schedule is done by an efficient runtime Executor (discussed with @reyoung @jacquesqiao @dzhwinter )
- ProgramDesc + ExectionDesc, please refer to @helinwang 's design Add ExecutionPlan design. #6078
- The block 0 of ProgramDesc describes execution and algorithm is described from block 1.(discussed with @tonyyang-svail )
In the third way, parallel do and nccl init operator will be inserted in block 0, and block 1 actually include forward/backward/nccl allreduce/sgd operators. And the total logic will be very clean, parallel do operator in block 0 will launch four threads which binds on four GPU cards. And each thread runs the same block 1. The backward/optimize transpiler, memory optimize transpiler will only need to focus on block 1. This will avoid too much if/else logic for parallel do operator.
Users prefer to focus on algorithm description and would like to write codes which only take a single GPU into consideration. We can write a parallel do transpiler to transpiler users' original ProgramDesc to a parallel do ProgramDesc.
And the logic of this parallel do transpiler will be very simple, all OpDesc and VarDesc in original ProgramDesc will be moved to block 1. And additional Parallel do OpDesc will be inserted in block 0.