|
| 1 | +CUDADECODER USAGE AND TUNING GUIDE |
| 2 | + |
| 3 | +INTRODUCTION: |
| 4 | + |
| 5 | +The CudaDecoder was developed by NVIDIA with coordination from Johns Hopkins. |
| 6 | +This work was intended to demonstrate efficient GPU utilization across a range |
| 7 | +of NVIDIA hardware from SM_35 and on. The following guide describes how to |
| 8 | +use and tune the decoder for your models. |
| 9 | + |
| 10 | +A single speech-to-text is not enough work to fully saturate any NVIDIA GPUs. |
| 11 | +To fully saturate GPUs we need to decode many audio files concurrently. The |
| 12 | +solution provide does this through a combination of batching many audio files |
| 13 | +into a single speech pipeline, running multiple pipelines in parallel on the |
| 14 | +device, and using multiple CPU threads to perform feature extraction and |
| 15 | +determinization. Users of the decoder will need to have a high level |
| 16 | +understanding of the underlying implementation to know how to tune the |
| 17 | +decoder. |
| 18 | + |
| 19 | +The interface to the decoder is defined in "batched-threaded-cuda-decoder.h". |
| 20 | +A binary example can be found in cudadecoderbin/batched-wav-nnet3-cuda.cc". |
| 21 | +Below is a simple usage example. |
| 22 | +/* |
| 23 | + * BatchedThreadedCudaDecoderConfig batchedDecoderConfig; |
| 24 | + * batchedDecoderConfig.Register(&po); |
| 25 | + * po.Read(argc, argv); |
| 26 | + * ... |
| 27 | + * BatchedThreadedCudaDecoder CudaDecoder(batchedDecoderConfig); |
| 28 | + * CudaDecoder.Initialize(*decode_fst, am_nnet, trans_model); |
| 29 | + * ... |
| 30 | + * |
| 31 | + * for (; !wav_reader.Done(); wav_reader.Next()) { |
| 32 | + * std::string key = wav_reader.Key(); |
| 33 | + * CudaDecoder.OpenDecodeHandle(key, wave_reader.Value()); |
| 34 | + * ... |
| 35 | + * } |
| 36 | + * |
| 37 | + * while (!processed.empty()) { |
| 38 | + * CompactLattice clat; |
| 39 | + * CudaDecoder.GetLattice(key, &clat); |
| 40 | + * CudaDecoder.CloseDecodeHandle(key); |
| 41 | + * ... |
| 42 | + * } |
| 43 | + * |
| 44 | + * CudaDecoder.Finalize(); |
| 45 | + */ |
| 46 | + |
| 47 | +In the code above we first declare a BatchedThreadedCudaDecoderConfig |
| 48 | +and register its options. This enables us to tune the configuration |
| 49 | +options. Next we declare the CudaDecoder with that configuration. |
| 50 | +Before we can use the CudaDecoder we need to initalize it with an |
| 51 | +FST, AmNnetSimple, and TransitionModel. |
| 52 | + |
| 53 | +Next we iterate through waves and enqueue them into the decoder by |
| 54 | +calling OpenDecodeHandle. Note the key must be unique for each |
| 55 | +decode. Once we have enqueued work we can query the results by calling |
| 56 | +GetLattice on the same key we opened the handle on. This will automatticaly |
| 57 | +wait for processing to complete before returning. |
| 58 | + |
| 59 | +The key to get performance is to have many decodes active at the same time |
| 60 | +by opening many decode handles before querying for the lattices. |
| 61 | + |
| 62 | + |
| 63 | +PERFORMANCE TUNING: |
| 64 | + |
| 65 | +The CudaDecoder has a lot of tuning parameters which should be used to |
| 66 | +increase performance on various models and hardware. Note that it is |
| 67 | +expected that the optimal parameters will vary according to both the hardware, |
| 68 | +model, and data being decoded. |
| 69 | + |
| 70 | +The following will briefly describe each parameter: |
| 71 | + |
| 72 | +BatchedThreadedCudaDecoderOptions: |
| 73 | + cuda-control-threads: Number of CPU threads simultaniously submitting work |
| 74 | + to the device. For best performance this should be between 2-4. |
| 75 | + cuda-worker-threads: CPU threads for worker tasks like determinization and |
| 76 | + feature extraction. For best performance this should take up all spare |
| 77 | + CPU threads available on the system. |
| 78 | + max-batch-size: Maximum batch size in a single pipeline. This should be as |
| 79 | + large as possible but is expected to be between 50-200. |
| 80 | + batch-drain-size: How far to drain the batch before getting new work. |
| 81 | + Draining the batch allows nnet3 to be better batched. Testing has |
| 82 | + indicated that 10-30% of max-batch-size is ideal. |
| 83 | + determinize-lattice: Use cuda-worker-threads to determinize the lattice. if |
| 84 | + this is true then GetRawLattice can no longer be called. |
| 85 | + max-outstanding-queue-length: The maximum number of decodes that can be |
| 86 | + queued and not assigned before OpenDecodeHandle will automatically stall |
| 87 | + the submitting thread. Raising this increases CPU resources. This should |
| 88 | + be set to a few thousand at least. |
| 89 | + |
| 90 | +Decoder Options: |
| 91 | + beam: The width of the beam during decoding |
| 92 | + lattice-beam: The width of the lattice beam |
| 93 | + ntokens-preallocated: number of tokens allocated in host buffers. If |
| 94 | + this size is exceeded the buffer will reallocate larger consuming more |
| 95 | + resources |
| 96 | + max-tokens-per-frame: maximum tokens in GPU memory per frame. If this |
| 97 | + value is exceeded the beam will tighten and accuracy may decrease. |
| 98 | + max-active: at the end of each frame computation, we keep only its best max-active tokens (arc instantiations) |
| 99 | + |
| 100 | +Device Options: |
| 101 | + use-tensor-cores: Enables tensor core (fp16 math) for gemms. This is |
| 102 | + faster but less accurate. For inference the loss of accuracy is marginal |
| 103 | + |
| 104 | +GPU MEMORY USAGE: |
| 105 | + |
| 106 | +GPU memory is limited. Large GPUs have between 16-32GB of memory. Consumer |
| 107 | +GPUs have much less. For best performance users should have as many |
| 108 | +concurrent decodes as possible. Thus users should purchase GPUs with as |
| 109 | +much memory as possible. GPUs with less memory may have to sacrifice either |
| 110 | +performance or accuracy. On 16GB GPUs for example we are able to support |
| 111 | +around 200 concurrent decodes at a time. This translates into 4 |
| 112 | +cuda-control-threads and a max-batch-size of 50 (4x50). If your model is |
| 113 | +larger or smaller than the models our models when testing you may have to |
| 114 | +raise or lower this. |
| 115 | + |
| 116 | +There are a number of parameters which can be used to control GPU memory |
| 117 | +usage. How they impact memory usage and accuracy is discussed below: |
| 118 | + |
| 119 | + max-tokens-per-frame: Controls how many buffers can be stored on the GPU for |
| 120 | + each frame. This buffer size cannot be exceed or reallocated. As this |
| 121 | + buffer gets closer to being exhausted the beam is reduced possibly reducing |
| 122 | + quality. This should be tuned according to the model and data. For |
| 123 | + example, a highly accurate model could set this values smaller to enable |
| 124 | + more concurrent decodes. |
| 125 | + |
| 126 | + cuda-control-threads: Each control thread is a concurrent pipeline. Thus |
| 127 | + the GPU memory scales linearly with this parameter. This should always be |
| 128 | + at least 2 but should probably not be higher than 4 as more concurrent |
| 129 | + pipelines leads to more driver contention reducing performance. |
| 130 | + |
| 131 | + max-batch-size: The number of concurrent decodes in each pipeline. The |
| 132 | + memory usage also scales linear with this parameter. Setting this smaller |
| 133 | + will reduce kernel runtime while increase launch latency overhead. |
| 134 | + Ideally this should be as large as possible while still fitting into |
| 135 | + memory. Note that currently the maximum allowed is 200. |
| 136 | + |
| 137 | +== Acknowledgement == |
| 138 | + |
| 139 | +We would like to thank Daniel Povey, Zhehuai Chen and Daniel Galvez for their help and expertise during the review process. |
| 140 | + |
| 141 | + |
0 commit comments