Skip to content

Commit b8a35fd

Browse files
hugovbraundanpovey
authored andcommitted
[src] Adding GPU/CUDA lattice batched decoder + binary (#3114)
1 parent f8cb5cc commit b8a35fd

20 files changed

+7153
-3
lines changed

src/Makefile

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@
44

55
SHELL := /bin/bash
66

7-
87
SUBDIRS = base matrix util feat tree gmm transform \
98
fstext hmm lm decoder lat kws cudamatrix nnet \
109
bin fstbin gmmbin fgmmbin featbin \
1110
nnetbin latbin sgmm2 sgmm2bin nnet2 nnet3 rnnlm chain nnet3bin nnet2bin kwsbin \
12-
ivector ivectorbin online2 online2bin lmbin chainbin rnnlmbin
11+
ivector ivectorbin online2 online2bin lmbin chainbin rnnlmbin \
12+
cudadecoder cudadecoderbin
1313

1414
MEMTESTDIRS = base matrix util feat tree gmm transform \
1515
fstext hmm lm decoder lat nnet kws chain \
@@ -143,7 +143,7 @@ $(EXT_SUBDIRS) : checkversion kaldi.mk mklibdir ext_depend
143143
### Dependency list ###
144144
# this is necessary for correct parallel compilation
145145
#1)The tools depend on all the libraries
146-
bin fstbin gmmbin fgmmbin sgmm2bin featbin nnetbin nnet2bin nnet3bin chainbin latbin ivectorbin lmbin kwsbin online2bin rnnlmbin: \
146+
bin fstbin gmmbin fgmmbin sgmm2bin featbin nnetbin nnet2bin nnet3bin chainbin latbin ivectorbin lmbin kwsbin online2bin rnnlmbin cudadecoderbin: \
147147
base matrix util feat tree gmm transform sgmm2 fstext hmm \
148148
lm decoder lat cudamatrix nnet nnet2 nnet3 ivector chain kws online2 rnnlm
149149

@@ -174,3 +174,5 @@ onlinebin: base matrix util feat tree gmm transform sgmm2 fstext hmm lm decoder
174174
online: decoder gmm transform feat matrix util base lat hmm tree
175175
online2: decoder gmm transform feat matrix util base lat hmm tree ivector cudamatrix nnet2 nnet3 chain
176176
kws: base util hmm tree matrix lat
177+
cudadecoder: cudamatrix online2 nnet3 ivector feat fstext lat chain transform
178+
cudadecoderbin: cudadecoder cudamatrix online2 nnet3 ivector feat fstext lat chain transform

src/cudadecoder/Makefile

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
all:
2+
3+
EXTRA_CXXFLAGS = -Wno-sign-compare
4+
include ../kaldi.mk
5+
6+
ifeq ($(CUDA), true)
7+
8+
# Make sure we have CUDA_ARCH from kaldi.mk,
9+
ifndef CUDA_ARCH
10+
$(error CUDA_ARCH is undefined, run 'src/configure')
11+
endif
12+
13+
TESTFILES =
14+
15+
OBJFILES = batched-threaded-nnet3-cuda-pipeline.o decodable-cumatrix.o \
16+
cuda-decoder.o cuda-decoder-kernels.o cuda-fst.o
17+
18+
LDFLAGS += $(CUDA_LDFLAGS)
19+
LDLIBS += $(CUDA_LDLIBS)
20+
21+
LIBNAME = kaldi-cudadecoder
22+
23+
ADDLIBS = ../cudamatrix/kaldi-cudamatrix.a ../base/kaldi-base.a ../matrix/kaldi-matrix.a \
24+
../lat/kaldi-lat.a ../util/kaldi-util.a ../matrix/kaldi-matrix.a ../gmm/kaldi-gmm.a \
25+
../fstext/kaldi-fstext.a ../hmm/kaldi-hmm.a ../gmm/kaldi-gmm.a ../transform/kaldi-transform.a \
26+
../tree/kaldi-tree.a ../online2/kaldi-online2.a ../nnet3/kaldi-nnet3.a
27+
28+
# Implicit rule for kernel compilation
29+
%.o : %.cu
30+
$(CUDATKDIR)/bin/nvcc -c $< -o $@ $(CUDA_INCLUDE) $(CUDA_FLAGS) $(CUDA_ARCH) -I../ -I$(OPENFSTINC)
31+
endif
32+
33+
include ../makefiles/default_rules.mk

src/cudadecoder/README

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
CUDADECODER USAGE AND TUNING GUIDE
2+
3+
INTRODUCTION:
4+
5+
The CudaDecoder was developed by NVIDIA with coordination from Johns Hopkins.
6+
This work was intended to demonstrate efficient GPU utilization across a range
7+
of NVIDIA hardware from SM_35 and on. The following guide describes how to
8+
use and tune the decoder for your models.
9+
10+
A single speech-to-text is not enough work to fully saturate any NVIDIA GPUs.
11+
To fully saturate GPUs we need to decode many audio files concurrently. The
12+
solution provide does this through a combination of batching many audio files
13+
into a single speech pipeline, running multiple pipelines in parallel on the
14+
device, and using multiple CPU threads to perform feature extraction and
15+
determinization. Users of the decoder will need to have a high level
16+
understanding of the underlying implementation to know how to tune the
17+
decoder.
18+
19+
The interface to the decoder is defined in "batched-threaded-cuda-decoder.h".
20+
A binary example can be found in cudadecoderbin/batched-wav-nnet3-cuda.cc".
21+
Below is a simple usage example.
22+
/*
23+
* BatchedThreadedCudaDecoderConfig batchedDecoderConfig;
24+
* batchedDecoderConfig.Register(&po);
25+
* po.Read(argc, argv);
26+
* ...
27+
* BatchedThreadedCudaDecoder CudaDecoder(batchedDecoderConfig);
28+
* CudaDecoder.Initialize(*decode_fst, am_nnet, trans_model);
29+
* ...
30+
*
31+
* for (; !wav_reader.Done(); wav_reader.Next()) {
32+
* std::string key = wav_reader.Key();
33+
* CudaDecoder.OpenDecodeHandle(key, wave_reader.Value());
34+
* ...
35+
* }
36+
*
37+
* while (!processed.empty()) {
38+
* CompactLattice clat;
39+
* CudaDecoder.GetLattice(key, &clat);
40+
* CudaDecoder.CloseDecodeHandle(key);
41+
* ...
42+
* }
43+
*
44+
* CudaDecoder.Finalize();
45+
*/
46+
47+
In the code above we first declare a BatchedThreadedCudaDecoderConfig
48+
and register its options. This enables us to tune the configuration
49+
options. Next we declare the CudaDecoder with that configuration.
50+
Before we can use the CudaDecoder we need to initalize it with an
51+
FST, AmNnetSimple, and TransitionModel.
52+
53+
Next we iterate through waves and enqueue them into the decoder by
54+
calling OpenDecodeHandle. Note the key must be unique for each
55+
decode. Once we have enqueued work we can query the results by calling
56+
GetLattice on the same key we opened the handle on. This will automatticaly
57+
wait for processing to complete before returning.
58+
59+
The key to get performance is to have many decodes active at the same time
60+
by opening many decode handles before querying for the lattices.
61+
62+
63+
PERFORMANCE TUNING:
64+
65+
The CudaDecoder has a lot of tuning parameters which should be used to
66+
increase performance on various models and hardware. Note that it is
67+
expected that the optimal parameters will vary according to both the hardware,
68+
model, and data being decoded.
69+
70+
The following will briefly describe each parameter:
71+
72+
BatchedThreadedCudaDecoderOptions:
73+
cuda-control-threads: Number of CPU threads simultaniously submitting work
74+
to the device. For best performance this should be between 2-4.
75+
cuda-worker-threads: CPU threads for worker tasks like determinization and
76+
feature extraction. For best performance this should take up all spare
77+
CPU threads available on the system.
78+
max-batch-size: Maximum batch size in a single pipeline. This should be as
79+
large as possible but is expected to be between 50-200.
80+
batch-drain-size: How far to drain the batch before getting new work.
81+
Draining the batch allows nnet3 to be better batched. Testing has
82+
indicated that 10-30% of max-batch-size is ideal.
83+
determinize-lattice: Use cuda-worker-threads to determinize the lattice. if
84+
this is true then GetRawLattice can no longer be called.
85+
max-outstanding-queue-length: The maximum number of decodes that can be
86+
queued and not assigned before OpenDecodeHandle will automatically stall
87+
the submitting thread. Raising this increases CPU resources. This should
88+
be set to a few thousand at least.
89+
90+
Decoder Options:
91+
beam: The width of the beam during decoding
92+
lattice-beam: The width of the lattice beam
93+
ntokens-preallocated: number of tokens allocated in host buffers. If
94+
this size is exceeded the buffer will reallocate larger consuming more
95+
resources
96+
max-tokens-per-frame: maximum tokens in GPU memory per frame. If this
97+
value is exceeded the beam will tighten and accuracy may decrease.
98+
max-active: at the end of each frame computation, we keep only its best max-active tokens (arc instantiations)
99+
100+
Device Options:
101+
use-tensor-cores: Enables tensor core (fp16 math) for gemms. This is
102+
faster but less accurate. For inference the loss of accuracy is marginal
103+
104+
GPU MEMORY USAGE:
105+
106+
GPU memory is limited. Large GPUs have between 16-32GB of memory. Consumer
107+
GPUs have much less. For best performance users should have as many
108+
concurrent decodes as possible. Thus users should purchase GPUs with as
109+
much memory as possible. GPUs with less memory may have to sacrifice either
110+
performance or accuracy. On 16GB GPUs for example we are able to support
111+
around 200 concurrent decodes at a time. This translates into 4
112+
cuda-control-threads and a max-batch-size of 50 (4x50). If your model is
113+
larger or smaller than the models our models when testing you may have to
114+
raise or lower this.
115+
116+
There are a number of parameters which can be used to control GPU memory
117+
usage. How they impact memory usage and accuracy is discussed below:
118+
119+
max-tokens-per-frame: Controls how many buffers can be stored on the GPU for
120+
each frame. This buffer size cannot be exceed or reallocated. As this
121+
buffer gets closer to being exhausted the beam is reduced possibly reducing
122+
quality. This should be tuned according to the model and data. For
123+
example, a highly accurate model could set this values smaller to enable
124+
more concurrent decodes.
125+
126+
cuda-control-threads: Each control thread is a concurrent pipeline. Thus
127+
the GPU memory scales linearly with this parameter. This should always be
128+
at least 2 but should probably not be higher than 4 as more concurrent
129+
pipelines leads to more driver contention reducing performance.
130+
131+
max-batch-size: The number of concurrent decodes in each pipeline. The
132+
memory usage also scales linear with this parameter. Setting this smaller
133+
will reduce kernel runtime while increase launch latency overhead.
134+
Ideally this should be as large as possible while still fitting into
135+
memory. Note that currently the maximum allowed is 200.
136+
137+
== Acknowledgement ==
138+
139+
We would like to thank Daniel Povey, Zhehuai Chen and Daniel Galvez for their help and expertise during the review process.
140+
141+

0 commit comments

Comments
 (0)