kaldi-asr
diff --git a/‎src/Makefile‎
Lines changed: 5 additions & 3 deletions b/‎src/Makefile‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎src/cudadecoder/Makefile‎
Lines changed: 33 additions & 0 deletions b/‎src/cudadecoder/Makefile‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎src/cudadecoder/README‎
Lines changed: 141 additions & 0 deletions b/‎src/cudadecoder/README‎
Lines changed: 141 additions & 0 deletions
@@ -4,12 +4,12 @@
 
 SHELL := /bin/bash
 
-
 SUBDIRS = base matrix util feat tree gmm transform \
  fstext hmm lm decoder lat kws cudamatrix nnet \
  bin fstbin gmmbin fgmmbin featbin \
  nnetbin latbin sgmm2 sgmm2bin nnet2 nnet3 rnnlm chain nnet3bin nnet2bin kwsbin \
- ivector ivectorbin online2 online2bin lmbin chainbin rnnlmbin
+ ivector ivectorbin online2 online2bin lmbin chainbin rnnlmbin \
+ cudadecoder cudadecoderbin
 
 MEMTESTDIRS = base matrix util feat tree gmm transform \
  fstext hmm lm decoder lat nnet kws chain \
@@ -143,7 +143,7 @@ $(EXT_SUBDIRS) : checkversion kaldi.mk mklibdir ext_depend
 ### Dependency list ###
 # this is necessary for correct parallel compilation
 #1)The tools depend on all the libraries
-bin fstbin gmmbin fgmmbin sgmm2bin featbin nnetbin nnet2bin nnet3bin chainbin latbin ivectorbin lmbin kwsbin online2bin rnnlmbin: \
+bin fstbin gmmbin fgmmbin sgmm2bin featbin nnetbin nnet2bin nnet3bin chainbin latbin ivectorbin lmbin kwsbin online2bin rnnlmbin cudadecoderbin: \
  base matrix util feat tree gmm transform sgmm2 fstext hmm \
  lm decoder lat cudamatrix nnet nnet2 nnet3 ivector chain kws online2 rnnlm
 
@@ -174,3 +174,5 @@ onlinebin: base matrix util feat tree gmm transform sgmm2 fstext hmm lm decoder
 online: decoder gmm transform feat matrix util base lat hmm tree
 online2: decoder gmm transform feat matrix util base lat hmm tree ivector cudamatrix nnet2 nnet3 chain
 kws: base util hmm tree matrix lat
+cudadecoder: cudamatrix online2 nnet3 ivector feat fstext lat chain transform
+cudadecoderbin: cudadecoder cudamatrix online2 nnet3 ivector feat fstext lat chain transform
@@ -0,0 +1,33 @@
+all:
+
+EXTRA_CXXFLAGS = -Wno-sign-compare
+include ../kaldi.mk
+
+ifeq ($(CUDA), true)
+
+# Make sure we have CUDA_ARCH from kaldi.mk,
+ifndef CUDA_ARCH
+ $(error CUDA_ARCH is undefined, run 'src/configure')
+endif
+
+TESTFILES =
+
+OBJFILES = batched-threaded-nnet3-cuda-pipeline.o decodable-cumatrix.o \
+ cuda-decoder.o cuda-decoder-kernels.o cuda-fst.o
+
+LDFLAGS += $(CUDA_LDFLAGS)
+LDLIBS += $(CUDA_LDLIBS)
+
+LIBNAME = kaldi-cudadecoder
+
+ADDLIBS = ../cudamatrix/kaldi-cudamatrix.a ../base/kaldi-base.a ../matrix/kaldi-matrix.a \
+ ../lat/kaldi-lat.a ../util/kaldi-util.a ../matrix/kaldi-matrix.a ../gmm/kaldi-gmm.a \
+ ../fstext/kaldi-fstext.a ../hmm/kaldi-hmm.a ../gmm/kaldi-gmm.a ../transform/kaldi-transform.a \
+ ../tree/kaldi-tree.a ../online2/kaldi-online2.a ../nnet3/kaldi-nnet3.a
+
+# Implicit rule for kernel compilation
+%.o : %.cu
+$(CUDATKDIR)/bin/nvcc -c $< -o $@ $(CUDA_INCLUDE) $(CUDA_FLAGS) $(CUDA_ARCH) -I../ -I$(OPENFSTINC)
+endif
+
+include ../makefiles/default_rules.mk
@@ -0,0 +1,141 @@
+CUDADECODER USAGE AND TUNING GUIDE
+
+INTRODUCTION:
+
+The CudaDecoder was developed by NVIDIA with coordination from Johns Hopkins.
+This work was intended to demonstrate efficient GPU utilization across a range 
+of NVIDIA hardware from SM_35 and on. The following guide describes how to 
+use and tune the decoder for your models.
+
+A single speech-to-text is not enough work to fully saturate any NVIDIA GPUs.
+To fully saturate GPUs we need to decode many audio files concurrently. The
+solution provide does this through a combination of batching many audio files
+into a single speech pipeline, running multiple pipelines in parallel on the
+device, and using multiple CPU threads to perform feature extraction and 
+determinization. Users of the decoder will need to have a high level 
+understanding of the underlying implementation to know how to tune the 
+decoder. 
+
+The interface to the decoder is defined in "batched-threaded-cuda-decoder.h".
+A binary example can be found in cudadecoderbin/batched-wav-nnet3-cuda.cc".
+Below is a simple usage example. 
+/*
+ * BatchedThreadedCudaDecoderConfig batchedDecoderConfig;
+ * batchedDecoderConfig.Register(&po);
+ * po.Read(argc, argv);
+ * ...
+ * BatchedThreadedCudaDecoder CudaDecoder(batchedDecoderConfig);
+ * CudaDecoder.Initialize(*decode_fst, am_nnet, trans_model);
+ * ...
+ *
+ * for (; !wav_reader.Done(); wav_reader.Next()) {
+ * std::string key = wav_reader.Key();
+ * CudaDecoder.OpenDecodeHandle(key, wave_reader.Value());
+ * ...
+ * }
+ *
+ * while (!processed.empty()) {
+ * CompactLattice clat;
+ * CudaDecoder.GetLattice(key, &clat);
+ * CudaDecoder.CloseDecodeHandle(key);
+ * ...
+ * }
+ *
+ * CudaDecoder.Finalize();
+ */
+
+In the code above we first declare a BatchedThreadedCudaDecoderConfig
+and register its options. This enables us to tune the configuration 
+options. Next we declare the CudaDecoder with that configuration.
+Before we can use the CudaDecoder we need to initalize it with an
+FST, AmNnetSimple, and TransitionModel. 
+
+Next we iterate through waves and enqueue them into the decoder by
+calling OpenDecodeHandle. Note the key must be unique for each 
+decode. Once we have enqueued work we can query the results by calling
+GetLattice on the same key we opened the handle on. This will automatticaly
+wait for processing to complete before returning. 
+
+The key to get performance is to have many decodes active at the same time
+by opening many decode handles before querying for the lattices.
+
+
+PERFORMANCE TUNING:
+
+The CudaDecoder has a lot of tuning parameters which should be used to
+increase performance on various models and hardware. Note that it is 
+expected that the optimal parameters will vary according to both the hardware,
+model, and data being decoded.
+
+The following will briefly describe each parameter:
+
+BatchedThreadedCudaDecoderOptions:
+ cuda-control-threads: Number of CPU threads simultaniously submitting work
+ to the device. For best performance this should be between 2-4.
+ cuda-worker-threads: CPU threads for worker tasks like determinization and
+ feature extraction. For best performance this should take up all spare
+ CPU threads available on the system.
+ max-batch-size: Maximum batch size in a single pipeline. This should be as
+ large as possible but is expected to be between 50-200. 
+ batch-drain-size: How far to drain the batch before getting new work.
+ Draining the batch allows nnet3 to be better batched. Testing has 
+ indicated that 10-30% of max-batch-size is ideal.
+ determinize-lattice: Use cuda-worker-threads to determinize the lattice. if
+ this is true then GetRawLattice can no longer be called.
+ max-outstanding-queue-length: The maximum number of decodes that can be
+ queued and not assigned before OpenDecodeHandle will automatically stall 
+ the submitting thread. Raising this increases CPU resources. This should 
+ be set to a few thousand at least.
+
+Decoder Options:
+ beam: The width of the beam during decoding
+ lattice-beam: The width of the lattice beam
+ ntokens-preallocated: number of tokens allocated in host buffers. If
+ this size is exceeded the buffer will reallocate larger consuming more
+ resources
+ max-tokens-per-frame: maximum tokens in GPU memory per frame. If this
+ value is exceeded the beam will tighten and accuracy may decrease.
+ max-active: at the end of each frame computation, we keep only its best max-active tokens (arc instantiations)
+
+Device Options:
+ use-tensor-cores: Enables tensor core (fp16 math) for gemms. This is
+ faster but less accurate. For inference the loss of accuracy is marginal
+
+GPU MEMORY USAGE:
+
+GPU memory is limited. Large GPUs have between 16-32GB of memory. Consumer
+GPUs have much less. For best performance users should have as many
+concurrent decodes as possible. Thus users should purchase GPUs with as
+much memory as possible. GPUs with less memory may have to sacrifice either
+performance or accuracy. On 16GB GPUs for example we are able to support
+around 200 concurrent decodes at a time. This translates into 4
+cuda-control-threads and a max-batch-size of 50 (4x50). If your model is
+larger or smaller than the models our models when testing you may have to
+raise or lower this. 
+
+There are a number of parameters which can be used to control GPU memory
+usage. How they impact memory usage and accuracy is discussed below:
+
+ max-tokens-per-frame: Controls how many buffers can be stored on the GPU for
+ each frame. This buffer size cannot be exceed or reallocated. As this
+ buffer gets closer to being exhausted the beam is reduced possibly reducing
+ quality. This should be tuned according to the model and data. For
+ example, a highly accurate model could set this values smaller to enable
+ more concurrent decodes.
+
+ cuda-control-threads: Each control thread is a concurrent pipeline. Thus
+ the GPU memory scales linearly with this parameter. This should always be
+ at least 2 but should probably not be higher than 4 as more concurrent
+ pipelines leads to more driver contention reducing performance.
+
+ max-batch-size: The number of concurrent decodes in each pipeline. The
+ memory usage also scales linear with this parameter. Setting this smaller
+ will reduce kernel runtime while increase launch latency overhead.
+ Ideally this should be as large as possible while still fitting into
+ memory. Note that currently the maximum allowed is 200.
+
+== Acknowledgement ==
+
+We would like to thank Daniel Povey, Zhehuai Chen and Daniel Galvez for their help and expertise during the review process.
+
+