Posted on Aug 8

Nvidia NeMo Speech Recognition Starting Guide

After reading an article that nvidia had several models in top of HF ASR Leaderboard I wanted to test in my local computer.
Even the code seems pretty simple from HF it turns didn't work for nvidia/canary-qwen-2.5b so I started to dig a bit more and test several features.

Base Setup

To test you will need a base setup, I'm using docker so the options are:

Using Nvidia PyTorch container
Using python with cuda enabled libraries

Nvidia PyTorch container

This is the simplest, but of course could contain a lot of libraries not needed, the size of the container is 12.78GB

docker run \ --gpus all \ -it \ --rm \ --shm-size=16g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ nvcr.io/nvidia/pytorch:25.06-py3

Python image with cuda libraries

docker run \ --gpus all \ -it \ --rm \ python:3.12-bookworm \ /bin/bash

Setup cuda libraries:

apt update && \ apt install -y wget && \ wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb && \ dpkg -i cuda-keyring_1.1-1_all.deb && \ apt-get update && \ apt-get -y install --no-install-recommends cuda-toolkit-12-9

Note: the image is around 11.5Gb so probably is not really much different from Nvidia container image.

Setup NeMo libraries

This is pretty simple:

pip install nemo-toolkit[asr]

Setup ffmpeg

For the examples should not be needed but libraries will report missing and is good to have for creating proper input files

cd /tmp wget https://github.com/BtbN/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz tar xvf ffmpeg-master-latest-linux64-gpl.tar.xz cp ffmpeg-master-latest-linux64-gpl/bin/* /usr/bin/ rm -rf ffmpeg-master-latest-linux64-gpl ffmpeg-master-latest-linux64-gpl.tar.xz

Get audio samples

You need audio samples that are:

1 single channel
16Khz Sample Rate
less than 20 seconds

An example used on several examples is this:
https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

so you can download:

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

This sample have the previous requirements, so is not needed to process; but if you want to have a proper sample from any file you can do with ffmpeg:

ffmpeg -i INPUT_FILE -ac 1 -ar 16000 example.wav

A more complex can be to extract just a part of the audio, the following will get 10 seconds, starting at second 29 so will extract audio from 29s - 39s:

ffmpeg -i INPUT_FILE -s 29 -t 10 -ac 1 -ar 16000 example.wav

Code example

The following file (asr_example.py) will help you to test the different models:

import nemo.collections.asr as nemo_asr import argparse parser = argparse.ArgumentParser(prog="ASR NeMo Example") parser.add_argument( "--enable-timestamps", help="Enable timestamps", action=argparse.BooleanOptionalAction, ) parser.add_argument( "model_name", help="Name of the model like 'nvidia/canary-1b-flash'", ) parser.add_argument( "input_file", help="Path of the wav file, must be 16000Hz and 1 channel", ) args = parser.parse_args() asr_model = nemo_asr.models.ASRModel.from_pretrained(args.model_name) transcriptions = asr_model.transcribe( args.input_file, timestamps=args.enable_timestamps, ) for idx, transcript in enumerate(transcriptions): print(f"[{idx}] {transcript.text}") if args.enable_timestamps: for stamp in transcript.timestamp["word"]: word = stamp['word'] output_line = f"{stamp['start']:0>5.2f}" output_line += f"-{stamp['end']:0>5.2f}" output_line += f": {word}" print(output_line)

Here is a list of models: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html
Note that list is not fully updated as the most recent is nvidia/canary-qwen-2.5b but this won't work with current code.

Testing the code

You need to provide the model name and the input file, so you can call like this:

python3 asr_example.py \ nvidia/canary-1b-flash \ 2086-149220-0033.wav

This will output:

[0] Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait.

You can enable the timestamps per word (not all the models support timestamps, but it will trigger an error if doesn't support it)

python3 asr_example.py \ --enable-timestamps \ nvidia/canary-1b-flash \ 2086-149220-0033.wav

This will output:

[0] Well I don't wish to see it any more observed Phoebe turning away her eyes it is certainly very like the old portrait 00.32-00.40: Well 00.56-00.72: I 00.72-01.04: don't 01.04-01.28: wish 01.28-01.36: to 01.44-01.52: see 01.60-01.68: it 01.76-01.84: any 01.92-02.00: more 02.24-02.64: observed 02.64-03.12: Phoebe 03.36-03.68: turning 03.76-03.84: away 04.08-04.16: her 04.24-04.48: eyes 04.96-05.04: it 05.12-05.20: is 05.36-05.76: certainly 05.84-05.92: very 06.08-06.16: like 06.24-06.32: the 06.40-06.48: old 06.64-07.12: portrait

Multilanguage

One of the cool features of the Canary family is the support of multiple input languages (English, German, French, Spanish) and can even translate the output.
I will use one file from this dataset: https://www.kaggle.com/datasets/carlfm01/120h-spanish-speech

In the case of the language instead of passing the wav file, it needs to be created an input manifest json.
The format is like this:

{ "audio_filepath": "FILE.wav", "duration": 10, "source_lang": "es", "target_lang": "en" }

But the trick is the input file is actually a text file where each line is a json entry, so the input-spanish.json must be:

{"audio_filepath": "0000df16-47ea-428f-8367-df2ce365d5c4.wav","duration": 9, "source_lang": "es","target_lang": "es"}

And to run with:

python3 asr_example.py \ --enable-timestamps \ nvidia/canary-1b-flash \ input-spanish.json

The output will be:

[0] con efeto, su lenguaje y singulares maneras me divertían extraordinariamente, porque nuestro hombre era un verdadero andaluz, 00.00-00.08: con 00.48-01.04: efeto, 01.12-01.20: su 01.36-01.84: lenguaje 01.92-02.00: y 02.08-02.64: singulares 02.72-03.04: maneras 03.20-03.28: me 03.36-03.92: divertían 04.08-05.92: extraordinariamente, 05.92-06.00: porque 06.40-06.48: nuestro 06.88-06.96: hombre 07.28-07.36: era 07.52-07.60: un 07.68-08.16: verdadero 08.24-08.96: andaluz,

And if you want to translate into english, the input-spanish.json must be:

{"audio_filepath": "0000df16-47ea-428f-8367-df2ce365d5c4.wav","duration": 9, "source_lang": "es","target_lang": "en"}

In this case the output of the same command will be:

[0] with effect his language and singular manners amused me extraordinarily because our man was a true Andalusian 00.00-00.08: with 00.48-00.56: effect 01.12-01.20: his 01.36-01.76: language 01.84-01.92: and 02.00-02.56: singular 02.64-03.04: manners 03.20-03.84: amused 04.08-04.16: me 04.24-05.28: extraordinarily 05.92-06.00: because 06.40-06.48: our 06.80-06.88: man 07.20-07.28: was 07.44-07.52: a 07.60-07.68: true 08.16-08.80: Andalusian