torchaudio.transforms¶
torchaudio.transforms
module contains common audio processings and feature extractions. The following diagram shows the relationship between some of the available transforms.

Transforms are implemented using torch.nn.Module
. Common ways to build a processing pipeline are to define custom Module class or chain Modules together using torch.nn.Sequential
, then move it to a target device and data type.
# Define custom feature extraction pipeline. # # 1. Resample audio # 2. Convert to power spectrogram # 3. Apply augmentations # 4. Convert to mel-scale # class MyPipeline(torch.nn.Module): def __init__( self, input_freq=16000, resample_freq=8000, n_fft=1024, n_mel=256, stretch_factor=0.8, ): super().__init__() self.resample = Resample(orig_freq=input_freq, new_freq=resample_freq) self.spec = Spectrogram(n_fft=n_fft, power=2) self.spec_aug = torch.nn.Sequential( TimeStretch(stretch_factor, fixed_rate=True), FrequencyMasking(freq_mask_param=80), TimeMasking(time_mask_param=80), ) self.mel_scale = MelScale( n_mels=n_mel, sample_rate=resample_freq, n_stft=n_fft // 2 + 1) def forward(self, waveform: torch.Tensor) -> torch.Tensor: # Resample the input resampled = self.resample(waveform) # Convert to power spectrogram spec = self.spec(resampled) # Apply SpecAugment spec = self.spec_aug(spec) # Convert to mel-scale mel = self.mel_scale(spec) return mel
# Instantiate a pipeline pipeline = MyPipeline() # Move the computation graph to CUDA pipeline.to(device=torch.device("cuda"), dtype=torch.float32) # Perform the transform features = pipeline(waveform)
Please check out tutorials that cover in-depth usage of trasforms.
Utility¶
Turn a tensor from the power/amplitude scale to the decibel scale. | |
Encode signal based on mu-law companding. | |
Decode mu-law encoded signal. | |
Resample a signal from one frequency to another. | |
Add a fade in and/or fade out to an waveform. | |
Adjust volume of waveform. | |
Measure audio loudness according to the ITU-R BS.1770-4 recommendation. | |
Scales and adds noise to waveform per signal-to-noise ratio. | |
Convolves inputs along their last dimension using the direct method. | |
Convolves inputs along their last dimension using FFT. | |
Adjusts waveform speed. | |
Applies the speed perturbation augmentation introduced in Audio augmentation for speech recognition [Ko et al., 2015]. | |
De-emphasizes a waveform along its last dimension. | |
Pre-emphasizes a waveform along its last dimension. |
Feature Extractions¶
Create a spectrogram from a audio signal. | |
Create an inverse spectrogram to recover an audio signal from a spectrogram. | |
Turn a normal STFT into a mel frequency STFT with triangular filter banks. | |
Estimate a STFT in normal frequency domain from mel frequency domain. | |
Create MelSpectrogram for a raw audio signal. | |
Compute waveform from a linear scale magnitude spectrogram using the Griffin-Lim transformation. | |
Create the Mel-frequency cepstrum coefficients from an audio signal. | |
Create the linear-frequency cepstrum coefficients from an audio signal. | |
Compute delta coefficients of a tensor, usually a spectrogram. | |
Shift the pitch of a waveform by | |
Apply sliding-window cepstral mean (and optionally variance) normalization per utterance. | |
Compute the spectral centroid for each channel along the time axis. | |
Voice Activity Detector. |
Augmentations¶
The following transforms implement popular augmentation techniques known as SpecAugment [Park et al., 2019].
Apply masking to a spectrogram in the frequency domain. | |
Apply masking to a spectrogram in the time domain. | |
Stretch stft in time without modifying pitch for a given rate. |
Loss¶
DEPRECATED |
Multi-channel¶
Compute cross-channel power spectral density (PSD) matrix. | |
Minimum Variance Distortionless Response (MVDR) module that performs MVDR beamforming with Time-Frequency masks. | |
Minimum Variance Distortionless Response (MVDR [Capon, 1969]) module based on the relative transfer function (RTF) and power spectral density (PSD) matrix of noise. | |
Minimum Variance Distortionless Response (MVDR [Capon, 1969]) module based on the method proposed by Souden et, al. [Souden et al., 2009]. |