STFTSpectrogram classkeras.layers.STFTSpectrogram( mode="log", frame_length=256, frame_step=None, fft_length=None, window="hann", periodic=False, scaling="density", padding="valid", expand_dims=False, data_format=None, **kwargs ) Layer to compute the Short-Time Fourier Transform (STFT) on a 1D signal.
A layer that computes Spectrograms of the input signal to produce a spectrogram. This layers utilizes Short-Time Fourier Transform (STFT) by The layer computes Spectrograms based on STFT by utilizing convolution kernels, which allows parallelization on GPUs and trainable kernels for fine-tuning support. This layer allows different modes of output (e.g., log-scaled magnitude, phase, power spectral density, etc.) and provides flexibility in windowing, padding, and scaling options for the STFT calculation.
Examples
Apply it as a non-trainable preprocessing layer on 3 audio tracks of 1 channel, 10 seconds and sampled at 16 kHz.
>>> layer = keras.layers.STFTSpectrogram( ... mode='log', ... frame_length=256, ... frame_step=128, # 50% overlap ... fft_length=512, ... window="hann", ... padding="valid", ... trainable=False, # non-trainable, preprocessing only ... ) >>> layer(keras.random.uniform(shape=(3, 160000, 1))).shape (3, 1249, 257) Apply it as a trainable processing layer on 3 stereo audio tracks of 2 channels, 10 seconds and sampled at 16 kHz. This is initialized as the non-trainable layer, but then can be trained jointly within a model.
>>> layer = keras.layers.STFTSpectrogram( ... mode='log', ... frame_length=256, ... frame_step=128, # 50% overlap ... fft_length=512, ... window="hamming", # hamming windowing function ... padding="same", # padding to preserve the time dimension ... trainable=True, # trainable, this is the default in keras ... ) >>> layer(keras.random.uniform(shape=(3, 160000, 2))).shape (3, 1250, 514) Similar to the last example, but add an extra dimension so the output is an image to be used with image models. We apply this here on a signal of 3 input channels to output an image tensor, hence is directly applicable with an image model.
>>> layer = keras.layers.STFTSpectrogram( ... mode='log', ... frame_length=256, ... frame_step=128, ... fft_length=512, ... padding="same", ... expand_dims=True, # this adds the extra dimension ... ) >>> layer(keras.random.uniform(shape=(3, 160000, 3))).shape (3, 1250, 257, 3) Arguments
"log", "magnitude", "psd", "real", "imag", "angle", "stft". Defaults to "log".frame_length // 2.frame_length. Recommended to be a power of two. Defaults to the smallest power of two that is greater than or equal to frame_length."hann" (default), "hamming", or a custom window provided as an array_like.False."density", "spectrum", or None. Default is "density"."valid" or "same". Defaults to "valid".False."channels_last" or "channels_first". The ordering of the dimensions in the inputs. "channels_last" corresponds to inputs with shape (batch, height, width, channels) while "channels_first" corresponds to inputs with shape (batch, channels, height, weight). Defaults to "channels_last".Raises
"mode", "scaling", "padding", or other input arguments."float16", "float32", or "float64".Input shape
A 3D tensor of shape (batch_size, time_length, input_channels), if data_format=="channels_last", and of shape (batch_size, input_channels, time_length) if data_format=="channels_first", where time_length is the length of the input signal, and input_channels is the number of input channels. The same kernels are applied to each channel independently.
Output shape
If data_format=="channels_first" and not expand_dims, a 3D tensor: (batch_size, input_channels * freq_channels, new_time_length) If data_format=="channels_last" and not expand_dims, a 3D tensor: (batch_size, new_time_length, input_channels * freq_channels) If data_format=="channels_first" and expand_dims, a 4D tensor: (batch_size, input_channels, new_time_length, freq_channels) If data_format=="channels_last" and expand_dims, a 4D tensor: (batch_size, new_time_length, freq_channels, input_channels)
where new_time_length depends on the padding, and freq_channels is the number of FFT bins (fft_length // 2 + 1).