STFTSpectrogram layer

[source]

STFTSpectrogram class

keras.layers.STFTSpectrogram( mode="log", frame_length=256, frame_step=None, fft_length=None, window="hann", periodic=False, scaling="density", padding="valid", expand_dims=False, data_format=None, **kwargs ) 

Layer to compute the Short-Time Fourier Transform (STFT) on a 1D signal.

A layer that computes Spectrograms of the input signal to produce a spectrogram. This layers utilizes Short-Time Fourier Transform (STFT) by The layer computes Spectrograms based on STFT by utilizing convolution kernels, which allows parallelization on GPUs and trainable kernels for fine-tuning support. This layer allows different modes of output (e.g., log-scaled magnitude, phase, power spectral density, etc.) and provides flexibility in windowing, padding, and scaling options for the STFT calculation.

Examples

Apply it as a non-trainable preprocessing layer on 3 audio tracks of 1 channel, 10 seconds and sampled at 16 kHz.

>>> layer = keras.layers.STFTSpectrogram( ... mode='log', ... frame_length=256, ... frame_step=128, # 50% overlap ... fft_length=512, ... window="hann", ... padding="valid", ... trainable=False, # non-trainable, preprocessing only ... ) >>> layer(keras.random.uniform(shape=(3, 160000, 1))).shape (3, 1249, 257) 

Apply it as a trainable processing layer on 3 stereo audio tracks of 2 channels, 10 seconds and sampled at 16 kHz. This is initialized as the non-trainable layer, but then can be trained jointly within a model.

>>> layer = keras.layers.STFTSpectrogram( ... mode='log', ... frame_length=256, ... frame_step=128, # 50% overlap ... fft_length=512, ... window="hamming", # hamming windowing function ... padding="same", # padding to preserve the time dimension ... trainable=True, # trainable, this is the default in keras ... ) >>> layer(keras.random.uniform(shape=(3, 160000, 2))).shape (3, 1250, 514) 

Similar to the last example, but add an extra dimension so the output is an image to be used with image models. We apply this here on a signal of 3 input channels to output an image tensor, hence is directly applicable with an image model.

>>> layer = keras.layers.STFTSpectrogram( ... mode='log', ... frame_length=256, ... frame_step=128, ... fft_length=512, ... padding="same", ... expand_dims=True, # this adds the extra dimension ... ) >>> layer(keras.random.uniform(shape=(3, 160000, 3))).shape (3, 1250, 257, 3) 

Arguments

  • mode: String, the output type of the spectrogram. Can be one of "log", "magnitude", "psd", "real", "imag", "angle", "stft". Defaults to "log".
  • frame_length: Integer, The length of each frame (window) for STFT in samples. Defaults to 256.
  • frame_step: Integer, the step size (hop length) between consecutive frames. If not provided, defaults to half the frame_length. Defaults to frame_length // 2.
  • fft_length: Integer, the size of frequency bins used in the Fast-Fourier Transform (FFT) to apply to each frame. Should be greater than or equal to frame_length. Recommended to be a power of two. Defaults to the smallest power of two that is greater than or equal to frame_length.
  • window: (String or array_like), the windowing function to apply to each frame. Can be "hann" (default), "hamming", or a custom window provided as an array_like.
  • periodic: Boolean, if True, the window function will be treated as periodic. Defaults to False.
  • scaling: String, type of scaling applied to the window. Can be "density", "spectrum", or None. Default is "density".
  • padding: String, padding strategy. Can be "valid" or "same". Defaults to "valid".
  • expand_dims: Boolean, if True, will expand the output into spectrograms into two dimensions to be compatible with image models. Defaults to False.
  • data_format: String, either "channels_last" or "channels_first". The ordering of the dimensions in the inputs. "channels_last" corresponds to inputs with shape (batch, height, width, channels) while "channels_first" corresponds to inputs with shape (batch, channels, height, weight). Defaults to "channels_last".

Raises

  • ValueError: If an invalid value is provided for "mode", "scaling", "padding", or other input arguments.
  • TypeError: If the input data type is not one of "float16", "float32", or "float64".

Input shape

A 3D tensor of shape (batch_size, time_length, input_channels), if data_format=="channels_last", and of shape (batch_size, input_channels, time_length) if data_format=="channels_first", where time_length is the length of the input signal, and input_channels is the number of input channels. The same kernels are applied to each channel independently.

Output shape

If data_format=="channels_first" and not expand_dims, a 3D tensor: (batch_size, input_channels * freq_channels, new_time_length) If data_format=="channels_last" and not expand_dims, a 3D tensor: (batch_size, new_time_length, input_channels * freq_channels) If data_format=="channels_first" and expand_dims, a 4D tensor: (batch_size, input_channels, new_time_length, freq_channels) If data_format=="channels_last" and expand_dims, a 4D tensor: (batch_size, new_time_length, freq_channels, input_channels)

where new_time_length depends on the padding, and freq_channels is the number of FFT bins (fft_length // 2 + 1).