Multivariate (multi-channel) dataset #16

ghost · 2022-09-20T08:54:23Z

ghost
Sep 20, 2022

Hello, thank you for the great repo for audio diffusion!
I just wanted to ask if you had any experience with multi-channel audio diffusion?
This might be out of the scope for this repo, but I am currently trying to train a diffusion model on a 12-lead electrocardiogram sampled in 500 Hz for 10 seconds ([12, 5000] shape). However, I am having difficulty training the model, so was wondering if you might have any insight to what parameters I should attempt from your intuition or experience.
Thank you!

flavioschneider · 2022-09-20T10:42:59Z

flavioschneider
Sep 20, 2022
Maintainer

You cannot pass a length that is not a power of 2, I'd suggest you to pad the cardiograms with zeros to a length of 8192. Then since the sequence is very short you can use a smaller model (~70M params), something like this:

from audio_diffusion_pytorch import AudioDiffusionModel model = AudioDiffusionModel( in_channels=12, patch_size=4, kernel_sizes_init=[1, 3, 7], multipliers=[1, 2, 4, 4, 4], factors=[4, 2, 2, 2], num_blocks=[2, 2, 2, 2], attentions=[False, True, True, True], ) # Train model with cardiograms sources x = torch.randn(1, 12, 8192) loss = model(x) loss.backward() # Do this many times # Sample 2 cardiograms given start noise noise = torch.randn(2, 12, 8192) sampled = model.sample( noise=noise, num_steps=10 # Suggested range: 2-50 ) # [2, 12, 8192]

9 replies

flavioschneider Sep 22, 2022
Maintainer

Is that plot in scale? If so, something is way off. The network values are clipped to be in the range [-1,1] so that plot should not be a possible network output. Might want to check that your data, other than that it's hard to tell without more information.

ghost Sep 23, 2022

Is that plot in scale? If so, something is way off. The network values are clipped to be in the range [-1,1] so that plot should not be a possible network output. Might want to check that your data, other than that it's hard to tell without more information.

Sorry for the confusion, the generated data was first normalized in the -1 to 1 range with minmaxscaler and after training, the samples were displayed with inverse_transform back to its raw waveform. Like you mentioned, i think something definitely went wrong during this process.

Actually I am getting much better generated samples from training the raw waveform, where most are in the -1 to 1 range anyway excluding some outliers. Would it make sense if I sample without clipping to the [-1,1] range?

And im also wondering if the conditional generation (text-conditioned) is ready for use!

Thanks again for your help.

flavioschneider Sep 23, 2022
Maintainer

What I would do in this case is find out what are the theoretical min p and max q possible value for a human ecg, remove all outliers that execeed that (if there are any) and normalize all ecg with a fixed scale e.g. s=max(abs(p), q), ecg=ecg/s. Then after sampling, undo the scaling with ecg=ecg*s – assuming those are zero centered.

Conditional generation is ready, I haven't tested it much though. Check out the AudioDiffusionConditional in the readme. You have to provide an embedding, ideally not directly text but some learned embedding from a transformer output.

ghost Sep 26, 2022

What I would do in this case is find out what are the theoretical min p and max q possible value for a human ecg, remove all outliers that execeed that (if there are any) and normalize all ecg with a fixed scale e.g. s=max(abs(p), q), ecg=ecg/s. Then after sampling, undo the scaling with ecg=ecg*s – assuming those are zero centered.

I followed your advice and now the training is more stable and the generated ECGs are much more realistic, although in some samples the beats are more irregular. However, it seems pretty accurate concerning the similarity between the generated samples and my dataset.

Conditional generation is ready, I haven't tested it much though. Check out the AudioDiffusionConditional in the readme. You have to provide an embedding, ideally not directly text but some learned embedding from a transformer output.

The sample I attached above is text-conditioned, and I used the text "sinus rhythm, normal ecg" to generate the sample. However, after training, it seems to me that the generated samples aren't able to follow the text conditions very well. For example, generating a sample with "bradycardia" (slow bpm) or "tachycardia" (fast bpm) generates similar samples with the above.

You mentioned in the AudioDiffusionConditional that I should be using a embedding and not directly text. I am currently using text tokenized with BPE and setting embedding_features to 1 for text-conditioning. Will this not work properly? The reason for doing this is because I used tokenized text for other baseline models, and wanted to make a fair comparison.

flavioschneider Sep 26, 2022
Maintainer

Very exciting progress! The way you're doing text conditioning is very unlikely to work. Embeddings should be vectors that can be attended to internally with cross attention. Ideally, you would train or use a pretrained transformer to interpret the grammatical content of the text, but in your case it seems that it's enough to correlate a generated ecg with the input text (e.g. if there's the word "tachycardia", the model wouldn't have to understand the grammatical structure of the sentence, but just correlate that word with a given ecg).

To do so, you can add an embedding block before the AudioDiffusionConditional model, something like:

# How many words (numbers) are in your BPE vocabulary  num_possible_words = 100 # The size of the vector, I'd suggest 768 it's quite standard embedding_features = 768 # Maximum number of word vectors per waveform max_length = 6 # Build trainable embedding  embedder = nn.Embedding(num_possible_words, embedding_features) model = AudioDiffusionConditional( in_channels=12, embedding_max_length=max_length, embedding_features=embedding_features, embedding_mask_proba=0.1 # Conditional dropout of batch elements patch_factor=4, multipliers=[1, 2, 4, 4, 4], factors=[4, 2, 2, 2], num_blocks=[2, 2, 2, 2], attentions=[0, 1, 1, 1, 1], ) # Then during training  # E.g. assume that the sentence "sinus rhythm, normal ecg" is tokenized to [1, 5, 8, 3, 0, 0] (in this example max length is 6, change it to whatever you want)  tokens = torch.tensor([[1, 5, 8, 3, 0, 0]]) # Convert to learned vectors of shape (1, 6, 768) = (batch_size, max_length, embedding_features)  embedding = embedder(tokens) ecgs = torch.randn(1, 12, 8192) model(ecgs, embedding=embedding)

Multivariate (multi-channel) dataset #16

Uh oh!

ghost Sep 20, 2022

Replies: 1 comment · 9 replies

Uh oh!

Uh oh!

flavioschneider Sep 20, 2022 Maintainer

Uh oh!

flavioschneider Sep 22, 2022 Maintainer

Uh oh!

Uh oh!

ghost Sep 23, 2022

Uh oh!

flavioschneider Sep 23, 2022 Maintainer

Uh oh!

ghost Sep 26, 2022

Uh oh!

Uh oh!

flavioschneider Sep 26, 2022 Maintainer

ghost
Sep 20, 2022

Replies: 1 comment 9 replies

flavioschneider
Sep 20, 2022
Maintainer

flavioschneider Sep 22, 2022
Maintainer

flavioschneider Sep 23, 2022
Maintainer

flavioschneider Sep 26, 2022
Maintainer