Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soundstream Training Goes From Great to Horrible #221

Open
adamfils opened this issue Aug 1, 2023 · 11 comments
Open

Soundstream Training Goes From Great to Horrible #221

adamfils opened this issue Aug 1, 2023 · 11 comments

Comments

@adamfils
Copy link

adamfils commented Aug 1, 2023

I have been training soundstream for the past 3 days on my A6000. At 25,000 steps I got amazing results then after that the loss just increased abruptly and other generations are just bad.
As you can see below from step 25031 the loss looks weird and increases.

At 25,000 steps here is the result
https://voca.ro/1c10gpytA3id

At 25,500 here is the result
https://voca.ro/1eaoQiOmo1Se

25000: saving to results 25000: saving model to results 25001: soundstream total loss: 4.872, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.284 | discr (scale 0.5) loss: 1.894 | discr (scale 0.25) loss: 1.829 25002: soundstream total loss: 4.893, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.336 | discr (scale 0.5) loss: 1.899 | discr (scale 0.25) loss: 1.887 25003: soundstream total loss: 4.375, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.332 | discr (scale 0.5) loss: 1.825 | discr (scale 0.25) loss: 1.871 25004: soundstream total loss: 4.699, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.229 | discr (scale 0.5) loss: 1.879 | discr (scale 0.25) loss: 1.921 25005: soundstream total loss: 4.486, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.217 | discr (scale 0.5) loss: 1.859 | discr (scale 0.25) loss: 1.928 25006: soundstream total loss: 4.232, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.296 | discr (scale 0.5) loss: 1.842 | discr (scale 0.25) loss: 1.934 25007: soundstream total loss: 4.356, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.056 | discr (scale 0.5) loss: 1.939 | discr (scale 0.25) loss: 1.930 25008: soundstream total loss: 4.532, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.011 | discr (scale 0.5) loss: 1.965 | discr (scale 0.25) loss: 1.964 25009: soundstream total loss: 4.534, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.065 | discr (scale 0.5) loss: 2.011 | discr (scale 0.25) loss: 2.013 25010: soundstream total loss: 4.773, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.297 | discr (scale 0.5) loss: 2.198 | discr (scale 0.25) loss: 2.055 25011: soundstream total loss: 4.817, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.109 | discr (scale 0.5) loss: 2.110 | discr (scale 0.25) loss: 2.033 25012: soundstream total loss: 5.056, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.398 | discr (scale 0.5) loss: 2.042 | discr (scale 0.25) loss: 1.931 25013: soundstream total loss: 5.122, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.212 | discr (scale 0.5) loss: 1.955 | discr (scale 0.25) loss: 1.865 25014: soundstream total loss: 4.553, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.231 | discr (scale 0.5) loss: 1.909 | discr (scale 0.25) loss: 1.913 25015: soundstream total loss: 4.360, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.232 | discr (scale 0.5) loss: 1.847 | discr (scale 0.25) loss: 1.952 25016: soundstream total loss: 4.644, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.279 | discr (scale 0.5) loss: 1.803 | discr (scale 0.25) loss: 1.994 25017: soundstream total loss: 5.561, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.278 | discr (scale 0.5) loss: 1.807 | discr (scale 0.25) loss: 1.943 25018: soundstream total loss: 4.956, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.209 | discr (scale 0.5) loss: 1.713 | discr (scale 0.25) loss: 1.878 25019: soundstream total loss: 5.055, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.179 | discr (scale 0.5) loss: 1.732 | discr (scale 0.25) loss: 1.865 25020: soundstream total loss: 5.168, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.332 | discr (scale 0.5) loss: 1.762 | discr (scale 0.25) loss: 1.853 25021: soundstream total loss: 4.924, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.375 | discr (scale 0.5) loss: 1.813 | discr (scale 0.25) loss: 1.867 25022: soundstream total loss: 4.844, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.462 | discr (scale 0.5) loss: 1.786 | discr (scale 0.25) loss: 1.855 25023: soundstream total loss: 5.200, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.579 | discr (scale 0.5) loss: 1.798 | discr (scale 0.25) loss: 1.822 25024: soundstream total loss: 7.380, soundstream recon loss: 0.002 | discr (scale 1) loss: 2.756 | discr (scale 0.5) loss: 1.805 | discr (scale 0.25) loss: 1.813 25025: soundstream total loss: 4.865, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.723 | discr (scale 0.5) loss: 1.748 | discr (scale 0.25) loss: 1.758 25026: soundstream total loss: 4.889, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.725 | discr (scale 0.5) loss: 1.854 | discr (scale 0.25) loss: 1.846 25027: soundstream total loss: 5.056, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.747 | discr (scale 0.5) loss: 1.817 | discr (scale 0.25) loss: 1.854 25028: soundstream total loss: 5.091, soundstream recon loss: 0.001 | discr (scale 1) loss: 3.242 | discr (scale 0.5) loss: 1.839 | discr (scale 0.25) loss: 1.891 25029: soundstream total loss: 4.385, soundstream recon loss: 0.001 | discr (scale 1) loss: 8.894 | discr (scale 0.5) loss: 1.760 | discr (scale 0.25) loss: 1.883 25030: soundstream total loss: 2.860, soundstream recon loss: 0.001 | discr (scale 1) loss: 108.547 | discr (scale 0.5) loss: 1.708 | discr (scale 0.25) loss: 1.798 25031: soundstream total loss: -15.905, soundstream recon loss: 0.002 | discr (scale 1) loss: 1718.587 | discr (scale 0.5) loss: 1.557 | discr (scale 0.25) loss: 1.979 25032: soundstream total loss: -303.631, soundstream recon loss: 0.024 | discr (scale 1) loss: 10940.722 | discr (scale 0.5) loss: 1.072 | discr (scale 0.25) loss: 3.398 25033: soundstream total loss: -2264.270, soundstream recon loss: 0.295 | discr (scale 1) loss: 234567.777 | discr (scale 0.5) loss: 0.180 | discr (scale 0.25) loss: 5.426 25034: soundstream total loss: -53273.180, soundstream recon loss: 15.740 | discr (scale 1) loss: 1108289.203 | discr (scale 0.5) loss: 0.008 | discr (scale 0.25) loss: 0.970 25035: soundstream total loss: -244286.930, soundstream recon loss: 272.947 | discr (scale 1) loss: 3089418.844 | discr (scale 0.5) loss: 0.010 | discr (scale 0.25) loss: 0.029 25036: soundstream total loss: -648283.398, soundstream recon loss: 2447.980 | discr (scale 1) loss: 7947847.062 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.007 25037: soundstream total loss: -1452483.922, soundstream recon loss: 19413.394 | discr (scale 1) loss: 18546006.250 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.001 25038: soundstream total loss: -2364417.562, soundstream recon loss: 132011.410 | discr (scale 1) loss: 33489656.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.008 25039: soundstream total loss: 2783657.594, soundstream recon loss: 803092.328 | discr (scale 1) loss: 49849376.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.002 25040: soundstream total loss: 14825873.875, soundstream recon loss: 2074252.219 | discr (scale 1) loss: 38289075.500 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.002 25041: soundstream total loss: 11907697.250, soundstream recon loss: 1596693.234 | discr (scale 1) loss: 12477728.375 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.016 25042: soundstream total loss: 2389267.781, soundstream recon loss: 358384.961 | discr (scale 1) loss: 1455136.312 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.009 25043: soundstream total loss: 47939.899, soundstream recon loss: 14739.114 | discr (scale 1) loss: 37.778 | discr (scale 0.5) loss: 63.086 | discr (scale 0.25) loss: 52.932 25044: soundstream total loss: 847.260, soundstream recon loss: 2.112 | discr (scale 1) loss: 15.008 | discr (scale 0.5) loss: 60.966 | discr (scale 0.25) loss: 115.134 25045: soundstream total loss: 936.149, soundstream recon loss: 0.910 | discr (scale 1) loss: 16.900 | discr (scale 0.5) loss: 5.909 | discr (scale 0.25) loss: 0.893 25046: soundstream total loss: 401.222, soundstream recon loss: 0.256 | discr (scale 1) loss: 18.919 | discr (scale 0.5) loss: 6.674 | discr (scale 0.25) loss: 0.226 25047: soundstream total loss: 172.702, soundstream recon loss: 0.054 | discr (scale 1) loss: 18.073 | discr (scale 0.5) loss: 4.793 | discr (scale 0.25) loss: 0.570

@adamfils
Copy link
Author

adamfils commented Aug 1, 2023

My training code

`from audiolm_pytorch import SoundStream, SoundStreamTrainer

soundstream = SoundStream(
codebook_size=1024,
rq_num_quantizers=8,
rq_groups=2, # 2 groups of quantizers
attn_window_size=128, # local attention receptive field at bottleneck
attn_depth=2
# 2 local attention transformer blocks - the soundstream folks were not experts with attention, so i took the liberty to add some. encodec went with lstms, but attention should be better
)

trainer = SoundStreamTrainer(
soundstream,
folder='/home/user/Downloads/LibriSpeech',
batch_size=8,
grad_accum_every=8, # effective batch size of 32
# data_max_length=320 * 32,
# lr=2e-6,
data_max_length_seconds=3,
save_model_every=1000,
save_results_every=500,
num_train_steps=10000001
).cuda()
trainer.train()`

@lucidrains
Copy link
Owner

@adamfils try loading from the checkpoint just before the collapse, and lowering the learning rate

@adamfils
Copy link
Author

adamfils commented Aug 2, 2023

Thanks. Also what is the difference between sample_31500.flac and sample_31500.ema.flac (EMA and non EMA Audio Samples).
Which should I use to measure the performance of the soundstream model? @lucidrains

@lucidrains
Copy link
Owner

@adamfils you want to use the ema version, which stands for exponential moving average

this is a common practice in generative field, where you update the parameters of your generator with exponential smoothing, which often leads to better end models

@adamfils
Copy link
Author

adamfils commented Aug 3, 2023

Okay. because the ema samples sound bad while the non Ema audio samples sound great. 😬
I'm at 38,000 steps and have been training for about 6 days now.
What tweaks would you suggest? @lucidrains

@lucidrains
Copy link
Owner

@adamfils yikes, that doesn't sound good! let me check on this maybe this sunday morning

@Fritskee
Copy link

Fritskee commented Sep 6, 2023

Any updates on how this got fixed? Want to start a training as well in the coming week.

@lucidrains
Copy link
Owner

multiple engineers and researchers have already successfully trained

you should just go for it, if you have enough data

@lucidrains
Copy link
Owner

@Fritskee my next stretch goal is to turn the soundstream training into a CLI, like what i did for lightweight gan

@Fritskee
Copy link

Fritskee commented Sep 6, 2023

multiple engineers and researchers have already successfully trained

you should just go for it, if you have enough data

I just wanted to go with LibriSpeech, so I figured if the weights were already out there, might as well ask. But you make a fair point!

@Fritskee
Copy link

Fritskee commented Sep 6, 2023

@Fritskee my next stretch goal is to turn the soundstream training into a CLI, like what i did for lightweight gan

That'd be dope!

I also want to take the time to thank you for all your efforts to democratize the latest research in ML!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants