Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Information sharing #1

Open
Rayhane-mamah opened this issue Mar 7, 2018 · 44 comments
Open

Information sharing #1

Rayhane-mamah opened this issue Mar 7, 2018 · 44 comments

Comments

@Rayhane-mamah
Copy link

Rayhane-mamah commented Mar 7, 2018

Hello @A-Jacobson.

Great work with your implementation and more importantly with you clear representation of the model in your README (100% better that the one presented in the paper x) ).

So I am actually also working on Tacotron 2 implementation (in tensorflow) and there are few things I wanted to check with you, maybe we could help each other out. (implementation here)

  • Does your attention mechanism work? Mine (based on the original tacotron) doesn't seem to capture the alignment correctly.
  • Does your loss decrease insanely fast?

Again, impressive work.

@A-Jacobson
Copy link
Owner

Hey @Rayhane-mamah,

Thanks for the compliment. Though this is still very much a work in progress. I'll take a look at your work soon!

The attention mechanism doesn't seem to be working correctly as it is. The alignments are all over the place though the generated spectrograms look quite good after only 1 epoch. I believe this could be due to the use of teacher forcing. The paper just mentioned that they use it, they didn't mention any ratio so I have had it set to "always on" I know in NMT it's common to have a teacher forcing ratio of 0.5.

I'm planning to tackle the attention problems today, some of the potential issues could be:

  • I'm using the decoder input after the prenet as my attention query vector and it's possible that I'm meant to use the decoder hidden state instead.
  • I'm using a form of dot product attention instead of bahndanau's I did this because it tends to work better for NMT and there are less parameters.
  • I may have incorporated the location features from the previous step incorrectly, I followed the original paper to the best of my knowledge but the tacotron2 paper mentions a cumulative sum which I am not clear about, unless it's implicit.

Best of luck with your work!

@Rayhane-mamah
Copy link
Author

Thanks for you quick reply @A-Jacobson.

About the teacher forcing, this is actually a nice perspective I haven't thought about since I only considered using the "always on" teacher forcing.

As for the attention mechanism, as far as I could understand from the paper, they extended the bahdanau's "sum style" attention to use cumulative location features as an extra (extracting them with the use of convolution and whatsoever). As far as I know, this requires the use of key, query and previous alignments, and if I'm not mistaking, this is the "hybrid" (content+location based) attention not only the location based one (just my point of view).

Best of luck to you too, if I ever find the right way to use the attention, you'll be the first to know!

@A-Jacobson
Copy link
Owner

Hmm, that's what I did.

previous context --> conv1d layers --> add.

To me "cumulative" would be some weighted sum on previous context vectors though I suppose that information is implicitly carried forward as each context vector is computer with information from all of the previous vectors.

Are you using zoneout and LSTMs in your project and still running into this problem?

@Rayhane-mamah
Copy link
Author

Yeah i'm supposing that information is implicitly being carried forward since each context vector is computed using the previous one..

I am indeed using zoneout LSTMs (unless they're not working correctly) and still running into many problems in fact.. even when changing the attention to use some more basic one like Luong or Bahdanau's, mel outputs tend to be blurry and at some stage in training, the "before loss explodes" and attention is completely lost (tends to 0). I'm still not really sure about the reason, however, I think i should point out that I am using a separate LSTM for attention (with 128 units) and concatenating its output with the context vector before sending them to the decoder LSTMs (based on the original tacotron approach and untill now gave the most "normal" results compared to LuongAttention, before exploding of course).

@A-Jacobson
Copy link
Owner

A-Jacobson commented Mar 8, 2018

My implementation appears to be working now. I ended up using the last decoder rnn hidden state as the query vector rather than the output of the prenet and fixed a malicious typo related to updating my decoder hidden state. These changes alone are giving reasonable results (The model learns to ignore the padding tokens at 50 steps).

It's odd that your attention is zero, or close to zero since each frame is being passed through a softmax layer. I would make sure you don't have zeros as input to the attention layer. The most likely culprit would be your hidden state (query vector). I'm also not sure having a separate lstm layer to generate the query hidden state is necessary since your hidden state is already the output of your lstm at the previous step in the loop. with regards to exploding gradients, try gradient clipping.

As for when we can expect to see alignments, it's supposedly around 15k steps see: keithito/tacotron#90.

Once things appeared to be working I also rewrote my attention mechanism to exactly replicate the one from the begio paper and switched from grus to lstms to more closely represent tacotron2. The only thing I haven't yet added is zoneout.

Just out of curiosity, how are you padding your text/spectrograms and how much gpu memory does your implementation take per training batch?

@Rayhane-mamah
Copy link
Author

Rayhane-mamah commented Mar 10, 2018

Hello again @A-Jacobson, sorry for the late reply.

If your attention works, I would definitely switch to yours too, it seems cleaner (and let's face it, less layers = faster computation = happy me 🎉 ).

With that said I managed to find the source of my problem. It appears that it was 100% related to my weights initialization. After going through all my layers and initializing my weights using the xavier initialization (to keep the same stddev along layers, preventing by that any vanishing or exploding gradients). Now after visualizing the gradients norm I can see that all signs of explosion are gone.

On the other hand, attention is working properly and I sometimes see it starting to form the right alignment. I am still using a separate LSTM however simply because in the first tacotron, they used a 256-GRU for attention which led me to interpret "The encoder output is consumed by an attention network which summarizes the full encoded sequence as a fixed-length context vector for each decoder output step." as if they used a separate LSTM in tacotron-2 as well. I might be wrong, that's why if your approach works fine, I won't really care about whether they used a separate LSTM or not as it would make a smaller network that yields the same results.

To answer your question, I am padding inputs (texts) with <pad_tokens> ("0" tokens) and padding the outputs (spectrograms) with the same <stop_token> ("0.0" tokens) from which the model is expected to learn to predict the true <stop_token> (using the linear_transform to scalar + sigmoid).

Finally, to answer your gpu memory question, I want to point out that i recently added the reduction factor (originally used in the first tacotron implementation) which consists of predicting "r" (reduction factor) frames simultaneously at each decoding step. This ensures that the model makes less decoding steps in training, reducing computation amount and freeing memory and seems to allow the model to capture alignment faster.

Since my main (more powerful gpu) is busy on another project, i'm only using a 920MX for training the tacotron-2 (which has 2Gb VRAM) and it only supports 12 batch size as a maximum (using the reduction factor r=5) but I suspect a 1080Ti would easily train the model with a batch size of 64. I hope this answered your question?

The comment is getting pretty long.. but just to make sure there isn't something wrong with my loss function, I saw that you are only using the MSE of decoder outputs (with no post-net?) and crossentropy of <stop_token> prediction in your loss function. I am doing the same with the exception of adding the "after-postnet" error too (seems to speed up convergence) along with an l2 regularization on all network weights.

The thing is my loss decreases amazingly fast (which seems odd actually) and then becomes "constant" in only 600 steps, mel-spectrogram quality continues to improve (along with audio quality, checked using a simple griffin-lim just to control linguistic improvement without paying much attention to audio quality in general). Is that supposed to be normal? I will try to share some tensorboard plots later (Just waiting for the alignment to appear before that)

@A-Jacobson
Copy link
Owner

@Rayhane-mamah, I'm glad you got your project working. I was also going to suggest tuning your learning rate or using cyclical learning rates. Since the paper didn't give weight init details and we aren't using the same batch size or dataset they are using in the paper, their parameters aren't going to be that relevant to us. I got the quickest alignments using the techniques in the stochastic gradient descent with warm restarts paper. My code for that has been pushed to this repo.

As for the loss, there's a screenshot of mine in the readme, the starting point is cut off but it usually starts at about 300.0. An exponential decrease like that isn't surprising to me since we are using MSE and our targets aren't z-normalized (at least mine aren't) . If you think about the magnitude of the loss between a random 120 x 700 matrix vs a true spectrogram, it's obvious that the value would start very high. Once the model starts outputting things in the correct range (blurry spectrograms) and has to make small adjustments (vocal patterns) the loss will naturally start decreasing much more slowly.

With regards to the attention mechanism, I seem to remember tacotron one using a double layered attention but as I've been focused on faithfully reproducing tacotron2 I don't know much about it. My attention works the same way as it would in an nmt model, I have an example using it for nmt here (https://github.com/A-Jacobson/minimal-nmt) which produces quite good alignments after about 10 epochs (15 mins).

It seems that I need to do some profiling or look into those reduction layers though.. my model is using about ~6x the memory you're reporting.

@Rayhane-mamah
Copy link
Author

Rayhane-mamah commented Mar 10, 2018

About the size of the model, i just had a quick look at your code and might have found some causes of such big differenve in the memory usage i'm experiencing.

I saw you using 1024 units in each of your decoder rnn layers, I am using 1024 for the two layers combined (512 each) and 128 in each prenet layer (256 for 2 layers). They may have meant 1024 units for each layer but if the model gives nice results with less units i'll prevent from adding more complexity to the model. But then again, this is a parameter we choose depending on the situation.
Then comes the reduction factor that reduces the size of the model even more.

After reflexion, your attention seems to reproduce tacotron-2 much better. Will definitely try it out later.
There's just one thing I wanted to check with you since i don't have much experience with pytorch. Are you using your postnet on each decoder step? Isn't it supposed to improve the output of decoder after all frames have been predicted?

As for hyper parameters, I also noticed the difference of our case with the paper, I toyed a little with the optimizer's params to minimize the loss shakes and will probably tune them more at a later stage.

Did your model generate any good sounding samples yet?

@A-Jacobson
Copy link
Owner

A-Jacobson commented Mar 10, 2018

That's an interesting interpretation, I didn't think about that and without the authors code to refer to we can truly only guess.

With regards to the results, I haven't written up a griffin-lim. So I've just been looking at the output spectrograms vs targets and all I can say is that they look quite similar. I was planning to wait until my wavenet is done to listen to them.

@fatchord
Copy link

fatchord commented Mar 12, 2018

Hi guys, hope you don't mind me chiming in. Regarding this from the paper: "cumulative attention weights from previous decoder time steps" - my initial interpretation of that was to make a tensor of size [Batch, 1, EncoderTimeSteps] and cumulatively add the attention weights from each step to it. So the attention convolution would be looking at all attention locations it had previously contributed to. What do you think?

Looking at the Bengio paper - if I'm not mistaken, they only convolved the attention weights of the previous time step - that sounds very different to "cumulative attention weights"

@A-Jacobson
Copy link
Owner

A-Jacobson commented Mar 12, 2018

Welcome @fatchord. I originally thought the same, but ended up following the equations in the bengio paper. I then realized that the attention weights are cumulative since we add the previous weights during the calculation.

Ignoring the other information, if:

weights = weights + weights[t-1] 

# and 

weights[t-1] = weights[t-1] + weights[t-2]

That fits the definition of cumulative to me. Does that make sense?

What still isn’t clear to me is if we backprop through the attention weights from the previous step or if we detach the weights from the graph. I’ve tried both and haven’t noticed much difference.

@fatchord
Copy link

fatchord commented Mar 13, 2018

Ah I think I see what you mean now - if the current weights are calculated from the last then this is cumulative, but it's a kind of transformed cumulation. Is that what you mean?

Regarding detaching the attention weights - my initial thinking would be to leave it in the graph as the attention is kinda like a recurrent net in itself right? I'm new to attention models so I'm not sure.

So in my own pytorch implementation I too have got very slow training times. I trained a big wavenet recently (took forever) but this part of tacotron2 is even slower - that's in direct conflict with what's stated in the paper - they wrapped up training in a day. Are they predicting multiple spectrograms per decoder time step? Surely they would have mentioned that?

@A-Jacobson
Copy link
Owner

They predict multiple frames in the first paper I believe. Though it was my understanding that this one is meant to be faster. Granted, this is a google paper that's full of rather googley things such as the use of an internal dataset, and unknown hardware (what kind of gpu can run this with a batch size of 64?). They also mention that they trained their wavenet "with a batch size of 128 distributed across 32 GPUs".

Honestly, unless I'm doing something terribly wrong I'm not likely to fully train this as it's supposed to take a few hundred thousand steps to fully converge. (I can only get a few thousand steps per hour with a smaller batch size). But it has been a decent deep dive into encoder --> attention --> decoder architectures.

As for the wavenet, do you have a link to your implementation? I just managed to crank what I think is a basic version but the details in the paper are rather sparse. I'm still not clear on how to get it to generate audio based on the input spectrogram. Do we add the full spectrogram as a conditional input?

@Rayhane-mamah
Copy link
Author

Rayhane-mamah commented Mar 13, 2018 via email

@A-Jacobson
Copy link
Owner

Right, the mixture of logistic encoding is from pixelcnn++.

I just built a baby wavenet that’s generating sine waves right now. It seems I have to add local conditioning.

My understanding right now is that.

Training:

Wavenet(full_audio, full_spectrogram)
Output = full_decoded_audio

-Spectrogram is upsamped to the same length as the audio (undo fft hops)

-Upsampled spec used as local context for each wavenet block.

Interence:

Wavenet(audio_start_token, spec_frame?)
Output = single audio frame?

As you can see I’m not too clear on the behavior just yet. To me the original wavenet paper left a lot of details out as well. Including the number of filters in all their layers! Though I believe I found some of that info in one of the authors twitter.. haha.

@Rayhane-mamah
Copy link
Author

Rayhane-mamah commented Mar 13, 2018

You work fast x)

I am not really sure about inference time as I am still writing the training part so I won't misguide you. I should be able to finish the entire thing this week end (right after i finish those exams..)
As for the training part, I share the same understanding.

I will keep you informed if I find anything useful.

@fatchord
Copy link

fatchord commented Mar 14, 2018

I've had decent enough results with my wavenet. I haven't implemented the mixed logistics because I wanted to replicate the sound quality of the original wavenet paper first. Actually it'd be great to get your thoughts on the sound quality: [testset.tar.gz](https://github.com/A-Jacobson/tacotron2/files/1810315
/testset.tar.gz)

Obviously there's noise from the 8bit encoding but besides that all I can hear just a little bit of phasey/flangey noise around the top-end.

My implementation is basically a gigantic jupyter notebook right now so it badly needs refactoring. Once I get around to that (I'm busy mainly with WaveRNN right now), I'll upload it to github.

@fatchord
Copy link

Oh I almost forgot - for wavenet hyperparams have a look towards the end of the distilled wavenet paper - in section 5 - Experiments they give details there.

@A-Jacobson
Copy link
Owner

Thanks for the pointers with regards to wavenet the diagrams from the original paper led me to believe that the kernel size should always be 2! Ironically, the parallel wavenet paper was the only wavenet paper I hadn't read in depth since i thought it was just about speedups.

I'm having trouble playing that sound file on this computer, says the format isn't supported though I'm not sure my ears would be able to notice anything about it that yours couldn't anyway.

I also don't mind giant jupyter notebooks too much since I'm just looking for hyper parameters and small details so please feel free to share your implementation. That being said, is the audio you generated conditioned on spectrograms or is it using the linguistic features from the original paper? Also are you using fast generation queues or parallel wavenet? I've found that generation of really anything (even a sine wave) with a naive approach is prohibitively slow.

@fatchord
Copy link

Yeah, I should really write a proper wav saving function - librosa does something strange sometimes when saving - yet another reason why I need to refactor the entire thing. In the meantime I recommend you check out the r9y9 and kan-bayashi repos, both have legit implementations.

As for conditioning - I'm using mel spectrograms. Be extremely careful with mel/sample alignment - that's something that tripped me up initially. Also I'm using fast-queues - it's not that fast though but it does cut down on naive generation by a factor of around 4 in my experience. That's why I'm so interested in WaveRNN right now - it took around 20 mins to generate the little sample I uploaded earlier - totally impractical.

@A-Jacobson
Copy link
Owner

A-Jacobson commented Mar 15, 2018

Ya, the fast queues only increase generation time in proportion to the number of layers in your network and they mentioned it isn't much faster (2x maybe) unless you're using more than 10 layers or so. The original is O(2^L), fast queues are O(L) but parallel wavenet claims real time performance. Perhaps that's worth a look. I don't really want to wait 2 hours to generate a decent sized eval clip!

I have checked out both of those repos and few chainer repos. But at this point I feel like it's worth my time to build my own up in as clear a way as possible since I believe the concept of this pseudo recurrent generative convnet with a wide receptive field could be adapted to other domains. Basically, I'd like to understand it well enough to pull out the ideas where appropriate.

@Rayhane-mamah
Copy link
Author

Rayhane-mamah commented Mar 15, 2018

Hello it's me, not Mario! (that wasn't funny..)

@A-Jacobson, I have tried implementing your attention here and I'm using it in the decoder here.

Just to make sure I have not done some silly mistakes:

  • previous output through a pre-net
  • Get context vector using encoder outputs + last decoder lstm hidden state (h) + previous alignments
  • concat context + prenet output and pass them through lstm cells to update hidden states and compute rnn output
  • then use the concat context + rnn_out through projections to predict mel spectrograms and <stop_token>.

Am i correct?

Now i remember you saying that the model learned to ignore the inputs padding at an early stage, well, mine only seems to look at the padding at such early stages.. (In the following plots, I am actually using a concatenation of the two lstm cells hidden states as query vector, even tho in the repository I am only using one of them)

28906626_1660106727419993_1876260753_n
28906980_1660156654081667_1256357912_n
29004358_1660156657415000_1893674767_n

Could you at the same time provide some alignment plots of the model while learning attention? It could really help me a lot to have an insight of what it's supposed to output in order to know when it's working properly. (Like the alignment in a few thousand steps until alignment? )

Thanks a lot!

@A-Jacobson
Copy link
Owner

In concept, it looks correct except that I use the last layers hidden state only as the query vector, as is common in nmt. as a warning I’m not familiar with the baseattention class you’re inheriting from in tf contrib and It’s been a while since I’ve touched tf. So I’m not likely to catch subtle bugs in your code!

As for the padding, I’m explicitly shutting off the gradient to the padding embedding in the decoder so perhaps that is a difference. It’s hard to say after only 300 steps though. Maybe of the plots from other repos like the one I referenced in this thread didn’t get any kind of alignment at all until ~20k steps. Most of my plots are only from ~3k so I have mostly checkerboards with the padding as a blur as well.

@Rayhane-mamah
Copy link
Author

Rayhane-mamah commented Mar 15, 2018 via email

@A-Jacobson
Copy link
Owner

A-Jacobson commented Mar 15, 2018 via email

@A-Jacobson
Copy link
Owner

A-Jacobson commented Mar 15, 2018

Plots at 4k steps, you can see that for each frame, it doesn't put weight on anything past the end token. Except in cases where there's silence (it doesn't seem to understand commas or periods yet. I also padded my spectrogram with zeros and perhaps should have used -80 which is the value of a full zero valued audio window before spectrogram extraction). on the right hand side you can see it is completely uncertain where to look when outputting zero valued spectrogram frames since 1.0 / ~140 (len of sequence) = 0.0067. That's what I meant when i said it was learning to ignore padding.

attention_4k

output:
output_4k

target:
target_4k

@Rayhane-mamah
Copy link
Author

Rayhane-mamah commented Mar 16, 2018 via email

@A-Jacobson
Copy link
Owner

You're absolutely right about all of those things! Unfortunately, I'm reluctant to change a hyperparameter and reset the training yet again. Which is why I have held off. I suppose I could sort the data between training sessions but I'm really more interested in learning concepts and checking correctness than training efficiency at this point. When you started this thread I was just starting my second day working on this thing.

@fatchord
Copy link

@A-Jacobson I noticed you're using embeddings for the input to your wavenet - I tried that and it didn't work so well, you're better off with scalars. One-hot inputs work too but in my experience a simple scalar is the best all.

Btw - WaveRNN is coming along nicely - check out this unconditioned output - it's early enough in training too:
12k_steps.wav.tar.gz

@A-Jacobson
Copy link
Owner

Hah is that was random phonemes strung together sounds like? How long does the generation from a waveRNN take vs normal wavenet?

@A-Jacobson
Copy link
Owner

Hey @fatchord, I started to add conditioning to my wavenet but realized the tacotron2 paper asked for a 12.5 fft hop size.. which I used. Unfortunately, that means the spectrogram features have to be upsampled by ~275. Minor differences (like frames generated from an incomplete audio frame) can be handled by clipping but they claim they did the upsampling with two transposed convolutions. Of course, didn't share the parameters they used in these layers. Did you follow this same recipe or do you use a more friendly hop size with your spectrograms, or perhaps use the feature repeating strategy instead of an upsampling network?

@A-Jacobson
Copy link
Owner

A-Jacobson commented Mar 18, 2018

Now that I look, they're generating audio at 24khz. Even at that rate you'd have to upsample by a factor of 300. Seems odd to try to do that in two layers. maybe stride 10 then a stride 30?

@fatchord
Copy link

@A-Jacobson re:wavernn - generation is around 1100 samples per second. The paper mentioned 1600 in their tensorflow regular implementation so I guess it's not too far off - the dynamic graph might be slowing it up a little. I've uploaded a public repo if you wanna check it out.

Re:spectrograms - I sampled at 22050hz with a hop-size of 256 and an FFT size of 1024. That's roughly in the same ballpark as T2's settings but at a reduced sample rate . I recommend checking out r9y9's wavenet vocoder's spectrogram preprocessing if you are unsure of anything.

@A-Jacobson
Copy link
Owner

A-Jacobson commented Mar 20, 2018 via email

@fatchord
Copy link

Well if you're going to follow the paper religiously then you're going to be stuck with those awkward upsampling scales. Still though, there must be a good reason why they picked them.

@fatchord
Copy link

fatchord commented Mar 31, 2018

Hey guys, what did you make of the latest tacotron papers? I think they're pretty amazing, the style tokens idea is great. Also this opens up the opportunity to use noisy datasets.

One of the co-authors popped up here https://www.reddit.com/r/MachineLearning/comments/87klvo/r_expressive_speech_synthesis_with_tacotron/ - definitely worth checking out. No 'tricks' held back apparently!

@Rayhane-mamah
Copy link
Author

Rayhane-mamah commented Mar 31, 2018 via email

@fatchord
Copy link

Congrats on getting T2 to work! I didn't know there was a revised paper either - must have a look now.

@Rayhane-mamah
Copy link
Author

Thank you @fatchord, the revised version can be found here. I don't seem to find the original paper anywhere anymore, but hopefully I got it in pdf format here.

The difference might seem minimal between the two papers, but I really find the second version clearer.. Weird.

@A-Jacobson
Copy link
Owner

I like the idea of being able to generate high quality speech from noisy speech. Definitely worth a look when I (and my gpu) get some free time again. I saw that reddit thread! they were really getting grilled. Though, other than the wavenet parts I think the architecture descriptions are pretty clear. The problem of course is the use of internal data that makes it impossible completely validate our implementations. It would be nice if they at least let us know the distributions of utterance lengths in their internal data!

@fatchord
Copy link

fatchord commented Apr 5, 2018

I think the dataset problem may be solvable with crowdsourcing. I mean there's nothing stopping a bunch of random people on the internet picking a high-quality commercial audiobook and manually segmenting it while at the same time logging all time-stamps of the start/end of utterances. Then create a script that will segment according to these time-stamps. If enough people got involved it might only amount to a morning's worth of work per person.

That way anyone can buy the audiobook, run the script and have 20+ hours of high-quality, noise-free TTS data. All legal problems regarding distribution are avoided since the dataset contains no audio, just metadata.

I was thinking of creating a dedicated subreddit for models like wavenet, tacotron, samplernn etc called r/AudioModels and this might be a nice project to start it all off. What do you guys think?

@Rayhane-mamah
Copy link
Author

@fatchord, I think it's an awesome idea if it works!

In the meantime, you can check this newly released open source speech data that can be used for TTS, speech recognition (with the add noise feature), and audio cleaning (extract speech from noisy audio). It contains several languages with multiple readers (eng-US, eng-UK, German...) and the same reader has always more than 24 hours of speech. I find it very well done and one should probably have a look at it!

@fatchord
Copy link

fatchord commented Apr 6, 2018

@Rayhane-mamah What a find, thanks! I just downloaded the eng_UK and the quality is really good.

@fatchord
Copy link

fatchord commented Jun 1, 2018

@Rayhane-mamah @A-Jacobson Hey guys, I just created https://old.reddit.com/r/AudioModels/ today if you wanna check it out.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants