You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I love this exploration! Thanks for writing and coding this up. Right now, we're working on modifications to the causal conv1d and selective scan CUDA kernels to support defining the input state, so we are reviewing your code carefully.
What is the objective of the exponential fall-off in the cache clearing in train-infinite.py?
Also, a general question: do you have a feeling for why your current implementation isn't working? Might vanishing gradients be an issue when running over longer sequences? I noticed that you're using bf16. I found this caused instability, and using amp for higher precision seemed to help.
The text was updated successfully, but these errors were encountered:
What is the objective of the exponential fall-off in the cache clearing in train-infinite.py?
Ah this is just based on some intuitions that we should clear the hidden states from time to time to allow them to be generated by the newest model weights during training. Nothing rigorous here.
Also, a general question: do you have a feeling for why your current implementation isn't working?
I remember the problem I encountered was nan loss. So probably not vanishing gradients but exploding activations or gradients when the sequence length is too long.
I love this exploration! Thanks for writing and coding this up. Right now, we're working on modifications to the causal conv1d and selective scan CUDA kernels to support defining the input state, so we are reviewing your code carefully.
What is the objective of the exponential fall-off in the cache clearing in
train-infinite.py
?Also, a general question: do you have a feeling for why your current implementation isn't working? Might vanishing gradients be an issue when running over longer sequences? I noticed that you're using bf16. I found this caused instability, and using amp for higher precision seemed to help.
The text was updated successfully, but these errors were encountered: