cache clearing interval for previous hidden states #2

ekg · 2024-02-28T23:00:52Z

I love this exploration! Thanks for writing and coding this up. Right now, we're working on modifications to the causal conv1d and selective scan CUDA kernels to support defining the input state, so we are reviewing your code carefully.

What is the objective of the exponential fall-off in the cache clearing in train-infinite.py?

        if completed_steps % clear_cache_interval == 0:
            for layer_idx in range(model.config.n_layer):
                conv_state = torch.zeros((1, model.config.d_model*2, 3), dtype=torch.bfloat16, device=accelerator.device).detach()
                ssm_state = torch.zeros((1, model.config.d_model*2, 16), dtype=torch.bfloat16, device=accelerator.device).detach()
                previous_hidden_states.append((conv_state, ssm_state))
            clear_cache_interval *= 2

Also, a general question: do you have a feeling for why your current implementation isn't working? Might vanishing gradients be an issue when running over longer sequences? I noticed that you're using bf16. I found this caused instability, and using amp for higher precision seemed to help.

The text was updated successfully, but these errors were encountered:

jzhang38 · 2024-02-29T06:13:14Z

What is the objective of the exponential fall-off in the cache clearing in train-infinite.py?

Ah this is just based on some intuitions that we should clear the hidden states from time to time to allow them to be generated by the newest model weights during training. Nothing rigorous here.

Also, a general question: do you have a feeling for why your current implementation isn't working?

I remember the problem I encountered was nan loss. So probably not vanishing gradients but exploding activations or gradients when the sequence length is too long.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache clearing interval for previous hidden states #2

cache clearing interval for previous hidden states #2

ekg commented Feb 28, 2024

jzhang38 commented Feb 29, 2024 •

edited

Loading

cache clearing interval for previous hidden states #2

cache clearing interval for previous hidden states #2

Comments

ekg commented Feb 28, 2024

jzhang38 commented Feb 29, 2024 • edited Loading

jzhang38 commented Feb 29, 2024 •

edited

Loading