Debugging a VQ-VAE #1219

credwood · 2024-06-20T16:43:09Z

credwood
Jun 20, 2024

Summary of the issue

I'm running into issues implementing a vector quantization module. The model gets through one training iteration and errors out with:

  File "/Users/red/miniconda3/envs/audio_mlx/lib/python3.11/site-packages/mlx/utils.py", line 44, in <genexpr>
    tree_map(fn, child, *(r[i] for r in rest), is_leaf=is_leaf)
                          ~^^^
IndexError: list index out of range

The full stack trace of the error is further down, this is just to summarize the issue. I can successfully train the exact same model without the quantizer, so the issue is local to how I'm handling quantization.

System and program specs

2020 MacBook Air
M1, 16gb
MLX version 0.15.1

The code

The code and instructions for replicating the issue are here. Apologies, cloning the repo may take longer than usual because of a few audio files serving as the dummy training dataset.

A few notes about it:

Once the above error appears in terminal, the loss data for the first iteration will be in train_log.log
To verify my claim about the VAE part of the model working, you can comment out the ConvVQ model, un-comment the ConvDUMMY and comment out all references to qt_loss and run the training script
The setup instructions are obviously specific to my system. I have only verified that they work on my machine

Full stack trace and model implementation details

Stack trace:

Traceback (most recent call last):
  File "/Users/red/projects/audio_test/VQ-VAE_MLX/train.py", line 160, in <module>
    main(args)
  File "/Users/red/projects/audio_test/VQ-VAE_MLX/train.py", line 112, in main
    loss = step(batch)
           ^^^^^^^^^^^
  File "/Users/red/projects/audio_test/VQ-VAE_MLX/train.py", line 95, in step
    optimizer.update(model, grads)
  File "/Users/red/miniconda3/envs/audio_mlx/lib/python3.11/site-packages/mlx/optimizers/optimizers.py", line 29, in update
    model.update(self.apply_gradients(gradients, model))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/red/miniconda3/envs/audio_mlx/lib/python3.11/site-packages/mlx/optimizers/optimizers.py", line 88, in apply_gradients
    return tree_map(self.apply_single, gradients, parameters, self.state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/red/miniconda3/envs/audio_mlx/lib/python3.11/site-packages/mlx/utils.py", line 48, in tree_map
    return {
           ^
  File "/Users/red/miniconda3/envs/audio_mlx/lib/python3.11/site-packages/mlx/utils.py", line 49, in <dictcomp>
    k: tree_map(fn, child, *(r[k] for r in rest), is_leaf=is_leaf)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/red/miniconda3/envs/audio_mlx/lib/python3.11/site-packages/mlx/utils.py", line 43, in tree_map
    return TreeType(
           ^^^^^^^^^
  File "/Users/red/miniconda3/envs/audio_mlx/lib/python3.11/site-packages/mlx/utils.py", line 44, in <genexpr>
    tree_map(fn, child, *(r[i] for r in rest), is_leaf=is_leaf)
  File "/Users/red/miniconda3/envs/audio_mlx/lib/python3.11/site-packages/mlx/utils.py", line 44, in <genexpr>
    tree_map(fn, child, *(r[i] for r in rest), is_leaf=is_leaf)
                          ~^^^
IndexError: list index out of range

Model details:

The model is a pared down MLX implementation of Meta AI's Encodec model.
In the process of debugging, I made a completely basic quantizer class in which the EuclideanCodebook submodule is just a simple embedding look up, excluding the k-means initialization and the additional steps the original model takes in optimizing the embedding weights.
If the issue is with my code, it would likely be located in the EuclideanCodebook submodule or the embedding function defined in functional.py that gets called.
Besides the integration of the quantizer, the code for the main model, ConvVQ is exactly the same as the ``ConvDUMMY``` code. Both of which can be found here

Answered by awni

Jun 23, 2024

The problem seems to be related to "parameters" getting added to the model after the first iteration. E.g the qt_vals in ConvVQ.

In general it looks like you use a lot of state inside the modules which should not be treated as parameters. You can prefix those with _ so that they are not picked up in the Module's parameters. For example use _qt_vals instead of qt_vals. And when you keep track of all the loss values in the modules use an _ as a prefix in the name to avoid treating them as parameters.

Slightly more detailed explanation:

When you take the first optimizer update, gradient state is initialized for all of the models parameters.
After this, the optimizer is considered initialized
…

View full answer

awni · 2024-06-23T21:08:51Z

awni
Jun 23, 2024
Maintainer

The problem seems to be related to "parameters" getting added to the model after the first iteration. E.g the qt_vals in ConvVQ.

In general it looks like you use a lot of state inside the modules which should not be treated as parameters. You can prefix those with _ so that they are not picked up in the Module's parameters. For example use _qt_vals instead of qt_vals. And when you keep track of all the loss values in the modules use an _ as a prefix in the name to avoid treating them as parameters.

Slightly more detailed explanation:

When you take the first optimizer update, gradient state is initialized for all of the models parameters.
After this, the optimizer is considered initialized
If you then change the set of parameters after that step (as you do when you append to qt_vals), then the optimizer does not have a matching state anymore
Trying to update with the optimizer will then crash since the optimizer state does not match the models' parameters

In theory we could be more dynamic with how we initialize the optimizer. However, in your case, I genuinely think those values should not be treated as parameters. Hence prefixing them with _ should be the right fix here.

1 reply

credwood Jun 23, 2024
Author

Thank you, the problem is solved with this fix. I had been looking for an equivalent to PyTorch's "register buffer", glad to now know it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging a VQ-VAE #1219

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Debugging a VQ-VAE #1219

credwood Jun 20, 2024

Summary of the issue

System and program specs

The code

Full stack trace and model implementation details

Replies: 1 comment · 1 reply

awni Jun 23, 2024 Maintainer

credwood Jun 23, 2024 Author

credwood
Jun 20, 2024

Replies: 1 comment 1 reply

awni
Jun 23, 2024
Maintainer

credwood Jun 23, 2024
Author