Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using sparse learning in practice #24

Open
iamanigeeit opened this issue Feb 26, 2022 · 1 comment
Open

Using sparse learning in practice #24

iamanigeeit opened this issue Feb 26, 2022 · 1 comment

Comments

@iamanigeeit
Copy link
Contributor

iamanigeeit commented Feb 26, 2022

Hi Tim, thanks for making this library. I am trying to test it on speech generation models and i have some questions from your code template:

  1. The models come with their own schedulers and optimizers. Can i simply wrap them around with decay = CosineDecay ... and mask = Masking(optimizer, ...)? Should i change the optimizer to follow optim.SGD(...) and ignore the scheduler? It looks like mask.step() runs every epoch and replaces the scheduler, but i think i should still keep the optimizer specific to the model i have.
  2. I understand that density/sparsity is the desired % of weights to keep, while prune/death rate is an internal parameter to determine what % weights should be redistributed at each iteration. Is this correct?
  3. Density looks like = sparsity in your code, although normally i would think density = 1 - sparsity.
  4. Code fails at core.py line 221-223 when there are RNNs, because for them bias is a boolean and the bias terms are actually bias_ih and bias_hh. I think this might count the parameters better:
for p, tensor in self.modules[0].named_parameters():
    total_size += tensor.numel()
@TimDettmers
Copy link
Owner

Hi! Thanks for your questions.

  1. The mask scheduler is different from the learning rate scheduler. The learning rate scheduler should be unaffected by the code.
  2. That is correct. The sparsity percentage is kept steady, but the prune rate changes over time.
  3. I think this is correct. For me, it feels more natural to think in terms of density (27% of weights seems more intuitive than 73% sparsity). However, I think I keep the naming in the code as "sparsity" even though I used density conceptually
  4. This is a good catch! Could you create a pull request for this? I did not test the code for RNNs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants