Low Bit Optim Instability. #1218

nighting0le01 · 2024-11-04T05:06:45Z

Hi will the torchao low bit optim allow for per layer selection of switching to 32-bit adam for stability? also stablemebedding layers 1. Reference:https://huggingface.co/docs/bitsandbytes/main/en/optimizers#optimize-unstable-parameters
2. Stable Embeddings https://huggingface.co/docs/bitsandbytes/main/en/reference/nn/embeddings#bitsandbytes.nn.StableEmbedding

i see some divergence with torchao optimizer and don't with bitsandbytes

gau-nernst · 2024-11-04T05:17:34Z

Regarding "per layer selection", I planned for this feature before but forgot about it. Should be easy to add. Do you have an idea how you want this API to look like? I'm thinking like this

optim = AdamW8bit(model.parameters(), exclude_low_bit_optim_params=[model.output.weight])

Regarding StableEmbedding, it looks like an ordinary nn.Module with some custom default. You should be able to use it directly from bnb. I don't think we need to re-implement it in ao.

Would you be interested in contributing a PR for the 1st feature?

nighting0le01 · 2024-11-04T05:24:56Z

hi @gau-nernst yes i believe this seems like a decent API. sure i can take it up if you can give some pointers

gau-nernst · 2024-11-04T05:30:22Z

We currently have some checks to only apply low-bit optim for certain params

ao/torchao/prototype/low_bit_optim/adam.py

Lines 42 to 57 in 8c07d22

    
           # follow bitsandbytes, only quantize tensors >= 4096 values 
        
           # also wrap subclass in DTensor when needed 
        
           def _new_buffer(self, p: Tensor, signed: bool): 
        
               if p.numel() >= 4096 and p.numel() % self.block_size == 0: 
        
                   if isinstance(p, DTensor): 
        
                       out = DTensor.from_local( 
        
                           local_tensor=self._subclass_zeros(p.to_local(), signed, self.block_size), 
        
                           device_mesh=p.device_mesh, 
        
                           placements=p.placements, 
        
                           run_check=False, 
        
                       ) 
        
                   else: 
        
                       out = self._subclass_zeros(p, signed, self.block_size) 
        
               else: 
        
                   out = torch.zeros_like(p) 
        
               return out

You can simply add an extra check if the param is not in exclude_low_bit_optim_params list (or set), and add this extra argument to each Adam variation.

nighting0le01 · 2024-11-04T05:32:55Z

sounds good! i'll create a PR!

gau-nernst added the optimizer label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low Bit Optim Instability. #1218

Low Bit Optim Instability. #1218

nighting0le01 commented Nov 4, 2024

gau-nernst commented Nov 4, 2024

nighting0le01 commented Nov 4, 2024

gau-nernst commented Nov 4, 2024

nighting0le01 commented Nov 4, 2024

Low Bit Optim Instability. #1218

Low Bit Optim Instability. #1218

Comments

nighting0le01 commented Nov 4, 2024

gau-nernst commented Nov 4, 2024

nighting0le01 commented Nov 4, 2024

gau-nernst commented Nov 4, 2024

nighting0le01 commented Nov 4, 2024