MAML & LLMs #827

pharringtonp19 · 2024-03-13T13:43:12Z

pharringtonp19
Mar 13, 2024

Is mlx flexible enough to fine-tune a LLM in the style of MAML? In other words, is it possible to fine-tune a LLM in mlx via bi-level gradient descent?

Answered by awni

Mar 13, 2024

Yes for sure you can compose vjp / value_and_grad /grad to any depth and it will work. So to do a bilevel thing you would do something like:

def step(outer_w, inner_w):
    def loss(inner_w, x, y)
        nn.losses.mse(inner_w @ x, y)
  
    dloss_dinner_w = mx.grad(loss)(inner_w, x, y)
    inner_w = inner_w + (outer_w @ x) * d_loss_dinner_w

dstep_douter_w = mx.grad(step)(outer_w, inner_w)

(Super simple + untested but just to give you the flavor of how that could go).

View full answer

awni · 2024-03-13T13:50:17Z

awni
Mar 13, 2024
Maintainer

Yes for sure you can compose vjp / value_and_grad /grad to any depth and it will work. So to do a bilevel thing you would do something like:

def step(outer_w, inner_w):
    def loss(inner_w, x, y)
        nn.losses.mse(inner_w @ x, y)
  
    dloss_dinner_w = mx.grad(loss)(inner_w, x, y)
    inner_w = inner_w + (outer_w @ x) * d_loss_dinner_w

dstep_douter_w = mx.grad(step)(outer_w, inner_w)

(Super simple + untested but just to give you the flavor of how that could go).

0 replies

pharringtonp19 · 2024-03-13T15:13:11Z

pharringtonp19
Mar 13, 2024
Author

Thanks @awni for your answer & example.

I'm now exploring whether it's possible to apply mx.vmap to a BERT model.

The first step is to process the text correctly. I understand why the following doesn't work, but I haven't yet come up with an alternative.

import mlx.core as mx 
from model import Bert, load_model


data = [["BERT revolutionized natural language processing by using bidirectional training to understand context better.",
          "Developed by Google, it can interpret the meaning of words based on surrounding text, improving tasks like question answering and sentiment analysis.",
          "BERT's architecture allows for fine-tuning on specific tasks without substantial modifications, making it versatile for a wide range of applications."], 
         ["LLaMA, or Large Language Model, is known for its efficiency and scalability, often utilized in various natural language processing tasks.",
          "It stands out for its ability to generate human-like text, perform translations, and understand context within large datasets.",
          "LLaMA models are designed to be adaptable, supporting tasks ranging from simple text generation to complex question answering with less computational cost than some alternatives."], 
         ["GEMM is a core algorithm in high-performance computing, critical for tasks in linear algebra, machine learning, and engineering.",
          "It multiplies two matrices and adds the result to a third matrix, serving as a foundational operation in many computational applications.",
          "Due to its importance, optimizations of GEMM are central to accelerating machine learning models and scientific computations on various hardware platforms."]]

def process_text(batch):
    tokens = tokenizer(batch, return_tensors="np", padding=True)
    tokens = {key: mx.array(v) for key, v in tokens.items()}
    return tokens 

tokens = mx.vmap(process_text)(data)

2 replies

awni Mar 14, 2024
Maintainer

Well the tokenizer is a hugging face transformer API that uses all sorts of stuff under the hood so it shouldn't be vmapped over.

I am not certain what you are trying to accomplish there. The tokenizer is already "batched" as are the array creations in the next line. If you want to tokenize the whole of data in a single go, you would probably need to collapse the outer dimension, tokenize, create arrays and then reshape explitictly:

data = [d for batch in data for d in batch]
tokens = tokenizer(data, return_tensors="np", padding=True)
tokens = {key: mx.array(v).reshape(3, 3, -1) for key, v in tokens.items()}

pharringtonp19 Mar 14, 2024
Author

Thanks for the response @awni

Aim: Train a Language Model via the MAML algorithm

Background: One way I think about MAML is that the "forward pass" over a batch of samples involves three steps (1) partitioning the batch into k groups (2) Updating the parameters via gradient descent for each of the k groups and (3) evaluating the loss of these partition specific parameters on a validation set.

When writing this up in Jax, I have found it convenient to (1) partition the batch of samples into k groups and then write a function f that performs steps (2) & (3). I then vmap(f) over the partitioned sample.

Current Challenge: I'm not sure whether it's possible to vmap the __call__ of the BERT model. I was unsuccessful in my attempts yesterday.

awni · 2024-03-14T13:12:15Z

awni
Mar 14, 2024
Maintainer

Current Challenge: I'm not sure whether it's possible to vmap the call of the BERT model. I was unsuccessful in my attempts yesterday.

No you probably can't vmap that out of the box. If you want to vmap over the call of BERT you'd have to do something like:

model = BERT()

def forward(params, x):
    model.update(params)
    return model(x)

vmapfn = mx.vmap(forward)
y = vmapfn(model.parameters(), x)

1 reply

awni Mar 14, 2024
Maintainer

You may also get errors that VMAP is not yet implemented for certain ops. If you encounter that, please let me know and we will prioritize implementing them!

pharringtonp19 · 2024-03-14T13:38:34Z

pharringtonp19
Mar 14, 2024
Author

@awni I followed your suggestion on how to define f. The following crashes the kernel (I'm not sure at the moment why this is the case)

import mlx.core as mx 
from model import Bert, load_model

model, tokenizer = load_model(
    "bert-base-uncased", 
    "weights/bert-base-uncased.npz"
)

data = [["BERT revolutionized natural language processing by using bidirectional training to understand context better.",
          "Developed by Google, it can interpret the meaning of words based on surrounding text, improving tasks like question answering and sentiment analysis.",
          "BERT's architecture allows for fine-tuning on specific tasks without substantial modifications, making it versatile for a wide range of applications."], 
         ["LLaMA, or Large Language Model, is known for its efficiency and scalability, often utilized in various natural language processing tasks.",
          "It stands out for its ability to generate human-like text, perform translations, and understand context within large datasets.",
          "LLaMA models are designed to be adaptable, supporting tasks ranging from simple text generation to complex question answering with less computational cost than some alternatives."], 
         ["GEMM is a core algorithm in high-performance computing, critical for tasks in linear algebra, machine learning, and engineering.",
          "It multiplies two matrices and adds the result to a third matrix, serving as a foundational operation in many computational applications.",
          "Due to its importance, optimizations of GEMM are central to accelerating machine learning models and scientific computations on various hardware platforms."]]

flatten_data =[item for sublist in data for item in sublist]

def process_text(batch):
    tokens = tokenizer(batch, return_tensors="np", padding=True)
    tokens = {key: mx.array(v) for key, v in tokens.items()}
    return tokens 

def cluster_process_text(c, n, batch):
    tokens = process_text(batch)
    tokens = {'input_ids': tokens['input_ids'].reshape(c, n, -1),
              'token_type_ids': tokens['input_ids'].reshape(c, n, -1),
              'attention_mask': tokens['input_ids'].reshape(c, n, -1)}
    return tokens 

def f(params, input_ids, token_type_ids, attention_mask):
    model.update(params)
    return model(input_ids, token_type_ids, attention_mask)


tokens = cluster_process_text(3, 3, flatten_data)
mx.vmap(f, in_axes=(None, 0, 0, 0))(model.parameters(), tokens['input_ids'], tokens['token_type_ids'], tokens['attention_mask'])

15 replies

pharringtonp19 Mar 22, 2024
Author

@awni The new release has been great. I've been able to vmap a bert model across different batches of text.

I'm now working on vmapping the inner loop, and I just received the following error:

ValueError: [Primitive::vmap] Not implemented for Scatter Sum.

awni Mar 22, 2024
Maintainer

Could you share the code for the inner loop?

pharringtonp19 Mar 22, 2024
Author

The inner loop in the following code is the following

inner_yuri = Trainer(mlp, supervised_loss, opt, 2)

Here's the entire code

# %% [markdown]
# #### **Imports**

# %%
import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim
import numpy as np
from model import load_model
from dataclasses import dataclass
from typing import Callable
from functools import partial 
import mlx.optimizers as optim

# %% [markdown]
# #### **BERT MODEL**

# %%
bert_model, bert_tokenizer = load_model(
    "bert-base-uncased", 
    "weights/bert-base-uncased.npz")

# %%
class MLP(nn.Module):
    def __init__(
        self, bert_model, num_layers: int, input_dim: int, hidden_dim: int, output_dim: int
    ):
        super(MLP, self).__init__()
        self.bert_model = bert_model
        layer_sizes = [input_dim] + [hidden_dim] * num_layers + [output_dim]
        self.layers = [
            nn.Linear(idim, odim)
            for idim, odim in zip(layer_sizes[:-1], layer_sizes[1:])
        ]

    def __call__(self, tokens):
        _, x = self.bert_model(**tokens)
        for l in self.layers[:-1]:
            x = mx.maximum(l(x), 0.0)
        return self.layers[-1](x)

# %%
def process_text(batch):
    tokens = bert_tokenizer(batch, return_tensors="np", padding=True)
    tokens = {key: mx.array(v) for key, v in tokens.items()}
    return tokens 

def cluster_process_text(c, n, batch):
    tokens = process_text(batch)
    tokens = {'input_ids': tokens['input_ids'].reshape(c, n, -1),
              'token_type_ids': tokens['input_ids'].reshape(c, n, -1),
              'attention_mask': tokens['input_ids'].reshape(c, n, -1)}
    return tokens 

# %%
@dataclass
class Supervised_Loss:
    loss_fn: Callable 
    fwd_pass : Callable                                      

    def __call__(self, params, X, Y, mask):
        Yhat = self.fwd_pass(params, X) 
        empirical_loss = mx.sum(mx.vmap(self.loss_fn)(Yhat, Y, mask)) / mx.sum(mask)
        return empirical_loss  
    
@dataclass
class Trainer:
    model : MLP
    loss_fn: Callable
    opt: optim.optimizers
    epochs: int

    def train(self, params, X, Y, mask):
        for i in range(self.epochs):
            grads = mx.grad(self.loss_fn)(params, X, Y, mask)                        
            params = self.opt.apply_gradients(grads, params)           
        return params 

# %%
def forward(model, params, x):
    model.update(params)
    return model(x)

def loss_fn(yhat, y, mask):
    return mx.sum(nn.losses.binary_cross_entropy(yhat, y) * mask )/mx.sum(mask)

# %% [markdown]
# #### **Data**

# %%
data = [["BERT revolutionized natural language processing by using bidirectional training to understand context better.",
          "Developed by Google, it can interpret the meaning of words based on surrounding text, improving tasks like question answering and sentiment analysis.",
          "BERT's architecture allows for fine-tuning on specific tasks without substantial modifications, making it versatile for a wide range of applications."], 
         ["LLaMA, or Large Language Model, is known for its efficiency and scalability, often utilized in various natural language processing tasks.",
          "It stands out for its ability to generate human-like text, perform translations, and understand context within large datasets.",
          "LLaMA models are designed to be adaptable, supporting tasks ranging from simple text generation to complex question answering with less computational cost than some alternatives."], 
         ["GEMM is a core algorithm in high-performance computing, critical for tasks in linear algebra, machine learning, and engineering.",
          "It multiplies two matrices and adds the result to a third matrix, serving as a foundational operation in many computational applications.",
          "Due to its importance, optimizations of GEMM are central to accelerating machine learning models and scientific computations on various hardware platforms."]]

batch =[item for sublist in data for item in sublist]

# %%
tokens = cluster_process_text(3, 3, batch)

y = mx.array([1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]).reshape(3,3,1)

# %%
mlp = MLP(bert_model, 2, 768, 768, 1) 
supervised_loss = Supervised_Loss(loss_fn, partial(forward, mlp))
opt = optim.SGD(learning_rate=0.01)
inner_yuri = Trainer(mlp, supervised_loss, opt, 2)

# %% [markdown]
# #### **Practiced Vmaps**

# %%
print(mx.vmap(partial(forward, mlp), in_axes=(None, 0))(mlp.parameters(), tokens).shape)

print(mx.vmap(supervised_loss, in_axes=(None, 0, 0, 0))(mlp.parameters(), tokens, y, mx.ones_like(y)).shape)

print(mx.vmap(inner_yuri.train, in_axes=(None, 0, 0, 0))(mlp.parameters(), tokens, y, mx.ones_like(y)).shape)

awni Mar 26, 2024
Maintainer

Sorry I forgot to reply to this. I just filed an issue #909 to get vmap for scatter implemented.

pharringtonp19 Mar 27, 2024
Author

@awni Thanks! Much appreciated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAML & LLMs #827

{{title}}

Replies: 4 comments 18 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

MAML & LLMs #827

pharringtonp19 Mar 13, 2024

Replies: 4 comments · 18 replies

awni Mar 13, 2024 Maintainer

pharringtonp19 Mar 13, 2024 Author

awni Mar 14, 2024 Maintainer

pharringtonp19 Mar 14, 2024 Author

awni Mar 14, 2024 Maintainer

awni Mar 14, 2024 Maintainer

pharringtonp19 Mar 14, 2024 Author

pharringtonp19 Mar 22, 2024 Author

awni Mar 22, 2024 Maintainer

pharringtonp19 Mar 22, 2024 Author

awni Mar 26, 2024 Maintainer

pharringtonp19 Mar 27, 2024 Author

pharringtonp19
Mar 13, 2024

Replies: 4 comments 18 replies

awni
Mar 13, 2024
Maintainer

pharringtonp19
Mar 13, 2024
Author

awni Mar 14, 2024
Maintainer

pharringtonp19 Mar 14, 2024
Author

awni
Mar 14, 2024
Maintainer

awni Mar 14, 2024
Maintainer

pharringtonp19
Mar 14, 2024
Author

pharringtonp19 Mar 22, 2024
Author

awni Mar 22, 2024
Maintainer

pharringtonp19 Mar 22, 2024
Author

awni Mar 26, 2024
Maintainer

pharringtonp19 Mar 27, 2024
Author