Why this MLX transformer model fails in training but its PyTorch equivalent works fine #1322

mzy2240 · 2024-08-13T03:46:49Z

mzy2240
Aug 13, 2024

I have spent whole day trying to figure out why this MLX transformer model does not work (losses stop decreasing in 2 iterations) while its PyTorch equivalent works just fine. Wondering if anyone could point out what goes wrong in my implementation. I am trying to build a transformer model for time series forecast.

PyTorch version

class TimeSeriesTransformer(nn.Module):
    def __init__(self, input_dim:int, model_dim:int):
        super(TimeSeriesTransformer, self).__init__()
        encoder_layer = nn.TransformerEncoderLayer(d_model=model_dim, nhead=2)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=2)
        decoder_layer = nn.TransformerDecoderLayer(d_model=model_dim, nhead=2)
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=2)
        self.input_projection = nn.Linear(input_dim, model_dim)
        self.input_pos_embedding = nn.Embedding(2048, embedding_dim=model_dim)
        self.output_projection = nn.Linear(input_dim, model_dim)
        self.output_pos_embedding = nn.Embedding(2048, embedding_dim=model_dim)
        self.output_linear_layer = nn.Linear(model_dim, 1)

    def encode_src(self, src):
        src_start = self.input_projection(src).permute(1, 0, 2)
        in_sequence_len, batch_size = src_start.size(0), src_start.size(1)
        pos_encoder = torch.arange(0, in_sequence_len, device=device).unsqueeze(0).repeat(batch_size, 1)
        pos_encoder = self.input_pos_embedding(pos_encoder).permute(1, 0, 2)
        src = src_start + pos_encoder
        src = self.encoder(src)
        return src

    def decode_tgt(self, tgt, memory):
        tgt_start = self.output_projection(tgt).permute(1, 0, 2)
        out_sequence_len, batch_size = tgt_start.size(0), tgt_start.size(1)
        pos_decoder = torch.arange(0, out_sequence_len, device=device).unsqueeze(0).repeat(batch_size, 1)
        pos_decoder = self.output_pos_embedding(pos_decoder).permute(1, 0, 2)
        tgt = tgt_start + pos_decoder
        out = self.decoder(tgt=tgt, memory=memory, tgt_mask=None) + tgt_start
        out = out.permute(1, 0, 2)  # [batch_size, seq_len, d_model]
        out = self.output_linear_layer(out)
        return out

    def forward(self, src, tgt):
        src = self.encode_src(src)
        out = self.decode_tgt(tgt=tgt, memory=src)
        return out

MLX version

class TimeSeriesTransformer(nn.Module):
    def __init__(self, input_dim:int, model_dim:int):
        super(TimeSeriesTransformer, self).__init__()
        self.encoder = nn.TransformerEncoder(
            2, model_dim, 2, norm_first=False, checkpoint=False
        )
        self.decoder = nn.layers.transformer.TransformerDecoder(
            2, model_dim, 2, norm_first=False, checkpoint=False
        )
        self.input_projection = nn.Linear(input_dim, model_dim)
        self.input_pos_embedding = nn.Embedding(2048, model_dim)
        self.output_projection = nn.Linear(input_dim, model_dim)
        self.output_pos_embedding = nn.Embedding(2048, model_dim)
        self.output_linear_layer = nn.Linear(model_dim, 1)

    def __call__(self, src, tgt):
        src = self.encode(src)
        out = self.decode(tgt, src)
        return out
    
    def encode(self, src):
        src_start = self.input_projection(src).transpose([1,0,2])
        in_sequence_len, batch_size = src_start.shape[0], src_start.shape[1]
        pos_encoder = mx.repeat(mx.arange(0, in_sequence_len).reshape((1,-1)), batch_size, axis=0)
        pos_encoder = self.input_pos_embedding(pos_encoder).transpose([1,0,2])
        src = src_start + pos_encoder
        src = self.encoder(src, mask=None)
        return src
    
    def decode(self, tgt, memory):
        tgt_start = self.output_projection(tgt).transpose([1,0,2])
        out_sequence_len, batch_size = tgt_start.shape[0], tgt_start.shape[1]
        pos_decoder = mx.repeat(mx.arange(0, out_sequence_len).reshape((1,-1)), batch_size, axis=0)
        pos_decoder = self.output_pos_embedding(pos_decoder).transpose([1,0,2])
        tgt = tgt_start + pos_decoder
        out = self.decoder(tgt, memory, None, None) + tgt_start
        out = out.transpose([1,0,2])
        out = self.output_linear_layer(out)
        return out

I first had the PyTorch version and then translated into MLX line by line. I am wondering if there are some differences on default parameters that caused the issue. I'm using the same M1 Max for both exercise.

angeloskath · 2024-08-13T03:49:27Z

angeloskath
Aug 13, 2024
Maintainer

Well the 2 possible culprits are default initialization of parameters and/or the optimizer.

Let us know if you need help to ensure that the initialization is the same.

5 replies

mzy2240 Aug 13, 2024
Author

Hey @angeloskath, thank you for the prompt reply! I did not have anything related to the initialization so I guess that means they both should be using the default values?

angeloskath Aug 13, 2024
Maintainer

Sure but that doesn't mean it is the same. I am not at a computer right now to check but the simplest one would be to make sure that they start from identically distributed weights or even load one set of weights or the other model.

mzy2240 Aug 13, 2024
Author

thats a good point! let me give a try

mzy2240 Aug 13, 2024
Author

Ok for the MLX version I just did this

init_fn = nn.init.constant(0.0)
model = model.apply(init_fn)

but for PyTorch I cannot initialize the model the same way

model.apply(lambda m: nn.init.zeros_(m))

which it would raise an error AttributeError: 'NonDynamicallyQuantizableLinear' object has no attribute 'zero_'. I am kind of clueless now about how to have the same initial weights over the two implementations.

mzy2240 Aug 13, 2024
Author

Made some progress to at lease initialize all the linear layers with zero weights, still observe the same behaviors where the PyTorch version has lower losses each iteration while the MLX version losses stops decreasing after the first two iterations.

# PyTorch
def init_fn(m):
    if type(m) == nn.Linear:
        torch.nn.init.zeros_(m.weight)

model.apply(init_fn)

# MLX
zero_initializer = nn.init.constant(0.0)
def init_fn(_, m):
    if type(m) == nn.Linear:
        return zero_initializer(m.weight)
model = model.apply_to_modules(init_fn)

mzy2240 · 2024-08-13T03:52:54Z

mzy2240
Aug 13, 2024
Author

For the optimizer and the losses I am using Adam and MSE for both.
PyTorch version

model = TimeSeriesTransformer(input_dim=src_train.shape[2], model_dim=256).to(device)

# Training setup
criterion = nn.MSELoss()
optimizer = Adam(model.parameters(), lr=0.001)

MLX version

model = TimeSeriesTransformer(input_dim=src_train.shape[2], model_dim=256)

optimizer = optim.Adam(learning_rate=0.001)

def loss_fn(model, x, tgt_inputs, y):
    predict = model(x, tgt_inputs)
    loss = nn.losses.mse_loss(predict, y, reduction="mean")
    return loss

0 replies

mzy2240 · 2024-08-13T04:47:14Z

mzy2240
Aug 13, 2024
Author

dataset.zip
debug.ipynb.zip

Ok I come up with this minimal example to reproduce what I observed. The actual data is 20X larger than the attached dataset. dataset.zip is a compressed npz file which I overrided the suffx to be zip, while the debug.ipynb.zip is a compressed zip file containing the reproducible notebook.

0 replies

mzy2240 · 2024-08-13T16:00:30Z

mzy2240
Aug 13, 2024
Author

~~Feels what I observed is similar to what described here #1153~~

Still observed the same behavior even after replacing the optimizer with optim.Adam(learning_rate=optim.linear_schedule(0.0, 0.001, 15))

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why this MLX transformer model fails in training but its PyTorch equivalent works fine #1322

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why this MLX transformer model fails in training but its PyTorch equivalent works fine #1322

mzy2240 Aug 13, 2024

Replies: 4 comments · 5 replies

angeloskath Aug 13, 2024 Maintainer

mzy2240 Aug 13, 2024 Author

angeloskath Aug 13, 2024 Maintainer

mzy2240 Aug 13, 2024 Author

mzy2240 Aug 13, 2024 Author

mzy2240 Aug 13, 2024 Author

mzy2240 Aug 13, 2024 Author

mzy2240 Aug 13, 2024 Author

mzy2240 Aug 13, 2024 Author

mzy2240
Aug 13, 2024

Replies: 4 comments 5 replies

angeloskath
Aug 13, 2024
Maintainer

mzy2240 Aug 13, 2024
Author

angeloskath Aug 13, 2024
Maintainer

mzy2240 Aug 13, 2024
Author

mzy2240 Aug 13, 2024
Author

mzy2240 Aug 13, 2024
Author

mzy2240
Aug 13, 2024
Author

mzy2240
Aug 13, 2024
Author

mzy2240
Aug 13, 2024
Author