Replies: 4 comments 5 replies
-
Well the 2 possible culprits are default initialization of parameters and/or the optimizer. Let us know if you need help to ensure that the initialization is the same. |
Beta Was this translation helpful? Give feedback.
-
For the optimizer and the losses I am using Adam and MSE for both. model = TimeSeriesTransformer(input_dim=src_train.shape[2], model_dim=256).to(device)
# Training setup
criterion = nn.MSELoss()
optimizer = Adam(model.parameters(), lr=0.001) MLX version model = TimeSeriesTransformer(input_dim=src_train.shape[2], model_dim=256)
optimizer = optim.Adam(learning_rate=0.001)
def loss_fn(model, x, tgt_inputs, y):
predict = model(x, tgt_inputs)
loss = nn.losses.mse_loss(predict, y, reduction="mean")
return loss |
Beta Was this translation helpful? Give feedback.
-
Ok I come up with this minimal example to reproduce what I observed. The actual data is 20X larger than the attached dataset. |
Beta Was this translation helpful? Give feedback.
-
Still observed the same behavior even after replacing the optimizer with |
Beta Was this translation helpful? Give feedback.
-
I have spent whole day trying to figure out why this MLX transformer model does not work (losses stop decreasing in 2 iterations) while its PyTorch equivalent works just fine. Wondering if anyone could point out what goes wrong in my implementation. I am trying to build a transformer model for time series forecast.
PyTorch version
MLX version
I first had the PyTorch version and then translated into MLX line by line. I am wondering if there are some differences on default parameters that caused the issue. I'm using the same M1 Max for both exercise.
Beta Was this translation helpful? Give feedback.
All reactions