Update smaller-lms.md

blbadger · Aug 18, 2024 · 0339b6d · 0339b6d
1 parent 7580973
commit 0339b6d
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/smaller-lms.md b/smaller-lms.md
@@ -660,7 +660,7 @@ The mixer has another significant advantage with respect to embedding training:
 
 With an increase in the number of samples available, the mixer is capable of better performance: for the 512-dimensional masked mixer embeddings given a 32-sized comparison batch the test loss decreases from 0.1 to 0.053.  As the number of samples increases to 550k, and likewise for a 128-sized batch we have a test loss of 0.12.
 
-It is interesting to note that an untrained mixer, while able to accurately represent its input, yields very poor embeddings for retrieval training. For the 200k dataset, an untrained 512-dimensional masked mixer's embeddings lead to practically no learning for batches of size 128 and even for batches of size 32 there is severe overfitting, with loss not below 3.2.
+It is interesting to note that an untrained mixer, while able to accurately represent its input, yields very poor embeddings for retrieval training. For the 200k dataset, an untrained 512-dimensional masked mixer's embeddings lead to practically no learning for batches of size 128 and even for batches of size 32 there is severe overfitting, with loss not below 3.2. Even more unexpected is the finding that the last hidden layer embedding is equally poor for retrieval tasks if the embedding model undergoes autoencoding rather than causal language model training (ie next token prediction). This can be shown by training an encoder-decoder architecture where the encoder mixer's last token's last hidden layer embedding is repeated to a decoder, which is tasked with regenerating the entire input string. This encoder-decoder may be trained remarkably effectively (with <2.0 cross-entropy loss) but the encoder's embedding is no better than that from an untrained model for retrieval learning. This suggests that the causal language model training process itself is important to language retrieval. 
 
 It may be wondered if this is a result of some kind of incompatibility between a transformer's embedding and the ability of a mixer to learn that model's manifold, but this is not the case: if we instead use a bidirectional transformer to learn the retrieval pairs from the transformer's embeddings, we find that this model is far worse than the bidirectional mixer and that practically no learning occurs.