Where's the second expert per token per layer? #2261

enn-nafnlaus · 2023-12-26T00:15:37Z

enn-nafnlaus
Dec 26, 2023

I'm looking into the code right now:

b5f882c

I was under the impression that Mixtral was 8 experts with two chosen per token per layer, but I'm not seeing that in the code. In BlockSparseMoE's forward function, it runs weights, selected_experts = torch.topk(all_probs, self.top_k, dim=-1), then flattens the experts, to get what as far as I can tell is one expert per token (per layer). And there's only one BlockSparseMoE per MixtralDecoderLayer. Where is it supposedly running two separate experts for each token to then add and norm? Maybe I'm just missing it...

I was thinking about extending it to (per-layer) add a dotproduct between the experts to assess the similarity between their output vectors, on the hypothesis that this could serve as a proxy measurement for hallucination, on the grounds that when the model is good at a task, different experts should reach roughly the same conclusion, but when the model is bad at a task, different experts may reach wildly different conclusions. This dotproduct could then be used as a scalar to a crossproduct of the output vector and a new (learned) vector, followed by add + norm - then training with examples of proper responses to situations with varying levels of uncertainty.

But perhaps I'm misunderstanding how the model works and maybe this isn't possible...

enn-nafnlaus · 2023-12-28T04:34:55Z

enn-nafnlaus
Dec 28, 2023
Author

Oh, duh. self.top_k is 2. So dims on selected_experts are (sequence_length, 2). Flattened out it becomes sequence_length * 2. It loses this shape with gather and then comes back to this shape with scatter, and we're left with dimensions (2 * sequence_length, model_dim), which could be reshaped as e.g. two full hidden state vectors (model_dim in length) for each token in the sequence.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where's the second expert per token per layer? #2261

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Where's the second expert per token per layer? #2261

enn-nafnlaus Dec 26, 2023

Replies: 1 comment

enn-nafnlaus Dec 28, 2023 Author

enn-nafnlaus
Dec 26, 2023

enn-nafnlaus
Dec 28, 2023
Author