Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The claim about splitting a circuit across two heads #25

Open
Mihonarium opened this issue Oct 5, 2024 · 1 comment
Open

The claim about splitting a circuit across two heads #25

Mihonarium opened this issue Oct 5, 2024 · 1 comment

Comments

@Mihonarium
Copy link
Contributor

I first pointed this out on Slack back in 2023, but this seems to not have been fixed.

There's this claim in intro to mech interp:
image

The OV circuits being rank-64 is not the full reason why the head split happens. You can easily train a rank 64 matrix (two matrices 768x64 @ 64x768) to get 98.9% top-1 accuracy and 99.4% top-5 accuracy on the full OV circuit, which is better than the top-5 of the model's combined rank 128 matrix.

Since a single rank-64 matrix can achieve a much better approximation of the desired 50K x 50K matrix than the rank-128 matrix that the model uses, the two heads probably aren't trying to only blindly copy the tokens, and the rank, possibly, isn't a good explanation for the split; the heads are probably doing something else as well.

(I was using the MLAB2 w2d4 notebook, so the code might be somewhat incompatible, but it's pretty straightforward and should be easy to reproduce: https://gist.github.com/Mihonarium/7b4b9a4a17c8f1b1c67dc143b9225d53.)

@callummcdougall
Copy link
Owner

callummcdougall commented Oct 15, 2024

Thanks for adding this and supplying the link! I'm sure it is possible to specifically train this, although models' attention heads aren't being trained directly to be faithful copying circuits, they're being trained to predict next tokens, and having high-fidelity copying heads to implement things like induction circuits is just one of many different ways to do this. I notice that your attached training code specifically tries to train this head to be the identity matrix using a tailored loss function and no weight decay, which is an easier setting to learn this specific pattern in.

Additionally, it's often the case that two different components will both start to learn some task X before either learns to do task X across some full dataset, leading to this capability being split across heads. For example, one head might learn copying as part of an induction circuit, and another might learn copying so it can copy repeated names or proper nouns in a sentence - if these cases don't overlap, then both capabilities could plausibly be learned independently. However I do agree that these heads are very plausibly doing something other than copying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants