-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Self attention for pooling linear classifier #28
base: master
Are you sure you want to change the base?
Conversation
This PR will introduce a `BiAttentionPoolingClassifier` as in [Attention is all you need](https://arxiv.org/abs/1706.03762) following the discussion with @sebastianruder in Teams. I ran out of memory on my 1060 while testing the attention module, but was able to at least verify that it is functionally correct. Some changes might be required to ensure that the tensor passed to `self.layers` is of the right shape (but I'm not quite sure as of now). I'll shift all the stuff to Collab for testing and see if it's any help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Looks good to me. Didn't expect that you'd go with multi-head attention right away (was thinking of regular attention), but that should be fine.
The OOM issue persists even on Colab with 11GB of GPU memory.
It appears that I have run into a memory leak. |
I am beginning to implement various options of attention on top of ulmfit, so obviously I've looked at this code. I do not really understand how it is used here.
|
One intuitive reason why I think this could be helpful is that the way we had planned on using XNLI was by concatenating the premise and the hypothesis -- so it is possible that we learn some premise-to-hypothesis attention through it. What do you think? Of course, the dimensionality is preserved but I don't think that's a big problem. I agree that a more "meaningful" way of applying attention is to attend on the hidden layer outputs from the forward and backward LMs. In fact, applying attention to the concatenation of the pooling outputs was somewhat foolish of me. What I'll do instead is only attend to a concatenation of the forward and backward LM outputs and also reduce the number of attention heads (which should solve the memory problem). I'll work on it this weekend and update. Feel free to add to this PR if you have ideas on improving. If you'd like to try a different experiment with attention, that's great too! |
@Dust0x I think all approaches are worth testing... Recently I have been experimenting with different variants of attention, applied to the LM outputs before pooling, on the imdb task. I've pushed 2 variants to a small (for now messy) repo ulmfit_experiments - maybe it could be of help somehow. Some observations:
Please let me know if you have any thoughts on the subject |
The memory "leak" was my own fault. I changed the way I was using attention and that fixed it. Self attention module seems to be working okay on the tests I ran locally, I'll start the bench-marking now. @sebastianruder @PiotrCzapla are there any specific datasets that you would like to see the results on? @tpietruszka it's very odd that you should get the same accuracy on IMDb across all your experiments. Is it possible that somewhere the classification head is hard-coded to |
Add a
BiAttentionPoolingClassifier
(self attention for pooling linear classifier) as in Attention is all you need following the discussion with @sebastianruder in Teams.I ran out of memory on my 1060 while testing the attention module, but was able to at least verify that it is functionally correct. Some changes might be required to ensure that the tensor passed to
self.layers
is of the right shape (but I'm not quite sure as of now).I'll shift all the stuff to Colab for testing and see if it's any help.