Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self attention for pooling linear classifier #28

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

aayux
Copy link
Contributor

@aayux aayux commented Jan 6, 2019

Add a BiAttentionPoolingClassifier (self attention for pooling linear classifier) as in Attention is all you need following the discussion with @sebastianruder in Teams.

I ran out of memory on my 1060 while testing the attention module, but was able to at least verify that it is functionally correct. Some changes might be required to ensure that the tensor passed to self.layers is of the right shape (but I'm not quite sure as of now).

I'll shift all the stuff to Colab for testing and see if it's any help.

aayux added 2 commits January 6, 2019 23:58
This PR will introduce a `BiAttentionPoolingClassifier` as in [Attention is all you need](https://arxiv.org/abs/1706.03762) following the discussion with @sebastianruder in Teams.

I ran out of memory on my 1060 while testing the attention module, but was able to at least verify that it is functionally correct. Some changes might be required to ensure that the tensor passed to `self.layers` is of the right shape (but I'm not quite sure as of now).

I'll shift all the stuff to Collab for testing and see if it's any help.
Copy link
Collaborator

@sebastianruder sebastianruder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Looks good to me. Didn't expect that you'd go with multi-head attention right away (was thinking of regular attention), but that should be fine.

@aayux
Copy link
Contributor Author

aayux commented Jan 14, 2019

The OOM issue persists even on Colab with 11GB of GPU memory.

RuntimeError: CUDA out of memory. Tried to allocate 8.41 GiB (GPU 0; 11.17 GiB total capacity; 10.26 GiB already allocated; 518.56 MiB free; 80.50 MiB cached)

It appears that I have run into a memory leak.

@tpietruszka
Copy link
Contributor

I am beginning to implement various options of attention on top of ulmfit, so obviously I've looked at this code. I do not really understand how it is used here.

  1. I thought that attention would be used along the sequence length, on different RNN outputs, kind of instead of the mean/max pooling and taking the last ouput.

  2. As mentioned, I thought about using attention instead of pooling, reducing the dimensions of the network. Here it is used with key=value=query - so if I understand correctly, it preserves the dimensionality, calculating the representations of each item in context of all the other items? I guess I just don't understand, is there an intuitive explanation of what it does?

  3. I thought about using attention in the way I have described above. In such a case, I think the query tensor should be learnable (or multiple tensors for multiple heads). Since this is a classification scenario, the mechanism I want to achieve is: attention returning the most relevant RNN outputs for the classification task at hand (instead of taking a mean/max...). Does it make sense?

@aayux
Copy link
Contributor Author

aayux commented Feb 4, 2019

@tpietruszka

One intuitive reason why I think this could be helpful is that the way we had planned on using XNLI was by concatenating the premise and the hypothesis -- so it is possible that we learn some premise-to-hypothesis attention through it. What do you think?

Of course, the dimensionality is preserved but I don't think that's a big problem.

I agree that a more "meaningful" way of applying attention is to attend on the hidden layer outputs from the forward and backward LMs. In fact, applying attention to the concatenation of the pooling outputs was somewhat foolish of me.

What I'll do instead is only attend to a concatenation of the forward and backward LM outputs and also reduce the number of attention heads (which should solve the memory problem). I'll work on it this weekend and update.

Feel free to add to this PR if you have ideas on improving. If you'd like to try a different experiment with attention, that's great too!

@tpietruszka
Copy link
Contributor

@Dust0x I think all approaches are worth testing...

Recently I have been experimenting with different variants of attention, applied to the LM outputs before pooling, on the imdb task. I've pushed 2 variants to a small (for now messy) repo ulmfit_experiments - maybe it could be of help somehow.

Some observations:

  1. whatever I do, I seem to end up with accuracy between 94 and 95. Yes, both uni- and bi-directional models. It is quite frustrating.
  2. I think attention might be helpful where there is less labeled examples, but it needs further testing
  3. one possible interpretation of the fact that the classifier head's architecture does not change much: the 'bottleneck' is the language model, not the classifier head. But then again adding bidirectionality should help, but it does not.
  4. in early versions I also had a GPU memory leak. It seems to be solved now, not sure how. I think it was related to some parameters not being correctly registered as a module (and I guess not de-allocated when appropriate).

Please let me know if you have any thoughts on the subject

@aayux
Copy link
Contributor Author

aayux commented Feb 14, 2019

The memory "leak" was my own fault. I changed the way I was using attention and that fixed it.

Self attention module seems to be working okay on the tests I ran locally, I'll start the bench-marking now. @sebastianruder @PiotrCzapla are there any specific datasets that you would like to see the results on?

@tpietruszka it's very odd that you should get the same accuracy on IMDb across all your experiments. Is it possible that somewhere the classification head is hard-coded to BiPoolingLinearClassifier and it's defaulting to it every time? I'm only suggesting this because this was something that came up when I was experimenting too, and of course it's possible that you have already checked.

@aayux aayux changed the title [WIP] Self attention for pooling linear classifier Self attention for pooling linear classifier Mar 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants