-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research : Embedding Aware Attention #122
Comments
Great issue! Reading the paper and the both original implementation and this one I made the same question about the embeddings from the same columns be used together. Have you benchmarked that already, somehow? |
I guess if we take the mean or the max we will lose the sparcity property... I am not sure, but maybe we can torch.stack the features instead of torch.cat them, and apply the sparse max in this new dimension. I would like to test this idea and maybe we can evaluate together the approachs. |
Hello @joseluismoreira, There are two PR on this topic already:
I guess the sparsity is not very important since you will only break the sparsity for embeddings from the same feature, so in the end you'll still be looking at only a few original features. |
Thank you, @Optimox . The PR 217 seems what I look for. 👍 |
After sparsemax, only 1% feature is nonzeros, is it normal? |
hello @W-void, The goal of sparsemax activation is to get a sparse mask, with a lot of 0 values. So yes it is expected. However if you think the masks are too sparse you can play with different parameters:
Let me know if this helped. |
@Optimox ,thx!I have set |
I have a question about mask and "explain" method. I checked on Tabmodel.py that there is an "explain" method of the model. I wonder how different masks are printed for each instance of the test data because the "explain" method does not train the model. Do we use the mask that has been learned until the last epoch of the training data in each row of the test data? |
The model learns to use its attention layer. For each row, the model decides what column should be masked or not. |
Main Problem
When training with large embedding dimensions, the mask size goes up.
One problem I see is that sparsemax does not know about which columns come from the same embedded columns, this could create something a bit difficult for the model to learn:
Proposed Solutions
It's an open problem but one way I see as promising is to create embedding aware attention.
The idea would be to mask all dimensions from a same embedding the same way, either by using the mean or the max of the initial mask.
I implemented a first version here : #92
If you feel like this is interesting and would like to contribute, please share your ideas in comments or open a PR!
The text was updated successfully, but these errors were encountered: