You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I have a few questions about the difference in models.
I understand how the recursive model is set up, it is described in the publication. But how is effective model learning achieved in batch fashion? As far as I understand, because we never explicitly calculate the attention matrix we can't just apply a triangular mask. How does this work then? Is it just iterative as in the recursive model, but implemented on cuda? Is it easily parallelizable as 3 matrix multiplications (like in full attention)?
Thanks!
The text was updated successfully, but these errors were encountered:
Hi! I have a few questions about the difference in models.
I understand how the recursive model is set up, it is described in the publication. But how is effective model learning achieved in batch fashion? As far as I understand, because we never explicitly calculate the attention matrix we can't just apply a triangular mask. How does this work then? Is it just iterative as in the recursive model, but implemented on cuda? Is it easily parallelizable as 3 matrix multiplications (like in full attention)?
Thanks!
The text was updated successfully, but these errors were encountered: