how causal mask constructed in training batch model with linear causal attention? #109

Howuhh · 2021-11-26T14:46:46Z

Hi! I have a few questions about the difference in models.

I understand how the recursive model is set up, it is described in the publication. But how is effective model learning achieved in batch fashion? As far as I understand, because we never explicitly calculate the attention matrix we can't just apply a triangular mask. How does this work then? Is it just iterative as in the recursive model, but implemented on cuda? Is it easily parallelizable as 3 matrix multiplications (like in full attention)?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how causal mask constructed in training batch model with linear causal attention? #109

how causal mask constructed in training batch model with linear causal attention? #109

Howuhh commented Nov 26, 2021 •

edited

Loading

how causal mask constructed in training batch model with linear causal attention? #109

how causal mask constructed in training batch model with linear causal attention? #109

Comments

Howuhh commented Nov 26, 2021 • edited Loading

Howuhh commented Nov 26, 2021 •

edited

Loading