Papers club from the AI team in D-ID - this time transformers from attention to vision lectures are in Hebrew
מועדון קריאת מאמרים שלנו - כל ההרצאות בעיברית
Lecture | Paper / Resource | Year | Why is it interesting? | Asignee | Recording | Presentation |
---|---|---|---|---|---|---|
Transformers are worth your attention | Attention is all you need | 2017 | read whyThe paper that started it all, introduction to the basic concept & comparison to previous methods like RNN. The transformer here has both encoder & decoder layers creating a seq2seq model |
@matan-feldman | zoom (K%32MLKi) | slides |
Transformers tricks - Positional Encoding, Layer Norm, Residual Connections. In code! | The annotated transformer | 2017 | read whyGoing into depth into the various tricks used to make transformers work. Implementing a trnasformer without them would lead to poor results |
self-work | x | x |
Visualizing Attention | Visualizing Attention in Transformer-Based Language Representation Models On the Relationship between Self-Attention and Convolutional Layers | 2017 | read whyAttention is useful for explainability too, we can see what the network is using for the task. In this lecture we will exammine visualizations of this in NLP & vision |
self-work | x | x |
BERT | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | 2018 | read whyThis model from Google used only encoders & achived state of the art on many NLP tasks |
@leong-deid | zoom (LW$8fQ6f) | slides |
GPT | Language Models are Few-Shot Learners | 2020 | read whyThis model from OpenAI used only decoders & achived state of the art text generation. Its authors first didn't release it becuase they said it is too dangerous. It is now the backbone of Github Co-Pilot |
self-work | x | x |
Wav2Vec U | Wav-to-vec U - Unsupervised Speech Recognition | 2021 | read whythis unsupervised model from Facebook, is able to learn language representations. we use the supervised version in our A2K input |
@matan-feldman | zoom (p.qE+Q59) | slides |
DETR for object detection & segmentations | End-to-End Object Detection with Transformers | 2020 | read whyTaking transformers even further to other CV tasks in this paper the autors from FacebookAI combine CNN with transformers to reduce some of the human prio needed in designing object detection & segmentation models |
@talbenh | zoom (17K%NSf3) | slides |
ViT | An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | 2020 | read whyTreats the image as a sentence of 16x16 patches words. The model acheives SoTA in classiciation tasks with significant less compute. with supervision from large scale images datasets |
@alon-mengi | zoom (ve1VHEM=) | slides |
CLIP | Learning Transferable Visual Models From Natural Language Supervision | 2021 | read whyOpenAI model that learns two encoders from images &text & via contrastive learning achive SoTA result on image classification while increasing dramatically the robustness over previous methods. Using internet scraped data instead of expesive annotated datasets. |
@amitay-nachmani | zoom (^a1!1BJf) | slides |
Preceiver | Perceiver IO: A General Architecture for Structured Inputs & Outputs | 2021 | read whyPreceivers models use cross-attention & learned latent dictionaries to work on many modalities by reducing the self attention complexity. The authors demostrate that the model produces baseline results on many tasks |
@orgoro | zoom (Ba9DQ&Ef) | slides |
Dall-E2 & Imagen | Hierarchical Text-Conditional Image Generation with CLIP Latents | 2022 | read whyDALL·E 2 is a new AI system that can create realistic images and art from a description in natural language from OpenAI. The model uses CLIP embedding & diffusion models to generate images from a text description. Google DeepMind also came up with a competing model called Imagen that argues for superior quality |
@talbenh | zoom (7m&@&QRv) | slides |