Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, & Neil Houlsby (2020), References, Textbook of AI

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, & Neil Houlsby (2020)

arXiv.

DOI: https://doi.org/10.48550/arxiv.2010.11929

Abstract. Introduces the Vision Transformer (ViT), which treats an image as a sequence of fixed-size patches and processes them with a standard transformer encoder. When pretrained on very large datasets, ViT matches or exceeds CNN performance on image classification.

Tags: transformer vit vision

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale