Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, & Neil Houlsby (2020)
arXiv.
DOI: https://doi.org/10.48550/arxiv.2010.11929
Abstract. Introduces the Vision Transformer (ViT), which treats an image as a sequence of fixed-size patches and processes them with a standard transformer encoder. When pretrained on very large datasets, ViT matches or exceeds CNN performance on image classification.
Tags: transformer vit vision