This repository contains a simple implementation of the Vision Transformer (ViT) model as described in the original paper titled "An Image is Worth 16x16 Words" by Alexey Dosovitskiy et al. The Vision Transformer is a novel approach to image classification that treats images as sequences of patches and utilizes transformer architectures for learning representations.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al.