In this repository, we will:
- Explore the paper AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE.
- Implement the vision transformer model from scratch.
- Train the model and predict on the Cifar-10 dataset.
Vision Transformers are a type of neural network architecture that was introduced as an alternative to traditional Convolutional Neural Networks (CNNs) for computer vision tasks. They are designed to apply the self-attention mechanism to image data, which has been successful in natural language processing tasks.