EsViT

### Paper
Implement [Efficient Self-Supervised Vision Transformers (EsViT)](https://arxiv.org/abs/2106.09785)

Link to the paper
### TODOs
- [ ] develop the model in a scalable way (using `*Block` syntax and `nn` modules)
- [ ] draw the model on figma
- [ ] write doc 
- [ ] test the model 
- [ ] get (if possible) the pretrained weights