This is a repo of my experiments and notes while learning about LLMs. I'm starting with a decent theoretical understanding of neural networks, and hands on experience training large models on distributed systems. I'm very comfortable with data and ML engineering.
I've read many papers and a few books in the deep learning and LLM space, but have never committed to learning things deeply or hands on. I plan to change that.
Here are all the things I'd like to do:
- Andrej Karpathy's Neural Networks: Zero to Hero guide.
- Thoroughly read The Annotated Transformers paper and run the code side-by-side.
- Explore the Tensor2Tensor repo.
- Implement FlashAttention myself (in Cuda maybe?)
- Implement FSDP myself (no idea how!?)
- Model efficiency experiments. Try out the following and benchmark performance changes:
- Speculative decoding
- Knowledge distillation
- Quantization
- Pruning
- Sparsity low ran compression
- etc
- Play around with LLAMA models locally
- Reread: Flash Attention 1, 2 and Paged Attention
- Flash Attention 3
- Depthwise Seperable Convolutions for NMT
- One Model To Learn Them All
- Self-Attention with Relative Position Representations
- Self-attention Does Not Need O(n2) Memory
- Oneline softmax papers