tinygrad 0.9.0
Close to the new line limit of 8000 lines, sitting at 7958 lines.
tinygrad is much more usable now.
Just over 1200 commits since 0.8.0
.
Release Highlights
- New documentation: https://docs.tinygrad.org
gpuctypes
has been brought in tree and is no longer an external dependency. [#3253]AMD=1
andNV=1
experimental backends for not requiring any userspace runtime components like ROCm or CUDA.- These backends should reduce the amount of python time, and specifically with multi-gpu use cases.
PTX=1
for rendering directly to ptx instead of cuda. [#3139] [#3623] [#3775]- Nvidia tensor core support. [#3544]
THREEFRY=1
for numpy-less random number generation using threefry2x32. [#2601] [#3785]- More stabilized multi-tensor API.
- Core tinygrad has been refactored into 4 pieces, read more about it here.
- Linearizer and codegen has support for generating kernels with multiple outputs.
- Lots of progress towards greater kernel fusion in the scheduler.
- Fusing of ReduceOps with their elementwise children. This trains mnist and gpt2 with ~20% less kernels and makes llama inference faster.
- New LoadOps.ASSIGN allows fusing optimizer updates with grad.
- Schedule kernels in BFS order. This improves resnet and llama speed.
- W.I.P. for fusing multiple reduces: [#4259] [#4208]
- MLPerf ResNet and BERT with a W.I.P. UNet3D
- Llama 3 support with a new
llama3.py
that provides an OpenAI compatible API. [#4576] - NF4 quantization support in Llama examples. [#4540]
label_smoothing
has been added tosparse_categorical_crossentropy
. [#3568]
Known Issues
- Using tinygrad in a conda env on macOS is known to cause problems with the
METAL
backend. See #2226.