Releases: tinygrad/tinygrad
tinygrad 0.10.0
A significant under the hood update.
Over 1200 commits since 0.9.2
.
At 9937 lines.
Release Highlights
VIZ=1
to show how rewrites are happening, try it- 0 python dependencies!
- 3 new backends
- More Tensor Cores
- Core refactors
- Removal of symbolic, it's just UOp rewrite now
- Many refactors with EXPAND, VECTORIZE, and INDEX
- Progress toward the replacement of
LazyBuffer
withUOp
See the full changelog: v0.9.2...v0.10.0
See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc
Join the Discord!
tinygrad 0.9.2
Small changes.
Over 700 commits since 0.9.1
.
Release Highlights
- experimental Monte Carlo Tree Search when
BEAM>=100
[#5598] TRANSCENDENTAL>=2
or by default onCLANG
andLLVM
to providesin
,log2
, andexp2
approximations. [#5187]- when running with
DEBUG>=2
you now see the tensor ops that are part of a kernel [#5271]
PROFILE=1
for a profiler when using HCQ backends (AMD
,NV
)
- Refactor
Linearizer
toLowerer
[#4957]
See the full changelog: v0.9.1...v0.9.2
See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc
Join the Discord!
tinygrad 0.9.1
Now sitting at 7844 lines, less than last release.
Looking to tag releases more often.
Over 320 commits since 0.9.0
.
Release Highlights
- Removal of the HSA backend, defaulting to AMD. [#4885]
- tinychat, a pretty simple llm web ui. [#4869]
- SDXL example. [#5206]
- A small tqdm replacement. [#4846]
- NV/AMD profiler using perfetto. [#4718]
Known Issues
- Using tinygrad in a conda env on macOS is known to cause problems with the
METAL
backend. See #2226.
See the full changelog: v0.9.0...v0.9.1
See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc
Join the Discord!
tinygrad 0.9.0
Close to the new line limit of 8000 lines, sitting at 7958 lines.
tinygrad is much more usable now.
Just over 1200 commits since 0.8.0
.
Release Highlights
- New documentation: https://docs.tinygrad.org
gpuctypes
has been brought in tree and is no longer an external dependency. [#3253]AMD=1
andNV=1
experimental backends for not requiring any userspace runtime components like ROCm or CUDA.- These backends should reduce the amount of python time, and specifically with multi-gpu use cases.
PTX=1
for rendering directly to ptx instead of cuda. [#3139] [#3623] [#3775]- Nvidia tensor core support. [#3544]
THREEFRY=1
for numpy-less random number generation using threefry2x32. [#2601] [#3785]- More stabilized multi-tensor API.
- Core tinygrad has been refactored into 4 pieces, read more about it here.
- Linearizer and codegen has support for generating kernels with multiple outputs.
- Lots of progress towards greater kernel fusion in the scheduler.
- Fusing of ReduceOps with their elementwise children. This trains mnist and gpt2 with ~20% less kernels and makes llama inference faster.
- New LoadOps.ASSIGN allows fusing optimizer updates with grad.
- Schedule kernels in BFS order. This improves resnet and llama speed.
- W.I.P. for fusing multiple reduces: [#4259] [#4208]
- MLPerf ResNet and BERT with a W.I.P. UNet3D
- Llama 3 support with a new
llama3.py
that provides an OpenAI compatible API. [#4576] - NF4 quantization support in Llama examples. [#4540]
label_smoothing
has been added tosparse_categorical_crossentropy
. [#3568]
Known Issues
- Using tinygrad in a conda env on macOS is known to cause problems with the
METAL
backend. See #2226.
See the full changelog: v0.8.0...v0.9.0
See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc
Join the Discord!
tinygrad 0.8.0
Close to the new limit of 5000 lines at 4981.
Release Highlights
- Real dtype support within kernels!
- New
.schedule()
API to separate concerns of scheduling and running - New lazy.py implementation doesn't reorder at build time.
GRAPH=1
is usable to debug issues - 95 TFLOP FP16->FP32 matmuls on 7900XTX
- GPT2 runs (jitted) in 2 ms on NVIDIA 3090
- Powerful and fast kernel beam search with
BEAM=2
- GPU/CUDA/HIP backends switched to
gpuctypes
- New (alpha) multigpu sharding API with
.shard
See the full changelog: v0.7.0...v0.8.0
Join the Discord!
tinygrad 0.7.0
Bigger again at 4311 lines :( But, tons of new features this time!
Just over 500 commits since 0.6.0
.
Release Highlights
- Windows support has been dropped to focus on Linux and Mac OS.
- Some functionality may work on Windows but no support will be provided, use WSL instead.
- DiskTensors: a way to store tensors on disk has been added.
- This is coupled with functionality in
state.py
which supports saving/loading safetensors and loading torch weights.
- This is coupled with functionality in
- Tensor Cores are supported on M1/Apple Silicon and on the 7900 XTX (WMMA).
- Support on the 7900 XTX requires weights and data to be in float16, full float16 compute support will come in a later release.
- Tensor Core behaviour/usage is controlled by the
TC
envvar.
- Kernel optimization with nevergrad
- This optimizes the shapes going into the kernel, gated by the
KOPT
envvar.
- This optimizes the shapes going into the kernel, gated by the
- P2P buffer transfers are supported on most AMD GPUs when using a single python process.
- This is controlled by the
P2P
envvar.
- This is controlled by the
- LLaMA 2 support.
- A requirement of this is bfloat16 support for loading the weights, which is semi-supported by casting them to float16, proper bfloat16 support is tracked at #1290.
- The LLaMA example now also supports 8-bit quantization using the flag
--quantize
.
- Most MLPerf models have working inference examples. Training these models is currently being worked on.
- Initial multigpu training support.
- slow multigpu training by copying through host shared memory.
- Somewhat follows torch's multiprocessing and DistributedDataParallel high-level design.
- See the hlb_cifar10.py example.
- SymbolicShapeTracker and Symbolic JIT.
- These two things combined allow models with changing shapes to be jitted like transformers.
- This means that LLaMA can now be jitted for a massive increase in performance.
- Be warned that the API for this is very WIP and may change in the future, similarly with the rest of the tinygrad API.
- aarch64 and ptx assembly backend.
- WebGPU backend, see the
compile_efficientnet.py
example. - Support for torch like tensor indexing by other tensors.
- Some more
nn
layers were promoted, namelyEmbedding
and variousConv
layers. - VITS and so-vits-svc examples added.
- Initial documentation work.
- Quickstart guide:
/docs/quickstart.md
- Environment variable reference:
/docs/env_vars.md
- Quickstart guide:
And lots of small optimizations all over the codebase.
See the full changelog: v0.6.0...v0.7.0
See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc
Join the Discord!
tinygrad 0.6.0
2516 lines now. Some day I promise a release will make it smaller.
- float16 support (needed for LLaMA)
- Fixed critical bug in training BatchNorm
- Limited support for multiple GPUs
- ConvNeXt + several MLPerf models in models/
- More torch-like methods in tensor.py
- Big refactor of the codegen into the Linearizer and CStyle
- Removed CompiledBuffer, use the LazyBuffer ShapeTracker
tinygrad 0.5.0
An upsetting 2223 lines of code, but so much great stuff!
- 7 backends : CLANG, CPU, CUDA, GPU, LLVM, METAL, and TORCH
- A TinyJit for speed (decorate your GPU function today)
- Support for a lot of onnx, including all the models in the backend tests
- No more MLOP convs, all HLOP (autodiff for convs)
- Improvements to shapetracker and symbolic engine
- 15% faster at running the openpilot model
tinygrad 0.4.0
So many changes since 0.3.0
Fairly stable and correct, though still not fast. The hlops/mlops are solid, just needs work on the llops.
The first automated release, so hopefully it works?