Skip to content

Releases: tinygrad/tinygrad

tinygrad 0.10.0

19 Nov 00:48
Compare
Choose a tag to compare

A significant under the hood update.
Over 1200 commits since 0.9.2.
At 9937 lines.

Release Highlights

  • VIZ=1 to show how rewrites are happening, try it
  • 0 python dependencies!
    • Switch from numpy random to threefry, removing numpy [#6116]
    • Switch from pyobjc to ctypes for metal, removing pyobjc [#6545]
  • 3 new backends
    • QCOM=1 HCQ backend for runtime speed on Adreno 630 [#5213]
    • CLOUD=1 for remote tinygrad [#6964]
    • DSP=1 backend on Qualcomm devices (alpha) [#6112]
  • More Tensor Cores
    • Apple AMX support [#5693]
    • Intel XMX tensor core support [#5622]
  • Core refactors
    • Removal of symbolic, it's just UOp rewrite now
    • Many refactors with EXPAND, VECTORIZE, and INDEX
    • Progress toward the replacement of LazyBuffer with UOp

See the full changelog: v0.9.2...v0.10.0

See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc

Join the Discord!

tinygrad 0.9.2

13 Aug 23:19
518c022
Compare
Choose a tag to compare

Small changes.
Over 700 commits since 0.9.1.

Release Highlights

  • experimental Monte Carlo Tree Search when BEAM>=100 [#5598]
  • TRANSCENDENTAL>=2 or by default on CLANG and LLVM to provide sin, log2, and exp2 approximations. [#5187]
  • when running with DEBUG>=2 you now see the tensor ops that are part of a kernel [#5271]
    image
  • PROFILE=1 for a profiler when using HCQ backends (AMD, NV)
    image
  • Refactor Linearizer to Lowerer [#4957]

See the full changelog: v0.9.1...v0.9.2

See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc

Join the Discord!

tinygrad 0.9.1

29 Jun 03:16
7bcb74a
Compare
Choose a tag to compare

Now sitting at 7844 lines, less than last release.
Looking to tag releases more often.

Over 320 commits since 0.9.0.

Release Highlights

Known Issues

  • Using tinygrad in a conda env on macOS is known to cause problems with the METAL backend. See #2226.

See the full changelog: v0.9.0...v0.9.1

See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc

Join the Discord!

tinygrad 0.9.0

28 May 18:48
6fcf220
Compare
Choose a tag to compare

Close to the new line limit of 8000 lines, sitting at 7958 lines.
tinygrad is much more usable now.

Just over 1200 commits since 0.8.0.

Release Highlights

  • New documentation: https://docs.tinygrad.org
  • gpuctypes has been brought in tree and is no longer an external dependency. [#3253]
  • AMD=1 and NV=1 experimental backends for not requiring any userspace runtime components like ROCm or CUDA.
    • These backends should reduce the amount of python time, and specifically with multi-gpu use cases.
  • PTX=1 for rendering directly to ptx instead of cuda. [#3139] [#3623] [#3775]
  • Nvidia tensor core support. [#3544]
  • THREEFRY=1 for numpy-less random number generation using threefry2x32. [#2601] [#3785]
  • More stabilized multi-tensor API.
  • Core tinygrad has been refactored into 4 pieces, read more about it here.
  • Linearizer and codegen has support for generating kernels with multiple outputs.
  • Lots of progress towards greater kernel fusion in the scheduler.
    • Fusing of ReduceOps with their elementwise children. This trains mnist and gpt2 with ~20% less kernels and makes llama inference faster.
    • New LoadOps.ASSIGN allows fusing optimizer updates with grad.
    • Schedule kernels in BFS order. This improves resnet and llama speed.
    • W.I.P. for fusing multiple reduces: [#4259] [#4208]
  • MLPerf ResNet and BERT with a W.I.P. UNet3D
  • Llama 3 support with a new llama3.py that provides an OpenAI compatible API. [#4576]
  • NF4 quantization support in Llama examples. [#4540]
  • label_smoothing has been added to sparse_categorical_crossentropy. [#3568]

Known Issues

  • Using tinygrad in a conda env on macOS is known to cause problems with the METAL backend. See #2226.

See the full changelog: v0.8.0...v0.9.0

See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc

Join the Discord!

tinygrad 0.8.0

09 Jan 18:16
2c6f2e8
Compare
Choose a tag to compare

Close to the new limit of 5000 lines at 4981.

Release Highlights

  • Real dtype support within kernels!
  • New .schedule() API to separate concerns of scheduling and running
  • New lazy.py implementation doesn't reorder at build time. GRAPH=1 is usable to debug issues
  • 95 TFLOP FP16->FP32 matmuls on 7900XTX
  • GPT2 runs (jitted) in 2 ms on NVIDIA 3090
  • Powerful and fast kernel beam search with BEAM=2
  • GPU/CUDA/HIP backends switched to gpuctypes
  • New (alpha) multigpu sharding API with .shard

See the full changelog: v0.7.0...v0.8.0

Join the Discord!

tinygrad 0.7.0

27 Aug 16:40
8b354b3
Compare
Choose a tag to compare

Bigger again at 4311 lines :( But, tons of new features this time!

Just over 500 commits since 0.6.0.

Release Highlights

  • Windows support has been dropped to focus on Linux and Mac OS.
    • Some functionality may work on Windows but no support will be provided, use WSL instead.
  • DiskTensors: a way to store tensors on disk has been added.
    • This is coupled with functionality in state.py which supports saving/loading safetensors and loading torch weights.
  • Tensor Cores are supported on M1/Apple Silicon and on the 7900 XTX (WMMA).
    • Support on the 7900 XTX requires weights and data to be in float16, full float16 compute support will come in a later release.
    • Tensor Core behaviour/usage is controlled by the TC envvar.
  • Kernel optimization with nevergrad
    • This optimizes the shapes going into the kernel, gated by the KOPT envvar.
  • P2P buffer transfers are supported on most AMD GPUs when using a single python process.
    • This is controlled by the P2P envvar.
  • LLaMA 2 support.
    • A requirement of this is bfloat16 support for loading the weights, which is semi-supported by casting them to float16, proper bfloat16 support is tracked at #1290.
    • The LLaMA example now also supports 8-bit quantization using the flag --quantize.
  • Most MLPerf models have working inference examples. Training these models is currently being worked on.
  • Initial multigpu training support.
    • slow multigpu training by copying through host shared memory.
    • Somewhat follows torch's multiprocessing and DistributedDataParallel high-level design.
    • See the hlb_cifar10.py example.
  • SymbolicShapeTracker and Symbolic JIT.
    • These two things combined allow models with changing shapes to be jitted like transformers.
    • This means that LLaMA can now be jitted for a massive increase in performance.
    • Be warned that the API for this is very WIP and may change in the future, similarly with the rest of the tinygrad API.
  • aarch64 and ptx assembly backend.
  • WebGPU backend, see the compile_efficientnet.py example.
  • Support for torch like tensor indexing by other tensors.
  • Some more nn layers were promoted, namely Embedding and various Conv layers.
  • VITS and so-vits-svc examples added.
  • Initial documentation work.

And lots of small optimizations all over the codebase.

See the full changelog: v0.6.0...v0.7.0

See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc

Join the Discord!

tinygrad 0.6.0

26 May 01:02
Compare
Choose a tag to compare

2516 lines now. Some day I promise a release will make it smaller.

  • float16 support (needed for LLaMA)
  • Fixed critical bug in training BatchNorm
  • Limited support for multiple GPUs
  • ConvNeXt + several MLPerf models in models/
  • More torch-like methods in tensor.py
  • Big refactor of the codegen into the Linearizer and CStyle
  • Removed CompiledBuffer, use the LazyBuffer ShapeTracker

tinygrad 0.5.0

07 Mar 02:21
Compare
Choose a tag to compare

An upsetting 2223 lines of code, but so much great stuff!

  • 7 backends : CLANG, CPU, CUDA, GPU, LLVM, METAL, and TORCH
  • A TinyJit for speed (decorate your GPU function today)
  • Support for a lot of onnx, including all the models in the backend tests
  • No more MLOP convs, all HLOP (autodiff for convs)
  • Improvements to shapetracker and symbolic engine
  • 15% faster at running the openpilot model

tinygrad 0.4.0

08 Nov 16:49
8dc28dd
Compare
Choose a tag to compare

So many changes since 0.3.0

Fairly stable and correct, though still not fast. The hlops/mlops are solid, just needs work on the llops.

The first automated release, so hopefully it works?