-
Notifications
You must be signed in to change notification settings - Fork 368
Non-ggml
backend
#31
Comments
Not sure this one is related (as opposed to pickle)? https://github.com/huggingface/safetensors#yet-another-format- |
|
So it look something like this am i right? maybe we should add this to some md file for new comer(me!). 🤔 graph TD;
A("PyTorch") --"<pre>1️⃣/2️⃣ export_state_dict_checkpoint.py</pre>PyTorch model checkpoints (pth)"--> B(Python) --"<pre>3️⃣ convert-pth-to-ggml.py</pre>Geometric Deep Learning Markup Language (ggml)"--> C(C++)--"<pre>4️⃣ quantize.cpp</pre>Quantized ggml (bin)"-->D(Rust);
1️⃣ tloen/alpaca-lora/export_state_dict_checkpoint.py (llama-7b-hf) |
Also worth keeping an eye on: @Narsil's https://github.com/Narsil/smelte-rs. |
This is another one that could possibly be worth looking at: https://github.com/coreylowman/dfdx One thing about it is it seems like it's pretty hard to load models where there stuff like the array dimensions or structure are dynamic. I looked at smelte for other stuff too, but one big con at the moment is it says it's single threaded. So I don't think it would even be able to get close to the current approach on CPU at least. |
You'd be suprised :) matmul is still linked against mkl which is multi threaded and make the overall thing fast enough. Even ggml uses threading only for a few select ops, not for all of them. |
Look into Rust's tch crate, which is a high-level deep learning library built on top of PyTorch. PyTorch has built-in support for f16 and q4, so tch may be able to support those formats. |
|
Hey I've started seeing if the code from ggml couldn't be done in pure Rust, here's the first draft: https://github.com/Narsil/rblas It's x86_64, avx-only right now and I'm getting 2x slower than intel-mkl on my old personal computer.
Not sure if I screwed something up in the translation, the f32 matmul of ggml isn't as good as intel-mkl, or my threading policy sucks (Using simple threadpool which isn't using spinlocks under the hood afaik) Also threadpool and num_cpus can be removed as dependencies, they just make my life and the code easier. Still if people find that interesting to work on. |
I'm not sure if it's the same for GPT (I assume it would be) but at least with RWKV the vast, vast majority of the time was spent just in the matrix multiplication. The rest was basically insignificant. You probably can just do simple/unoptimized versions of the other ops and come very close to equal performance as long as the MM part is fast. |
The softmax and layer norm can start to take up some time when not threaded.
I'm beating cblas by ~30% using this code on M1.. I guess it's not that bad. |
Did you take a look at https://github.com/Noeda/rllama works ? |
Just tossing another idea around, another choice of backend would be to use In my linux machine, it gives the same performance (both speed and cpu utilitization) as intel mkl, which is suprising enough that I kind of doubt my result, but I checked several times if anything is wrong. One thing is that it was not able to run mac m1 machine. |
This is indeed true and a testament to its creator. |
@Narsil Are you sure about the statement noncontiguous calls? In my experiments, which uses your ggblas bench script but faer-core version= 0.8.0 and a change in matrix dims, faer-core still gives same performance as intel mkl for both matmul and matmul_t The result logs
I also got similar results for gemm backend (instead of faer-core backend). I also checked that they are indeed calculating AB^T |
Interesting numbers (they seem pretty high, are you modifying shapes ?)
Interestingly mkl is performing much worse today on my computer (not sure if I updated since). In any case, I thought there was potentially some nice upgrade possible on faer-rs an get the best of both worlds ideally |
Yeap I did modify
This seemed to decrease the variance/mean ratio (not so much). |
Thanks. I don't have much time to add new features on it. In general mkl used to be the best runtime for all engines I tried. |
Hey this is the dfdx author. I recently added f16 support for both cpu & gpu (on main currently waiting for new release from I also moved the CPU matrix multiplication to using Happy to add anything that you guys would need to use it! Edit: I was getting ~30ms/token on an A10 GPU with llama-7b |
Awesome! The main things we'd want that I don't believe Our current thinking on this is to implement a computation graph abstraction (#137), and then have that shell out to alternate backends as required or as available. I'd love to see |
I feel forced to say that this approach has major drawbacks, the biggest of all being that it's hard to implement efficient runtimes. And if someone has a super clever idea to do computations more efficiently, well it's much harder to implement in those graphs, since you have to talk an entire new language (the graph structure). My very personal opinion is that we shouldn't have "graph computation" models, but real code, as real code is a full descriptor of your graph (no need to reimplement the wheel there). It's fully qualified already, there are already great ways to modify any part you want without having to understand the graph structure. Like if I want to reimplement a given model with new kernel X that can replace existing ops, there's an obvious way to do it (rewrite the corresponding code). It's very much not easy to do on computation graphs. Ofc, for training, in order to get automatic differentiation, we need a computation graph. PyTorch seemed to have made it work correctly without having a "real computation graph", it ends up being classic python code, which is where it wins imo. |
Yeah of those are both accurate.
We have a tracking issue for 4 bit quantization, and someone actually was working on a draft PR for this, but it has kind of stalled out given how specific of a use case it is. So there is a fairly complex way forward, but it'll take a not insignificant amount of time. Luckily llama doesn't use all the possible tensor operations, so the MVP is probably just implementing kernels for the specific ops we'd need. Has anyone done 4 bit quantization on CUDA? Or is this specifically for Cpu optimizations?
I was thinking about memory mapped tensors yesterday (probably for similar use case where CPU tensors for llama can just use memory mapping for data storage). There might be a way to do this on top of dfdx, similar to how I did the lazy tensor stuff, however it'd be really unsafe/sketchy. Basically we'd have to construct a
+1 on narsil's response. This seems like a very complex way to gain access to running on multiple devices. The easiest/hackiest way to do it would be with feature flags, something like: #[cfg(feature = "cuda")]
// call e.g. dfdx backend
#[cfg(feature = "cpu)]
// call existing ggml backend While you have to maintain two separate pieces of code that do the same thing, I think its probably simpler than creating/impl'ing a graph abstraction. Thoughts? |
GPTQ does it : https://github.com/qwopqwop200/GPTQ-for-LLaMa (Triton backed, so you could steal their ptx file ! ) |
Also I tried out some sketch mmap stuff, and it seems like you can create a Vec from an mmap buffer. I have no idea how safe it is, but it seems to work (it produces the same generations as the regular copy version) 🤷 Was able to "load" all the 13gb of weights in 10ms on my dev laptop
PR is here coreylowman/llama-dfdx#15 |
My thinking on this is that we already use a computation graph, through ggml: https://github.com/rustformers/llm/blob/main/crates/models/llama/src/lib.rs#L141-L326 Replicating this graph would be no worse than the current state of affairs, and it would allow us to directly "compile" our graph to the GGML graph in a way that would let us maintain compatibility. I would need to read more into the state of affairs here before making a decision.
At present, we have five models (with a sixth hopefully coming soon). Multiplying the maintenance work by the number of backends seems intractable over time. I'd like to avoid that as much as possible, for as long as possible.
Very cool! I'll have to give this more of a look soon 🙂 |
I've opened an issue with I imagine it will be non-trivial for them to implement a more freeform interface (if they're interested in doing so), so it may not be done/could take a long time. That being said, I would love to see non-CUDA GPU inference! |
Just listing all potential backends that come to mind, feel free to suggest more:
Note that some of these overlap and/or are at different abstraction levels. I'm just listing them out for general reference. |
https://github.com/OpenNMT/CTranslate2 is another solid choice (cpu / gpu (cuda) support, wide model support matrix) |
This has been a topic of some discussion in #4 and on the Discord, so I figured I'd document our initial findings so far.
We would like to switch away from
ggml
at some point so that we can remove the C compiler dependency, and enable running on other types of devices (namely the GPU).Our primary candidate for a Rust-native ML/tensor backend is burn, which is a flexible deep learning framework that supports multiple backends (including ndarray and torch).
Unfortunately, it doesn't support the two formats we need:
f16
(original weights) andq4_0
/q4_1
(quantized weights). Adding these to thendarray
backend should be viable, but getting it right and working optimally (i.e. similar toggml
's optimisations for those datatypes) will take some time.Torch does support
f16
on the GPU only, andburn
's Torch backend supports it. The main problem there is actually just testing: the 7B weights are 14GB, which is difficult to make work with most consumer GPUs.So we're in a bit of a pickle - there are three options available, all of which will require some work, and all of which have individual drawbacks:
uint8
and usendarray
/torch
backends. This is the least work (at least in theory), but uint8 quantization performs worse than eitherf16
orq4
, from what I've heard.f16
toburn
'sndarray
backend. Thetorch
backend should already work, but it will be very hard to test with most of our machines. Adding support tondarray
for CPU inference shouldn't be impossible either (especially if we just convert tof32
for every operation), but it will be difficult to make it performance-optimal.q4_0/1
toburn
'sndarray
backend. This is the option that will give us the most parity with the current implementation (assuming the majority of our users are usingq4
weights), but it has the same performance-optimality issue asf16
on the CPU (every cumulative operation, like matrix multiplication and such, will need to be specialised). Additionally, there is no way to natively store a 4-bit element, so there's no guarantee that this will be space-optimal (e.g. we can't assume thatndarray
andrustc
will remap[[bool; 4]; N]
to[u8; N/2]
).This is summarised in the following table:
uint8
f16
q4
ndarray
torch
An idea that I briefly floated was porting
ggml
itself to Rust usingc2rust
and some cleanup work, but that's likely to be quite time-consuming and it locks us out of the relatively-free improvements we get from people making PRs againstllama.cpp
'sggml
implementation. The gain from having pure Rust would be outweighed by the maintenance burden we'd put on ourselves.I believe the other Rust ML crates also do not support
f16
orq4
, but that's from a cursory exploration. Happy to be proven wrong!The text was updated successfully, but these errors were encountered: