-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64
Comments
@ggerganov Do you want to store k, v or K, V? |
Hello and sorry to bother you, I made a attempt at doing this and would like to ask for correction, as probably gravely misunderstood something. Without dumping my entire code, I have two functions that basically boil to memcpy-ing content model.memory_[kv]->data Am I mistaken? I'm, of course, asking simply because result of my attempt is state that appears entirelly random, often outputing nothing at all or garbage text. bool gptj_make_savepoint(const struct gptj_model & model, gptj_savepoint & savepoint) {
size_t nelements = ggml_nelements(model.memory_k);
assert(nelements == ggml_nelements(model.memory_v));
savepoint.memory.clear();
assert(ggml_type_size(model.memory_k->type) == sizeof(float));
assert(ggml_type_size(model.memory_v->type) == sizeof(float));
savepoint.memory.resize(nelements * 2);
memcpy(
&savepoint.memory[0],
ggml_get_data(model.memory_k),
sizeof(float) * nelements
);
memcpy(
&savepoint.memory[nelements],
ggml_get_data(model.memory_v),
sizeof(float) * nelements
);
return true;
}
bool gptj_apply_savepoint(const gptj_savepoint & savepoint, struct gptj_model & model) {
size_t nelements = savepoint.memory.size() / 2;
assert(nelements == ggml_nelements(model.memory_k));
assert(nelements == ggml_nelements(model.memory_v));
memcpy(
ggml_get_data(model.memory_k),
&savepoint.memory[0],
sizeof(float) * nelements
);
memcpy(
ggml_get_data(model.memory_v),
&savepoint.memory[nelements],
sizeof(float) * nelements
);
return true;
} |
@im-not-tom I got this working on my project and those are basically the steps I followed 👍 I have verified this works by saving memory, restoring on a different process (after loading the model again), and then comparing the two |
This requires a future input prompt to be identical to the original prompt, correct? So if I wanted to swap out a few words early in the text each time, it would still require reprocessing everything. I'm struggling to understand, why does Huggingface Transformers not have this issue? It seems like generating from a 500 word input prompt has the same latency as a 3 word prompt there (constant time). Whereas with llama.cpp, the prompt processing time scales linearly with prompt length. |
This a great idea. I think the right way going forward would be to separate the concepts of the immutable model and its' mutable context to their own separate, simple structs. Those could be then manipulated and evaluated against in various ways any new features and implementations would see fit. Load / save from disk, mmap to memory , transfer over sockets , even from hardware devices like a serial port. Serve multiple instances by creating threads or share the model and/or context between processes. Currently I see the problem of having multiple concurrent PR's , each of which try to implement new functionality by directly modifying the main program in their different ways. A simple C API to access the 'model' and 'context' structs could keep the main program lean, clean and fast and have the ability to add all sorts of new functionality using separate modules which could interface with this API. You've done absolutely fantastic job making this minimal, fast and ultra-portable zero-dependency port of LLaMA. It's absolutely delightful in its' current state and I think modularity would be the right approach moving forward instead of growing the main program to a large monolith with a web of non-portable #ifdefs scattered around everywhere. With every functionality-adding module living inside its separate .cpp file independent of each other, any functionality could be simply added or left out by the makefile script. I'd see that this could spawn a whole environment where a new "modules" directory could be added to the repo root, then people could make whatever modules which can add new functionality. Living inside the modules folder and being separated from the main program and being included by makefile options, they could also less strictly conform to the rules of no dependencies and full platform compatibility. Allowing people to make new functionality without taking into account every platform and also have the ability of opting-in to the features they want and have nothing forced upon them. If a non-modular approach would be taken, it would inevitably lead this marvelous minimalistic codebase to grow to a large monolith and force people to fork the minimal version and cherry-pick the bugfixes and features they want each in their own forks, creating a divergence which in my view would hurt the project in the long run. |
When implementing the feature, this comment may be useful: rustformers/llm#38 (comment) TL;DR is zstd compression at the lowest/fastest setting works well and increasing the compression level doesn't do a lot. Since the memory is quite large (2GB with So I think it's really worth it to use some sort of lightweight compression scheme for at least the memory tensors when saving/restoring state. |
I have implemented functions for getting and setting the rest of the model state. It was necessary to store the logits so that we can eval tokens, save state, restart program, load state and then sample. For completeness I also stored the embedding vector. Because the whole state is not in one contiguous memory buffer I decided on an output pointer parameter to get the state data. |
@xaedes do you have a PR for this? |
Just created the pull request: #1105 |
I believe #1169 covers this |
Add unlimited max_tokens
Idea from: #23 (comment)
We can add a
--cache_prompt
flag that if added will dump the computed KV caches of the prompt processing to the disk in a file with name produced by the hash of the prompt. Next time you run, it will first check if we have stored KV cache for this hash and load it straight from disk instead of computing it.Great task for contributing to the project!
The text was updated successfully, but these errors were encountered: