llama : refactor `src/llama.cpp` #10902

ggerganov · 2024-12-19T14:54:36Z

Attempting to split the src/llama.cpp into a few separate modules. Very work-in-progress, mainly opening this PR for people to keep track and suggest improvements as we move along. This part does not involve functional changes, just code reorganization and decoupling to make it easier to work with the codebase. The batch and KV cache abstractions and reimplementations will be done in follow-up PRs.

graph TD;
chat;
model_loader;
model   --> arch[<b>arch </b>];
model   --> hparams[<b>hparams </b>];
model   ----> mmap[<b>mmap </b> <br><br> llama_file <br> llama_mmap <br> llama_mlock];
model   -.-> model_loader;
model   --> vocab;
vocab   --> unicode;
adapter -.-> model;
kv_cache -.-> batch;
kv_cache -.-> cparams;
kv_cache -.-> model;
context --> adapter[<b>adapter</b> <br><br> llama_adapter_cvec <br> llama_adapter_lora];
context -.-> batch;
context --> cparams;
context --> kv_cache;
context --> model;

style adapter fill:green
style arch fill:green
style batch fill:green
style chat fill:green
style cparams fill:green
style hparams fill:green
style kv_cache fill:green
style mmap fill:green
style model fill:green
style model_loader fill:green
style unicode fill:green
style vocab fill:green

TODO

~~move the llama_mmaps and llama_mlocks from llama_model to llama_context?~~ (no)
change _internal suffix to _impl (next PR)
add llama_tensor_loader ?
model loading
quantization

Conflicts

llama : add support for Cohere2ForCausalLM #10900

ngxson · 2024-12-19T16:04:32Z

I think control_vector and lora related stuff should be re-grouped into a module, maybe called adapters (if someone has a better naming, feel free to comment). That's because they work kinda the same way, by "adding things" on top of the original cgraph.

ggml-ci

ggerganov · 2025-01-02T20:11:20Z

I think this is a good place to merge this change. The project builds faster now and hopefully the code is organized a bit better. Will continue refactoring in follow-up PRs and any suggestions and recommendations are welcome. I've left some TODOs around the code and will try to address those next. After that will be looking for ways to separate the KV cache from the llama_context and enable support for multiple KV cache implementations.

ngxson

LGTM overall, thanks for taking time during the holidays to finish this. Happy new year btw 🎉

ngxson · 2025-01-02T20:18:36Z

common/common.h

@@ -24,13 +24,12 @@

 #define DEFAULT_MODEL_PATH "models/7B/ggml-model-f16.gguf"

+// TODO: "lora_adapter" is tautology


Note sure what do you mean by this. I think "lora_adapter" is not tautology because there can be multiple types of adapter, and there can also be "lora_a", "lora_b", "lora_scale"

I thought that "lora" already implies "adapter", since it means comes from "LOw-Rank Adapter". So it seems to me that common_lora_adapter_info should be simply called common_lora_info.

Hmm no, the "A" means "adaptation", not "adapter". Quoting from this article:

LoRA, which stands for “Low-Rank Adaptation”, distinguishes itself by training and storing the additional weight changes in a matrix while freezing all the pre-trained model weights. LoRA is not called an “adapter” because it does not add adapters. Instead, it is referred to as “adaptation” to describe the process of fine-tuning the domain data and tasks.

Funny enough, I've just found out that "adapter" is technically a different technique than LoRA, firstly introduced in this paper. But the way they work are quite similar, adding nodes to the existing cgraph. So, I guess the term "adapter" is being used correctly in our context in llama.cpp, since both LoRA and cvector are just additions on top of model's cgraph.

MaggotHATE · 2025-01-03T09:48:37Z

After this PR, common_tokenize fails to tokenize any text, exits without console errors on llama_tokenize. Windows 10, static cpu build. Model and ctx are loaded at that moment (main-based application).

What change caused this? I don't see changes to llama_tokenize_internal function, for example. Are models required to be reconverted after this PR?

ggerganov · 2025-01-03T10:08:29Z

There should be no functional changes. Try to clean your build folder.

MaggotHATE · 2025-01-03T10:19:48Z

There should be no functional changes.

There's definitely a change, I see llama_n_ctx returning random numbers instead of the actual context size. At the same time, the context size is reported correctly on loading:

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 16384, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 56, can_shift = 1

but later:

load: llama_n_ctx = 458134640

ngxson · 2025-01-03T10:27:58Z

There is a slight change in common_init_result, both model and ctx are now unique pointer. I'm not sure if it's the root problem. (i.e. you need to modify your code to use model.get() instead of just model)

Please note that we don't provide stable interface for common library, so undocumented breaking changes are expected.

MaggotHATE · 2025-01-03T10:34:08Z

you need to modify your code to use model.get() instead of just model

That's already done according to the changes in main this PR introduces. The problem is that everything before llama_n_ctx seems to work fine, but it looks like ctx either doesn't load properly, or is unloaded (which didn't happen before and probably shouldn't happen).

undocumented breaking changes

That's exactly what I'm trying to find.

ngxson · 2025-01-03T10:59:13Z

Be careful that with unique ptr, the ctx maybe deallocated when it goes out of scope. You should make sure that common_init_result always valid in the scope, otherwise .release() the ptr from unique ptr and use it as raw ptr.

MaggotHATE · 2025-01-03T11:02:00Z

ctx maybe deallocated when it goes out of scope

Yes, looks like that's what happens here. I see that llama_free(ctx) and llama_free_model(model) are no longer used, so deallocation and unloading happen automatically. It there a way to lock model and ctx manually now?

Although, it's a bit weird that this wasn't needed previously.

ngxson · 2025-01-03T11:07:59Z

llama_free is no longer needed because common_init_result is in scope of main(), and deleter will be called automatically when we exit from main() (go out of scope)

Another way is: (but remember to free it manually afterwards)

common_init_result llama_init = common_init_from_params(params);
llama_model * model = llama_init.model.release();
llama_context * ctx = llama_init.context.release();

llama_free(ctx);
llama_free_model(model);

MaggotHATE · 2025-01-03T11:20:37Z

llama_model * model = llama_init.model.release();
llama_context * ctx = llama_init.context.release();

Yes, thank you, that's exactly what I needed!

So, release() allows to match previous behavior (for example, reloading works exactly as it used to), while .get() relies on common_init_result (I tested it, and keeping common_init_result as a part of the object helps, but regen doesn't work - looks like it should be cleared/reset manually too).

ngxson · 2025-01-03T11:30:42Z

In cpp, you have notion of "ownership". For example, std::string owns a char * buffer under the hood that you can access it using str.data(), and buffer is freed once std::string goes out of scope.

Same for unique_ptr, llama_model * model is owned by unique_ptr and get() let you access it. You can also release() it from being owned.

~~unique_ptr cannot be reassigned, you can only reset() it or std::move it.~~ Edit: it can be reassigned, which will be equivalent to reset()

MaggotHATE · 2025-01-03T11:40:41Z

Same for unique_ptr, llama_model * model is owned by unique_ptr and get() let you access it. You can also release() it from being owned.

I understand that, but the problem is that release() isn't used anywhere for model and ctx, so there are no examples to learn that from. Plus, in practical sense, functionality has changed in the examples (they are examples to learn from!), because it used to rely on free functions, while now they rely on simply going out of scope.

ggerganov force-pushed the gg/llama-refactor-0 branch 8 times, most recently from 524886b to 7ab08d5 Compare December 22, 2024 16:24

github-actions bot added examples devops improvements to build systems and github actions labels Dec 22, 2024

ggerganov force-pushed the gg/llama-refactor-0 branch 2 times, most recently from be8f568 to dcbfda1 Compare December 22, 2024 20:30

github-actions bot added the server label Dec 22, 2024

ggerganov force-pushed the gg/llama-refactor-0 branch 7 times, most recently from ba48e37 to 0ccae21 Compare December 23, 2024 17:22

lexasub mentioned this pull request Dec 30, 2024

llama/ggml: add LLM training support #10544

Open

ggerganov force-pushed the gg/llama-refactor-0 branch from 1e7e338 to 597ae05 Compare January 2, 2025 10:39

ggerganov added 7 commits January 2, 2025 16:55

llama : scatter llama.cpp into multiple modules (wip)

498b68f

llama : control-vector -> adapter

844660b

llama : arch

cf899ea

llama : mmap

6b24e6e

ggml-ci

ci : remove BUILD_SHARED_LIBS=OFF

e9c9209

ggml-ci

llama : arch (cont)

6c22ce1

ggml-ci

llama : chat

a2dc93e

ggml-ci

ggerganov added 2 commits January 2, 2025 16:56

llama : context (cont)

736e692

ggml-ci

llama : model loader

8d117a5

ggml-ci

ggerganov force-pushed the gg/llama-refactor-0 branch from 1521f9e to c16630e Compare January 2, 2025 15:29

common : update lora

272cd0e

ggml-ci

ggerganov force-pushed the gg/llama-refactor-0 branch 2 times, most recently from 391a111 to 089cf4a Compare January 2, 2025 19:37

llama : quant

e06d267

ggml-ci

ggerganov force-pushed the gg/llama-refactor-0 branch from 089cf4a to e06d267 Compare January 2, 2025 19:40

llama : quant (cont)

69dd1e8

ggml-ci

ggerganov marked this pull request as ready for review January 2, 2025 20:02

ggerganov requested a review from ngxson as a code owner January 2, 2025 20:02

ngxson approved these changes Jan 2, 2025

View reviewed changes

minor [no ci]

9d0156b

ggerganov merged commit f66f582 into master Jan 3, 2025
1 check passed

ggerganov deleted the gg/llama-refactor-0 branch January 3, 2025 08:18

ggerganov mentioned this pull request Jan 3, 2025

llama : add support for Cohere2ForCausalLM #10900

Merged

This was referenced Jan 3, 2025

Add support for DeepSeek V3 #11049

Merged

Feature Request: add DeepSeek-v3 support #10981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : refactor `src/llama.cpp` #10902

llama : refactor `src/llama.cpp` #10902

ggerganov commented Dec 19, 2024 •

edited

Loading

ngxson commented Dec 19, 2024 •

edited

Loading

ggerganov commented Jan 2, 2025

ngxson left a comment

ngxson Jan 2, 2025

ggerganov Jan 2, 2025

ngxson Jan 2, 2025 •

edited

Loading

MaggotHATE commented Jan 3, 2025

ggerganov commented Jan 3, 2025

MaggotHATE commented Jan 3, 2025

ngxson commented Jan 3, 2025 •

edited

Loading

MaggotHATE commented Jan 3, 2025

ngxson commented Jan 3, 2025

MaggotHATE commented Jan 3, 2025

ngxson commented Jan 3, 2025 •

edited

Loading

MaggotHATE commented Jan 3, 2025 •

edited

Loading

ngxson commented Jan 3, 2025 •

edited

Loading

MaggotHATE commented Jan 3, 2025

		@@ -24,13 +24,12 @@

		#define DEFAULT_MODEL_PATH "models/7B/ggml-model-f16.gguf"

		// TODO: "lora_adapter" is tautology

llama : refactor src/llama.cpp #10902

llama : refactor src/llama.cpp #10902

Conversation

ggerganov commented Dec 19, 2024 • edited Loading

TODO

Conflicts

ngxson commented Dec 19, 2024 • edited Loading

ggerganov commented Jan 2, 2025

ngxson left a comment

Choose a reason for hiding this comment

ngxson Jan 2, 2025

Choose a reason for hiding this comment

ggerganov Jan 2, 2025

Choose a reason for hiding this comment

ngxson Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

MaggotHATE commented Jan 3, 2025

ggerganov commented Jan 3, 2025

MaggotHATE commented Jan 3, 2025

ngxson commented Jan 3, 2025 • edited Loading

MaggotHATE commented Jan 3, 2025

ngxson commented Jan 3, 2025

MaggotHATE commented Jan 3, 2025

ngxson commented Jan 3, 2025 • edited Loading

MaggotHATE commented Jan 3, 2025 • edited Loading

ngxson commented Jan 3, 2025 • edited Loading

MaggotHATE commented Jan 3, 2025

llama : refactor `src/llama.cpp` #10902

llama : refactor `src/llama.cpp` #10902

ggerganov commented Dec 19, 2024 •

edited

Loading

ngxson commented Dec 19, 2024 •

edited

Loading

ngxson Jan 2, 2025 •

edited

Loading

ngxson commented Jan 3, 2025 •

edited

Loading

ngxson commented Jan 3, 2025 •

edited

Loading

MaggotHATE commented Jan 3, 2025 •

edited

Loading

ngxson commented Jan 3, 2025 •

edited

Loading