Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : refactor src/llama.cpp #10902

Merged
merged 25 commits into from
Jan 3, 2025
Merged

llama : refactor src/llama.cpp #10902

merged 25 commits into from
Jan 3, 2025

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Dec 19, 2024

Attempting to split the src/llama.cpp into a few separate modules. Very work-in-progress, mainly opening this PR for people to keep track and suggest improvements as we move along. This part does not involve functional changes, just code reorganization and decoupling to make it easier to work with the codebase. The batch and KV cache abstractions and reimplementations will be done in follow-up PRs.

graph TD;
chat;
model_loader;
model   --> arch[<b>arch </b>];
model   --> hparams[<b>hparams </b>];
model   ----> mmap[<b>mmap </b> <br><br> llama_file <br> llama_mmap <br> llama_mlock];
model   -.-> model_loader;
model   --> vocab;
vocab   --> unicode;
adapter -.-> model;
kv_cache -.-> batch;
kv_cache -.-> cparams;
kv_cache -.-> model;
context --> adapter[<b>adapter</b> <br><br> llama_adapter_cvec <br> llama_adapter_lora];
context -.-> batch;
context --> cparams;
context --> kv_cache;
context --> model;

style adapter fill:green
style arch fill:green
style batch fill:green
style chat fill:green
style cparams fill:green
style hparams fill:green
style kv_cache fill:green
style mmap fill:green
style model fill:green
style model_loader fill:green
style unicode fill:green
style vocab fill:green
Loading

TODO

  • move the llama_mmaps and llama_mlocks from llama_model to llama_context? (no)
  • change _internal suffix to _impl (next PR)
  • add llama_tensor_loader ?
  • model loading
  • quantization

Conflicts

@ngxson
Copy link
Collaborator

ngxson commented Dec 19, 2024

I think control_vector and lora related stuff should be re-grouped into a module, maybe called adapters (if someone has a better naming, feel free to comment). That's because they work kinda the same way, by "adding things" on top of the original cgraph.

@ggerganov ggerganov force-pushed the gg/llama-refactor-0 branch 8 times, most recently from 524886b to 7ab08d5 Compare December 22, 2024 16:24
@github-actions github-actions bot added examples devops improvements to build systems and github actions labels Dec 22, 2024
@ggerganov ggerganov force-pushed the gg/llama-refactor-0 branch 2 times, most recently from be8f568 to dcbfda1 Compare December 22, 2024 20:30
@ggerganov ggerganov force-pushed the gg/llama-refactor-0 branch 7 times, most recently from ba48e37 to 0ccae21 Compare December 23, 2024 17:22
@ggerganov ggerganov force-pushed the gg/llama-refactor-0 branch from 1e7e338 to 597ae05 Compare January 2, 2025 10:39
@ggerganov ggerganov force-pushed the gg/llama-refactor-0 branch from 1521f9e to c16630e Compare January 2, 2025 15:29
@ggerganov ggerganov force-pushed the gg/llama-refactor-0 branch 2 times, most recently from 391a111 to 089cf4a Compare January 2, 2025 19:37
ggml-ci
@ggerganov ggerganov force-pushed the gg/llama-refactor-0 branch from 089cf4a to e06d267 Compare January 2, 2025 19:40
@ggerganov ggerganov marked this pull request as ready for review January 2, 2025 20:02
@ggerganov ggerganov requested a review from ngxson as a code owner January 2, 2025 20:02
@ggerganov
Copy link
Owner Author

I think this is a good place to merge this change. The project builds faster now and hopefully the code is organized a bit better. Will continue refactoring in follow-up PRs and any suggestions and recommendations are welcome. I've left some TODOs around the code and will try to address those next. After that will be looking for ways to separate the KV cache from the llama_context and enable support for multiple KV cache implementations.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, thanks for taking time during the holidays to finish this. Happy new year btw 🎉

common/common.h Outdated
@@ -24,13 +24,12 @@

#define DEFAULT_MODEL_PATH "models/7B/ggml-model-f16.gguf"

// TODO: "lora_adapter" is tautology
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note sure what do you mean by this. I think "lora_adapter" is not tautology because there can be multiple types of adapter, and there can also be "lora_a", "lora_b", "lora_scale"

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that "lora" already implies "adapter", since it means comes from "LOw-Rank Adapter". So it seems to me that common_lora_adapter_info should be simply called common_lora_info.

Copy link
Collaborator

@ngxson ngxson Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm no, the "A" means "adaptation", not "adapter". Quoting from this article:

LoRA, which stands for “Low-Rank Adaptation”, distinguishes itself by training and storing the additional weight changes in a matrix while freezing all the pre-trained model weights. LoRA is not called an “adapter” because it does not add adapters. Instead, it is referred to as “adaptation” to describe the process of fine-tuning the domain data and tasks.

Funny enough, I've just found out that "adapter" is technically a different technique than LoRA, firstly introduced in this paper. But the way they work are quite similar, adding nodes to the existing cgraph. So, I guess the term "adapter" is being used correctly in our context in llama.cpp, since both LoRA and cvector are just additions on top of model's cgraph.

@ggerganov ggerganov merged commit f66f582 into master Jan 3, 2025
1 check passed
@ggerganov ggerganov deleted the gg/llama-refactor-0 branch January 3, 2025 08:18
@MaggotHATE
Copy link
Contributor

After this PR, common_tokenize fails to tokenize any text, exits without console errors on llama_tokenize. Windows 10, static cpu build. Model and ctx are loaded at that moment (main-based application).

What change caused this? I don't see changes to llama_tokenize_internal function, for example. Are models required to be reconverted after this PR?

@ggerganov
Copy link
Owner Author

There should be no functional changes. Try to clean your build folder.

@MaggotHATE
Copy link
Contributor

There should be no functional changes.

There's definitely a change, I see llama_n_ctx returning random numbers instead of the actual context size. At the same time, the context size is reported correctly on loading:

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 16384, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 56, can_shift = 1

but later:

load: llama_n_ctx = 458134640

@ngxson
Copy link
Collaborator

ngxson commented Jan 3, 2025

There is a slight change in common_init_result, both model and ctx are now unique pointer. I'm not sure if it's the root problem. (i.e. you need to modify your code to use model.get() instead of just model)

Please note that we don't provide stable interface for common library, so undocumented breaking changes are expected.

@MaggotHATE
Copy link
Contributor

you need to modify your code to use model.get() instead of just model

That's already done according to the changes in main this PR introduces. The problem is that everything before llama_n_ctx seems to work fine, but it looks like ctx either doesn't load properly, or is unloaded (which didn't happen before and probably shouldn't happen).

undocumented breaking changes

That's exactly what I'm trying to find.

@ngxson
Copy link
Collaborator

ngxson commented Jan 3, 2025

Be careful that with unique ptr, the ctx maybe deallocated when it goes out of scope. You should make sure that common_init_result always valid in the scope, otherwise .release() the ptr from unique ptr and use it as raw ptr.

@MaggotHATE
Copy link
Contributor

ctx maybe deallocated when it goes out of scope

Yes, looks like that's what happens here. I see that llama_free(ctx) and llama_free_model(model) are no longer used, so deallocation and unloading happen automatically. It there a way to lock model and ctx manually now?

Although, it's a bit weird that this wasn't needed previously.

@ngxson
Copy link
Collaborator

ngxson commented Jan 3, 2025

llama_free is no longer needed because common_init_result is in scope of main(), and deleter will be called automatically when we exit from main() (go out of scope)


Another way is: (but remember to free it manually afterwards)

common_init_result llama_init = common_init_from_params(params);
llama_model * model = llama_init.model.release();
llama_context * ctx = llama_init.context.release();

llama_free(ctx);
llama_free_model(model);

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Jan 3, 2025

llama_model * model = llama_init.model.release();
llama_context * ctx = llama_init.context.release();

Yes, thank you, that's exactly what I needed!

So, release() allows to match previous behavior (for example, reloading works exactly as it used to), while .get() relies on common_init_result (I tested it, and keeping common_init_result as a part of the object helps, but regen doesn't work - looks like it should be cleared/reset manually too).

@ngxson
Copy link
Collaborator

ngxson commented Jan 3, 2025

In cpp, you have notion of "ownership". For example, std::string owns a char * buffer under the hood that you can access it using str.data(), and buffer is freed once std::string goes out of scope.

Same for unique_ptr, llama_model * model is owned by unique_ptr and get() let you access it. You can also release() it from being owned.

unique_ptr cannot be reassigned, you can only reset() it or std::move it. Edit: it can be reassigned, which will be equivalent to reset()

@MaggotHATE
Copy link
Contributor

Same for unique_ptr, llama_model * model is owned by unique_ptr and get() let you access it. You can also release() it from being owned.

I understand that, but the problem is that release() isn't used anywhere for model and ctx, so there are no examples to learn that from. Plus, in practical sense, functionality has changed in the examples (they are examples to learn from!), because it used to rely on free functions, while now they rely on simply going out of scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions examples server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants