Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWQ (Activation-aware Weight Quantization) for llama, llama2, mpt, and mistral models #4593

Merged
merged 34 commits into from
Dec 27, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
2ea3934
update: awq support llama-7b model
Dec 14, 2023
8a3cece
update: change order
Dec 14, 2023
0adf4c7
update: benchmark results for llama2-7b
Dec 16, 2023
e851199
update: mistral 7b v1 benchmark
Dec 18, 2023
eb9a790
update: support 4 models
Dec 18, 2023
576d28b
fix: Readme
Dec 18, 2023
4cad8d7
update: ready for PR
Dec 19, 2023
f97c587
update: readme
Dec 19, 2023
ef61a66
fix: readme
Dec 19, 2023
f8cf783
update: change order import
Dec 19, 2023
1b300cb
black
Dec 19, 2023
8fece75
format code
Dec 19, 2023
8177ad4
update: work for bot mpt and awqmpt
Dec 19, 2023
d2e9d00
update: readme
Dec 19, 2023
0610672
Rename to llm_build_ffn_mpt_awq
Dec 20, 2023
c02f6df
Formatted other files
Dec 20, 2023
71c0a27
Fixed params count
Dec 20, 2023
741b7fb
Merge branch 'github' of https://gitlab.vinai.io/mlbooster/llama.cpp …
Dec 20, 2023
e04b8f0
fix: remove code
Dec 22, 2023
48cd819
update: more detail for mpt
Dec 22, 2023
6fcdb07
fix: readme
Dec 22, 2023
b00e2d9
fix: readme
Dec 22, 2023
440cc2f
update: change folder architecture
Dec 22, 2023
00f48ad
fix: common.cpp
Dec 22, 2023
9b742c5
fix: readme
Dec 22, 2023
e8fae2d
Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …
Dec 22, 2023
a600c61
fix: remove ggml_repeat
namtranase Dec 22, 2023
2187a8d
update: cicd
namtranase Dec 22, 2023
e9ad5fe
update: cicd
namtranase Dec 23, 2023
13f60c4
uppdate: remove use_awq arg
namtranase Dec 25, 2023
44f4ce2
Merge branch 'master' of https://github.com/namtranase/llama.cpp
namtranase Dec 25, 2023
d089842
update: readme
namtranase Dec 25, 2023
278f3e9
Merge branch 'master' into HEAD
ggerganov Dec 27, 2023
9174699
llama : adapt plamo to new ffn
ggerganov Dec 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Rename to llm_build_ffn_mpt_awq
  • Loading branch information
Le Hoang Anh committed Dec 20, 2023
commit 0610672b19b5c2e7676859e7ae5c03a99736633a
2 changes: 2 additions & 0 deletions common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1100,11 +1100,13 @@ void llama_batch_add(

std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(gpt_params & params) {
auto mparams = llama_model_params_from_gpt_params(params);

llama_model * model = llama_load_model_from_file(params.model.c_str(), mparams);
if (model == NULL) {
fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
return std::make_tuple(nullptr, nullptr);
}

auto cparams = llama_context_params_from_gpt_params(params);

llama_context * lctx = llama_new_context_with_model(model, cparams);
Expand Down
100 changes: 33 additions & 67 deletions llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -454,8 +454,8 @@ static std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NAMES =
{ LLM_TENSOR_ATTN_QKV, "blk.%d.attn_qkv" },
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
{ LLM_TENSOR_FFN_ACT, "blk.%d.ffn.act"},
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
{ LLM_TENSOR_FFN_ACT, "blk.%d.ffn.act" },
},
},
{
Expand Down Expand Up @@ -1178,6 +1178,7 @@ struct llama_hparams {

float f_clamp_kqv;
float f_max_alibi_bias;

bool use_awq;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this has to be an hparam. We can determine whether to apply ffn_act or not based on it's presence in the model data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, but do you think there would be problems with the future modification of other models if they have scale layers like MPT

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am missing some detail, it should work and can be extended to other models in a similar way. If ffn_act is NULL then no scaling is applied so we fallback to the original non-AWQ behaviour


bool operator!=(const llama_hparams & other) const {
Expand Down Expand Up @@ -1274,7 +1275,7 @@ struct llama_layer {
// ff bias
struct ggml_tensor * ffn_down_b; // b2
struct ggml_tensor * ffn_up_b; // b3
struct ggml_tensor *ffn_act;
struct ggml_tensor * ffn_act;
};

struct llama_kv_cell {
Expand Down Expand Up @@ -3423,10 +3424,10 @@ static void llm_load_tensors(
layer.ffn_norm = ml.create_tensor(ctx, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd}, backend);

layer.ffn_down = ml.create_tensor(ctx, tn(LLM_TENSOR_FFN_DOWN, "weight", i), { n_ff, n_embd}, backend_split);
layer.ffn_up = ml.create_tensor(ctx, tn(LLM_TENSOR_FFN_UP, "weight", i), {n_embd, n_ff}, backend_split);
if (model.hparams.use_awq) {
layer.ffn_act = ml.create_tensor(ctx, tn(LLM_TENSOR_FFN_ACT, "scales", i), {n_ff}, backend);
}
layer.ffn_up = ml.create_tensor(ctx, tn(LLM_TENSOR_FFN_UP, "weight", i), {n_embd, n_ff}, backend_split);

if (backend == GGML_BACKEND_GPU) {
if (model.hparams.use_awq) {
Expand All @@ -3436,10 +3437,9 @@ static void llm_load_tensors(
ggml_nbytes(layer.wo) +
ggml_nbytes(layer.ffn_norm) +
ggml_nbytes(layer.ffn_down) +
ggml_nbytes(layer.ffn_act) +
ggml_nbytes(layer.ffn_up);
}
else {
ggml_nbytes(layer.ffn_up) +
ggml_nbytes(layer.ffn_act);
} else {
vram_weights +=
ggml_nbytes(layer.attn_norm) +
ggml_nbytes(layer.wqkv) +
Expand Down Expand Up @@ -3647,7 +3647,8 @@ static bool llama_model_load(const std::string & fname, llama_model & model, con
llama_model_loader ml(fname, params.use_mmap, params.kv_overrides);

model.hparams.vocab_only = params.vocab_only;
model.hparams.use_awq = params.use_awq;
model.hparams.use_awq = params.use_awq;

llm_load_arch (ml, model);
llm_load_hparams(ml, model);
llm_load_vocab (ml, model);
Expand Down Expand Up @@ -3935,7 +3936,7 @@ static struct ggml_tensor * llm_build_ffn(
return cur;
}

static struct ggml_tensor *llm_build_ffn(
static struct ggml_tensor * llm_build_ffn_mpt_awq(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can be merged into the existing llm_build_ffn

The ggml_div operator can optionally be applied if act_scales is non-NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback, for more detail, only MPT models need an additional scale for gelu activation (original code). I will merge llm_build_ffn_mpt_awq to the llm_build_ffn but need to add the act_scales param (only for MPT and NULL for other models)
Llama, Llama2, and Mistral models do not need the scaling activation so they don't need any modification.

struct ggml_context *ctx,
struct ggml_tensor *cur,
struct ggml_tensor *up,
Expand All @@ -3950,72 +3951,39 @@ static struct ggml_tensor *llm_build_ffn(
const llm_build_cb &cb,
int il)
{
struct ggml_tensor *tmp = ggml_mul_mat(ctx, up, cur);
struct ggml_tensor * tmp = ggml_mul_mat(ctx, up, cur);
cb(tmp, "ffn_up", il);

if (up_b)
{
if (up_b) {
tmp = ggml_add(ctx, tmp, up_b);
cb(tmp, "ffn_up_b", il);
}

if (gate)
{
switch (type_gate)
{
case LLM_FFN_SEQ:
{
cur = ggml_mul_mat(ctx, gate, tmp);
cb(cur, "ffn_gate", il);
}
break;
case LLM_FFN_PAR:
{
cur = ggml_mul_mat(ctx, gate, cur);
cb(cur, "ffn_gate", il);
}
break;
}

if (gate_b)
{
cur = ggml_add(ctx, cur, gate_b);
cb(cur, "ffn_gate_b", il);
}
}
else
{
cur = tmp;
}
cur = tmp;

switch (type_op)
{
case LLM_FFN_GELU_ACT:
{
cur = ggml_gelu(ctx, cur);
cb(cur, "ffn_relu", il);
struct ggml_tensor *repeat = ggml_repeat(ctx, act_scales, cur);
cb(repeat, "ffn_repeat(scales)", il);
cur = ggml_div(ctx, cur, repeat);
cb(cur, "ffn_div(gelu)", il);
}
break;
switch (type_op) {
case LLM_FFN_GELU_ACT:
{
cur = ggml_gelu(ctx, cur);
cb(cur, "ffn_relu", il);
ggerganov marked this conversation as resolved.
Show resolved Hide resolved
struct ggml_tensor *repeat = ggml_repeat(ctx, act_scales, cur);
cb(repeat, "ffn_repeat(scales)", il);
cur = ggml_div(ctx, cur, repeat);
slaren marked this conversation as resolved.
Show resolved Hide resolved
cb(cur, "ffn_div(gelu)", il);
ggerganov marked this conversation as resolved.
Show resolved Hide resolved
} break;
}

if (type_gate == LLM_FFN_PAR)
{
if (type_gate == LLM_FFN_PAR) {
cur = ggml_mul(ctx, cur, tmp);
cb(cur, "ffn_gate_par", il);
}

cur = ggml_mul_mat(ctx, down, cur);
if (down_b)
{
if (down_b) {
cb(cur, "ffn_down", il);
}

if (down_b)
{
if (down_b) {
cur = ggml_add(ctx, cur, down_b);
}

Expand Down Expand Up @@ -5133,21 +5101,17 @@ struct llm_build_context {
LLM_NORM, cb, il);
cb(cur, "ffn_norm", il);
if (hparams.use_awq) {
cur = llm_build_ffn(ctx0, cur,
cur = llm_build_ffn_mpt_awq(ctx0, cur,
model.layers[il].ffn_up, NULL,
NULL, NULL,
model.layers[il].ffn_down, NULL,
model.layers[il].ffn_act,
LLM_FFN_GELU_ACT, LLM_FFN_SEQ, cb, il);

}
else {
} else {
cur = llm_build_ffn(ctx0, cur,
model.layers[il].ffn_up, NULL,
NULL, NULL,
model.layers[il].ffn_down, NULL,
LLM_FFN_GELU, LLM_FFN_SEQ, cb, il);

}
cb(cur, "ffn_out", il);
}
Expand Down Expand Up @@ -5558,7 +5522,7 @@ static const std::unordered_map<const char *, llm_offload_func_e> k_offload_map
{ "ffn_gate", OFFLOAD_FUNC },
{ "ffn_gate_b", OFFLOAD_FUNC },
{ "ffn_gate_par", OFFLOAD_FUNC },
{"ffn_act", OFFLOAD_FUNC },
{ "ffn_act", OFFLOAD_FUNC },
{ "ffn_down", OFFLOAD_FUNC },
{ "ffn_down_b", OFFLOAD_FUNC },
{ "ffn_out", OFFLOAD_FUNC },
Expand Down Expand Up @@ -8864,9 +8828,9 @@ struct llama_model_params llama_model_default_params() {
/*.progress_callback_user_data =*/ nullptr,
/*.kv_overrides =*/ nullptr,
/*.vocab_only =*/ false,
/*.use_awq =*/ false,
/*.use_mmap =*/ true,
/*.use_mlock =*/ false,
/*.use_awq =*/ false,
};

#ifdef GGML_USE_METAL
Expand Down Expand Up @@ -8960,7 +8924,9 @@ struct llama_model * llama_load_model_from_file(
const char * path_model,
struct llama_model_params params) {
ggml_time_init();

llama_model * model = new llama_model;

unsigned cur_percentage = 0;
if (params.progress_callback == NULL) {
params.progress_callback_user_data = &cur_percentage;
Expand Down Expand Up @@ -9087,7 +9053,7 @@ struct llama_context * llama_new_context_with_model(
if (params.embedding){
ctx->embedding.resize(hparams.n_embd);
}

{
static const size_t tensor_alignment = 32;
// the compute buffer is used to store the tensor and graph structs, while the allocator buffer is used for the tensor data
Expand Down
2 changes: 1 addition & 1 deletion llama.h
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,7 @@ extern "C" {
bool vocab_only; // only load the vocabulary, no weights
bool use_mmap; // use mmap if possible
bool use_mlock; // force system to keep model in RAM
bool use_awq; // whether to use awq quantization
bool use_awq; // whether to use awq quantization
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an AWQ model should use the activation scales by default. AFAIU using an AWQ model with applying the activation scales would produce incorrect results because the FFN data has been scaled, so I don't see a reason to be able to turn off AWQ

};

struct llama_context_params {
Expand Down