-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cohere Model Release #29622
Cohere Model Release #29622
Conversation
Cohere Model Release
Some cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge work ! Thanks a lot for this !
I left couple of comments - mostly being that we can leverage # Copied from
mechanism almost all over the places and simply put # Ignore copy
statements at the places where the implementation differs from Llama, please have a look at my comments
Also it seems the documentation, and testing files are missing - they should be pretty easy to add as the model is mostly copied from llama so I'd expect all tests to pass out of the box ! For a similar PR: #29215 Starcoder was mostly copied from Mistral with minor architectural changes, please have a look at that PR as well and try to take some inspiration from there - I really think it shouldn't be hard to make this PR mergeable ASAP ! 🚀
Also, let's remove pretraining_tp
as from the checkpoints I inspected, it's either set to 1: https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/main/config.json#L27 or None
so we can remove it completely from the code base of this model
For making the CI happy, you can first make sure that make fixup
passes on your local machine (this will check the # Copied from
mechanism)
Let me know if you need any help !
@@ -331,6 +331,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h | |||
1. **[CLVP](https://huggingface.co/docs/transformers/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. | |||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. | |||
1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. | |||
1. **[Cohere](https://huggingface.co/docs/transformers/main/model_doc/cohere)** (from Cohere) released with the paper [Command-R: Retrieval Augmented Generation at Production Scale](<https://txt.cohere.com/command-r/>) by Cohere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to run make fix-copies
to propagate these changes to all other READMEs
) | ||
|
||
|
||
class LayerNorm(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you prepend the name of these modules with Cohere
? -- > CohereLayerNorm
|
||
|
||
class LayerNorm(nn.Module): | ||
def __init__(self, hidden_size, eps=1e-5, bias=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This layer norm is different from the Llama layernorm because of the bias term, can we explcitly mention that in the docstring of this module?
def __init__(self, hidden_size, eps=1e-5, bias=False): | ||
super().__init__() | ||
self.weight = nn.Parameter(torch.ones(hidden_size)) | ||
self.bias = nn.Parameter(torch.zeros(hidden_size)) if bias else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually bias
seems to be always set to None (looking at the LayerNorm
definitions below) - can we instead here remove bias
, copy the LayerNorm module from Llama and add a # Copied from
statement on top of the module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Llama LayerNorm is a bit different from this one which happens in FP32. I kept the bias incase we decide to add bias in the Future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok !
return causal_mask | ||
|
||
|
||
class CohereForCausalLM(CoherePreTrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class CohereForCausalLM(CoherePreTrainedModel): | |
# Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM with Llama->Cohere | |
class CohereForCausalLM(CoherePreTrainedModel): |
|
||
@add_start_docstrings_to_model_forward(COHERE_INPUTS_DOCSTRING) | ||
@replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC) | ||
def forward( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def forward( | |
# Ignore copy | |
def forward( |
Since here the difference with llama is that we multiply the lm logits with logits_scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A file for modeling and tokenizer tests is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of tests are required for modeling and tokenization @younesbelkada ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can copy over tests/models/llama/test_modeling_llama.py
and tests/models/llama/test_tokenization_llama.py
and adapt it for Cohere. Note we also need an integration test so you can remove the llama integration tests and replace it with new ones
@@ -310,6 +310,8 @@ | |||
title: CodeGen | |||
- local: model_doc/code_llama | |||
title: CodeLlama | |||
- local: model_doc/cohere | |||
title: Cohere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cohere doc page is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you need to add a field for the slow tokenizer in order for the failing CI to pass, e.g.:
("bloom", (None, "BloomTokenizerFast" if is_tokenizers_available() else None)), |
@@ -137,6 +143,10 @@ | |||
), | |||
), | |||
("codegen", ("CodeGenTokenizer", "CodeGenTokenizerFast" if is_tokenizers_available() else None)), | |||
( | |||
"cohere", | |||
("CohereTokenizerFast" if is_tokenizers_available() else None,), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
("CohereTokenizerFast" if is_tokenizers_available() else None,), | |
(None, "CohereTokenizerFast" if is_tokenizers_available() else None,), |
( | ||
"cohere", | ||
( | ||
"CohereTokenizerFast" if is_tokenizers_available() else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"CohereTokenizerFast" if is_tokenizers_available() else None, | |
None, "CohereTokenizerFast" if is_tokenizers_available() else None, |
* fixes for pr * pr fixes for the format * pr fixes for the format * src/transformers/models/auto/tokenization_auto.py
* tokenizer test * format fix
Fixes in Tokenization Tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the huge work ! This is much cleaner ! 🤩 I left few possible enhancements to make the code easier to maintain in the future ! We should be really close merging this ! 🚀
docs/source/en/model_doc/cohere.md
Outdated
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | ||
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | |
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) | |
tokenizer = AutoTokenizer.from_pretrained(model_id) | |
model = AutoModelForCausalLM.from_pretrained(model_id) |
These won't be needed once we merge the PR right?
docs/source/en/model_doc/cohere.md
Outdated
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | ||
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | |
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) | |
tokenizer = AutoTokenizer.from_pretrained(model_id) | |
model = AutoModelForCausalLM.from_pretrained(model_id) |
docs/source/en/model_doc/cohere.md
Outdated
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | ||
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, quantization_config=bnb_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | |
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, quantization_config=bnb_config) | |
tokenizer = AutoTokenizer.from_pretrained(model_id) | |
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config) |
def __init__(self, hidden_size, eps=1e-5, bias=False): | ||
super().__init__() | ||
self.weight = nn.Parameter(torch.ones(hidden_size)) | ||
self.bias = nn.Parameter(torch.zeros(hidden_size)) if bias else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok !
t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq) | ||
t = t / self.scaling_factor | ||
freqs = torch.outer(t, self.inv_freq) | ||
emb = torch.repeat_interleave(freqs, 2, dim=-1) | ||
self.register_buffer("_cos_cached", emb.cos().to(torch.get_default_dtype()), persistent=False) | ||
self.register_buffer("_sin_cached", emb.sin().to(torch.get_default_dtype()), persistent=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq) | |
t = t / self.scaling_factor | |
freqs = torch.outer(t, self.inv_freq) | |
emb = torch.repeat_interleave(freqs, 2, dim=-1) | |
self.register_buffer("_cos_cached", emb.cos().to(torch.get_default_dtype()), persistent=False) | |
self.register_buffer("_sin_cached", emb.sin().to(torch.get_default_dtype()), persistent=False) |
_sin_cached
and _cos_cached
seems to be not used below, I think we can remove them !
class CohereForCausalLM(CoherePreTrainedModel): | ||
_tied_weights_keys = ["model.embed_tokens.weight", "lm_head.weight"] | ||
|
||
def __init__(self, config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __init__(self, config): | |
# Ignore copy | |
def __init__(self, config): |
return causal_mask | ||
|
||
|
||
class CohereForCausalLM(CoherePreTrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why the copied from cannot be used here? It will be very useful to easily maintain the methods below such as _prepare_inputs_for_generation
, otherwise you can also try to put a copied from statement on the _prepare_inputs_for_generation
method
from transformers import CohereForCausalLM, CohereModel | ||
|
||
|
||
class CohereModelTester: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class CohereModelTester: | |
# Copied from transformers.tests.models.llama.test_modeling_llama.LlamaModelTester with Llama->Cohere | |
class CohereModelTester: |
Can we add also copied from on tests as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are differences in the tests.
|
||
|
||
@require_torch | ||
class CohereModelTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here for copied from
class CohereIntegrationTest(unittest.TestCase): | ||
@unittest.skip("Logits are not exactly the same, once we fix the instabalities somehow, will update!") | ||
@slow | ||
def test_model_logits(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these the real values obtained?
Could you rather add some end-to-end generation tests similarly as:
def test_starcoder2_batched_generation_sdpa(self): |
* fix modeling final nits and add proper test file * for now leave empty tests * add integration test * push new test
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Hi @saurabhdash2512, this looks really good! We've just made some updates to the (I realize you've already rebased quite recently - I'm sorry about this!) |
@Rocketknight1 Done! Sync'ed with main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @saurabhdash and team !
* Cohere Model Release (#1) Cohere Model Release * Remove unnecessary files and code (#2) Some cleanup * Delete cohere-model directory (#3) * Make Fix (#5) * Pr fixes (#6) * fixes for pr * pr fixes for the format * pr fixes for the format * src/transformers/models/auto/tokenization_auto.py * Tokenizer test (#8) * tokenizer test * format fix * Adding Docs and other minor changes (#7) * Add modeling tests (#9) * Smol Fix (#11) * tokenization tests are fixed * format fixes * fix pr doc tests * fix pr doc tests * fix pr doc tests * fix pr style check * small changes in cohere.md * FIX: Address final comments for transformers integration (#13) * fix modeling final nits and add proper test file * for now leave empty tests * add integration test * push new test * fix modeling cohere (#14) * Update chat templates to use the new API (#15) --------- Co-authored-by: ahmetustun <ahmetustun89@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Cohere Model Release
What does this PR do?
This PR adds the Cohere Model Family with the release of the Command-R model weights
More information about the model can be found here
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@LysandreJik