Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Command R and Command R+ models #2548

Merged
merged 11 commits into from
May 20, 2024
Merged

Conversation

andyt-cohere
Copy link
Contributor

@andyt-cohere andyt-cohere commented Apr 3, 2024

To do:

  • PR description
  • Deciding what to keep / delete for old Cohere content
  • Impact of chat limitations (see notes in cohere_client.py)
  • Testing on supported evals?

Tested basic setup via: helm-run --run-entries "mmlu:subject=anatomy,model_deployment=cohere/command-r" --suite v1 --max- eval-instances 1 --disable-cache, and likewise for command-r-plus

@@ -0,0 +1 @@
cohereApiKey: TODO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to ensure things work

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for this pull request! I left some high level suggestions.

"""
# Cohere has a Python SDK, but it requires additional post-processing to convert their response
# objects (`Generations`) to JSON, the form the cache expects the responses to be in.
response = requests.request(
method="POST",
url=get_cohere_url(CohereClient.GENERATE_ENDPOINT),
url=get_cohere_url(CohereClient.CHAT_ENDPOINT),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest splitting this into CohereTextClient and CohereChatClient (which can inherit from a common CohereAbstractClient subclass for common logic), then Command R can use CohereChatClient while the other models can keep using CohereTextClient.

Copy link
Contributor Author

@andyt-cohere andyt-cohere Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Although I think going forward we will only be supporting chat and not generate for Cohere models.

Do you think I should still go with the split approach?

p.s. is there any way to check what I would break if I removed the no-longer-supported generate code? do you know if it is possible to delete some old Cohere stuff and move forward with the new paradigm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious if we could even remove some old Cohere stuff.

In the latest version of this PR, I explored some deletions we can make, such as tokenizer removals. Let me know what you think!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding checking what breaks - you should be able to run e.g. cohere/command-light and that should still work.

I'm inclined to keep the old text API endpoint support, so that people can still use them for reproducing old results (at least until the text API endpoint is gone).

As for the tokenizer, can the command-r tokenizer be used for the older command models, or do those use a different tokenizer? Also, is there still a (small) length limit on the tokenizer API? If either of those assumptions are no longer true, it would simplify a lot of HELM code here.

# However, even sending a request with 60,000 characters sometimes fails, so we set the
# maximum length to 50,000, which is about 8,333 tokens.
# TODO: followed up with Cohere support with an example of a failure case
TOKENIZE_API_MAX_TEXT_LENGTH: int = 50_000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this limit still exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not any more!

1 <= len(text) <= CohereTokenizer.TOKENIZE_API_MAX_TEXT_LENGTH
), f"Invalid text length: {len(text)}. Valid length: [1..{CohereTokenizer.TOKENIZE_API_MAX_TEXT_LENGTH:,d}]"
assert (1 <= len(text)), f"Invalid empty text with length: {len(text)}. Valid length: >= 1"
request["model"] = "command-r" ### To do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can read this from the request (and also, avoid mutating the input parameter):

if request["tokenizer"] == "cohere/command-r":
    request = {"model: "command-r", **request}

# now accept a num_generations argument in the JSON and will always return an array of generations"
# Note that the API version is decoupled from the model version.
DEFAULT_API_VERSION: str = "2021-11-08"

TOKENIZE_ENDPOINT: str = "tokenize"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider using the Hugging Face tokenizer version of the Command R tokenizer to avoid using the API, like so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner, thank you!

max_sequence_length: 2048
client_spec:
class_name: "helm.clients.cohere_client.CohereClient"
window_service_spec:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to later note about tokenizers: if you use the Hugging Face tokenizer rather than the Cohere tokenizer API, then you can just delete window_service_spec (which exists to work around the length limit of the Cohere tokenizer API), which will make HELM use the default window service.

@andyt-cohere
Copy link
Contributor Author

Thank you! Wonderful advice

@andyt-cohere
Copy link
Contributor Author

andyt-cohere commented Apr 4, 2024

@yifanmai thank you so much for the advice.

Let me know if you can take another look at things! I think we may need some thought on what to preserve / what to delete


Also curious what the consequence is of:

  • Not supporting multiple completions (we could perhaps write a for loop around do_it? unclear if that would violate any designs with parallelism or perhaps cause timeouts)
  • Not supporting LLHs in responses + responses not explicitly returning tokens (would need to be re-tokenized?)

i.e. in src/helm/clients/cohere_client.py

In total, I wonder if these limitations make it so that most of HELM evals are not possible? Or can we go far with just generative evals?

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I will be traveling until 2024-04-10 (next Wednesday), so I might not respond to your comments until then.

It's okay to (1) only support single completions and (2) not have tokens in the results (you would set the tokens: List[Token] to an empty array [] in RequestResult. Your current implementation looks fine and should you to run all the scenarios in HELM Lite which is our "flagship" leaderboard.

You won't be able to run some other scenarios that require token logprobs (e.g. the language modeling scenarios).

The contract is that models marked as FULL_FUNCTIONALITY_TEXT_MODEL_TAG should implement all these features and can support any scenario, whereas models marked as LIMITED_FUNCTIONALITY_TEXT_MODEL_TAG only need to implement returning text, and are only guaranteed to work on HELM Lite scenarios.

We currently don't do any kind of fine-grained capabilities validation (i.e. checking which scenarios work on which models). We may add that in the future.

"""
# Cohere has a Python SDK, but it requires additional post-processing to convert their response
# objects (`Generations`) to JSON, the form the cache expects the responses to be in.
response = requests.request(
method="POST",
url=get_cohere_url(CohereClient.GENERATE_ENDPOINT),
url=get_cohere_url(CohereClient.CHAT_ENDPOINT),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding checking what breaks - you should be able to run e.g. cohere/command-light and that should still work.

I'm inclined to keep the old text API endpoint support, so that people can still use them for reproducing old results (at least until the text API endpoint is gone).

As for the tokenizer, can the command-r tokenizer be used for the older command models, or do those use a different tokenizer? Also, is there still a (small) length limit on the tokenizer API? If either of those assumptions are no longer true, it would simplify a lot of HELM code here.

@yifanmai
Copy link
Collaborator

yifanmai commented Apr 5, 2024

Also, could you rebase / merge with #2551, which requires renaming Sequence to GeneratedOutput? I will try to minimize these kinds of breaking changes to the Client API in the future.

@andyt-cohere
Copy link
Contributor Author

No problem! Will be sure to rebase. And will make sure we support Command-Light with the mentioned abstractions

p.s. I am OOO next week

@andyt-cohere
Copy link
Contributor Author

There doesn't appear to be a tokenizer length cap anymore (I tried some large lengths), but I would need to use the right tokenizer for Command Light (which is currently not the Command R one).

@yifanmai
Copy link
Collaborator

Hi @andyt-cohere, would you be okay if I made changes directly on your GitHub fork and branch? This is pretty close to being merge-ready, and I can spend some time next week working on it.

@yifanmai yifanmai marked this pull request as ready for review May 7, 2024 21:18
@yifanmai yifanmai requested review from percyliang and farzaank May 7, 2024 21:18
@yifanmai
Copy link
Collaborator

yifanmai commented May 7, 2024

@andyt-cohere I reverted some of the changes, so now this change only handles adding the Command R model. I'm planning to add the new tokenizer in this pull request #2623, and to do the deletion of the old tokenizers and window services in a another future pull request.

@yifanmai yifanmai self-requested a review May 7, 2024 21:20
@yifanmai yifanmai changed the title Command R Add Command R and Command R+ models May 7, 2024
@yifanmai yifanmai merged commit c5c451c into stanford-crfm:main May 20, 2024
6 checks passed
@yifanmai yifanmai deleted the command_r branch May 20, 2024 20:41
@yifanmai yifanmai restored the command_r branch May 20, 2024 20:41
xuwangyin pushed a commit to xuwangyinx/helm that referenced this pull request Jun 23, 2024
Co-authored-by: Yifan Mai <yifan@cs.stanford.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants