Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔖 Issues Auto-Labeller #2542

Merged
merged 3 commits into from
Jan 17, 2025

Conversation

August-murr
Copy link
Collaborator

I was playing around with OpenAI API and was thinking about useful things we could do with the repos and all the data like issues and feedbacks and trying to automate some tasks or just do something useful with them like analysis or reports and I thought of an auto labeller.

This auto labeller uses OpenAI API, so it needs an OPENAI_API_KEY secret. It will cost a few bucks per month, but there are also string-based and regex-based labellers that may be faster.

If you are interested then we could test and see how consistent it is.

couple of ideas for improvement :

we could one-shot or few-shot prompt it for more accuracy or use different kinds of models both of which can cost more.

as for the speed, it takes about 30 seconds to label an issue.

any other ideas on anything useful we could do with OpenAI API or other stuff to either automate or analyze and get some value?

@August-murr August-murr self-assigned this Jan 4, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Member

Fun feature! Do you have a demo repo?

@qgallouedec
Copy link
Member

Have you tried with the HF api? It could be a free alternative

@August-murr
Copy link
Collaborator Author

Fun feature! Do you have a demo repo?

Just pushed it to my own fork

@qgallouedec
Copy link
Member

I'll open a batch of issues to test it

@August-murr
Copy link
Collaborator Author

Have you tried with the HF api? It could be a free alternative

Honestly, this was really effortless since I simply forked a mostly functional actions extension. Modifying it to work with the HF API will require much more effort. also it uses GPT-4o, there aren't many open-source models that are this accurate.

If it's absolutely necessary, then I can do it, but I honestly don't think it's worth the effort.

However, if you believe it is important, then I'll go ahead and do it.

@qgallouedec
Copy link
Member

It doesn't seem like a big deal to me. Probably something like this could work

from huggingface_hub import InferenceClient

client = InferenceClient(model="meta-llama/Llama-3.2-1B-Instruct", token="your_token")
content = "Find the label among these: question, issue."
completion = client.chat_completion(messages=[{"role": "user", "content": content}], max_tokens=256)
response = completion.choices[0].message.content

there aren't many open-source models that are this accurate.

This task is very simple, I don't think we absolutely need GPT-4o here. And even if the labeled fail, it's not a big deal.

@August-murr
Copy link
Collaborator Author

August-murr commented Jan 6, 2025

It doesn't seem like a big deal to me. Probably something like this could work

from huggingface_hub import InferenceClient

client = InferenceClient(model="meta-llama/Llama-3.2-1B-Instruct", token="your_token")
content = "Find the label among these: question, issue."
completion = client.chat_completion(messages=[{"role": "user", "content": content}], max_tokens=256)
response = completion.choices[0].message.content

there aren't many open-source models that are this accurate.

This task is very simple, I don't think we absolutely need GPT-4o here. And even if the labeled fail, it's not a big deal.

ok got it

@qgallouedec
Copy link
Member

Do you know if you can access the tag description? It could help the model in its prediction

@August-murr
Copy link
Collaborator Author

Do you know if you can access the tag description? It could help the model in its prediction

tag description as in the label description?
like:
🚀 deepspeed --> Related to deepspeed

If so, yes, it is part of the prompt.

@August-murr
Copy link
Collaborator Author

I tried using the Llama 1B model, and it "functioned," but for the TRL, I switched to the 70B model. However, I couldn't test it with the 70B because it requires a subscription.

Don't forget to add the HF_API_KEY to the secrets.

I got a context length error (limit of 4096 tokens) when using the Llama 1B model, which was weird since it supports up to 128k tokens. Since I can't use the 70B model, I'm unsure if it's a problem or not.

@August-murr August-murr marked this pull request as ready for review January 12, 2025 10:35
Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just commit the suggested change please

@qgallouedec qgallouedec changed the title Issues Auto-Labeller 🔖 Issues Auto-Labeller Jan 12, 2025
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
@August-murr
Copy link
Collaborator Author

I got a context length error (limit of 4096 tokens) when using the Llama 1B model, which was weird since it supports up to 128k tokens. Since I can't use the 70B model, I'm unsure if it's a problem or not.

This can be problematic when dealing with issues that require a long context. The exact error message received was:
Input validation error: inputs tokens + max_new_tokens must be <= 4096. Given: 9223 inputs tokens and 50 max_new_tokens
I couldn't find a solution or parameter to set, possibly from the inference endpoint.

@qgallouedec
Copy link
Member

qgallouedec commented Jan 12, 2025

A bit hacky but you can take the 15000 first strings. It should be enough for most issues:

content = content[:15000]

@August-murr
Copy link
Collaborator Author

A bit hacky but you can take the 15000 first strings. It should be enough for most issues:

content = content[:15000]

more like 4000
But it works well.

@qgallouedec qgallouedec merged commit cdc16f3 into huggingface:main Jan 17, 2025
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants