Function Calling Benchmark by Composio
Welcome to the official GitHub repository for the Composio's Function Calling Benchmark. This repository contains a benchmark of 50 function calling problems, each of which is designed to be solved using one of the 8 function schemas provided, which are inspired from some of ClickUp's integration endpoints.
The benchmark is designed to test the ability of various models to correctly call functions based on given prompts, and solve the situation in a ClickUp workspace using one of the given functions. Each question in the benchmark presents a scenario that requires the use of a specific function to solve. The function schemas provided outline the structure and parameters of the functions that can be used.
Note that, a speciality of this benchmark is, problems are designed to test the abilities of the models to handle real world API structurs, and performance against differnet optimizations.
prompts/
: Propmts used to check & modify the Problems and Schema.clickup_space_benchmark.json
: The problems and correct solutions.clickup_space_schema.json
: Function Schema's that the LLMs use to solve the problems of the Benchmark.*.ipynb
(in relevant branches): Different optimization techniques, applied to the LLMs to check their performance against the Benchmark.
We did the all experimentations on notebooks now, as it is easier to keep track of the results.
We have tested different function calling models, Resut notebooks of which are stored in each seperate branch.
Currently we have experimented with:
gpt-4-turbo-preview
- OpenAI - branchgpt-4-turbo
- OpenAI - branchgpt-4-0125-preview
- OpenAI - branchclaude-3-haiku-20240307
- Anthropic - branchclaude-3-sonnet-20240229
- Anthropic - branchclaude-3-opus-20240229
- Anthropic - branch
- Functionary Models(MeetKai)
- Mistral Models
- Open-Gorilla Models
- NexusRaven Models
All these different optimizations has been tested with the models, and each of the techniques are explained here.
Optimization Approach | gpt-4-turbo-preview |
gpt-4-turbo |
gpt-4-0125-preview |
claude-3-haiku-20240307 |
claude-3-sonnet-20240229 |
claude-3-opus-20240229 |
|
---|---|---|---|---|---|---|---|
1 | No System Prompt | 0.36 | 0.36 | 0.353 | 0.48 | 0.6 | 0.42 |
2 | Flattening Schema | 0.527 | 0.487 | 0.533 | 0.5 | 0.58 | 0.5 |
3 | Flattened Schema + Simple System Prompt |
0.553 | 0.533 | 0.54 | 0.54 | 0.6 | 0.54 |
4 | Flattened Schema + Focused System Prompt |
0.633 | 0.633 | 0.64 | 0.54 | 0.54 | 0.54 |
5 | Flattened Schema + Focused System Prompt + Function Name Optimized |
0.553 | 0.607 | 0.587 | 0.52 | 0.62 | 0.52 |
6 | Flattened Schema + Focused System Prompt + Function Description Optimized |
0.633 | 0.66 | 0.673 | 0.52 | 0.6 | 0.52 |
7 | Flattened Schema + Focused System Prompt containing Schema summary |
0.64 | 0.553 | 0.64 | 0.46 | 0.62 | 0.46 |
8 | Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimized |
0.70 | 0.707 | 0.686 | 0.5 | 0.64 | 0.46 |
9 | Flattened Schema + Focused System Prompt containing Schema summary + Function Description Optimized |
0.687 | 0.707 | 0.68 | 0.5 | 0.6 | 0.6 |
10 | Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized |
0.767 | 0.767 | 0.787 | 0.58 | 0.74 | 0.58 |
11 | Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Call examples added |
0.693 | 0.6 | 0.707 | 0.6 | 0.76 | 0.64 |
12 | Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Parameter examples added |
0.787 | 0.693 | 0.787 | 0.68 | 0.76 | 0.66 |
We welcome contributions to this repository. If you have a model that you would like to test against the benchmark, feel free to open a pull request. If you encounter any issues while using the benchmark, please open an issue.
This project is licensed under the terms of the MIT license.
Composio is an organization dedicated to advancing the field of artificial intelligence. We create benchmarks, develop models, and build tools to push the boundaries of what is possible in AI. Follow us on Twitter for updates on our latest projects.
© 2024 Composio, All Rights Reserved.