Decomposition of paper Q&A using humans and language models
-
Recipes are decompositions of a task into subtasks.
The meaning of a recipe is: If a human executed these steps and did a good job at each workspace in isolation, the overall answer would be good. This decomposition may be informed by what we think ML can do at this point, but the recipe itself (as an abstraction) doesn’t know about specific agents.
-
Agents perform atomic subtasks of predefined shapes, like completion, scoring, or classification.
Agents don't know which recipe is calling them. Agents don’t maintain state between subtasks. Agents generally try to complete all subtasks they're asked to complete (however badly), but some will not have implementations for certain task types.
-
The mode in which a recipe runs is a global setting that can affect every agent call. For instance, whether to use humans or agents. Recipes can also run with certain
RecipeSettings
, which can map a task type to a specificagent_name
, which can modify which agent is used for that specfic type of task.
- Install Docker Desktop
-
Add required secrets to
.env
. See.env.example
for a model. -
Start ICE in its own terminal and leave it running:
scripts/run-local.sh
-
Go through the tutorial.
scripts/run-recipe.sh --mode machine
You can run on the iteration gold standards of a specific recipe like this:
scripts/run-recipe.sh --mode machine -r placebotree -q placebo -g iterate
To run over multiple gold standard splits, just provide them separated by spaces:
scripts/run-recipe.sh --mode machine -r placebotree -q placebo -g iterate validation
These require the streamlit variant of the Docker image:
STREAMLIT=1 scripts/run-local.sh
Run the streamlit apps like this:
scripts/run-streamlit.sh
This opens a multi-page app that lets you select specific scripts.
To add a page, simply create a script in the streamlits/pages
folder.
When you run a recipe, ICE will evaluate the results based on the gold standards in gold_standards/
. You'll see the results on-screen, and they'll be saved as CSVs in data/evaluation_csvs/
. You can then upload the CSVs to the "Performance dashboard" and "Individual paper eval" tables in the ICE Airtable.
- Set up both
ice
andelicit-next
so that they can run on your computer - Switch to the
eval
branch ofelicit-next
, or a branch from the eval branch. This branch should contain the QA code and gold standards that you want to evaluate. - If the
ice
QA gold standards (gold_standards/gold_standards.csv
) may not be up-to-date, download this Airtable view (all rows, all fields) as a CSV and save it asgold_standards/gold_standards.csv
- Duplicate the All rows, all fields view in Airtable, then in your duplicated view, filter to exactly the gold standards you'd like to evaluate and download it as a CSV. Save that CSV as
api/eval/gold_standards/gold_standards.csv
inelicit-next
- Make sure
api/eval/papers
inelicit-next
contains all of the gold standard papers you want to evaluate - In
ice
, runscripts/eval-in-app-qa.sh <path to elicit-next>
. If you haveelicit-next
cloned as a sibling ofice
, this would bescripts/eval-in-app-qa.sh $(pwd)/../elicit-next/
.
This will generate the same sort of eval as for ICE recipes.
TORCH=1 scripts/run-local.sh
Cheap integration tests:
scripts/run-recipe.sh --mode test
Unit tests:
scripts/run-tests.sh
- Manually add the dependency to
pyproject.toml
- Update the lock file and install the changes:
docker compose exec ice poetry lock --no-update
docker compose exec ice poetry install # if you're running a variant image, pass --extras streamlit or --extras torch
The lockfile update step will take about 15 minutes.
You do not need to stop, rebuild, and restart the docker containers.
To upgrade poetry to a new version:
- In the Dockerfile, temporarily change
pip install -r poetry-requirements.txt
topip install poetry==DESIRED_VERSION
- Generate a new
poetry-requirements.txt
:docker compose -f docker-compose.yml -f docker-compose.build.yml up -d docker compose exec ice bash -c 'pip freeze > poetry-requirements.txt'
- Revert the Dockerfile changes
Before making a PR, check linting, types, tests, etc:
scripts/checks.sh
Reminder: Traces contain source code, so be sure you want to share all the code called by your recipe.
- Publish the trace to https://github.com/oughtinc/static and wait for the github-pages workflow to finish.
- Add the trace information to
ui/helpers/recipes.ts
.