-
Notifications
You must be signed in to change notification settings - Fork 68
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' of https://github.com/oughtinc/ice into reppertj/…
…add-prompt-helper
- Loading branch information
Showing
43 changed files
with
545 additions
and
384 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
.gitignore |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
[flake8] | ||
ignore = E203,W503,E501,E712,E211 | ||
ignore = E203,W503,E501,E712,E211,F403,F405 | ||
max-line-length = 120 | ||
exclude = __pycache__,node_modules |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,20 @@ | ||
secrets | ||
__pycache__ | ||
.coverage* | ||
store/* | ||
.envrc | ||
.env | ||
**/.vscode/* | ||
data | ||
.mypy_cache | ||
.pre-commit-home | ||
log*.txt | ||
.dmypy.json | ||
**/__pycache__ | ||
**/.DS_Store | ||
**/.vscode | ||
**/*.pickle | ||
**/*.pt | ||
**/*.sqlite | ||
/.coverage | ||
/.env | ||
/.git | ||
/.mypy_cache | ||
/.pre-commit-home | ||
/data | ||
/papers/*.pdf | ||
/papers/*.txt | ||
!/papers/keenan-2018-tiny.txt | ||
!/papers/keenan-2018.pdf | ||
/store | ||
/ui/.next/ | ||
/ui/node_modules | ||
/ui/public/traces/*.jsonl |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2022 Ought | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,165 +1,57 @@ | ||
# Interactive Composition Explorer 🧊 | ||
|
||
Decomposition of paper Q&A using humans and language models | ||
|
||
## Table of contents | ||
|
||
- [Design principles](#design-principles) | ||
- [Running ICE locally](#running-ice-locally) | ||
- [Requirements](#requirements) | ||
- [Setup](#setup) | ||
- [Running ICE on the command line](#running-ice-on-the-command-line) | ||
- [Human data collection](#human-data-collection) | ||
- [GPT](#gpt) | ||
- [Streamlit](#streamlit) | ||
- [Evaluation](#evaluation) | ||
- [Evaluate in-app QA results](#evaluate-in-app-qa-results) | ||
- [Development](#development) | ||
- [Adding new Python dependencies](#adding-new-python-dependencies) | ||
- [Contributions](#contributions) | ||
|
||
## Design principles | ||
ICE is a Python library and trace visualizer for language model programs. | ||
|
||
- **Recipes** are decompositions of a task into subtasks. | ||
|
||
The meaning of a recipe is: If a human executed these steps and did a good job at each workspace in isolation, the overall answer would be good. This decomposition may be informed by what we think ML can do at this point, but the recipe itself (as an abstraction) doesn’t know about specific agents. | ||
## Screenshot | ||
|
||
- **Agents** perform atomic subtasks of predefined shapes, like completion, scoring, or classification. | ||
<p align="center"> | ||
<img alt="ice-screenshot" src="https://user-images.githubusercontent.com/382515/192681645-6ed87072-2dc6-4982-92d1-8de209bc3ef6.png" /> | ||
Execution trace visualized in ICE | ||
</p> | ||
|
||
Agents don't know which recipe is calling them. Agents don’t maintain state between subtasks. Agents generally try to complete all subtasks they're asked to complete (however badly), but some will not have implementations for certain task types. | ||
## Features | ||
|
||
- The **mode** in which a recipe runs is a global setting that can affect every agent call. For instance, whether to use humans or agents. Recipes can also run with certain `RecipeSettings`, which can map a task type to a specific `agent_name`, which can modify which agent is used for that specfic type of task. | ||
- Run language model recipes in different modes: humans, human+LM, LM | ||
- Inspect the execution traces in your browser for debugging | ||
- Define and use new language model agents, e.g. chain-of-thought agents | ||
- Run recipes quickly by parallelizing language model calls | ||
- Reuse component recipes such as question-answering, ranking, and verification | ||
|
||
## Running ICE locally | ||
|
||
### Prerequisites | ||
## Getting started | ||
|
||
1. Install [Docker Desktop](https://www.docker.com/products/docker-desktop/) | ||
|
||
### Setup | ||
|
||
1. Add required secrets to `.env`. See `.env.example` for a model. | ||
1. Add required secrets to `.env`. See [`.env.example`](https://github.com/oughtinc/ice/blob/main/.env.example) for a model. | ||
|
||
1. Start ICE in its own terminal and leave it running: | ||
|
||
```sh | ||
scripts/run-local.sh | ||
``` | ||
|
||
1. Go through [the tutorial](https://primer.ought.org/). | ||
|
||
### Advanced command line usage | ||
|
||
#### Gold standards | ||
|
||
```sh | ||
scripts/run-recipe.sh --mode machine | ||
``` | ||
|
||
You can run on the iteration gold standards of a specific recipe like this: | ||
|
||
```sh | ||
scripts/run-recipe.sh --mode machine -r placebotree -q placebo -g iterate | ||
``` | ||
|
||
To run over multiple gold standard splits, just provide them separated by spaces: | ||
|
||
```sh | ||
scripts/run-recipe.sh --mode machine -r placebotree -q placebo -g iterate validation | ||
``` | ||
|
||
### Streamlit | ||
|
||
These require the streamlit variant of the Docker image: | ||
|
||
```sh | ||
STREAMLIT=1 scripts/run-local.sh | ||
``` | ||
|
||
Run the streamlit apps like this: | ||
|
||
```sh | ||
scripts/run-streamlit.sh | ||
``` | ||
|
||
This opens a multi-page app that lets you select specific scripts. | ||
|
||
To add a page, simply create a script in the `streamlits/pages` folder. | ||
1. Go through [the primer](https://primer.ought.org/). | ||
|
||
### Evaluation | ||
## Terminology | ||
|
||
When you run a recipe, ICE will evaluate the results based on the gold standards in `gold_standards/`. You'll see the results on-screen, and they'll be saved as CSVs in `data/evaluation_csvs/`. You can then upload the CSVs to the "Performance dashboard" and "Individual paper eval" tables in the [ICE Airtable](https://airtable.com/app4Fo26j2vGYufCe/tblkFq839UrBrj9P9/viwDkUqYMQtDAl773?blocks=hide). | ||
|
||
#### Evaluate in-app QA results | ||
|
||
1. Set up both `ice` and `elicit-next` so that they can run on your computer | ||
2. Switch to the `eval` branch of `elicit-next`, or a branch from the eval branch. This branch should contain the QA code and gold standards that you want to evaluate. | ||
3. If the `ice` QA gold standards (`gold_standards/gold_standards.csv`) may not be up-to-date, download [this Airtable view](https://airtable.com/app4Fo26j2vGYufCe/tbl0JN0LFtDi5SrS5/viws799VwN4AXMNii?blocks=hide) (all rows, all fields) as a CSV and save it as `gold_standards/gold_standards.csv` | ||
4. Duplicate the [All rows, all fields](https://airtable.com/app4Fo26j2vGYufCe/tbl0JN0LFtDi5SrS5/viws799VwN4AXMNii?blocks=hide) view in Airtable, then in your duplicated view, filter to exactly the gold standards you'd like to evaluate and download it as a CSV. Save that CSV as `api/eval/gold_standards/gold_standards.csv` in `elicit-next` | ||
5. Make sure `api/eval/papers` in `elicit-next` contains all of the gold standard papers you want to evaluate | ||
6. In `ice`, run `scripts/eval-in-app-qa.sh <path to elicit-next>`. If you have `elicit-next` cloned as a sibling of `ice`, this would be `scripts/eval-in-app-qa.sh $(pwd)/../elicit-next/`. | ||
|
||
This will generate the same sort of eval as for ICE recipes. | ||
|
||
### Using PyTorch | ||
|
||
```sh | ||
TORCH=1 scripts/run-local.sh | ||
``` | ||
|
||
## Development | ||
|
||
### Running tests | ||
|
||
Cheap integration tests: | ||
|
||
```sh | ||
scripts/run-recipe.sh --mode test | ||
``` | ||
|
||
Unit tests: | ||
|
||
```sh | ||
scripts/run-tests.sh | ||
``` | ||
|
||
### Adding new Python dependencies | ||
|
||
1. Manually add the dependency to `pyproject.toml` | ||
2. Update the lock file and install the changes: | ||
|
||
```sh | ||
docker compose exec ice poetry lock --no-update | ||
docker compose exec ice poetry install # if you're running a variant image, pass --extras streamlit or --extras torch | ||
``` | ||
|
||
The lockfile update step will take about 15 minutes. | ||
- **Recipes** are decompositions of a task into subtasks. | ||
|
||
You **do not** need to stop, rebuild, and restart the docker containers. | ||
The meaning of a recipe is: If a human executed these steps and did a good job at each workspace in isolation, the overall answer would be good. This decomposition may be informed by what we think ML can do at this point, but the recipe itself (as an abstraction) doesn’t know about specific agents. | ||
|
||
### Upgrading poetry | ||
- **Agents** perform atomic subtasks of predefined shapes, like completion, scoring, or classification. | ||
|
||
To upgrade poetry to a new version: | ||
Agents don't know which recipe is calling them. Agents don’t maintain state between subtasks. Agents generally try to complete all subtasks they're asked to complete (however badly), but some will not have implementations for certain task types. | ||
|
||
1. In the Dockerfile, temporarily change `pip install -r poetry-requirements.txt` to `pip install poetry==DESIRED_VERSION` | ||
2. Generate a new `poetry-requirements.txt`: | ||
```sh | ||
docker compose -f docker-compose.yml -f docker-compose.build.yml up -d | ||
docker compose exec ice bash -c 'pip freeze > poetry-requirements.txt' | ||
``` | ||
3. Revert the Dockerfile changes | ||
- The **mode** in which a recipe runs is a global setting that can affect every agent call. For instance, whether to use humans or agents. Recipes can also run with certain `RecipeSettings`, which can map a task type to a specific `agent_name`, which can modify which agent is used for that specfic type of task. | ||
|
||
### Contributions | ||
## Contributions | ||
|
||
Before making a PR, check linting, types, tests, etc: | ||
ICE is an [open-source](https://github.com/oughtinc/ice/blob/main/LICENSE.md) project by [Ought](https://ought.org/). We're an applied ML lab building the AI research assistant [Elicit](https://elicit.org/). | ||
|
||
```sh | ||
scripts/checks.sh | ||
``` | ||
We welcome community contributions: | ||
|
||
## Sharing recipe traces | ||
- If you're a developer, you can dive into the codebase and help us fix bugs, improve code quality and performance, or add new features. | ||
- If you're a language model researcher, you can help us add new agents or improve existing ones, and refine or create new recipes and recipe components. | ||
|
||
Reminder: Traces contain source code, so be sure you want to share all the code called by your recipe. | ||
For larger contributions, make an issue for discussion before submitting a PR. | ||
|
||
1. Publish the trace to https://github.com/oughtinc/static and wait for the github-pages workflow to finish. | ||
2. Add the trace information to `ui/helpers/recipes.ts`. | ||
And for even larger contributions, join us - [we're hiring](https://ought.org/careers)! |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
File renamed without changes.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.