Preliminary implementation of the inference engine for OpenAssistant. This is strictly for local development, although you might find limited success for your self-hosting OA plan. There is no warranty that this will not change in the future — in fact, expect it to change.
The services of the inference stack are prefixed with "inference-" in the
unified compose descriptor.
Prior to building
those, please ensure that you have Docker's new
BuildKit backend enabled. See the
FAQ
for more info.
To build the services, run:
docker compose --profile inference build
Spin up the stack:
docker compose --profile inference up -d
Tail the logs:
docker compose logs -f \
inference-server \
inference-worker
Note: The compose file contains the bind mounts enabling you to develop on the modules of the inference stack, and the
oasst-shared
package, without rebuilding.
Note: You can change the model by editing variable
MODEL_CONFIG_NAME
in thedocker-compose.yaml
file. Valid model names can be found in model_configs.py.
Note: You can spin up any number of workers by adjusting the number of replicas of the
inference-worker
service to your liking.
Note: Please wait for the
inference-text-generation-server
service to output{"message":"Connected"}
before starting to chat.
Run the text client and start chatting:
cd text-client
pip install -r requirements.txt
python __main__.py
# You'll soon see a `User:` prompt, where you can type your prompts.
We run distributed load tests using the
locust
Python package.
pip install locust
cd tests/locust
locust
Navigate to http://0.0.0.0:8089/ to view the locust UI.
To update the api docs, once the inference server is running run below command
to download the inference openapi json into the relevant folder under /docs
:
wget localhost:8000/openapi.json -O docs/docs/api/inference-openapi.json
Then make a PR to have the updated docs merged.