Skip to content

A text-to-speech and speech-to-text server compatible with the OpenAI API, supporting Whisper, FunASR, Bark, and CosyVoice backends.

License

Notifications You must be signed in to change notification settings

gpustack/vox-box

Vox Box

A text-to-speech and speech-to-text server compatible with the OpenAI API, powered by backend support from Whisper, FunASR, Bark, and CosyVoice.

Requirements

Installation

You can install the project using pip:

pip install vox-box

# For MacOS, you need to manually install `openfst`, `pynini`, and `wetextprocessing` after installing `vox-box` to make `cosyvoice` work:
brew install openfst
export CPLUS_INCLUDE_PATH=$(brew --prefix openfst)/include
export LIBRARY_PATH=$(brew --prefix openfst)/lib
pip install pynini==2.1.6
pip install wetextprocessing==1.0.4.1

Usage

vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir ./cache/data-dir --host 0.0.0.0 --port 80

# Windows
vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir C:\Users\michelia\AppData\Roaming\vox-box --host 0.0.0.0 --port 8082

Options

  • -d, --debug: Enable debug mode.
  • --host: Host to bind the server to. Default is 0.0.0.0.
  • --port: Port to bind the server to. Default is 80.
  • --model: model path.
  • --device: Binding device, e.g., cuda:0. Default is cpu.
  • --huggingface-repo-id: Huggingface repo id for the model.
  • --model-scope-model-id: Model scope model id for the model.
  • --data-dir: Directory to store downloaded model data. Default is OS specific.

Supported Models

Model Type Link Verified Platforms
Faster-whisper-large-v3 speech-to-text Hugging Face, ModelScope Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-large-v2 speech-to-text Hugging Face, ModelScope Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-large-v1 speech-to-text Hugging Face, ModelScope
Faster-whisper-medium speech-to-text Hugging Face, ModelScope Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-medium.en speech-to-text Hugging Face, ModelScope
Faster-whisper-small speech-to-text Hugging Face, ModelScope Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-small.en speech-to-text Hugging Face, ModelScope
Faster-distil-whisper-large-v3 speech-to-text Hugging Face, ModelScope MacOS ✅
Faster-distil-whisper-large-v2 speech-to-text Hugging Face, ModelScope MacOS ✅
Faster-distil-whisper-medium.en speech-to-text Hugging Face, ModelScope
Faster-whisper-tiny speech-to-text Hugging Face, ModelScope
Faster-whisper-tiny.en speech-to-text Hugging Face, ModelScope
Paraformer-zh speech-to-text Hugging Face, ModelScope
Paraformer-zh-streaming speech-to-text Hugging Face, ModelScope Linux ✅, MacOS ✅
Paraformer-en speech-to-text Hugging Face, ModelScope
Conformer-en speech-to-text Hugging Face, Modelscope
SenseVoiceSmall speech-to-text Hugging Face, ModelScope Linux ✅, Windows ✅, MacOS ✅
Bark text-to-speech Hugging Face Linux ✅, Windows, MacOS ✅
Bark-small text-to-speech Hugging Face Linux ✅, Windows, MacOS ✅
CosyVoice2-0.5B text-to-speech Hugging Face, ModelScope Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M-Instruct text-to-speech Hugging Face, ModelScope Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M-SFT text-to-speech Hugging Face, ModelScope Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M text-to-speech Hugging Face, ModelScope Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M-25Hz text-to-speech ModelScope Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅

Supported APIs

Create speech

Endpoint: POST /v1/audio/speech

Generates audio from the input text. Compatible with the OpenAI audio/speech API.

Example Request:

curl http://localhost/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "cosyvoice",
    "input": "Hello world",
    "voice": "English Female"
  }' \
  --output speech.mp3

Response: The audio file content.

Create transcription

Endpoint: POST /v1/audio/transcriptions

Transcribes audio into the input language. Compatible with the OpenAI audio/transcription API.

Example Request:

curl https://localhost/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="whisper-large-v3"

Response:

{
  "text": "Hello world."
}

List Models

Endpoint: GET /v1/models

Returns the current running models.

Get Model

Endpoint: GET /v1/models/{model_id}

Returns the current running model.

Get Voices

Endpoint: GET /v1/voices

Returns the supported voice for current running model.

Health Check

Endpoint: GET /health

Returns the heath check result of the Vox Box.