Whisper command line client compatible with original OpenAI client based on CTranslate2.
It uses CTranslate2 and Faster-whisper Whisper implementation that is up to 4 times faster than openai/whisper for the same accuracy while using less memory.
Goals of the project:
- Provide an easy way to use the CTranslate2 Whisper implementation
- Ease the migration for people using OpenAI Whisper CLI
Open dubbing is an AI dubbing system which uses machine learning models to automatically translate and synchronize audio dialogue into different languages ! π
π₯ Check it out now: open-dubbing π₯
To install the latest stable version, just type:
pip install -U whisper-ctranslate2
Alternatively, if you are interested in the latest development (non-stable) version from this repository, just type:
pip install git+https://github.com/Softcatala/whisper-ctranslate2
You can use build docker image. First pull the image:
docker pull ghcr.io/softcatala/whisper-ctranslate2:latest
The Docker image includes the small, medium" and large-v2.
To run it:
docker run --gpus "device=0" \
-v "$(pwd)":/srv/files/ \
-it ghcr.io/softcatala/whisper-ctranslate2:latest \
/srv/files/e2e-tests/gossos.mp3 \
--output_dir /srv/files/
Notes:
- --gpus "device=0" gives access to the GPU. If you do not have a GPU, remove this.
- "$(pwd)":/srv/files/ maps your current directory to /srv/files/ inside the container
GPU and CPU support are provided by CTranslate2.
It has compatibility with x86-64 and AArch64/ARM64 CPU and integrates multiple backends that are optimized for these platforms: Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate.
GPU execution requires the NVIDIA libraries cuBLAS 11.x and cuDNN 8.x to be installed on the system. Please refer to the CTranslate2 documentation
By default the best hardware available is selected for inference. You can use the options --device
and --device_index
to control manually the selection.
Same command line as OpenAI Whisper.
To transcribe:
whisper-ctranslate2 inaguracio2011.mp3 --model medium
To translate:
whisper-ctranslate2 inaguracio2011.mp3 --model medium --task translate
Whisper translate task translates the transcription from the source language to English (the only target language supported).
Additionally using:
whisper-ctranslate2 --help
All the supported options with their help are shown.
On top of the OpenAI Whisper command line options, there are some specific options provided by CTranslate2 or whiper-ctranslate2.
Batched inference transcribes each segment in-dependently which can provide an additional 2x-4x speed increase:
whisper-ctranslate2 inaguracio2011.mp3 --batched True
You can additionally use the --batch_size to specify the maximum number of parallel requests to model for decoding.
Batched inference uses Voice Activity Detection (VAD) filter and ignores the following paramters: compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, prompt_reset_on_temperature, prefix, hallucination_silence_threshold.
--compute_type
option which accepts default,auto,int8,int8_float16,int16,float16,float32 values indicates the type of quantization to use. On CPU int8 will give the best performance:
whisper-ctranslate2 myfile.mp3 --compute_type int8
--model_directory
option allows to specify the directory from which you want to load a CTranslate2 Whisper model. For example, if you want to load your own quantified Whisper model version or using your own Whisper fine-tunned version. The model must be in CTranslate2 format.
--vad_filter
option enables the voice activity detection (VAD) to filter out parts of the audio without speech. This step uses the Silero VAD model:
whisper-ctranslate2 myfile.mp3 --vad_filter True
The VAD filter accepts multiple additional options to determine the filter behavior:
--vad_onset VALUE (float)
Probabilities above this value are considered as speech.
--vad_min_speech_duration_ms (int)
Final speech chunks shorter min_speech_duration_ms are thrown out.
--vad_max_speech_duration_s VALUE (int)
Maximum duration of speech chunks in seconds. Longer will be split at the timestamp of the last silence.
--print_colors True
options prints the transcribed text using an experimental color coding strategy based on whisper.cpp to highlight words with high or low confidence:
whisper-ctranslate2 myfile.mp3 --print_colors True
--live_transcribe True
option activates the live transcription mode from your microphone:
whisper-ctranslate2 --live_transcribe True --language en
whisper-demo.mov
There is experimental diarization support using pyannote.audio
to identify speakers. At the moment, the support is at segment level.
To enable diarization you need to follow these steps:
- Install
pyannote.audio
withpip install pyannote.audio
- Accept
pyannote/segmentation-3.0
user conditions - Accept
pyannote/speaker-diarization-3.1
user conditions - Create an access token at
hf.co/settings/tokens
.
And then execute passing the HuggingFace API token as parameter to enable diarization:
whisper-ctranslate2 --hf_token YOUR_HF_TOKEN
and then the name of the speaker is added in the output files (e.g. JSON, VTT and STR files):
[SPEAKER_00]: There is a lot of people in this room
The option --speaker_name SPEAKER_NAME
allows to use your own string to identify the speaker.
Check our frequently asked questions for common questions.
Jordi Mas jmas@softcatala.org