forked from openvinotoolkit/openvino.genai
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add whisper pipeline - Initial commit (openvinotoolkit#789)
This is work in progress PR. Todos: - [x] use WhisperFeatureExtractor for audio preprocessing - [x] compute `assets/whisper/mel_filters_data.bin` on initialization - [x] move wav reader to sample utils - [ ] Longer audio inputs (>30s) chunking border poor quality results. Long audio inputs splitted by 30s chunks. This leads to a loss of context on a chunking border. This could be partially solved by [chunking with stride](https://huggingface.co/blog/asr-chunking). - [ ] add perf metrics - [x] update docstrings - [ ] update documentation - [x] add python bindings - [x] add tests - [ ] add cpp, python samples tests - [x] fix win build - [x] fetch `dr_wav.h` with `FetchContent` - [ ] support different languages, language autodetection - [ ] support translation - [ ] support timestamps - [x] remove constructor with infer requests - [x] rename pipeline to WhisperPipeline - [ ] Whisper pipeline doesn't need tokenizer, it uses detokenizer only. Implement detokenizer only initialization for `ov::genai::Tokenizer` - [ ] Check discrete GPU. Integrated GPU works as expected. - [ ] Investigate use of `RemoteTensor` for GPU - [ ] Add batch - [ ] Add sampler, inherit WhisperGenerationConfig from GenerationConfig Current limitations: - No resampling during preprocessing. Input raw speech should have 16k Hz sampling rate - No normalization during preprocessing. Input raw speech should be normalized to near [-1, 1] range Tickets: CVS-147994, CVS-146010, CVS-152522
- Loading branch information
1 parent
d831e64
commit 7b81bcb
Showing
31 changed files
with
2,511 additions
and
178 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Copyright (C) 2023-2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
find_package(OpenVINOGenAI REQUIRED PATHS | ||
"${CMAKE_BINARY_DIR}" # Reuse the package from the build. | ||
${OpenVINO_DIR} # GenAI may be installed alogside OpenVINO. | ||
) | ||
|
||
if(POLICY CMP0135) | ||
cmake_policy(SET CMP0135 NEW) | ||
endif() | ||
|
||
if(POLICY CMP0169) | ||
cmake_policy(SET CMP0169 OLD) | ||
endif() | ||
|
||
include(FetchContent) | ||
|
||
if(NOT TARGET dr_libs) | ||
FetchContent_Declare(dr_libs | ||
URL https://github.com/mackron/dr_libs/archive/da35f9d6c7374a95353fd1df1d394d44ab66cf01.tar.gz | ||
URL_HASH SHA256=2704d347f480ca1bc92233fb01747e4550cc8031735b6ea62ca9990ebb8851ae) | ||
FetchContent_MakeAvailable(dr_libs) | ||
endif() | ||
|
||
add_executable(whisper_speech_recognition whisper_speech_recognition.cpp audio_utils.cpp) | ||
target_link_libraries(whisper_speech_recognition PRIVATE openvino::genai) | ||
target_include_directories(whisper_speech_recognition PRIVATE "$<BUILD_INTERFACE:${dr_libs_SOURCE_DIR}>") | ||
set_target_properties(whisper_speech_recognition PROPERTIES | ||
COMPILE_PDB_NAME whisper_speech_recognition | ||
# Ensure out of box LC_RPATH on macOS with SIP | ||
INSTALL_RPATH_USE_LINK_PATH ON) | ||
target_compile_features(whisper_speech_recognition PRIVATE cxx_std_11) | ||
|
||
install(TARGETS whisper_speech_recognition | ||
RUNTIME DESTINATION samples_bin/ | ||
COMPONENT samples_bin | ||
EXCLUDE_FROM_ALL) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Whisper automatic speech recognition sample | ||
|
||
This example showcases inference of speech recognition Whisper Models. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample fearures `ov::genai::WhisperPipeline` and uses audio file in wav format as an input source. | ||
|
||
## Download and convert the model and tokenizers | ||
|
||
The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version. | ||
|
||
It's not required to install [../../requirements.txt](../../requirements.txt) for deployment if the model has already been exported. | ||
|
||
```sh | ||
pip install --upgrade-strategy eager -r ../../requirements.txt | ||
optimum-cli export openvino --trust-remote-code --model openai/whisper-base whisper-base | ||
``` | ||
|
||
## Prepare audio file | ||
|
||
Prepare audio file in wav format with sampling rate 16k Hz. | ||
|
||
## Run | ||
|
||
`whisper_speech_recognition whisper-base sample.wav` | ||
|
||
|
||
Discrete GPUs (dGPUs) usually provide better performance compared to CPUs. It is recommended to run larger models on a dGPU with 32GB+ RAM. For example, the model meta-llama/Llama-2-13b-chat-hf can benefit from being run on a dGPU. Modify the source code to change the device for inference to the GPU. | ||
|
||
Models can be downloaded from [OpenAI HiggingFace](https://huggingface.co/openai). | ||
|
||
### Troubleshooting | ||
|
||
#### Empty or rubbish output | ||
|
||
Example output: | ||
``` | ||
---------------- | ||
``` | ||
|
||
To resolve this ensure that audio data has 16k Hz sampling rate |
Oops, something went wrong.