Skip to content

Commit

Permalink
[LLM_BENCH] Whisper pipeline (openvinotoolkit#1025)
Browse files Browse the repository at this point in the history
tickets: CVS-152393

[whisper_tiny_genai_API.txt](https://github.com/user-attachments/files/17629272/whisper_tiny_genai_API.txt)

[whisper_tiny_optimum-intel_API.txt](https://github.com/user-attachments/files/17629275/whisper_tiny_optimum-intel_API.txt)

---------

Co-authored-by: Xiping Yan <xiping.yan@intel.com>
Co-authored-by: Chen Peter <peter.chen@intel.com>
Co-authored-by: Andrei Kochin <andrei.kochin@intel.com>
Co-authored-by: Ekaterina Aidova <ekaterina.aidova@intel.com>
  • Loading branch information
5 people authored Nov 5, 2024
1 parent cf9ec64 commit 2370f6a
Show file tree
Hide file tree
Showing 17 changed files with 656 additions and 116 deletions.
11 changes: 11 additions & 0 deletions .github/workflows/llm_bench-python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,17 @@ jobs:
run: |
wget -O ./ov_models/soulcard.safetensors https://civitai.com/api/download/models/72591
python ./tools/llm_bench/benchmark.py -m ./ov_models/dreamlike-art-dreamlike-anime-1.0/FP16/ -pf ./tools/llm_bench/prompts/stable-diffusion.jsonl -d cpu -n 1 --genai --lora ./ov_models/soulcard.safetensors --lora_alphas 0.7
- name: Test whisper-tiny on Linux
run: |
GIT_LFS_SKIP_SMUDGE=1 git clone --depth 1 --branch main --single-branch https://huggingface.co/datasets/facebook/multilingual_librispeech
cd multilingual_librispeech
git lfs pull -I /data/mls_polish/train/audio/3283_1447_000.tar.gz
mkdir data/mls_polish/train/audio/3283_1447_000
tar zxvf data/mls_polish/train/audio/3283_1447_000.tar.gz -C data/mls_polish/train/audio/3283_1447_000/
cd ..
optimum-cli export openvino --trust-remote-code --model openai/whisper-tiny ./ov_models/whisper-tiny
python ./tools/llm_bench/benchmark.py -m ./ov_models/whisper-tiny --media multilingual_librispeech/data/mls_polish/train/audio/3283_1447_000/3283_1447_000000.flac -d cpu -n 1
python ./tools/llm_bench/benchmark.py -m ./ov_models/whisper-tiny --media multilingual_librispeech/data/mls_polish/train/audio/3283_1447_000/3283_1447_000000.flac -d cpu -n 1 --genai
- name: WWB Tests
run: |
GIT_CLONE_PROTECTION_ACTIVE=false pip install -r ${{ env.WWB_PATH }}/requirements.txt
Expand Down
2 changes: 1 addition & 1 deletion tools/llm_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,4 +170,4 @@ OpenVINO is by default built with [oneTBB](https://github.com/oneapi-src/oneTBB/
## 7. Additional Resources

- **Error Troubleshooting:** Check the [NOTES.md](./doc/NOTES.md) for solutions to known issues.
- **Image Generation Configuration:** Refer to [IMAGE_GEN.md](./doc/IMAGE_GEN.md) for setting parameters for image generation models.
- **Syntax and attributes of prompt file:** Refer to [PROMPT.md](./doc/PROMPT.md) for writing a prompt file.
3 changes: 3 additions & 0 deletions tools/llm_bench/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import task.text_generation as bench_text
import task.image_generation as bench_image
import task.super_resolution_generation as bench_ldm_sr
import task.speech_to_text_generation as bench_speech

DEFAULT_TORCH_THREAD_NUMS = 16
mem_consumption = MemConsumption()
Expand Down Expand Up @@ -46,6 +47,7 @@ def get_argprser():
help='Prompt file(s) in jsonl format. Multiple prompt files should be separated with space(s).')
parser.add_argument('-pi', '--prompt_index', nargs='+', type=num_iters_type, default=None,
help='Run the specified prompt index. You can specify multiple prompt indexes, separated by spaces.')
parser.add_argument('--media', default=None, help='Media file path for speech or visual models.')
parser.add_argument(
'-ic',
'--infer_count',
Expand Down Expand Up @@ -153,6 +155,7 @@ def get_argprser():
'image_gen': bench_image.run_image_generation_benchmark,
'code_gen': bench_text.run_text_generation_benchmark,
'ldm_super_resolution': bench_ldm_sr.run_ldm_super_resolution_benchmark,
'speech2text': bench_speech.run_speech_2_txt_benchmark,
}


Expand Down
74 changes: 74 additions & 0 deletions tools/llm_bench/doc/NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Notes
## chatglm2-6b - AttributeError: can't set attribute
Download chatglm2-6b from hugginface, convert to OpenVINO IR files and run with benchmark.py, the following error may occur:
```bash
AttributeError: can't set attribute
```
Reproduced with https://huggingface.co/THUDM/chatglm2-6b 7fabe56db91e085c9c027f56f1c654d137bdba40 <br />
As on https://huggingface.co/THUDM/chatglm2-6b/discussions/99 <br />
Solution: update `tokenization_chatglm.py` as following: <br />
```Python
self.vocab_file = vocab_file
self.tokenizer = SPTokenizer(vocab_file)
+ kwargs.pop("eos_token", None)
+ kwargs.pop("pad_token", None)
+ kwargs.pop("unk_token", None)
self.special_tokens = {
"<bos>": self.tokenizer.bos_id,
"<eos>": self.tokenizer.eos_id,
```
> The solution works for chatglm3-6b as well.
## Qwen-7B-Chat-Int4 - Torch not compiled with CUDA enabled
Convert Qwen-7B-Chat-Int4 to OpenVINO IR files run with convert.py, the following error may occur:
```bash
raise AssertionError("Torch not compiled with CUDA enabled")
```
Reproduced with https://huggingface.co/Qwen/Qwen-7B-Chat-Int4 8750247cc50f2a7bb84bef322f7707159b700723 <br />
Solution: update `modeling_qwen.py` as following: <br />
```Python
-SUPPORT_CUDA = torch.cuda.is_available()
+SUPPORT_CUDA = False
SUPPORT_BF16 = SUPPORT_CUDA and torch.cuda.is_bf16_supported()
```
## Baichuan2-7B-Chat - AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'
Convert Baichuan2-7B-Chat to OpenVINO IR files run with convert.py, the following error may occur:
```bash
AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'
```
Reproduced with https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat 84603cde5ebffb6084e476cfaeceaf0b8b91fe54 <br />
Reference to https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/discussions/2 <br />
Solution: update `tokenization_baichuan.py` as following: <br />
```Python
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+ self.vocab_file = vocab_file
+ self.add_bos_token = add_bos_token
+ self.add_eos_token = add_eos_token
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+ self.sp_model.Load(vocab_file)
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
**kwargs,
)
- self.vocab_file = vocab_file
- self.add_bos_token = add_bos_token
- self.add_eos_token = add_eos_token
- self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
- self.sp_model.Load(vocab_file)
```
## CompressWeights Mode INT4 - ConnectionError: Couldn't reach 'wikitext' on the Hub (SSLError)
Download LLM from hugginface, convert to OpenVINO IR files and run with convert.py and CompressWeights Mode to INT4, the following error may occur:
```bash
raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e)._name_})")
ConnectionError: Couldn't reach 'wikitext' on the Hub (SSLError)
```
root cause: The wikitext data set was not downloaded correctly, or the Hugging Face Hub network could not be connected normally. <br />
Solution: <br />
Refer to https://huggingface.co/docs/datasets/loading#arrow , copy wikitext data set to ~/.cache/huggingface/datasets/ folder, set the environment variable HF_DATASETS_OFFLINE to 1.
39 changes: 39 additions & 0 deletions tools/llm_bench/doc/PROMPT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
## [!NOTE]
> Currently llm_bench only supports json files with the suffix .jsonl.
> If there is no prompt file, the default value is used.
> There can be multiple prompts in the prompt file. You can specify the prompt to run by using the option --prompt_index
## 1.Text Generation
Supported parameters that can be set are:
* `prompt` - input prompt text for the text generation
Prompt file example:
{"prompt": "what is openvino?"}
{"prompt": "A chat between a curious user and an artificial intelligence assistant."}

## 2.Stable-diffusion
Supported parameters that can be set are:
* `steps` - inference steps (default 20)
* `width` - resolution width (default 512)
* `height` - resolution height (default 512)
* `guidance_scale` - guidance scale
* `prompt` - input prompt text for the image generation
Prompt file example:
{"steps":"10", "width":"256", "height":"256", "guidance_scale":"1.0", "prompt": "side profile centered painted portrait, Gandhi rolling a blunt, Gloomhaven, matte painting concept art, art nouveau, 8K HD Resolution, beautifully background"}

## 3.Ldm-super-resolution
Supported parameters that can be set are:
* `steps` - inference steps (default 50)
* `width` - resize image width (default 128)
* `height` - resize image height (default 128)
* `prompt` - image path
Prompt file example:
{"steps": "20", "width": "256", "height": "256", "prompt": "./image_256x256_size/4.png"}

## 4.Whisper
Supported parameters that can be set are:
* `media` - audio file path
* `language` - language of audio (default <|en|>)
* `timestamp` - timestamp for whisper (default true)
Prompt file example:
{"media": "./audio/intel_ad_90s_128kbps.mp3", "language": "<|en|>", "timestamp":false}
{"media": "./audio/intel_ad_120s_128kbps.mp3", "language": "<|en|>", "timestamp":true}
6 changes: 4 additions & 2 deletions tools/llm_bench/llm_bench_utils/config_class.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@
OVModelForSeq2SeqLM,
OVStableDiffusionPipeline,
OVLatentConsistencyModelPipeline,
OVStableDiffusionXLPipeline
OVStableDiffusionXLPipeline,
OVModelForSpeechSeq2Seq
)
from llm_bench_utils.ov_model_classes import OVMPTModel, OVLDMSuperResolutionPipeline, OVChatGLMModel

Expand Down Expand Up @@ -41,6 +42,7 @@
'chatglm2': OVModelForCausalLM,
'chatglm3': OVModelForCausalLM,
'chatglm': OVChatGLMModel,
'whisper': OVModelForSpeechSeq2Seq,
}

PT_MODEL_CLASSES_MAPPING = {
Expand All @@ -56,7 +58,7 @@

USE_CASES = {
'image_gen': ['stable-diffusion-', 'ssd-', 'deepfloyd-if', 'tiny-sd', 'small-sd', 'lcm-', 'sdxl', 'dreamlike'],
'text2speech': ['whisper'],
'speech2text': ['whisper'],
'image_cls': ['vit'],
'code_gen': ['replit', 'codegen2', 'codegen', 'codet5', "stable-code"],
'text_gen': [
Expand Down
23 changes: 4 additions & 19 deletions tools/llm_bench/llm_bench_utils/gen_output_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@

def gen_iterate_data(
iter_idx='',
loop_idx='',
in_size='',
infer_count='',
out_size='',
Expand All @@ -17,37 +16,23 @@ def gen_iterate_data(
max_uss_mem='',
prompt_idx='',
tokenization_time=[],
loop_data=None
):
iter_data = {}
iter_data['iteration'] = iter_idx
iter_data['loop_idx'] = loop_idx
iter_data['input_size'] = in_size
iter_data['infer_count'] = infer_count
iter_data['output_size'] = out_size
iter_data['generation_time'] = gen_time
iter_data['latency'] = latency
iter_data['result_md5'] = res_md5
iter_data['first_token_latency'] = ''
iter_data['other_tokens_avg_latency'] = ''
iter_data['first_token_infer_latency'] = ''
iter_data['other_tokens_infer_avg_latency'] = ''
iter_data['max_rss_mem_consumption'] = max_rss_mem
iter_data['max_shared_mem_consumption'] = max_shared_mem
iter_data['max_uss_mem_consumption'] = max_uss_mem
iter_data['prompt_idx'] = prompt_idx
iter_data['tokenization_time'] = tokenization_time[0] if len(tokenization_time) > 0 else ''
iter_data['detokenization_time'] = tokenization_time[1] if len(tokenization_time) > 1 else ''

if loop_data is not None:
iter_data['enc_token_latency'] = loop_data['enc_token_time']
iter_data['enc_infer_latency'] = loop_data['enc_infer_time']
iter_data['first_token_latency'] = loop_data['dec_1st_token_time']
iter_data['other_tokens_avg_latency'] = loop_data['dec_2nd_tokens_time']
iter_data['first_token_infer_latency'] = loop_data['dec_1st_infer_time']
iter_data['other_tokens_infer_avg_latency'] = loop_data['dec_2nd_infers_time']
else:
iter_data['enc_token_latency'] = ''
iter_data['enc_infer_latency'] = ''
iter_data['first_token_latency'] = ''
iter_data['other_tokens_avg_latency'] = ''
iter_data['first_token_infer_latency'] = ''
iter_data['other_tokens_infer_avg_latency'] = ''

return iter_data
136 changes: 136 additions & 0 deletions tools/llm_bench/llm_bench_utils/hook_forward_whisper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
import time
import copy
import llm_bench_utils.hook_greedy_search


class WhisperHook:
def __init__(self):
self.enc_infer_count = 0
self.time_data = []
self.latency_list = []
self.tm_list = []
self.tm_infer_list = []
self.greedy_hook = None

def get_time_list(self):
first_token_latency = 0
for data in self.time_data:
if 'enc_token_time' in data:
first_token_latency += data['enc_token_time']
if 'dec_token_time' in data:
first_token_latency += data['dec_token_time'][0]
self.tm_list.extend(copy.deepcopy(data['dec_token_time'][1:]))
self.tm_list.insert(0, first_token_latency)
return self.tm_list

def get_time_infer_list(self):
first_infer_latency = 0
for data in self.time_data:
if 'enc_infer_time' in data:
first_infer_latency += data['enc_infer_time']
if 'dec_infer_time' in data:
first_infer_latency += data['dec_infer_time'][0]
self.tm_infer_list.extend(copy.deepcopy(data['dec_infer_time'][1:]))
self.tm_infer_list.insert(0, first_infer_latency)
return self.tm_infer_list

def get_whisper_latency(self):
self.latency_list.clear()
for data in self.time_data:
latency_data = {}
if 'enc_token_time' and 'enc_infer_time' in data:
latency_data['enc_token_time'] = round(data['enc_token_time'] * 1000, 2)
latency_data['enc_infer_time'] = round(data['enc_infer_time'] * 1000, 2)
if 'dec_token_time' in data:
dec_token_count = len(data['dec_token_time'])
dec_infer_count = len(data['dec_infer_time'])
latency_data['dec_token_count'] = dec_token_count
latency_data['dec_infer_count'] = dec_infer_count
latency_data['dec_1st_token_time'] = round(data['dec_token_time'][0] * 1000, 2) if dec_token_count > 0 else 'NA'
latency_data['dec_2nd_tokens_time'] = round(sum(data['dec_token_time'][1:]) * 1000 / (dec_token_count - 1), 2) if dec_token_count > 1 else 'NA'
latency_data['dec_1st_infer_time'] = round(data['dec_infer_time'][0] * 1000, 2) if dec_infer_count > 0 else 'NA'
latency_data['dec_2nd_infers_time'] = round(sum(data['dec_infer_time'][1:]) * 1000 / (dec_infer_count - 1), 2) if dec_infer_count > 1 else 'NA'
self.latency_list.append(latency_data)

def print_whisper_latency(self, iter, prompt_idx):
self.get_whisper_latency()
str = ''
for idx, data in enumerate(self.latency_list):
title = f'[ INFO ] [{iter}][P{prompt_idx}][L{idx}]'
if 'enc_token_time' and 'enc_infer_time' in data:
str += \
f"{title} encoder token latency: {data['enc_token_time']:.2f} ms/token, " \
f"encoder infers latency: {data['enc_infer_time']:.2f} ms/infer"
if 'dec_1st_token_time' and 'dec_2nd_tokens_time' in data:
str += \
f"\n{title} decoder first token latency: {data['dec_1st_token_time']} ms/token, " \
f"decoder other tokens latency: {data['dec_2nd_tokens_time']} ms/token, " \
f"decoder tokens count: {data['dec_token_count']}\n"
if 'dec_1st_infer_time' and 'dec_2nd_infers_time' in data:
str += \
f"{title} decoder first infer latency: {data['dec_1st_infer_time']} ms/infer, " \
f"decoder other infers latency: {data['dec_2nd_infers_time']} ms/infer, " \
f"decoder infers count: {data['dec_infer_count']}"
if idx < len(self.latency_list) - 1:
str += '\n'
return str

def clear_statistics(self):
self.enc_infer_count = 0
self.time_data.clear()
self.tm_list.clear()
self.tm_infer_list.clear()
if self.greedy_hook is not None:
self.greedy_hook.clear_time_list()
self.greedy_hook.clear_time_infer_list()

def new_text_encoder(self, pipe):
old_text_encoder = pipe.model.encoder.forward

def my_text_encoder(*args, **kwargs):
t1 = time.time()
r = old_text_encoder(*args, **kwargs)
t2 = time.time()
text_encoder_token_time = t2 - t1
if self.enc_infer_count > 0:
prev_loop_data = self.time_data[self.enc_infer_count - 1]
prev_loop_data['enc_token_time'] = text_encoder_token_time
return r
pipe.model.encoder.forward = my_text_encoder

def new_text_encoder_request(self, pipe):
old_text_encoder_request = pipe.model.encoder.request

def my_text_encoder_request(*args, **kwargs):
loop_data = {}
t1 = time.time()
r = old_text_encoder_request(*args, **kwargs)
t2 = time.time()
text_encoder_infer_time = t2 - t1
loop_data['enc_infer_time'] = text_encoder_infer_time
self.time_data.append(loop_data)
self.enc_infer_count += 1
return r
pipe.model.encoder.request = my_text_encoder_request

def new_text_sample(self, pipe):
self.greedy_hook = llm_bench_utils.hook_greedy_search.GreedySearchHook()
self.greedy_hook.new_forward(pipe.model)

def new_generate(self, pipe):
old_generate = pipe.model.generate

def my_generate(attention_mask, **kwargs):
r = old_generate(attention_mask, **kwargs)
self.set_decoder_time_data()
return r
pipe.model.generate = my_generate

def set_decoder_time_data(self):
if self.enc_infer_count > 0:
prev_loop_data = self.time_data[self.enc_infer_count - 1]
prev_loop_data['dec_token_time'] = copy.deepcopy(self.greedy_hook.get_time_list())
prev_loop_data['dec_infer_time'] = copy.deepcopy(self.greedy_hook.get_time_infer_list())
if self.greedy_hook is not None:
self.greedy_hook.clear_time_list()
self.greedy_hook.clear_time_infer_list()
Loading

0 comments on commit 2370f6a

Please sign in to comment.