forked from openvinotoolkit/openvino.genai
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[LLM_BENCH] Whisper pipeline (openvinotoolkit#1025)
tickets: CVS-152393 [whisper_tiny_genai_API.txt](https://github.com/user-attachments/files/17629272/whisper_tiny_genai_API.txt) [whisper_tiny_optimum-intel_API.txt](https://github.com/user-attachments/files/17629275/whisper_tiny_optimum-intel_API.txt) --------- Co-authored-by: Xiping Yan <xiping.yan@intel.com> Co-authored-by: Chen Peter <peter.chen@intel.com> Co-authored-by: Andrei Kochin <andrei.kochin@intel.com> Co-authored-by: Ekaterina Aidova <ekaterina.aidova@intel.com>
- Loading branch information
1 parent
cf9ec64
commit 2370f6a
Showing
17 changed files
with
656 additions
and
116 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
# Notes | ||
## chatglm2-6b - AttributeError: can't set attribute | ||
Download chatglm2-6b from hugginface, convert to OpenVINO IR files and run with benchmark.py, the following error may occur: | ||
```bash | ||
AttributeError: can't set attribute | ||
``` | ||
Reproduced with https://huggingface.co/THUDM/chatglm2-6b 7fabe56db91e085c9c027f56f1c654d137bdba40 <br /> | ||
As on https://huggingface.co/THUDM/chatglm2-6b/discussions/99 <br /> | ||
Solution: update `tokenization_chatglm.py` as following: <br /> | ||
```Python | ||
self.vocab_file = vocab_file | ||
self.tokenizer = SPTokenizer(vocab_file) | ||
+ kwargs.pop("eos_token", None) | ||
+ kwargs.pop("pad_token", None) | ||
+ kwargs.pop("unk_token", None) | ||
self.special_tokens = { | ||
"<bos>": self.tokenizer.bos_id, | ||
"<eos>": self.tokenizer.eos_id, | ||
``` | ||
> The solution works for chatglm3-6b as well. | ||
## Qwen-7B-Chat-Int4 - Torch not compiled with CUDA enabled | ||
Convert Qwen-7B-Chat-Int4 to OpenVINO IR files run with convert.py, the following error may occur: | ||
```bash | ||
raise AssertionError("Torch not compiled with CUDA enabled") | ||
``` | ||
Reproduced with https://huggingface.co/Qwen/Qwen-7B-Chat-Int4 8750247cc50f2a7bb84bef322f7707159b700723 <br /> | ||
Solution: update `modeling_qwen.py` as following: <br /> | ||
```Python | ||
-SUPPORT_CUDA = torch.cuda.is_available() | ||
+SUPPORT_CUDA = False | ||
SUPPORT_BF16 = SUPPORT_CUDA and torch.cuda.is_bf16_supported() | ||
``` | ||
## Baichuan2-7B-Chat - AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model' | ||
Convert Baichuan2-7B-Chat to OpenVINO IR files run with convert.py, the following error may occur: | ||
```bash | ||
AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model' | ||
``` | ||
Reproduced with https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat 84603cde5ebffb6084e476cfaeceaf0b8b91fe54 <br /> | ||
Reference to https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/discussions/2 <br /> | ||
Solution: update `tokenization_baichuan.py` as following: <br /> | ||
```Python | ||
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token | ||
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token | ||
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token | ||
+ self.vocab_file = vocab_file | ||
+ self.add_bos_token = add_bos_token | ||
+ self.add_eos_token = add_eos_token | ||
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) | ||
+ self.sp_model.Load(vocab_file) | ||
super().__init__( | ||
bos_token=bos_token, | ||
eos_token=eos_token, | ||
clean_up_tokenization_spaces=clean_up_tokenization_spaces, | ||
**kwargs, | ||
) | ||
- self.vocab_file = vocab_file | ||
- self.add_bos_token = add_bos_token | ||
- self.add_eos_token = add_eos_token | ||
- self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) | ||
- self.sp_model.Load(vocab_file) | ||
``` | ||
## CompressWeights Mode INT4 - ConnectionError: Couldn't reach 'wikitext' on the Hub (SSLError) | ||
Download LLM from hugginface, convert to OpenVINO IR files and run with convert.py and CompressWeights Mode to INT4, the following error may occur: | ||
```bash | ||
raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e)._name_})") | ||
ConnectionError: Couldn't reach 'wikitext' on the Hub (SSLError) | ||
``` | ||
root cause: The wikitext data set was not downloaded correctly, or the Hugging Face Hub network could not be connected normally. <br /> | ||
Solution: <br /> | ||
Refer to https://huggingface.co/docs/datasets/loading#arrow , copy wikitext data set to ~/.cache/huggingface/datasets/ folder, set the environment variable HF_DATASETS_OFFLINE to 1. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
## [!NOTE] | ||
> Currently llm_bench only supports json files with the suffix .jsonl. | ||
> If there is no prompt file, the default value is used. | ||
> There can be multiple prompts in the prompt file. You can specify the prompt to run by using the option --prompt_index | ||
## 1.Text Generation | ||
Supported parameters that can be set are: | ||
* `prompt` - input prompt text for the text generation | ||
Prompt file example: | ||
{"prompt": "what is openvino?"} | ||
{"prompt": "A chat between a curious user and an artificial intelligence assistant."} | ||
|
||
## 2.Stable-diffusion | ||
Supported parameters that can be set are: | ||
* `steps` - inference steps (default 20) | ||
* `width` - resolution width (default 512) | ||
* `height` - resolution height (default 512) | ||
* `guidance_scale` - guidance scale | ||
* `prompt` - input prompt text for the image generation | ||
Prompt file example: | ||
{"steps":"10", "width":"256", "height":"256", "guidance_scale":"1.0", "prompt": "side profile centered painted portrait, Gandhi rolling a blunt, Gloomhaven, matte painting concept art, art nouveau, 8K HD Resolution, beautifully background"} | ||
|
||
## 3.Ldm-super-resolution | ||
Supported parameters that can be set are: | ||
* `steps` - inference steps (default 50) | ||
* `width` - resize image width (default 128) | ||
* `height` - resize image height (default 128) | ||
* `prompt` - image path | ||
Prompt file example: | ||
{"steps": "20", "width": "256", "height": "256", "prompt": "./image_256x256_size/4.png"} | ||
|
||
## 4.Whisper | ||
Supported parameters that can be set are: | ||
* `media` - audio file path | ||
* `language` - language of audio (default <|en|>) | ||
* `timestamp` - timestamp for whisper (default true) | ||
Prompt file example: | ||
{"media": "./audio/intel_ad_90s_128kbps.mp3", "language": "<|en|>", "timestamp":false} | ||
{"media": "./audio/intel_ad_120s_128kbps.mp3", "language": "<|en|>", "timestamp":true} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
136 changes: 136 additions & 0 deletions
136
tools/llm_bench/llm_bench_utils/hook_forward_whisper.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
import time | ||
import copy | ||
import llm_bench_utils.hook_greedy_search | ||
|
||
|
||
class WhisperHook: | ||
def __init__(self): | ||
self.enc_infer_count = 0 | ||
self.time_data = [] | ||
self.latency_list = [] | ||
self.tm_list = [] | ||
self.tm_infer_list = [] | ||
self.greedy_hook = None | ||
|
||
def get_time_list(self): | ||
first_token_latency = 0 | ||
for data in self.time_data: | ||
if 'enc_token_time' in data: | ||
first_token_latency += data['enc_token_time'] | ||
if 'dec_token_time' in data: | ||
first_token_latency += data['dec_token_time'][0] | ||
self.tm_list.extend(copy.deepcopy(data['dec_token_time'][1:])) | ||
self.tm_list.insert(0, first_token_latency) | ||
return self.tm_list | ||
|
||
def get_time_infer_list(self): | ||
first_infer_latency = 0 | ||
for data in self.time_data: | ||
if 'enc_infer_time' in data: | ||
first_infer_latency += data['enc_infer_time'] | ||
if 'dec_infer_time' in data: | ||
first_infer_latency += data['dec_infer_time'][0] | ||
self.tm_infer_list.extend(copy.deepcopy(data['dec_infer_time'][1:])) | ||
self.tm_infer_list.insert(0, first_infer_latency) | ||
return self.tm_infer_list | ||
|
||
def get_whisper_latency(self): | ||
self.latency_list.clear() | ||
for data in self.time_data: | ||
latency_data = {} | ||
if 'enc_token_time' and 'enc_infer_time' in data: | ||
latency_data['enc_token_time'] = round(data['enc_token_time'] * 1000, 2) | ||
latency_data['enc_infer_time'] = round(data['enc_infer_time'] * 1000, 2) | ||
if 'dec_token_time' in data: | ||
dec_token_count = len(data['dec_token_time']) | ||
dec_infer_count = len(data['dec_infer_time']) | ||
latency_data['dec_token_count'] = dec_token_count | ||
latency_data['dec_infer_count'] = dec_infer_count | ||
latency_data['dec_1st_token_time'] = round(data['dec_token_time'][0] * 1000, 2) if dec_token_count > 0 else 'NA' | ||
latency_data['dec_2nd_tokens_time'] = round(sum(data['dec_token_time'][1:]) * 1000 / (dec_token_count - 1), 2) if dec_token_count > 1 else 'NA' | ||
latency_data['dec_1st_infer_time'] = round(data['dec_infer_time'][0] * 1000, 2) if dec_infer_count > 0 else 'NA' | ||
latency_data['dec_2nd_infers_time'] = round(sum(data['dec_infer_time'][1:]) * 1000 / (dec_infer_count - 1), 2) if dec_infer_count > 1 else 'NA' | ||
self.latency_list.append(latency_data) | ||
|
||
def print_whisper_latency(self, iter, prompt_idx): | ||
self.get_whisper_latency() | ||
str = '' | ||
for idx, data in enumerate(self.latency_list): | ||
title = f'[ INFO ] [{iter}][P{prompt_idx}][L{idx}]' | ||
if 'enc_token_time' and 'enc_infer_time' in data: | ||
str += \ | ||
f"{title} encoder token latency: {data['enc_token_time']:.2f} ms/token, " \ | ||
f"encoder infers latency: {data['enc_infer_time']:.2f} ms/infer" | ||
if 'dec_1st_token_time' and 'dec_2nd_tokens_time' in data: | ||
str += \ | ||
f"\n{title} decoder first token latency: {data['dec_1st_token_time']} ms/token, " \ | ||
f"decoder other tokens latency: {data['dec_2nd_tokens_time']} ms/token, " \ | ||
f"decoder tokens count: {data['dec_token_count']}\n" | ||
if 'dec_1st_infer_time' and 'dec_2nd_infers_time' in data: | ||
str += \ | ||
f"{title} decoder first infer latency: {data['dec_1st_infer_time']} ms/infer, " \ | ||
f"decoder other infers latency: {data['dec_2nd_infers_time']} ms/infer, " \ | ||
f"decoder infers count: {data['dec_infer_count']}" | ||
if idx < len(self.latency_list) - 1: | ||
str += '\n' | ||
return str | ||
|
||
def clear_statistics(self): | ||
self.enc_infer_count = 0 | ||
self.time_data.clear() | ||
self.tm_list.clear() | ||
self.tm_infer_list.clear() | ||
if self.greedy_hook is not None: | ||
self.greedy_hook.clear_time_list() | ||
self.greedy_hook.clear_time_infer_list() | ||
|
||
def new_text_encoder(self, pipe): | ||
old_text_encoder = pipe.model.encoder.forward | ||
|
||
def my_text_encoder(*args, **kwargs): | ||
t1 = time.time() | ||
r = old_text_encoder(*args, **kwargs) | ||
t2 = time.time() | ||
text_encoder_token_time = t2 - t1 | ||
if self.enc_infer_count > 0: | ||
prev_loop_data = self.time_data[self.enc_infer_count - 1] | ||
prev_loop_data['enc_token_time'] = text_encoder_token_time | ||
return r | ||
pipe.model.encoder.forward = my_text_encoder | ||
|
||
def new_text_encoder_request(self, pipe): | ||
old_text_encoder_request = pipe.model.encoder.request | ||
|
||
def my_text_encoder_request(*args, **kwargs): | ||
loop_data = {} | ||
t1 = time.time() | ||
r = old_text_encoder_request(*args, **kwargs) | ||
t2 = time.time() | ||
text_encoder_infer_time = t2 - t1 | ||
loop_data['enc_infer_time'] = text_encoder_infer_time | ||
self.time_data.append(loop_data) | ||
self.enc_infer_count += 1 | ||
return r | ||
pipe.model.encoder.request = my_text_encoder_request | ||
|
||
def new_text_sample(self, pipe): | ||
self.greedy_hook = llm_bench_utils.hook_greedy_search.GreedySearchHook() | ||
self.greedy_hook.new_forward(pipe.model) | ||
|
||
def new_generate(self, pipe): | ||
old_generate = pipe.model.generate | ||
|
||
def my_generate(attention_mask, **kwargs): | ||
r = old_generate(attention_mask, **kwargs) | ||
self.set_decoder_time_data() | ||
return r | ||
pipe.model.generate = my_generate | ||
|
||
def set_decoder_time_data(self): | ||
if self.enc_infer_count > 0: | ||
prev_loop_data = self.time_data[self.enc_infer_count - 1] | ||
prev_loop_data['dec_token_time'] = copy.deepcopy(self.greedy_hook.get_time_list()) | ||
prev_loop_data['dec_infer_time'] = copy.deepcopy(self.greedy_hook.get_time_infer_list()) | ||
if self.greedy_hook is not None: | ||
self.greedy_hook.clear_time_list() | ||
self.greedy_hook.clear_time_infer_list() |
Oops, something went wrong.