[LLM_BENCH] Whisper pipeline (openvinotoolkit#1025)

tickets: CVS-152393 [whisper_tiny_genai_API.txt](https://github.com/user-attachments/files/17629272/whisper_tiny_genai_API.txt) [whisper_tiny_optimum-intel_API.txt](https://github.com/user-attachments/files/17629275/whisper_tiny_optimum-intel_API.txt) --------- Co-authored-by: Xiping Yan <xiping.yan@intel.com> Co-authored-by: Chen Peter <peter.chen@intel.com> Co-authored-by: Andrei Kochin <andrei.kochin@intel.com> Co-authored-by: Ekaterina Aidova <ekaterina.aidova@intel.com>
rishik-ashili · Nov 5, 2024 · 2370f6a · 2370f6a
1 parent cf9ec64
commit 2370f6a
Show file tree

Hide file tree

Showing 17 changed files with 656 additions and 116 deletions.
diff --git a/.github/workflows/llm_bench-python.yml b/.github/workflows/llm_bench-python.yml
@@ -82,6 +82,17 @@ jobs:
       run: |
         wget -O ./ov_models/soulcard.safetensors https://civitai.com/api/download/models/72591
         python ./tools/llm_bench/benchmark.py -m ./ov_models/dreamlike-art-dreamlike-anime-1.0/FP16/ -pf ./tools/llm_bench/prompts/stable-diffusion.jsonl -d cpu -n 1 --genai  --lora ./ov_models/soulcard.safetensors --lora_alphas 0.7
+    - name: Test whisper-tiny on Linux
+      run: |
+        GIT_LFS_SKIP_SMUDGE=1 git clone --depth 1 --branch main --single-branch https://huggingface.co/datasets/facebook/multilingual_librispeech
+        cd multilingual_librispeech
+        git lfs pull -I /data/mls_polish/train/audio/3283_1447_000.tar.gz
+        mkdir data/mls_polish/train/audio/3283_1447_000
+        tar zxvf data/mls_polish/train/audio/3283_1447_000.tar.gz -C data/mls_polish/train/audio/3283_1447_000/
+        cd ..
+        optimum-cli export openvino --trust-remote-code --model openai/whisper-tiny ./ov_models/whisper-tiny
+        python ./tools/llm_bench/benchmark.py -m ./ov_models/whisper-tiny --media multilingual_librispeech/data/mls_polish/train/audio/3283_1447_000/3283_1447_000000.flac -d cpu -n 1
+        python ./tools/llm_bench/benchmark.py -m ./ov_models/whisper-tiny --media multilingual_librispeech/data/mls_polish/train/audio/3283_1447_000/3283_1447_000000.flac -d cpu -n 1 --genai
     - name: WWB Tests
       run: |
         GIT_CLONE_PROTECTION_ACTIVE=false pip install -r ${{ env.WWB_PATH }}/requirements.txt

diff --git a/tools/llm_bench/README.md b/tools/llm_bench/README.md
@@ -170,4 +170,4 @@ OpenVINO is by default built with [oneTBB](https://github.com/oneapi-src/oneTBB/
 ## 7. Additional Resources
 
 - **Error Troubleshooting:** Check the [NOTES.md](./doc/NOTES.md) for solutions to known issues.
-- **Image Generation Configuration:** Refer to [IMAGE_GEN.md](./doc/IMAGE_GEN.md) for setting parameters for image generation models.
+- **Syntax and attributes of prompt file:** Refer to [PROMPT.md](./doc/PROMPT.md) for writing a prompt file.
diff --git a/tools/llm_bench/benchmark.py b/tools/llm_bench/benchmark.py
@@ -15,6 +15,7 @@
 import task.text_generation as bench_text
 import task.image_generation as bench_image
 import task.super_resolution_generation as bench_ldm_sr
+import task.speech_to_text_generation as bench_speech
 
 DEFAULT_TORCH_THREAD_NUMS = 16
 mem_consumption = MemConsumption()
@@ -46,6 +47,7 @@ def get_argprser():
                         help='Prompt file(s) in jsonl format. Multiple prompt files should be separated with space(s).')
     parser.add_argument('-pi', '--prompt_index', nargs='+', type=num_iters_type, default=None,
                         help='Run the specified prompt index. You can specify multiple prompt indexes, separated by spaces.')
+    parser.add_argument('--media', default=None, help='Media file path for speech or visual models.')
     parser.add_argument(
         '-ic',
         '--infer_count',
@@ -153,6 +155,7 @@ def get_argprser():
     'image_gen': bench_image.run_image_generation_benchmark,
     'code_gen': bench_text.run_text_generation_benchmark,
     'ldm_super_resolution': bench_ldm_sr.run_ldm_super_resolution_benchmark,
+    'speech2text': bench_speech.run_speech_2_txt_benchmark,
 }
 
 

diff --git a/tools/llm_bench/doc/NOTES.md b/tools/llm_bench/doc/NOTES.md
@@ -0,0 +1,74 @@
+# Notes
+## chatglm2-6b - AttributeError: can't set attribute
+Download chatglm2-6b from hugginface, convert to OpenVINO IR files and run with benchmark.py, the following error may occur：
+```bash
+AttributeError: can't set attribute
+```
+Reproduced with https://huggingface.co/THUDM/chatglm2-6b 7fabe56db91e085c9c027f56f1c654d137bdba40 <br />
+As on https://huggingface.co/THUDM/chatglm2-6b/discussions/99 <br />
+Solution: update `tokenization_chatglm.py` as following: <br />
+```Python
+          self.vocab_file = vocab_file
+          self.tokenizer = SPTokenizer(vocab_file)
+ +        kwargs.pop("eos_token", None)
+ +        kwargs.pop("pad_token", None)
+ +        kwargs.pop("unk_token", None)
+          self.special_tokens = {
+              "<bos>": self.tokenizer.bos_id,
+              "<eos>": self.tokenizer.eos_id,
+```              
+
+> The solution works for chatglm3-6b as well.
+
+## Qwen-7B-Chat-Int4 - Torch not compiled with CUDA enabled
+Convert Qwen-7B-Chat-Int4 to OpenVINO IR files run with convert.py, the following error may occur：
+```bash
+raise AssertionError("Torch not compiled with CUDA enabled")
+```
+Reproduced with https://huggingface.co/Qwen/Qwen-7B-Chat-Int4 8750247cc50f2a7bb84bef322f7707159b700723 <br />
+Solution: update `modeling_qwen.py` as following: <br />
+```Python
+-SUPPORT_CUDA = torch.cuda.is_available()
++SUPPORT_CUDA = False
+ SUPPORT_BF16 = SUPPORT_CUDA and torch.cuda.is_bf16_supported()
+ ```
+
+## Baichuan2-7B-Chat - AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'
+Convert Baichuan2-7B-Chat to OpenVINO IR files run with convert.py, the following error may occur：
+```bash
+AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'
+```
+Reproduced with https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat 84603cde5ebffb6084e476cfaeceaf0b8b91fe54 <br />
+Reference to https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/discussions/2 <br />
+Solution: update `tokenization_baichuan.py` as following: <br />
+```Python
+         eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+         unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
+         pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
++        self.vocab_file = vocab_file
++        self.add_bos_token = add_bos_token
++        self.add_eos_token = add_eos_token
++        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
++        self.sp_model.Load(vocab_file)
+         super().__init__(
+             bos_token=bos_token,
+             eos_token=eos_token,
+             clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+             **kwargs,
+         )
+-        self.vocab_file = vocab_file
+-        self.add_bos_token = add_bos_token
+-        self.add_eos_token = add_eos_token
+-        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+-        self.sp_model.Load(vocab_file)
+```
+
+## CompressWeights Mode INT4 - ConnectionError: Couldn't reach 'wikitext' on the Hub (SSLError)
+Download LLM from hugginface, convert to OpenVINO IR files and run with convert.py and CompressWeights Mode to INT4, the following error may occur：
+```bash
+raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e)._name_})")
+ConnectionError: Couldn't reach 'wikitext' on the Hub (SSLError)
+```
+root cause: The wikitext data set was not downloaded correctly, or the Hugging Face Hub network could not be connected normally. <br />
+Solution: <br />
+Refer to https://huggingface.co/docs/datasets/loading#arrow , copy wikitext data set to ~/.cache/huggingface/datasets/ folder, set the environment variable HF_DATASETS_OFFLINE to 1.
diff --git a/tools/llm_bench/doc/PROMPT.md b/tools/llm_bench/doc/PROMPT.md
@@ -0,0 +1,39 @@
+## [!NOTE]
+> Currently llm_bench only supports json files with the suffix .jsonl.
+> If there is no prompt file, the default value is used.
+> There can be multiple prompts in the prompt file. You can specify the prompt to run by using the option --prompt_index
+
+## 1.Text Generation
+Supported parameters that can be set are:
+* `prompt` - input prompt text for the text generation
+Prompt file example：
+{"prompt": "what is openvino?"}
+{"prompt": "A chat between a curious user and an artificial intelligence assistant."}
+
+## 2.Stable-diffusion
+Supported parameters that can be set are:
+* `steps` - inference steps (default 20)
+* `width` - resolution width (default 512)
+* `height` - resolution height (default 512)
+* `guidance_scale` - guidance scale
+* `prompt` - input prompt text for the image generation
+Prompt file example：
+{"steps":"10", "width":"256", "height":"256", "guidance_scale":"1.0", "prompt": "side profile centered painted portrait, Gandhi rolling a blunt, Gloomhaven, matte painting concept art, art nouveau, 8K HD Resolution, beautifully background"}
+
+## 3.Ldm-super-resolution
+Supported parameters that can be set are:
+* `steps` - inference steps (default 50)
+* `width` - resize image width (default 128)
+* `height` - resize image height (default 128)
+* `prompt` - image path
+Prompt file example：
+{"steps": "20", "width": "256", "height": "256", "prompt": "./image_256x256_size/4.png"}
+
+## 4.Whisper
+Supported parameters that can be set are:
+* `media` - audio file path
+* `language` - language of audio (default <|en|>)
+* `timestamp` - timestamp for whisper (default true)
+Prompt file example：
+{"media": "./audio/intel_ad_90s_128kbps.mp3", "language": "<|en|>", "timestamp":false}
+{"media": "./audio/intel_ad_120s_128kbps.mp3", "language": "<|en|>", "timestamp":true}
diff --git a/tools/llm_bench/llm_bench_utils/config_class.py b/tools/llm_bench/llm_bench_utils/config_class.py
@@ -9,7 +9,8 @@
     OVModelForSeq2SeqLM,
     OVStableDiffusionPipeline,
     OVLatentConsistencyModelPipeline,
-    OVStableDiffusionXLPipeline
+    OVStableDiffusionXLPipeline,
+    OVModelForSpeechSeq2Seq
 )
 from llm_bench_utils.ov_model_classes import OVMPTModel, OVLDMSuperResolutionPipeline, OVChatGLMModel
 
@@ -41,6 +42,7 @@
     'chatglm2': OVModelForCausalLM,
     'chatglm3': OVModelForCausalLM,
     'chatglm': OVChatGLMModel,
+    'whisper': OVModelForSpeechSeq2Seq,
 }
 
 PT_MODEL_CLASSES_MAPPING = {
@@ -56,7 +58,7 @@
 
 USE_CASES = {
     'image_gen': ['stable-diffusion-', 'ssd-', 'deepfloyd-if', 'tiny-sd', 'small-sd', 'lcm-', 'sdxl', 'dreamlike'],
-    'text2speech': ['whisper'],
+    'speech2text': ['whisper'],
     'image_cls': ['vit'],
     'code_gen': ['replit', 'codegen2', 'codegen', 'codet5', "stable-code"],
     'text_gen': [

diff --git a/tools/llm_bench/llm_bench_utils/gen_output_data.py b/tools/llm_bench/llm_bench_utils/gen_output_data.py
@@ -5,7 +5,6 @@
 
 def gen_iterate_data(
     iter_idx='',
-    loop_idx='',
     in_size='',
     infer_count='',
     out_size='',
@@ -17,37 +16,23 @@ def gen_iterate_data(
     max_uss_mem='',
     prompt_idx='',
     tokenization_time=[],
-    loop_data=None
 ):
     iter_data = {}
     iter_data['iteration'] = iter_idx
-    iter_data['loop_idx'] = loop_idx
     iter_data['input_size'] = in_size
     iter_data['infer_count'] = infer_count
     iter_data['output_size'] = out_size
     iter_data['generation_time'] = gen_time
     iter_data['latency'] = latency
     iter_data['result_md5'] = res_md5
+    iter_data['first_token_latency'] = ''
+    iter_data['other_tokens_avg_latency'] = ''
+    iter_data['first_token_infer_latency'] = ''
+    iter_data['other_tokens_infer_avg_latency'] = ''
     iter_data['max_rss_mem_consumption'] = max_rss_mem
     iter_data['max_shared_mem_consumption'] = max_shared_mem
     iter_data['max_uss_mem_consumption'] = max_uss_mem
     iter_data['prompt_idx'] = prompt_idx
     iter_data['tokenization_time'] = tokenization_time[0] if len(tokenization_time) > 0 else ''
     iter_data['detokenization_time'] = tokenization_time[1] if len(tokenization_time) > 1 else ''
-
-    if loop_data is not None:
-        iter_data['enc_token_latency'] = loop_data['enc_token_time']
-        iter_data['enc_infer_latency'] = loop_data['enc_infer_time']
-        iter_data['first_token_latency'] = loop_data['dec_1st_token_time']
-        iter_data['other_tokens_avg_latency'] = loop_data['dec_2nd_tokens_time']
-        iter_data['first_token_infer_latency'] = loop_data['dec_1st_infer_time']
-        iter_data['other_tokens_infer_avg_latency'] = loop_data['dec_2nd_infers_time']
-    else:
-        iter_data['enc_token_latency'] = ''
-        iter_data['enc_infer_latency'] = ''
-        iter_data['first_token_latency'] = ''
-        iter_data['other_tokens_avg_latency'] = ''
-        iter_data['first_token_infer_latency'] = ''
-        iter_data['other_tokens_infer_avg_latency'] = ''
-
     return iter_data
diff --git a/tools/llm_bench/llm_bench_utils/hook_forward_whisper.py b/tools/llm_bench/llm_bench_utils/hook_forward_whisper.py
@@ -0,0 +1,136 @@
+import time
+import copy
+import llm_bench_utils.hook_greedy_search
+
+
+class WhisperHook:
+    def __init__(self):
+        self.enc_infer_count = 0
+        self.time_data = []
+        self.latency_list = []
+        self.tm_list = []
+        self.tm_infer_list = []
+        self.greedy_hook = None
+
+    def get_time_list(self):
+        first_token_latency = 0
+        for data in self.time_data:
+            if 'enc_token_time' in data:
+                first_token_latency += data['enc_token_time']
+            if 'dec_token_time' in data:
+                first_token_latency += data['dec_token_time'][0]
+                self.tm_list.extend(copy.deepcopy(data['dec_token_time'][1:]))
+        self.tm_list.insert(0, first_token_latency)
+        return self.tm_list
+
+    def get_time_infer_list(self):
+        first_infer_latency = 0
+        for data in self.time_data:
+            if 'enc_infer_time' in data:
+                first_infer_latency += data['enc_infer_time']
+            if 'dec_infer_time' in data:
+                first_infer_latency += data['dec_infer_time'][0]
+                self.tm_infer_list.extend(copy.deepcopy(data['dec_infer_time'][1:]))
+        self.tm_infer_list.insert(0, first_infer_latency)
+        return self.tm_infer_list
+
+    def get_whisper_latency(self):
+        self.latency_list.clear()
+        for data in self.time_data:
+            latency_data = {}
+            if 'enc_token_time' and 'enc_infer_time' in data:
+                latency_data['enc_token_time'] = round(data['enc_token_time'] * 1000, 2)
+                latency_data['enc_infer_time'] = round(data['enc_infer_time'] * 1000, 2)
+            if 'dec_token_time' in data:
+                dec_token_count = len(data['dec_token_time'])
+                dec_infer_count = len(data['dec_infer_time'])
+                latency_data['dec_token_count'] = dec_token_count
+                latency_data['dec_infer_count'] = dec_infer_count
+                latency_data['dec_1st_token_time'] = round(data['dec_token_time'][0] * 1000, 2) if dec_token_count > 0 else 'NA'
+                latency_data['dec_2nd_tokens_time'] = round(sum(data['dec_token_time'][1:]) * 1000 / (dec_token_count - 1), 2) if dec_token_count > 1 else 'NA'
+                latency_data['dec_1st_infer_time'] = round(data['dec_infer_time'][0] * 1000, 2) if dec_infer_count > 0 else 'NA'
+                latency_data['dec_2nd_infers_time'] = round(sum(data['dec_infer_time'][1:]) * 1000 / (dec_infer_count - 1), 2) if dec_infer_count > 1 else 'NA'
+            self.latency_list.append(latency_data)
+
+    def print_whisper_latency(self, iter, prompt_idx):
+        self.get_whisper_latency()
+        str = ''
+        for idx, data in enumerate(self.latency_list):
+            title = f'[ INFO ] [{iter}][P{prompt_idx}][L{idx}]'
+            if 'enc_token_time' and 'enc_infer_time' in data:
+                str += \
+                    f"{title} encoder token latency: {data['enc_token_time']:.2f} ms/token, " \
+                    f"encoder infers latency: {data['enc_infer_time']:.2f} ms/infer"
+            if 'dec_1st_token_time' and 'dec_2nd_tokens_time' in data:
+                str += \
+                    f"\n{title} decoder first token latency: {data['dec_1st_token_time']} ms/token, " \
+                    f"decoder other tokens latency: {data['dec_2nd_tokens_time']} ms/token, " \
+                    f"decoder tokens count: {data['dec_token_count']}\n"
+            if 'dec_1st_infer_time' and 'dec_2nd_infers_time' in data:
+                str += \
+                    f"{title} decoder first infer latency: {data['dec_1st_infer_time']} ms/infer, " \
+                    f"decoder other infers latency: {data['dec_2nd_infers_time']} ms/infer, " \
+                    f"decoder infers count: {data['dec_infer_count']}"
+            if idx < len(self.latency_list) - 1:
+                str += '\n'
+        return str
+
+    def clear_statistics(self):
+        self.enc_infer_count = 0
+        self.time_data.clear()
+        self.tm_list.clear()
+        self.tm_infer_list.clear()
+        if self.greedy_hook is not None:
+            self.greedy_hook.clear_time_list()
+            self.greedy_hook.clear_time_infer_list()
+
+    def new_text_encoder(self, pipe):
+        old_text_encoder = pipe.model.encoder.forward
+
+        def my_text_encoder(*args, **kwargs):
+            t1 = time.time()
+            r = old_text_encoder(*args, **kwargs)
+            t2 = time.time()
+            text_encoder_token_time = t2 - t1
+            if self.enc_infer_count > 0:
+                prev_loop_data = self.time_data[self.enc_infer_count - 1]
+                prev_loop_data['enc_token_time'] = text_encoder_token_time
+            return r
+        pipe.model.encoder.forward = my_text_encoder
+
+    def new_text_encoder_request(self, pipe):
+        old_text_encoder_request = pipe.model.encoder.request
+
+        def my_text_encoder_request(*args, **kwargs):
+            loop_data = {}
+            t1 = time.time()
+            r = old_text_encoder_request(*args, **kwargs)
+            t2 = time.time()
+            text_encoder_infer_time = t2 - t1
+            loop_data['enc_infer_time'] = text_encoder_infer_time
+            self.time_data.append(loop_data)
+            self.enc_infer_count += 1
+            return r
+        pipe.model.encoder.request = my_text_encoder_request
+
+    def new_text_sample(self, pipe):
+        self.greedy_hook = llm_bench_utils.hook_greedy_search.GreedySearchHook()
+        self.greedy_hook.new_forward(pipe.model)
+
+    def new_generate(self, pipe):
+        old_generate = pipe.model.generate
+
+        def my_generate(attention_mask, **kwargs):
+            r = old_generate(attention_mask, **kwargs)
+            self.set_decoder_time_data()
+            return r
+        pipe.model.generate = my_generate
+
+    def set_decoder_time_data(self):
+        if self.enc_infer_count > 0:
+            prev_loop_data = self.time_data[self.enc_infer_count - 1]
+            prev_loop_data['dec_token_time'] = copy.deepcopy(self.greedy_hook.get_time_list())
+            prev_loop_data['dec_infer_time'] = copy.deepcopy(self.greedy_hook.get_time_infer_list())
+            if self.greedy_hook is not None:
+                self.greedy_hook.clear_time_list()
+                self.greedy_hook.clear_time_infer_list()