openai-api backend #52

tijszwinkels · 2024-01-25T09:14:43Z

This PR implements a back-end that uses the OpenAI Whisper api.
This way, no expensive GPU server is necessary to run this software.

Gldkslfmsd · 2024-01-25T11:09:44Z

Wow, thanks a lot! You're solving my current issue :)

Have you compared latency-quality to faster-whisper? Can you share the results? I'd like to merge it only if there's an evidence that this feature is useful.

tijszwinkels · 2024-01-25T13:22:14Z

Anecdotally this works well, but let me see whether I can give some objective results.

Latencies are roughly comparable to my (abused) 3090:

OpenAI:

(realtime) tijs@Pillar:~/os/whisper_streaming$ python whisper_online.py --lan nl --min-chunk-size 10 --backend openai-api ~/call_tijs_jeroen.mp3
(realtime) tijs@Pillar:~/os/whisper_streaming$ cat openai-api-output.txt | grep -E "latency|transcribing"
transcribing 10.01 seconds from 0.00
## last processed 10.01 s, now is 11.38, the latency is 1.37
transcribing 20.02 seconds from 0.00
## last processed 20.02 s, now is 21.96, the latency is 1.95
transcribing 21.03 seconds from 9.00
## last processed 30.03 s, now is 31.80, the latency is 1.77
transcribing 23.04 seconds from 17.00
## last processed 40.04 s, now is 41.80, the latency is 1.76
transcribing 22.54 seconds from 27.50
## last processed 50.04 s, now is 51.65, the latency is 1.60
transcribing 32.55 seconds from 27.50
## last processed 60.05 s, now is 61.92, the latency is 1.86
transcribing 42.56 seconds from 27.50
## last processed 70.06 s, now is 72.85, the latency is 2.79
transcribing 22.65 seconds from 57.42
## last processed 80.07 s, now is 82.30, the latency is 2.23
transcribing 32.66 seconds from 57.42
## last processed 90.08 s, now is 92.53, the latency is 2.45
transcribing 42.66 seconds from 57.42
## last processed 100.08 s, now is 102.92, the latency is 2.84
transcribing 23.11 seconds from 86.98
## last processed 110.09 s, now is 111.85, the latency is 1.76
transcribing 33.12 seconds from 86.98
## last processed 120.10 s, now is 122.32, the latency is 2.22
transcribing 43.13 seconds from 86.98
## last processed 130.11 s, now is 132.76, the latency is 2.65
transcribing 27.57 seconds from 112.54
## last processed 140.11 s, now is 141.35, the latency is 1.24
transcribing 30.58 seconds from 119.54
## last processed 150.12 s, now is 151.72, the latency is 1.60
transcribing 40.59 seconds from 119.54
## last processed 160.13 s, now is 162.17, the latency is 2.03

faster-whisper:

(realtime) tijs@Pillar:~/os/whisper_streaming$ LD_LIBRARY_PATH=/usr/lib/wsl/lib::/home/tijs/anaconda3/envs/realtime/lib/python3.10/site-packages/nvidia/cudnn/lib/ python whisper_online.py --lan nl --min-chunk-size 10 ~/call_tijs_jeroen.mp3 | tee faster-whisper-log.txt
(realtime) tijs@Pillar:~/os/whisper_streaming$ cat faster-whisper-output.txt | grep -E "latency|transcribing"
transcribing 10.01 seconds from 0.00
## last processed 10.01 s, now is 11.27, the latency is 1.26
transcribing 20.02 seconds from 0.00
## last processed 20.02 s, now is 21.95, the latency is 1.93
transcribing 20.85 seconds from 9.18
## last processed 30.03 s, now is 31.65, the latency is 1.63
transcribing 29.24 seconds from 10.80
## last processed 40.04 s, now is 41.70, the latency is 1.66
transcribing 39.24 seconds from 10.80
## last processed 50.04 s, now is 52.22, the latency is 2.17
transcribing 32.51 seconds from 27.54
## last processed 60.05 s, now is 61.92, the latency is 1.87
transcribing 42.52 seconds from 27.54
## last processed 70.06 s, now is 72.71, the latency is 2.65
transcribing 52.53 seconds from 27.54
## last processed 80.07 s, now is 83.28, the latency is 3.21
transcribing 23.03 seconds from 67.04
## last processed 90.07 s, now is 92.14, the latency is 2.07
transcribing 21.88 seconds from 78.20
## last processed 100.08 s, now is 101.91, the latency is 1.82
transcribing 29.75 seconds from 80.34
## last processed 110.09 s, now is 112.69, the latency is 2.60
transcribing 39.76 seconds from 80.34
## last processed 120.10 s, now is 123.01, the latency is 2.91
transcribing 49.77 seconds from 80.34
## last processed 130.11 s, now is 133.84, the latency is 3.74
transcribing 34.85 seconds from 105.26
## last processed 140.11 s, now is 142.06, the latency is 1.95
transcribing 44.86 seconds from 105.26
## last processed 150.12 s, now is 152.70, the latency is 2.58

tijszwinkels · 2024-01-25T13:31:21Z

transcription of a professional conversation:

(realtime) tijs@Pillar:~/os/whisper_streaming$ cat openai-api-log.txt | head -n 7
Model configuration is set to use the OpenAI Whisper API.
21964.8111 0 9000 Hoi Hoi. Hallo Thijs, mijn ene AirPod, daar komt een soort geest uit, geen idee wat hij aan het doen is. Maar het klonk niet als jouw stem.
31797.1096 9000 17000 Als het goed is hoor je me toch wel zo. Ja. Oké. Dan doe ik het maar. Ik hoor je goed hoor. Dus dat gaat helemaal goed. Oké.
41797.7839 17000 27500 Nee, ik zat na te denken, die partij die doorstuurde ben ik net wat kort ingelopen. Die doen ook transcriptie van audio.
51644.9432 27500 34500 Maar er ligt nog geen connectie met
72847.0478 35005 57420 is het enige voordeel. Nee en ook eens naar die APIs kijkt, kijk het lijkt dat zij alleen integreren met, ja ik weet niet eens echt wat FHIR is, maar in elk geval met medische systemen, ja Fast Healthcare Interoperability Resources.

(realtime) tijs@Pillar:~/os/whisper_streaming$ cat faster-whisper-log.txt | head -n 5
21947.3231 2000 9180  Hoi Hoi Hallo Thijs, mijn ene AirPod Daar komt een soort geest uit Geen idee wat hij aan het doen is Maar het klonk niet als jouw stem
31654.8061 9240 10800  Als het goed is hoor je me toch wel zo
52216.2840 12700 34800  Ik hoor je goed hoor, dus dat gaat helemaal goed Nee, ik zat na te denken, die partij die doorstuurde ben ik net wat kort ingelopen Die doen ook transcriptie van audio Maar er ligt nog geen connectie met
83275.8024 34800 68900  VK Nee, en ook als je naar die APIs kijkt Kijk, het lijkt dat zij alleen integreren met, ja ik weet niet eens echt wat FHIR is Maar in elk geval met medische systemen, ja Fast Healthcare Interoperability Resources Dus in zoverre, ja zij zijn zo gefocust op de medische markt dat ze ook weer niet echt een concurrent van ons zijn Kijk, voor VK kan
92141.5589 68940 79220  het op een gegeven moment natuurlijk wel interessant worden Maar ook dan, het is wel heel specifiek wat zij doen, wij zijn veel breder Ja,

The results are not identical, but seem of similar quality.

Gldkslfmsd · 2024-01-25T13:40:37Z

OK. The identical results should be checked with --comp_unaware mode.

I confirm it looks very good. I will run a benchmark and evaluate latency-quality.

Gldkslfmsd · 2024-01-25T13:41:07Z

Btw., how is it possible to use VAD?

Gldkslfmsd · 2024-01-25T14:07:10Z

Also, let's add logging how many seconds are actually processed through API, so that the cost is calculated

tijszwinkels · 2024-01-25T15:02:26Z

Regarding VAD; The api tells us how likely it is that there's no speech. See: https://github.com/ufal/whisper_streaming/pull/52/files#diff-a270860122060d07d4ae5ba131afc258fd70131ed20b8aa8c258303789a1c8bdR167

Right now I just skip these segments regardless of VAD settings, but that's possibly not the right way to do it.

Gldkslfmsd · 2024-01-25T15:52:31Z

I did updates in your PR: VAD and translate options, and code cleaning.

tijszwinkels · 2024-01-25T16:10:06Z

lovely, thanks!

Gldkslfmsd · 2024-01-29T12:49:19Z

So, I got the results on ESIC dev2, 3-times 43 minutes, 3-times 27 docs, ASR in En, Cs, De.

I compared comp. aware mode segment-15 and min-chunk-size 1.0s.

The WER of OpenAI API is twice worse than faster-whisper large-v3 model on NVIDIA A100. I hypothesize that they don't use the large model in API, but a smaller one, with worse quality.

The latency of OpenAI API is 3-times worse. And it is very unstable and unreliable.

Cost: appx 8-times audio duration is processed with this mode, so 0.048 USD per minute in the streaming mode.

So, @tijszwinkels , or anyone, do you have similar results, or better? Someone proposed Azure Whisper API, it could be more reliable and faster.

Gldkslfmsd · 2024-01-29T12:52:54Z

I used the VAD. My second hypothesis is that it should be improved in the OpenAI API backend. Now it's working wrongly, filtering out segments with a threshold > 0.8. Maybe rather words?

tijszwinkels · 2024-01-29T14:19:54Z

Thank you for these tests! - At least this is good to know!
According to the docs (https://platform.openai.com/docs/guides/speech-to-text) they're still using the large-v2 model (weird that they don't update their own api), but I wouldn't expect the difference between v2 and v3 to be quite this large.

Actually, another reason (sorry for not thinking of this earlier); The OpenAI api doesn't provide word-level timestamps (only segment-level), so I interpolated word level timestamps by assuming equal length for all words, which would obviously lead to incorrect timestamps in some cases, but I find it hard to estimate the consequence on the final output.
See: https://github.com/ufal/whisper_streaming/pull/52/files#diff-a270860122060d07d4ae5ba131afc258fd70131ed20b8aa8c258303789a1c8bdR178 - What do you think?

Alternatively, we could implement other api's such as one of the WhisperX back-ends on https://replicate.com/ . The disadvantage is that the api's I've seen there so far don't get the audio chunk from the api request but expect the audio to be uploaded on a publicly available URL, which seems cumbersome for many small audio-chunks. - but could search around a bit more.

Gldkslfmsd · 2024-01-29T14:26:49Z

whisper_online.py

+
+            # Assign start and end times for each word
+            # We only have timestamps per segment, so interpolating start and end-times
+            # assuming equal duration per word


I suggest proportional to the character length

Gldkslfmsd · 2024-01-29T14:30:55Z

The diff between large v2 and v3 is around 1% WER.
I noticed that some segments or words are omitted in the API outputs. The VAD makes the quality issue, I think. Maybe also the timestamps.

Let's not go to the replicate back-end, it doesn't seem useful. Maybe Azure API?

tijszwinkels · 2024-01-29T14:32:41Z

The VAD makes the quality issue, I think. Maybe also the timestamps.

What if we disable VAD? Lot's of hallucinations?

Gldkslfmsd · 2024-01-29T14:35:49Z

https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-preview-of-openai-whisper-in-azure-openai-service/ba-p/3928388 -- Azure should have word-level timestamps

Gldkslfmsd · 2024-01-29T14:37:03Z

The VAD makes the quality issue, I think. Maybe also the timestamps.

What if we disable VAD? Lot's of hallucinations?

yes, hallucinations on silence and non-voice sounds. We could run VAD locally and send only the voice part of audio to API.

Gldkslfmsd · 2024-01-29T14:53:13Z

So, I checked one En ASR document. Faster-whisper large-v2 has 9.5 % WER with VAD, 10.2 without VAD.
OpenAI API has 13.14 % WER regardless of the mode -- comp. aware VAD, comp. unaware VAD on or off.

Gldkslfmsd · 2024-01-29T15:05:57Z

I compared the outputs and errors. I suspect the rough approximation of word-level timestamps in OpenAI API. It misses some words in the middle of the sentence regardless of VAD.

Left: OpenAI API, right: faster-whisper v2. The top line in both is gold.

tijszwinkels · 2024-01-29T15:22:44Z

Thanks for your extensive testing! - I'll update to estimation based on character-length, but might not get round to it until wednesday unfortunately.

Gldkslfmsd · 2024-01-29T15:34:25Z

Well, I suggest trying the Azure API instead of the char proportion timestamps, it would be just a slightly better approximation. OpenAI API is also very slow and has unstable latency. Someone told me that Azure has better latency and is more robust, and it has the word-level timestamps. The API itself should be very easy to replace the OpenAI one.

Anytime is ok, I don't have any near plans with this anyway. Thanks!

tijszwinkels · 2024-01-29T15:39:15Z

Alright, if Azure basically fixes all these issues, I'll definitely look into that first then!

tijszwinkels · 2024-02-08T09:19:13Z

I'm not very enthusiastic about the Azure API.

They have two options.

The Speech-to-text is designed to be synchronous, but has no settings and no timestamps. It just returns the whole text for the sent audiofile at once. This makes it entirely unusable for our purposes.

the Batch transcription is more flexible and can do word-level timestamps, but it's explicitly designed for large batch job and most certainly not for latency-sensitive applications:

Batch transcription jobs are scheduled on a best-effort basis. At peak hours, it may take up to 30 minutes or longer for a transcription job to start processing.

So I don't think using Azure api's is the way to go.

tijszwinkels · 2024-02-09T09:42:13Z

The OpenAI api actually supports word-level timestamps through timestamp_granularities[]. :)

Fix coming soon!

tijszwinkels · 2024-02-10T14:33:34Z

The word-timestamp-interpolation based on character side had the same problem with missing words on chunk boundaries, but with word-level timestamps, the OpenAI api gives identical results to the offline whisper in my tests:

Original:
Nadat de ervaring mij geleerd had, dat al wat zo in het gewone leven volkomt ijdel en nietig is, en ik inzag dat alles waarvoor en wat ik vreesde niets goeds nog kwaads bevatte, tenzij alleen voor zover mijn gemoed er door bewogen werd, besloot ik eindelijk te onderzoeken of er ook iets bestond dat een waarachtig goed was, dat men deelachtig zou kunnen worden, en waardoor alleen, met verwerping van al het overige, de ziel kon worden vervuld.

Streaming:
Nadat de ervaring mij geleerd had dat al wat zo in het gewone leven volkomt ijdel en nietig is en ik inzag dat alles waarvoor en wat ik vreesde niets goeds nog kwaads bevatte tenzij alleen voor zover mijn gemoed er door bewogen werd besloot ik eindelijk te onderzoeken of er ook iets bestond dat een waarachtig goed was dat men deelachtig zou kunnen worden en waardoor alleen met verwerping van al het overige de ziel kon worden vervuld.

In my preliminary testing, this seems good enough for my use case!
I wonder what you think.

Gldkslfmsd · 2024-02-12T14:09:42Z

Thanks, @tijszwinkels ! I plan to test the code next week, I'm busy.

Meanwhile, can you merge main into this branch and test it? There's a new feature of automatic language detection when language parameter is None. Does it work with the API?

tijszwinkels · 2024-02-14T16:31:47Z

Rebased the branch and made language auto-detect work with --lan auto.

Right now the default language is 'en' if not specified on the cli. Perhaps the default should be auto?

Gldkslfmsd · 2024-02-15T11:37:50Z

Thanks. OK, let's make auto default.

Gldkslfmsd · 2024-02-19T16:39:55Z

OK, I tested it, the quality seems alright.

The latency is around 4 seconds larger than local faster-whisper, with min-chunk-size 1 second, but it can't be better...

So let's merge it.

tijszwinkels force-pushed the openai-api-backend branch 3 times, most recently from cd7d629 to b7cb783 Compare January 25, 2024 09:21

Gldkslfmsd added a commit that referenced this pull request Jan 25, 2024

missing features in openai-api, PR #52

54cee50

Gldkslfmsd mentioned this pull request Jan 26, 2024

Using Whisper API instead of local offline Whisper model, #34

Closed

Gldkslfmsd reviewed Jan 29, 2024

View reviewed changes

tijszwinkels and others added 8 commits February 14, 2024 17:01

OpenAI Whisper API backend

7ed1d45

Update documentation to include openai-api backend

1949e5f

missing features in openai-api, PR ufal#52

997a653

fixes

b67d653

Interpolate word timestamps based on word character length

46598ad

Use OpenAI api word-level timestamps

f3c76f7

Make --vad work with --backend openai-api

5da3267

Make OpenAI backend work with language autodetect

5428d1c

tijszwinkels force-pushed the openai-api-backend branch from 00d0179 to 5428d1c Compare February 14, 2024 16:30

Use automatic language detection by default (instead of English)

06b5a9e

Gldkslfmsd merged commit e11a5ba into ufal:main Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openai-api backend #52

openai-api backend #52

tijszwinkels commented Jan 25, 2024

Gldkslfmsd commented Jan 25, 2024

tijszwinkels commented Jan 25, 2024

tijszwinkels commented Jan 25, 2024 •

edited

Loading

Gldkslfmsd commented Jan 25, 2024

Gldkslfmsd commented Jan 25, 2024

Gldkslfmsd commented Jan 25, 2024

tijszwinkels commented Jan 25, 2024

Gldkslfmsd commented Jan 25, 2024

tijszwinkels commented Jan 25, 2024

Gldkslfmsd commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

tijszwinkels commented Jan 29, 2024

Gldkslfmsd Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

tijszwinkels commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

tijszwinkels commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

tijszwinkels commented Jan 29, 2024

tijszwinkels commented Feb 8, 2024 •

edited

Loading

tijszwinkels commented Feb 9, 2024 •

edited

Loading

tijszwinkels commented Feb 10, 2024

Gldkslfmsd commented Feb 12, 2024

tijszwinkels commented Feb 14, 2024

Gldkslfmsd commented Feb 15, 2024

Gldkslfmsd commented Feb 19, 2024

openai-api backend #52

openai-api backend #52

Conversation

tijszwinkels commented Jan 25, 2024

Gldkslfmsd commented Jan 25, 2024

tijszwinkels commented Jan 25, 2024

tijszwinkels commented Jan 25, 2024 • edited Loading

Gldkslfmsd commented Jan 25, 2024

Gldkslfmsd commented Jan 25, 2024

Gldkslfmsd commented Jan 25, 2024

tijszwinkels commented Jan 25, 2024

Gldkslfmsd commented Jan 25, 2024

tijszwinkels commented Jan 25, 2024

Gldkslfmsd commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

tijszwinkels commented Jan 29, 2024

Gldkslfmsd Jan 29, 2024

Choose a reason for hiding this comment

Gldkslfmsd commented Jan 29, 2024

tijszwinkels commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

tijszwinkels commented Jan 29, 2024

Gldkslfmsd commented Jan 29, 2024

tijszwinkels commented Jan 29, 2024

tijszwinkels commented Feb 8, 2024 • edited Loading

tijszwinkels commented Feb 9, 2024 • edited Loading

tijszwinkels commented Feb 10, 2024

Gldkslfmsd commented Feb 12, 2024

tijszwinkels commented Feb 14, 2024

Gldkslfmsd commented Feb 15, 2024

Gldkslfmsd commented Feb 19, 2024

tijszwinkels commented Jan 25, 2024 •

edited

Loading

tijszwinkels commented Feb 8, 2024 •

edited

Loading

tijszwinkels commented Feb 9, 2024 •

edited

Loading