Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openai-api backend #52

Merged
merged 9 commits into from
Feb 19, 2024
Merged

openai-api backend #52

merged 9 commits into from
Feb 19, 2024

Conversation

tijszwinkels
Copy link
Contributor

This PR implements a back-end that uses the OpenAI Whisper api.
This way, no expensive GPU server is necessary to run this software.

@tijszwinkels tijszwinkels force-pushed the openai-api-backend branch 3 times, most recently from cd7d629 to b7cb783 Compare January 25, 2024 09:21
@Gldkslfmsd
Copy link
Collaborator

Wow, thanks a lot! You're solving my current issue :)

Have you compared latency-quality to faster-whisper? Can you share the results? I'd like to merge it only if there's an evidence that this feature is useful.

@tijszwinkels
Copy link
Contributor Author

Anecdotally this works well, but let me see whether I can give some objective results.

Latencies are roughly comparable to my (abused) 3090:

OpenAI:

(realtime) tijs@Pillar:~/os/whisper_streaming$ python whisper_online.py --lan nl --min-chunk-size 10 --backend openai-api ~/call_tijs_jeroen.mp3
(realtime) tijs@Pillar:~/os/whisper_streaming$ cat openai-api-output.txt | grep -E "latency|transcribing"
transcribing 10.01 seconds from 0.00
## last processed 10.01 s, now is 11.38, the latency is 1.37
transcribing 20.02 seconds from 0.00
## last processed 20.02 s, now is 21.96, the latency is 1.95
transcribing 21.03 seconds from 9.00
## last processed 30.03 s, now is 31.80, the latency is 1.77
transcribing 23.04 seconds from 17.00
## last processed 40.04 s, now is 41.80, the latency is 1.76
transcribing 22.54 seconds from 27.50
## last processed 50.04 s, now is 51.65, the latency is 1.60
transcribing 32.55 seconds from 27.50
## last processed 60.05 s, now is 61.92, the latency is 1.86
transcribing 42.56 seconds from 27.50
## last processed 70.06 s, now is 72.85, the latency is 2.79
transcribing 22.65 seconds from 57.42
## last processed 80.07 s, now is 82.30, the latency is 2.23
transcribing 32.66 seconds from 57.42
## last processed 90.08 s, now is 92.53, the latency is 2.45
transcribing 42.66 seconds from 57.42
## last processed 100.08 s, now is 102.92, the latency is 2.84
transcribing 23.11 seconds from 86.98
## last processed 110.09 s, now is 111.85, the latency is 1.76
transcribing 33.12 seconds from 86.98
## last processed 120.10 s, now is 122.32, the latency is 2.22
transcribing 43.13 seconds from 86.98
## last processed 130.11 s, now is 132.76, the latency is 2.65
transcribing 27.57 seconds from 112.54
## last processed 140.11 s, now is 141.35, the latency is 1.24
transcribing 30.58 seconds from 119.54
## last processed 150.12 s, now is 151.72, the latency is 1.60
transcribing 40.59 seconds from 119.54
## last processed 160.13 s, now is 162.17, the latency is 2.03

faster-whisper:

(realtime) tijs@Pillar:~/os/whisper_streaming$ LD_LIBRARY_PATH=/usr/lib/wsl/lib::/home/tijs/anaconda3/envs/realtime/lib/python3.10/site-packages/nvidia/cudnn/lib/ python whisper_online.py --lan nl --min-chunk-size 10 ~/call_tijs_jeroen.mp3 | tee faster-whisper-log.txt
(realtime) tijs@Pillar:~/os/whisper_streaming$ cat faster-whisper-output.txt | grep -E "latency|transcribing"
transcribing 10.01 seconds from 0.00
## last processed 10.01 s, now is 11.27, the latency is 1.26
transcribing 20.02 seconds from 0.00
## last processed 20.02 s, now is 21.95, the latency is 1.93
transcribing 20.85 seconds from 9.18
## last processed 30.03 s, now is 31.65, the latency is 1.63
transcribing 29.24 seconds from 10.80
## last processed 40.04 s, now is 41.70, the latency is 1.66
transcribing 39.24 seconds from 10.80
## last processed 50.04 s, now is 52.22, the latency is 2.17
transcribing 32.51 seconds from 27.54
## last processed 60.05 s, now is 61.92, the latency is 1.87
transcribing 42.52 seconds from 27.54
## last processed 70.06 s, now is 72.71, the latency is 2.65
transcribing 52.53 seconds from 27.54
## last processed 80.07 s, now is 83.28, the latency is 3.21
transcribing 23.03 seconds from 67.04
## last processed 90.07 s, now is 92.14, the latency is 2.07
transcribing 21.88 seconds from 78.20
## last processed 100.08 s, now is 101.91, the latency is 1.82
transcribing 29.75 seconds from 80.34
## last processed 110.09 s, now is 112.69, the latency is 2.60
transcribing 39.76 seconds from 80.34
## last processed 120.10 s, now is 123.01, the latency is 2.91
transcribing 49.77 seconds from 80.34
## last processed 130.11 s, now is 133.84, the latency is 3.74
transcribing 34.85 seconds from 105.26
## last processed 140.11 s, now is 142.06, the latency is 1.95
transcribing 44.86 seconds from 105.26
## last processed 150.12 s, now is 152.70, the latency is 2.58

@tijszwinkels
Copy link
Contributor Author

tijszwinkels commented Jan 25, 2024

transcription of a professional conversation:

(realtime) tijs@Pillar:~/os/whisper_streaming$ cat openai-api-log.txt | head -n 7
Model configuration is set to use the OpenAI Whisper API.
21964.8111 0 9000 Hoi Hoi. Hallo Thijs, mijn ene AirPod, daar komt een soort geest uit, geen idee wat hij aan het doen is. Maar het klonk niet als jouw stem.
31797.1096 9000 17000 Als het goed is hoor je me toch wel zo. Ja. Oké. Dan doe ik het maar. Ik hoor je goed hoor. Dus dat gaat helemaal goed. Oké.
41797.7839 17000 27500 Nee, ik zat na te denken, die partij die doorstuurde ben ik net wat kort ingelopen. Die doen ook transcriptie van audio.
51644.9432 27500 34500 Maar er ligt nog geen connectie met
72847.0478 35005 57420 is het enige voordeel. Nee en ook eens naar die APIs kijkt, kijk het lijkt dat zij alleen integreren met, ja ik weet niet eens echt wat FHIR is, maar in elk geval met medische systemen, ja Fast Healthcare Interoperability Resources.
(realtime) tijs@Pillar:~/os/whisper_streaming$ cat faster-whisper-log.txt | head -n 5
21947.3231 2000 9180  Hoi Hoi Hallo Thijs, mijn ene AirPod Daar komt een soort geest uit Geen idee wat hij aan het doen is Maar het klonk niet als jouw stem
31654.8061 9240 10800  Als het goed is hoor je me toch wel zo
52216.2840 12700 34800  Ik hoor je goed hoor, dus dat gaat helemaal goed Nee, ik zat na te denken, die partij die doorstuurde ben ik net wat kort ingelopen Die doen ook transcriptie van audio Maar er ligt nog geen connectie met
83275.8024 34800 68900  VK Nee, en ook als je naar die APIs kijkt Kijk, het lijkt dat zij alleen integreren met, ja ik weet niet eens echt wat FHIR is Maar in elk geval met medische systemen, ja Fast Healthcare Interoperability Resources Dus in zoverre, ja zij zijn zo gefocust op de medische markt dat ze ook weer niet echt een concurrent van ons zijn Kijk, voor VK kan
92141.5589 68940 79220  het op een gegeven moment natuurlijk wel interessant worden Maar ook dan, het is wel heel specifiek wat zij doen, wij zijn veel breder Ja,

The results are not identical, but seem of similar quality.

@Gldkslfmsd
Copy link
Collaborator

OK. The identical results should be checked with --comp_unaware mode.

I confirm it looks very good. I will run a benchmark and evaluate latency-quality.

@Gldkslfmsd
Copy link
Collaborator

Btw., how is it possible to use VAD?

@Gldkslfmsd
Copy link
Collaborator

Also, let's add logging how many seconds are actually processed through API, so that the cost is calculated

@tijszwinkels
Copy link
Contributor Author

Regarding VAD; The api tells us how likely it is that there's no speech. See: https://github.com/ufal/whisper_streaming/pull/52/files#diff-a270860122060d07d4ae5ba131afc258fd70131ed20b8aa8c258303789a1c8bdR167

Right now I just skip these segments regardless of VAD settings, but that's possibly not the right way to do it.

Gldkslfmsd added a commit that referenced this pull request Jan 25, 2024
@Gldkslfmsd
Copy link
Collaborator

I did updates in your PR: VAD and translate options, and code cleaning.

@tijszwinkels
Copy link
Contributor Author

lovely, thanks!

@Gldkslfmsd
Copy link
Collaborator

So, I got the results on ESIC dev2, 3-times 43 minutes, 3-times 27 docs, ASR in En, Cs, De.

I compared comp. aware mode segment-15 and min-chunk-size 1.0s.

The WER of OpenAI API is twice worse than faster-whisper large-v3 model on NVIDIA A100. I hypothesize that they don't use the large model in API, but a smaller one, with worse quality.

The latency of OpenAI API is 3-times worse. And it is very unstable and unreliable.

Cost: appx 8-times audio duration is processed with this mode, so 0.048 USD per minute in the streaming mode.

So, @tijszwinkels , or anyone, do you have similar results, or better? Someone proposed Azure Whisper API, it could be more reliable and faster.

@Gldkslfmsd
Copy link
Collaborator

I used the VAD. My second hypothesis is that it should be improved in the OpenAI API backend. Now it's working wrongly, filtering out segments with a threshold > 0.8. Maybe rather words?

@tijszwinkels
Copy link
Contributor Author

Thank you for these tests! - At least this is good to know!
According to the docs (https://platform.openai.com/docs/guides/speech-to-text) they're still using the large-v2 model (weird that they don't update their own api), but I wouldn't expect the difference between v2 and v3 to be quite this large.

Actually, another reason (sorry for not thinking of this earlier); The OpenAI api doesn't provide word-level timestamps (only segment-level), so I interpolated word level timestamps by assuming equal length for all words, which would obviously lead to incorrect timestamps in some cases, but I find it hard to estimate the consequence on the final output.
See: https://github.com/ufal/whisper_streaming/pull/52/files#diff-a270860122060d07d4ae5ba131afc258fd70131ed20b8aa8c258303789a1c8bdR178 - What do you think?

Alternatively, we could implement other api's such as one of the WhisperX back-ends on https://replicate.com/ . The disadvantage is that the api's I've seen there so far don't get the audio chunk from the api request but expect the audio to be uploaded on a publicly available URL, which seems cumbersome for many small audio-chunks. - but could search around a bit more.


# Assign start and end times for each word
# We only have timestamps per segment, so interpolating start and end-times
# assuming equal duration per word
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest proportional to the character length

@Gldkslfmsd
Copy link
Collaborator

The diff between large v2 and v3 is around 1% WER.
I noticed that some segments or words are omitted in the API outputs. The VAD makes the quality issue, I think. Maybe also the timestamps.

Let's not go to the replicate back-end, it doesn't seem useful. Maybe Azure API?

@tijszwinkels
Copy link
Contributor Author

The VAD makes the quality issue, I think. Maybe also the timestamps.

What if we disable VAD? Lot's of hallucinations?

@Gldkslfmsd
Copy link
Collaborator

@Gldkslfmsd
Copy link
Collaborator

The VAD makes the quality issue, I think. Maybe also the timestamps.

What if we disable VAD? Lot's of hallucinations?

yes, hallucinations on silence and non-voice sounds. We could run VAD locally and send only the voice part of audio to API.

@Gldkslfmsd
Copy link
Collaborator

So, I checked one En ASR document. Faster-whisper large-v2 has 9.5 % WER with VAD, 10.2 without VAD.
OpenAI API has 13.14 % WER regardless of the mode -- comp. aware VAD, comp. unaware VAD on or off.

@Gldkslfmsd
Copy link
Collaborator

I compared the outputs and errors. I suspect the rough approximation of word-level timestamps in OpenAI API. It misses some words in the middle of the sentence regardless of VAD.

Snímek obrazovky pořízený 2024-01-29 16-03-38

Left: OpenAI API, right: faster-whisper v2. The top line in both is gold.

@tijszwinkels
Copy link
Contributor Author

Thanks for your extensive testing! - I'll update to estimation based on character-length, but might not get round to it until wednesday unfortunately.

@Gldkslfmsd
Copy link
Collaborator

Well, I suggest trying the Azure API instead of the char proportion timestamps, it would be just a slightly better approximation. OpenAI API is also very slow and has unstable latency. Someone told me that Azure has better latency and is more robust, and it has the word-level timestamps. The API itself should be very easy to replace the OpenAI one.

Anytime is ok, I don't have any near plans with this anyway. Thanks!

@tijszwinkels
Copy link
Contributor Author

Alright, if Azure basically fixes all these issues, I'll definitely look into that first then!

@tijszwinkels
Copy link
Contributor Author

tijszwinkels commented Feb 8, 2024

I'm not very enthusiastic about the Azure API.

They have two options.

The Speech-to-text is designed to be synchronous, but has no settings and no timestamps. It just returns the whole text for the sent audiofile at once. This makes it entirely unusable for our purposes.

the Batch transcription is more flexible and can do word-level timestamps, but it's explicitly designed for large batch job and most certainly not for latency-sensitive applications:

Batch transcription jobs are scheduled on a best-effort basis. At peak hours, it may take up to 30 minutes or longer for a transcription job to start processing.

So I don't think using Azure api's is the way to go.

@tijszwinkels
Copy link
Contributor Author

tijszwinkels commented Feb 9, 2024

The OpenAI api actually supports word-level timestamps through timestamp_granularities[]. :)

Fix coming soon!

@tijszwinkels
Copy link
Contributor Author

The word-timestamp-interpolation based on character side had the same problem with missing words on chunk boundaries, but with word-level timestamps, the OpenAI api gives identical results to the offline whisper in my tests:

Original:
Nadat de ervaring mij geleerd had, dat al wat zo in het gewone leven volkomt ijdel en nietig is, en ik inzag dat alles waarvoor en wat ik vreesde niets goeds nog kwaads bevatte, tenzij alleen voor zover mijn gemoed er door bewogen werd, besloot ik eindelijk te onderzoeken of er ook iets bestond dat een waarachtig goed was, dat men deelachtig zou kunnen worden, en waardoor alleen, met verwerping van al het overige, de ziel kon worden vervuld.

Streaming:
Nadat de ervaring mij geleerd had dat al wat zo in het gewone leven volkomt ijdel en nietig is en ik inzag dat alles waarvoor en wat ik vreesde niets goeds nog kwaads bevatte tenzij alleen voor zover mijn gemoed er door bewogen werd besloot ik eindelijk te onderzoeken of er ook iets bestond dat een waarachtig goed was dat men deelachtig zou kunnen worden en waardoor alleen met verwerping van al het overige de ziel kon worden vervuld.

In my preliminary testing, this seems good enough for my use case!
I wonder what you think.

@Gldkslfmsd
Copy link
Collaborator

Thanks, @tijszwinkels ! I plan to test the code next week, I'm busy.

Meanwhile, can you merge main into this branch and test it? There's a new feature of automatic language detection when language parameter is None. Does it work with the API?

@tijszwinkels
Copy link
Contributor Author

Rebased the branch and made language auto-detect work with --lan auto.

Right now the default language is 'en' if not specified on the cli. Perhaps the default should be auto?

@Gldkslfmsd
Copy link
Collaborator

Thanks. OK, let's make auto default.

@Gldkslfmsd
Copy link
Collaborator

OK, I tested it, the quality seems alright.

The latency is around 4 seconds larger than local faster-whisper, with min-chunk-size 1 second, but it can't be better...

So let's merge it.

@Gldkslfmsd Gldkslfmsd merged commit e11a5ba into ufal:main Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants