-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openai-api backend #52
Conversation
cd7d629
to
b7cb783
Compare
Wow, thanks a lot! You're solving my current issue :) Have you compared latency-quality to faster-whisper? Can you share the results? I'd like to merge it only if there's an evidence that this feature is useful. |
Anecdotally this works well, but let me see whether I can give some objective results. Latencies are roughly comparable to my (abused) 3090: OpenAI:
faster-whisper:
|
transcription of a professional conversation:
The results are not identical, but seem of similar quality. |
OK. The identical results should be checked with --comp_unaware mode. I confirm it looks very good. I will run a benchmark and evaluate latency-quality. |
Btw., how is it possible to use VAD? |
Also, let's add logging how many seconds are actually processed through API, so that the cost is calculated |
Regarding VAD; The api tells us how likely it is that there's no speech. See: https://github.com/ufal/whisper_streaming/pull/52/files#diff-a270860122060d07d4ae5ba131afc258fd70131ed20b8aa8c258303789a1c8bdR167 Right now I just skip these segments regardless of VAD settings, but that's possibly not the right way to do it. |
I did updates in your PR: VAD and translate options, and code cleaning. |
lovely, thanks! |
So, I got the results on ESIC dev2, 3-times 43 minutes, 3-times 27 docs, ASR in En, Cs, De. I compared comp. aware mode segment-15 and min-chunk-size 1.0s. The WER of OpenAI API is twice worse than faster-whisper large-v3 model on NVIDIA A100. I hypothesize that they don't use the large model in API, but a smaller one, with worse quality. The latency of OpenAI API is 3-times worse. And it is very unstable and unreliable. Cost: appx 8-times audio duration is processed with this mode, so 0.048 USD per minute in the streaming mode. So, @tijszwinkels , or anyone, do you have similar results, or better? Someone proposed Azure Whisper API, it could be more reliable and faster. |
I used the VAD. My second hypothesis is that it should be improved in the OpenAI API backend. Now it's working wrongly, filtering out segments with a threshold > 0.8. Maybe rather words? |
Thank you for these tests! - At least this is good to know! Actually, another reason (sorry for not thinking of this earlier); The OpenAI api doesn't provide word-level timestamps (only segment-level), so I interpolated word level timestamps by assuming equal length for all words, which would obviously lead to incorrect timestamps in some cases, but I find it hard to estimate the consequence on the final output. Alternatively, we could implement other api's such as one of the WhisperX back-ends on https://replicate.com/ . The disadvantage is that the api's I've seen there so far don't get the audio chunk from the api request but expect the audio to be uploaded on a publicly available URL, which seems cumbersome for many small audio-chunks. - but could search around a bit more. |
whisper_online.py
Outdated
|
||
# Assign start and end times for each word | ||
# We only have timestamps per segment, so interpolating start and end-times | ||
# assuming equal duration per word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest proportional to the character length
The diff between large v2 and v3 is around 1% WER. Let's not go to the replicate back-end, it doesn't seem useful. Maybe Azure API? |
What if we disable VAD? Lot's of hallucinations? |
https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-preview-of-openai-whisper-in-azure-openai-service/ba-p/3928388 -- Azure should have word-level timestamps |
yes, hallucinations on silence and non-voice sounds. We could run VAD locally and send only the voice part of audio to API. |
So, I checked one En ASR document. Faster-whisper large-v2 has 9.5 % WER with VAD, 10.2 without VAD. |
Thanks for your extensive testing! - I'll update to estimation based on character-length, but might not get round to it until wednesday unfortunately. |
Well, I suggest trying the Azure API instead of the char proportion timestamps, it would be just a slightly better approximation. OpenAI API is also very slow and has unstable latency. Someone told me that Azure has better latency and is more robust, and it has the word-level timestamps. The API itself should be very easy to replace the OpenAI one. Anytime is ok, I don't have any near plans with this anyway. Thanks! |
Alright, if Azure basically fixes all these issues, I'll definitely look into that first then! |
I'm not very enthusiastic about the Azure API. They have two options. The Speech-to-text is designed to be synchronous, but has no settings and no timestamps. It just returns the whole text for the sent audiofile at once. This makes it entirely unusable for our purposes. the Batch transcription is more flexible and can do word-level timestamps, but it's explicitly designed for large batch job and most certainly not for latency-sensitive applications:
So I don't think using Azure api's is the way to go. |
The OpenAI api actually supports word-level timestamps through Fix coming soon! |
The word-timestamp-interpolation based on character side had the same problem with missing words on chunk boundaries, but with word-level timestamps, the OpenAI api gives identical results to the offline whisper in my tests:
In my preliminary testing, this seems good enough for my use case! |
Thanks, @tijszwinkels ! I plan to test the code next week, I'm busy. Meanwhile, can you merge main into this branch and test it? There's a new feature of automatic language detection when language parameter is None. Does it work with the API? |
00d0179
to
5428d1c
Compare
Rebased the branch and made language auto-detect work with --lan auto. Right now the default language is 'en' if not specified on the cli. Perhaps the default should be auto? |
Thanks. OK, let's make auto default. |
OK, I tested it, the quality seems alright. The latency is around 4 seconds larger than local faster-whisper, with min-chunk-size 1 second, but it can't be better... So let's merge it. |
This PR implements a back-end that uses the OpenAI Whisper api.
This way, no expensive GPU server is necessary to run this software.