-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Fork: Web client + WebSocket + own VAD impl. #105
Comments
@marcinmatys Hi, thanks for the fork it's really a godsend since I was looking to put together something similar. :) |
@vuduc153 Thanks for your feedback. When silence is detected, OnlineASRProcessor finish() and init() methods are called to read uncommited transcription and clear buffer. We loose context and have uncommited transcription then, but in my opinion, it does not have a significant impact on quality. However, I must say that this implementation is just my experiment. You have to do the tests yourself and decide whether it is appropriate or not. You could remove line online.init() from below code and check the difference.
|
@marcinmatys Thanks for the reply I just wanted to confirm if that's indeed to intended logic. Calculating |
Thanks for a nice work, @marcinmatys . I shortly looked at your README2 and I found out that you're using numpy sound intensity detection as "VAD". I think that that way you can detect silence vs non-silence. What about noise vs. speech? In the vad_streaming branch I'm using Silero VAD, a neural torch model to detect non-voice (such as noise, silence, music etc.) vs voice. It should be more robust than your numpy approach. Silero is used in the default offline Whisper as VAD and it was recommended to me in #39 . |
@vuduc153 Thanks for this information and PR. You are right; there is probably an issue with long pauses. However, there is also a problem with your new logic. We need to improve your fix. I will write the details in the PR comment. |
@Gldkslfmsd Thank you for your response and explanations. Silero definitely has more capabilities as you said, but in some cases, I think numpy can also handle it. It depends on the environment we are in, whether we have noise around us, what kind of noise we have around us, and what microphone we are using. We have two types of microphones: Headset Microphone: The microphone in a headset that is positioned near the mouth. Omnidirectional Microphone: A microphone used in conference settings that captures sound from all directions. I performed some tests using a Headset Microphone and played some conversations (it was probably football match commentaries) from another speaker on the desk next to me. The Headset Microphone did not pick up this noise even when the other speaker was really close. Do you thik that numpy sound intensity detection could works more efficiently than Silero ? Maybe there should be an option to use one of these. If we need a more robust tool, we use Silero, but if not, we use simple numpy. |
It's verified, it works very well but the code is ugly. It needs to be cleaned, made transparent and self-documented. Then it can be merged. Not in my time schedule now.
I believe there are some good reasons why Silero exists. Check their paper and other VAD papers. They may have it tested rigorously, you can reproduce some test. Numpy may be faster, simpler to install, and good enough for many. If you present an evidence, we can integrate it as an option. |
Hi, @marcinmatys , So I suggest you can create a new repo whisper_streaming_websocket, or whatever. Definitely put there your Web client + websocket server, about your VAD I'm not sure, it's up to you. I will then reference your project from README, and give you credits. Thanks! Good luck! |
@Gldkslfmsd Ok, sure
And finally, please give me some time for that , because now I am engaged in other projects... |
I have created fork of whisper_streaming , so I took the liberty of writing about it here.
We may close this issue soon as it is information only.
I encourage you to check it out if you are interested in topics such as
Web Browser-Based client with WebSocket Communication,
Voice Activity Detection, and Silence Processing.
If you have any comments, please write here or check out feedback section in my README
The text was updated successfully, but these errors were encountered: