-
-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speaker ID training #164
Comments
This is a good question! Right now we're not technically using pyannote in the tool, but only some functionality to embed how speakers "sound like". This is how the algorithm works behind the scenes:
I hope this makes sense! Now, pyannote is quite powerful in actually identifying speakers. The plan is to go further with this, so that we're able to:
We were focused on integrating other stuff until now, but hopefully will be able to get to that soon. The main reason why we chose to code speaker change detection first instead of speaker identification is because the vpyannote models have a few special licensing needs (for e.g. you have to have a hugging face account and accept the author's terms before being able to download it), and I wanted to dig a bit further into what it would mean for the final user who has no idea about how these things work. Nevertheless, it's probably doable... it just needs a bit of attention to make it work. After writing all this down, I'm not sure that I answered your question, so feel free to ping me back!! Cheers |
I haven't looked much at the indexing side of your tool, but I do know that tens of thousands of dollars are spent per season of many of these unscripted shows on the transcriptions. The limited Speaker ID you are doing now is already what is separating you from similar tools, for me. If you could make that work first I think you would get some attention. |
Is your feature request related to a problem? Please describe.
I work with a lot of reality television. And I'm deep in the south for a couple of my shows. The speakers are often mumbling in a deep vernacular. I would like to be able to submit some clips to the expand the ability to identify these individual speakers. The model actually does much better than I expected at figuring out what they are saying, but not much luck in differentiating people, or just breaking off into a new speaker mid-sentence. I'm not actually a programmer although I'm not bad at pretending to be one. I would like to be able to improve the tool to be bale to recognize the people I have in the show the most.
Describe the solution you'd like
It looks like pyannote is what you are using for identification. So I'm asking more for clarity for us laymen. Is it possible to train the tool to recognize certain people better in order to increase transcription efficiency? If so, would that be something that I can do directly via StoryToolkit or would I need to work directly through pyannote? Based on your answer I can research how to actually accomplish this, I'm mostly just looking for a direction to go in.
Describe alternatives you've considered
Cry
Additional context
The text was updated successfully, but these errors were encountered: