-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement several new metrics for speech recognition #2451
Implement several new metrics for speech recognition #2451
Conversation
014e4f0
to
b71e830
Compare
82f400f
to
5c90ea9
Compare
Tests seem to fail because |
a76f72d
to
9cdfad0
Compare
Added a link to a tutorial I just finished completing in the main post. |
899822b
to
b0f5d1d
Compare
5ec6e09
to
0791eaf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the tutorial with the new TextEncoder
HF interface. I believe this should be ready to review again.
batch_precision = precision_values * precision_weights | ||
|
||
for i, utt_id in enumerate(ids): | ||
# TODO: optionally provide a token->token map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not actually implemented yet, but the TODO indicates it can be done and roughly where that should be. It can be done later or just implemented if we figure out it's useful in practice.
I am actually not fully sure what form it would take and it wouldn't be very useful without a way to present it (which doesn't seem very convenient to do in a text interface, as opposed to e.g. a graph/table view using graphviz or matplotlib).
That said, I still don't understand why CI is failing. |
The error is due to https://huggingface.co/speechbrain/SSL_Quantization being private. Could you please add doctest skip in the example so that It skip running the example? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks clean! I am only wondering why the CI is not asking you to complte the "returns" section of some docstring since it should be required but other than it looks good to me.
speechbrain/lobes/models/huggingface_transformers/textencoder.py
Outdated
Show resolved
Hide resolved
As for the tutorial, could you please try to stick with the shape of SpeechBrain colabs? (same header etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this PR do?
The goal of this PR is to introduce a number of new metrics, and the supporting interfaces and package integrations that go along with it. The metrics picked here were suggested and compared by the paper Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition, with the hope to address shortcomings of the WER in ASR.
Much of the PR upgrades existing metrics for flexibility, and some of the metrics are suitable for tasks other than ASR.
No new required dependencies are added. flair and spaCy are added as optional dependencies, in the form of optional modules under
speechbrain.lobes
, if only to help make model loading as consistent as possible with how we do HF hub loading in SB.Whether this whole approach is the way to go should be discussed for this PR... I feel weird about them because they add annoying dependencies to the CI and for docs generation and they are rather incomplete as they only implement what is necessary for this PR (but can be extended for more usecases).
Other changes
Tutorial:
The following tutorial demonstrates how to use all the proposed metrics using sample ASR predictions over a French corpus (taken from https://github.com/thibault-roux/hypereval/) in terms of hyperparameters, so that they can easily be copied and integrated into recipes:
https://gist.github.com/asumagic/75a362614b55695be8c4b729567b252a
Introduced/suggested metrics
Part-of-speech Error Rate (POSER)
WER is estimated over parts of speech instead of words.
In order to support this conveniently, this PR adds a thin integration with the flair toolkit, which is frequently used to implement POS-tagging models.
The paper proposes a variant (uPOSER) with broad POS categories, but we do not explicitly reference that detail: with this PR, implementing uPOSER can be done with the synonym dictionary mechanism (or with token mapping that already exists in the ER classes but I haven't tried).
Lemma Error Rate (LER)
WER is estimated over lemmas instead of words.
In order to support this conveniently, this PR adds a thin integration with the spaCy toolkit. Note that the download mechanism for spaCy is not the same, and they do not use the HF hub, so we do not try to integrate it any more nicely.
Embedding Error Rate (EmbER)
EmbER weights the WER with a check over the cosine similarity of word embeddings. See the code for more details.
Because word-level embeddings are required, subword tokenization is an issue. Thus, this PR adds also adds a simple wrapper for flair embeddings, which provides support for some word-level embeddings like fastText. The models are rather large, but it works.
Note: facebook's
fasttext
package was initially used, but it comes with headaches at install time, has been archived andflair
was being integrated anyway, with equivalent support, and significantly stronger word embedding support - so that was more powerful and simpler.BERTScore
BERTScore introduces recall, precision and F1 metrics which are calculated using contextualized embeddings using a BERT-like LM, currently hardcoded to use a HuggingFace Transformers interface. See code and docs for more details.
This PR adds a simple, well-documented reimplementation of BERTScore, which should closely match the scores obtained by the reference implementation. No additional dependency is required.
Sentence Semantic Distance (SemDist)
Compare sentence embeddings as output by a BERT-like LM, using cosine similarity.
Two modes are currently proposed to determine what to compute the similarity on:
[CLS]
tokenAdditionally, Roux's paper cited earlier uses a sentence embedding model, which we do not explicitly use. Currently, this is hardcoded to use a HuggingFace Transformers LM interface. No interface for sentence embedding models is currently provided, but this is an easy addition. However, it would require adding a dependency as HF Transformers does not seem to wrap such models.
Synonym dictionaries
This PR also allows defining "synonym" dictionaries for words that should be considered identical for the WER.
Since the WER function is now made to allow taking an equality function as a parameter, plugging it into the WER calculation is trivial.
As mentioned earlier, one of the usecases is to define classes that should be considered equivalent when wrapping the WER (e.g. for the uPOSER metric implementation).
Before submitting
PR review
Reviewer checklist