Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement several new metrics for speech recognition #2451

Merged
merged 101 commits into from
Mar 29, 2024

Conversation

asumagic
Copy link
Collaborator

@asumagic asumagic commented Mar 4, 2024

What does this PR do?

The goal of this PR is to introduce a number of new metrics, and the supporting interfaces and package integrations that go along with it. The metrics picked here were suggested and compared by the paper Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition, with the hope to address shortcomings of the WER in ASR.

Much of the PR upgrades existing metrics for flexibility, and some of the metrics are suitable for tasks other than ASR.

No new required dependencies are added. flair and spaCy are added as optional dependencies, in the form of optional modules under speechbrain.lobes, if only to help make model loading as consistent as possible with how we do HF hub loading in SB.
Whether this whole approach is the way to go should be discussed for this PR... I feel weird about them because they add annoying dependencies to the CI and for docs generation and they are rather incomplete as they only implement what is necessary for this PR (but can be extended for more usecases).

Other changes

  • The WER calculation was fixed to work with empty references. This can be changed, but I figure it is sane enough to scale errors as if the reference contained 1 word.
  • There were some other edits to e.g. the WER calculation code, but nothing that changes the default behavior.

Tutorial:

The following tutorial demonstrates how to use all the proposed metrics using sample ASR predictions over a French corpus (taken from https://github.com/thibault-roux/hypereval/) in terms of hyperparameters, so that they can easily be copied and integrated into recipes:

https://gist.github.com/asumagic/75a362614b55695be8c4b729567b252a

Introduced/suggested metrics

Part-of-speech Error Rate (POSER)

WER is estimated over parts of speech instead of words.
In order to support this conveniently, this PR adds a thin integration with the flair toolkit, which is frequently used to implement POS-tagging models.

The paper proposes a variant (uPOSER) with broad POS categories, but we do not explicitly reference that detail: with this PR, implementing uPOSER can be done with the synonym dictionary mechanism (or with token mapping that already exists in the ER classes but I haven't tried).

Lemma Error Rate (LER)

WER is estimated over lemmas instead of words.
In order to support this conveniently, this PR adds a thin integration with the spaCy toolkit. Note that the download mechanism for spaCy is not the same, and they do not use the HF hub, so we do not try to integrate it any more nicely.

Embedding Error Rate (EmbER)

EmbER weights the WER with a check over the cosine similarity of word embeddings. See the code for more details.

Because word-level embeddings are required, subword tokenization is an issue. Thus, this PR adds also adds a simple wrapper for flair embeddings, which provides support for some word-level embeddings like fastText. The models are rather large, but it works.

Note: facebook's fasttext package was initially used, but it comes with headaches at install time, has been archived and flair was being integrated anyway, with equivalent support, and significantly stronger word embedding support - so that was more powerful and simpler.

BERTScore

BERTScore recall example

BERTScore introduces recall, precision and F1 metrics which are calculated using contextualized embeddings using a BERT-like LM, currently hardcoded to use a HuggingFace Transformers interface. See code and docs for more details.

This PR adds a simple, well-documented reimplementation of BERTScore, which should closely match the scores obtained by the reference implementation. No additional dependency is required.

Sentence Semantic Distance (SemDist)

SemDist example

Compare sentence embeddings as output by a BERT-like LM, using cosine similarity.

Two modes are currently proposed to determine what to compute the similarity on:

  • mean of all contextualized embeddings
  • embedding of the output [CLS] token

Additionally, Roux's paper cited earlier uses a sentence embedding model, which we do not explicitly use. Currently, this is hardcoded to use a HuggingFace Transformers LM interface. No interface for sentence embedding models is currently provided, but this is an easy addition. However, it would require adding a dependency as HF Transformers does not seem to wrap such models.

Synonym dictionaries

This PR also allows defining "synonym" dictionaries for words that should be considered identical for the WER.
Since the WER function is now made to allow taking an equality function as a parameter, plugging it into the WER calculation is trivial.

As mentioned earlier, one of the usecases is to define classes that should be considered equivalent when wrapping the WER (e.g. for the uPOSER metric implementation).


Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

@asumagic asumagic force-pushed the metrics-roux22-interspeech branch from 014e4f0 to b71e830 Compare March 12, 2024 12:46
@asumagic asumagic force-pushed the metrics-roux22-interspeech branch from 82f400f to 5c90ea9 Compare March 20, 2024 12:50
@asumagic asumagic added the enhancement New feature or request label Mar 21, 2024
@asumagic
Copy link
Collaborator Author

Tests seem to fail because speechbrain/SSL_Quantization on HF returns 401, should stuff requiring HF even be part of the doctests?

@asumagic asumagic force-pushed the metrics-roux22-interspeech branch from a76f72d to 9cdfad0 Compare March 22, 2024 08:52
@asumagic
Copy link
Collaborator Author

Added a link to a tutorial I just finished completing in the main post.

@asumagic asumagic force-pushed the metrics-roux22-interspeech branch from 899822b to b0f5d1d Compare March 28, 2024 14:56
@asumagic asumagic force-pushed the metrics-roux22-interspeech branch from 5ec6e09 to 0791eaf Compare March 28, 2024 15:04
Copy link
Collaborator Author

@asumagic asumagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the tutorial with the new TextEncoder HF interface. I believe this should be ready to review again.

speechbrain/utils/metric_stats.py Show resolved Hide resolved
speechbrain/lobes/models/flair/embeddings.py Show resolved Hide resolved
batch_precision = precision_values * precision_weights

for i, utt_id in enumerate(ids):
# TODO: optionally provide a token->token map
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not actually implemented yet, but the TODO indicates it can be done and roughly where that should be. It can be done later or just implemented if we figure out it's useful in practice.

I am actually not fully sure what form it would take and it wouldn't be very useful without a way to present it (which doesn't seem very convenient to do in a text interface, as opposed to e.g. a graph/table view using graphviz or matplotlib).

@asumagic
Copy link
Collaborator Author

That said, I still don't understand why CI is failing.

@Adel-Moumen
Copy link
Collaborator

That said, I still don't understand why CI is failing.

The error is due to https://huggingface.co/speechbrain/SSL_Quantization being private. Could you please add doctest skip in the example so that It skip running the example?

Copy link
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks clean! I am only wondering why the CI is not asking you to complte the "returns" section of some docstring since it should be required but other than it looks good to me.

@Adel-Moumen
Copy link
Collaborator

As for the tutorial, could you please try to stick with the shape of SpeechBrain colabs? (same header etc)

Copy link
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @asumagic this is a very great work.

CC: @mrouvier :)

@Adel-Moumen Adel-Moumen merged commit 1350e9b into speechbrain:develop Mar 29, 2024
5 checks passed
@asumagic asumagic mentioned this pull request Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ready to review Waiting on reviewer to provide feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants