-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] initial implementation to support audio processing as arrays #40
Conversation
@emanuele-moscato I've left around several todo markers for points to discuss. You can find them with |
Look at the explainers under
|
c0c631e
to
ed15ea1
Compare
ferret/explainers/explanation_speech/gradient_speech_explainer.py
Outdated
Show resolved
Hide resolved
…to the speechxai_utils.py module
To do:
Note: the audio transcription functions |
…ers, changes to ExplanationSpeech Class
5fd52b2
to
721beb2
Compare
ferret/explainers/explanation_speech/equal_width/gradient_equal_width_explainer.py
Outdated
Show resolved
Hide resolved
ferret/explainers/explanation_speech/equal_width/lime_equal_width_explainer.py
Outdated
Show resolved
Hide resolved
ferret/explainers/explanation_speech/equal_width/loo_equal_width_explainer.py
Outdated
Show resolved
Hide resolved
Issues to solve:
Action: modified
Result: numerically it is, they both return an array of dtype float32 normalized by a factor 2 ** 15, but the shape is different as (for mono audio)
Action: this is inferred looking at the shape of the array. 1-dimensional array --> good, 2-dimensional array --> good only if the trailing dimension is 1 (shape
|
- if word timestamps are not provided they are generated on the fly - each word timestamps expects word transcripts - word timestamps are not external to the FerretAudio class - add a new notebook to show this behavior
- updated the new notebook - adapted the paraling explainer - [WIP] code crashed if no ffmpeg is found on the machine
- final edits to methods to update - update the notebook name - WIP need to check that everything returns expected results - WIP need to check that the notebook with local loading works
Regarding the normalization: we should not touch the input array unless 1) it's not normalized and 2) we are transcribing it with whisperX -- that's the only case where we really need normalization + 16kHz sampling rate. |
…o using the numpy array when the pydub AS is not needed
…iners), commit re-evaluated notebooks
This PR has multiple goals.
The main one is to support using numpy arrays for audio rather than local paths only. I believe this could be a strong limitation since many audio datasets today come from the HF Hub already decoded into np.ndarray.
However, while inspecting the code, I've noticed there are still several dependencies from local files that we don't want to have. For example, the code expects:
Another important goal is to standardize transcript generation, which is currently made only by the LOO explainers.