Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] initial implementation to support audio processing as arrays #40

Merged
merged 18 commits into from
Mar 27, 2024

Conversation

g8a9
Copy link
Owner

@g8a9 g8a9 commented Mar 6, 2024

This PR has multiple goals.
The main one is to support using numpy arrays for audio rather than local paths only. I believe this could be a strong limitation since many audio datasets today come from the HF Hub already decoded into np.ndarray.

However, while inspecting the code, I've noticed there are still several dependencies from local files that we don't want to have. For example, the code expects:

  • local transcripts for most of the explainers (I think we should have a centralized place where we compute them)
  • local files for adding noise (white and pink)

Another important goal is to standardize transcript generation, which is currently made only by the LOO explainers.

@g8a9
Copy link
Owner Author

g8a9 commented Mar 7, 2024

@emanuele-moscato I've left around several todo markers for points to discuss. You can find them with #TODO GA

@emanuele-moscato
Copy link
Collaborator

emanuele-moscato commented Mar 7, 2024

Look at the explainers under ferret/explainers/explanation_speech/:

  • Whenever the path to an audio file is required (typically in the compute_explanation methods), pass an object of type FerretAudio (newly defined in the PR).
  • Make FerretAudio able to transcribe to text (dedicated method called from within the explainers when/if needed). --> NOT VALID ANYMORE
  • Check the SpeechXAI examples notebook.

@gaiageagea gaiageagea force-pushed the feat/support-speech-from-array branch from c0c631e to ed15ea1 Compare March 15, 2024 08:04
@emanuele-moscato
Copy link
Collaborator

emanuele-moscato commented Mar 15, 2024

To do:

  • In the definition of the ExplanationSpeech class, change the kwarg name from audio_path to audio and fix code accordingly everywhere the class is implemented (make sure an object of type FerretAudio is passed).
  • Address the comments in the last code review.
  • Complete the update of the speech explainers so they accept an object of type FerretAudio as their input.
  • See previous to-do list.

Note: the audio transcription functions transcribe_audio and transcribe_audio_with_model were moved from the ferret/explainers/explanation_speech/utils_removal.py module to ferret/speechxai_utils.py to avoid a circular import. The rest of the code has already been updated, but have one last check please!

@gaiageagea gaiageagea force-pushed the feat/support-speech-from-array branch from 5fd52b2 to 721beb2 Compare March 15, 2024 18:36
@emanuele-moscato
Copy link
Collaborator

emanuele-moscato commented Mar 18, 2024

Issues to solve:

  • When loading an audio file into a FerretAudio object and then extracting the AudioSegment object (to_pydub method), the audio gets distorted. Check FerretAudio's to_pydub method.

Action: modified FerretAudio's to_pydub method so as to create a pudyb AudioSegment object from an array that is always unnormalized and of dtype int16.

  • Move audio resampling (if the native sample rate is different from 16 KHz) to within the transcription method.
  • Move normalization of the array corresponding to the audio (if needed) to within the transcription method as well (so there are no ambiguities on whether the array provided by the user is normalized or not: it is what it is from the start).
  • Check that the new way of obtaining arrays from audio (librosa.load) returns the same as (AudioSegment.load_wav --> pydub_to_np).

Result: numerically it is, they both return an array of dtype float32 normalized by a factor 2 ** 15, but the shape is different as (for mono audio) librosa returns a flattened array while pydub_to_np returns an array of shape (n_samples, 1).

  • Raise error if the numpy array or audio file passed to FerretAudio has more than one channel (we only support mono audio!).

Action: this is inferred looking at the shape of the array. 1-dimensional array --> good, 2-dimensional array --> good only if the trailing dimension is 1 (shape (n_samples, 1)), array of dimension > 2 --> shape not understood, raise exception.

  • Remove the remove_word_np function if it ends up not being used.

emanuele-moscato and others added 4 commits March 19, 2024 12:55
- if word timestamps are not provided they are generated on the fly
- each word timestamps expects word transcripts
- word timestamps are not external to the FerretAudio class
- add a new notebook to show this behavior
- updated the new notebook
- adapted the paraling explainer
- [WIP] code crashed if no ffmpeg is found on the machine
- final edits to methods to update
- update the notebook name
- WIP need to check that everything returns expected results
- WIP need to check that the notebook with local loading works
@g8a9
Copy link
Owner Author

g8a9 commented Mar 19, 2024

Regarding the normalization: we should not touch the input array unless 1) it's not normalized and 2) we are transcribing it with whisperX -- that's the only case where we really need normalization + 16kHz sampling rate.

@emanuele-moscato emanuele-moscato merged commit 4d46242 into dev Mar 27, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants