Skip to content

Commit

Permalink
explainable: Add a new section
Browse files Browse the repository at this point in the history
  • Loading branch information
jonnor committed Sep 30, 2020
1 parent ffc2bb6 commit 061e09c
Show file tree
Hide file tree
Showing 2 changed files with 132 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ Rough notes on various topics.
* [Applications](./applications.md). Practical applications of Machine Hearing
* [Tasks](./tasks.md). Established problem formulations
* [Audio Quality](./audio-quality). Metrics for measuring audio quality
* [Explainable models for Audio](./explainable).
* [Features](./features.md). Feature representations
* [Preprocessing](./preprocessing.md). Preprocessing techniques
* [DCASE2018](./dcase2018.md). Notes from DCASE2018 challenge and conference
Expand Down
131 changes: 131 additions & 0 deletions explainable/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@

## Expainable Audio ML
In many domains and tasks it is highly desirable
that predictive models can be explainable somehow,
and audio is no different.
Aka Interpretable models.

Being able to explain model behavior in general,
or for subsets, or for a particular input instance can be be useful for
checking robustness, debugging models, ensuring compliance et.c

Some techniques that are being explored:

* Local Interpretable Model-agnostic Explanations (LIME)
* Layer-wise relevance propagation (LRP)
* SHapley Additive exPlanations (SHAP)
* DeepLIFT (Deep Learning Important FeaTures)

Quite often this is done on spectrogram (time-frequency) representation of audio.
The spectrograms themselves might have a degree of intepretability.
But as a 2d representation, it allows to use image-based explanations.

In audio we might want to:

- See strength of feature contributions wrt time
- See contributions in time-frequency (spectrogram) space
- Listen to only the contributing parts of the sound (source-separation)

Different approaches can be to

- create a model which is interpretable/explainable by construction
- generate explainer models on top of a black-box model

## Papers

### audioLIME: Listenable Explanations Using Source Separation
[https://arxiv.org/abs/2008.00582](Paper link).
By Verena Haunschmid, Ethan Manilow, Gerhard Widmer.

Use LIME on spectrograms to decompose into multiple sources.
Then constructs audio for each of these sources, that can be listened to independently.

### Towards explainable Music Emotion Recognition: The route via mid-level features
[Paper link](http://archives.ismir.net/ismir2019/paper/000027.pdf).
By Shreyan Chowdhury, Andreu Vall, Verena Haunschmid, Gerhard Widmer.
From Institute of Computational Perception, Johannes Kepler University Linz, Austria.
Published at ISMIR 2019.

Proposal for a way to make an *explainable* model for music emotion.
Model output: Emotion estimation along 8 dimensions. Valence, Energy, Tension, Anger, Fear, Happy, Sad, Tender.
Defined 7 perceptual features, and asked human evaluators to evaluate music clips in terms of these.
Melodiousness, Articulation, Rhythmic Stability, Rhythmic Complexity, Dissonance, Tonal Stability, Modality/Minorness

Used a deep neural network to estimate these perceptual features.
Then used a linear model on top of these features to obtain output emotions.
Compare to a standard neural network model (not inherently explainable).

Results:
When trained separately, performance dropped a bit.
When trained jointly, perfomance of explainable model matched

Very simple idea. Easy to adopt to other tasks **if mid-level evaluations are available**.

Weakness. No baseline "explainer" model included?
Like using SHAP on the baseline neural network.


### Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals
[Paper link](https://arxiv.org/abs/1807.03418). [.
By Sören Becker, Marcel Ackermann, Sebastian Lapuschkin, Klaus-Robert Müller, Wojciech Samek.
From Fraunhofer and TU Berlin,

Uses layer-wise relevance propagation (LRP), previously proposed by same authors.

Tested speech classification. Gender classification.
Evaluated both a spectrogram-based model and a raw-waveform audio model. Very similar performance.

Shows relevance maps on the spectrograms.

#### [audioMNIST](https://github.com/soerenab/AudioMNIST)
30000 audio recordings (ca. 9.5 hours) of spoken digits (0-9) in English.
50 repetitions per digit for each of the 60 different speakers.
Controlled conditions. Quiet offices with a RØDE NT-USB microphone mono channel at 48kHz, 16bit.
Meta information including age (22-61 years), gender (12 female/48 male), origin and accent of all speakers.

### Understanding the Importance of Heart Sound Segmentation for Heart Anomaly Detection
[Paper link](https://arxiv.org/abs/2005.10480).
By Theekshana Dissanayake, Tharindu Fernando, Simon Denman, Sridha Sridharan, Houman Ghaemmaghami, Clinton Fookes.

Heart sound segmentation, as preprocessing step before classifcation.
To detect abnormal heart sounds.

Used SHAP to see whether the model uses the segmentation part of the model or local features the most.
Also to indicate whether the S1 or the S2 heart sounds are getting focused on.
Used SHAP to compute feature contributions at different points in time.

Combined SHAP with Occlusion Maps to find the importance of a feature region.
Plotted on the time axis.

Annoying: Probability scores not shown on normal/abnormal examples.

### PhysioNet heart abnormality dataset
[Challenge](https://physionet.org/content/challenge-2016/1.0.0/)
[Dataset](https://archive.physionet.org/physiobank/database/challenge/2016/)
Heart sound recordings were sourced from several contributors around the world.
Either a clinical or nonclinical environment, from both healthy subjects and pathological patients.
The Challenge training set consists of five databases (A through E) containing a total of 3,126 heart sound recordings,
lasting from 5 seconds to just over 120 seconds. Entire training set is 169 MB.

The heart sound recordings were collected from different locations on the body.
Four locations are aortic area, pulmonic area, tricuspid area and mitral area,
but could be one of nine different locations.

Please note that due to the uncontrolled environment of the recordings,
many recordings are corrupted by various noise sources, such as talking, stethoscope motion, breathing and intestinal sounds. Some recordings were difficult or even impossible to classify as normal or abnormal.


## Tools

## [tf-explain](https://github.com/sicara/tf-explain#available-methods)
Can visualize Activations, Gradient sensitivity on input, Occlusion Sensitivity, GRAD-CAM, etc. Layer-Wise Relevance Propagation is on the roadmap.
Callback-based for Keras. Takes validation data as input.
Integration with Tensorboard, results can be seen live there.

The resolution of Grad-CAM is the spatial resolution of the last convolutional feature map,
this can cause it to miss details.
For occlusion sensitivity, one must choose the right values for mask size and stride.




0 comments on commit 061e09c

Please sign in to comment.