Python implementation of 'Blind estimation of reverberation time' [1].
[1] Ratnam, Rama & Jones, Douglas & Wheeler, Bruce & O'Brien, William & Lansing, Charissa & Feng, Albert. (2003). Blind estimation of reverberation time. The Journal of the Acoustical Society of America. 114. 2877-92. 10.1121/1.1616578.
For the evaluation, a speech utterance was taken from the NOIZEUS database [3], a repository of noisy speech corpus.
We assume that the reverberant tail of a decaying sound
y is the product of a fine structure x that is random process,
and an envelope a that is deterministic.
Given the likelihood function, the parameters
- The geometric ratio is notably compressive, and in actual scenarios, the values of a are expected to be proximate to 1. Conversely,
$\sigma$ exhibits a broad range. - Examining the gradient of
$\frac{{\partial \ln L\left( {y;a,\sigma } \right)}}{{\partial a}}$ , initiating the process with an initial value smaller than a requires the root-solving strategy to descend the gradient fast enough.
-
Solved using numerical and iterative approach
$\frac{{\partial \ln L\left( {y;a,\sigma } \right)}}{{\partial a}} = 0$ ;$\frac{{\partial \ln L\left( {y;a,\sigma } \right)}}{{\partial \sigma }} = 0$ . -
Estimating
$a*$ :- The root was bisected until the zero was bracketed.
- The Newton–Raphson method was applied to accurate the root,
${a_{n = 1}} = {a_n} - \frac{{\frac{{\partial \ln L\left( {y;{a_n},\sigma } \right)}}{{\partial a}}}}{{\frac{{{\partial ^2}\ln L\left( {y;{a_n},\sigma } \right)}}{{\partial {a^2}}}}}$ .
-
Estimating
$\sigma$ :$${\sigma ^2} = \frac{1}{N}\sum\limits_{n = 0}^{N - 1} {{a^{ - 2n}}y{{\left( n \right)}^2}}$$
The model will fail during (1) estimation Frames Do Not Fall Within a Region of Free Decay, and (2) sound with a gradual rather than rapid offset.
- In the first case, the damping of sound in a room cannot occur at a rate faster than the free decay. A robust strategy would be to select a threshold value of
$$a*$$ such that the left tail of the probability density function of$$a*$$ , $a = \arg \left{ {P\left( x \right) = \gamma ;,,,P\left( x \right) = \int_0^x {p\left( {{a^}} \right)} d{a^}} \right}$. - In the second case, ${p\left( {{a^}} \right)}$ is likely to be multimodal. the strategy then is to select the first dominant peak in ${p\left( {{a^}} \right)}$, $a = \min \arg \left{ {dp\left( {{a^}} \right)/d{a^} = 0} \right}$.
- For a unimodal symmetric distribution with
$\gamma = 0.5$ the filter will track the peak value, i.e., the median. In connected speech, where peaks cannot be clearly discriminated or the distribution is multi-modal,$$\gamma$$ should peaked based on the statistics of gap durations.