The Blind RT60 Estimation module is a Python implementation based on the paper "Blind estimation of reverberation time" by Ratnam et al. [1]. It aims to estimate the reverberation time (RT60) of an input audio signal.
[1] Ratnam, Rama & Jones, Douglas & Wheeler, Bruce & O'Brien, William & Lansing, Charissa & Feng, Albert. (2003). Blind estimation of reverberation time. The Journal of the Acoustical Society of America. 114. 2877-92. 10.1121/1.1616578.
For the evaluation, a speech utterance was taken from the NOIZEUS database [3], a repository of noisy speech corpus.
pip install blind_rt60
from blind_rt60 import BlindRT60
from scipy.io import wavfile
# Create an instance of the BlindRT60 estimator
estimator = BlindRT60()
# Load your audio signal (x) and its sampling frequency (fs)
# Example: fs, x = wavfile.read("path/to/audio/file.wav")
# Estimate the RT60
rt60_estimate = estimator(x, fs)
# Visualize the results
fig = estimator.visualize(x, fs)
plt.show()
The BlindRT60 class accepts various parameters that allow customization of the estimation process. Here are the key parameters:
- fs: Sample rate of the audio signal.
- framelen: Length of each analysis frame in seconds.
- hop: Hop size between analysis frames in seconds.
- percentile: Pre-specified percentile value for RT60 estimation.
- a_init: Initial value for the decay rate parameter.
- sigma2_init: Initial value for the signal variance parameter.
- max_itr: Maximum number of iterations for convergence.
- max_err: Maximum error for convergence.
- a_range: Range of valid values for the decay rate parameter.
- bisected_itr: Number of iterations for the bisection method.
- sigma2_range: Range of valid values for the signal variance parameter.
- verbose: Enable verbose output for each iteration.
Contributions are welcome! If you find any issues or have suggestions for improvement, please open an issue or submit a pull request on the GitHub repository.
This project is licensed under the MIT License. See the LICENSE file for more information.
For any inquiries or questions, please contact zoreasaf@gmail.com.
We assume that the reverberant tail of a decaying sound
y is the product of a fine structure x that is random process, and an envelope a that is deterministic.
The model for room decay then suggests that the observations y are specified by
For each estimation interval the likelihood function of y is,
Describe
Given the likelihood function, the parameters
- The geometric ratio is notably compressive, and in actual scenarios, the values of a are expected to be proximate to 1. Conversely,
$\sigma$ exhibits a broad range. - Examining the gradient of
$\frac{{\partial \ln L\left( {y;a,\sigma } \right)}}{{\partial a}}$ , initiating the process with an initial value smaller than a requires the root-solving strategy to descend the gradient fast enough.
- Solved using numerical and iterative approach
$\frac{{\partial \ln L\left( {y;a,\sigma } \right)}}{{\partial a}} = 0$ ;$\frac{{\partial \ln L\left( {y;a,\sigma } \right)}}{{\partial \sigma }} = 0$ . - Estimating
$a*$ :- The root was bisected until the zero was bracketed.
- The Newton–Raphson method was applied to accurate the root,
${a_{n = 1}} = {a_n} - \frac{{\frac{{\partial \ln L\left( {y;{a_n},\sigma } \right)}}{{\partial a}}}}{{\frac{{{\partial ^2}\ln L\left( {y;{a_n},\sigma } \right)}}{{\partial {a^2}}}}}$ .
- Estimating
$\sigma$ :$${\sigma ^2} = \frac{1}{N}\sum\limits_{n = 0}^{N - 1} {{a^{ - 2n}}y{{\left( n \right)}^2}}$$
The model will fail during (1) estimation Frames Do Not Fall Within a Region of Free Decay, and (2) sound with a gradual rather than rapid offset.
- In the first case, the damping of sound in a room cannot occur at a rate faster than the free decay. A robust strategy would be to select a threshold value such that the left tail of the probability density function of
$a*$ . - In the second case,
$p(a^*)$ is likely to be multimodal. the strategy then is to select the first dominant peak. - For a unimodal symmetric distribution with
$\gamma = 0.5$ the filter will track the peak value, i.e., the median. In connected speech, where peaks cannot be clearly discriminated or the distribution is multi-modal,$\gamma$ should peaked based on the statistics of gap durations.