Note: In October 2024, we released a new API for the code. The initial versions are preserved and archived in a separate folder for reference and continued use if needed. You can find the archived initial code versions here.
A Python package for running WARP-Q, a full-reference metric for predicting speech quality in generative neural speech codecs.
WARP-Q is a Python library designed to evaluate the quality of modern generative neural speech codecs and traditional low bit-rate speech coders. It uses a subsequence dynamic time warping (SDTW) algorithm to measure the similarity between a reference (original) and a test (degraded) speech signal, producing raw and normalized quality scores.
- News
- Overview
- Installation
- Usage
- Importing the Package
- Evaluating Quality of Two Audio Files
- Save detailed results to a CSV file
- Evaluate Two Audio Arrays
- Evaluate Batches of Files From CSV Inputs
- Visualizing WARP-Q Scores Using the
plot_warpq_scores
Function - Plotting Scores Grouped by Condition or Degradation Type
- WARP-Q Score Normalization
- Plot Examples: MOS vs. WARP-Q Scores
- Citing
-
October 2024: We launched a new API for the WARP-Q model on
PyPI
, with several key improvements:- Efficient Computation: Compute WARP-Q scores for single or batch audio files via CSV inputs.
- Flexible SDTW Parameters: Customize Soft Dynamic Time Warping (SDTW) settings.
- Alignment Indices: Get time and frame indices for aligned audio segments.
- Score Normalization: Normalize WARP-Q scores for consistent evaluation.
- Detailed Results: Save results in DataFrames for analysis.
- Utility Functions: Plot results, aggregate data, and load audio files.
- Accelerated Processing: Support for parallel processing to speed up computations.
-
November 2022: Adding a new API with two running modes via command line arguments and different pretrained mapping models. Examples of how to use the code are also included.
-
July 2022: A new manuscript entitled “Speech quality assessmentwith WARP‐Q: From similarity to subsequence dynamic time warp cost” has been accepted for publication in the IET Signal Processing Journal. In this paper, we present the detailed design of WARP-Q with a comprehensive evaluation and analysis of the model components, design decisions, and salience of parameters to the model's performance. The paper also presents a comprehensive set of benchmarks with standard and new datasets and benchmarks against other standard and state-of-the-art full reference speech quality metrics. Furthermore, we compared WARP-Q results to the results from two state-of-the-art reference free speech quality measures. We also explored the possibility of mapping raw WAPR-Q scores onto target MOS using different machine learning (ML) models.
-
Jan 2021: Publishing initial codes of WARP-Q.
Speech coding has been shown to achieve good speech quality using either waveform matching or parametric reconstruction. For very low bit rate streams, recently developed generative speech models can reconstruct high quality wideband speech from the bit streams of standard parametric encoders at less than 3 kb/s. Generative codecs produce high quality speech based on synthesising speech from a DNN and the parametric input.
The problem is that the existing objective speech quality models (e.g., ViSQOL, POLQA) cannot be used to accurately evaluate the quality of coded speech from generative models as they penalise based on signal differences not apparent in subjective listening test results. Motivated by this observation, we propose the WARP-Q metric, which is robust to low perceptual signal changes introduced by low bit rate neural vocoders. Figure 1 illustrates a high‐level block diagram of the WARP‐Q metric.
The algorithm of WARP-Q metric consists of four processing stages:
-
Pre-processing: silent non-speech segments from reference and degraded signals are detected and removed using a voice activity detection (VAD) algorithm.
-
Feature extraction: the Mel-frequency cepstral coefficients (MFCC) of the reference and degraded signals are first generated. The obtained MFCCs representations are then normalised so that they have the same segmental statistics (zero mean and unit variance) using the cepstral mean and variance normalisation (CMVN).
-
Similarity comparison: WARP-Q employs the Subsequence Dynamic Time Warping (SDTW) algorithm to assess the similarity between reference and degraded signals in the MFCC domain [1], [2]. Figure 2 shows an example for this process which involves:
- Dividing the normalized MFCCs of the degraded signal into
$L$ patches. - For each degraded patch
$X$ , computing the SDTW alignment cost between$X$ and the reference MFCC matrix$Y$ . - Calculating the accumulated alignment cost matrix
$D_{(X,Y)}$ and determining its optimal path$P^\ast$ between$X$ and$Y$ .
- Dividing the normalized MFCCs of the degraded signal into
D(X,Y)
and optimal path P*
between the reference
MFCC matrix Y
and a 2-second patch X
from the coded signal's MFCC matrix
(highlighted in green). Optimal indices a*
and
b*
are indicated.
- Subsequence score aggregation: the final quality score is representd by a median value of all alignment costs.
An evaluation using waveform matching, parametric and generative neural vocoder based codecs as well as channel and environmental noise shows that WARP-Q has better correlation and codec quality ranking for novel codecs compared to traditional metrics as well as the versatility of capturing other types of degradations, such as additive noise and transmission channel degradations. The results show that although WARP-Q is a simple model building on well established speech signal processing features and algorithms it solves the unmet need of a speech quality model that can be applied to generative neural codecs.
You can install WARP-Q directly from PyPI:
pip install warpq
Once installed, you can import the WARP-Q
metric class:
from warpq.core import warpqMetric
You can also import utility functions:
from warpq.utils import load_audio, plot_warpq_scores, group_dataframe_by_columns
To create an instance of the warpqMetric
class and initialize the model with default parameters, we run:
# Create an instance of the warpqMetric class
model = warpqMetric()
Loading the warpqMetric
object will create an instance of the class with the following parameters:
-
sr
:int
(default:16000
): Sampling frequency of audio signals in Hertz (Hz). -
frame_ms
:int
(default:32
): Length of audio frame in milliseconds for framing. -
overlap_ms
:int
(default:4
): Length of overlap between consecutive frames in milliseconds. -
n_mfcc
:int
(default:13
): Number of Mel-Frequency Cepstral Coefficients to compute. -
fmax
:int
(default:5000
): Cutoff frequency for MFCC computation. -
patch_size
:float
(default:0.4
): Size of each patch in seconds for processing. -
patch_hop
:float
(default:0.2
): Hop size between patches in seconds. -
sigma
:list
(default:[[1, 0], [0, 3], [1, 3]]
): Step size conditions for Subsequence Dynamic Time Warping (SDTW). -
apply_vad
:bool
(default:True
): Flag to determine if Voice Activity Detection (VAD) should be applied. -
score_fn
:str
(default:'median'
): Function to compute the final score. Options are'mean'
or'median'
. -
cmvnw_win_time
:float
(default:0.836
): The size of the sliding window for local normalization (in seconds). -
max_score
:float
(default:3.5
): Maximum raw score for normalization. -
n_jobs
:int
(default:-1
): Number of cores to use for parallel processing. IfNone
or-1
, all available cores will be used.
We can also call the warpqMetric
class with customized parameters. For example:
# Initialize the class with custom parameters
model = warpqMetric(sr=8000, frame_ms=25, overlap_ms=10)
This allows you to set custom values for parameters such as the sampling rate (sr
), frame length (frame_ms
), and overlap between frames (overlap_ms
), among others.
The evaluate()
function from the warpqMetric
class computes the WARP-Q score between two input speech signals, which can either be audio file paths or audio arrays. This function provides detailed alignment information, including the degree of similarity between the reference and degraded audio using the WARP-Q metric.
ref_audio
(str
ornp.ndarray
): Path to the reference audio file or a NumPy array of the reference audio signal.deg_audio
(str
ornp.ndarray
): Path to the degraded audio file or a NumPy array of the degraded audio signal.arr_sr
(int
, optional): Sampling rate, required only if providing audio arrays.save_csv_path
(str
, optional): Path to save the detailed results in a CSV file. IfNone
, results are not saved. If a valid path is provided, the results will be saved in CSV format, with columns including reference and degraded audio descriptions, WARP-Q scores, alignment costs, and timing information for each patch. If the file already exists, new results will be appended without the header.verbose
(bool
, optional): IfTrue
, outputs messages about the processing.
The evaluate()
function returns a dictionary (dict
) containing the WARP-Q results and detailed alignment information, including:
-
raw_warpq_score
(float
): The computed WARP-Q score between the reference and degraded audio. -
normalized_warpq_score
(float
): The normalized WARP-Q score between0
and1
, where1
indicates the best audio quality. Please see the normalization section below for more details. -
total_patch_count
(int
): The total number of patches generated from the degraded signal's MFCC, representing the number of segments in the degraded signal after applying the sliding window. -
alignment_costs
(list
): A list of DTW alignment costs for each degraded MFCC patch, representing how well each patch matches its aligned subsequence in the reference MFCC. Length is equal tototal_patch_count
. -
aligned_ref_time_ranges
(list
): List of (start_time, end_time) tuples containing the start and end time stamps (in seconds) for the best matching subsequences in the reference MFCC, as aligned to each patch in the degraded signal using DTW. Length is equal tototal_patch_count
. -
aligned_ref_frame_indices
(list
): List of (a_ast
,b_ast
) tuples containing the start and end frame indices for the best matching subsequences in the reference MFCC, corresponding to the aligned subsequences. Length is equal tototal_patch_count
. -
deg_patch_time_ranges
(list
): List of (start_time, end_time) tuples containing the start and end time stamps (in seconds) for each patch in the degraded signal's MFCC, generated using a sliding window approach. Length is equal tototal_patch_count
. -
deg_patch_frame_indices
(list
): List of (start_frame, end_frame) tuples containing the start and end frame indices for each patch in the degraded signal's MFCC, corresponding to the patches created by the sliding window process. Length is equal tototal_patch_count
.
You can compute the WARP-Q score between two audio files (reference and degraded) as follows:
# Create an instance of the warpqMetric class
model = warpqMetric()
# Evaluate the audio quality between two files
results = model.evaluate('audio/ref_audio.wav', 'audio/deg_audio.wav', verbose=True)
# Access the raw WARP-Q and normalized scores
raw_warpq_score = results["raw_warpq_score"]
normalized_warpq_score = results["normalized_warpq_score"]
# Print the results
print(f"Raw WARP-Q Score: {raw_warpq_score}")
print(f"Normalized WARP-Q Score: {normalized_warpq_score}")
It is possible to save the results obtained from the model.evaluate
function to a CSV file for further analysis. This can be done by setting the parameter save_csv_path
to the desired file path:
results = model.evaluate('audio/ref_audio.wav', 'audio/deg_audio.wav', save_csv_path="csv/results.csv", verbose=True)
Below is an example of a few rows from the saved CSV file:
You can load audio files using the load_audio
function and pass the audio data to the model.evaluation
function as shown in the following example:
# Load the reference and degraded audio files
ref_arr, deg_arr, ref_sr, deg_sr = load_audio(ref_path="audio/ref_audio.wav", deg_path="audio/deg_audio.wav", sr=16000, native_sr=False, verbose=True)
# Run the model using the loaded audio arrays
results = model.evaluate(ref_arr, deg_arr, arr_sr=ref_sr)
Note that the value passed to the arr_sr
parameter should match the class-defined sampling rate self.sr
.
The evaluate_from_csv
from the warpqMetric
class allows you to compute the WARP-Q scores for multiple audio files listed in a CSV file. This is useful when you need to evaluate the quality of audio for a large number of file pairs (reference and degraded).
The evaluate_from_csv
function takes the following inputs:
input_csv
(str
): Path to a CSV file with specified reference and degraded wave columns.ref_wave_col
(str
): Name of the reference wave column. Default is'ref_wave'
.deg_wave_col
(str
): Name of the degraded wave column. Default is'deg_wave'
.raw_score_col
(str
): Column name where raw scores will be saved. Default is'Raw WARP-Q Score'
.output_csv
(str
): Path to save results. IfNone
, results are not saved.save_details
(bool
): IfTrue
, save detailed results (alignment costs, times) in the same DataFrame.
and it returns the following:
-
pd.DataFrame
: DataFrame with computed WARP-Q scores and detailed results if requested. -
Additional detailed results (saved when
save_details=True
) include:total_patch_count
(int
): The total number of patches in the degraded signal.alignment_costs
(list
): The alignment costs for each patch between the degraded and reference signals.deg_patch_time_ranges
(list
): List of tuples for (start, end) times in seconds of each patch in the degraded signal.aligned_ref_time_ranges
(list
): List of tupes for (start, end) times in seconds of the aligned segments in the reference signal.
To use this function, first, you need to prepare a CSV file with columns specifying the reference and degraded audio files. The CSV file must contain at least the following two columns:
ref_wave
: Column containing paths to the reference audio files.deg_wave
: Column containing paths to the degraded audio files.
You may optionally include additional columns, such as Mean Opinion Score (MOS), degradation type, condition (experiment), and database for each file, if such information is accessible. These columns can facilitate further analysis, such as plotting WARP-Q scores against MOS or evaluating performance based on degradation types or experimental conditions.
An example of the CSV file:
database | ref_wave | deg_wave | condition | degradation_type | MOS |
---|---|---|---|---|---|
set1 | ref_audio_1.wav | deg_audio_1.wav | condA | noise | 4.5 |
set2 | ref_audio_2.wav | deg_audio_2.wav | condB | reverb | 3.8 |
set3 | ref_audio_3.wav | deg_audio_3.wav | condC | echo | 4.1 |
set4 | ref_audio_4.wav | deg_audio_4.wav | condD | clipping | 4.3 |
You can optionally add more columns, but the function will primarily rely on the ref_wave
and deg_wave
columns for evaluating the audio quality.
results_df = model.evaluate_from_csv(
input_csv="audio_files.csv",
ref_wave_col="ref_wave",
deg_wave_col="deg_wave",
raw_score_col="WARP-Q score",
output_csv="results_df.csv",
save_details=True
)
Note that audio files with short durations are skipped in the computation, and their results are replaced with np.nan
. Below is an example of how the results from the evaluate_from_csv
function might look when saved to a CSV file.
The alignment_costs
, deg_patch_time_ranges
, and aligned_ref_time_ranges
lists are saved as strings in the CSV file. To convert these strings back to Python lists, you can use ast.literal_eval
:
import ast
import pandas as pd
import numpy as np
# Load the CSV file
df_loaded = pd.read_csv('results_df.csv')
# Convert the string back to lists, handling NaN values
df_loaded['alignment_costs'] = df_loaded['alignment_costs'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else np.nan)
print(df_loaded)
The plot_warpq_scores
function generates a scatter plot of MOS versus WARP-Q scores and calculates the Pearson and Spearman correlation coefficients. The function supports color encoding and marker styling based on categories such as condition
or degradation_type
to enhance the plot's clarity.
df
: Apandas
DataFrame or path to a CSV file containing MOS and WARP-Q scores.mos_col
: The column name for theMOS
.warpq_col
: The column name for WARP-Q scores. Default is"Raw WARP-Q Score"
.hue_col
: Optional. A column name used to color the points by category (e.g.,condition
ordegradation_type
).style_col
: Optional. A column name to differentiate marker styles in the scatter plot.save_path
: Optional. The path to save the plot as a.png
file. The.png
extension will be added if not provided.
By default, the function plots MOS and WARP-Q scores for each audio file in the dataset, allowing you to directly assess the relationship between subjective and objective quality metrics.
warp_plot = model.plot_warpq_scores(
df="results_df.csv", # Path to CSV containing MOS and WARP-Q scores for each audio file (e.g., obtained from the model.evaluate_from_csv function)
mos_col="MOS", # Column containing MOS
warpq_col="WARP-Q score", # Column containing WARP-Q scores
# warpq_col="Normalized WARP-Q score", # or plot the normalized quality scores
title="MOS vs WARP-Q for Individual Files",
save_path="mos_vs_warpq_individual.png"
)
This example generates a scatter plot comparing MOS and WARP-Q scores for each individual audio file. The plot is saved as mos_vs_warpq_individual.png
.
In addition to plotting scores for individual files, it is often insightful to group the data by specific conditions such as degradation type, experiment condition, database or codec. Grouping scores can help you better understand how different types of degradation impact audio quality and assess the overall performance of a codec or processing technique across multiple conditions.
To achieve this, you can first group the data using the group_dataframe_by_columns
function, and then plot the aggregated results.
data
: Apandas
DataFrame to group. If not provided, data can be loaded from a CSV viacsv_path
.csv_path
: Path to a CSV file to load data from if no DataFrame is provided.group_cols
: A list of columns to group by (e.g.,["Degradation Type", "Condition"]
).agg_cols
: A list of columns to apply the aggregation function to (e.g.,["MOS", "Raw WARP-Q Score"]
).agg_func
: The aggregation function to apply. Default is"mean"
, but you can also apply other functions like"sum
","min
","max
", etc..output_csv
: Optional. Path to save the grouped data as a CSV file.
# Group data by Degradation Type and calculate the mean MOS and WARP-Q scores for each group
grouped_df = model.group_dataframe_by_columns(
csv_path="results_df.csv", # Path to the CSV file
group_cols=["degradation_type"], # Grouping by degradation type
agg_cols=["MOS", "WARP-Q score", 'Normalized WARP-Q score'], # Columns to aggregate
agg_func="mean", # Aggregating by mean values
output_csv="grouped_by_degradation.csv" # Save grouped data to a new CSV
)
# Plot the grouped data
warp_plot = model.plot_warpq_scores(
df="grouped_by_degradation.csv", # Use the grouped data
mos_col="MOS", # Column containing aggregated MOS
warpq_col="WARP-Q score", # Column containing aggregated WARP-Q scores
# warpq_col="Normalized WARP-Q score", # or plot the normalized quality scores
hue_col="degradation_type", # Color points by degradation type
title="MOS vs WARP-Q by Degradation Type",
#title="MOS vs normalized WARP-Q by Degradation Type",
save_path="mos_vs_warpq_degradation.png"
)
The WARP-Q metric provides raw scores that exhibit a negative correlation with quality. This means that lower values (closer to zero) indicate higher quality, while higher scores reflect lower quality. This behavior arises because WARP-Q is based on the alignment cost between the reference and degraded speech signals.
The alignment cost is calculated using Subsequence Dynamic Time Warping (SDTW), which measures how well the degraded speech aligns with the reference speech over time. When the alignment cost is low, it indicates that the codec or degradation type has preserved the speech signal well, resulting in higher quality. Conversely, a high alignment cost suggests more distortion, meaning the signal quality has deteriorated.
To present WARP-Q scores with positive correlation (where higher scores indicate better quality), we normalize the raw WARP-Q scores to a 0 to 1 scale. The normalization is performed using the following equation:
where:
Raw WARP-Q Score
is the score produced by the WARP-Q metric (based on subsequence DTW alignment cost).Max WARP-Q Score
is a predefined maximum score used for normalization.
The Max WARP-Q Score
is set to 3.5
based on evaluations across four different databases that are used in our papers. This value ensures that the normalized scores range from 0
to 1
with positive correlation to quality.
If the Max WARP-Q Score
is set too low, the normalization term 1
. In such cases, the normalized score 0
.
Therefore, if you notice many scores are zero, this likely indicates that the Max WARP-Q Score
is too low for your dataset or codec, and you may need to adjust it.
You can adjust the Max WARP-Q Score
to better fit your specific use case or database. To do this, pass the desired maximum value when creating an instance of the WARP-Q class:
model = warpqMetric(max_score=your_desired_max_score)
This allows you to fine-tune the normalization process according to the characteristics of your dataset and codec performance, ensuring better alignment with subjective quality assessments.
It is possible to normalize the raw WARP-Q score to align with the Mean Opinion Score (MOS), which typically ranges from 1
to 5
. The normalization for this would be:
where:
- A score of
1
indicates low quality, - A score of
5
indicates high quality.
In the current implementation, we are scaling the scores to the 0
to 1
range to keep things simple. A more robust mapping model based on machine learning algorithms is under development. It will be released soon and will provide better correlations with the subjective quality scores. Such a model can handle various distortion and coding scenarios more effectively.
The following plots demonstrate the relationship between MOS and WARP-Q scores, both before and after normalization. In the normalized plots, the WARP-Q scores are scaled from 0 to 1. These examples cover two cases:
- Per Audio File: This shows the alignment between MOS and WARP-Q scores for individual audio files.
- Per Codec: This demonstrates the comparison between MOS and WARP-Q scores aggregated by codec.
The plots are based on the Genspeech
and TCD-VoIP
databases described in the papers above and illustrate how normalization impacts score distribution and alignment between MOS and WARP-Q.
Please cite our papers if you find this repository useful:
@article{Wissam_IET_Signal_Process2022,
author = {Jassim, Wissam A. and Skoglund, Jan and Chinen, Michael and Hines, Andrew},
title = {Speech quality assessment with WARP-Q: From similarity to subsequence dynamic time warp cost},
journal = {IET Signal Processing},
volume = {n/a},
number = {n/a},
pages = {},
doi = {https://doi.org/10.1049/sil2.12151},
url = {https://ietresearch.onlinelibrary.wiley.com/doi/abs/10.1049/sil2.12151},
eprint = {https://ietresearch.onlinelibrary.wiley.com/doi/pdf/10.1049/sil2.12151},
}
@INPROCEEDINGS{Wissam_ICASSP2021,
author={Jassim, Wissam A. and Skoglund, Jan and Chinen, Michael and Hines, Andrew},
booktitle={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Warp-Q: Quality Prediction for Generative Neural Speech Codecs},
year={2021},
pages={401-405},
doi={10.1109/ICASSP39728.2021.9414901}
}
Dr Wissam A Jassim
wissam.a.jassim@gmail.com
October 6, 2024