Skip to content

[EMNLP 2024] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers

License

Notifications You must be signed in to change notification settings

yzGuu830/efficient-speech-codec

Repository files navigation

Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers

[arXiv] This is the code repository for the ESC codec presented in the ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers paper.

  • Our neural speech codec, within only 30MB, can compress 16kHz speech to 1.5, 3, 4.5, 6, 7.5 and 9kbps more efficiently while maintaining comparative reconstruction quality to Descript's audio codec.
  • We provide Model Checkpoints for different ESC variants and DAC models, along with a Demo Page for multilingual speech audios.

An illustration of ESC Architecture

Usage

Model Checkpoints

Codec Checkpoint #Param.
ESC-Base Download 8.39M
ESC-Base(adv) Download 8.39M
ESC-Large Download 15.58M
DAC-Tiny(adv) Download 8.17M
DAC-Tiny Download 8.17M
DAC-Base(adv) Download 74.31M

Install Dev Dependencies

pip install -r requirements.txt

To compress and decompress audio

python -m scripts.compress  --input /path/to/input.wav --save_path /path/to/output --model_path /path/to/model --num_streams 6 --device cpu 

This will create .pth and .wav files (code and reconstructed audio) under save_path. Our codec supports num_streams from 1 to 6, corresponding to 1.5 ~ 9.0kbps bitrates.

import torchaudio, torch
from esc import ESC
model = ESC(**config)
model.load_state_dict(torch.load("model.pth", map_location="cpu"),)
x, _ = torchaudio.load("input.wav")
# Encoding. (@ num_streams*1.5 kbps)
codes, f_shape = model.encode(x, num_streams=6)
# Decoding.
recon_x = model.decode(codes, f_shape)

This is the programmatic usage of esc to compress audio tensors using torchaudio. For more details see the example.ipynb notebook.

Training

We provide our developmental training and evaluation dataset on huggingface.

accelerate launch main.py --exp_name esc9kbps --config_path ./configs/9kbps_esc_base.yaml --wandb_project efficient-speech-codec --lr 1.0e-4 --num_epochs 80 --num_pretraining_epochs 15 --num_devices 4 --dropout_rate 0.75 --save_path /path/to/output --seed 53

We use accelerate library to handle distributed training. Logging is processed by wandb library. With 4 NVIDIA RTX4090 GPUs, training an ESC codec requires ~12h for 250k training steps on 180k 3-second speech clips with a batch size of 36. For detailed configurations, refer to ./configs/ folder.

Evaluation

python -m scripts.test --eval_folder_path path/to/data --batch_size 12 --model_path /path/to/model --device cuda

This will run codec evaluation at all bandwidth on a test set folder. We provide four metrics for reporting: PESQ, Mel Distance, SI-SDR and Bitrate Utilization Rate. The evaluation statistics will be saved into model_path by default.

Results

Performance Evaluation We provide a comprehensive performance comparison of ESC with Descript's audio codec (DAC) at different scales of model sizes (w/ and w/o adversarial trainings).

About

[EMNLP 2024] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •