Skip to content

[EMNLP 2024] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers

License

Notifications You must be signed in to change notification settings

yzGuu830/efficient-speech-codec

Repository files navigation

ESC: High-Fidelity Speech Coding with Efficient Cross-Scale Vector Quantized Transformers

[arXiv] This is the code repository for the ESC presented in the ESC: High-Fidelity Speech Coding with Efficient Cross-Scale Vector Quantized Transformers paper.

  • Our neural speech codec, within only 30MB, can compress 16kHz speech to 1.5, 3, 4.5, 6, 7.5 and 9kbps efficiently while maintaining comparative reconstruction quality to Descript's audio codec.
  • We provide Model Checkpoints and a Demo Page

An illustration of ESC Architecture

Usage

Install Dev Dependencies

pip install -r requirements.txt

To compress and decompress audio

python -m scripts.compress  --input /path/to/input.wav --save_path /path/to/output --model_path /path/to/model --num_streams 6 --device cpu 

This will create .pth and .wav files (code and reconstructed audio) under save_path. Our codec supports num_streams from 1 to 6, corresponding to 1.5 ~ 9.0kbps bitrates.

import torchaudio
from models import ESC
model = ESC(**config)
model.load_state_dict(
        torch.load("model.pth", map_location="cpu")["model_state_dict"],
    )
model = model.to("cuda")
x, _ = torchaudio.load("input.wav")
x.to("cuda")
# encode to codes
codes, pshape = model.encode(x, num_streams=6)
# decode to audios
recon_x = model.decode(codes, pshape)

This is the programmatic usage of esc to compress audio tensors using torchaudio.

Training

We provide our developmental training and evaluation dataset on huggingface.

accelerate launch main.py --exp_name esc9kbps --config_path ./configs/9kbps_final.yaml --wandb_project efficient-speech-codec --lr 1.0e-4 --num_epochs 80 --num_pretraining_epochs 15 --num_devices 4 --dropout_rate 0.75 --save_path /path/to/output --seed 53

We use accelerate library to handle distributed training. Logging is processed by wandb library. With 4 NVIDIA RTX4090 GPUs, training an ESC codec requires ~12h for 250k training steps on 180k 3-second audio clips with a batch size of 36. For detailed configurations, please refer to ./configs/ folder.

Evaluation

python -m scripts.test --eval_folder_path path/to/data --batch_size 12 --model_path /path/to/model --device cuda

This will run codec evaluation at all bandwidth on a test set folder. We provide four metrics for reporting: PESQ, Mel Distance, SI-SDR and Bitrate Utilization Rate. The evaluation statistics will be saved into model_path by default.

Results

Performance Evaluation We provide a performance comparison with Descript's audio codec (DAC) at different scales of model sizes.

About

[EMNLP 2024] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •