pip install -r requirements.txt
If you want to use apex for AMP training, please clone the apex source code from the repository at github.com to install.
To pretrain our model from scratch, please first download our processed pretraining dataset Spotify-100k
Warning: The spotify.tgz file is 96GB with 960 hours of audio (*.npy). Use download tools like aria2 that can pause and resume to download it efficiently
Then, download pre-trained WavLM and RoBERTa models from huggingface.co (optional), and run scripts/train-960.sh
The transForV4.pkl
(in spotify.tgz
) includes a dataset of 358,268 audio-text pairs, each less than 10 seconds in duration, complete with precise timestamp alignment details. Here's what an example entry looks like:
['/<directory>/spotify-960/0/0/show_002B8PbILr169CdsS9ySTH/0.npy',
[0, 405, 18, 627, 19456, 1644, 102, 43264, 4, 3056, 6025, 7325, 3479, 254, 5652, 10162, 4, 2678, 4783, 2],
["it's", 1, 3, 0, 8000],
['the', 3, 4, 6400, 9600],
['mother', 4, 5, 8000, 14400],
['back', 5, 6, 12800, 17600],
['a', 6, 7, 16000, 19200],
['podcast.', 7, 9, 17600, 35200],
['well', 9, 10, 88000, 99200],
['that', 10, 11, 100800, 105600],
['was', 11, 12, 104000, 110400],
['longer', 12, 14, 108800, 115200],
['than', 14, 15, 113600, 118400],
['expected.', 15, 17, 116800, 132800],
['oh', 17, 18, 145600, 148800],
['my', 18, 19, 147200, 152000],
-1]
Each dataset entry consists of four parts:
- Audio File Path: The location of the audio file in .npy format.
- Text IDs: Tokens generated by RoBERTa corresponding to the text.
- Previous Turn Index: The last element indicating the reference to the previous turn's ID, with
-1
indicating no prior reference. - Audio-Text Alignment: Detailed information on how text aligns with audio segments. For instance, the word "it's" aligns with text IDs [1:3] and audio segment [0:8000], where the sample rate of 16000 Hz means this segment represents approximately 0.5 seconds of audio.
We provide the pre-trained checkpoint of our model at huggingface.co. To reproduce our result in the paper, please first download the pre-processed fine-tuning data: MOSI, MOSEI, IEMOCAP, and MINTREC. These are all composed by pickles and can be used directly. Then run
scripts/finetune.sh
Since SpokenWOZ data is large and constantly updated, please obtain it from the source.
To access the training, validation, and test files in the datasets, you can use the following command to extract the mosi.tgz file:
tar -xzvf mosi.tgz
Once extracted, you'll find .pkl files for training, validation, and testing. Each pickle file contains a list of samples, and each sample includes the following components:
- Audio Features: This field contains the audio feature data.
- Text Token IDs: Here, you'll find the IDs corresponding to text tokens.
- Label: This is the label assigned to the sample.
- History Audio Features (if applicable): If present, this field contains historical audio feature data.
- History Text Token IDs (if applicable): Similar to the above, this includes historical text token IDs, if available.
We hope these information helps you in utilizing the dataset effectively. Should you have any questions or need further assistance, please feel free to reach out.