Code for "Homophone Reveals the Truth: A Reality Check for Speech2Vec"
A brief version of this report has been accepted by ICASSP-2023: "A Reality Check and A Practical Baseline for Semantic Speech Embedding"
- Free GPU RAM >= 3GB
- Free System RAM >= 30GB
- Pytorch Version >= 1.8.2
Download the dataset.zip
(2.4GB) from Google Drive.
Unzip it to the project root. Its structure is shown below:
dataset/
├── info
│ ├── 500h_word2wav_keys.pkl
│ ├── 500h_word_counter.pkl
│ ├── 500h_word_split.pkl
│ └── eval
│ ├── all_words_5846.pkl
│ ├── files
│ │ ├── EN-MC-30.txt
│ │ ├── EN-MEN-TR-3k.txt
│ │ ├── EN-MTurk-287.txt
│ │ ├── EN-MTurk-771.txt
│ │ ├── EN-RG-65.txt
│ │ ├── EN-RW-STANFORD.txt
│ │ ├── EN-SIMLEX-999.txt
│ │ ├── EN-SimVerb-3500.txt
│ │ ├── EN-VERB-143.txt
│ │ ├── EN-WS-353-ALL.txt
│ │ ├── EN-WS-353-REL.txt
│ │ ├── EN-WS-353-SIM.txt
│ │ └── EN-YP-130.txt
│ └── homophone.txt
├── split_mfcc_dict.pkl
└── split_mfcc_mean_std.pkl
We also provided some speech sentence segment exmaples in SentenceSegmentExamples.zip
.
This instruction describes how we generated those files.
Just run python 1_train.py
The full training process (500-epoch) takes 8.4 days on an AMD 3900XT + RTX3090 machine.
use wandb to view the training process:
-
Create
.wb_config.json
file in the project root, using the following content:{ "WB_KEY": "Your wandb auth key" }
-
add
--dryrun=False
to the training command, for example:python 1_train.py --dryrun=False
The checkpoints and embeddings of every epoch are in Full500EpochModelsEmbedings.zip
(2.9GB)
The Rand Init
model corresponds to: epoch-01_ws0.10_men0.08_loss-1.000000.pkl
The 500-Epoch
model corresponds to: epoch499_ws0.15_men0.08_loss0.247943.pkl
Feel free to contact me if you have any question:
Email: my@huacishu.com
WeChat: