SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples
This is the repository for SNCSE.
SNCSE aims to alleviate feature suppression in contrastive learning for unsupervised sentence embedding. In the field, feature suppression means the models fail to distinguish and decouple textual similarity and semantic similarity. As a result, they may overestimate the semantic similarity of any pairs with similar textual regardless of the actual semantic difference between them. And the models may underestimate the semantic similarity of pairs with less words in common. (Please refer to Section 5 of our paper for several instances and detailed analysis.) To this end, we propose to take the negation of original sentences as soft negative samples, and introduce them into the traditional contrastive learning framework through bidirectional margin loss (BML). The structure of SNCSE is as follows:
The performance of SNCSE on STS task with different encoders is:
To reproduce above results, please download the files and unzip it to replace the original file folder. Then download the models from Google or Baidu , modify the file path variables and run:
python bert_prediction.py
python roberta_prediction.py
To train SNCSE, please download the training file, and put it at /SNCSE/data. You can either run:
python generate_soft_negative_samples.py
to generate soft negative samples, or use our files in /Files/soft_negative_samples.txt. Then you may modify and run train_SNCSE.sh.
To evaluate the checkpoints saved during training on the development set of STSB task, please run:
python bert_evaluation.py
python roberta_evaluation.py
Feel free to contact the authors at wanghao2@sensetime.com for any questions.
@article{wang2022sncse,
title={SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples},
author={Wang, Hao and Li, Yangguang and Huang, Zhen and Dou, Yong and Kong, Lingpeng and Shao, Jing},
journal={arXiv preprint arXiv:2201.05979},
year={2022}
}