SimpleX is a simple and strong baseline model for collaborative filtering tasks. This repo provides the official open-source implementation of our paper:
- Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. SimpleX: A Simple and Strong Baseline for Collaborative Filtering, in CIKM 2021.
SimpleX presents a simple unified CF model, which follows the commonly-used two-tower network structure to support efficient retrieval from a large item corpus. The user tower outputs a weighted combination of user profile embedding and aggregated behavior sequence embedding. The model structure is general, and with appropriate settings, it can instantiate related models such as MF, YouTubeNet, and one-hop GNN. Based on the model, we evaluate the effectiveness of cosine contrastive loss and negative sampling.
Our experiments were conducted in the following environment settings. For reproducibility, please follow the instructions #32 to install the dependent packages.
CPU: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz GPU: Tesla P100 16G RAM: 755G
python: 3.6.5 pytorch: 1.0.1.post2 pandas: 0.23.0 numpy: 1.18.1 scipy: 1.1.0 sklearn: 0.23.1 pyyaml: 5.1 h5py: 2.7.1 tqdm: 4.59.0 faiss-cpu: 1.7.0 recbox: 0.0.4
Dataset config
data_root: ./data/Yelp/ # data directory to save h5 data data_format: csv # input data format train_data: ./data/Yelp/Yelp18_m1/train.csv # training data path valid_data: ./data/Yelp/Yelp18_m1/test.csv # validation data path test_data: ./data/Yelp/Yelp18_m1/test.csv # test data path item_corpus: ./data/Yelp/Yelp18_m1/item_corpus.csv # item corpus which maps corpus_index to item features min_categr_count: 1 # min count to filter category features, # e.g., features of less than 10 occurrences may be set to a default "OOV" token query_index: query_index # query_index to group metrics per request/user corpus_index: corpus_index # corpus_index used to map to item ids and features # feature_cols can be defined with the following keys: # name: feature column name in csv # active: True/False, whether to use the feature # dtype: int/str, the input data dtype # type: "index"/"categorical"/"sequence", types of features # source: "user"/"item"/"context" (optional), used to group features # splitter: (optional) the seperator used to split str sequence # max_len: (optional) the max length to chunk or pad sequence feature # padding: "pre"/"post" (optional), whether to pad before or after the original sequence # embedding_callback: (optional) "layers.MaskedAveragePooling()" is used by default. # When set to "null", the sequence embedding output will not be aggregated. # share_embedding: (optional) specify which features to share embedding table feature_cols: - {'name': 'query_index', 'active': True, 'dtype': int, 'type': 'index'} - {'name': 'corpus_index', 'active': True, 'dtype': int, 'type': 'index'} - {'name': 'user_id', 'active': True, 'dtype': str, 'type': 'categorical', 'source': 'user'} - {'name': 'user_history', 'active': True, 'dtype': str, 'type': 'sequence', 'source': 'user', 'splitter': '^', 'max_len': 500, 'padding': 'pre', 'embedding_callback': null} - {'name': 'item_id', 'active': True, 'dtype': str, 'type': 'categorical', 'source': 'item', 'share_embedding': 'user_history'} label_col: {name: label, dtype: float} # specify label column name and dtype
Model config
model: SimpleX # model class name dataset_id: yelp18_m1_9217a019 # dataset id to join data config metrics: ['Recall(k=20)', 'Recall(k=50)', 'NDCG(k=20)', 'NDCG(k=50)', 'HitRate(k=20)', 'HitRate(k=50)'] # metrics for evaluation optimizer: adam # optimizer set to adam by default learning_rate: 1.e-4 # learning rate batch_size: 512 num_negs: 1000 # number of samples for negative sampling embedding_dim: 64 aggregator: mean # behavior aggregator: mean/user_attention/self_attention gamma: 1 # combination weight g user_id_field: user_id item_id_field: item_id user_history_field: user_history # behavior sequence embedding_regularizer: 1.e-8 # L2 regularization weight for embedding parameters net_regularizer: 0 # L2 regularization weight for network parameters net_dropout: 0.1 # dropout rate for network attention_dropout: 0 # dropout rate for attention if used enable_bias: False # whether to add bias term similarity_score: cosine # similarity score measure: cosine/dot loss: CosineContrastiveLoss # loss used in training margin: 0.9 # the margin `m` threshold for CCL negative_weight: 150 # negative weight `w` for CCL sampling_num_process: 1 # number of processes for negative sampling fix_sampling_seeds: False # whether to use fixed random seeds for negative sampling ignore_pos_items: False # wheter to mask out positive items during negative sampling. # When set to True, the training will become more slow, but gives better results. epochs: 100 # the max epochs for training. Typically, training will stop by early stopping. shuffle: True # whether to shuffle data samples for training seed: 2019 # random seed used to ensure reproducibility monitor: 'Recall(k=20)' # metrics used to monitor the evaluation results for early stopping monitor_mode: 'max' # `max`/`min`, indicate the higher the better or the lower the better for the monitor metric
Model | Recall@20 | NDCG@20 |
YouTubeNet [RecSys'16] | 0.0686 | 0.0567 |
ENMF [TOIS'20] | 0.0650 | 0.0515 |
NGCF [SIGIR'19] | 0.0579 | 0.0477 |
LightGCN [SIGIR'20] | 0.0649 | 0.0530 |
SGL-ED [SIGIR'21] | 0.0675 | 0.0555 |
UltraGCN [CIKM'21] | 0.0683 | 0.0561 |
MF-CCL [CIKM'21] | 0.0698 | 0.0572 |
SimpleX [CIKM'21] | 0.0701 | 0.0575 |
Follow the steps below to reproduce the results
# convert data format cd data/Yelp/Yelp18_m1 python # run the model python --config ./config/MF_CCL_yelp18_m1 --expid MF_CCL_yelp18_m1 --gpu 0 python --config ./config/SimpleX_yelp18_m1 --expid SimpleX_yelp18_m1 --gpu 0
See the running logs:
Model | Recall@20 | NDCG@20 |
YouTubeNet [RecSys'16] | 0.1754 | 0.1473 |
ENMF [TOIS'20] | 0.1523 | 0.1315 |
NGCF [SIGIR'19] | 0.1570 | 0.1327 |
LightGCN [SIGIR'20] | 0.1830 | 0.1554 |
UltraGCN [CIKM'21] | 0.1862 | 0.1580 |
MF-CCL [CIKM'21] | 0.1837 | 0.1493 |
SimpleX [CIKM'21] | 0.1872 | 0.1557 |
Follow the steps below to reproduce the results
# convert data format cd data/Gowalla/Gowalla_m1 python # run the model python --config ./config/MF_CCL_gowalla_m1 --expid MF_CCL_gowalla_m1 --gpu 0 python --config ./config/SimpleX_gowalla_m1 --expid SimpleX_gowalla_m1 --gpu 0
See the running logs:
Model | Recall@20 | NDCG@20 |
YouTubeNet [RecSys'16] | 0.0502 | 0.0388 |
ENMF [TOIS'20] | 0.0359 | 0.0281 |
NGCF [SIGIR'19] | 0.0344 | 0.0263 |
LightGCN [SIGIR'20] | 0.0411 | 0.0315 |
SGL-ED [SIGIR'21] | 0.0478 | 0.0379 |
UltraGCN [CIKM'21] | 0.0681 | 0.0556 |
MF-CCL [CIKM'21] | 0.0559 | 0.0447 |
SimpleX [CIKM'21] | 0.0583 | 0.0468 |
Follow the steps below to reproduce the results
# convert data format cd data/Amazon/AmazonBooks_m1 python # run the model python --config ./config/MF_CCL_amazonbooks_m1 --expid MF_CCL_amazonbooks_m1 --gpu 0 python --config ./config/SimpleX_amazonbooks_m1 --expid SimpleX_amazonbooks_m1 --gpu 0
See the running logs:
Model | Recall@20 | NDCG@20 |
NGCF [SIGIR'19] | 0.1258 | 0.0792 |
BGCF [KDD'20] | 0.1506 | 0.0948 |
SimpleX [CIKM'21] | 0.1763 | 0.1145 |
Follow the steps below to reproduce the results
# convert data format cd data/Amazon/AmazonCDs_m1 python # run the model python --config ./config/SimpleX_amazoncds_m1 --expid SimpleX_amazoncds_m1 --gpu 0
See the running log: results/SimpleX_amazoncds_m1_014_c5143710.log
Model | Recall@20 | NDCG@20 |
NGCF [SIGIR'19] | 0.0866 | 0.0555 |
BGCF [KDD'20] | 0.1066 | 0.0693 |
SimpleX [CIKM'21] | 0.1342 | 0.0926 |
Follow the steps below to reproduce the results
# convert data format cd data/Amazon/AmazonMovies_m1 python # run the model python --config ./config/SimpleX_amazonmovies_m1 --expid SimpleX_amazonmovies_m1 --gpu 0
See the running log: results/SimpleX_amazonmovies_m1_009_88b07f96.log
Model | Recall@20 | NDCG@20 |
NGCF [SIGIR'19] | 0.1513 | 0.0917 |
BGCF [KDD'20] | 0.1534 | 0.0912 |
SimpleX [CIKM'21] | 0.1721 | 0.1028 |
Follow the steps below to reproduce the results
# convert data format cd data/Amazon/AmazonBeauty_m1 python # run the model python --config ./config/SimpleX_amazonbeauty_m1 --expid SimpleX_amazonbeauty_m1 --gpu 0
See the running log: results/SimpleX_amazonbeauty_m1_001_bcec104e.log
Model | F1@20 | NDCG@20 |
ENMF [TOIS'20] | 0.0314 | 0.0823 |
NBPO [SIGIR'20] | 0.0313 | 0.0810 |
UltraGCN [CIKM'21] | 0.0330 | 0.0829 |
SimpleX [CIKM'21] | 0.0338 | 0.0842 |
Follow the steps below to reproduce the results
# convert data format cd data/Amazon/AmazonElectronics_m1 python # run the model python --config ./config/SimpleX_amazonelectronics_m1 --expid SimpleX_amazonelectronics_m1 --gpu 0
See the running log: results/SimpleX_amazonelectronics_m1_110_a6e18467.log
Model | Precision@20 | Recall@20 |
ENMF [TOIS'20] | 0.0748 | 0.0280 |
NGCF [SIGIR'19] | 0.0517 | 0.0193 |
DHCF [KDD'20] | 0.0635 | 0.0249 |
SimpleX [CIKM'21] | 0.0754 | 0.0269 |
Follow the steps below to reproduce the results
# convert data format cd data/CiteULike/CiteUlikeA_m1 python # run the model python --config ./config/SimpleX_citeulikea_m1 --expid SimpleX_citeulikea_m1 --gpu 0
See the running log: results/SimpleX_citeulikea_m1_005_fe2b7f3d.log
Model | F1@20 | NDCG@20 | Recall@20 |
ENMF [TOIS'20] | 0.1640 | 0.2656 | |
NGCF [SIGIR'19] | 0.1582 | 0.2511 | 0.2513 |
LCFN [ICML'20] | 0.1625 | 0.2603 | |
LightGCN [SIGIR'20] | 0.2427 | 0.2576 | |
UltraGCN [CIKM'21] | 0.2004 | 0.2642 | 0.2787 |
SimpleX [CIKM'21] | 0.1658 | 0.2670 | 0.2802 |
Follow the steps below to reproduce the results
# convert data format cd data/MovieLens/Movielens1M_m1 python # run the model python --config ./config/SimpleX_movielens1m_m1 --expid SimpleX_movielens1m_m1 --gpu 0
See the running log: results/SimpleX_movielens1m_m1_021_6b1eda86.log
For reproducing our baselines, please refer to the BARS benchmark at