SimpleX

SimpleX is a simple and strong baseline model for collaborative filtering tasks. This repo provides the official open-source implementation of our paper:

Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. SimpleX: A Simple and Strong Baseline for Collaborative Filtering, in CIKM 2021.

Model Structure

SimpleX presents a simple unified CF model, which follows the commonly-used two-tower network structure to support efficient retrieval from a large item corpus. The user tower outputs a weighted combination of user profile embedding and aggregated behavior sequence embedding. The model structure is general, and with appropriate settings, it can instantiate related models such as MF, YouTubeNet, and one-hop GNN. Based on the model, we evaluate the effectiveness of cosine contrastive loss and negative sampling.

Environments

Our experiments were conducted in the following environment settings. For reproducibility, please follow the instructions #32 to install the dependent packages.

Hardware

CPU: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
GPU: Tesla P100 16G
RAM: 755G

Software

python: 3.6.5
pytorch: 1.0.1.post2
pandas: 0.23.0
numpy: 1.18.1
scipy: 1.1.0
sklearn: 0.23.1
pyyaml: 5.1
h5py: 2.7.1
tqdm: 4.59.0
faiss-cpu: 1.7.0
recbox: 0.0.4

Configuration Guide

Dataset config

data_root: ./data/Yelp/  # data directory to save h5 data
data_format: csv  # input data format
train_data: ./data/Yelp/Yelp18_m1/train.csv  # training data path
valid_data: ./data/Yelp/Yelp18_m1/test.csv  # validation data path
test_data: ./data/Yelp/Yelp18_m1/test.csv  # test data path
item_corpus: ./data/Yelp/Yelp18_m1/item_corpus.csv  # item corpus which maps corpus_index to item features
min_categr_count: 1  # min count to filter category features, 
                     # e.g., features of less than 10 occurrences may be set to a default "OOV" token
query_index: query_index  # query_index to group metrics per request/user
corpus_index: corpus_index  # corpus_index used to map to item ids and features
# feature_cols can be defined with the following keys:
#     name: feature column name in csv
#     active: True/False, whether to use the feature
#     dtype: int/str, the input data dtype
#     type: "index"/"categorical"/"sequence", types of features
#     source: "user"/"item"/"context" (optional), used to group features
#     splitter: (optional) the seperator used to split str sequence
#     max_len: (optional) the max length to chunk or pad sequence feature
#     padding: "pre"/"post" (optional), whether to pad before or after the original sequence
#     embedding_callback: (optional) "layers.MaskedAveragePooling()" is used by default.
#                         When set to "null", the sequence embedding output will not be aggregated.
#     share_embedding: (optional) specify which features to share embedding table
feature_cols:
    - {'name': 'query_index', 'active': True, 'dtype': int, 'type': 'index'}
    - {'name': 'corpus_index', 'active': True, 'dtype': int, 'type': 'index'}
    - {'name': 'user_id', 'active': True, 'dtype': str, 'type': 'categorical', 'source': 'user'}
    - {'name': 'user_history', 'active': True, 'dtype': str, 'type': 'sequence', 'source': 'user', 'splitter': '^',
       'max_len': 500, 'padding': 'pre', 'embedding_callback': null}
    - {'name': 'item_id', 'active': True, 'dtype': str, 'type': 'categorical', 'source': 'item', 'share_embedding': 'user_history'}
label_col: {name: label, dtype: float}  # specify label column name and dtype

Model config

model: SimpleX  # model class name
dataset_id: yelp18_m1_9217a019  # dataset id to join data config
metrics: ['Recall(k=20)', 'Recall(k=50)', 'NDCG(k=20)', 'NDCG(k=50)', 'HitRate(k=20)', 'HitRate(k=50)'] # metrics for evaluation
optimizer: adam  # optimizer set to adam by default
learning_rate: 1.e-4  # learning rate
batch_size: 512 
num_negs: 1000  # number of samples for negative sampling
embedding_dim: 64  
aggregator: mean  # behavior aggregator: mean/user_attention/self_attention
gamma: 1  # combination weight g
user_id_field: user_id  
item_id_field: item_id  
user_history_field: user_history  # behavior sequence
embedding_regularizer: 1.e-8  # L2 regularization weight for embedding parameters
net_regularizer: 0  # L2 regularization weight for network parameters
net_dropout: 0.1  # dropout rate for network
attention_dropout: 0  # dropout rate for attention if used
enable_bias: False  # whether to add bias term
similarity_score: cosine  # similarity score measure: cosine/dot
loss: CosineContrastiveLoss  # loss used in training
margin: 0.9  # the margin `m` threshold for CCL
negative_weight: 150  # negative weight `w` for CCL
sampling_num_process: 1  # number of processes for negative sampling
fix_sampling_seeds: False  # whether to use fixed random seeds for negative sampling
ignore_pos_items: False  # wheter to mask out positive items during negative sampling. 
                         # When set to True, the training will become more slow, but gives better results.
epochs: 100  # the max epochs for training. Typically, training will stop by early stopping.
shuffle: True  # whether to shuffle data samples for training
seed: 2019  # random seed used to ensure reproducibility
monitor: 'Recall(k=20)'  # metrics used to monitor the evaluation results for early stopping
monitor_mode: 'max'  # `max`/`min`, indicate the higher the better or the lower the better for the monitor metric