dev branch - Number of unique items does not match max item index #17
Open
Description
In line 235 of daisy/utils/loader.py:
def __get_stats(self, df):
user_num = df['user'].nunique()
item_num = df['item'].nunique()
return user_num, item_num
Here, item num refers to the total number of unique items
However, in daisy/utils/sampler.py line 71 we have
uni_negs = np.random.choice(
np.setdiff1d(np.arange(self.item_num), past_inter), # where self.item_num = config['item_num']
size=uniform_num
)
other_negs = np.random.choice(
np.arange(self.item_num),
size=other_num,
p=self.pop_prob
)
However, here, item num refers to the maximum index that an item can take, since we are performing the set difference. Thus the sampling is done from a smaller possible pool of items, which is incorrect.
In fact, in ml-100k, there are 1152 unique items involved in an interaction, while there are 1682 possible items (meaning 530 were never involved in an interaction).
There are multiple ways to fix this issue depending on the choice of future collaborators
Metadata
Assignees
Labels
No labels