Skip to content

dev branch - Number of unique items does not match max item index #17

Open
@SyedZaheen

Description

In line 235 of daisy/utils/loader.py:

    def __get_stats(self, df):
        user_num = df['user'].nunique()
        item_num = df['item'].nunique()

        return user_num, item_num

Here, item num refers to the total number of unique items

However, in daisy/utils/sampler.py line 71 we have

                uni_negs = np.random.choice(
                    np.setdiff1d(np.arange(self.item_num), past_inter), # where self.item_num = config['item_num']
                    size=uniform_num
                )
                other_negs = np.random.choice(
                    np.arange(self.item_num),
                    size=other_num,
                    p=self.pop_prob
                )

However, here, item num refers to the maximum index that an item can take, since we are performing the set difference. Thus the sampling is done from a smaller possible pool of items, which is incorrect.

In fact, in ml-100k, there are 1152 unique items involved in an interaction, while there are 1682 possible items (meaning 530 were never involved in an interaction).

There are multiple ways to fix this issue depending on the choice of future collaborators

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions