dev branch - Number of unique items does not match max item index

In line 235 of daisy/utils/loader.py:

```    
    def __get_stats(self, df):
        user_num = df['user'].nunique()
        item_num = df['item'].nunique()

        return user_num, item_num
```
Here, item num refers to the **total number of unique items**

However, in daisy/utils/sampler.py line 71 we have

```
                uni_negs = np.random.choice(
                    np.setdiff1d(np.arange(self.item_num), past_inter), # where self.item_num = config['item_num']
                    size=uniform_num
                )
                other_negs = np.random.choice(
                    np.arange(self.item_num),
                    size=other_num,
                    p=self.pop_prob
                )
```

However, here, item num refers to the **maximum index that an item can take**, since we are performing the set difference. Thus the sampling is done from a smaller possible pool of items, which is incorrect. 

In fact, in ml-100k, there are 1152 unique items involved in an interaction, while there are 1682 possible items (meaning 530 were never involved in an interaction).

There are multiple ways to fix this issue depending on the choice of future collaborators





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dev branch - Number of unique items does not match max item index #17

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development