[bug] objective/entropy < 0 when using rlootrainer and ppotrainerΒ #2496
Open
Description
trl/trl/trainer/rloo_trainer.py
Line 443 in 1661bc2
This is because in the previous code, the padding part of logprod is filled with 1.
INVALID_LOGPROB=1.0
logprobs = torch.masked_fill(logprobs, padding_mask, INVALID_LOGPROB)
I don't know why INVALID_LOGPROB is set to 1, wouldn't it work fine if it is set to 0?