Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update PER support #25

Merged
merged 1 commit into from
Mar 23, 2020

Conversation

xuxiyang1993
Copy link
Contributor

Hi,

Thanks for your information!

For the 'wrap' mode in numpy.put(), you are correct. I fixed this line in this commit.

For the mean, I don't have a reference. Since we are first sampling games, and then sampling transitions from games, we need two prioritized sampling process, which is different from Prioritized Experience Replay paper. An alternative to the mean could be max, since after averaging, the transitions with high priorities may be affected. I'm still testing these two schemes.

  • I also added an option to choose whether or not to use the priority replay (self.PER in the config file). If set to False, the algorithm will never update the priorities, thus all transitions will always have equal priorities (currently 1.0), which is equivalent to uniform sampling.

  • I made alpha in PER algorithm configurable by adding self.PER_alpha in the config file.

For the remaining items on the list, here are my thoughts:

  • Add the loss scaling using the importance sampling ratio. (I'm having trouble figuring out how to do this without turning the buffer into a very long list with all the steps of each game)

        I think to calculate the IS weights for each sample, we need $N$ and $P(i)$ in this formula:

            is
        where $N$ is the total number of transitions in the buffer, and $P(i)$ is the probability of transition $i$ get sampled, which can be calculated from game priorities and transition priorities.

  • Maybe assign an initial value of probabilities based on the loss of root.value and the predicted value in MCTS (or 1 as you did, could be a parameter).

        As suggested by the Distributed Prioritized Experience Replay paper (Page 4, paragraph 2), the initial priority can be either set to the "maximum priority seen so far" (which performs well when replay buffer is small but do NOT scale to cases where there are a lot of actors and replay buffer is large) or the "current n-step TD error" (as you suggested, which requires an additional prediction step), we can make this configurable.

init

Let me know if my points make sense to you!

@werner-duvaud
Copy link
Owner

Hi,

Thank you for the update and the answers.

It seems clear to me.

Should we scale the IS weights by the maximum one as suggested in the Prioritized Experience Replay paper page 5?

For stability reasons, we always normalize weights by 1/ maxi wi so
that they only scale the update downwards

@werner-duvaud werner-duvaud merged commit 2c3b9fb into werner-duvaud:prioritized_replay Mar 23, 2020
egafni pushed a commit to egafni/muzero-general that referenced this pull request Apr 15, 2021
EpicLiem pushed a commit to EpicLiem/muzero-general-chess-archive that referenced this pull request Feb 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants