Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
Thanks for your information!
For the 'wrap' mode in
numpy.put()
, you are correct. I fixed this line in this commit.For the mean, I don't have a reference. Since we are first sampling games, and then sampling transitions from games, we need two prioritized sampling process, which is different from Prioritized Experience Replay paper. An alternative to the mean could be max, since after averaging, the transitions with high priorities may be affected. I'm still testing these two schemes.
I also added an option to choose whether or not to use the priority replay (
self.PER
in the config file). If set to False, the algorithm will never update the priorities, thus all transitions will always have equal priorities (currently 1.0), which is equivalent to uniform sampling.I made alpha in PER algorithm configurable by adding
self.PER_alpha
in the config file.For the remaining items on the list, here are my thoughts:
I think to calculate the IS weights for each sample, we need$N$ and $P(i)$ in this formula:
where
As suggested by the Distributed Prioritized Experience Replay paper (Page 4, paragraph 2), the initial priority can be either set to the "maximum priority seen so far" (which performs well when replay buffer is small but do NOT scale to cases where there are a lot of actors and replay buffer is large) or the "current n-step TD error" (as you suggested, which requires an additional prediction step), we can make this configurable.
Let me know if my points make sense to you!