-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pid port + duplicate rank_zero logging #2231
Conversation
@ZhaofengWu mind putting your Full name + email here for the commit? |
Zhaofeng Wu; zfw7@cs.washington.edu :) |
This comment should also be changed then |
Codecov Report
@@ Coverage Diff @@
## master #2231 +/- ##
======================================
- Coverage 88% 88% -0%
======================================
Files 70 70
Lines 5415 5421 +6
======================================
+ Hits 4759 4761 +2
- Misses 656 660 +4 |
unfortunately, this way of co-authoring is pure git thing and dos do not appear in any GitHub stats, see my comment here #1920 (comment) |
if 'LOCAL_RANK' in os.environ: | ||
rank_zero_only.rank = os.environ['LOCAL_RANK'] | ||
if 'SLURM_JOB_ID' in os.environ: | ||
rank_zero_only.rank = os.environ['SLURM_JOB_ID'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line hear breaks logging in my setup. rank_zero_only.rank is set to a non-zero value (the slurm job id) and thus non of the logging functions are ever executed. Did you mean SLURM_PROCID
which is the MPI rank?
I think this pull request breaks logging in some setups (when running on slurm clusters). See https://github.com/PyTorchLightning/pytorch-lightning/pull/2231/files#r443131338 (sorry for the double post, I did not realize that the review also shows up in this thread) |
Implements #2140 in a way that works.
For local ddp, we now use the process id as a numpy sub-seed to generate a random port. This way since the user sets the full seed we can use the parent process id as a seed before spawning off children.
Also solves an issue where on init when ranks are know, we shouldn't see duplicate messages