Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug-Fix]:properties current_epoch and global_step between model and trainer same always #3785

Merged

Conversation

nrupatunga
Copy link
Contributor

@nrupatunga nrupatunga commented Oct 2, 2020

What does this PR do?

  • This PR makes sure that properties current_epoch and global_step shared between the lightning module and the trainer remain the same always.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

Did you have fun?

Yes I had fun learning from Adrian 🙃

@mergify mergify bot requested a review from a team October 2, 2020 02:25
@@ -122,6 +116,14 @@ def __init__(self, *args, **kwargs):
def example_input_array(self) -> Any:
return self._example_input_array

@property
def current_epoch(self):
return self.trainer.current_epoch if self.trainer else 0
Copy link
Contributor

@ananthsub ananthsub Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be -1 if self.trainer isn't set? that could indicate that this is currently outside of training

Copy link
Contributor Author

@nrupatunga nrupatunga Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a nice addition but isn't trainer = None, when it isn't set and that could be an indication too?

  • have set it to 0 as previously it was self.current_epoch = 0 in __init__

nrupatunga and others added 2 commits October 2, 2020 11:40
type check, better clarity

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
type check, better clarity

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
@nrupatunga nrupatunga removed the request for review from a team October 2, 2020 06:13
@mergify mergify bot requested a review from a team October 2, 2020 06:13
@nrupatunga nrupatunga removed the request for review from a team October 2, 2020 06:14
@mergify mergify bot requested a review from a team October 2, 2020 06:15
@ydcjeff
Copy link
Contributor

ydcjeff commented Oct 2, 2020

Can you please re-add those properties comments?

@Borda Borda added the bug Something isn't working label Oct 2, 2020
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind share motivation of this addition since it is just one-to-one mapping to trainer attribute...?

tests/models/test_restore.py Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team October 2, 2020 06:53
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
@nrupatunga nrupatunga requested review from Borda and removed request for a team October 2, 2020 07:07
@mergify mergify bot requested a review from a team October 2, 2020 07:08
@nrupatunga
Copy link
Contributor Author

Yes, this doesn't affect users.
But I think if we mention in the docs, other users or newcomers can know this they are the same and don't need to ask questions or try comparing them.

What do you think?

@ydcjeff

Should I mention that self.current_epoch will be same as trainer.current_epoch. (similarly for global_step)

I was thinking if that information is really needed for the user to know? As the user will care only about if he could access those at any point in time and expect it to be right.

@ydcjeff
Copy link
Contributor

ydcjeff commented Oct 3, 2020

I am suggesting that I prefer to have them in the docs. But, core contributors will decide to add it or not.

@nrupatunga
Copy link
Contributor Author

nrupatunga commented Oct 3, 2020

I am suggesting that I prefer to have them in the docs. But, core contributors will decide to add it or not.

I agree, I think adding has an advantage that the user will have clarity that self.current_epoch = trainer.current_epoch

@nrupatunga
Copy link
Contributor Author

nrupatunga commented Oct 3, 2020

@ydcjeff @ananthsub
Also, now adding change logs resulted in 4 failed check, Can you help me on why this might be happening

one of the logs:
Failed to fetch https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/Packages.gz File has unexpected size (47871 != 49498). Mirror sync in progress? [IP: 152.195.19.142 443]

Is there a way to rerun them

@ydcjeff
Copy link
Contributor

ydcjeff commented Oct 3, 2020

Don't worry about the failed ones, these are unrelated to this PR.

@nrupatunga nrupatunga requested review from awaelchli and removed request for a team October 3, 2020 16:50
@mergify mergify bot requested a review from a team October 3, 2020 16:51
@nrupatunga nrupatunga removed the request for review from a team October 3, 2020 16:51
@mergify mergify bot requested a review from a team October 3, 2020 16:51
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice

CHANGELOG.md Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team October 4, 2020 00:01
@nrupatunga nrupatunga removed the request for review from a team October 4, 2020 02:31
@mergify mergify bot requested a review from a team October 4, 2020 02:31
@Borda
Copy link
Member

Borda commented Oct 4, 2020

Seems that some tests are failing...

@mergify mergify bot requested a review from a team October 4, 2020 13:04
Comment on lines +40 to +42
def test_resume_from_checkpoint(tmpdir):
""" Test that properties like `current_epoch` and `global_step`
in model and trainer are always the same. """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should extend this test somehow to check for all the things that are required to be restored. Not here maybe.

@mergify mergify bot requested a review from a team October 4, 2020 13:31
@mergify mergify bot requested a review from a team October 4, 2020 13:33
@edenlightning edenlightning added this to the 1.0 milestone Oct 4, 2020
@williamFalcon williamFalcon merged commit 7d47ed1 into Lightning-AI:master Oct 5, 2020
@Borda Borda modified the milestones: 1.0, 0.10.0 Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants