Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support num_sanity_val_steps=-1 #2246

Merged
merged 19 commits into from
Jul 23, 2020
Merged

support num_sanity_val_steps=-1 #2246

merged 19 commits into from
Jul 23, 2020

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Jun 18, 2020

What does this PR do?

Fixes #1715
Trainer(num_sanity_val_steps=-1) is now possible and will run all val dataloaders in full.

TODO:

  • is limit_val_batches supposed to influence sanity val steps? ok, clarified by william

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team June 18, 2020 21:53
@awaelchli awaelchli added the feature Is an improvement or enhancement label Jun 18, 2020
@awaelchli awaelchli force-pushed the feat/all_sanity_steps branch from 8b50ecf to d375e2a Compare June 20, 2020 10:13
@codecov
Copy link

codecov bot commented Jun 20, 2020

Codecov Report

Merging #2246 into master will decrease coverage by 0%.
The diff coverage is 92%.

@@          Coverage Diff           @@
##           master   #2246   +/-   ##
======================================
- Coverage      92%     92%   -0%     
======================================
  Files          74      74           
  Lines        6316    6322    +6     
======================================
+ Hits         5786    5791    +5     
- Misses        530     531    +1     

@pep8speaks
Copy link

pep8speaks commented Jun 22, 2020

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-23 10:48:02 UTC

@awaelchli
Copy link
Contributor Author

@williamFalcon is limit_val_batches=0 supposed to influence num_sanity_val_steps, i.e., should
Trainer(num_sanity_val_steps=5, limit_val_batches=0) run 5 checks or none? Currently it runs none.

@awaelchli awaelchli changed the title support sanity_val_step=-1 support num_sanity_val_steps=-1 Jun 22, 2020
@awaelchli awaelchli marked this pull request as ready for review June 22, 2020 23:18
@mergify
Copy link
Contributor

mergify bot commented Jun 23, 2020

This pull request is now in conflict... :(

@awaelchli awaelchli requested review from Borda and williamFalcon June 23, 2020 23:26
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved
@Borda Borda added the ready PRs ready to be merged label Jun 26, 2020
@Borda Borda requested review from MattPainter01, neggert and a team June 26, 2020 23:05
@Borda
Copy link
Member

Borda commented Jun 26, 2020

is limit_val_batches supposed to influence sanity val steps?

it depends if you shuffle them... @williamFalcon?

@Borda Borda removed the ready PRs ready to be merged label Jun 26, 2020
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
@mergify
Copy link
Contributor

mergify bot commented Jun 27, 2020

This pull request is now in conflict... :(

@awaelchli
Copy link
Contributor Author

awaelchli commented Jun 29, 2020

@williamFalcon Ok, I hope I understood you correctly. I updated the PR. Here are some examples of how it currently is:

# runs exactly 5 sanity val steps
Trainer(num_sanity_val_steps=5)

# runs exactly 5 sanity val steps
Trainer(num_sanity_val_steps=5, limit_val_batches=3)

# runs 0 sanity val steps
Trainer(num_sanity_val_steps=5, limit_val_batches=0)

# runs 0 sanity val steps (but run fast_dev check batches as usual)
Trainer(num_sanity_val_steps=5, limit_val_batches=3, fast_dev_run=True)

We can also substitute 5 with -1 of course.
Does this make sense?

@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 29, 2020

almost perfect!

fast_dev_run always runs 1 of each and completes.

@awaelchli
Copy link
Contributor Author

awaelchli commented Jun 29, 2020

yes exactly, I did not change that. I just did not conceptually count it as "sanity check" because in the code it follows a different path. (in fact, fast_dev_run sets num_sanity_val_steps=0 in Trainer init)

# runs 0 sanity steps, 1 val step and 1 train step.
so Trainer(num_sanity_val_steps=5, fast_dev_run=True)

@mergify
Copy link
Contributor

mergify bot commented Jul 1, 2020

This pull request is now in conflict... :(

@mergify
Copy link
Contributor

mergify bot commented Jul 1, 2020

This pull request is now in conflict... :(

@awaelchli awaelchli requested a review from Borda July 11, 2020 22:45
@awaelchli
Copy link
Contributor Author

@williamFalcon kindly requesting your review :)

@mergify
Copy link
Contributor

mergify bot commented Jul 14, 2020

This pull request is now in conflict... :(

@williamFalcon
Copy link
Contributor

@awaelchli this looks great. Can you rebase so we can merge?

@@ -302,7 +302,7 @@ def init_test_tqdm(self) -> tqdm:
def on_sanity_check_start(self, trainer, pl_module):
super().on_sanity_check_start(trainer, pl_module)
self.val_progress_bar = self.init_sanity_tqdm()
self.val_progress_bar.total = trainer.num_sanity_val_steps * len(trainer.val_dataloaders)
self.val_progress_bar.total = convert_inf(trainer.num_sanity_val_steps * len(trainer.val_dataloaders))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tqdm does not understand float("inf"), so we have to convert it to None in the case num_sanity_val_steps=inf or dataloader has inf length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right but where is this imported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see... just kind of stuck at the bottom lol

@mergify mergify bot requested a review from a team July 23, 2020 00:32
@awaelchli
Copy link
Contributor Author

@williamFalcon rebase done.

@property
def disable_validation(self) -> bool:
""" Check if validation is disabled during training. """
disable_validation = not (self.is_overridden('validation_step') and self.limit_val_batches > 0) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when all are not, would it be easier to write enable_validation and negate it here? (vise verso to this state)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mergify mergify bot requested a review from a team July 23, 2020 09:39
@Borda Borda self-requested a review July 23, 2020 09:39
@williamFalcon
Copy link
Contributor

@awaelchli why does this drop coverage so much? maybe the logical part was wrong?

@awaelchli
Copy link
Contributor Author

awaelchli commented Jul 23, 2020

@williamFalcon it is because it fails to upload the coverage, http error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run full validation epoch before training
4 participants