Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support EarlyStoppingHook #739

Merged
merged 31 commits into from
Mar 6, 2023

Conversation

nijkah
Copy link
Contributor

@nijkah nijkah commented Nov 17, 2022

Motivation

Closes #356

Modification

Added a new variable stop_training in EpochBasedTrainLoop and IterBasedTrainLoop.
Edited run method as

    def run(self) -> torch.nn.Module:
        """Launch training."""
        while self._epoch < self._max_epochs and not self.stop_training:
            self.run_epoch()

**
EarlyStoppingHook itself does not save the best checkpoint. Should we support it?

I referred to some logics from

mmengine/hooks/early_stopping_hook.py Show resolved Hide resolved
mmengine/hooks/early_stopping_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/early_stopping_hook.py Outdated Show resolved Hide resolved
@nijkah nijkah requested review from HAOCHENYE and removed request for RangiLyu and zhouzaida November 18, 2022 11:43
@codecov
Copy link

codecov bot commented Nov 19, 2022

Codecov Report

❗ No coverage uploaded for pull request base (main@25dfe41). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head fe6ff30 differs from pull request most recent head 046d7be. Consider uploading reports for the commit 046d7be to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #739   +/-   ##
=======================================
  Coverage        ?   76.71%           
=======================================
  Files           ?      139           
  Lines           ?    10921           
  Branches        ?     2184           
=======================================
  Hits            ?     8378           
  Misses          ?     2186           
  Partials        ?      357           
Flag Coverage Δ
unittests 76.71% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@nijkah
Copy link
Contributor Author

nijkah commented Nov 19, 2022

Hi @HAOCHENYE, I have a little question.
In the UT with initializing Runner, the error raises when the experiment_name of Runner isn't specified. Also, it happens when the experiment_name is duplicated with the other UTs although there is a logic for cleaning its temporary directory.

Is it intended (the developers should consider) or unintended behavior?

@HAOCHENYE
Copy link
Collaborator

Hi @HAOCHENYE, I have a little question. In the UT with initializing Runner, the error raises when the experiment_name of Runner isn't specified. Also, it happens when the experiment_name is duplicated with the other UTs although there is a logic for cleaning its temporary directory.

Is it intended (the developers should consider) or unintended behavior?

Usually, if the user does not specify the experiment name, Runner will automatically generate the experiment name and each Runner instance will have unique experiment names with different timestamp suffix. However, if the time interval between Runner's creation is so short that the timestamp suffixes are identical, it could lead to Runners with duplicate experiment names, which is not allowed in current design.

MMEngine has multiple global variances such as MessageHub, Visualizer and DefaultScope, and all these variables are marked with the experiment name of Runner. Global variables are very dangerous, therefore these variables must have unique names.

As for the unit tests will often raise an error about the duplicated experiment name, it is caused by the irregularity of the current unit testing, and we plan to refactor the unit tests, and provide a more general function to build Runner in unit tests.

@HAOCHENYE HAOCHENYE added this to the 0.5.0 milestone Nov 20, 2022
@nijkah
Copy link
Contributor Author

nijkah commented Nov 21, 2022

One more question! I am thinking about which experiments we should consider for no improvement.

Current implementation retains top scores ignoring the recent inferior results.
But it may be more intuitive to retain latest scores to check the trend of training.

Which one will be better?

@HAOCHENYE
Copy link
Collaborator

One more question! I am thinking about which experiments we should consider for no improvement.

Current implementation retains top scores ignoring the recent inferior results. But it may be more intuitive to retain latest scores to check the trend of training.

Which one will be better?

Additional notes: We could also borrow the implementation in

https://github.com/Lightning-AI/lightning/blob/9a4e8a8c528d6475ab33c46a0a84e27273cc10bd/src/pytorch_lightning/callbacks/early_stopping.py#L38

Back to the topic, it could make more sense to me to directly compare the latest metric with the previous best Nth metrics and finally update the Nth metrics with the latest one, rather than first update the Nth metrics, and then compare them 🤣. what do you think about it?

@nijkah
Copy link
Contributor Author

nijkah commented Nov 22, 2022

I think your provided link looks neat. I'll rewrite the code especially for stopping logic and __init__ parameters.

@zhouzaida
Copy link
Collaborator

Hi @nijkah , thanks for your contribution. We are planning to merge this PR today and release a new version 0.6.0. The feature to resume can be supported in the future if users require it. In addition, this PR can be merged after resolving the conflict.

@zhouzaida I apologize for not finishing the rest work. I think it is okay to merge right now. I'll try to stay tuned to the future work for this.

Hi @nijkah , thanks for your contribution again 👍.

nijkah and others added 2 commits February 23, 2023 22:07
Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
zhouzaida
zhouzaida previously approved these changes Mar 5, 2023
@zhouzaida zhouzaida modified the milestones: 0.6.0, 0.7.0 Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Does mmengine implement the early stopping? Support Early Stop
5 participants