Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added accumulation of loggers' metrics for the same steps #1278

Conversation

alexeykarnachev
Copy link
Contributor

@alexeykarnachev alexeykarnachev commented Mar 29, 2020

Before submitting

  • Was this discussed/approved via a Github issue? #1173
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1173.
This PR adds new method LightningLoggerBase.agg_and_log_metrics. This method has the same signature as and abstract LightningLoggerBase.log_metrics method.
New method aggregates metrics with the same step value and then, delegates the logging procedure to log_metrics.
From the client perspective nothing has changed. An abstract log_metrics still in place and it still needs to be implemented in the child loggers, but the Trainer now calls the agg_and_log_metrics.

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Mar 29, 2020

Hello @alexeykarnachev! Thanks for updating this PR.

Line 558:21: E731 do not assign a lambda expression, use a def

Comment last updated at 2020-04-07 16:06:12 UTC

@alexeykarnachev alexeykarnachev changed the title Added accumulation of loggers' metrics for the same steps [WIP] Added accumulation of loggers' metrics for the same steps Mar 29, 2020
@codecov
Copy link

codecov bot commented Mar 29, 2020

Codecov Report

Merging #1278 into master will decrease coverage by 0%.
The diff coverage is 73%.

@@          Coverage Diff           @@
##           master   #1278   +/-   ##
======================================
- Coverage      92%     91%   -0%     
======================================
  Files          63      64    +1     
  Lines        3332    3369   +37     
======================================
+ Hits         3059    3079   +20     
- Misses        273     290   +17     

@alexeykarnachev
Copy link
Contributor Author

Oh, I did the merge instead of rebase. sry

@alexeykarnachev alexeykarnachev changed the title [WIP] Added accumulation of loggers' metrics for the same steps Added accumulation of loggers' metrics for the same steps Mar 29, 2020
@Borda Borda requested review from ethanwharris, a team and jeremyjordan March 29, 2020 21:50
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not very easy to follow what is the usage since the new functions re not used anywhere...

pytorch_lightning/loggers/metrics_agg.py Outdated Show resolved Hide resolved
pytorch_lightning/loggers/metrics_agg.py Outdated Show resolved Hide resolved
pytorch_lightning/loggers/metrics_agg.py Outdated Show resolved Hide resolved
pytorch_lightning/loggers/base.py Outdated Show resolved Hide resolved
pytorch_lightning/loggers/base.py Outdated Show resolved Hide resolved
pytorch_lightning/loggers/metrics_agg.py Outdated Show resolved Hide resolved
tests/loggers/test_metrics_agg.py Outdated Show resolved Hide resolved
@alexeykarnachev alexeykarnachev changed the title Added accumulation of loggers' metrics for the same steps [WIP] Added accumulation of loggers' metrics for the same steps Mar 30, 2020
@mergify
Copy link
Contributor

mergify bot commented Mar 30, 2020

This pull request is now in conflict... :(

@alexeykarnachev alexeykarnachev changed the title [WIP] Added accumulation of loggers' metrics for the same steps Added accumulation of loggers' metrics for the same steps Mar 30, 2020
@Borda Borda requested a review from a team March 30, 2020 16:27
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way of adding abstraction with agg_and_log_metrics, but feel that the logger usage is too rigid...
cc: @PyTorchLightning/core-contributors @justusschock ^^

pytorch_lightning/loggers/metrics_agg.py Outdated Show resolved Hide resolved
pytorch_lightning/loggers/base.py Outdated Show resolved Hide resolved
@mergify
Copy link
Contributor

mergify bot commented Mar 30, 2020

This pull request is now in conflict... :(

@mergify mergify bot requested a review from a team March 30, 2020 22:33
@Borda
Copy link
Member

Borda commented Mar 31, 2020

@justusschock I think that it would be nice to help @alexeykarnachev with growing some aggregation which may be later useful also for metrics than shaking there and back... do you have any suggestion/idea?

@justusschock
Copy link
Member

justusschock commented Mar 31, 2020

@Borda @alexeykarnachev back when I've written my own framework I solved it like this (of course replace the numpy function with something appropriate/torch functions):
https://github.com/delira-dev/delira/blob/master/delira/utils/dict_reductions.py

Then each logger was given a dict of key, reduce_specifier like this:

{'key1': 'max', 'key2': 'min'}

and we had a fallback function (mean) for non-specified keys.
We basically did this, because you would probably want to log scalars more often then images. And we used the same for metric reduction.

For reference, this was our whole logger implementation (in some parts not the most beautiful code :D ):
https://github.com/delira-dev/delira/blob/master/delira/logging/base_logger.py

@williamFalcon
Copy link
Contributor

@Borda @alexeykarnachev @justusschock can we get this merged soon so we can add to 0.7.2?

@Borda
Copy link
Member

Borda commented Apr 2, 2020

not yet we have some API disagreement...

@alexeykarnachev
Copy link
Contributor Author

alexeykarnachev commented Apr 2, 2020

planning to add all discussed fixes tomorrow

@Borda Borda added bug Something isn't working feature Is an improvement or enhancement labels Apr 3, 2020
@Borda Borda added this to the 0.7.2 milestone Apr 3, 2020
@alexeykarnachev
Copy link
Contributor Author

Now, there are two arguments in the LightningLoggerBase. They have default values, so the
inherited classes won't break when they call super.
But what do you think about adding these arguments to all inherited classes constructors explicitly (also with default value)?

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykarnachev I made some updates, could you pls check it?

@williamFalcon
Copy link
Contributor

@alexeykarnachev we're preparing the 0.7.2 release for tomorrow. would be good to get this in there :)

@alexeykarnachev alexeykarnachev changed the title [WIP] Added accumulation of loggers' metrics for the same steps Added accumulation of loggers' metrics for the same steps Apr 7, 2020
@Borda Borda requested review from justusschock and a team April 7, 2020 15:16
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls add note to the changelog (extra line) about refactoring / remam TensorRunningAccum

@Borda Borda added the ready PRs ready to be merged label Apr 7, 2020
@Borda
Copy link
Member

Borda commented Apr 7, 2020

@alexeykarnachev GREAT job and sorry for overwriting some of your work...

>>> d2 = {'a': 1.1, 'b': 2.2, 'v': 1}
>>> d3 = {'a': 1.1, 'v': 2.3}
>>> dflt_func = min
>>> agg_funcs = {'a': np.mean, 'v': max}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't numpy functions slow things down because everything needs to go to cpu? we don't want to move things to CPU for the user ever haha. Every cpu calls slows training down a ton

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so to have it rather as Tensor...

Copy link
Member

@Borda Borda Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykarnachev we may use the Running accum or make another for full accum and extending every N steps and copy existing...
https://discuss.pytorch.org/t/dynamically-extend-the-tensor/39553/2
or https://discuss.pytorch.org/t/appending-to-a-tensor/2665/9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the structure will change from list of dict to dict of tensors but not sure if it makes much faster... also it will allow us to use agg implemented for Torch

Copy link
Contributor Author

@alexeykarnachev alexeykarnachev Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I thought, that at this point all values are already on cpu.
Here we call the log procedure:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f7622ebfca45abe7d8d34f2ee2070d6856e24646/pytorch_lightning/trainer/logging.py#L74

And few lines above this call, we transform the metrics to the scalars:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f7622ebfca45abe7d8d34f2ee2070d6856e24646/pytorch_lightning/trainer/logging.py#L64

So, do we really need tensors here? Metrics which come to the LightningLoggerBase are already itemed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Borda , the TrainerLoggingMixin forces all metrics to be a scalars. And it seems not very convenient to transform them back to tensors. Besides, as I see, metrics could never be a tensor. For example, I can pass scalar value metric even on the training_step, like so:

    def training_step(self, batch, batch_idx):
        loss, logits, _ = self.forward(batch)
        lr = self.trainer.optimizers[0].param_groups[0]['lr']
        log = {'Loss/train': loss, 'Learning-Rate': lr}
        return {'loss': loss, 'log': log}

Here, the lr is a scalar, it's on cpu and it is not a tensor.

What do you think on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, let's get this done and think about speedup later... :]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok perfect.

this might actually be the main cause of the minimal speed discrepancy between lightning and pure pytorch.

@Borda Borda added ready PRs ready to be merged and removed ready PRs ready to be merged labels Apr 7, 2020
@williamFalcon williamFalcon merged commit ddbf7de into Lightning-AI:master Apr 8, 2020
@alexeykarnachev alexeykarnachev deleted the fix-acuumulate-batches-logging branch April 8, 2020 13:13
alsrgv added a commit to carbonrobotics/pytorch-lightning that referenced this pull request Apr 11, 2020
tullie pushed a commit to tullie/pytorch-lightning that referenced this pull request Jun 7, 2020
…AI#1278)

* `add_argparse_args` method fixed (argument types added)

* autopep8 fixes

* --gpus=0 removed from test (for ci tests)

* Update pytorch_lightning/trainer/trainer.py

Co-Authored-By: Joe Davison <joe@huggingface.co>

* test_with_accumulate_grad_batches added

* agg_and_log_metrics logic added to the base logger class

* small format fix

* agg metrics strategies removed (not to complicate stuff)

* agg metrics: handle zero step

* autopep8

* changelog upd

* flake fix

* metrics aggregators factored out, metrics_agg.py added + tests

* metrics agg default value added

* Update pytorch_lightning/loggers/metrics_agg.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* metrics aggregators factored out, metrics_agg.py added + tests

* metrics agg default value added

* Update pytorch_lightning/loggers/metrics_agg.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* remove .item which causes sync issues (Lightning-AI#1254)

* remove .item which causes sync issues

* fixed gradient acc sched

* fixed gradient acc sched

* test_metrics_agg.py removed (all tested in doctrings), agg metrics refactored

* test_metrics_agg.py removed (all tested in doctrings), agg metrics refactored

* autopep8

* loggers base.py types fixed

* test

* test

* metrics aggregation for loggers: each key now has a specific function (or default one)

* metrics aggregation for loggers: each key now has a specific function (or default one)

* docstrings upd

* manual typehints removed from docstrings

* batch_size decreased for test `test_with_accumulate_grad_batches`

* extend running accum

* refactor

* fix tests

* fix tests

* allowed_types generator scoped

* trainer.py distutils was imported twice, fixed

* TensorRunningAccum refactored

* TensorRunningAccum added to change log (Changed)

* change log pull link added

Co-authored-by: Joe Davison <joe@huggingface.co>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
@Borda Borda modified the milestones: v0.7., v0.7.x Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics are logged on each batch, but not on each accum. step
8 participants