-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added accumulation of loggers' metrics for the same steps #1278
Added accumulation of loggers' metrics for the same steps #1278
Conversation
Hello @alexeykarnachev! Thanks for updating this PR.
Comment last updated at 2020-04-07 16:06:12 UTC |
Codecov Report
@@ Coverage Diff @@
## master #1278 +/- ##
======================================
- Coverage 92% 91% -0%
======================================
Files 63 64 +1
Lines 3332 3369 +37
======================================
+ Hits 3059 3079 +20
- Misses 273 290 +17 |
Oh, I did the merge instead of rebase. sry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not very easy to follow what is the usage since the new functions re not used anywhere...
This pull request is now in conflict... :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the way of adding abstraction with agg_and_log_metrics
, but feel that the logger usage is too rigid...
cc: @PyTorchLightning/core-contributors @justusschock ^^
This pull request is now in conflict... :( |
@justusschock I think that it would be nice to help @alexeykarnachev with growing some aggregation which may be later useful also for metrics than shaking there and back... do you have any suggestion/idea? |
@Borda @alexeykarnachev back when I've written my own framework I solved it like this (of course replace the numpy function with something appropriate/torch functions): Then each logger was given a dict of key, reduce_specifier like this:
and we had a fallback function (mean) for non-specified keys. For reference, this was our whole logger implementation (in some parts not the most beautiful code :D ): |
@Borda @alexeykarnachev @justusschock can we get this merged soon so we can add to 0.7.2? |
not yet we have some API disagreement... |
planning to add all discussed fixes tomorrow |
Now, there are two arguments in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykarnachev I made some updates, could you pls check it?
@alexeykarnachev we're preparing the 0.7.2 release for tomorrow. would be good to get this in there :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls add note to the changelog (extra line) about refactoring / remam TensorRunningAccum
@alexeykarnachev GREAT job and sorry for overwriting some of your work... |
>>> d2 = {'a': 1.1, 'b': 2.2, 'v': 1} | ||
>>> d3 = {'a': 1.1, 'v': 2.3} | ||
>>> dflt_func = min | ||
>>> agg_funcs = {'a': np.mean, 'v': max} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't numpy functions slow things down because everything needs to go to cpu? we don't want to move things to CPU for the user ever haha. Every cpu calls slows training down a ton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so to have it rather as Tensor...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykarnachev we may use the Running accum or make another for full accum and extending every N steps and copy existing...
https://discuss.pytorch.org/t/dynamically-extend-the-tensor/39553/2
or https://discuss.pytorch.org/t/appending-to-a-tensor/2665/9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the structure will change from list of dict to dict of tensors but not sure if it makes much faster... also it will allow us to use agg implemented for Torch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. I thought, that at this point all values are already on cpu.
Here we call the log procedure:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f7622ebfca45abe7d8d34f2ee2070d6856e24646/pytorch_lightning/trainer/logging.py#L74
And few lines above this call, we transform the metrics to the scalars:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f7622ebfca45abe7d8d34f2ee2070d6856e24646/pytorch_lightning/trainer/logging.py#L64
So, do we really need tensors here? Metrics which come to the LightningLoggerBase
are already item
ed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Borda , the TrainerLoggingMixin
forces all metrics to be a scalars. And it seems not very convenient to transform them back to tensors. Besides, as I see, metrics could never be a tensor. For example, I can pass scalar value metric even on the training_step
, like so:
def training_step(self, batch, batch_idx):
loss, logits, _ = self.forward(batch)
lr = self.trainer.optimizers[0].param_groups[0]['lr']
log = {'Loss/train': loss, 'Learning-Rate': lr}
return {'loss': loss, 'log': log}
Here, the lr
is a scalar, it's on cpu and it is not a tensor.
What do you think on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, let's get this done and think about speedup later... :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok perfect.
this might actually be the main cause of the minimal speed discrepancy between lightning and pure pytorch.
…ghtning-AI#1278)" This reverts commit ddbf7de.
…AI#1278) * `add_argparse_args` method fixed (argument types added) * autopep8 fixes * --gpus=0 removed from test (for ci tests) * Update pytorch_lightning/trainer/trainer.py Co-Authored-By: Joe Davison <joe@huggingface.co> * test_with_accumulate_grad_batches added * agg_and_log_metrics logic added to the base logger class * small format fix * agg metrics strategies removed (not to complicate stuff) * agg metrics: handle zero step * autopep8 * changelog upd * flake fix * metrics aggregators factored out, metrics_agg.py added + tests * metrics agg default value added * Update pytorch_lightning/loggers/metrics_agg.py Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * metrics aggregators factored out, metrics_agg.py added + tests * metrics agg default value added * Update pytorch_lightning/loggers/metrics_agg.py Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * remove .item which causes sync issues (Lightning-AI#1254) * remove .item which causes sync issues * fixed gradient acc sched * fixed gradient acc sched * test_metrics_agg.py removed (all tested in doctrings), agg metrics refactored * test_metrics_agg.py removed (all tested in doctrings), agg metrics refactored * autopep8 * loggers base.py types fixed * test * test * metrics aggregation for loggers: each key now has a specific function (or default one) * metrics aggregation for loggers: each key now has a specific function (or default one) * docstrings upd * manual typehints removed from docstrings * batch_size decreased for test `test_with_accumulate_grad_batches` * extend running accum * refactor * fix tests * fix tests * allowed_types generator scoped * trainer.py distutils was imported twice, fixed * TensorRunningAccum refactored * TensorRunningAccum added to change log (Changed) * change log pull link added Co-authored-by: Joe Davison <joe@huggingface.co> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
Before submitting
What does this PR do?
Fixes #1173.
This PR adds new method
LightningLoggerBase.agg_and_log_metrics
. This method has the same signature as and abstractLightningLoggerBase.log_metrics
method.New method aggregates metrics with the same
step
value and then, delegates the logging procedure tolog_metrics
.From the client perspective nothing has changed. An abstract
log_metrics
still in place and it still needs to be implemented in the child loggers, but theTrainer
now calls theagg_and_log_metrics
.Did you have fun?
Make sure you had fun coding 🙃