Horovod: adjust base LR used by schedulers to scale with the number of workers #2626

tgaddair · 2020-07-16T22:07:53Z

In #2574 it was observed that the learning rate used to initialize learning rate schedulers is in conflict with the way we scale the learning rate with the number of Horovod workers. Because the learning rate schedulers are initialized before the scaling, the scaling would be overridden. This PR fixes this so that optimizers and LR schedulers scale up correctly.

Follow-ups to consider would be:

Make scaling work for other backends like DDP.
Add an option to configure the LR scaling in the Trainer.

…izer after scaling by number of workers

codecov · 2020-07-16T22:26:01Z

Codecov Report

Merging #2626 into master will increase coverage by 0%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #2626    +/-   ##
=======================================
  Coverage      91%     92%            
=======================================
  Files          72      74     +2     
  Lines        6131    6321   +190     
=======================================
+ Hits         5599    5791   +192     
+ Misses        532     530     -2

mergify · 2020-07-21T19:19:49Z

This pull request is now in conflict... :(

tgaddair · 2020-07-22T16:09:38Z

@williamFalcon @Borda failing tests seem unrelated. Could you take a look at the PR and see if everything makes sense?

As a follow-up, I want to add a param to Trainer so we can override the LR scale when using Horovod/DDP.

Borda · 2020-07-22T17:36:40Z

@williamFalcon @Borda failing tests seem unrelated. Could you take a look at the PR and see if everything makes sense?

yes, no connection to this PR

tests/models/test_horovod.py

amorehead · 2022-09-16T18:14:33Z

Hello. I was curious if there have been any other PRs that have added the ability for users to specify which learning rate scaling strategy should be used with e.g., DDP/Horovod. As a frequent user of DDP with Lightning, I would love to see the ability to have my optimizer's learning rate automatically scaled according to the effective batch size across nodes and their GPUs (i.e., the total world size).

tgaddair added 3 commits July 15, 2020 18:18

Horovod: Adjust base LR used by schedulers to match that of the optim…

6998f44

…izer after scaling by number of workers

Added unit test

42cc438

Removed debug statements

83ba5bc

mergify bot requested a review from a team July 16, 2020 22:08

Updated changelog

13b10a3

tgaddair mentioned this pull request Jul 16, 2020

horovod mode increase lr #2574

Closed

Borda added the feature Is an improvement or enhancement label Jul 16, 2020

Merge branch 'master' into hvd-lr-scale

6f0daa5

williamFalcon added the allowed_pre_1.0 label Jul 22, 2020

Borda approved these changes Jul 22, 2020

View reviewed changes

tests/models/test_horovod.py Outdated Show resolved Hide resolved

tests/models/test_horovod.py Outdated Show resolved Hide resolved

Borda added the ready PRs ready to be merged label Jul 22, 2020

mergify bot requested a review from a team July 22, 2020 17:41

Apply suggestions from code review

560a435

williamFalcon merged commit 1369012 into Lightning-AI:master Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horovod: adjust base LR used by schedulers to scale with the number of workers #2626

Horovod: adjust base LR used by schedulers to scale with the number of workers #2626

tgaddair commented Jul 16, 2020

codecov bot commented Jul 16, 2020 •

edited

Loading

mergify bot commented Jul 21, 2020

tgaddair commented Jul 22, 2020

Borda commented Jul 22, 2020

amorehead commented Sep 16, 2022 •

edited

Loading

Horovod: adjust base LR used by schedulers to scale with the number of workers #2626

Horovod: adjust base LR used by schedulers to scale with the number of workers #2626

Conversation

tgaddair commented Jul 16, 2020

codecov bot commented Jul 16, 2020 • edited Loading

Codecov Report

mergify bot commented Jul 21, 2020

tgaddair commented Jul 22, 2020

Borda commented Jul 22, 2020

amorehead commented Sep 16, 2022 • edited Loading

codecov bot commented Jul 16, 2020 •

edited

Loading

amorehead commented Sep 16, 2022 •

edited

Loading