Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zero grad params on initialization #484

Open
willwhitney opened this issue Nov 19, 2015 · 4 comments
Open

zero grad params on initialization #484

willwhitney opened this issue Nov 19, 2015 · 4 comments

Comments

@willwhitney
Copy link

th> lin = nn.Linear(2,2)
th> p1, gp1 = lin:getParameters()
th> p1
 0.4611
-0.6737
-0.6769
 0.3312
-0.3065
-0.0952
[torch.DoubleTensor of size 6]

Is there a reason grad params don't start zeroed when a module is initialized? This seems super dangerous, and since initialization only happens once, it's not like it's a big performance hit to zero them.

th> gp1
-2.6816e+154
-2.6816e+154
 2.9644e-323
 2.7813e-309
-2.6816e+154
-2.6816e+154
[torch.DoubleTensor of size 6]
@soumith
Copy link
Member

soumith commented Nov 19, 2015

the training loop is usually run as:

model:zeroGradParameters()
criterion:forward(model:forward(...), target)
model:backward(...)
optimization

after every mini-batch, you need to zero the gradient buffers for correctness anyways. Initialization with zeros will likely hide bugs induced by forgetting to zero the gradBuffers every iteration...

@willwhitney
Copy link
Author

Yup, I get that this is the standard form. But intuitively, you'd expect this one would work just as well:

criterion:forward(model:forward(...), target)
model:backward(...)
optimization
model:zeroGradParameters()

This probably isn't that big a deal either way (I came across it randomly, not as a bug), but it seems like since all the other fields get initialized for you, this one would too.

@soumith
Copy link
Member

soumith commented Dec 19, 2015

this has come up in the past, several times. Maybe we should initialize gradWeight / gradBias with nans.

@Kaixhin
Copy link
Contributor

Kaixhin commented Dec 29, 2015

I had assumed that zeros was the case and just so happened to write an optimisation loop the latter way around, so +1 for initialising with NaNs (by the reasoning you gave above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants