-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FPN training: divide by zero, RPNL1Loss explodes #146
Comments
The same situation occurs to me when i use FPN+ResNet101+dcn to train my own dataset. But the same data works fine for ResNet101+dcn. Bad log looks like follows: Epoch[0] Batch [100] Speed: 3.10 samples/sec Train-RPNAcc=0.714563, RPNLogLoss=0.677545, RPNL1Loss=0.119950, Proposal FG Fraction=0.008675, R-CNN FG Accuracy=0.034800, RCNNAcc=0.956340, RCNNLogLoss=1.054744, RCNNL1Loss=191189370617.151367, Epoch[0] Batch [200] Speed: 3.12 samples/sec Train-RPNAcc=0.720455, RPNLogLoss=0.663638, RPNL1Loss=0.113055, Proposal FG Fraction=0.008540, R-CNN FG Accuracy=0.033646, RCNNAcc=0.954282, RCNNLogLoss=1.296537, RCNNL1Loss=257069246790015490457600.000000, Epoch[0] Batch [300] Speed: 3.08 samples/sec Train-RPNAcc=0.721229, RPNLogLoss=0.648105, RPNL1Loss=0.111896, Proposal FG Fraction=0.008614, R-CNN FG Accuracy=0.038954, RCNNAcc=0.953531, RCNNLogLoss=nan, RCNNL1Loss=nan, |
I encountered the same question as yours, have you solved it? @smorrel1 |
Did you use the default learning rate (0.01) ? |
@Puzer Thanks that solved it! |
I have changed the learning rate to 1e-5, but the error still raised. |
I solved the problem(at least it worked in my case) by changing source code in
Because in VOC format, it's pixel indexes are 0-based, if you do not transfer your data accordingly, 0 minus 1 will result in 65535, which will cause training loss NAN. You can add Hope it helps. |
I found it to be a combination of the <1 box edges and the higher lerning rate for less GPUs
|
Hi, please could you assist. I'm training FPN on Coco as per the instructions and get a large RPNL1Loss. It is coming down v v slowly and I suspect training may not work, or at least be delayed a lot.
Any assistance appreciated! Thanks, Stephen
log-error.txt
The text was updated successfully, but these errors were encountered: