Skip to content
This repository has been archived by the owner on Dec 3, 2022. It is now read-only.

Cannot replicate the results! #8

Open
leehomyc opened this issue Aug 11, 2020 · 5 comments
Open

Cannot replicate the results! #8

leehomyc opened this issue Aug 11, 2020 · 5 comments

Comments

@leehomyc
Copy link

I tried to replicate the Xceptionnet results but failed. I strictly followed the data pre-processing and training. However, it turns out the log loss on the DFDC public test set is 0.4 at best after trying multiple runs. While I used the pre-trained xceptionnet the test loss is 0.3. I have several questions:

  • It takes us a day to train each epoch. How to train 20 epochs within a day?
  • It saves the best model after running validation at each epoch. However, training one epoch takes a long time and running validation only once per epoch may not be enough. I found that the model overfits quickly and best validation most is not really the best test model (ckpt-1 may be better).
  • For xceptionnet, any reason it does not use imagenet as pre-train?
  • For the same setting, does the test loss differ a lot between different runs?
@cuihaoleo
Copy link
Owner

Is 0.3 logloss the exact value you got? That is even better than our Kaggle record without WS-DAN (0.3250). Cannot give specific suggestion without further evidence. We will try the xception replication in our environment and see what will happen.

It takes us a day to train each epoch. How to train 20 epochs within a day?

Are you talking about the Xception code or the WS-DAN code? The Xception code should not be that slow. It only samples around 10% frames in one epoch to save time and validate more often. Could it be your IO too slow?

It saves the best model after running validation at each epoch. However, training one epoch takes a long time and running validation only once per epoch may not be enough. I found that the model overfits quickly and best validation most is not really the best test model (ckpt-1 may be better).

As above mentioned, the code only samples around 10% frames in one epoch so validation is more frequent. And it is very possible that best validation is not the best test model, which almost every DFDC team suffered.

For xceptionnet, any reason it does not use imagenet as pre-train?

I checked with our members. The xception model did use imagenet pretrained weight to initialize (from: https://github.com/Cadene/pretrained-models.pytorch). Sorry, it is not reflected in the code. I will update it later.

For the same setting, does the test loss differ a lot between different runs?

Randomness in augmentation (and other parts) could impact the result. But intuitively we don't think it affect much.

@leehomyc
Copy link
Author

Thanks for your answer. It clears up a lot of the questions in my mind. I have one additional question, in your csv file, the number of frames sometimes do not match with the actual frames of the video. How did you determine the number of frames in a video?

@leehomyc
Copy link
Author

Also do you notice which epoch is usually best.pth? I my case it is ckpt-5.pth, should I train longer or it is the case that it converges at epoch 5?

@leehomyc
Copy link
Author

Sorry, I should make it clear that 0.4 is our xceptionet result, but I take your model from Google drive, the xceptionet result is 0.3.
Also, I changed to xceptionnet pre-trained, and it does not make much a difference for the final log loss, which is still 0.4+. I did not change any of your code, so I am not sure what went wrong.
I notice that you set to a random image when such frame does not exist. Does it affect the final results?

@leehomyc
Copy link
Author

Hi, another question i wanna ask is, I assume xception-hg-2.pth is the best.pth you saved when running train-xceptiopn.py? However, I found that the test log loss of ckpt-1.pth is usually smaller than best.pth, although they are both 0.4+.
For wsdan, could you let me know how you saved ckpt_x.pth and ckpt_e.pth? The code seems does not tell this at all.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants