-
Notifications
You must be signed in to change notification settings - Fork 42
Cannot replicate the results! #8
Comments
Is 0.3 logloss the exact value you got? That is even better than our Kaggle record without WS-DAN (0.3250). Cannot give specific suggestion without further evidence. We will try the xception replication in our environment and see what will happen. It takes us a day to train each epoch. How to train 20 epochs within a day? Are you talking about the Xception code or the WS-DAN code? The Xception code should not be that slow. It only samples around 10% frames in one epoch to save time and validate more often. Could it be your IO too slow? It saves the best model after running validation at each epoch. However, training one epoch takes a long time and running validation only once per epoch may not be enough. I found that the model overfits quickly and best validation most is not really the best test model (ckpt-1 may be better). As above mentioned, the code only samples around 10% frames in one epoch so validation is more frequent. And it is very possible that best validation is not the best test model, which almost every DFDC team suffered. For xceptionnet, any reason it does not use imagenet as pre-train? I checked with our members. The xception model did use imagenet pretrained weight to initialize (from: https://github.com/Cadene/pretrained-models.pytorch). Sorry, it is not reflected in the code. I will update it later. For the same setting, does the test loss differ a lot between different runs? Randomness in augmentation (and other parts) could impact the result. But intuitively we don't think it affect much. |
Thanks for your answer. It clears up a lot of the questions in my mind. I have one additional question, in your csv file, the number of frames sometimes do not match with the actual frames of the video. How did you determine the number of frames in a video? |
Also do you notice which epoch is usually best.pth? I my case it is ckpt-5.pth, should I train longer or it is the case that it converges at epoch 5? |
Sorry, I should make it clear that 0.4 is our xceptionet result, but I take your model from Google drive, the xceptionet result is 0.3. |
Hi, another question i wanna ask is, I assume xception-hg-2.pth is the best.pth you saved when running train-xceptiopn.py? However, I found that the test log loss of ckpt-1.pth is usually smaller than best.pth, although they are both 0.4+. |
I tried to replicate the Xceptionnet results but failed. I strictly followed the data pre-processing and training. However, it turns out the log loss on the DFDC public test set is 0.4 at best after trying multiple runs. While I used the pre-trained xceptionnet the test loss is 0.3. I have several questions:
The text was updated successfully, but these errors were encountered: