Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于 loss = nan #2

Open
Walleclipse opened this issue Feb 22, 2019 · 3 comments
Open

关于 loss = nan #2

Walleclipse opened this issue Feb 22, 2019 · 3 comments

Comments

@Walleclipse
Copy link

你好 我用你给的sample data 运行 train_ECM.py 时遇到了loss = nan 的问题,这是什么原因呢?
以下是几个step的loss: 从第60步开始就nan了。
Start training ...
step 20, loss = 10.188582,perp: 17026.870
(0.380 sec/step)
step 40, loss = 5.171560,perp: 79.408
(0.380 sec/step)
step 60, loss = nan,perp: nan
(0.396 sec/step)

但是我运行 train.py时就可以收敛
Start training ...
step 20, loss = 3.336257,perp: 26.973
(0.323 sec/step)
step 40, loss = 2.431179,perp: 10.790
(0.398 sec/step)
step 60, loss = 1.425767,perp: 3.964
(0.397 sec/step)
step 80, loss = 0.656351,perp: 1.895
(0.386 sec/step)

是不是 ECM_model 存在一些bug?

@1YCxZ
Copy link
Owner

1YCxZ commented Feb 23, 2019

你好,我检查了一下,模型应该没有问题。
sample data只有10条,只是用作展示数据的格式,你可以试试看stc的数据集,链接我已经写在readme中了。
附上部分我刚刚用stc数据集训练ecm模型的log:
Initializing embeddings ...
Done.
Building model architecture ...
building model... ...
read_gate concat LSTMState C and H
write gate concat LSTMState C and H
/home/ychliu/tf_1.4_p27/local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Done.
2019-02-23 09:31:24.293662: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-02-23 09:31:24.943148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:8a:00.0
totalMemory: 11.90GiB freeMemory: 10.37GiB
2019-02-23 09:31:24.943222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 7, name: TITAN Xp, pci bus id: 0000:8a:00.0, compute capability: 6.1)
Trying to restore saved checkpoints from ./works/example_debug//nn_models/ ... Checkpoint found: ./works/example_debug//nn_models/model.ckpt-1000
Global step was: 1000
Restoring... Done.
Loading data ...
Start training ...
step 1200, loss = 5.964163,perp: 326.911
val_prep: 384.564
(0.326 sec/step)
step 1400, loss = 5.468798,perp: 214.066
val_prep: 353.045
(0.288 sec/step)
step 1600, loss = 5.757420,perp: 277.068
val_prep: 287.971
(0.262 sec/step)
step 1800, loss = 5.543399,perp: 229.734
val_prep: 364.986
(0.280 sec/step)
step 2000, loss = 5.248699,perp: 165.369
val_prep: 320.422
(0.295 sec/step)
Storing checkpoint to ./works/example_debug//nn_models/ ... Done.
step 2200, loss = 5.683836,perp: 262.020
val_prep: 248.307
(0.296 sec/step)
step 2400, loss = 5.359348,perp: 190.597
val_prep: 239.272
(0.284 sec/step)
step 2600, loss = 5.527216,perp: 227.533
val_prep: 218.751
(0.292 sec/step)
step 2800, loss = 5.323764,perp: 185.976
val_prep: 216.894
(0.285 sec/step)
step 3000, loss = 5.305732,perp: 178.029
val_prep: 220.237
(0.276 sec/step)
Storing checkpoint to ./works/example_debug//nn_models/ ... Done.
step 3200, loss = 5.426156,perp: 203.885
val_prep: 242.733
(0.243 sec/step)
step 3400, loss = 5.159526,perp: 154.070
val_prep: 278.893
(0.296 sec/step)

@Walleclipse
Copy link
Author

好的,非常非常感谢!
我试试更大的数据集!

@1YCxZ
Copy link
Owner

1YCxZ commented Feb 23, 2019

2333,不用谢,欢迎交流

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants