关于 loss = nan #2

Walleclipse · 2019-02-22T03:29:15Z

你好我用你给的sample data 运行 train_ECM.py 时遇到了loss = nan 的问题，这是什么原因呢？
以下是几个step的loss: 从第60步开始就nan了。
Start training ...
step 20, loss = 10.188582,perp: 17026.870
(0.380 sec/step)
step 40, loss = 5.171560,perp: 79.408
(0.380 sec/step)
step 60, loss = nan,perp: nan
(0.396 sec/step)

但是我运行 train.py时就可以收敛
Start training ...
step 20, loss = 3.336257,perp: 26.973
(0.323 sec/step)
step 40, loss = 2.431179,perp: 10.790
(0.398 sec/step)
step 60, loss = 1.425767,perp: 3.964
(0.397 sec/step)
step 80, loss = 0.656351,perp: 1.895
(0.386 sec/step)

是不是 ECM_model 存在一些bug？

1YCxZ · 2019-02-23T01:45:40Z

你好，我检查了一下，模型应该没有问题。
sample data只有10条，只是用作展示数据的格式，你可以试试看stc的数据集，链接我已经写在readme中了。
附上部分我刚刚用stc数据集训练ecm模型的log：
Initializing embeddings ...
Done.
Building model architecture ...
building model... ...
read_gate concat LSTMState C and H
write gate concat LSTMState C and H
/home/ychliu/tf_1.4_p27/local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Done.
2019-02-23 09:31:24.293662: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-02-23 09:31:24.943148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:8a:00.0
totalMemory: 11.90GiB freeMemory: 10.37GiB
2019-02-23 09:31:24.943222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 7, name: TITAN Xp, pci bus id: 0000:8a:00.0, compute capability: 6.1)
Trying to restore saved checkpoints from ./works/example_debug//nn_models/ ... Checkpoint found: ./works/example_debug//nn_models/model.ckpt-1000
Global step was: 1000
Restoring... Done.
Loading data ...
Start training ...
step 1200, loss = 5.964163,perp: 326.911
val_prep: 384.564
(0.326 sec/step)
step 1400, loss = 5.468798,perp: 214.066
val_prep: 353.045
(0.288 sec/step)
step 1600, loss = 5.757420,perp: 277.068
val_prep: 287.971
(0.262 sec/step)
step 1800, loss = 5.543399,perp: 229.734
val_prep: 364.986
(0.280 sec/step)
step 2000, loss = 5.248699,perp: 165.369
val_prep: 320.422
(0.295 sec/step)
Storing checkpoint to ./works/example_debug//nn_models/ ... Done.
step 2200, loss = 5.683836,perp: 262.020
val_prep: 248.307
(0.296 sec/step)
step 2400, loss = 5.359348,perp: 190.597
val_prep: 239.272
(0.284 sec/step)
step 2600, loss = 5.527216,perp: 227.533
val_prep: 218.751
(0.292 sec/step)
step 2800, loss = 5.323764,perp: 185.976
val_prep: 216.894
(0.285 sec/step)
step 3000, loss = 5.305732,perp: 178.029
val_prep: 220.237
(0.276 sec/step)
Storing checkpoint to ./works/example_debug//nn_models/ ... Done.
step 3200, loss = 5.426156,perp: 203.885
val_prep: 242.733
(0.243 sec/step)
step 3400, loss = 5.159526,perp: 154.070
val_prep: 278.893
(0.296 sec/step)

Walleclipse · 2019-02-23T03:25:10Z

好的，非常非常感谢！
我试试更大的数据集！

1YCxZ · 2019-02-23T03:56:37Z

2333，不用谢，欢迎交流

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于 loss = nan #2

关于 loss = nan #2

Walleclipse commented Feb 22, 2019

1YCxZ commented Feb 23, 2019 •

edited

Loading

Walleclipse commented Feb 23, 2019

1YCxZ commented Feb 23, 2019

关于 loss = nan #2

关于 loss = nan #2

Comments

Walleclipse commented Feb 22, 2019

1YCxZ commented Feb 23, 2019 • edited Loading

Walleclipse commented Feb 23, 2019

1YCxZ commented Feb 23, 2019

1YCxZ commented Feb 23, 2019 •

edited

Loading