ported from pytorch-examples
- torchvision:
pip install torchvision
- tqdm:
pip install tqdm
Run the example:
python mnist.py
Same example with logging using TQDM progress bar
python mnist_with_tqdm_logger.py
MNIST example with training and validation monitoring using Tensorboard
- Tensorboard:
pip install tensorboard
Run the example:
python mnist_with_tensorboard.py --log_dir=/tmp/tensorboard_logs
Start tensorboard:
tensorboard --logdir=/tmp/tensorboard_logs/
MNIST example with training and validation monitoring using Visdom
- Visdom:
pip install visdom
Start visdom:
python -m visdom.server
Run the example:
python mnist_with_visdom.py
- ClearML python client:
pip install clearml
python mnist_with_clearml_logger.py
Example shows how to save a checkpoint of the trainer, model, optimizer, lr scheduler. User can resume the training from stored latest checkpoint. In addition, training crash can be emulated.
We provided an option --deterministic
which setups a deterministic trainer as
DeterministicEngine
.
Trainer performs dataflow synchronization on epoch in order to ensure the same dataflow when training is resumed.
Please, see the documentation for more details.
- torchvision:
pip install torchvision
- tqdm:
pip install tqdm
- TensorboardX:
pip install tensorboardX
- Tensorboard:
pip install tensorboard
Training
python mnist_save_resume_engine.py --log_dir=logs/run_1 --epochs=10
# or same in deterministic mode
python mnist_save_resume_engine.py --log_dir=logs-det/run_1 --deterministic --epochs=10
Resume the training
python mnist_save_resume_engine.py --log_dir=logs/run_2 --resume_from=logs/run_1/checkpoint_5628.pt --epochs=10
# or same in deterministic mode
python mnist_save_resume_engine.py --log_dir=logs-det/run_2 --resume_from=logs-det/run_1/checkpoint_5628.pt --deterministic --epochs=10
Start tensorboard:
tensorboard --logdir=.
The script logs batch stats (mean/std of images, median of targets), model weights' norms and computed gradients norms in
run.log
and resume_run.log
to compare training behaviour in both cases.
If set --deterministic
option, we can observe the same values after resuming the training.
Non-deterministic | Deterministic |
---|---|
Deterministic run.log
vs resume_run.log
Initial training with a crash
python mnist_save_resume_engine.py --crash_iteration 5700 --log_dir=logs/run_3_crash --epochs 10
# or same in deterministic mode
python mnist_save_resume_engine.py --crash_iteration 5700 --log_dir=logs-det/run_3_crash --epochs 10 --deterministic
Resume from the latest checkpoint
python mnist_save_resume_engine.py --resume_from logs/run_3_crash/checkpoint_6.pt --log_dir=logs/run_4 --epochs 10
# or same in deterministic mode
python mnist_save_resume_engine.py --resume_from logs-det/run_3_crash/checkpoint_6.pt --log_dir=logs-det/run_4 --epochs 10 --deterministic
Non-deterministic | Deterministic |
---|---|