[BUG]: AssertionError: you are calculating the l2 norm twice #2382
Closed
Description
🐛 Describe the bug
Describe the bug
I meet AssertionError: you are calculating the l2 norm twice, which looks similar to another issue.
I guess it is related to set_l2_norm. But strangely, this error doesn't raise at the very beginning, but appear after several steps. Once this happens, the loss becomes nan.
To Reproduce
python -m torch.distributed.run --nproc_per_node=$GPU_NUM --nnodes=$WORLD_SIZE \
--node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT ./train_dreambooth_colossalai.py
I'm working with this dreambooth example. Both single machine and multiple machine cannot work well in the training.
Environment
I install the lastest ColossalAI from the source as instructed by the new readme.