Skip to content

[BUG]: AssertionError: you are calculating the l2 norm twice #2382

Closed
@haofanwang

Description

🐛 Describe the bug

Describe the bug

I meet AssertionError: you are calculating the l2 norm twice, which looks similar to another issue.

I guess it is related to set_l2_norm. But strangely, this error doesn't raise at the very beginning, but appear after several steps. Once this happens, the loss becomes nan.

To Reproduce

python -m torch.distributed.run --nproc_per_node=$GPU_NUM --nnodes=$WORLD_SIZE \
  --node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT ./train_dreambooth_colossalai.py

I'm working with this dreambooth example. Both single machine and multiple machine cannot work well in the training.

Environment

I install the lastest ColossalAI from the source as instructed by the new readme.

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions