-
Notifications
You must be signed in to change notification settings - Fork 143
Distributed mnist is unexpectedly slow #271
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
How about the bandwidth in your cluster? |
@gaocegege , local network, no unusual bottlenecks. |
@panchul I met similar problem when using DataParallel(...) in code, but I did not find good solution. Distribute deep learning learning workload tightly depends on the network bandwidth. If there is not bottlenecks on network, try to enlarge the batch size based on the number of workers. Refer https://github.com/pytorch/pytorch/issues/3917 for more detail. |
I am coming across the same problem,have you solved it? |
After each iteration == a batch, all of the replicas will send out their gradient (size = network size) You can check your network bandwidth and compare with the model size to see if the network is the bottleneck. If it is because of network then can use bigger batch size as xq2005 said, or using no_sync in DDP. |
I ran
mnist
example with 2 workers on a 2-node Kubernetes cluster running on 2 VMs, and expected it be faster comparing with 1-worker case. However the time actually increased, and was even slower the more workers I added. Made several test runs, timing is reproducible:No GPUs(explicitly disabling them in container spec template). Here is the node information:
Below is the minimally-modified
pytorch-operator/examples/mnist/v1/pytorch_job_mnist_gloo.yaml
I used:The text was updated successfully, but these errors were encountered: