Distributed mnist is unexpectedly slow #271

panchul · 2020-05-11T22:45:33Z

I ran mnist example with 2 workers on a 2-node Kubernetes cluster running on 2 VMs, and expected it be faster comparing with 1-worker case. However the time actually increased, and was even slower the more workers I added. Made several test runs, timing is reproducible:

1 master 1 worker : 100 seconds
1 master 2 workers: 2 minutes 56 seconds (176 seconds)
1 master 6 workers: 7 minutes 59 seconds (479 seconds)

No GPUs(explicitly disabling them in container spec template). Here is the node information:

$ k get nodes
NAME                                STATUS   ROLES   AGE   VERSION
aks-nodepool1-12102812-vmss000000   Ready    agent   28d   v1.15.10
aks-nodepool1-12102812-vmss000001   Ready    agent   28d   v1.15.10

Below is the minimally-modified pytorch-operator/examples/mnist/v1/pytorch_job_mnist_gloo.yaml I used:

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: alek8106/pytorch-dist-mnist-test:1.0
              args: ["--backend", "gloo", "--no-cuda"]
              resources:
                limits:
              #    nvidia.com/gpu: 1
    Worker:
      #replicas: 1
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: alek8106/pytorch-dist-mnist-test:1.0
              args: ["--backend", "gloo", "--no-cuda"]
              resources:
                limits:
               #   nvidia.com/gpu: 1```

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-05-11T22:45:41Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/bug	0.78

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

gaocegege · 2020-05-12T01:41:19Z

How about the bandwidth in your cluster?

panchul · 2020-05-26T18:19:27Z

@gaocegege , local network, no unusual bottlenecks.

xq2005 · 2020-06-18T04:39:47Z

@panchul I met similar problem when using DataParallel(...) in code, but I did not find good solution. Distribute deep learning learning workload tightly depends on the network bandwidth. If there is not bottlenecks on network, try to enlarge the batch size based on the number of workers.

Refer https://github.com/pytorch/pytorch/issues/3917 for more detail.

lwj1980s · 2020-11-06T06:41:30Z

I am coming across the same problem，have you solved it？

jalola · 2021-06-25T07:31:34Z

After each iteration == a batch, all of the replicas will send out their gradient (size = network size)
If model size is 100MB:
1 node: no need to send gradient
2 nodes: 2 x 100MB = 200MB/it

You can check your network bandwidth and compare with the model size to see if the network is the bottleneck. If it is because of network then can use bigger batch size as xq2005 said, or using no_sync in DDP.

gaocegege · 2021-10-28T09:50:21Z

Ref kubeflow/training-operator#1454

issue-label-bot bot added the kind/bug label May 11, 2020

gaocegege mentioned this issue Oct 28, 2021

[question] PyTorchJob MNIST example training speed kubeflow/training-operator#1454

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed mnist is unexpectedly slow #271

Distributed mnist is unexpectedly slow #271

panchul commented May 11, 2020

issue-label-bot bot commented May 11, 2020

gaocegege commented May 12, 2020

panchul commented May 26, 2020

xq2005 commented Jun 18, 2020

lwj1980s commented Nov 6, 2020

jalola commented Jun 25, 2021

gaocegege commented Oct 28, 2021

Distributed mnist is unexpectedly slow #271

Distributed mnist is unexpectedly slow #271

Comments

panchul commented May 11, 2020

issue-label-bot bot commented May 11, 2020

gaocegege commented May 12, 2020

panchul commented May 26, 2020

xq2005 commented Jun 18, 2020

lwj1980s commented Nov 6, 2020

jalola commented Jun 25, 2021

gaocegege commented Oct 28, 2021