[utils] add synchronized cuda memory monitor #740

1SAA · 2022-04-12T09:11:47Z

No description provided.

feifeibear · 2022-04-12T09:29:47Z

colossalai/utils/memory_tracer/memory_monitor.py

+    def finish(self):
+        torch.cuda.synchronize()
+        self.time_stamps.append(time())
+        max_usage = torch.cuda.max_memory_allocated()


reserved?
pytorch/pytorch#40989

Pytorch uses a caching allocator.
max_memory_allocated returns maximum cuda memory used by tensors.
max_memory_reserved returns maximum cuda memory used by the allocator.
If user still has enough memory in GPU, caching allocator may not recycle unused segments.
So the reserved memory can be much bigger than allocated memory. This will bring inaccuracy when we trying
to know maximum memory used by all tensors.
But there is a possiblity that the maximum cuda memory used is bigger than memory allocated.
We are better to set the maximum cuda memory slightly smaller than the actual cuda memory.
So there is some memory to place buffer for Pytorch.

add synchronized cuda memory monitor

8151995

1SAA requested a review from feifeibear April 12, 2022 09:11

feifeibear reviewed Apr 12, 2022

View reviewed changes

feifeibear added the Run Build and Test label Apr 13, 2022

feifeibear approved these changes Apr 13, 2022

View reviewed changes

1SAA merged commit 340e59f into hpcaitech:main Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[utils] add synchronized cuda memory monitor #740

[utils] add synchronized cuda memory monitor #740

1SAA commented Apr 12, 2022

feifeibear Apr 12, 2022

1SAA Apr 13, 2022

[utils] add synchronized cuda memory monitor #740

[utils] add synchronized cuda memory monitor #740

Conversation

1SAA commented Apr 12, 2022

feifeibear Apr 12, 2022

Choose a reason for hiding this comment

1SAA Apr 13, 2022

Choose a reason for hiding this comment