Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training a few epoch memory suddenly OOM #467

Closed
1 of 2 tasks
mpj1234 opened this issue Jan 18, 2023 · 9 comments
Closed
1 of 2 tasks

Training a few epoch memory suddenly OOM #467

mpj1234 opened this issue Jan 18, 2023 · 9 comments
Labels
bug Something isn't working as intended in the official Ultralytics package.

Comments

@mpj1234
Copy link

mpj1234 commented Jan 18, 2023

Search before asking

  • I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

Training

Bug

During the training, there was no problem with the first few epoch, but suddenly OOM happened.

@F~`LB~I) %@WPALBMC9CEI

Environment

YOLOV8S

device:A40 48G
environment:
torch 1.10.0+cu113
torchvision 0.11.1+cu113
opencv-contrib-python 4.2.0.32
opencv-python 4.2.0.32
python 3.8.10

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@mpj1234 mpj1234 added the bug Something isn't working as intended in the official Ultralytics package. label Jan 18, 2023
@AyushExel
Copy link
Contributor

@mpj1234 what's your batch size? Reducing that might help.

@mpj1234
Copy link
Author

mpj1234 commented Jan 18, 2023

The bug of OOM is batch size = 120.

Now it is batch size = 100, and eight epoch have been run, so there is no problem for the time being.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 18, 2023

👋 Hello! Thanks for asking about CUDA memory issues. YOLOv5/v8 🚀 can be trained on CPU, single-GPU, or multi-GPU. When training on GPU it is important to keep your batch-size small enough that you do not use all of your GPU memory, otherwise you will see a CUDA Out Of Memory (OOM) Error and your training will crash. You can observe your CUDA memory utilization using either the nvidia-smi command or by viewing your console output:

Screenshot 2021-05-28 at 12 19 51

CUDA Out of Memory Solutions

If you encounter a CUDA OOM error, the steps you can take to reduce your memory usage are:

  • Reduce --batch-size
  • Reduce --img-size
  • Reduce model size, i.e. from YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s > YOLOv5n
  • Train with multi-GPU at the same --batch-size
  • Upgrade your hardware to a larger GPU
  • Train on free GPU backends with up to 16GB of CUDA memory: Open In Colab Open In Kaggle

AutoBatch

You can use YOLOv5 AutoBatch (NEW) to find the best batch size for your training by passing --batch-size -1. AutoBatch will solve for a 90% CUDA memory-utilization batch-size given your training settings. AutoBatch is experimental, and only works for Single-GPU training. It may not work on all systems, and is not recommended for production use.

Screenshot 2021-11-06 at 12 31 10

Good luck 🍀 and let us know if you have any other questions!

@mpj1234
Copy link
Author

mpj1234 commented Jan 19, 2023

Yesterday, after I switched batch size=100, I ran 67 epochs normally, but at 68epoch, I suddenly allocated a lot of memory, and OOM appeared. I think there may be memory leakage in the code.

DQ%SE3%F95QR@SKB_ _N0

I'm switching to automatic batch now, and I'm still experimenting. The automatic batch allocation is 151.

78%JVE MQ2T$_XJPD{N01MM

I hope if it's not a bug, explain why after training so many epoch, I suddenly have to allocate a lot of video memory.

@mpj1234
Copy link
Author

mpj1234 commented Jan 19, 2023

This is the automatic batch, and it is still OOM.
I'm surprised that epoch in front is fine, but OOM in the back suddenly. Is there an explanation for this?

TY@CATXAACST~R5%XOT5I1Y

@Laughing-q
Copy link
Member

@mpj1234 hi, looks the number of instances in your dataset is variable. The memory is instances-related, the more instances you got the more memory would be occupied. So the OOM could happen when you got more instances for a sudden in one batch. You have to reduce batch-size or use small model to solve this OOM issue.
c0deb2d6-ef8d-4b12-90cb-f5f9e20b9dd0

@mpj1234
Copy link
Author

mpj1234 commented Jan 19, 2023

ok, Thanks♪(・ω・)ノ

@Petros626
Copy link

for me the following things seem to work:

  1. reduce the batch size
  2. call torch.cuda.empty_cache() before training job, some frameworks do this automatically.
  3. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

@glenn-jocher
Copy link
Member

@Petros626 thank you for sharing these suggestions! Reducing batch size and using torch.cuda.empty_cache() are effective, and setting PYTORCH_CUDA_ALLOC_CONF can help manage memory fragmentation. These approaches align well with best practices for resolving CUDA OOM issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as intended in the official Ultralytics package.
Projects
None yet
Development

No branches or pull requests

5 participants