Training a few epoch memory suddenly OOM #467

mpj1234 · 2023-01-18T12:59:09Z

Search before asking

I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

Training

Bug

During the training, there was no problem with the first few epoch, but suddenly OOM happened.

Environment

YOLOV8S

device：A40 48G
environment：
torch 1.10.0+cu113
torchvision 0.11.1+cu113
opencv-contrib-python 4.2.0.32
opencv-python 4.2.0.32
python 3.8.10

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

AyushExel · 2023-01-18T13:29:09Z

@mpj1234 what's your batch size? Reducing that might help.

mpj1234 · 2023-01-18T13:38:15Z

The bug of OOM is batch size = 120.

Now it is batch size = 100, and eight epoch have been run, so there is no problem for the time being.

glenn-jocher · 2023-01-18T16:33:14Z

👋 Hello! Thanks for asking about CUDA memory issues. YOLOv5/v8 🚀 can be trained on CPU, single-GPU, or multi-GPU. When training on GPU it is important to keep your batch-size small enough that you do not use all of your GPU memory, otherwise you will see a CUDA Out Of Memory (OOM) Error and your training will crash. You can observe your CUDA memory utilization using either the nvidia-smi command or by viewing your console output:

CUDA Out of Memory Solutions

If you encounter a CUDA OOM error, the steps you can take to reduce your memory usage are:

Reduce --batch-size
Reduce --img-size
Reduce model size, i.e. from YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s > YOLOv5n
Train with multi-GPU at the same --batch-size
Upgrade your hardware to a larger GPU
Train on free GPU backends with up to 16GB of CUDA memory:

AutoBatch

You can use YOLOv5 AutoBatch (NEW) to find the best batch size for your training by passing --batch-size -1. AutoBatch will solve for a 90% CUDA memory-utilization batch-size given your training settings. AutoBatch is experimental, and only works for Single-GPU training. It may not work on all systems, and is not recommended for production use.

Good luck 🍀 and let us know if you have any other questions!

mpj1234 · 2023-01-19T02:01:41Z

Yesterday, after I switched batch size=100, I ran 67 epochs normally, but at 68epoch, I suddenly allocated a lot of memory, and OOM appeared. I think there may be memory leakage in the code.

I'm switching to automatic batch now, and I'm still experimenting. The automatic batch allocation is 151.

$78%JVE MQ2T$_XJPD{N01MM$

I hope if it's not a bug, explain why after training so many epoch, I suddenly have to allocate a lot of video memory.

mpj1234 · 2023-01-19T02:18:08Z

This is the automatic batch, and it is still OOM.
I'm surprised that epoch in front is fine, but OOM in the back suddenly. Is there an explanation for this?

Laughing-q · 2023-01-19T07:58:26Z

@mpj1234 hi, looks the number of instances in your dataset is variable. The memory is instances-related, the more instances you got the more memory would be occupied. So the OOM could happen when you got more instances for a sudden in one batch. You have to reduce batch-size or use small model to solve this OOM issue.

mpj1234 · 2023-01-19T08:00:38Z

ok, Thanks♪(･ω･)ﾉ

Petros626 · 2025-01-06T11:16:11Z

for me the following things seem to work:

reduce the batch size
call torch.cuda.empty_cache() before training job, some frameworks do this automatically.
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

glenn-jocher · 2025-01-07T03:26:52Z

@Petros626 thank you for sharing these suggestions! Reducing batch size and using torch.cuda.empty_cache() are effective, and setting PYTORCH_CUDA_ALLOC_CONF can help manage memory fragmentation. These approaches align well with best practices for resolving CUDA OOM issues.

mpj1234 added the bug Something isn't working as intended in the official Ultralytics package. label Jan 18, 2023

glenn-jocher closed this as completed Jan 18, 2023

Laughing-q mentioned this issue Jan 31, 2023

GPU memory usage continuously increases when training a YOLOv8 model. #730

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a few epoch memory suddenly OOM #467

Training a few epoch memory suddenly OOM #467

mpj1234 commented Jan 18, 2023

AyushExel commented Jan 18, 2023

mpj1234 commented Jan 18, 2023

glenn-jocher commented Jan 18, 2023 •

edited by UltralyticsAssistant

Loading

mpj1234 commented Jan 19, 2023

mpj1234 commented Jan 19, 2023

Laughing-q commented Jan 19, 2023

mpj1234 commented Jan 19, 2023

Petros626 commented Jan 6, 2025

glenn-jocher commented Jan 7, 2025

Training a few epoch memory suddenly OOM #467

Training a few epoch memory suddenly OOM #467

Comments

mpj1234 commented Jan 18, 2023

Search before asking

YOLOv8 Component

Bug

During the training, there was no problem with the first few epoch, but suddenly OOM happened.

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

AyushExel commented Jan 18, 2023

mpj1234 commented Jan 18, 2023

glenn-jocher commented Jan 18, 2023 • edited by UltralyticsAssistant Loading

CUDA Out of Memory Solutions

AutoBatch

mpj1234 commented Jan 19, 2023

mpj1234 commented Jan 19, 2023

Laughing-q commented Jan 19, 2023

mpj1234 commented Jan 19, 2023

Petros626 commented Jan 6, 2025

glenn-jocher commented Jan 7, 2025

glenn-jocher commented Jan 18, 2023 •

edited by UltralyticsAssistant

Loading