Services running out of disk space when pulling images on autoscaled machines #6990
Open
Description
Services running on autoscaled machines run out of disk space when are being opened and fail to start.
Something similar to the following error communicating that disk space ran out.
"2024-12-23T08:15:55.593Z","ip-10-0-3-149","dy-sidecar_a2a82999-1c8d-4cfe-b081-41c6a587366b.1.vtsax3o9cnggg3fv9ql1d37av","log_level=ERROR | log_timestamp=2024-12-23 08:15:55,593 | log_source=servicelib.docker_utils:pull_image(253) | log_uid=None | log_msg=Unexpected error while validating 'pull_progress={'errorDetail': {'message': 'failed to register layer: write /usr/sbin/wipefs: no space left on device'}, 'error': 'failed to register layer: write /usr/sbin/wipefs: no space left on device'}'. TIP: This is probably an unforeseen pull status text that shall be added to the code. The pulling process will still continue.
NOTE
It is possible for the error to also occur when pulling inputs the state or the outputs!
After asking the user to start their service again the service was able to start. It was running on a machine with 873.3 Gb Free space
.
What I think it happened
I can think of the possible situation:
- a machine with lower disk space is used and the service does not fit
- the disk space on a previously used machine ran out
- a mix of 1 and 2