You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I have a dataset of 7T tokens, which when I run in the gpt-neox codebase, it creates about 5000 .npy files. I can get this to train for a 7B model on 32 gpus. But when I try to use 64 gpus, I get an error that says too many files have been opened, reaching the limit for max files opened. I believe it's opening a file decorator for each gpu and worker of all 5000 .npy files, so the mode gpus, the more files opened. Has anyone else ran into a similar limit? The current limit when typing ulimit -n is 1048576.
Here's the error I got:
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/socket.py", line 546, in fromfd
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/threading.py", line 975, in run
GPUCA6E: nfd = dup(fd)
GPUCA6E: self._target(*self._args, **self._kwargs)
GPUCA6E: ^^^^^^^
GPUCA6E: OSError: [Errno 24] Too many open files
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
GPUCA6E: do_one_step()
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
GPUCA6E: r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
GPUCA6E: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/queues.py", line 122, in get
GPUCA6E: return _ForkingPickler.loads(res)
GPUCA6E: ^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
GPUCA6E: fd = df.detach()
GPUCA6E: ^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/resource_sharer.py", line 58, in detach
GPUCA6E: return reduction.recv_handle(conn)
GPUCA6E: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 189, in recv_handle
GPUCA6E: return recvfds(s, 1)[0]
GPUCA6E: ^^^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 159, in recvfds
GPUCA6E: raise EOFError
GPUCA6E: EOFError
The text was updated successfully, but these errors were encountered:
Hello, I have a dataset of 7T tokens, which when I run in the gpt-neox codebase, it creates about 5000 .npy files. I can get this to train for a 7B model on 32 gpus. But when I try to use 64 gpus, I get an error that says too many files have been opened, reaching the limit for max files opened. I believe it's opening a file decorator for each gpu and worker of all 5000 .npy files, so the mode gpus, the more files opened. Has anyone else ran into a similar limit? The current limit when typing
ulimit -n
is 1048576.Here's the error I got:
The text was updated successfully, but these errors were encountered: