Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what's the biggest dataset you've tried? #1253

Open
exnx opened this issue Jul 15, 2024 · 0 comments
Open

what's the biggest dataset you've tried? #1253

exnx opened this issue Jul 15, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@exnx
Copy link
Contributor

exnx commented Jul 15, 2024

Hello, I have a dataset of 7T tokens, which when I run in the gpt-neox codebase, it creates about 5000 .npy files. I can get this to train for a 7B model on 32 gpus. But when I try to use 64 gpus, I get an error that says too many files have been opened, reaching the limit for max files opened. I believe it's opening a file decorator for each gpu and worker of all 5000 .npy files, so the mode gpus, the more files opened. Has anyone else ran into a similar limit? The current limit when typing ulimit -n is 1048576.

Here's the error I got:

GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/socket.py", line 546, in fromfd
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/threading.py", line 975, in run
GPUCA6E:     nfd = dup(fd)
GPUCA6E:             self._target(*self._args, **self._kwargs) 
GPUCA6E:  ^^^^^^^
GPUCA6E: OSError: [Errno 24] Too many open files
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
GPUCA6E:     do_one_step()
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
GPUCA6E:     r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
GPUCA6E:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/queues.py", line 122, in get
GPUCA6E:     return _ForkingPickler.loads(res)
GPUCA6E:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
GPUCA6E:     fd = df.detach()
GPUCA6E:          ^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/resource_sharer.py", line 58, in detach
GPUCA6E:     return reduction.recv_handle(conn)
GPUCA6E:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 189, in recv_handle
GPUCA6E:     return recvfds(s, 1)[0]
GPUCA6E:            ^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 159, in recvfds
GPUCA6E:     raise EOFError
GPUCA6E: EOFError
@exnx exnx added the bug Something isn't working label Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant