Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for expanding the list of organisms #2

Open
JackKay404 opened this issue Sep 16, 2024 · 11 comments
Open

Add support for expanding the list of organisms #2

JackKay404 opened this issue Sep 16, 2024 · 11 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@JackKay404
Copy link

Hi,
Love the package, great work!
It would be nice to enable expansion of the model to organisms not on the list. Or maybe this can be done via the fine-tuning script? Thanks!

@JackKay404 JackKay404 changed the title Add support for expanding the list of organisms / Docker Add support for expanding the list of organisms Sep 16, 2024
@Adibvafa
Copy link
Owner

Hello!
Thank you for opening an issue.

You can use the finetuning guide on readme and finetune.py to finetune the model on any new datasets.
To add new organisms, you need to use both pretrain.py and finetune.py to train the model from scratch.

@Adibvafa
Copy link
Owner

Please reopen the issue if you get into any problems during training!

@Adibvafa Adibvafa added the enhancement New feature or request label Sep 17, 2024
@JackKay404
Copy link
Author

JackKay404 commented Oct 1, 2024

Hi @Adibvafa,

Having some difficulties in training a new model and hoping might be able to help?

I prepared a new csv pretraining dataset by combining your original dataset with some additional organisms.
Then I used the prepare_training_data function to generate a json from the new dataset.

I'm working in a docker container which has access to the hosts GPU, so I did an initial test of running the first script form the README and I get the expected output.

Then I tried running the following from inside the container:

python pretrain.py \
    --tokenizer_path /CodonTransformer/src/CodonTransformerTokenizer.json \
    --train_data_path path/to/new_training_data.json \
    --checkpoint_dir path/to/checkpoints_dir \
    --batch_size 6 \
    --max_epochs 5 \
    --num_workers 5 \
    --accumulate_grad_batches 1 \
    --num_gpus 16 \
    --learning_rate 0.00005 \
    --warmup_fraction 0.1 \
    --save_interval 5 \
    --seed 123

But it fails with the following output (Last error is blank):

/usr/local/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
BigBirdForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[rank0]: Traceback (most recent call last):
[rank0]:   File "/codont_worker/CodonTransformer/pretrain.py", line 239, in <module>
[rank0]:     main(args)
[rank0]:   File "/codont_worker/CodonTransformer/pretrain.py", line 178, in main
[rank0]:     trainer.fit(harnessed_model, data_loader)
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 938, in _run
[rank0]:     self.__setup_profiler()
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1071, in __setup_profiler
[rank0]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]:                                                                             ^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1233, in log_dir
[rank0]:     dirpath = self.strategy.broadcast(dirpath)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank0]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]:   File "/usr/local/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2901, in broadcast_object_list
[rank0]:     broadcast(object_sizes_tensor, src=src, group=group)
[rank0]:   File "/usr/local/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2205, in broadcast
[rank0]:     work = default_pg.broadcast([tensor], opts)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank0]: Last error:

The output of nvidia-smi from within the container is below:

Tue Oct  1 16:06:26 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     Off | 00000000:01:00.0 Off |                  N/A |
|  0%   42C    P8              13W / 165W |     20MiB / 16380MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Would really appreciate any advice on this.

Thanks!

@gui11aume
Copy link
Collaborator

Hi @JackKay404 and thanks for your interest in CodonTransformer! It is hard to diagnose the problem. All we have is NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error, which may be a GPU configuration issue. Would you know if the machine can run other training loops using Pytorch Lightning?

@JackKay404
Copy link
Author

Not sure about Pytorch Lightning but I have been able to perform other GPU enabled tasks using a similar containerised setup. Is there a test case I could try?

@gui11aume
Copy link
Collaborator

Unfortunately, we do not have a test case for the trainer. Let's see if we can help a bit... You can check that NVCC is set up correctly (I suppose it is if you can run other loops).

nvcc --version

Otherwise, since the issue arises when setting up the profiler, you can also try removing it the hard way by editing the code of pretrain.py and set trainer = Trainer(profiler=None, ...), i.e., add an argument to force remove the profiler.

@JackKay404
Copy link
Author

JackKay404 commented Oct 2, 2024

Thanks for all the help!

after running nvcc --version it seems that it is not installed:
bash: nvcc: command not found

This seems strange because I am able to run nvidia-smi from within the container but I guess I need to install nvidia container-toolkit, maybe I'm missing something...

I re-built my image using nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 as the base, meaning the container now has nvcc version 12.4

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Now when I run the command I get a different error:

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/codont_worker/CodonTransformer/pretrain.py", line 239, in <module>
[rank0]:     main(args)
[rank0]:   File "/codont_worker/CodonTransformer/pretrain.py", line 178, in main
[rank0]:     trainer.fit(harnessed_model, data_loader)
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 957, in _run
[rank0]:     self.strategy.setup(self)
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/strategies/ddp.py", line 174, in setup
[rank0]:     self.setup_optimizers(trainer)
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/strategies/strategy.py", line 138, in setup_optimizers
[rank0]:     self.optimizers, self.lr_scheduler_configs = _init_optimizers_and_lr_schedulers(self.lightning_module)
[rank0]:                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/core/optimizer.py", line 179, in _init_optimizers_and_lr_schedulers
[rank0]:     optim_conf = call._call_lightning_module_hook(model.trainer, "configure_optimizers", pl_module=model)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 167, in _call_lightning_module_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/codont_worker/CodonTransformer/pretrain.py", line 90, in configure_optimizers
[rank0]:     total_steps=self.trainer.estimated_stepping_batches,
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1675, in estimated_stepping_batches
[rank0]:     self.fit_loop.setup_data()
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py", line 263, in setup_data
[rank0]:     iter(self._data_fetcher)  # creates the iterator inside the fetcher
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 111, in __iter__
[rank0]:     batch = super().__next__()
[rank0]:             ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
[rank0]:     batch = next(self.iterator)
[rank0]:             ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
[rank0]:     out = next(self._iterator)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 78, in __next__
[rank0]:     out[i] = next(self.iterators[i])
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]:     data = self._next_data()
[rank0]:            ^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
[rank0]:     return self._process_data(data)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank0]:     data.reraise()
[rank0]:   File "/usr/local/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
[rank0]:     raise exception
[rank0]: TypeError: Caught TypeError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 253, in _worker_loop
[rank0]:     fetcher = _DatasetKind.create_fetcher(dataset_kind, dataset, auto_collation, collate_fn, drop_last)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 80, in create_fetcher
[rank0]:     return _utils.fetch._IterableDatasetFetcher(dataset, auto_collation, collate_fn, drop_last)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 22, in __init__
[rank0]:     self.dataset_iter = iter(dataset)
[rank0]:                         ^^^^^^^^^^^^^
[rank0]:   File "/codont_worker/CodonTransformer/CodonTransformer/CodonUtils.py", line 530, in __iter__
[rank0]:     world_size = int(os.environ.get(self.world_size_handle))
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

Could this be because I only have a single GPU?

Update:

I think maybe the issue is because I am trying to train on a single node with single GPU?
I end up hitting the NotImplementedError in /CodonTransformer/CodonUtils.py", line 514 after I set dist_env=None and export environment variables WORLD_SIZE=1, and LOCAL_RANK=-1

@Adibvafa Adibvafa reopened this Oct 2, 2024
@Adibvafa
Copy link
Owner

Adibvafa commented Oct 2, 2024

@gui11aume Could the issue be that the custom slurm json loader we used doesn't support only a single GPU?
I can do a refactor to directly use the input JSON dataset with torch dataset and dataloader objects.

@gui11aume
Copy link
Collaborator

gui11aume commented Oct 2, 2024

The reason is certainly because the code is strongly "addicted" to a SLURM environment and it was not tested on so many machines. Here the issue is obvious: os.environ.get(self.world_size_handle) does not find the proper environment variable and returns None, which cannot be turned into an int. The code should be os.environ.get(self.world_size_handle, 1) so that it makes a world size 1 by default. That may fix this issue, but the code is highly non-portable so I suspect that other issues will arise.

@JackKay404
Copy link
Author

Theoretically, if my system had another cuda GPU would that be a fix? Alternatively, could you suggest some code edits for a single-GPU workaround? @Adibvafa

@Adibvafa
Copy link
Owner

I will open a PR to add support for non-SLURM environments this weekend.

@Adibvafa Adibvafa self-assigned this Oct 17, 2024
@Adibvafa Adibvafa added the bug Something isn't working label Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants