-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for expanding the list of organisms #2
Comments
Hello! You can use the finetuning guide on readme and |
Please reopen the issue if you get into any problems during training! |
Hi @Adibvafa, Having some difficulties in training a new model and hoping might be able to help? I prepared a new csv pretraining dataset by combining your original dataset with some additional organisms. I'm working in a docker container which has access to the hosts GPU, so I did an initial test of running the first script form the README and I get the expected output. Then I tried running the following from inside the container:
But it fails with the following output (Last error is blank):
The output of
Would really appreciate any advice on this. Thanks! |
Hi @JackKay404 and thanks for your interest in CodonTransformer! It is hard to diagnose the problem. All we have is |
Not sure about Pytorch Lightning but I have been able to perform other GPU enabled tasks using a similar containerised setup. Is there a test case I could try? |
Unfortunately, we do not have a test case for the trainer. Let's see if we can help a bit... You can check that NVCC is set up correctly (I suppose it is if you can run other loops).
Otherwise, since the issue arises when setting up the profiler, you can also try removing it the hard way by editing the code of |
Thanks for all the help! after running This seems strange because I am able to run nvidia-smi from within the container but I guess I need to install nvidia container-toolkit, maybe I'm missing something... I re-built my image using
Now when I run the command I get a different error:
Could this be because I only have a single GPU? Update: I think maybe the issue is because I am trying to train on a single node with single GPU? |
@gui11aume Could the issue be that the custom slurm json loader we used doesn't support only a single GPU? |
The reason is certainly because the code is strongly "addicted" to a SLURM environment and it was not tested on so many machines. Here the issue is obvious: |
Theoretically, if my system had another cuda GPU would that be a fix? Alternatively, could you suggest some code edits for a single-GPU workaround? @Adibvafa |
I will open a PR to add support for non-SLURM environments this weekend. |
Hi,
Love the package, great work!
It would be nice to enable expansion of the model to organisms not on the list. Or maybe this can be done via the fine-tuning script? Thanks!
The text was updated successfully, but these errors were encountered: