-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Colabfold_batch connects to mmseqs server despite msa input #563
Comments
If you turn on |
Is there a way of obtaining template PDB files locally? And if not is it feasible to obtain the templates of many predictions (>10000) from the msa server given the resource limitations of the server? |
Currently, MMSEQS_PATH="/path/to/your/mmseqs2/for_colabfold"
DATABASE_PATH="/mnt/databases"
INPUTFILE="ras_raf.fasta"
OUTPUTDIR="ras_raf"
colabfold_search \
--use-env 1 \
--use-templates 1 \
--db-load-mode 2 \
--db2 pdb100_230517 \
--mmseqs ${MMSEQS_PATH}/bin/mmseqs \
--threads 4 \
${INPUTFILE} \
${DATABASE_PATH} \
${OUTPUTDIR} Then, use INPUTFILE="RAS_RAF.a3m"
PDBHITFILE="RAS_RAF_pdb100_230517.m8"
LOCALPDBPATH="/path/to/pdb_mmcif/mmcif_files"
RANDOMSEED=0
colabfold_batch \
--amber \
--templates \
--use-gpu-relax \
--pdb-hit-file ${PDBHITFILE} \
--local-pdb-path ${LOCALPDBPATH} \
--random-seed ${RANDOMSEED} \
${INPUTFILE} \
ras_raf |
That's great. Is there a way of doing this for a batch of proteins with respective .a3m files and .m8 files? |
I am trying to use a fasta file with multiple complexes as an input for colabfold_search. This is my code:
I get this error:
Does Colabfold_serach support multiples sequences as an input when generating both .a3m and .m8? |
Yes, you can obtain a3m files for multiple input. Here is my example. I'm using ColabFold 1.5.5 (a00ce1b).
# I'm using MMseqs2 commit hash 71dd32ec43e3ac4dabf111bbc4b124f1c66a85f1
colabfold_search \
--use-env 1 \
--use-templates 1 \
--db-load-mode 2 \
--mmseqs /path/to/your/mmseqs2/bin/mmseqs \
--db2 pdb100_230517 \
--threads 4 \
ras_raf.fasta \
/mnt/databases \
manual_ras_raf
$ ls -lt /mnt/databases
-rw-r--r-- 1 moriwaki staffs 5797891705 May 22 2023 uniref30_2302_db_mapping
-rw-r--r-- 1 moriwaki staffs 667957493 May 22 2023 uniref30_2302_db_taxonomy
-rw-r--r-- 1 moriwaki staffs 64064274015 Jun 13 2023 pdb100_a3m.ffdata
-rw-r--r-- 1 moriwaki staffs 6389810 Jun 13 2023 pdb100_a3m.ffindex
-rw-r--r-- 1 moriwaki staffs 43200163261 Oct 9 17:34 uniref30_2302_db_h
-rw-r--r-- 1 moriwaki staffs 8910693488 Oct 9 17:35 uniref30_2302_db_h.index
-rw-r--r-- 1 moriwaki staffs 4 Oct 9 17:35 uniref30_2302_db_h.dbtype
-rw-r--r-- 1 moriwaki staffs 5787495369 Oct 9 17:36 uniref30_2302_db
-rw-r--r-- 1 moriwaki staffs 879290728 Oct 9 17:36 uniref30_2302_db.index
-rw-r--r-- 1 moriwaki staffs 4 Oct 9 17:36 uniref30_2302_db.dbtype
-rw-r--r-- 1 moriwaki staffs 83036144795 Oct 9 17:57 uniref30_2302_db_seq
-rw-r--r-- 1 moriwaki staffs 8957791292 Oct 9 17:58 uniref30_2302_db_seq.index
-rw-r--r-- 1 moriwaki staffs 4 Oct 9 17:58 uniref30_2302_db_seq.dbtype
-rw-r--r-- 1 moriwaki staffs 8709887243 Oct 9 18:07 uniref30_2302_db_aln
-rw-r--r-- 1 moriwaki staffs 867494002 Oct 9 18:07 uniref30_2302_db_aln.index
-rw-r--r-- 1 moriwaki staffs 4 Oct 9 18:07 uniref30_2302_db_aln.dbtype
lrwxrwxrwx 1 moriwaki staffs 24 Oct 9 18:07 uniref30_2302_db_seq_h.index -> uniref30_2302_db_h.index
lrwxrwxrwx 1 moriwaki staffs 25 Oct 9 18:07 uniref30_2302_db_seq_h.dbtype -> uniref30_2302_db_h.dbtype
lrwxrwxrwx 1 moriwaki staffs 18 Oct 9 18:07 uniref30_2302_db_seq_h -> uniref30_2302_db_h
-rw-r--r-- 1 moriwaki staffs 228709249024 Oct 9 18:20 uniref30_2302_db.idx
-rw-r--r-- 1 moriwaki staffs 506 Oct 9 18:20 uniref30_2302_db.idx.index
-rw-r--r-- 1 moriwaki staffs 4 Oct 9 18:20 uniref30_2302_db.idx.dbtype
lrwxrwxrwx 1 moriwaki staffs 24 Oct 9 18:21 uniref30_2302_db.idx_mapping -> uniref30_2302_db_mapping
lrwxrwxrwx 1 moriwaki staffs 25 Oct 9 18:21 uniref30_2302_db.idx_taxonomy -> uniref30_2302_db_taxonomy
-rw-r--r-- 1 moriwaki staffs 0 Oct 9 18:21 UNIREF30_READY
-rw-r--r-- 1 moriwaki staffs 25108896515 Oct 10 09:07 colabfold_envdb_202108_db_h
-rw-r--r-- 1 moriwaki staffs 18036930897 Oct 10 09:09 colabfold_envdb_202108_db_h.index
-rw-r--r-- 1 moriwaki staffs 4 Oct 10 09:09 colabfold_envdb_202108_db_h.dbtype
-rw-r--r-- 1 moriwaki staffs 26732224605 Oct 10 09:14 colabfold_envdb_202108_db
-rw-r--r-- 1 moriwaki staffs 5260769931 Oct 10 09:15 colabfold_envdb_202108_db.index
-rw-r--r-- 1 moriwaki staffs 4 Oct 10 09:15 colabfold_envdb_202108_db.dbtype
-rw-r--r-- 1 moriwaki staffs 92749953996 Oct 10 09:46 colabfold_envdb_202108_db_seq
-rw-r--r-- 1 moriwaki staffs 18917335740 Oct 10 09:49 colabfold_envdb_202108_db_seq.index
-rw-r--r-- 1 moriwaki staffs 4 Oct 10 09:49 colabfold_envdb_202108_db_seq.dbtype
-rw-r--r-- 1 moriwaki staffs 27929446713 Oct 10 09:57 colabfold_envdb_202108_db_aln
-rw-r--r-- 1 moriwaki staffs 5214433987 Oct 10 09:58 colabfold_envdb_202108_db_aln.index
-rw-r--r-- 1 moriwaki staffs 4 Oct 10 09:58 colabfold_envdb_202108_db_aln.dbtype
-rw-r--r-- 1 moriwaki staffs 1907 Oct 10 11:23 colabfold_envdb_202108_db.idx.index
-rw-r--r-- 1 moriwaki staffs 4 Oct 10 11:23 colabfold_envdb_202108_db.idx.dbtype
lrwxrwxrwx 1 moriwaki staffs 33 Oct 10 13:38 colabfold_envdb_202108_db_seq_h.index -> colabfold_envdb_202108_db_h.index
lrwxrwxrwx 1 moriwaki staffs 34 Oct 10 13:39 colabfold_envdb_202108_db_seq_h.dbtype -> colabfold_envdb_202108_db_h.dbtype
lrwxrwxrwx 1 moriwaki staffs 27 Oct 10 13:39 colabfold_envdb_202108_db_seq_h -> colabfold_envdb_202108_db_h
-rw-r--r-- 1 moriwaki staffs 562358472704 Oct 16 01:28 colabfold_envdb_202108_db.idx
-rw-r--r-- 1 moriwaki staffs 0 Oct 10 13:40 COLABDB_READY
-rw-r--r-- 1 moriwaki staffs 25 Oct 10 13:47 pdb100_230517.source
-rw-r--r-- 1 moriwaki staffs 27989933 Oct 10 13:47 pdb100_230517_h
-rw-r--r-- 1 moriwaki staffs 4 Oct 10 13:47 pdb100_230517_h.dbtype
-rw-r--r-- 1 moriwaki staffs 65092975 Oct 10 13:47 pdb100_230517
-rw-r--r-- 1 moriwaki staffs 4 Oct 10 13:47 pdb100_230517.dbtype
-rw-r--r-- 1 moriwaki staffs 6279753 Oct 10 13:47 pdb100_230517.index
-rw-r--r-- 1 moriwaki staffs 6116273 Oct 10 13:47 pdb100_230517_h.index
-rw-r--r-- 1 moriwaki staffs 5178372 Oct 10 13:47 pdb100_230517.lookup
-rw-r--r-- 1 moriwaki staffs 1443213312 Oct 10 13:47 pdb100_230517.idx
-rw-r--r-- 1 moriwaki staffs 383 Oct 10 13:47 pdb100_230517.idx.index
-rw-r--r-- 1 moriwaki staffs 4 Oct 10 13:47 pdb100_230517.idx.dbtype
-rw-r--r-- 1 moriwaki staffs 0 Oct 10 13:47 PDB_READY
-rw-r--r-- 1 moriwaki staffs 0 Oct 10 13:54 PDB100_READY Then, I obtained the a3m and m8 files in the
Note that |
Thank you for this example. But how would I use an input file like this?
|
@YoshitakaMo Thank you for fixing this. I have noticed another related issue, where the templates that are picked at the beginning of the prediction are different when I use I use this input:
When using
However, when doing:
then this is the
Clearly the templates are similar between the methods and the resulting predictions are also similar in this instance, but I had cases where the predicted structures were significantly different. Is this intended behaviour? |
Hi @YoshitakaMo, How long does the search usually take? I followed your instructions (https://qiita.com/Ag_smith/items/bfcf94e701f1e6a2aa90) and installed everything on HPC, however, without loading the dataset fully onto RAM. I tested it with a few proteins but the search takes quite a long time (1h+), I assume this is abnormally long. Do you know if there might be something obvious that could elicit these search times? In my case I ran: colabfold_search --use-env 1 --use-templates 0 --db-load-mode 2 --mmseqs /projects/0/prjs0859/ml/algorithms/colabfold/mmseqs/bin/mmseqs --threads 8 /projects/0/prjs0859/ml/inputs/alphafold/fastas/7XTB_5.fasta /projects/2/managed_datasets/AlphaFold_mmseqs2/ /projects/0/prjs0859/ml/outputs/msa/ Any input would be greatly apprecited, thanks! |
Long search times are expected for single queries. colabfold_search is intended for larger scale runs with hundreds or thousands of queries. It still works for single queries but doesn’t scale down well. |
Hello, |
@crisdarbellay For 5000 predictions |
This sounds about right, in the paper we show that we ran a proteome with 1.7k proteins in 2h on a 24-core CPU. The server is optimized for low latency for single queries, not for the highest possible throughput. |
So, how is it possible that obtaining MSA using the servers takes mere seconds? Is it a matter of just using colabfold_search on a way larger batch? I have around 6m proteins that I need to compute the MSA. What am I missing? |
The server takes about one minute-ish per MSA (can become much longer for long sequences). The MSAs stay cached for a while, so if you request the same sequence again it ill not recompute the MSA, but return it from the cache (instantly).
You can reduce sensitivity slightly if you really want to speed-up the MSA computation part. |
Hi! I am in a similar position, where I have to predict thousands of structures. I am considering running Thanks! |
@Nuta0 |
Expected Behavior
I want to create msas for a batch of heterodimers using colabfold_search and predict the structure for the msas using colabfold_batch without using the mmseqs server for msa generation.
Current Behavior
Colabfold_search is generating msas as expected. At the moment it looks like colabfold_batch is using the server to generate msas again even though I point to the already generate a3m files. I am not sure if it is actually connecting to the server but I do think it is wasting time waiting for something, since it outputs
pending
.Steps to Reproduce (for bugs)
Here is the input I use for colabfold_search
and then colabfold_batch
ColabFold Output (for bugs)
Here is are the first few lines of output and log.txt when I run colabfold_batch.
Context
Providing context helps us come up with a solution and improve our documentation for the future.
Your Environment
The text was updated successfully, but these errors were encountered: