Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colabfold_batch connects to mmseqs server despite msa input #563

Open
Nuta0 opened this issue Jan 29, 2024 · 18 comments
Open

Colabfold_batch connects to mmseqs server despite msa input #563

Nuta0 opened this issue Jan 29, 2024 · 18 comments

Comments

@Nuta0
Copy link

Nuta0 commented Jan 29, 2024

Expected Behavior

I want to create msas for a batch of heterodimers using colabfold_search and predict the structure for the msas using colabfold_batch without using the mmseqs server for msa generation.

Current Behavior

Colabfold_search is generating msas as expected. At the moment it looks like colabfold_batch is using the server to generate msas again even though I point to the already generate a3m files. I am not sure if it is actually connecting to the server but I do think it is wasting time waiting for something, since it outputs pending.

Steps to Reproduce (for bugs)

Here is the input I use for colabfold_search

module load Miniconda3/22.11.1-1
eval "$(conda shell.bash hook)"
conda activate /data/gpfs/projects/punim1869/shared_bin/localcolabfold/colabfold-conda
module load MMseqs2/15-6f452
colabfold_search input.fasta /data/gpfs/datasets/mmseqs/uniref30_2302 msas

and then colabfold_batch

module load Miniconda3/22.11.1-1
module load CUDA/12.2.0
eval "$(conda shell.bash hook)"
conda activate /data/gpfs/projects/punim1869/shared_bin/localcolabfold/colabfold-conda
colabfold_batch --amber --templates --use-gpu-relax msas predictions

ColabFold Output (for bugs)

Here is are the first few lines of output and log.txt when I run colabfold_batch.

2024-01-26 12:25:04.257622: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-01-26 12:25:05.750801: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
^M  0%|          | 0/300 [elapsed: 00:00 remaining: ?]^MSUBMIT:   0%|          | 0/300 [elapsed: 00:00 remaining: ?]^MPENDING:   0%|          | 0/300 [elapsed: 00:00 remaining: ?]^M                                                             ^MPENDING:   0%|          | 0/300 [elapsed: 00:00 remaining: ?]^MCOMPLETE:   0%|          | 0/300 [elapsed: 00:09 remaining: ?]^MCOMPLETE: 100%|██████████| 300/300 [elapsed: 00:09 remaining: 00:00]^MCOMPLETE: 100%|██████████| 300/300 [elapsed: 00:14 remaining: 00:00]
^M  0%|          | 0/300 [elapsed: 00:00 remaining: ?]^MSUBMIT:   0%|          | 0/300 [elapsed: 00:00 remaining: ?]^MPENDING:   0%|          | 0/300 [elapsed: 00:01 remaining: ?]
2024-01-26 12:25:00,517 Running colabfold 1.5.5 (941feece178db14c9af1580eefbf4a8fe4e5b5af)
2024-01-26 12:25:05,724 Running on GPU
2024-01-26 12:25:13,220 Found 9 citations for tools or databases
2024-01-26 12:25:13,220 Query 1/100: Heterodimer (length 261)
2024-01-26 12:25:17,176 Sleeping for 8s. Reason: PENDING
2024-01-26 12:26:05,156 Sequence 0 found templates: ['Xxx8']
2024-01-26 12:26:05,156 Sequence 1 found no templates

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

@YoshitakaMo
Copy link
Collaborator

YoshitakaMo commented Jan 29, 2024

If you turn on --templates arg, colabfold_batch will try to connect the MSA server to obtain template PDB files. Since the effect of template structure is small for many cases (except limited available MSAs), you can turn off --template.

@Nuta0
Copy link
Author

Nuta0 commented Jan 29, 2024

Is there a way of obtaining template PDB files locally? And if not is it feasible to obtain the templates of many predictions (>10000) from the msa server given the resource limitations of the server?

@YoshitakaMo
Copy link
Collaborator

YoshitakaMo commented Jan 29, 2024

Currently, colabfold_search can generate a list of template PDB files together with MSA files. For example,

MMSEQS_PATH="/path/to/your/mmseqs2/for_colabfold"
DATABASE_PATH="/mnt/databases"
INPUTFILE="ras_raf.fasta"
OUTPUTDIR="ras_raf"

colabfold_search \
  --use-env 1 \
  --use-templates 1 \
  --db-load-mode 2 \
  --db2 pdb100_230517 \
  --mmseqs ${MMSEQS_PATH}/bin/mmseqs \
  --threads 4 \
  ${INPUTFILE} \
  ${DATABASE_PATH} \
  ${OUTPUTDIR}

Then, use colabfold_batch with --pdb-hit-file PDBHITFILE, which will be generated by colabfold_search. Note that mmCIF file database (/path/to/pdb_mmcif/mmcif_files) is required in your computer like the original AlphaFold2.

INPUTFILE="RAS_RAF.a3m"
PDBHITFILE="RAS_RAF_pdb100_230517.m8"
LOCALPDBPATH="/path/to/pdb_mmcif/mmcif_files"
RANDOMSEED=0

colabfold_batch \
  --amber \
  --templates \
  --use-gpu-relax \
  --pdb-hit-file ${PDBHITFILE} \
  --local-pdb-path ${LOCALPDBPATH} \
  --random-seed ${RANDOMSEED} \
  ${INPUTFILE} \
  ras_raf

@Nuta0
Copy link
Author

Nuta0 commented Jan 29, 2024

That's great. Is there a way of doing this for a batch of proteins with respective .a3m files and .m8 files?

@Nuta0
Copy link
Author

Nuta0 commented Jan 30, 2024

I am trying to use a fasta file with multiple complexes as an input for colabfold_search. This is my code:

module load Miniconda3/22.11.1-1
eval "$(conda shell.bash hook)"
conda activate /data/gpfs/projects/punim1869/shared_bin/localcolabfold/colabfold-conda
module load MMseqs2/15-6f452

DATABASE_PATH="/data/gpfs/datasets/mmseqs/uniref30_2302"

colabfold_search \
  --use-env 1 \
  --use-templates 1 \
  --db2 pdb100_230517 \
  --threads 16 \
  input/${input_file} \
  ${DATABASE_PATH} \
  msas

I get this error:

Traceback (most recent call last):
  File "/data/gpfs/projects/punim1869/shared_bin/localcolabfold/colabfold-conda/bin/colabfold_search", line 8, in <module>
    sys.exit(main())
  File "/data/gpfs/projects/punim1869/shared_bin/localcolabfold/colabfold-conda/lib/python3.10/site-packages/colabfold/mmseqs/search.py", line 385, in main
    os.rename(
FileNotFoundError: [Errno 2] No such file or directory: 'msas/pdb100_230517.m8' -> 'msas/Complex_2_pdb100_230517.m8'

Does Colabfold_serach support multiples sequences as an input when generating both .a3m and .m8?

@YoshitakaMo
Copy link
Collaborator

YoshitakaMo commented Jan 31, 2024

Does Colabfold_serach support multiples sequences as an input when generating both .a3m and .m8?

Yes, you can obtain a3m files for multiple input.

Here is my example. I'm using ColabFold 1.5.5 (a00ce1b).

  • Input ras_raf.fasta
>RAS_RAF
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG
QEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDL
AARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPG
CMSCKCVLS:
PSKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARLDWNTDAASLIGEELQVDFL
  • command
# I'm using MMseqs2 commit hash 71dd32ec43e3ac4dabf111bbc4b124f1c66a85f1 
colabfold_search \
  --use-env 1 \
  --use-templates 1 \
  --db-load-mode 2 \
  --mmseqs /path/to/your/mmseqs2/bin/mmseqs \ 
  --db2 pdb100_230517 \
  --threads 4 \
  ras_raf.fasta \
  /mnt/databases \
  manual_ras_raf
  • /mnt/databases directory
$ ls -lt /mnt/databases

-rw-r--r-- 1 moriwaki staffs   5797891705 May 22  2023 uniref30_2302_db_mapping
-rw-r--r-- 1 moriwaki staffs    667957493 May 22  2023 uniref30_2302_db_taxonomy
-rw-r--r-- 1 moriwaki staffs  64064274015 Jun 13  2023 pdb100_a3m.ffdata
-rw-r--r-- 1 moriwaki staffs      6389810 Jun 13  2023 pdb100_a3m.ffindex
-rw-r--r-- 1 moriwaki staffs  43200163261 Oct  9 17:34 uniref30_2302_db_h
-rw-r--r-- 1 moriwaki staffs   8910693488 Oct  9 17:35 uniref30_2302_db_h.index
-rw-r--r-- 1 moriwaki staffs            4 Oct  9 17:35 uniref30_2302_db_h.dbtype
-rw-r--r-- 1 moriwaki staffs   5787495369 Oct  9 17:36 uniref30_2302_db
-rw-r--r-- 1 moriwaki staffs    879290728 Oct  9 17:36 uniref30_2302_db.index
-rw-r--r-- 1 moriwaki staffs            4 Oct  9 17:36 uniref30_2302_db.dbtype
-rw-r--r-- 1 moriwaki staffs  83036144795 Oct  9 17:57 uniref30_2302_db_seq
-rw-r--r-- 1 moriwaki staffs   8957791292 Oct  9 17:58 uniref30_2302_db_seq.index
-rw-r--r-- 1 moriwaki staffs            4 Oct  9 17:58 uniref30_2302_db_seq.dbtype
-rw-r--r-- 1 moriwaki staffs   8709887243 Oct  9 18:07 uniref30_2302_db_aln
-rw-r--r-- 1 moriwaki staffs    867494002 Oct  9 18:07 uniref30_2302_db_aln.index
-rw-r--r-- 1 moriwaki staffs            4 Oct  9 18:07 uniref30_2302_db_aln.dbtype
lrwxrwxrwx 1 moriwaki staffs           24 Oct  9 18:07 uniref30_2302_db_seq_h.index -> uniref30_2302_db_h.index
lrwxrwxrwx 1 moriwaki staffs           25 Oct  9 18:07 uniref30_2302_db_seq_h.dbtype -> uniref30_2302_db_h.dbtype
lrwxrwxrwx 1 moriwaki staffs           18 Oct  9 18:07 uniref30_2302_db_seq_h -> uniref30_2302_db_h
-rw-r--r-- 1 moriwaki staffs 228709249024 Oct  9 18:20 uniref30_2302_db.idx
-rw-r--r-- 1 moriwaki staffs          506 Oct  9 18:20 uniref30_2302_db.idx.index
-rw-r--r-- 1 moriwaki staffs            4 Oct  9 18:20 uniref30_2302_db.idx.dbtype
lrwxrwxrwx 1 moriwaki staffs           24 Oct  9 18:21 uniref30_2302_db.idx_mapping -> uniref30_2302_db_mapping
lrwxrwxrwx 1 moriwaki staffs           25 Oct  9 18:21 uniref30_2302_db.idx_taxonomy -> uniref30_2302_db_taxonomy
-rw-r--r-- 1 moriwaki staffs            0 Oct  9 18:21 UNIREF30_READY
-rw-r--r-- 1 moriwaki staffs  25108896515 Oct 10 09:07 colabfold_envdb_202108_db_h
-rw-r--r-- 1 moriwaki staffs  18036930897 Oct 10 09:09 colabfold_envdb_202108_db_h.index
-rw-r--r-- 1 moriwaki staffs            4 Oct 10 09:09 colabfold_envdb_202108_db_h.dbtype
-rw-r--r-- 1 moriwaki staffs  26732224605 Oct 10 09:14 colabfold_envdb_202108_db
-rw-r--r-- 1 moriwaki staffs   5260769931 Oct 10 09:15 colabfold_envdb_202108_db.index
-rw-r--r-- 1 moriwaki staffs            4 Oct 10 09:15 colabfold_envdb_202108_db.dbtype
-rw-r--r-- 1 moriwaki staffs  92749953996 Oct 10 09:46 colabfold_envdb_202108_db_seq
-rw-r--r-- 1 moriwaki staffs  18917335740 Oct 10 09:49 colabfold_envdb_202108_db_seq.index
-rw-r--r-- 1 moriwaki staffs            4 Oct 10 09:49 colabfold_envdb_202108_db_seq.dbtype
-rw-r--r-- 1 moriwaki staffs  27929446713 Oct 10 09:57 colabfold_envdb_202108_db_aln
-rw-r--r-- 1 moriwaki staffs   5214433987 Oct 10 09:58 colabfold_envdb_202108_db_aln.index
-rw-r--r-- 1 moriwaki staffs            4 Oct 10 09:58 colabfold_envdb_202108_db_aln.dbtype
-rw-r--r-- 1 moriwaki staffs         1907 Oct 10 11:23 colabfold_envdb_202108_db.idx.index
-rw-r--r-- 1 moriwaki staffs            4 Oct 10 11:23 colabfold_envdb_202108_db.idx.dbtype
lrwxrwxrwx 1 moriwaki staffs           33 Oct 10 13:38 colabfold_envdb_202108_db_seq_h.index -> colabfold_envdb_202108_db_h.index
lrwxrwxrwx 1 moriwaki staffs           34 Oct 10 13:39 colabfold_envdb_202108_db_seq_h.dbtype -> colabfold_envdb_202108_db_h.dbtype
lrwxrwxrwx 1 moriwaki staffs           27 Oct 10 13:39 colabfold_envdb_202108_db_seq_h -> colabfold_envdb_202108_db_h
-rw-r--r-- 1 moriwaki staffs 562358472704 Oct 16 01:28 colabfold_envdb_202108_db.idx
-rw-r--r-- 1 moriwaki staffs            0 Oct 10 13:40 COLABDB_READY
-rw-r--r-- 1 moriwaki staffs           25 Oct 10 13:47 pdb100_230517.source
-rw-r--r-- 1 moriwaki staffs     27989933 Oct 10 13:47 pdb100_230517_h
-rw-r--r-- 1 moriwaki staffs            4 Oct 10 13:47 pdb100_230517_h.dbtype
-rw-r--r-- 1 moriwaki staffs     65092975 Oct 10 13:47 pdb100_230517
-rw-r--r-- 1 moriwaki staffs            4 Oct 10 13:47 pdb100_230517.dbtype
-rw-r--r-- 1 moriwaki staffs      6279753 Oct 10 13:47 pdb100_230517.index
-rw-r--r-- 1 moriwaki staffs      6116273 Oct 10 13:47 pdb100_230517_h.index
-rw-r--r-- 1 moriwaki staffs      5178372 Oct 10 13:47 pdb100_230517.lookup
-rw-r--r-- 1 moriwaki staffs   1443213312 Oct 10 13:47 pdb100_230517.idx
-rw-r--r-- 1 moriwaki staffs          383 Oct 10 13:47 pdb100_230517.idx.index
-rw-r--r-- 1 moriwaki staffs            4 Oct 10 13:47 pdb100_230517.idx.dbtype
-rw-r--r-- 1 moriwaki staffs            0 Oct 10 13:47 PDB_READY
-rw-r--r-- 1 moriwaki staffs            0 Oct 10 13:54 PDB100_READY

Then, I obtained the a3m and m8 files in the manual_ras_raf directory, RAS_RAF_pdb100_230517.m8 contains

101	7kyz_A	0.856	188	26	1	1	188	1	187	3.275E-63	215	167M1I20M
101	2mse_B	0.848	185	28	0	1	185	1	185	1.583E-62	213	185M
101	7tlk_B	0.934	167	11	0	1	167	1	167	4.075E-62	212	167M
101	7t1f_A	0.923	169	13	0	1	169	1	169	4.075E-62	212	169M
...
...
101	6pgo_B	0.804	169	16	2	1	169	1	152	1.592E-46	167	31M7I20M10I101M
101	4m1s_C	0.796	167	15	2	1	167	1	148	2.987E-46	166	28M10I23M9I97M
101	6o62_A	0.299	167	109	3	5	170	8	167	7.677E-46	165	27M6I68M1I15M1D49M
101	4m21_C	0.789	166	15	2	2	167	1	146	6.945E-45	162	27M11I22M9I97M
102	4g3x_B	1.000	77	0	0	5	81	1	77	1.003E-28	110	77M
102	3kud_B	0.986	76	1	0	6	81	1	76	4.892E-28	108	76M
102	3kuc_B	0.973	76	2	0	6	81	1	76	9.221E-28	107	76M
102	1rrb_A	0.986	76	1	0	6	81	1	76	1.266E-27	107	76M
...
...
102	2mse_D	0.578	76	29	1	6	81	1	73	5.603E-22	91	47M3I26M
102	2mse_D	0.578	76	29	1	6	81	1	73	5.603E-22	91	47M3I26M
102	5yxi_A	0.500	74	37	0	5	78	3	76	1.561E-18	81	74M
102	6ntd_B	0.733	75	9	1	5	79	1	64	9.674E-17	76	47M11I17M
102	6ntc_B	0.706	75	10	2	6	80	1	63	9.682E-13	64	17M3I24M9I22M

Note that 101 and 102 represent the first and second sequence in the input fasta file, respectively.

@Nuta0
Copy link
Author

Nuta0 commented Jan 31, 2024

Thank you for this example. But how would I use an input file like this?

>complex_a
MFAWVSVSQSYGVIEILKDIMNKVMGIKKKGTNTGITVEDFEQMGEEEVRQHLHDFLRDKKYLVVMDDVWTVDVWRQIHQIFPNVNNGSRILLTTRNMEVARHAEPWIPPHEPHLLNDTHSLELFCRKAFPANQDVPTELEPLSQKLAKR:
MCGGLPLALVVLGGLMSRKDPSYDTWLRVAQSMNWESSGEGQECLGILGLSYNDLPYQLKPCFLYITAFPEDSIIPVSKLARLWIAEGFILEEQRQTMEDTARDWLDELVQRCMIQVVKRSVTRGRVKSIRIHDMLRDFGLLEARKDGFLHVCSTDA
>complex_b
MVVSSHRVAFHDRINEEVAVSSPHLRTLLGSNLILTNAGRFLNGLNLLRVLDLEGARDLKKLPKQMGNMIHLRYLGLRRTGLKRLPSSIGHLLNLQTLDARGTYISWLPKSFWKIRTLRYVYINILAFLSAPIIG:
MDHKNLQALKITWINVDVMDMIRLGGIRFIKNWVTTSDSAEMAYERIFSESFGKSLEKMDSLVSLNMYVKELPKDIFFAHARPLPKLRSLYLGG
>complex_c
MSFQQQQLPDITQFPPNLTKLILISFHLEQDPMPVLEKLPNLRLLELCGAYHGKSMSC:
MSAGGFPRLQHLILEDLYDLEAWRVEVGAMPRLTNLTIRWCGMLKMLPEGLQHVTTVRELKLIDMPREFSDKVRSEDGYKVTHPLHYY

@YoshitakaMo
Copy link
Collaborator

I've pushed a fix for this issue, @Nuta0. See #567 .
Please update your (local)ColabFold and try to use CSV format for input.

@Nuta0
Copy link
Author

Nuta0 commented Feb 22, 2024

@YoshitakaMo Thank you for fixing this.

I have noticed another related issue, where the templates that are picked at the beginning of the prediction are different when I use colabfold_batch directly compared to when I use colabfold_search and then colabfold_batch.

I use this input:

id,sequence
3kud,MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQH:PSKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKKLKVRGLQPECCAVFRLLHEHKGKKARLDWNTDAASLIGEELQVDFL
ras,MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS
1BJP_2,PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR:PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR
1BJP_ras,PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR:PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR:MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS

When using colabfold_batch --templates --amber --use-gpu-relax input.csv prediction these are the contents of the 1BJP_2_template_domain_names.json file:

{"A": ["3mb2_C", "2fm7_B", "4fdx_A", "3ry0_B", "1bjp_A", "6fps_P", "4faz_C", "7m59_B", "6bgn_C", "1otf_D", "3abf_B", "5clo_C", "6fps_R", "7xuy_A", "5cln_I", "6blm_A", "7puo_F", "2op8_A", "4x1c_F", "6blm_A"], "B": ["3mb2_C", "2fm7_B", "4fdx_A", "3ry0_B", "1bjp_A", "6fps_P", "4faz_C", "7m59_B", "6bgn_C", "1otf_D", "3abf_B", "5clo_C", "6fps_R", "7xuy_A", "5cln_I", "6blm_A", "7puo_F", "2op8_A", "4x1c_F", "6blm_A"]}

However, when doing:

MMSEQS_PATH="/apps/easybuild-2022/easybuild/software/MPI/GCC/11.3.0/OpenMPI/4.1.4/MMseqs2/15-6f452/bin/mmseqs"
DATABASE_PATH="/data/gpfs/datasets/mmseqs/uniref30_2302"
INPUTFILE="input.csv"

DATABASE_PATH=
colabfold_search \
  --use-env 1 \
  --use-templates 1 \
  --db-load-mode 2 \
  --mmseqs ${MMSEQS_PATH} \
  --db2 pdb100_230517 \
  --threads 4 \
  ${INPUTFILE} \
  ${DATABASE_PATH} \
  msas
INPUTFILE="1BJP_2.a3m"
PDBHITFILE="1BJP_2_pdb100_230517.m8"
LOCALPDBPATH="/data/scratch/datasets/alphafold/v2.3.2/pdb_mmcif/mmcif_files"
RANDOMSEED=0

colabfold_batch \
  --amber \
  --templates \
  --use-gpu-relax \
  --pdb-hit-file ${PDBHITFILE} \
  --local-pdb-path ${LOCALPDBPATH} \
  --random-seed ${RANDOMSEED} \
  ${INPUTFILE} \
  prediction

then this is the 1BJP_2_template_domain_names.json file:

{"A": ["4x1c_H", "1bjp_B", "1bjp_A", "1bjp_E", "6fps_N", "6fps_Q", "6fps_P", "7xuy_A", "3ry0_B", "3ry0_A", "2op8_B", "2op8_A", "7puo_F", "7puo_C", "4x1c_G", "7puo_B", "7puo_D", "7puo_E", "7puo_A", "4faz_C"], "B": ["4x1c_H", "1bjp_B", "1bjp_A", "1bjp_E", "6fps_N", "6fps_Q", "6fps_P", "7xuy_A", "3ry0_B", "3ry0_A", "2op8_B", "2op8_A", "7puo_F", "7puo_C", "4x1c_G", "7puo_B", "7puo_D", "7puo_E", "7puo_A", "4faz_C"]}

Clearly the templates are similar between the methods and the resulting predictions are also similar in this instance, but I had cases where the predicted structures were significantly different. Is this intended behaviour?

@Cryptheon
Copy link

Hi @YoshitakaMo,

How long does the search usually take? I followed your instructions (https://qiita.com/Ag_smith/items/bfcf94e701f1e6a2aa90) and installed everything on HPC, however, without loading the dataset fully onto RAM. I tested it with a few proteins but the search takes quite a long time (1h+), I assume this is abnormally long.

Do you know if there might be something obvious that could elicit these search times? In my case I ran:

colabfold_search --use-env 1 --use-templates 0 --db-load-mode 2 --mmseqs /projects/0/prjs0859/ml/algorithms/colabfold/mmseqs/bin/mmseqs --threads 8 /projects/0/prjs0859/ml/inputs/alphafold/fastas/7XTB_5.fasta /projects/2/managed_datasets/AlphaFold_mmseqs2/ /projects/0/prjs0859/ml/outputs/msa/

Any input would be greatly apprecited, thanks!

@milot-mirdita
Copy link
Collaborator

Long search times are expected for single queries. colabfold_search is intended for larger scale runs with hundreds or thousands of queries.

It still works for single queries but doesn’t scale down well.

@crisdarbellay
Copy link

Hello,
I followed the instruction of https://qiita.com/Ag_smith/items/bfcf94e701f1e6a2aa90, but I still find myself with long search time (~1h). I have a huge ram (~750GB), I should be able to reproduce around same speed as the colabfold server right? I have around 5'000 predictions to make. How could I optimize the run and search time?
Thank you for your work!

@Nuta0
Copy link
Author

Nuta0 commented Apr 17, 2024

@crisdarbellay For 5000 predictions colabfold_search takes around 6 h with 16 CPU cores in my tests.

@milot-mirdita
Copy link
Collaborator

This sounds about right, in the paper we show that we ran a proteome with 1.7k proteins in 2h on a 24-core CPU.

The server is optimized for low latency for single queries, not for the highest possible throughput. colabfold_search is intended for that.

@Cryptheon
Copy link

So, how is it possible that obtaining MSA using the servers takes mere seconds? Is it a matter of just using colabfold_search on a way larger batch? I have around 6m proteins that I need to compute the MSA. What am I missing?

@milot-mirdita
Copy link
Collaborator

milot-mirdita commented Apr 18, 2024

The server takes about one minute-ish per MSA (can become much longer for long sequences). The MSAs stay cached for a while, so if you request the same sequence again it ill not recompute the MSA, but return it from the cache (instantly).

colabfold_search's raw throughput is still be much better than the server's, it should be much faster than the 6 million * 1 minute the server (divided by the number of workers) would take (and much much much faster than running colabfold_search 6 million times). But that still means you will need to throw quite a bit of CPU at the problem of computing 6 million MSAs.

You can reduce sensitivity slightly if you really want to speed-up the MSA computation part.

@ahof1704
Copy link

Hi!

I am in a similar position, where I have to predict thousands of structures. I am considering running colabfold_search first to speed the process. if I understand it correctly, colabfold_search doesn't need gpus, right? Could I compute the MSA in a CPU node with many cores and large RAM and then move it to a GPU node for structure prediction?

Thanks!

@crisdarbellay
Copy link

crisdarbellay commented May 2, 2024

@Nuta0
Could I see an example where you run prediction on that many queries? Are you using mutliple fastas or one fastas with all sequences?
I think that I'm doing something wrong...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants