-
Notifications
You must be signed in to change notification settings - Fork 509
Home
Easy to use protein structure and complex prediction.
ColabFold, by default, requests the Multiple Sequence Alignment input required for structure prediction from the public MSA server. Using the server is free, however, has a few limitations. For example, we rate-limit requests to ensure one user does not overload the server and that you should not requests MSAs from multiple IPs. Thus large-scale structure predictions should not be done with the public MSA server. Additionally, you might not be allowed to send your protein sequences to a third-party server.
For this purpose, we developed colabfold_search
that runs the MSA generation step with MMseqs2 locally on your own resources. Read on to learn how to use colabfold_search
correctly and get most speed out of it.
When we developed colabfold_search
, we intended it to be used with a large number of query sequences. Skip to the next section, if you want to use colabfold_search
with a single query or a small number of queries.
MMseqs2 default algorithm for homology search executes a double consecutive k-mer matching strategy. For this k-mers of the target database (i.e., the UniRef30 and the ColabFoldDB) are pregenerated and stored in an index. This index can be precomputed or generated on-the-fly.
Precomputing and storing the index fully within RAM allows the MSA server to answer MSA requests very quickly, however, the precomputed index must fully reside within RAM, with some additional RAM to spare for each CPU-thread. Currently, we recommend to use a 2TB RAM machine for this purpose. As machines such as these are not very common, we do not recommend to set-up your server.
Running MMseqs2 with an on-the-fly generated index has much smaller resource requirements, because MMseqs2 can generate k-mers for only a subset of the target sequences, run its prefiltering algorithm, and then proceed with the next subset. This comes at the cost of spending a few minutes for each prefiltering step in overhead for the on-the-fly computation. This overhead sums up quickly to a considerable amount, since we tweaked the search procedure to run in multiple iterations, and for two databases (UniRef30 and ColabFold).
For on-the-fly searches, we recommend to download and setup the databases with the following command:
MMSEQS_NO_INDEX=1 ./setup_databases.sh /path/to/db_folder
MMSEQS_NO_INDEX=1
ensures that we do not generate and store the large precomputed index, which can take a considerable amount of disk space. If you have already called setup_databases.sh
without MMSEQS_NO_INDEX=1
, you can either delete the .idx*
files in /path/to/db_folder
(however, DO NOT delete the .index
files) or call MMSEQS_IGNORE_INDEX=1 colabfold_search ...
.
First step: Prepare a FASTA file (queries.fa
) with your sequences to be predicted.
Second step: Call colabfold_search
MMSEQS_IGNORE_INDEX=1 colabfold_search queries.fa /path/to/db_folder output_folder
If you are using different database versions than currently set as default in the colabfold_search script, you need to adjust the --db[1..3]
parameters.
Last step: predict structures with colabfold_batch
colabfold_batch output_folder output_predictions_folder
For thousands of queries, the overhead describe above can be largely ignored. However, for single queries it can become prohibitively large.
We have implemented a different prefiltering strategy within MMseqs2 that trades the high RAM usage for higher CPU usage. This strategy is based on computing quickly computing ungapped alignments for all diagonals of each query sequence. When compared to the normal double consecutive k-mer matching strategy, this is overhead free, however, has a lower total throughput for a large number of queries. This is a different algorithm and will result in different MSAs.
Modify the search procedure above to include the --prefilter-mode 1
parameter, otherwise proceed in the same way.
MMSEQS_IGNORE_INDEX=1 colabfold_search queries.fa /path/to/db_folder output_folder --prefilter-mode 1