-
Notifications
You must be signed in to change notification settings - Fork 73
Workflows
The recommended workflow for assessing the completeness and contamination of genome bins is to use lineage-specific marker sets. This workflow consists of 4 mandatory (M) steps and 1 recommended (R) step:
(M) > checkm tree <bin folder> <output folder>
(R) > checkm tree_qa <output folder>
(M) > checkm lineage_set <output folder> <marker file>
(M) > checkm analyze <marker file> <bin folder> <output folder>
(M) > checkm qa <marker file> <output folder>
The tree
command places genome bins into a reference genome tree. All genomes to be analyzed must reside in a single bins
directory. CheckM assumes genome bins are in FASTA format with the extension fna
, though this can be changed with the –x
flag. The tree
command can optionally be followed by the tree_qa
command which will indicate the number of phylogenetically informative marker genes found in each genome bin along with a taxonomic string indicating its approximate placement in the tree. If desired, genome bins with few phylogenetically marker genes may be removed in order to reduce the computational requirements of the following commands. Alternatively, if only genomes from a particular taxonomic group are of interest these can be moved to a new directory and analyzed separately. The lineage_set
command creates a marker file indicating lineage-specific marker sets suitable for evaluating each genome. This marker file is passed to the analyze
command in order to identify marker genes and estimate the completeness and contamination of each genome bin. Finally, the qa
command can be used to produce different tables summarizing the quality of each genome bin.
For convenience, the 4 mandatory steps can be executed using:
> checkm lineage_wf <bin folder> <output folder>
In some cases it is convenient to analyze all genome bins with the same marker set. A common example would be a set of genomes from the same taxonomic group. The workflow for using a taxonomic-specific marker set consists of 3 mandatory (M) steps and 1 recommended (R) step:
(R) > checkm taxon_list
(M) > checkm taxon_set <rank> <taxon> <marker file>
(M) > checkm analyze <marker file> <bin folder> <output folder>
(M) > checkm qa <marker file> <output folder>
The taxon_list
command produces a table indicating all taxa for which a marker set can be produced. All support taxa at a given taxonomic rank can be produced by passing taxon_list
the --rank
flag:
> checkm taxon_list --rank phylum
The taxon_set
command is used to produce marker sets for a specific taxon:
> checkm taxon_set phylum Cyanobacteria cyanobacteria.ms
The marker file produced by the taxon_set
command is passed to the analyze
command in order to identify marker genes within each genome bin and estimate completeness and contamination. All putative genomes to be analyzed must reside in a single bins
directory. CheckM assumes genomes are in FASTA format with the extension ‘fna’, though this can be changed with the –x
flag. Finally, the qa
command can be used to produce different tables summarizing the quality of each genome bin.
For convenience, the above workflow can be executed in a single step:
> checkm taxonomy_wf <rank> <taxon> <bin folder> <output folder>
CheckM supports using custom marker genes for assessing genome completeness and contamination. The desired marker genes must be specified as hidden markov models (HMMs) constructed with HMMER. Genome quality can be assessed using these marker genes as follows:
> checkm analyze <custom HMM file> <bin folder> <output folder>
> checkm qa <custom HMM file> <output folder>
This HMM file is passed to the analyze
command in order to identify marker genes and estimate the completeness and contamination of each genome bin. Finally, the qa
command can be used to produce different tables summarizing the quality of each genome bin.
CheckM tends to underestimate the completeness of CPR (Patescibacteria) genomes as they are often missing many of the "universal" set of 104 bacterial marker genes used by CheckM. The 43 marker genes proposed by Brown et al., 2015 likely provide improved estimates of the quality of CPR genomes. This marker set has been provided by Adarsh Singh and Nate Cira for use in CheckM and is available in the custom_marker_sets
directory. It can be applied to CPR genomes as follows:
> checkm analyze cpr_43_markers.hmm <bin folder> <output folder>
> checkm qa cpr_43_markers.hmm <output folder>
As an example, this marker set was applied to two CPR genomes:
- Candidatus Roizmanbacteria bacterium GCA_001790095.1 which has an estimated completeness of 95.4% with the CPR marker genes and 72.1% when using the 104 bacterial marker genes
- Candidatus Paceibacter normanii GCA_000398025.1 which has an estimated completeness of 69.8% with the CPR marker set genes and 58.6% when using the 104 bacterial marker genes
When using custom marker sets note the output for marker lineage is N/A, # genomes will be -1 (indicating it is unknown how many genomes were used to create the custom marker set), and # marker sets will be 1.