Skip to content

k mers hashes rescue

Téo Lemane edited this page Oct 22, 2021 · 4 revisions

In kmtricks, k-mer filtering is achieved by leverage k-mer abundances across samples. The following parameters can modulate this procedure.

  • --hard-min INT: All k-mers with an abundance less than this parameter are discarded.
  • --soft-min INT/STR/FLOAT: All k-mers with an abundance between count-abundance-min and merge-abundance-min are considering rescue-able. You can provide a path of a file containing one threshold per line, with the same order as in the input fof. You can also use a float as input. In this case, one specific threshold T per sample is computed such that the number of k-mers occurring T times is smaller than VALUE x nb_kmers.
  • --share-min INT: If a k-mer is rescue-able, it is conserved if it is solid (with an abundance greater than soft-min) in at least save-if other sample(s).
  • --recurrence-min INT: All k-mers that do not occur in at least recurrence-min sample(s) are discarded.

The figure below shows an example of the rescue procedure using sample-specific soft-min and the following parameters: hard-min 1, share-min 3 and recurrence-min 2.

  • H1 has a abundance lower than 3 in D0 but it is solid in at least share-min samples (D2, D3, D4). It is then conserved in D0 (right part of the figure).
  • H2 is non-solid in D1, D3 and D4 and is solid only in 2 samples. H2 is therefore discarded in D1, D3 and D4.
  • H3 is solid only in one sample. Hence, as recurrence-min cannot be satisfied, the whole row is discarded (dash signs in the Figure, or corresponds to the null bit-vector in hash mode).