Hi,
I run kmergenie on a merged data file (concatenated from 6 individual fastq files: 3 R1 and 3 corresponding pair-end R2)
$ ./kmergenie -k 40 -l 16 -s 2 -t 16 -o 16-40histo merged.fq.gz
and get the results:
Predicted best k: 40
Predicted assembly size: 772582595 bp
Then I run
$ ./kmergenie -k 60 -l 42 -s 2 -t 16 -o 42-60histo merged.fq.gz
which get:
Predicted best k: 60
Predicted assembly size: 802168662 bp
Then I run
$ ./kmergenie -k 120 -l 60 -s 10 -t 16 -o 60-120histo merged.fq.gz
also obtain:
Predicted best k: 120
Predicted assembly size: 831614678 bp
Here is the graphs created by kmergenie: https://drive.google.com/file/d/0Bx7ogD6DesvEWE1MWVNQU0pGN00/view?usp=sharing
Beside, the kmer multiplicity graphs seem to be okay with clear 2nd peak, which decrease from around 27 to around 8 while k increase from 16 to 120.
I don't know why the suggested best k keep going higher, is maximizing total number of distinct k-mer always the good choice? or could there be something wrong with our data or any parameter used with kmergenie? Our pair-end data are from Miseq which have read lengths range from 35 to 310 with very sharp peak at 310bp.
Welcome any suggestion and help, thank you very much!
Phuong
I agree with this answer. To complete it: yes, we've found it's wise to use such high kmer values when kmergenie suggest them, as they allow for better resolution of repeats.
Thanks a lot !