Sketching paired-end read data #32

mw55309 · 2016-07-26T07:57:59Z

Hi

I note no mention of paired-end data in the docs; sketching with multiple files creates a multi-file sketch, but if I want a single sketch per paired-end sample, how is this done?

I assume I should be able to use sub-shells and/or zcat and pipe to stdin; however when i did this on our cluster I got core dumps (admittedly I seem to be having some problems with zcat on our cluster....)

At the very least it may be a good idea to update your docs to reflect how to sketch paired-end read data

Cheers
Mick

ondovb · 2016-07-27T00:59:15Z

You have the right idea catting to stdin. I added this step to the tutorial to clarify, but this could certainly be more elegant, maybe with a flag that pools all inputs into one sketch (possibly a behavior of the read flag -r).

If you want to try avoiding zcat, you can cat the gz files directly; mash will inflate each one in-stream.

mw55309 · 2016-08-04T15:12:30Z

I do seem to have this persistent core dump:

zcat SRR1262647_1.fastq.gz | ../mash-Linux64-v1.1/mash sketch -k 21 -r -
Sketching from stdin...
Segmentation fault

Same with cat:

cat SRR1262647_1.fastq.gz | ../mash-Linux64-v1.1/mash sketch -k 21 -r -
Sketching from stdin...
Segmentation fault

Works fine without piping:

../mash-Linux64-v1.1/mash sketch -k 21 -r SRR1262647_1.fastq.gz
Sketching SRR1262647_1.fastq.gz...
Estimated genome size: 7.01971e+08
Estimated coverage: 2.942
Writing to SRR1262647_1.fastq.gz.msh...

Any ideas?

ondovb · 2016-08-09T22:17:19Z

Looks look stdin input was broken in 1.1. A fix is now in the latest source and will be included in the next release.

ondovb · 2016-08-29T20:17:53Z

Should be fixed in v1.1.1.

alienzj · 2018-04-03T08:49:50Z

Hello,

cat sample_1.fq.gz sample_2.fq.gz | mash sketch -k 21 -s 10000 -r - -o sample
mash info sample.msh

output:

Header:
  Hash function (seed):          MurmurHash3_x64_128 (42)
  K-mer size:                    21 (64-bit hashes)
  Alphabet:                      ACGT (canonical)
  Target min-hashes per sketch:  10000
  Sketches:                      1

Sketches:
  [Hashes]  [Length]   [ID]  [Comment]

  10000     815194870  -      -

The ID is empty.
If we have many samples pair-ended fastq files, and get sketch for each sample, then paste all into a single file, then mash dist it, finally all ID is empty.

We can do bellow:

cat sample_1.fq.gz sample_2.fq.gz > sample.fq.gz
mash sketch -k 21 -s 10000 sample.fq.gz -o sample

so can avoid ID issue, but couldn't enjoy the pleasure of unix stream pipeline : )

Thanks for the author
Such a great and creative tool！

ondovb · 2018-09-26T19:58:13Z

In latest source the -r flag will combine all input files and allows filling the empty fields with -I and -C.

mw55309 mentioned this issue Jul 26, 2016

Paired-data more distant than different samples #33

Closed

ondovb added the enhancement label Jul 27, 2016

ondovb closed this as completed Aug 29, 2016

tseemann mentioned this issue Jun 11, 2018

Option to set the "ID" of a mash sketch #87

Closed

aswarren mentioned this issue Jul 26, 2018

I lied to PATRIC & it liked it PATRIC3/patric3_website#1962

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sketching paired-end read data #32

Sketching paired-end read data #32

mw55309 commented Jul 26, 2016

ondovb commented Jul 27, 2016 •

edited

Loading

mw55309 commented Aug 4, 2016 •

edited

Loading

ondovb commented Aug 9, 2016

ondovb commented Aug 29, 2016

alienzj commented Apr 3, 2018

ondovb commented Sep 26, 2018

Sketching paired-end read data #32

Sketching paired-end read data #32

Comments

mw55309 commented Jul 26, 2016

ondovb commented Jul 27, 2016 • edited Loading

mw55309 commented Aug 4, 2016 • edited Loading

ondovb commented Aug 9, 2016

ondovb commented Aug 29, 2016

alienzj commented Apr 3, 2018

ondovb commented Sep 26, 2018

ondovb commented Jul 27, 2016 •

edited

Loading

mw55309 commented Aug 4, 2016 •

edited

Loading