Question

Random Access remote BAM files

1

Entering edit mode

8 months ago

Lucas R.F. ▴ 20

Hello,

I wanted to ask what solutions are out there use for random accessing BAM files via http.

Of course, the first answer here is samtools/htslib/pysam, but the current version of the htslib creates open range GET requests, those request lead to inflated egress costs when working on the S3 infrastructure.

I described this behavior here:

https://github.com/samtools/htslib/issues/1670

I was curious, If anybody else experienced this behavior and maybe has an work around for this.

IGV/IGVjs creates clean range requests when accessing data via http, but I don’t see an option to use this functionality outside of the programs for example in a pipeline or a command line tool.

A solution could be to parse the .bai file and define the range for the requested bytes from this data, maybe somebody has some code to share.

Happy about any feedback on this topic.

Best,
Stephan

htslib BAM • 1.3k views

ADD COMMENT • link updated 5 months ago by a.penatauber • 0 • written 8 months ago by Lucas R.F. ▴ 20

0

Entering edit mode

Hi Lucas, I am looking for a similar functionality as I'm working with a large volume of CRAM files on S3, and downloading them whole would cost tens of thousands of dollars. Meanwhile if we want to study a gene locus it only needs downloading a few MB of data per individual. Have you been able to figure out a workaround or simple tool to download just the byte range for a specified genetic locus?

Thanks

ADD REPLY • link 5 months ago by a.penatauber • 0

score 1 · Answer 1 · 2024-02-15

1

Entering edit mode

8 months ago

Alex Reynolds 35k

For command-line work, you could look at using NodeJS with the GMOD bam-js library: https://github.com/GMOD/bam-js

The bam-js library in turn relies on another GMOD library called generic-filehandle that appears to fetch a specified byte range: https://github.com/GMOD/generic-filehandle/blob/06056a2135ddef119262195dd4d5556dfc74b050/src/remoteFile.ts#L72-L134

Perhaps you could run this through a proxy to view headers and confirm if the byte range is calculated more efficiently, or fork the relevant libraries and modify them to write the range header value to a debugger. Or perhaps S3 logs provide enough granularity about requests that would show the same detail of information.

Here's a generic example of a shell script:

#!/usr/bin/env node

const { BamFile } = require('@gmod/bam');
const { performance } = require('perf_hooks');

const bamChrom = 'chr2';
const bamStart = 90383700;
const bamEnd = 90384700;
const bamRootUrl = 'https://foo.cloudfront.net';
const bamUrl = `${bamRootUrl}/reads.bam`;

(async() => {

  const startTime = performance.now();

  const bamHandle = new BamFile({
    bamPath: bamUrl
  });
  const bamHeader = await bamHandle.getHeader();
  const bamReads = await bamHandle.getRecordsForRange(bamChrom, bamStart, bamEnd);
  // console.log(`${JSON.stringify(bamReads)}`);

  const finishTime = performance.now();
  const elapsedTime = finishTime - startTime;
  console.log(`Execution time: ${elapsedTime} ms`);
})();

ADD COMMENT • link 8 months ago by Alex Reynolds 35k

0

Entering edit mode

I will have a look, thanks for the reply, I am happy about any input :)

ADD REPLY • link 8 months ago by Lucas R.F. ▴ 20

0

Entering edit mode

If you try this out, I would be curious to know if you see the same open range issue with this library. I use this library for moving a fair bit of data and knowing if this issue requires fixing would be helpful.

ADD REPLY • link 8 months ago by Alex Reynolds 35k

1

Entering edit mode

Hey tested bamjs and monitored my traffic, rthe ange requests look good to me. Thanks for the help, if I have more updates I keep you in the loop

enter image description here

ADD REPLY • link 8 months ago by Lucas R.F. ▴ 20

0

Entering edit mode

Good to know, thanks for following up

ADD REPLY • link 8 months ago by Alex Reynolds 35k

0

Entering edit mode

Really appreciate your feedback, do you know if there is a function in bam-js to lift the raw data to the server without parsing the reads? I am trying to create like sub BAM containing the header and the blocks from the regions of interest.

ADD REPLY • link 8 months ago by Stephan • 0

0

Entering edit mode

I don't know. Perhaps you could fork the repository and use the BAI parser to get the desired byte range from the index. Then you can just fetch for raw bytes from the BAM file using a generic fetcher (fetch, axios, etc.).

ADD REPLY • link 8 months ago by Alex Reynolds 35k