MarkDuplicatesSpark GATK flags for overcoming space issues
1
0
Entering edit mode
4 weeks ago

Hi everyone,

I am trying to run GATK MarkDuplicatesSpark on a computational cluster with a very large sam file size (494 GB).

Does anyone know of any flags I can use with MarkDuplicatesSpark to overcome this issue? Or any other workarounds, like if there is a way to use less temporary space for the operation? It would be greatly appreciated.

Here is the command I used:

gatk MarkDuplicatesSpark -I ${aligned_reads}/ERR9880493.paired.sam -O ${aligned_reads}/ERR9880493_marked_duplicates.bam --conf spark.sql.shuffle.partitions=100

The /tmp directory does not have enough available disk space.

Caused by: java.io.IOException: No space left on device
    at java.base/java.io.FileOutputStream.writeBytes(Native Method)
    at java.base/java.io.FileOutputStream.write(FileOutputStream.java:349)
    at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59)
    at org.apache.spark.io.MutableCheckedOutputStream.write(MutableCheckedOutputStream.scala:43)
    at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123)
    at com.ning.compress.lzf.ChunkEncoder.encodeAndWriteChunk(ChunkEncoder.java:286)
    at com.ning.compress.lzf.LZFOutputStream.writeCompressedBlock(LZFOutputStream.java:285)
    at com.ning.compress.lzf.LZFOutputStream.write(LZFOutputStream.java:143)
    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:323)
    at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:232)
    at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:444)
    at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:222)
    at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:182)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)

11:18:26.176 INFO ShutdownHookManager - Shutdown hook called 11:18:26.177 INFO ShutdownHookManager - Deleting directory /tmp/spark-df9c8d33-c610-4dac-826e-8e0103845c

GATK • 286 views
ADD COMMENT
1
Entering edit mode
4 weeks ago

https://gatk.broadinstitute.org/hc/en-us/articles/360037224932-MarkDuplicatesSpark

Furthermore, we recommend explicitly setting the Spark temp directory to an available SSD when running this in local mode by adding the argument --conf 'spark.local.dir=/PATH/TO/TEMP/DIR'. See this forum discussion for details.

and

--tmp-dir null Temp directory to use.

ADD COMMENT
0
Entering edit mode

Hi Pierre,

I tried adding the argument that you gave me, and it worked for a few hours before throwing another (Disk quota exceeded) error.

I created a new /tmp directory, and it had 18P available storage in it.

I ran:

gatk MarkDuplicatesSpark -I ${aligned_reads}/ERR9880493.paired.sam -O ${aligned_reads}/ERR9880493_marked_duplicates.bam --conf spark.local.dir=/corral/mdacc/MCB24068/MC38/tmp --tmp-dir /corral/mdacc/MCB24068/MC38/tmp

My sam file size is 494G.

ADD REPLY

Login before adding your answer.

Traffic: 1614 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6