Hi everyone,
I am trying to run GATK MarkDuplicatesSpark on a computational cluster with a very large sam file size (494 GB).
Does anyone know of any flags I can use with MarkDuplicatesSpark to overcome this issue? Or any other workarounds, like if there is a way to use less temporary space for the operation? It would be greatly appreciated.
Here is the command I used:
gatk MarkDuplicatesSpark -I ${aligned_reads}/ERR9880493.paired.sam -O ${aligned_reads}/ERR9880493_marked_duplicates.bam --conf spark.sql.shuffle.partitions=100
The /tmp directory does not have enough available disk space.
Caused by: java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.writeBytes(Native Method)
at java.base/java.io.FileOutputStream.write(FileOutputStream.java:349)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59)
at org.apache.spark.io.MutableCheckedOutputStream.write(MutableCheckedOutputStream.scala:43)
at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123)
at com.ning.compress.lzf.ChunkEncoder.encodeAndWriteChunk(ChunkEncoder.java:286)
at com.ning.compress.lzf.LZFOutputStream.writeCompressedBlock(LZFOutputStream.java:285)
at com.ning.compress.lzf.LZFOutputStream.write(LZFOutputStream.java:143)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:323)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:232)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:444)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:222)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:182)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
11:18:26.176 INFO ShutdownHookManager - Shutdown hook called 11:18:26.177 INFO ShutdownHookManager - Deleting directory /tmp/spark-df9c8d33-c610-4dac-826e-8e0103845c
Hi Pierre,
I tried adding the argument that you gave me, and it worked for a few hours before throwing another (Disk quota exceeded) error.
I created a new /tmp directory, and it had 18P available storage in it.
I ran:
My sam file size is 494G.