Question

How to Handle and Merge Feature Counts for Mixed Single-End and Paired-End RNA-Seq Data? Post:

0

Entering edit mode

7 weeks ago

roxana ▴ 10

Hello everyone,

I’m working on RNA-Seq data analysis and encountered an issue when using the featureCounts tool from the Subread package. I have two types of RNA-Seq data:

The first three samples are single-end reads (control samples). The next three samples are paired-end reads (disease samples). I attempted to use the following featureCounts command:

featureCounts -a Homo_sapiens.GRCh38.112.gtf -F GTF -p -C -T 8 -G Homo_sapiens.GRCh38.dna.primary_assembly.fa -o Sixsamplecontrolearly-PCOS.txt \
/home/tj/SRR868862/SRR868862Aligned_sorted.bam \
/home/tj/SRR868881/SRR868881Aligned_sorted.bam \
/home/tj/SRR868884/SRR868884Aligned_sorted.bam \
/home/tj/SRR868859/Aligned_sorted.bam \
/home/tj/SRR868863/SRR868863Aligned_sorted.bam \
/home/tj/SRR868873/SRR868873Aligned_sorted.bam

However, I ran into a problem because the -p flag, which enables paired-end counting, causes an issue with the single-end reads in the first three samples. The -p flag forces featureCounts to treat all the BAM files as paired-end data, which is incorrect for the single-end samples.

Solution Approach: After some research and advice, I realized that the correct approach is to handle single-end and paired-end samples separately:

Step 1: Run featureCounts Separately for Single-End and Paired-End Samples For Single-End Samples (Control Samples): Run featureCounts without the -p flag:

featureCounts -a Homo_sapiens.GRCh38.112.gtf -F GTF -T 8 -G Homo_sapiens.GRCh38.dna.primary_assembly.fa -o ControlSamplesFeatureCounts.txt \
/home/tj/SRR868862/SRR868862Aligned_sorted.bam \
/home/tj/SRR868881/SRR868881Aligned_sorted.bam \
/home/tj/SRR868884/SRR868884Aligned_sorted.bam

For Paired-End Samples (Disease Samples): Run featureCounts with the -p flag for the paired-end data:

featureCounts -a Homo_sapiens.GRCh38.112.gtf -F GTF -p -C -T 8 -G Homo_sapiens.GRCh38.dna.primary_assembly.fa -o DiseaseSamplesFeatureCounts.txt \
/home/tj/SRR868859/Aligned_sorted.bam \
/home/tj/SRR868863/SRR868863Aligned_sorted.bam \
/home/tj/SRR868873/SRR868873Aligned_sorted.bam

Step 2: Merge the Resulting Count Files After running featureCounts separately, I needed to merge the two resulting count files into a single count matrix for downstream analysis. Here are a few ways to accomplish this:

Using Command-Line Tools (paste or join):

Using paste: Merges files side by side:

paste ControlSamplesFeatureCounts.txt DiseaseSamplesFeatureCounts.txt > MergedFeatureCounts.txt
Using join: Merges based on a common field (gene ID):


sort -k1,1 ControlSamplesFeatureCounts.txt > sorted_ControlSamplesFeatureCounts.txt
sort -k1,1 DiseaseSamplesFeatureCounts.txt > sorted_DiseaseSamplesFeatureCounts.txt
join -1 1 -2 1 sorted_ControlSamplesFeatureCounts.txt sorted_DiseaseSamplesFeatureCounts.txt > MergedFeatureCounts.txt

Using R:

# Load count data from both files
control_counts <- read.table("ControlSamplesFeatureCounts.txt", header = TRUE, row.names = 1)
disease_counts <- read.table("DiseaseSamplesFeatureCounts.txt", header = TRUE, row.names = 1)

# Merge the data frames by row names (gene IDs)
merged_counts <- merge(control_counts, disease_counts, by = "row.names", all = TRUE)

# Set row names back to gene IDs after merging
rownames(merged_counts) <- merged_counts$Row.names
merged_counts <- merged_counts[, -1]  # Remove redundant row names column

# Write the merged data to a new file
write.table(merged_counts, file = "MergedFeatureCounts.txt", sep = "\t", quote = FALSE)

My Question: Are there any better or more efficient methods to handle and merge count files from mixed single-end and paired-end RNA-Seq samples? Are there any potential pitfalls with this approach that I should be aware of?

I appreciate any insights or suggestions from the community!

Thank you!

RNA-Seq • 355 views

ADD COMMENT • link updated 7 weeks ago by i.sudbery 20k • written 7 weeks ago by roxana ▴ 10

0

Entering edit mode

Interstingly, the SRA records for both your control and disease samples claim to be single ended.

ADD REPLY • link 7 weeks ago by i.sudbery 20k

score 1 · Answer 1 · 2024-09-12

1

Entering edit mode

7 weeks ago

i.sudbery 20k

You can't really hand mixed single and paired-end reads together in the same experiment, parituclarly if your control samples are single ended, and your treatment samples are pair-ended. The best way to proceed is to discard the read2s from the treatment samples, and only use the read1s.

ADD COMMENT • link 7 weeks ago by i.sudbery 20k

0

Entering edit mode

Also, if read lengths are different, trim the longer so it matches the other samples.

Are your u sure these are appropriate controls? Library prepped at the same time as the treatments?

ADD REPLY • link 7 weeks ago by swbarnes2 14k