Skip to content

Qiongyi/DMSA

Repository files navigation

DMSA

Analysis pipeline for the identification of DNA differentially methylated sites between two groups for MeDIP-Seq (Illumina paired-end reads) with multiple biological replicates in each condition/group.

Usage for scripts in DMSA pipeline:
1) Get_properly_mapping_stats.pl
 Usage: perl Get_properly_mapping_stats.pl input1 input2 output

 This script is used to calculate some statistics for the provided ".sam" file.	
 Input1 is a list of chromosome names (eg. chr1, chr2, chrX, chrM, etc.), each chromosome per line.
 Input2 is the ".sam" file generated by BWA
 Output file will be the summary of total number of reads, properly paired-end mapped reads, Properly paired-end mapped reads with good mapping quality (Q>=20) and mapping stats based on each chromosome.  Output file name should be like this "*_bwa_mappping_stats.xls" if later "Summary_stats.pl" need to be used.


2) Summary_stats.pl
 Usage: perl Summary_stats.pl input1 input2 output1 output2 output3

 This script is used to summarize the statistics generated by "Get_properly_mapping_stats.pl"

 Input1 is a ".fai" file which lists the chromosome name and its length (eg. mm9.fa.fai for mouse mm9 genome).
 Input2 is a directory that contains result files generated by "Get_properly_mapping_stats.pl"
 Output1 is a summary for several statistics including the number of total reads, properly paired-end reads and the number of properly paired-end reads with mapping-quality >=20 for all samples.
 Output2 is a summary for the number of properly paired-end reads with mapping-quality >=20 based on each chromosome for all samples.
 Output3 is a similar suumary like Output2 but normalized the total counts for each sample and the length of each chromosome (total counts is normalized to 1 million and each chromosome is normalized to 100Mbp)


3) Map_Q20_sam.pl
 Usage: perl Map_Q20_sam.pl input output

 This script is used to parse the high mapping-quality reads from the ".sam" file. If reads with both properly paired-end aligned and high mapping-quality are required, this script should be used combined with the samtools (eg. samtools view -f 2 -h raw.bam | Map_Q20_sam.pl - properly_aligned_and_high_mapping_quality.sam).

 Input is a raw ".sam" file. 
 Output will be a ".sam" file only keeping the high mapping-quality reads.


4) sam2bed_PE.pl
 Usage: perl sam2bed_PE.pl input output

 This script is used to transfer a properly paired-end aligned ".sam" file to a ".bed" file.

 Input is a ".sam" file generated by BWA (only properly paired-end aligned reads were kept)
 Output is a coordinate ".bed" file.


5) MACS_1GetPeakSummits.pl
 Usage: perl MACS_1GetPeakSummits.pl output input1 input2 (input3..inputN)

 This script is used to get all peak summit positions from all samples.

 Output will list all potential peak summit positions for all samples based on MACS outputs from individual sample.
 Input1 is a directory for MACS outputs
 Input2 is a prefix for one group of samples (Say, one group of samples named as s_7_AAA, s_7_BBB, s_7-CCC, s_7XYZ. They all start with "s_7". "perl MACS_1GetPeakSummits.pl summits_position.xls /home/projects/MACS_peaks s_7" will consider all peak files start with "s_7" to get all potential peak summit positions. 
 Input3 to InputN are optional. They are also prefixs for other groups of samples (should be different group with Input2)


6) MACS_2ReadCount_BinarySearch.pl
 Usage: perl MACS_2ReadCount_BinarySearch.pl F input1 input2 input3 output1 output2 input_prefix

 This script is used to calculate the count and normalized count data for all the peak summit region in one sample or multiple samples.

 F: Either "T" or "F" should be used here to indicate if you want the header line or not. Here we don't need header line, so F is used.
 Input1 is the Output1 in "Summary_stats.pl", which will be used for mormalization.
 Input2 is the summit position file that was the Output from "MACS_1GetPeakSummits.pl".
 Input3 is a directory for bed files that were generated by the "sam2bed_PE.pl" script.
 Output1 is the count data for one sample or multiple samples depending on the input_prefix.
 Output2 is the normalized count data for one sample or multiple samples depending on the input_prefix.
 Input_prefix could either be a sample name or the prefix of a group of samples (eg. "s_7_TBL3S5" for sample name; "s_7" for a group of samples). We recommend to use a sample name here because it will take a long time to go through all samples. It's better to run individual samples independently on different computing nodes to speed up the whole process.


7) MACS_3TTest.pl
 Usage: perl MACS_3TTest.pl input_suffix group1 group2 output

 This script is used to do the Student's t-test for two groups of count data

 Input_suffix is the suffix of output files generated by "MACS_2ReadCount_BinarySearch.pl" (eg. ".count.norm.xls")
 Group1 is one group of samples, each sample is seperated by colon (eg. s_7_TBL3S1:s_7_TBL3S2:s_7_TBL3S3).
 Group2 is the other group of samples, each sample is seperated by colon.
 Output is the result file after Student's t-test


8) MACS_4group_peaks.pl
 Usage: perl MACS_4group_peaks.pl distance_cutoff input output

 This script is used to group the peaks if peak summits locate within a certain distance.
 
 Distance_cutoff is a number to define the distance within which two peak summits will be grouped.
 Input is the file with differentially methylated sites parsed from the output of "MACS_3TTest.pl".
 Output will list all the grouped differentially methylated sites.




About

Differentially Methylated Sites Analyzer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published