Skip to content

Commit

Permalink
update README
Browse files Browse the repository at this point in the history
  • Loading branch information
Shunhua Han committed Feb 4, 2021
1 parent d9ff2e0 commit c3df727
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ In the first stage, all reads from a whole genome shotgun sequence dataset are q

In the second stage, junction reads on each side of TE identified in the first stage are separately aligned to a reference genome that is hardmasked using the same TE library from stage one. Genome-wide coverage profiles are computed and genomic intervals with enriched coverage that represents 5' and 3' clusters of junction reads are annotated in bed format. Regions of overlap between intervals of 5' and 3' clusters of junction reads define the locations of the TSDs for candidate non-reference TE insertions. The orientation of the TE is determined from the relative orientation of alignments of the junction reads to the reference genome and TE library.

In the third stage, all reads from the original whole genome shotgun sequence dataset are used to query against a modified version of reference genome that is hard-masked for all regions not in the vicinity of candidate non-reference TE insertions. This additional mapping step is necessary to obtain all reads that span the TE-flank junction, as well as identify if any reads are present for the alternative "reference" haplotype that does not carry the TE insertion. For each candidate non-reference TE insertion site, 'Junction reads' covering 5' and 3' side of each candidate insertion are counted as number of soft-clipped reads overlapping a 10bp window on the 5' and 3' side of the TSD, respectively. 'Reference reads' were counted as number of non-soft-clipped reads spanning the TSD with at least 3bp extension on both side. The allele frequency for non-reference TEs is heuristically estimated as `max(5' Junction reads, 3' Junction reads)/Reference reads`.
In the third stage, all reads from the original whole genome shotgun sequence dataset are used to query against a modified version of reference genome that is hard-masked for all regions not in the vicinity of candidate non-reference TE insertions. This additional mapping step is necessary to obtain all reads that span the TE-flank junction, as well as identify if any reads are present for the alternative "reference" haplotype that does not carry the TE insertion. For each candidate non-reference TE insertion site, Junction reads covering 5' and 3' side of each candidate insertion are counted as number of soft-clipped reads overlapping a 10bp window on the 5' and 3' side of the TSD, respectively (cigar5' and cigar3' in the diagram). 'Non-reference reads' were estimated as `max(cigar5', cigar3')`. 'Reference reads' were estimated as number of non-soft-clipped reads spanning the TSD with at least 3bp extension on both side. The allele frequency for non-reference TEs is heuristically estimated as `Non-reference reads/(Reference reads + Non-reference reads)`.

Reference TE insertions are detected using a similar strategy to non-reference insertions, independently of any reference TE annotation. The first stage in detecting reference TE insertions is identical to the first stage of detecting non-reference TE insertions described above. The second stage in identifying reference TE insertions involves alignment of the renamed, but otherwise unmodified, junction reads to the reference genome. Alignments of the complete junction read (i.e. non-TE and TE components) are clustered to identify the two ends of the reference TE insertion. The orientation of the reference TE is then determined from the relative orientation of alignments of the junction reads to the reference genome and TE library.

Expand Down

0 comments on commit c3df727

Please sign in to comment.