Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 19;25(1):267.
doi: 10.1186/s12859-024-05901-8.

GCphase: an SNP phasing method using a graph partition and error correction algorithm

Affiliations

GCphase: an SNP phasing method using a graph partition and error correction algorithm

Junwei Luo et al. BMC Bioinformatics. .

Abstract

Background: The utilization of long reads for single nucleotide polymorphism (SNP) phasing has become popular, providing substantial support for research on human diseases and genetic studies in animals and plants. However, due to the complexity of the linkage relationships between SNP loci and sequencing errors in the reads, the recent methods still cannot yield satisfactory results.

Results: In this study, we present a graph-based algorithm, GCphase, which utilizes the minimum cut algorithm to perform phasing. First, based on alignment between long reads and the reference genome, GCphase filters out ambiguous SNP sites and useless read information. Second, GCphase constructs a graph in which a vertex represents alleles of an SNP locus and each edge represents the presence of read support; moreover, GCphase adopts a graph minimum-cut algorithm to phase the SNPs. Next, GCpahse uses two error correction steps to refine the phasing results obtained from the previous step, effectively reducing the error rate. Finally, GCphase obtains the phase block. GCphase was compared to three other methods, WhatsHap, HapCUT2, and LongPhase, on the Nanopore and PacBio long-read datasets. The code is available from https://github.com/baimawjy/GCphase .

Conclusions: Experimental results show that GCphase under different sequencing depths of different data has the least number of switch errors and the highest accuracy compared with other methods.

Keywords: Error correction; Graph minimum-cut algorithm; Haplotype assembly; SNP phasing.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Overview of the GCphase workflow. a Data preprocessing. GCphase simplifies the reads into a format that contains only SNP information. b Selecting SNPs and long reads. SNP loci with disproportionately large allele ratios (the number of reads supporting the major allele accounts for more than 85% of the total number of reads) and reads with insufficient SNP information (indicated by red borders in the graph) were removed. c Constructing the Graph. The two alleles of SNP loci are represented as vertices in the graph, and the reads supporting two alleles are represented as edges. d Partitioning Graph. The graph is partitioned into two sets with the smallest intersection using the minimum-cut algorithm. e Correcting errors and Producing haplotype blocks. After undergoing two error correction steps, the algorithm traverses the maximal connected components in the graph to generate haplotype blocks as the output
Fig. 2
Fig. 2
Filtering criteria for SNP loci and edges. Alleles of the same colour connected by lines represent a single read, where different colours indicate different haplotypes. In the figure, at the fourth SNP (represented by red dashed lines), the major allele 'A' has a count of 9, while the minor allele 'G' has a count of 1. The proportion of the major allele reaches 90%, which exceeds the set threshold of 85%. Therefore, SNP4 is considered a fuzzy SNP, and phasing cannot be effectively performed. Thus, SNP4 is removed. The read only aligned to SNP8 and represented in red is shown in the figure. However, since they cover only one SNP locus, they cannot provide effective information for SNP phasing. Therefore, these reads were deleted
Fig. 3
Fig. 3
For the initial set partitioning G0 and G1 of the initial SNP to be phased, there are edges connecting the two alleles vg0 and vg1 of any SNP locus vg with the points in both sets. The cut value is the sum of the weights of the edges connecting the allele vertex with all vertices in its complementary set. Therefore, when the two alleles belong to different sets, different cut values exist. a When vg0 ∊ G0, vg1 ∊ G1, the cut value is the sum of the weights of the red edges, which is equal to 16. b When vg1 ∊ G0, vg0 ∊ G1, the cut value is the sum of the weights of the red edges, which is equal to 5. The cut value (16) of the allocation method in (a) is greater than the cut value (5) of the allocation method in (b). Therefore, the allocation method for the two alleles vg0 and vg1 of SNP tends to be the allocation method in (b), which is vg1 ∊ G0, vg0 ∊ G1

Similar articles

References

    1. Lan W, Lai D, Chen Q, Wu X, Chen B, Liu J, Chen YPP. LDICDL: LncRNA-disease association identification based on collaborative deep learning. IEEE/ACM Trans Comput Biol Bioinf. 2020;19(3):1715–23.10.1109/TCBB.2020.3034910 - DOI - PubMed
    1. Chaisson MJP, Sanders AD, Zhao X, et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun. 2019;10(1):1–16. 10.1038/s41467-018-08148-z - DOI - PMC - PubMed
    1. 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature, 526(7571), 68 - PMC - PubMed
    1. Garg S. Computational methods for chromosome-scale haplotype reconstruction. Genome Biol. 2021;22(1):1–24. 10.1186/s13059-021-02328-9 - DOI - PMC - PubMed
    1. Simpson JT, Workman RE, Zuzarte PC, David M, Dursi LJ, Timp W. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods. 2017;14(4):407–10. 10.1038/nmeth.4184 - DOI - PubMed

LinkOut - more resources