Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 30;7(2):veab064.
doi: 10.1093/ve/veab064. eCollection 2021.

Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool

Affiliations

Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool

Áine O'Toole et al. Virus Evol. .

Abstract

The response of the global virus genomics community to the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been unprecedented, with significant advances made towards the 'real-time' generation and sharing of SARS-CoV-2 genomic data. The rapid growth in virus genome data production has necessitated the development of new analytical methods that can deal with orders of magnitude of more genomes than previously available. Here, we present and describe Phylogenetic Assignment of Named Global Outbreak Lineages (pangolin), a computational tool that has been developed to assign the most likely lineage to a given SARS-CoV-2 genome sequence according to the Pango dynamic lineage nomenclature scheme. To date, nearly two million virus genomes have been submitted to the web-application implementation of pangolin, which has facilitated the SARS-CoV-2 genomic epidemiology and provided researchers with access to actionable information about the pandemic's transmission lineages.

Keywords: SARS-CoV-2; genomic surveillance; lineage; phylogenetics; software.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
(A) Time series of the number of countries that have reported SARS-CoV-2 genome sequences (shaded area) and a curve showing trends in the geographic variation of reporting, quantified as the Shannon diversity (H) of sequence sampling location labels (curved). (B) The accumulation of SARS-CoV-2 genetic diversity over time, measured as the mean genetic distance of sampled sequences from the reference sequence (accession: EPI_ISL_406801). Shaded regions indicate one standard deviation from the mean. (C) Number of designated Pango lineages over the course of the pandemic. As more countries have contributed sequences, and as genetic diversity has accumulated throughout, the Pango nomenclature has continued to define distinct lineages that represent the emerging edge of the pandemic. SARS-CoV-2 genome sequences and metadata described were sourced from GISAID on 31 May 2021.
Figure 2.
Figure 2.
Workflow describing the process of Pango lineage designation and assignment of lineages to SARS-CoV-2 genome sequences using pangolin. An estimated global SARS-CoV-2 phylogeny is periodically manually curated to ‘designate’ lineages and sequences (left). A list of sequences and the lineages to which they have been designated (the ‘sequence designation list’) is maintained by the Pango team at https://github.com/cov-lineages/pango-designation (last accessed: 29 June 2021). These designations and the associated genome sequences from GISAID are used as input for the pangoLEARN training pipeline (https://github.com/cov-lineages/pangoLEARN; last accessed: 29 June 2021) (centre). Once this is completed a new pangoLEARN data release is tagged (centre). This creates the machine learning model that pangolin uses to assign genomes (right). Users can then submit a SARS-CoV-2 genome query sequence and pangolin will assign the most likely lineage based on the currently established lineage designations. *In addition to assignment using the pangoLEARN model, certain lineages of interest are assigned by checking for specific defining SNPs with some built-in flexibility (e.g. B.1.1.7 is assigned by checking for the presence of at least 5 of the 17 defining SNPs that fall on the basal branch of the lineage). These additional ad hoc rules may be subject to revision or removal to maintain performance of the pangolin system.
Figure 3.
Figure 3.
Pangolin web application interface. (A) The image shows the landing page of the pangolin web application where users can either select or drag and drop a local file into the web browser. (B) The results page showing a processed file, the sequence name for each sequence and the assigned lineage. Links to the UK and global microreact.org builds, as well as the cov-lineages.org web pages for each lineage are represented by the three icons on the right.
Figure 4.
Figure 4.
Distribution of genome completeness as a percentage of informative coding region sites for all SARS-CoV-2 sequences on GISAID. 1,382,550 sequences were assessed for ambiguity in coding regions, including both whole genome sequences with high ambiguity content and short fragments that have been uploaded to GISAID. 1,284,427 sequences had <5 per cent ambiguous sites across the virus coding region (i.e. were at least 95 per cent complete). Sequences that have designated a lineage are indicated (n = 438,440). This 95 per cent completeness threshold was enacted as of Pango designation version 1.2 (GISAID data sourced on 7 May 2021).
Figure 7.
Figure 7.
Performance of pangolin in response to missing data. (A) pangolin assignment accuracy over a gradient of percentage ambiguities. Correct assignments decline with increasing numbers of ambiguous sites, and this initially leads to a greater proportion of genomes assigned to the parent lineage. At greater percentage of ambiguities, minimap2 fails to effectively map against the reference genome and beyond 24 per cent ambiguity all genomes fail to map. (B) Overlapping amplicons generated from the ARTIC ncov2019 primer scheme v3 (98 amplicons across two pools). (C) Pangolin assignments over a sliding window of triple amplicon dropouts coloured by correct lineage assignment, ancestor lineage assigned or incorrect assignment. SARS-CoV-2 genome schema with genome positions and largest genes (ORF1ab, S, M and N) labelled.
Figure 8.
Figure 8.
The behaviour of pangolin in response to simulated recombinant genomes. (A) Genome graph of SARS-CoV-2. The position of percentage cut-off sites is shown along the top of the graph and nucleotide base position along the bottom. (B) Schema describing the structure of simulated recombinants. Each recombinant is a combination of two distinct lineages in varying proportions. The three example recombinants show a 20 per cent 5ʹ lineage (80 per cent 3ʹ lineage) recombinant, a 50:50 recombinant and an 80 per cent 5ʹ lineage (20 per cent 3ʹ lineage) recombinant. (C) A density curve over the SARS-CoV-2 genome highlighting the relative importance of particular sites within the decision tree. The density is calculated based on the number of rules in the decision tree that include a given site in the genome. (D) The horizontal axis indicates the percentage of the 5ʹ lineage present in a given recombinant genome. Each bar represents 125,300 simulated recombinants. Stacked colours indicate the count of the recombinants that had either the 5ʹ lineage assigned, the 3ʹ lineage assigned, an ancestral lineage assigned, or an incorrect assignment (i.e. a sibling or unrelated lineage).
Figure 5.
Figure 5.
Performance of different pangoLEARN models. (A) Number of genomes submitted to GISAID by the date of model training, and thus number of genomes included in a given model training. (B) Training time (hours) for each model type (logistic regression, random forest, decision tree), tested on SARS-CoV-2 genome data releases from April to October 2020. All models except the multinomial logistic regression scale acceptably with increasing sequence and lineage counts. (C) The average lineage recall rate for each model for each data release. All models performed well, with the random forests each slightly beating the decision trees. (D) The average F1 scores for each of the models for each of the data releases. These scores were closely correlated with the recall rate.
Figure 6.
Figure 6.
Performance of pangolin for genomes with increasing numbers of simulated additional mutations. (A) Boxplot showing the spread of the majority of data, separated by whether the lineage was correctly assigned, assigned an ancestral lineage of the designated lineage, assigned a descendant lineage, or incorrectly assigned. The whiskers define the 5th and 95th percentile range. (B) Proportion of genomes assigned correctly, incorrectly or to an ancestral or descendant (child) lineage, normalised for a given percentage identity.

Similar articles

Cited by

References

    1. De Maio C. et al. (2020) Issues with SARS-CoV-2 Sequencing Data Virological. <https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473> accessed27 Jan 2021.
    1. Duchene S. et al. (2020) ‘Temporal Signal and the Phylodynamic Threshold of SARS-CoV-2’, Virus Evolution, 6: veaa061. - PMC - PubMed
    1. Dudas G. et al. (2017) ‘Virus Genomes Reveal Factors that Spread and Sustained the Ebola Epidemic’, Nature, 544: 309–15. - PMC - PubMed
    1. Elbe S., and Buckland-Merrett G. (2017) ‘Data, Disease and Diplomacy: GISAID’s Innovative Contribution to Global Health’, Global Challenges, 1: 33–46. - PMC - PubMed
    1. Faria N. R. et al. (2017) ‘Establishment and Cryptic Transmission of Zika Virus in Brazil and the Americas’, Nature, 546: 406–10. - PMC - PubMed