Hostname: page-component-7bb8b95d7b-fmk2r Total loading time: 0 Render date: 2024-10-02T12:56:26.156Z Has data issue: false hasContentIssue false

Combined visualization of genomic and epidemiological data for outbreaks

Published online by Cambridge University Press:  30 September 2024

Carl J. E. Suster*
Affiliation:
Centre for Infectious Diseases and Microbiology – Public Health, Westmead Hospital, Westmead, NSW, Australia Sydney Infectious Diseases Institute, The University of Sydney, Sydney, NSW, Australia School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, Australia
Anne E. Watt
Affiliation:
Sydney Infectious Diseases Institute, The University of Sydney, Sydney, NSW, Australia School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, Australia Centre for Infectious Diseases and Microbiology Laboratory Services, Institute of Clinical Pathology and Medical Research, NSW Health Pathology, Westmead, NSW, Australia
Qinning Wang
Affiliation:
Centre for Infectious Diseases and Microbiology – Public Health, Westmead Hospital, Westmead, NSW, Australia Centre for Infectious Diseases and Microbiology Laboratory Services, Institute of Clinical Pathology and Medical Research, NSW Health Pathology, Westmead, NSW, Australia
Sharon C.-A. Chen
Affiliation:
Centre for Infectious Diseases and Microbiology – Public Health, Westmead Hospital, Westmead, NSW, Australia Sydney Infectious Diseases Institute, The University of Sydney, Sydney, NSW, Australia School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, Australia Centre for Infectious Diseases and Microbiology Laboratory Services, Institute of Clinical Pathology and Medical Research, NSW Health Pathology, Westmead, NSW, Australia
Jen Kok
Affiliation:
Centre for Infectious Diseases and Microbiology – Public Health, Westmead Hospital, Westmead, NSW, Australia Sydney Infectious Diseases Institute, The University of Sydney, Sydney, NSW, Australia School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, Australia Centre for Infectious Diseases and Microbiology Laboratory Services, Institute of Clinical Pathology and Medical Research, NSW Health Pathology, Westmead, NSW, Australia
Vitali Sintchenko
Affiliation:
Centre for Infectious Diseases and Microbiology – Public Health, Westmead Hospital, Westmead, NSW, Australia Sydney Infectious Diseases Institute, The University of Sydney, Sydney, NSW, Australia School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, Australia Centre for Infectious Diseases and Microbiology Laboratory Services, Institute of Clinical Pathology and Medical Research, NSW Health Pathology, Westmead, NSW, Australia
*
Corresponding author: Carl J. E. Suster; Email: carl.suster@sydney.edu.au
Rights & Permissions [Opens in a new window]

Abstract

In epidemiological investigations, pathogen genomics can provide insights and test epidemiological hypotheses that would not have been possible through traditional epidemiology. Tools to synthesize genomic analysis with other types of data are a key requirement of genomic epidemiology. We propose a new ‘phylepic’ visualization that combines a phylogenomic tree with an epidemic curve. The combination visually links the molecular time represented in the tree to the calendar time in the epidemic curve, a correspondence that is not easily represented by existing tools. Using an example derived from a foodborne bacterial outbreak, we demonstrated that the phylepic chart communicates that what appeared to be a point-source outbreak was in fact composed of cases associated with two genetically distinct clades of bacteria. We provide an R package implementing the chart. We expect that visualizations that place genomic analyses within the epidemiological context will become increasingly important for outbreak investigations and public health surveillance of infectious diseases.

Type
Short Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press

Report

High-resolution characterization and clustering of pathogen genomes using whole genome sequencing (WGS) has proven to be a powerful complementary tool to investigations of infectious disease outbreaks [Reference Armstrong1, Reference Baker2]. An epidemiological investigation will often involve establishing a case definition to demarcate the outbreak, geospatial and temporal clustering, case follow-up and further investigations to establish transmission pathways or common reservoirs [3]. Genomic epidemiology draws together bioinformatic analysis from pathogen genomic sequences and epidemiological data to provide context and support to inferences of transmission. Synthesis of genomic and epidemiological data is required to perform these inferences, however, this can be challenging given the complexity of the underlying data [Reference Watt4]. These challenges can be compounded when genomic and epidemiological investigations are conducted by discrete organizations (e.g., public health laboratories and local public health services respectively) with different information contexts and objectives.

The epidemic curve, a histogram of cases binned along the time axis, is a key representation relied upon in disease control. It is straightforward to prepare from line listing data, supports both data exploration and statistical epidemiology [Reference Nishiura, Chowell and Hyman5], and is straightforward to interpret. Genomic epidemiology instead relies upon phylogenomic trees, which display a detailed summary of the relationships between the included genomes under the particular assumptions of the underlying evolutionary models. Phylogenomic trees are prepared using specialized tools that implement the evolutionary models and efficiently search for the optimal tree (e.g., IQ-TREE, http://www.iqtree.org/). Trees are an essential representation of the salient information contained in a set of genomes that support genomic epidemiology and public health surveillance. Visually extracting this information relies upon knowledge of the conventions of tree drawing, and some familiarity with the process of their preparation. New users of trees may therefore require dedicated training so as to minimize the risk of misinterpretation.

We propose a new visualization called the ‘phylepic’ chart that synthesizes epidemic curves and phylogenomic trees. The structure of the phylepic chart facilitates the visual linkage of the molecular time represented in the tree with the epidemiological time represented in the epidemic curve [Reference Sintchenko and Holmes6]. For illustration, we adapted a dataset from the investigation of a point-source foodborne outbreak in New South Wales (NSW), Australia. Clinical cases presenting with gastroenteritis were epidemiologically linked to the same food product, sparking an investigation. Bacterial isolates were collected during the outbreak period from clinical samples, food samples, and food manufacturing facilities, and all underwent WGS. The resulting sequences were analysed alongside historical sequences collected in NSW using conventional bioinformatic pipelines to produce a core genome phylogenomic tree. For the purpose of illustration, we have pruned the resulting phylogeny and edited some details in the scenario. The chart is presented in Figure 1.

Figure 1. A phylepic chart using an illustrative dataset adapted from a foodborne enteric pathogen outbreak in NSW. The panel on the left shows the phylogenomic tree with tips coloured by genomic cluster membership. Numbers on branches are bootstrap support values expressed as percentages. The panel on the upper right shows the epidemic curve using the date of sample collection. The calendar grid (boxed) in the lower right connects the tree to the epidemic curve, and each tile shows the day of the month of the sample collection date. Samples with a collection date prior to the outbreak period are shown with triangles at the left edge of the date scale. Colours on tree tips, calendar tiles, and the epidemic curve refer to the same genomic clusters.

Relying on the epidemiological data alone, the silhouette of the epidemic curve in Figure 1 is consistent with a single point-source outbreak. The epidemic curve is coloured by genomic cluster membership. The presence of two genomic clusters suggests that there are two distinct strains associated with cases occurring in the same period. The phylogenomic tree provides the more complete genomic context of the outbreak by providing detailed information about the inferred relatedness of bacterial sequences. The tree reveals that the most recent common ancestor of the two outbreak clusters (Clust-04 and Clust-05) is also the common ancestor of several other clusters and singletons in the historical dataset. This provides evidence that Clust-04 and Clust-05 arose from separate introductions of the pathogen into the food product. From the tree alone the epidemiological link between the two clusters is not evident and there is no representation of the temporal proximity of cases. Only the combination of the two lines of evidence leads to the conclusion of distinct introductions to the same food product.

The calendar grid in the lower right of Figure 1 shows a visual projection of the tree onto a time axis representing the date of sample collection. For this particular chart, the time axis was restricted to the investigation period, meaning that for isolates collected before mid-August, there is no calendar tile. To distinguish these from cases with missing metadata, triangles are drawn at the left edge of the calendar to indicate out-of-range values. The epidemic curve and calendar grid are binned by epidemiological week. The phylepic chart is most readable when the tree is relatively small and there are not too many date bins, both of which are reasonable restrictions for the intended application in outbreak investigation. The bin width can be decreased to a single day or increased to another interval depending on the appropriate resolution for an investigation.

While we have chosen a foodborne bacterial pathogen for this example, phylepic charts are equally compatible with any other pathogens such as respiratory viruses. The tree need not necessarily be built from the consensus genome or a core genome as in our example; it is sufficient that a tree of some sort is prepared that is meaningful in the context of the investigation. It remains important to convey any caveats and limitations of the phylogenomic analysis alongside the results to ensure correct interpretation.

To communicate relevant epidemiological context, the most popular tools for visualizing trees (e.g., the ggtree R package, https://yulab-smu.top/treedata-book/, or the ETE toolkit Python library, http://etetoolkit.org/) allow tips to be annotated with an array of coloured boxes or symbols. This can work well for categorical or binary data such as the isolation source of samples, results of laboratory assays, or the presence of antimicrobial resistance, toxins, or virulence factors. These annotations can be incorporated into the phylepic chart, as shown with the coloured tiles indicating clinical and environmental sample sources in Figure 1. Additional columns can be incorporated to represent other categorical or quantitative data as required. With some care, it is possible to use existing software libraries such as those mentioned above to produce a chart similar to Figure 1.

Dates associated with each case (reporting, sample collection, symptom onset) are key data for generating epidemiological hypotheses. These dates do not directly relate to the branch lengths in phylogenomic trees. Instead, the axis extending from the root of the tree to its tips can be thought of as molecular time, reflecting the evolutionary distances predicted by the model. It is possible to express branch lengths in units of time using an inferred molecular clock rate, and by incorporating temporal information as explicit constraints in the model [Reference Sagulenko, Puller and Neher7]. With a time-scaled tree of this sort, an alternative visualization is made possible where an epidemic curve is vertically aligned with the tree [Reference Sintchenko and Holmes6]. This can be helpful to show how population-level dynamics over a longer period correspond to structural features of the tree, however, it is less suitable for high-resolution outbreak investigation since the dates associated with individual cases are not shown.

We have implemented the phylepic chart in an R package. Our package uses the ggplot2 library [Reference Wickham8] as a framework, the ggraph library (https://ggraph.data-imaginist.com) for drawing the tree, and the cowplot library (https://wilkelab.org/cowplot) for aligning panels. The package is designed so that minimal code is required to implement visualizations such as Figure 1, while full customisation and annotation of the plots are possible using the underlying frameworks. Manual curation and careful consideration of how data are represented remain important in crafting an effective visualization of the data. The package takes as input a prebuilt phylogeny (in any commonly used tree format such as Newick) and tabular data containing at minimum a column of dates associated with each sequence in the tree. Compared to existing software libraries for annotating phylogenetic trees, the package implements routines for date handling (including automatic binning by week) and ensuring data matching and consistency across panels.

As genomic epidemiology becomes a routine part of public health outbreak investigations, effective integration of genomic and epidemiological evidence to support collaboration between relevant public health and clinical teams remains a key challenge. Communication and visualization tools will assist in conveying complex genomic data into investigation and response contexts [Reference Arnott9, Reference Argimón10]. Our experience with using phylepic charts in public health reporting has shown that they can help epidemiologists and other public health professionals to accurately interpret phylogenomic trees by relating genomic features to more familiar epidemiological details of the cases under investigation.

Data availability statement

The R package described in this report is available for download from GitHub (https://github.com/cidm-ph/phylepic) and the Comprehensive R Archive Network (https://cran.r-project.org/package=phylepic). The specific version used is archived on Zenodo (https://doi.org/10.5281/zenodo.11479719). The package contains the tree in Newick format, the metadata table in CSV format, and the R code necessary to reproduce Figure 1, with commentary explaining usage.

Acknowledgements

We gratefully acknowledge Health Protection NSW and the NSW Health Pathology-ICPMR Microbial Genomics Reference Laboratory for providing the tree used in our example and the motivating context from the associated outbreak investigation.

Author contribution

Conceptualization: V.S.; Data curation: C.J.E.S.; Software: C.J.E.S.; Supervision: V.S.; Visualization: C.J.E.S.; Writing – original draft: C.J.E.S.; Writing – review and editing: A.E.W., Q.W., S.C.-A.C., J.K., V.S.

Funding statement

This study was supported by the Prevention Research Support Program funded by the New South Wales Ministry of Health.

Competing interest

The authors declare none.

References

Armstrong, GL, et al. (2019) Pathogen genomics in public health. New England Journal of Medicine Massachusetts Medical Society 381, 25692580.Google ScholarPubMed
Baker, KS (2020) Microbe hunting in the modern era: Reflecting on a decade of microbial genomic epidemiology. Current Biology 30, R1124R1130.CrossRefGoogle ScholarPubMed
World Health Organization (2021) Framework and Toolkit for Infection Prevention and Control in Outbreak Preparedness, Readiness and Response at the National Level. Geneva: World Health Organization.Google Scholar
Watt, AE, et al. (2022) State-wide genomic epidemiology investigations of COVID-19 in healthcare workers in 2020 Victoria, Australia: Qualitative thematic analysis to provide insights for future pandemic preparedness. The Lancet Regional Health – Western Pacific 25, 100487.CrossRefGoogle ScholarPubMed
Nishiura, H (2016) Methods to determine the end of an infectious disease epidemic: A short review. In Chowell, G and Hyman, JM (eds), Mathematical and Statistical Modeling for Emerging and re-Emerging Infectious Diseases. Cham: Springer International Publishing, pp. 291301.CrossRefGoogle Scholar
Sintchenko, V and Holmes, EC (2015) The role of pathogen genomics in assessing disease transmission. BMJ 350, h1314.CrossRefGoogle ScholarPubMed
Sagulenko, P, Puller, V and Neher, RA (2018) TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evolution 4, vex042.CrossRefGoogle ScholarPubMed
Wickham, H (2016) ggplot2: Elegant Graphics for Data Analysis, 2nd Edn. Cham: Springer International Publishing .CrossRefGoogle Scholar
Arnott, A, et al. (2021) Documenting elimination of co-circulating COVID-19 clusters using genomics in New South Wales, Australia. BMC Research Notes 14, 415.CrossRefGoogle ScholarPubMed
Argimón, S, et al. (2016) Microreact: Visualizing and sharing data for genomic epidemiology and phylogeography. Microbial Genomics Microbiology Society 2, e000093.Google ScholarPubMed
Figure 0

Figure 1. A phylepic chart using an illustrative dataset adapted from a foodborne enteric pathogen outbreak in NSW. The panel on the left shows the phylogenomic tree with tips coloured by genomic cluster membership. Numbers on branches are bootstrap support values expressed as percentages. The panel on the upper right shows the epidemic curve using the date of sample collection. The calendar grid (boxed) in the lower right connects the tree to the epidemic curve, and each tile shows the day of the month of the sample collection date. Samples with a collection date prior to the outbreak period are shown with triangles at the left edge of the date scale. Colours on tree tips, calendar tiles, and the epidemic curve refer to the same genomic clusters.