Evaluation of variant calling algorithms for wastewater-based epidemiology using mixed populations of SARS-CoV-2 variants in synthetic and wastewater samples



Bassano, Irene, Ramachandran, Vinoy, Khalifa, Mohammad, Lilley, Chris, Brown, Mathew, van Aerle, Ronny, Denise, Hubert ORCID: 0000-0001-9862-5890, Rowe, William, George, Airey, Cairns, Edward
et al (show 11 more authors) (2022) Evaluation of variant calling algorithms for wastewater-based epidemiology using mixed populations of SARS-CoV-2 variants in synthetic and wastewater samples. [Preprint]

Access the full-text of this item by clicking on the Open Access link.

Abstract

Wastewater-based epidemiology (WBE) has been used extensively throughout the COVID-19 pandemic to detect and monitor the spread and prevalence of SARS-CoV-2 and its variants. It has proven an excellent, complementary tool to clinical sequencing, supporting the insights gained and helping to make informed public health decisions. Consequently, many groups globally have developed bioinformatics pipelines to analyse sequencing data from wastewater. Accurate calling of mutations is critical in this process and in the assignment of circulating variants, yet, to date, the performance of variant-calling algorithms in wastewater samples has not been investigated. To address this, we compared the performance of six variant callers (VarScan, iVar, GATK, FreeBayes, LoFreq and BCFtools), used widely in bioinformatics pipelines, on 19 synthetic samples with known ratios of three different SARS-CoV-2 variants (Alpha, Beta and Delta), as well as 13 wastewater samples collected in London between the 15–18 December 2021. We used the fundamental parameters of recall (sensitivity) and precision (specificity) to confirm the presence of mutational profiles defining specific variants across the six variant callers. Our results show that BCFtools, FreeBayes and VarScan found the expected variants with higher precision and recall than GATK or iVar, although the latter identified more expected defining mutations than other callers. LoFreq gave the least reliable results due to the high number of false-positive mutations detected, resulting in lower precision. Similar results were obtained for both the synthetic and wastewater samples.

Item Type: Preprint
Uncontrolled Keywords: 31 Biological Sciences, 3102 Bioinformatics and Computational Biology, Emerging Infectious Diseases, Coronaviruses, Infectious Diseases, Coronaviruses Disparities and At-Risk Populations, Bioengineering, Genetics, 3 Good Health and Well Being
Divisions: Faculty of Health and Life Sciences
Faculty of Health and Life Sciences > Institute of Infection, Veterinary and Ecological Sciences
Depositing User: Symplectic Admin
Date Deposited: 02 Dec 2022 09:10
Last Modified: 21 Jun 2024 13:05
DOI: 10.1101/2022.06.06.22275866
Open Access URL: https://www.medrxiv.org/content/10.1101/2022.06.06...
Related URLs:
URI: https://livrepository.liverpool.ac.uk/id/eprint/3166475