PanForest: predicting genes in genomes using random forests



Beavan, AJS ORCID: 0000-0002-8219-6742, Domingo-Sananes, MR and McInerney, JO ORCID: 0000-0003-1885-2503
(2026) PanForest: predicting genes in genomes using random forests Bioinformatics Oxford England, 42 (1). btag005-. ISSN 1367-4803, 1367-4811

[thumbnail of btag005.pdf] Text
btag005.pdf - Author Accepted Manuscript
Available under License Creative Commons Attribution.

Download (1MB) | Preview

Abstract

MOTIVATION: The presence or absence of some genes in a genome can influence whether other genes are likely to be present or absent. Understanding these gene co-occurrence and avoidance patterns reveals fundamental principles of genome organization, with applications ranging from evolutionary reconstruction to rational design of synthetic genomes. RESULTS: PanForest, presented here, uses random forest classifiers to predict the presence and absence of genes in genomes from the set of other genes present. Performance statistics output by PanForest reveal how predictable each gene's presence or absence is, based on the presence or absence of other genes in the genome. Further, PanForest produces statistics indicating the importance of each gene in predicting the presence or absence of each other gene. The PanForest software can run serially or in parallel, thereby facilitating the analysis of pangenomes at Network of Life scale.A pangenome of 12 741 accessory genes in 1000 Escherichia coli genomes was analysed in around 5 h using eight processors. To demonstrate PanForest's utility, we present a case study and show that certain genes associated with resistance to antimicrobial drugs reliably predict the presence or absence of other genes associated with resistance to the same drug. Further, we highlight several associations between those genes and others not known to be associated with antimicrobial resistance (AMR), or associated with resistance to other drugs. We envisage PanForest's use in studies from multiple disciplines concerning the dynamics of gene distributions in pangenomes ranging from biomedical science and synthetic biology to molecular ecology. AVAILABILITY AND IMPLEMENTATION: The software if freely available with a full manual and can be found with at www.github.com/alanbeavan/PanForest DOI: https://doi.org/10.5281/zenodo.17865482.

Item Type: Article
Uncontrolled Keywords: Escherichia coli, Computational Biology, Genomics, Genome, Bacterial, Algorithms, Software, Random Forest
Divisions: Faculty of Health & Life Sciences
Faculty of Health & Life Sciences > Inst. Infection, Vet & Ecological Sciences
Faculty of Health & Life Sciences > Inst. Infection, Vet & Ecological Sciences > Inst. Infection, Vet & Ecological Sciences (T&R Staff)
Faculty of Health & Life Sciences > Inst. Infection, Vet & Ecological Sciences > Evolution, Ecology & Behaviour
Depositing User: Symplectic Admin
Date Deposited: 19 Jan 2026 11:00
Last Modified: 28 Feb 2026 01:28
DOI: 10.1093/bioinformatics/btag005
Related Websites:
URI: https://livrepository.liverpool.ac.uk/id/eprint/3196664
Disclaimer: The University of Liverpool is not responsible for content contained on other websites from links within repository metadata. Please contact us if you notice anything that appears incorrect or inappropriate.