Using topic modelling for unsupervised annotation of electronic health records to identify an outbreak of disease in UK dogs



Noble, Peter-John Mantyla, Appleton, Charlotte, Radford, Alan David ORCID: 0000-0002-4590-1334 and Nenadic, Goran
(2021) Using topic modelling for unsupervised annotation of electronic health records to identify an outbreak of disease in UK dogs. PLOS ONE, 16 (12). e0260402-.

[img] Text
FinalSubmittedWordDoc_withFigures.docx - Author Accepted Manuscript

Download (8MB)

Abstract

A key goal of disease surveillance is to identify outbreaks of known or novel diseases in a timely manner. Such an outbreak occurred in the UK associated with acute vomiting in dogs between December 2019 and March 2020. We tracked this outbreak using the clinical free text component of anonymised electronic health records (EHRs) collected from a sentinel network of participating veterinary practices. We sourced the free text (narrative) component of each EHR supplemented with one of 10 practitioner-derived main presenting complaints (MPCs), with the 'gastroenteric' MPC identifying cases involved in the disease outbreak. Such clinician-derived annotation systems can suffer from poor compliance requiring retrospective, often manual, coding, thereby limiting real-time usability, especially where an outbreak of a novel disease might not present clinically as a currently recognised syndrome or MPC. Here, we investigate the use of an unsupervised method of EHR annotation using latent Dirichlet allocation topic-modelling to identify topics inherent within the clinical narrative component of EHRs. The model comprised 30 topics which were used to annotate EHRs spanning the natural disease outbreak and investigate whether any given topic might mirror the outbreak time-course. Narratives were annotated using the Gensim Library LdaModel module for the topic best representing the text within them. Counts for narratives labelled with one of the topics significantly matched the disease outbreak based on the practitioner-derived 'gastroenteric' MPC (Spearman correlation 0.978); no other topics showed a similar time course. Using artificially injected outbreaks, it was possible to see other topics that would match other MPCs including respiratory disease. The underlying topics were readily evaluated using simple word-cloud representations and using a freely available package (LDAVis) providing rapid insight into the clinical basis of each topic. This work clearly shows that unsupervised record annotation using topic modelling linked to simple text visualisations can provide an easily interrogable method to identify and characterise outbreaks and other anomalies of known and previously un-characterised diseases based on changes in clinical narratives.

Item Type: Article
Uncontrolled Keywords: Animals, Dogs, Gastroenteritis, Dog Diseases, Population Surveillance, Disease Outbreaks, Electronic Health Records, Data Curation, Unsupervised Machine Learning, United Kingdom
Divisions: Faculty of Health and Life Sciences
Faculty of Health and Life Sciences > Institute of Infection, Veterinary and Ecological Sciences
Depositing User: Symplectic Admin
Date Deposited: 12 Jan 2022 15:29
Last Modified: 18 Jan 2023 21:16
DOI: 10.1371/journal.pone.0260402
Related URLs:
URI: https://livrepository.liverpool.ac.uk/id/eprint/3146677