Optimising the statistical pipeline for quantitative proteomics



Price, Hayley
(2023) Optimising the statistical pipeline for quantitative proteomics. PhD thesis, University of Liverpool.

[img] Text
200991684_Oct2023.pdf - Author Accepted Manuscript

Download (8MB) | Preview

Abstract

Background Label-free quantitative proteomics utilises differential expression (DE) analysis of high-throughput methods for mass spectrometry, providing insight into disease biomarkers, protein involvement in metabolic pathways or facilitating drug discovery. Applying statistical techniques to assess the significance of proteins changing in abundance is complicated by the properties of the data. Small numbers of samples containing vast numbers of features result in large sample-to-sample variation where the comparison of means can be distorted by outliers. Limitations of benchmarking data and the complexity of the algorithms make software comparison challenging. Full optimisation of the proteomics workflow is difficult, and it is a daunting task for the biologist to intuitively obtain optimal results. The aim of this Industrial CASE PhD studentship, in collaboration with Nonlinear Dynamics, the developers of Progenesis QI for Proteomics (QIP), is to provide an improved statistical pipeline that could be implemented in the Progenesis QIP workflow. Methods Benchmarking of three existing statistical approaches: QPROT, ANOVA as implemented directly in Progenesis QIP, and MSstats, was conducted traditionally, using spike-in datasets, and through the implementation of a novel method, using biological data and applying pathway analysis as an evaluation metric. Normalisation methods and the optimal threshold for defining significance were also investigated. Following this, an optimised proteomics pipeline was developed and implemented using high performance computing cluster for parallelisation of multiple combinations of methods for DE analysis, normalisation, and significance threshold selection. Functional enrichment analysis of proteins defined as changing was used to assess the results and the optimal parameter combination returned to the user. Effectiveness of this approach was demonstrated by comparing the best results from the pipeline with enrichment analysis of the output from the current Progenesis QIP workflow. Results Overall, the results of benchmarking gave no consensus on best method for DE, normalisation method, or significance threshold and the correct combination of parameters appeared to be dependent on the characteristics of the individual datasets. The results also showed that the choice of an appropriate normalisation method is an important and underappreciated factor in differential expression analysis and that the optimal threshold for defining significance varied greatly from the generally accepted value of p < 0.05. The optimised pipeline’s performance was superior to a standard analysis using Progenesis QIP. To our knowledge, this is the only end-to-end pathway analysis pipeline designed for proteomics data, enabling users to iterate through multiple options for finding the best normalisation method and the best significance threshold for pathway analysis.

Item Type: Thesis (PhD)
Uncontrolled Keywords: Label-free, quantitative proteomics, differential expression, functional enrichment
Divisions: Faculty of Health and Life Sciences
Depositing User: Symplectic Admin
Date Deposited: 05 Feb 2024 16:32
Last Modified: 05 Feb 2024 16:32
DOI: 10.17638/03176846
Supervisors:
  • Jones, Andy
  • Savage, Natasha
  • Beynon, Rob
  • Morns, Ian
URI: https://livrepository.liverpool.ac.uk/id/eprint/3176846