Variable Selection Methods for Classification: Application to Metabolomics Data



Ibrahim, Nurain
(2020) Variable Selection Methods for Classification: Application to Metabolomics Data. PhD thesis, University of Liverpool.

[img] Text
201136446_Mar2020.pdf - Unspecified

Download (7MB) | Preview

Abstract

Metabolomics is an emerging field, which focuses on the study of small molecules (metabolites) and their chemical processes. Metabolomics data are highly dimensional, with p>>n where p is the number of variables and n is the sample size. Variable selection is therefore a key step in metabolomics studies. There are three categories of variable selection, such as filter, wrapper and embedded methods. Common univariate filter methods such as the t-test and ANOVA (analysis of variance) have been often used in the literature to identify important metabolites for a given clinical problem. A challenge in metabolomics research is that metabolite variables tend to be highly correlated. Multivariate approaches that take into account the correlation among variables, such as PCA (principal component analysis), have been applied to reduce the dimensionality of metabolite datasets. The correlation-sharing t-test method (corT) is a filter method that also considers the correlation among variables, but to my knowledge it has only been applied to genomic data. Penalized regression, and in particular the embedded method Lasso, has also been applied for variable selection with the aim of minimising the problem of overfitting that often affects prediction models in this area. In this thesis I presented a literature review on variable selection methods and classification methods applied to metabolomics data. I proposed an extended version of the variable selection method corT, which I name adjusted correlation-sharing t-test (adjcorT). Simulation studies were carried out to compare the performance of several variable selection methods (T, corT, adjcorT and Lasso) using logistic regression for data classification. Simulations assumed a set of 200 variables of which 2 variables were discriminators. A range of sample sizes (n=50, 76, 100, 300, 500, 1000, 2000 and 20000) and of different correlation values among the discriminant variables (\rho=-0.8, -0.5, -0.2, 0, 0.2, 0.5, 0.8) were considered to explore the effect that sample size and correlation have on the classification accuracy of each method. These methods were also applied to metabolomics datasets, including data from patients with colorectal cancer (aimed at discriminating between non-cancer vs colorectal cancer groups, and healthy control vs adenoma groups) as well as, kidney disease and infant sepsis datasets. R code was developed to analyse the datasets. Cross validation, with data split into two sets (80% for training and 20% for validation) was used to compare the performance of the variable selection methods using classification accuracy, sensitivity, specificity and area under ROC. Results from the simulation studies indicate that for small sample sizes (n=50, 76), T, corT, adjcorT and Lasso often failed to select the two discriminatory variables. For example, for \rho=0.5 and n=50, only 3%, 12%, 11% and 0% of the times the two discriminatory variables were selected. Nevertheless, the detection rates for adjcorT and Lasso improved for negative strong correlations (Table 4.3). These results are consistent with the better performance in classification accuracy observed for adjcorT and Lasso for negative strong correlations ( -0.5 ≤ \rho < -1.0; Table 4.4). As the sample size increased towards n=300, all methods increased their ability to select the two discriminatory variables, with Lasso underperforming for positive strong correlations and corT underperforming for moderate and strong negative correlations. These differences can explain the dissimilarities observed across methods in classification accuracy for sample sizes n=300, 500 and 1000; with Lasso showing poorer performance than T, corT and adjcorT for positive strong correlations, and corT showing poorer performance than T, corT and adjcorT for moderate and strong negative correlations (Tables 4.5 and 4.6). As the sample size increases, T, adjcorT and Lasso offered a similar level of accuracy but corT still underperforms for moderate and strong negative correlations and larger sample sizes (Table 4.7). In the clinical applications, corT and adjcorT show a similar level of classification accuracy, possibly due to the positive correlation that exists among most metabolites. For non-cancer and cancer discrimination, the method T showed the worst classification accuracy followed by Lasso. Methods corT and adjcorT achieved the best level of discrimination although this was still low (AUC of 0.60; Table 5.3). For healthy control and adenoma discrimination however, methods corT and adjcorT showed the lowest AUC, followed by the T method. Lasso achieved the best level of discrimination, although this remained low (AUC of 0.65; Table 5.8). For the discrimination between bacterial and non-bacterial sepsis cases, Lasso exhibited a better performance that the other variable selection methods with 83.1% classification accuracy (Table 5.13). Lasso also offered the best level of discrimination between healthy controls and kidney disease (AUC=0.90, Table 5.21), although the four methods showed a comparable performance (AUCs=0.86 and 0.87 were achieved with the T and with the corT and adjcorT methods respectively). My work based on simulations shows that adjcorT offers a flexible approach for variable selection aimed at clinical classification, especially for datasets involving negative correlations between discriminators for medium and large samples where adjcorT consistently shows a better performance than corT. These findings were however not reproduced by the analyses on real data. I believe this is possibly due to the lack of negative correlations among metabolites in the datasets considered. Both adjcorT and corT are filter variable selection methods. Given that adjcorT showed a better performance compared to corT for negative correlations and a similar performance for positive correlations across all sample sizes investigated, adjcorT is expected to offer advantages compared to corT as a variable selection method for the analysis of some metabolomics data.

Item Type: Thesis (PhD)
Divisions: Faculty of Health and Life Sciences > Institute of Life Courses and Medical Sciences > School of Medicine
Depositing User: Symplectic Admin
Date Deposited: 04 Sep 2020 10:27
Last Modified: 18 Jan 2023 23:46
DOI: 10.17638/03093506
Supervisors:
URI: https://livrepository.liverpool.ac.uk/id/eprint/3093506