Predicting the Predisposition to Colorectal Cancer based on SNP Profiles of Immune Phenotypes using Supervised Learning Models
Cakmak, Ali,
Ibrahimzada, Ali Reza,
Arikan, Soykan,
Ayaz, Huzeyfe,
Demirkol, Seyda,
Sonmez, Dilara,
Hakan, Mehmet Tolgahan,
Surmen, Saime Turan,
Horozoglu, Cem,
Kucukhuseyin, Ozlem,
Cacina, Canan,
Kiran, Bayram,
Zeybek, Umit,
Baysan, Mehmet,
and Yaylim, Ilhan
Medical & Biological Engineering & Computing, 2022
This study explores the machine learning-based assessment of predisposition to
colorectal cancer based on single nucleotide polymorphisms (SNP). Such a computational
approach may be used as a risk indicator and an auxiliary diagnosis method that complements the
traditional methods such as biopsy and CT scan. Moreover, it may be used to develop a low-cost
screening test for the early detection of colorectal cancers to improve public health. We employ
several supervised classification algorithms. Besides, we apply data imputation to fill in the
missing genotype values. The employed dataset includes SNPs observed in particular colorectalcancer-associated genomic loci that are located within DNA regions of 11 selected genes obtained
from 115 individuals. We make the following observations: (i) Random Forest-based classifier
using one-hot encoding and K-Nearest Neighbor (KNN)-based imputation performs the best
among the studied classifiers with an F1 score of 89% and Area Under the Curve (AUC) score of
0.96. (ii) One-hot encoding together with K-Nearest-Neighbor-based data imputation increase the
F1 scores by around 26% in comparison to the baseline approach which does not employ them.
(iii) The proposed model outperforms a commonly employed state-of-the-art approach,
ColonFlag, under all evaluated settings by up to 24% in terms of the AUC score. Based on the
high accuracy of the constructed predictive models, the studied 11 genes may be considered as a
gene panel candidate for colon cancer risk screening.