论文标题
在初始阶段检测慢性肾脏疾病(CKD):一种新型混合特征选择方法和针对不同ML技术的强大数据准备管道
Detecting Chronic Kidney Disease(CKD) at the Initial Stage: A Novel Hybrid Feature-selection Method and Robust Data Preparation Pipeline for Different ML Techniques
论文作者
论文摘要
慢性肾脏疾病(CKD)已感染了全球近8亿人。因此,每年约有170万人死亡。在初始阶段检测CKD对于挽救数百万生命至关重要。许多研究人员应用了不同的机器学习(ML)方法来在早期阶段检测CKD,但仍缺少详细的研究。我们提出了一种结构化和详尽的方法,用于处理以最佳性能处理医疗数据的复杂性。此外,这项研究将帮助研究人员对医疗数据准备管道产生清晰的想法。在本文中,我们将KNN推出以将缺失值,局部离群因素删除异常值,示意为处理数据不平衡,k分层的K-折叠交叉验证以验证ML模型以及一种新型的混合特征选择方法来删除冗余特征。这项研究中的应用算法是支持向量机,高斯天真贝叶斯,决策树,随机森林,逻辑回归,k-neart邻居,梯度提升,自适应增强和极端的梯度提升。最后,随机森林可以以100%准确性检测CKD,而不会泄漏任何数据。
Chronic Kidney Disease (CKD) has infected almost 800 million people around the world. Around 1.7 million people die each year because of it. Detecting CKD in the initial stage is essential for saving millions of lives. Many researchers have applied distinct Machine Learning (ML) methods to detect CKD at an early stage, but detailed studies are still missing. We present a structured and thorough method for dealing with the complexities of medical data with optimal performance. Besides, this study will assist researchers in producing clear ideas on the medical data preparation pipeline. In this paper, we applied KNN Imputation to impute missing values, Local Outlier Factor to remove outliers, SMOTE to handle data imbalance, K-stratified K-fold Cross-validation to validate the ML models, and a novel hybrid feature selection method to remove redundant features. Applied algorithms in this study are Support Vector Machine, Gaussian Naive Bayes, Decision Tree, Random Forest, Logistic Regression, K-Nearest Neighbor, Gradient Boosting, Adaptive Boosting, and Extreme Gradient Boosting. Finally, the Random Forest can detect CKD with 100% accuracy without any data leakage.