一项有关跨数据库偏见和评估指标的广泛研究，用于胃肠道异常分类的机器学习

论文标题

一项有关跨数据库偏见和评估指标的广泛研究，用于胃肠道异常分类的机器学习

An Extensive Study on Cross-Dataset Bias and Evaluation Metrics Interpretation for Machine Learning applied to Gastrointestinal Tract Abnormality Classification

论文作者

Thambawita, Vajira, Jha, Debesh, Hammer, Hugo Lewi, Johansen, Håvard D., Johansen, Dag, Halvorsen, Pål, Riegler, Michael A.

论文摘要

胃肠道（GI）疾病的精确自动鉴定可以帮助医生治疗更多的患者并提高疾病检测和鉴定率。当前，胃肠道疾病的自动分析是计算机科学和医疗相关期刊的热门话题。然而，对这种自动分析的评估通常是不完整或简单的。算法通常仅在小型和有偏见的数据集上进行测试，并且很少执行交叉数据集评估。对使用交叉数据集的评估指标和机器学习模型的清晰了解对于将现场的研究提高到了新的质量水平至关重要。为了实现这一目标，我们使用全球特征和深层神经网络对五个不同的机器学习模型进行了全面评估，这些模型可以对16种不同类型的胃肠道条件进行分类，包括病理学发现，解剖学地标，息肉删除条件以及从常见的GI Tract考试仪器捕获的图像中的正常发现。在我们的评估中，我们使用六个性能指标介绍性能六角形，例如召回，精度，特异性，准确性，F1得分和Matthews相关系数，以演示如何确定模型的实际功能，而不是对它们进行评估。此外，我们使用不同的数据集进行训练和测试。通过这些跨数据库评估，我们证明了实际构建可在不同医院使用的可推广模型的挑战。我们的实验清楚地表明，需要应用更复杂的性能指标和评估方法来获取可靠的模型，而不是依赖于对同一数据集的拆分的评估，即，应始终将性能指标始终一起解释而不是依赖单个指标。

Precise and efficient automated identification of Gastrointestinal (GI) tract diseases can help doctors treat more patients and improve the rate of disease detection and identification. Currently, automatic analysis of diseases in the GI tract is a hot topic in both computer science and medical-related journals. Nevertheless, the evaluation of such an automatic analysis is often incomplete or simply wrong. Algorithms are often only tested on small and biased datasets, and cross-dataset evaluations are rarely performed. A clear understanding of evaluation metrics and machine learning models with cross datasets is crucial to bring research in the field to a new quality level. Towards this goal, we present comprehensive evaluations of five distinct machine learning models using Global Features and Deep Neural Networks that can classify 16 different key types of GI tract conditions, including pathological findings, anatomical landmarks, polyp removal conditions, and normal findings from images captured by common GI tract examination instruments. In our evaluation, we introduce performance hexagons using six performance metrics such as recall, precision, specificity, accuracy, F1-score, and Matthews Correlation Coefficient to demonstrate how to determine the real capabilities of models rather than evaluating them shallowly. Furthermore, we perform cross-dataset evaluations using different datasets for training and testing. With these cross-dataset evaluations, we demonstrate the challenge of actually building a generalizable model that could be used across different hospitals. Our experiments clearly show that more sophisticated performance metrics and evaluation methods need to be applied to get reliable models rather than depending on evaluations of the splits of the same dataset, i.e., the performance metrics should always be interpreted together rather than relying on a single metric.

下载PDF全文

下载文献需遵守相关版权规定

论文标题