PuSH - Publikationsserver des Helmholtz Zentrums München

Berrar, D.* ; Dubitzky, W.

Should significance testing be abandoned in machine learning?

Int. J. Data Sci. Anal. 7, 247-257 (2019)
Postprint DOI Verlagsversion bestellen
Open Access Green
Significance testing has become a mainstay in machine learning, with the p value being firmly embedded in the current research practice. Significance tests are widely believed to lend scientific rigor to the interpretation of empirical findings; however, their problems have received only scant attention in the machine learning literature so far. Here, we investigate one particular problem, the Jeffreys–Lindley paradox. This paradox describes a statistical conundrum: the p value can be close to zero, convincing us that there is overwhelming evidence against the null hypothesis. At the same time, however, the posterior probability of the null hypothesis being true can be close to 1, convincing us of the exact opposite. In experiments with synthetic data sets and a subsequent thought experiment, we demonstrate that this paradox can have severe repercussions for the comparison of multiple classifiers over multiple benchmark data sets. Our main result suggests that significance tests should not be used in such comparative studies. We caution that the reliance on significance tests might lead to a situation that is similar to the reproducibility crisis in other fields of science. We offer for debate four avenues that might alleviate the looming crisis.
Weitere Metriken?
Zusatzinfos bearbeiten [➜Einloggen]
Publikationstyp Artikel: Journalartikel
Dokumenttyp Wissenschaftlicher Artikel
Schlagwörter Bayesian Test ; Classification ; Jeffreys–lindley Paradox ; P Value ; Significance Test
ISSN (print) / ISBN 2364-415X
e-ISSN 2364-4168
Quellenangaben Band: 7, Heft: 4, Seiten: 247-257 Artikelnummer: , Supplement: ,
Verlag Springer
Verlagsort Cham (ZG)
Begutachtungsstatus Peer reviewed