Skip to main content Accessibility help
×
Hostname: page-component-586b7cd67f-2plfb Total loading time: 0 Render date: 2024-11-28T18:40:50.493Z Has data issue: false hasContentIssue false

References

Published online by Cambridge University Press:  07 November 2024

Nathalie Japkowicz
Affiliation:
American University, Washington DC
Zois Boukouvalas
Affiliation:
American University, Washington DC
Get access

Summary

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'
Type
Chapter
Information
Machine Learning Evaluation
Towards Reliable and Responsible AI
, pp. 387 - 402
Publisher: Cambridge University Press
Print publication year: 2024

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abadi, M., Chu, A., Goodfellow, I. et al. (2016). Deep learning with differential privacy. In Kruegel, C., Myers, A., and Halevi, S. (eds.) Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308318. Association for Computer Machinery.CrossRefGoogle Scholar
Abdi, H. (2007). Multiple correlation coefficient. In Salkind, N. J. (ed.) Encyclopedia of Measurement and Statistics, 648, 651. SAGE Publications.Google Scholar
Absalom, E. E., Ikotun, A. M., Oyelade, O. N. et al. (2022). A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110, 104743. https://doi.org/10.1016/j.engappai.2022.104743CrossRefGoogle Scholar
Adali, T., Anderson, M., and Fu, G.-S. (2014). Diversity in independent component and vector analyses: identifiability, algorithms, and applications in medical imaging. IEEE Signal Processing Magazine, 31(3), 1833.CrossRefGoogle Scholar
Adali, T., and Calhoun, V. D. (2022). Reproducibility and replicability in neuroimaging data analysis. Current Opinion in Neurology, 35(4), 475481.CrossRefGoogle ScholarPubMed
Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., and Bellemare, M. G. (2021). Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems, 34, 2930429320.Google Scholar
Aggarwal, C. C., Kong, X., Gu, Q., Han, J., and Yu, P. S. (2014). Active learning: a survey. In Aggarwal, C. C. (ed.) Data Classification: Algorithms and Applications, 599634. Chapman and Hall.CrossRefGoogle Scholar
Alaiz-Rodríguez, R., Japkowicz, N., and Tischer, P. (2008). Visualizing classifier performance on different domains. In Proceedings of the 20th IEEE International Conference on Tools with Artificial Intelligence, 310. IEEE Computer Society.Google Scholar
Ali, R., Lee, S., and Chung, T. C. (2017). Accurate multi-criteria decision making methodology for recommending machine learning algorithm. Expert Systems with Applications, 71, 257278.CrossRefGoogle Scholar
Alpaydn, E. (1999). Combined 5 × 2 cv F test for comparing supervised classification learning algorithms. Neural Computation, 11, 18851892.CrossRefGoogle Scholar
Amancio, D. R., Comin, C. H., Casanova, D. et al. (2014). A systematic comparison of supervised classifiers. PLoS ONE, 9(4), e94137.CrossRefGoogle Scholar
Amodei, D., Olah, C., Steinhardt, J. et al. (2016). Concrete problems in AI safety. arXiv:1606.06565.Google Scholar
Andersson, A., Davidsson, P., and Linden, J. (1999). Measure-based classifier performance evaluation. Pattern Recognition Letters, 20(11–13), 11651173.CrossRefGoogle Scholar
Armstrong, J. S. (2007). Significance tests harm progress in forecasting. International Journal of Forecasting, 23, 321327.CrossRefGoogle Scholar
Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. (2017). A brief survey of deep reinforcement learning. arXiv:1708.05866.Google Scholar
Ashayeri, C., and Jha, B. (2021). Evaluation of transfer learning in data-driven methods in the assessment of unconventional resources. Journal of Petroleum Science and Engineering, 207, 109178.CrossRefGoogle Scholar
Atanov, A., Xu, S., Beker, O., Filatov, A., and Zamir, A. (2022). Simple control baselines for evaluating transfer learning. arXiv:2202.03365.Google Scholar
Bahari, M. H., and Hamme, H. V. (2014). Normalized ordinal distance; a performance metric for ordinal, probabilistic-ordinal or partial-ordinal classification problems. In Issac, B. and Israr, N. (eds.) Case Studies in Intelligent Computing: Achievements and Trends, 285302. CRC Press.Google Scholar
Bahri, M., Bifet, A., Gama, J., Gomes, H. M., and Maniu, S. (2021). Data stream analysis: foundations, major tasks and tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(3), e1405.Google Scholar
Barocas, S., and Selbst, A. D. (2016). Big data’s disparate impact. California Law Review, 104, 671732.Google Scholar
Beck, N., Sivasubramanian, D., Dani, A., Ramakrishnan, G., and Iyer, R. K. (2021). Effective evaluation of deep active learning on image classification tasks. arXiv:2106.15324.Google Scholar
Bekker, J., and Davis, J. (2020). Learning from positive and unlabeled data: a survey. Machine Learning, 109, 719760.CrossRefGoogle Scholar
Bellinger, C., Corizzo, R., and Japkowicz, N. (to appear). Performance estimation bias in class imbalance with minority subconcepts. In Moniz, N., Branco, P., Japkowicz, N., Woźniak, M., and Wang, S. (eds.) Proceedings of Machine Learning Research.Google Scholar
Benavoli, A., Corani, G., Demšar, J., and Zaffalon, M. (2017). Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. Journal of Machine Learning Research, 18(1), 26532688.Google Scholar
Berrar, D. P. (2016). Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers. Machine Learning, 106, 911949.CrossRefGoogle Scholar
Berrar, D. P., and Lozano, J. A. (2013). Significance tests or confidence intervals: Which are preferable for the comparison of classifiers? Journal of Experimental & Theoretical Artificial Intelligence, 25, 189206.CrossRefGoogle Scholar
Bishop, C. M., and Nasrabadi, N. M. (2006). Pattern recognition and machine learning. Springer.Google Scholar
Borji, A. (2019). Pros and cons of GAN evaluation measures. Computer Vision and Image Understanding, 179, 4165.CrossRefGoogle Scholar
Bouckaert, R. R. (2003). Choosing between two learning algorithms based on calibrated tests. In Fawcett, T., and Mishra, N. (eds.) Proceedings of the Twentieth International Conference on Machine Learning, 5158. AAAI Press.Google Scholar
Bouckaert, R. R. (2004). Estimating replicability of classifier learning experiments. In Brodley, C. (ed.) Proceedings of the Twenty-First International Conference on Machine Learning, paper 15. AAAI Press.Google Scholar
Bousquet, O., Boucheron, S., and Lugosi, G. (2004). Introduction to statistical learning theory. In Advanced Lectures on Machine Learning, vol. 3176 of Lecture Notes in Artificial Intelligence, 169207. Springer Verlag.CrossRefGoogle Scholar
Bradford, J. P., Kunz, C., Kohavi, R., Brunk, C., and Brodley, C. E. (1998). Pruning decision trees with misclassification costs. In Proceedings of the European Conference on Machine Learning, 131136. Springer.Google Scholar
Branco, P., Torgo, L., and Ribeiro, R. P. (2016). A survey of predictive modeling on imbalanced domains. ACM Computing Surveys, 49, 150.CrossRefGoogle Scholar
Branco, P., Torgo, L., and Ribeiro, R. P. (2017). Relevance-based evaluation metrics for multi-class imbalanced domains. In Kim, J., Shim, K., Cao, L., Lee, J.-G., Lin, X., and Moon, Y.-S. (eds.) Pacific-Asia Conference on Knowledge Discovery and Data Mining, 698710. Springer International Publishing.CrossRefGoogle Scholar
Brownlee, J. (2016). Statistical methods for machine learning. Discover how to transform data into knowledge with Python. Jason Brownlee https://machinelearningmastery.com/statistics_for_machine_learningGoogle Scholar
Brundage, M., Avin, S., Wang, J. et al. (2020). Toward trustworthy AI development: mechanisms for supporting verifiable claims. arXiv:2004.07213.Google Scholar
Bryson, J. J. (2019). The past decade and future of AI’s impact on society. www.bbvaopenmind.com/en/articles/the-past-decade-and-future-of-ais-impact-on-society/Google Scholar
Brzezinski, D. W., Stefanowski, J., Susmaga, R., and Szczech, I. (2020). On the dynamics of classification measures for imbalanced and streaming data. IEEE Transactions on Neural Networks and Learning Systems, 31, 28682878.CrossRefGoogle ScholarPubMed
Buja, A., Stuetzle, W., and Shen, Y. (2005). Loss functions for binary class probability estimation: structure and applications. www-stat.wharton.upenn.edu/~buja/PAPERS/paper-proper-scoring.pdfGoogle Scholar
Buolamwini, J., and Gebru, T. (2018). Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability, and Transparency, 7791. Proceedings of Machine Learning Research.Google Scholar
Campos, G. O., Zimek, A., Sander, J. et al. (2015). On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery, 30, 891927.CrossRefGoogle Scholar
Canbek, G., Temizel, T. T., and Sağiroğlu, S. (2022). PToPI: a comprehensive review, analysis, and knowledge representation of binary classification performance measures/-metrics. SN Computer Science, 4(1), 13.CrossRefGoogle ScholarPubMed
Caruana, R., and Niculescu-Mizil, A. (2004). Data mining in metric space: an empirical analysis of supervised learning performance criteria. In Gehrke, J. and DuMouchel, W. (eds.) Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 6978. Association for Computing Machinery.Google Scholar
Caruana, R., and Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. In Cohen, W. and Moore, A. (eds.) Proceedings of the 23rd International Conference on Machine Learning, 161168. Association for Computing Machinery.CrossRefGoogle Scholar
Celikyilmaz, A., Clark, E., and Gao, J. (2020). Evaluation of text generation: a survey. arXiv:2006.14799.Google Scholar
Cerqueira, V., Torgo, L., and Mozetic, I. (2020). Evaluating time series forecasting models: an empirical study on performance estimation methods. Machine Learning, 109, 19972028.CrossRefGoogle Scholar
Chalapathy, R., and Chawla, S. (2019). Deep learning for anomaly detection: a survey. arXiv:abs/1901.03407.Google Scholar
Chalapathy, R., Menon, A. K., and Chawla, S. (2018). Anomaly detection using one-class neural networks. arXiv:1802.06360.Google Scholar
Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: a survey. ACM Comput. Surv., 41, 15:115:58.CrossRefGoogle Scholar
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., and Blei, D. (2009). Reading tea leaves: how humans interpret topic models. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, 288296. Curran Associates Inc.Google Scholar
Chawla, N., Bowyer, K., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321357.CrossRefGoogle Scholar
Chen, T. Y., Kuo, F.-C., Liu, H. et al. (2019). Metamorphic testing: a review of challenges and opportunities. ACM Computing Surveys, 51(1), 4.CrossRefGoogle Scholar
Chernik, M. R. (2007). Bootstrap methods: a guide for practitioners and researchers. 2nd ed. Wiley.CrossRefGoogle Scholar
Chicco, D., and Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1), 113.CrossRefGoogle ScholarPubMed
Cho, H., Matthews, G. J., and Harel, O. (2019). Confidence intervals for the area under the receiver operating characteristic curve in the presence of ignorable missing data. International Statistical Review, 87, 152177.CrossRefGoogle ScholarPubMed
Choudhury, A., and Daly, S. J. (2019). Combining quality metrics using machine learning for improved and robust HDR image quality assessment. Electronic Imaging, 31, 17.CrossRefGoogle Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurements, 20, 3746.CrossRefGoogle Scholar
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155.CrossRefGoogle ScholarPubMed
Collberg, C., Proebsting, T., and Warren, A. M. (2015). Repeatability and benefaction in computer systems research. Technical Report. 14(4). University of Arizona.Google Scholar
Confalonieri, R., Coba, L., Wagner, B., and Besold, T. R. (2021). A historical perspective of explainable artificial intelligence. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(1), e1391.Google Scholar
Corani, G., Benavoli, A., Demšar, J., Mangili, F., and Zaffalon, M. (2017). Statistical comparison of classifiers through Bayesian hierarchical modelling. Machine Learning, 106(11), 18171837.CrossRefGoogle Scholar
Corbett-Davies, S., and Goel, S. (2018). The measure and mismeasure of fairness: a critical review of fair machine learning. arXiv:1808.00023.Google Scholar
Cortes, C., and Mohri, M. (2004). AUC optimization vs. error rate minimization. In Thrun, S., Saul, L. K., and Schölkopf, B. (eds.) Proceedings of the 16th International Conference on Neural Information Processing Systems, 16, 313320. MIT Press.Google Scholar
Cortes, C., and Mohri, M. (2005). Confidence intervals for the area under the ROC curve. In Saul, L., Weiss, Y., and Bottou, L. (eds.) Proceedings of the 17th International Conference on Advances in Neural Information Processing Systems, 17, 305312. MIT Press.Google Scholar
Crothers, E. (2020). Ethical detection of online influence campaigns using transformer language models. Masters thesis, University of Ottawa, Canada.Google Scholar
Crothers, E., Japkowicz, N., and Viktor, H. L. (2023). Machine-generated text: a comprehensive survey of threat models and detection methods. IEEE Access, 11, 7097771002.CrossRefGoogle Scholar
Cumming, G. (2013). Understanding the new statistics: effect sizes, confidence intervals, and meta-analysis. Routledge.CrossRefGoogle Scholar
Damasceno, L. P., Cavalcante, C. C., Adali, T., and Boukouvalas, Z. (2021). Independent vector analysis using semi-parametric density estimation via multivariate entropy maximization. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing, 37153719. IEEE.Google Scholar
Davis, J., and Goadrich, M. H. (2006). The relationship between precision-recall and ROC curves. In Cohen, W. and Moore, A. (eds.) Proceedings of the 23rd International Conference on Machine Learning, 233240. Association for Computing Machinery.CrossRefGoogle Scholar
Deeks, J. J., and Altman, D. G. (2004). Diagnostic tests 4: likelihood ratios. British Medical Journal, 329, 168169.CrossRefGoogle ScholarPubMed
Dembla, G. (2020). Intuition behind log-loss score. Medium, Blog. https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680aGoogle Scholar
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 130.Google Scholar
Demšar, J. (2008). On the appropriateness of statistical tests in machine learning. In Workshop on Evaluation Methods for Machine Learning in Conjunction with ICML, 65. Citeseer.Google Scholar
Díaz-Rodríguez, N., Lomonaco, V., Filliat, D., and Maltoni, D. (2018). Don’t forget, there is more than forgetting: new metrics for continual learning. arXiv:1810.13166.Google Scholar
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 18951923.CrossRefGoogle ScholarPubMed
Domingos, P. (2000). A unified bias–variance decomposition and its applications. In Proceedings of the 17th International Conference on Machine Learning, 231238. Morgan Kaufmann.Google Scholar
Domingues, R., Filippone, M., Michiardi, P., and Zouaoui, J. (2018). A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recognition, 74, 406421.CrossRefGoogle Scholar
Douglas, H. (2009). Science, policy, and the value-free ideal. University of Pittsburgh Press.CrossRefGoogle Scholar
Drummond, C. (2006). Machine learning as an experimental science (revisited). In Drummond, C., Elazmeh, W., and Japkowicz, N. (eds.) Proceedings of the AAAI’06 Workshop on Evaluation Methods for Machine Learning. AAAI Press.Google Scholar
Drummond, C. (2009). Replicability is not reproducibility: nor is it good science. In Proceedings of the Evaluation Methods for Machine Learning Workshop at the 26th ICML. www.site.uottawa.ca/~cdrummon/pubs/ICMLws09.pdfGoogle Scholar
Dubber, M., Pasquale, F., and Das, S. (2021). The Oxford handbook of ethics of AI. Oxford University Press.Google Scholar
Dwarakanath, A., Ahuja, M., Sikand, S. et al. (2018). Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In Bodden, E. (ed.) Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, 118128. Association for Computing Machinery.CrossRefGoogle Scholar
Efron, B., and Tibshirani, R. (1993). An introduction to the bootstrap. Chapman and Hall.CrossRefGoogle Scholar
Elazmeh, W., Japkowicz, N., and Matwin, S. (2006). A framework for measuring classification difference with imbalance. In Fürnkranz, J., Scheffer, T., and Spiliopoulou, M. (eds.) Proceedings of the 2006 European Conference on Machine Learning, 126137. Springer-Verlag.CrossRefGoogle Scholar
Elkan, C. P., and Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 213220. Association for Computing Machinery.CrossRefGoogle Scholar
Espadoto, M., Martins, R. M., Kerren, A., Hirata, N. S., and Telea, A. C. (2019). Toward a quantitative survey of dimension reduction techniques. In IEEE Transactions on Visualization and Computer Graphics, 27(3), 21532173.CrossRefGoogle Scholar
Eykholt, K., Ivan, E., Earlence, F. et al. (2018). Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 16251634. IEEE.Google Scholar
Faber, K., Corizzo, R., Sniezynski, B., and Japkowicz, N. (2022a). Active lifelong anomaly detection with experience replay. In IEEE 9th International Conference on Data Science and Advanced Analytics, 110. IEEE.Google Scholar
Faber, K., Corizzo, R., Sniezynski, B., and Japkowicz, N. (2023a). Lifelong learning for anomaly detection: new challenges, perspectives, and insights. arXiv:2303.07557.CrossRefGoogle Scholar
Faber, K., Corizzo, R., Sniezynsky, B., and Japkowicz, N. (2023b). Vlad: task-agnostic VAE-based lifelong anomaly detection. Neural Networks, 165, 248273.CrossRefGoogle ScholarPubMed
Fan, W., Stolfo, S. J., Zhang, J., and Chan, P. K. (1999). AdaCost: misclassification costsensitive boosting. In Proceedings of the 16th International Conference on Machine Learning, 97105. Morgan Kaufmann.Google Scholar
Farquhar, S., and Gal, Y. (2018). Towards robust evaluations of continual learning. arXiv:abs/1805.09733.Google Scholar
Fawcett, T. (2004). ROC graphs: notes and practical considerations for data mining researchers. Machine Learning, 31(1), 138.Google Scholar
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27, 861874.CrossRefGoogle Scholar
Fawcett, T., and Niculescu-Mizil, A. (2007a). PAV and the ROC convex hull. Machine Learning, 68(1), 97106.CrossRefGoogle Scholar
Ferri, C., Flach, P. A., and Hernandez-Orallo, J. (2003). Improving the AUC of probabilistic estimation trees. In Proceedings of the 14th European Conference on Machine Learning, 121132. Springer.Google Scholar
Ferri, C., Hernandez-Orallo, J., and Modroiu, R. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30, 2738.CrossRefGoogle Scholar
Fisher, R. A. (1925). Statistical methods for research workers. Oliver and Boyd.Google Scholar
Flach, P. A. (2003). The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In Proceedings of the Twentieth International Conference on Machine Learning, 194201. AAAI Press.Google Scholar
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378382.CrossRefGoogle Scholar
Fredrikson, M., Jha, S., and Ristenpart, T. (2015). Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, 13221333. Association for Computing Machinery.CrossRefGoogle Scholar
Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933969.Google Scholar
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5), 11891232.CrossRefGoogle Scholar
Fuernkranz, J., and Flach, P. A. (2005). Roc ‘n’ rule learning – towards a better understanding of covering algorithms. Machine Learning, 58(1), 3977.CrossRefGoogle Scholar
Gaber, M. M. (2012). Advances in data stream mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 7985.Google Scholar
Gama, J., Sebastião, R., and Rodrigues, P. P. (2013). On evaluating stream learning algorithms. Machine Learning, 90, 317346.CrossRefGoogle Scholar
García, S., and Herrera, F. (2008). An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research, 9, 26772694.Google Scholar
Gardner, P., Lord, C., and Barthorpe, R. J. (2018). An evaluation of validation metrics for probabilistic model outputs. In Proceedings of the ASME 2018 Verification and Validation Symposium, VVS2018–9327. IEEE.Google Scholar
Garg, A., Zhang, W., Samaran, J., Savitha, R., and Foo, C.-S. (2021). An evaluation of anomaly detection and diagnosis in multivariate time series. IEEE Transactions on Neural Networks and Learning Systems, 33(6), 25082517.CrossRefGoogle Scholar
Gaudette, L., and Japkowicz, N. (2009). Evaluation methods for ordinal classification. In Proceedings of the 2009 Canadian Conference on Artificial Intelligence, 207210. Springer-Verlag.Google Scholar
Gautret, P., Lagier, J.-C., Parola, P. et al. (2020). Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open-label non-randomized clinical trial. International Journal of Antimicrobial Agents, 56, 105949105949.CrossRefGoogle ScholarPubMed
Gelman, A., Carlin, J. B., Stern, H. S. et al. (2013). Bayesian data analysis. Chapman and Hall.CrossRefGoogle Scholar
Ghosh, K., Bellinger, C., Corizzo, R. et al. (2022). The class imbalance problem in deep learning. Machine Learning, 1–57. https://doi.org/10.1007/s10994-022-06268-8CrossRefGoogle Scholar
Gibaja, E. L., and Ventura, S. (2014). A tutorial on multi-label learning. ACM Computing Surveys, 47(3), 138.CrossRefGoogle Scholar
Gill, J., and Meir, K. (1999). The insignificance of null hypothesis significance testing. Political Research Quarterly, 52, 647674.CrossRefGoogle Scholar
Goix, N. (2016). How to evaluate the quality of unsupervised anomaly detection algorithms? arXiv:abs/1607.01152.Google Scholar
Goldstein, M., and Uchida, S. (2016). A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE, 11(4), e0152173.CrossRefGoogle ScholarPubMed
Golub, T. R., Slonim, D. K., Tamayo, P. et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439), 531537.CrossRefGoogle ScholarPubMed
Gomes, H. M., Read, J., Bifet, A., Barddal, J. P., and Gama, J. (2019). Machine learning for streaming data: state of the art, challenges, and opportunities. ACM SIGKDD Explorations Newsletter, 21, 622.CrossRefGoogle Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M. et al. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139144.CrossRefGoogle Scholar
Gundersen, O. E., Shamsaliei, S., and Isdahl, R. J. (2022). Do machine learning platforms provide out-of-the-box reproducibility? Future Generation Computer Systems, 126, 3447.CrossRefGoogle Scholar
Guo, Q., Xie, X., Ma, L. et al. (2018). An orchestrated empirical study on deep learning frameworks and platforms. arXiv:abs/1811.05187.Google Scholar
Gutiérrez, P. A., Pérez-Ortiz, M., Sánchez-Monedero, J., Fernández-Navarro, F., and Hervás-Martínez, C. (2016). Ordinal regression methods: survey and experimental study. IEEE Transactions on Knowledge and Data Engineering, 28, 127146.CrossRefGoogle Scholar
Hackenberger, B. K. (2019). Bayes or not Bayes, is this the question? Croatian Medical Journal, 60(1), 5052.CrossRefGoogle ScholarPubMed
Halligan, S., Altman, D. G., and Mallett, S. (2014). Disadvantages of using the area under the receiver operating characteristic curve to assess imaging tests: a discussion and proposal for an alternative approach. European Radiology, 25, 932939.CrossRefGoogle Scholar
Hand, D. J. (2006). Classifier technology and the illusion of progress. Statistical Science, 21(1), 115.Google Scholar
Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103123.CrossRefGoogle Scholar
Hand, D. J., and Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45, 171186.CrossRefGoogle Scholar
Hansen, L. K., and Rieger, L. (2019). Interpretability in intelligent systems – a new concept? In Samek, W., Montavon, G., Vedaldi, A., Hansen, L. K., and Müller, K.-R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, 4149. Springer.CrossRefGoogle Scholar
Hardt, M., Price, E., and Srebro, N. (2016). Equality of opportunity in supervised learning. In Lee, D. D., von Luxburg, U., Garnett, R., Sugiyama, M., and Guyon, I. (eds.) Proceedings of the 30th International Conference on Neural Information Processing Systems, 47. Curran Associates Inc.Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning: data mining, inference and prediction. Springer-Verlag.CrossRefGoogle Scholar
He, X., and Frey, E. C. (2008). The meaning and use of the volume under a three-class ROC surface (VUS). IEEE Transactions in Medical Imaging, 27(5), 577588.CrossRefGoogle Scholar
Herbrich, R. (2002). Learning kernel classifiers. MIT Press.Google Scholar
Hinton, P. (1995). Statistics explained. Routledge.Google Scholar
Ho, N., and Kim, Y.-C. (2021). Evaluation of transfer learning in deep convolutional neural network models for cardiac short axis slice classification. Scientific Reports, 11, 1839.CrossRefGoogle ScholarPubMed
Hossin, M., and Sulaiman, M. N. (2015). A review on evaluation metrics for data classification evaluations. International Journal of Data Mining & Knowledge Management Process, 5, 111.Google Scholar
Howell, D. C. (2020). Statistical methods for psychology. 5th ed. Wadsworth Press.Google Scholar
Huang, J., and Ling, C. X. (2007). Constructing new and better evaluation measures for machine learning. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, 859864. Morgan Kaufmann.Google Scholar
Huang, J., Ling, C. X., Zhang, H., and Matwin, S. (2008). Proper model selection with significance test. In Proceedings of the European Conference on Machine Learning, 536547.Google Scholar
Huang, Y., Li, W., Macheret, E., Gabriel, R. A., and Ohno-Machado, L. (2020). A tutorial on calibration measurements and calibration models for clinical prediction models. Journal of the American Medical Informatics Association, 27, 621633.CrossRefGoogle ScholarPubMed
Hulse, J. V., Khoshgoftaar, T. M., and Napolitano, A. (2009). An empirical comparison of repetitive undersampling techniques. In 2009 IEEE International Conference on Information Reuse & Integration, 2934. IEEE.Google Scholar
Hyndman, R. J., and Athanasopoulos, G. (2013). Forecasting: principles and practice. 3rd ed. OTexts. https://otexts.com/fpp3/Google Scholar
Japkowicz, N., Myers, C. E., and Gluck, M. A. (1995). A novelty detection approach to classification. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, 518523. IJCAI.Google Scholar
Japkowicz, N., Sanghi, P., and Tischer, P. (2008). A projection-based framework for classifier performance evaluation. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases – Part I, 548563. Springer-Verlag.Google Scholar
Japkowicz, N., and Shah, M. (2011). Evaluating learning algorithms: a classification perspective. Cambridge University Press.CrossRefGoogle Scholar
Japkowicz, N., and Stephen, S. (2002). The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5), 429450.CrossRefGoogle Scholar
Johnson, J. M., and Khoshgoftaar, T. M. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6, 154.CrossRefGoogle Scholar
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32(3), 241254.CrossRefGoogle ScholarPubMed
Jordan, S. M., Chandak, Y., Cohen, D., Zhang, M., and Thomas, P. S. (2020). Evaluating the performance of reinforcement learning algorithms. arXiv:abs/2006.16958.Google Scholar
Kamiran, E., and Calders, T. (2012). Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1), 133.CrossRefGoogle Scholar
Kapil, A. R. (2018). Data vedas: an introduction to data science. Archish Rai Kapil.Google Scholar
Kaur, D., Uslu, S., Rittichier, K. J., and Durresi, A. (2022). Trustworthy artificial intelligence: a review. ACM Computing Surveys, 55(2), 138.CrossRefGoogle Scholar
Khan, S. S., and Madden, M. G. (2014). One-class classification: taxonomy of study and review of techniques. The Knowledge Engineering Review, 29, 345374.CrossRefGoogle Scholar
Kim, T., Eltoft, T., and Lee, T.-W. (2006). Independent vector analysis: an extension of ICA to multivariate components. In International Conference on Independent Component Analysis and Signal Separation, 165172. Springer.Google Scholar
Kingma, D. P., and Welling, M. (2013). Auto-encoding variational Bayes. arXiv:1312.6114.Google Scholar
Kleiman, R., and Page, D. (2019). AUCμ: a performance metric for multi-class machine learning models. In International Conference on Machine Learning, 34393447. Proceedings of Machine Learning Research.Google Scholar
Klement, W., Flach, P. A., Japkowicz, N., and Matwin, S. (2011). Smooth receiver operating characteristics (smROC) curves. In Proceedings of Machine Learning and Knowledge Discovery in Databases: European Conference, Part II, 193208. Springer.CrossRefGoogle Scholar
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, 11371143. Morgan Kaufmann.Google Scholar
Koller, D., and Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT Press.Google Scholar
Kononenko, I., and Bratko, I. (2004). Information-based evaluation criterion for classifier’s performance. Machine Learning, 6: 6780.CrossRefGoogle Scholar
Korycki, Ł., and Krawczyk, B. (2021). Streaming decision trees for lifelong learning. In Proceedings of Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, Part I, 502518. Springer.CrossRefGoogle Scholar
Kovács, G. (2019). Smote-variants: a Python implementation of 85 minority oversampling techniques. Neurocomputing, 366, 352354.CrossRefGoogle Scholar
Krawczyk, B., Minku, L. L., Gama, J., Stefanowski, J., and Wozniak, M. (2017). Ensemble learning for data stream analysis: a survey. Information Fusion, 37, 132156.CrossRefGoogle Scholar
Krempl, G., Žliobaitė, I., Brzezinski, D. W. et al. (2014). Open challenges for data stream mining research. ACM SIGKDD Explorations Newsletter, 16, 110.CrossRefGoogle Scholar
Kubat, M., Holte, R. C., and Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30, 195215.CrossRefGoogle Scholar
Kukačka, J., Golkov, V., and Cremers, D. (2017). Regularization for deep learning: a taxonomy. arXiv:1710.10686.Google Scholar
Kukar, M., Kononenko, I., and Ljubljana, S. (2002). Reliable classifications with machine learning. In Proceedings of the 13th European Conference on Machine Learning, 219231. Springer.Google Scholar
Kukar, M. Z., and Kononenko, I. (1998). Cost-sensitive learning with neural networks. In Proceedings of the 13th European Conference on Artificial Intelligence, 445449. Wiley.Google Scholar
Kuncheva, L. I. (2014). Combining pattern classifiers: methods and algorithms, 2nd ed. Wiley.CrossRefGoogle Scholar
Kuncheva, L. I., Whitaker, C. J., Shipp, C. A., and Duin, R. P. W. (2003). Limits on the majority vote accuracy in classifier fusion. Pattern Analysis and Applications, 6, 2231.CrossRefGoogle Scholar
Lachiche, N., and Flach, P. (2003). Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In Proceedings of the 20th International Conference on Machine Learning, 416423. AAAI Press.Google Scholar
Landgrebe, T., Pacl’ik, P., Tax, D. J. M., Verzakov, S., and Duin, R. P. W. (2004). Cost-based classifier evaluation for imbalanced problems. In Proceedings of the 10th International Workshop on Structural and Syntactic Pattern Recognition and 5th International Workshop on Statistical Techniques in Pattern Recognition, 762770, vol. 3138 of Lecture Notes in Computer Science. Springer Verlag.Google Scholar
Lange, M. D., Aljundi, R., Masana, M. et al. (2022). A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 33663385.Google Scholar
Lau, J. H., Newman, D., and Baldwin, T. (2014). Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 530539. Association for Computational Linguistics.CrossRefGoogle Scholar
Lavesson, N., and Davidsson, P. (2008a). Generic methods for multi-criteria evaluation. In Proceedings of the 8th SIAM International Conference on Data Mining, 541546. SIAM.Google Scholar
Lavesson, N., and Davidsson, P. (2008b). Towards application-specific evaluation metrics. In Proceedings of the 3rd Workshop on Evaluation Methods for Machine Learning. ICML.Google Scholar
Laviolette, F., Marchand, M., Shah, M., and Shanian, S. (2010). Learning the set covering machine by bound minimization and margin-sparsity trade-off. Machine Learning, 78(1–2), 275301.CrossRefGoogle Scholar
Lebanon, G., and Lafferty, J. D. (2002). Cranking: combining rankings using conditional probability models on permutations. In Proceedings of the 19th International Conference on Machine Learning, 363370. Morgan Kaufmann.Google Scholar
Lee, K.-S., Jung, S.-K., Ryu, J.-J., Shin, S., and Choi, J. (2020). Evaluation of transfer learning with deep convolutional neural networks for screening osteoporosis in dental panoramic radiographs. Journal of Clinical Medicine, 9(2), 392.CrossRefGoogle ScholarPubMed
Lee, N. T., Resnick, P., and Barton, G. (2019). Algorithmic bias detection and mitigation: best practices and policies to reduce consumer harms. Brookings Institute Reports. www.brookings.edu/articles/algorithmic-bias-detection-and-mitigation-best-practices-and-policies-to-reduce-consumer-harms/Google Scholar
Li, B., Qi, P., Liu, B. et al. (2023). Trustworthy AI: from principles to practices. ACM Computing Surveys, 55(9), 146.Google Scholar
Li, M., and Viťanyi, P. (1997). An introduction to Kolmogorov complexity and its applications. 2nd ed. Springer-Verlag.CrossRefGoogle Scholar
Li, N., Li, T., and Venkatasubramanian, S. (2006). t-Closeness: privacy beyond k-anonymity and -diversity. In 23rd International Conference on Data Engineering, 106115. IEEE.Google Scholar
Li, Y., Chen, M.-H., Liu, Y., He, D., and Xu, Q. (2022). An empirical study on the efficacy of deep active learning for image classification. arXiv:abs/2212.03088.Google Scholar
Linardatos, P., Papastefanopoulos, V., and Kotsiantis, S. (2020). Explainable AI: a review of machine learning interpretability methods. Entropy, 23(1), 18.CrossRefGoogle Scholar
Lindley, D., and Scott, W. (1984). New Cambridge statistical tables, 2nd ed. Cambridge University Press.Google Scholar
Lindquist, E. F. (1940). Statistical analysis in educational research. Houghton Mifflin.Google Scholar
Ling, C. X., Huang, J., and Zhang, H. (2003). AUC: a statistically consistent and more discriminating measure than accuracy. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, 519526. IJCAI.Google Scholar
Liu, W., Li, R., Zheng, M. et al. (2020). Towards visually explaining variational autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 86428651. IEEE.Google Scholar
Liu, X. Y., and Zhou, Z. H. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1), 6377. IEEE.Google Scholar
Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 1423.Google Scholar
Long, Q., Bhinge, S., Levin-Schwartz, Y. et al. (2019). The role of diversity in data-driven analysis of multi-subject fMRI data: comparison of approaches based on independence and sparsity using global performance metrics. Human Brain Mapping, 40(2), 489504.CrossRefGoogle ScholarPubMed
Long, Q., Jia, C., Boukouvalas, Z. et al. (2018). Consistent run selection for independent component analysis: application to fMRI analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing, 25812585. IEEE.Google Scholar
Lundberg, S. M., and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems. Curran Associates. https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdfGoogle Scholar
Luque, A., Carrasco, A., Martín, A., and de las Heras, A. (2019). The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition, 91, 216231.CrossRefGoogle Scholar
Lydersen, S. (2019). Statistical power: before, but not after! Tidsskr Nor Laegeforen, 139 (2), 10.4045/tidsskr.18.0847. doi:10.4045/tidsskr.18.0847.Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. (2007). -diversity: privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 3es.CrossRefGoogle Scholar
MacQueen, J. (1967). Classification and analysis of multivariate observations. In 5th Berkeley Symposium of Mathematics, Statistics, and Probability, 281297. De Gruyter.Google Scholar
Macskassy, S. A., Provost, F., and Rosset, S. (2005). Pointwise ROC confidence bounds: an empirical evaluation. In Proceedings of the Workshop on ROC Analysis in Machine Learning, 537544. Association for Computing Machinery.Google Scholar
Makhlouf, K., Zhioua, S., and Palamidessi, C. (2021). Machine learning fairness notions: bridging the gap with real-world applications. Information Processing & Management, 58, 102642.CrossRefGoogle Scholar
Marchand, M., and Shawe-Taylor, J. (2002). The set covering machine. Journal of Machine Learning Research, 3, 723746.Google Scholar
Mehrabi, N., Morstatter, F., Saxena, N. A., Lerman, K., and Galstyan, A. G. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54, 135.CrossRefGoogle Scholar
Melnik, O., Vardi, Y., and Zhang, C. (2004). Mixed group ranks: preference and confidence in classifier combination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 973981. IEEE.Google Scholar
Merkle, E. C., and Steyvers, M. (2013). Choosing a strictly proper scoring rule. Decision Analysis, 10, 292304.CrossRefGoogle Scholar
Midway, S. R., Robertson, M., Flinn, S., and Kaller, M. D. (2020). Comparing multiple comparisons: practical guidance for choosing the best multiple comparisons test. PeerJ, 8, e10387.CrossRefGoogle ScholarPubMed
Minaee, S., Boykov, Y., Porikli, F. M. et al. (2022). Image segmentation using deep learning: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 35233542. IEEE.Google Scholar
Mitchell, T. M. (1997). Machine learning. McGraw-Hill.Google Scholar
Moroney, C., Crothers, E., Mittal, S. et al. (2021). The case for latent variable vs deep learning methods in misinformation detection: an application to COVID-19. In 24th International Conference on Discovery Science, 422432. Springer.Google Scholar
Müller, V. C. (2020). Ethics of artificial intelligence and robotics. In Zalta, E. N. and Nodelman, U. (eds.) The Stanford Encyclopedia of Philosophy. Fall 2023 ed. Stanford University.Google Scholar
Munjal, P., Hayat, N., Hayat, M., Sourati, J., and Khan, S. (2020). Towards robust and reproducible active learning using neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 223232. IEEE.Google Scholar
Murphy, C., Kaiser, G. E., and Arias, M. (2007). An approach to software testing of machine learning applications. In Proceedings of the International Conference on Software Engineering and Knowledge Engineering, 167. Knowledge Systems Institute Graduate School.Google Scholar
Murphy, C., Kaiser, G. E., Hu, L., and Wu, L. L. (2008). Properties of machine learning applications for use in metamorphic testing. In Proceedings of the International Conference on Software Engineering and Knowledge Engineering, 867872. Knowledge Systems Institute Graduate School.Google Scholar
Murtagh, F., and Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2. https://api.semanticscholar.org/CorpusID:18990050Google Scholar
Murtagh, F., and Contreras, P. (2017). Algorithms for hierarchical clustering: an overview, II. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7. https://api.semanticscholar.org/CorpusID:38660367Google Scholar
Murua, A. (2002). Upper bounds for error rates of linear combinations of classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 591602.CrossRefGoogle Scholar
Nadeau, C., and Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3), 239281.CrossRefGoogle Scholar
Nakhaeizadeh, G., and Schnabl, A. (1997). Development of multi-criteria metrics for evaluation of data mining algorithms. In Proceedings of KDD97 Conference, 3742. AAAI Press.Google Scholar
Nakhaeizadeh, G., and Schnabl, A. (1998). Towards the personalization of algorithms evaluation in data mining. In Proceedings of KDD98 Conference, 289293. AAAI Press.Google Scholar
Narasimhamurthy, A. M. (2005). Theoretical bounds of majority voting performance for a binary classification problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12), 19881995.CrossRefGoogle ScholarPubMed
New, A., Baker, M. M., Nguyen, E. Q., and Vallabha, G. (2022). Lifelong learning metrics. arXiv:abs/2201.08278.Google Scholar
Neyman, J., and Pearson, E. S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inference. Part I. Biometrika 20A (1/2), 175240.Google Scholar
Niculescu-Mizil, A., and Caruana, R. (2005). Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, 625632. Association for Computing Machinery.CrossRefGoogle Scholar
Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., and Tran, D. (2019). Measuring calibration in deep learning. arXiv:abs/1904.01685.Google Scholar
Ntoutsi, E., Fafalios, P., Gadiraju, U. et al. (2020). Bias in data-driven artificial intelligence systems—an introductory survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(3), e1356.Google Scholar
Nuzzo, R. (2014). Statistical errors. Nature, 506(7487), 150.CrossRefGoogle ScholarPubMed
O’Brien, D. B., Gupta, M. R., and Gray, R. M. (2008). Cost-sensitive multi-class classification from probability estimates. In Proceedings of the 25th International Conference on Machine Learning, 712719. Association for Computing Machinery.CrossRefGoogle Scholar
Oliver, A., Odena, A., Raffel, C., Cubuk, E. D., and Goodfellow, I. J. (2018). Realistic evaluation of deep semi-supervised learning algorithms. In Proceedings of the 32nd Conference on Neural Information Processing Systems. Curran Associates.Google Scholar
Pang, G., Shen, C., Cao, L., and van den Hengel, A. (2021). Deep learning for anomaly detection. ACM Computing Surveys, 54, 138.CrossRefGoogle Scholar
Pearson, E. S., and Hartley, H.O., eds. (1970). Biometrika tables for statisticians. Vol 1. 3rd ed. Cambridge University Press.Google Scholar
Perera, P., Oza, P., and Patel, V. M. (2021). One-class classification: a survey. arXiv:abs/2101.03064.Google Scholar
Perezgonzalez, J. D. (2015). Fisher, Neyman–Pearson or NHST? A tutorial for teaching data testing. Frontiers in Psychology, 6, 233.CrossRefGoogle ScholarPubMed
Petsche, T., Marcantonio, A., Darken, C. J. et al. (1995). A neural network autoassociator for induction motor failure prediction. In Proceedings of NIPS Advances in Neural Information Processing Systems. MIT Press.Google Scholar
Piano, S. L. (2020). Ethical principles in machine learning and artificial intelligence: cases from the field and possible ways forward. Humanities and Social Sciences Communications, 7, 17.CrossRefGoogle Scholar
Pimentel, M. A. F., Clifton, D. A., Clifton, L. A., and Tarassenko, L. (2014). A review of novelty detection. Signal Processing, 99, 215249.CrossRefGoogle Scholar
Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3), 6174.Google Scholar
Poole, D. L., and Mackworth, A. K. (2017). Artificial intelligence – foundations of computational agents. 2nd ed. Cambridge University Press.CrossRefGoogle Scholar
Prokopalo, Y., Meignier, S., Galibert, O., Barrault, L., and Larcher, A. (2020). Evaluation of lifelong learning systems. In Proceedings of the Twelfth Language Resources and Evaluation Conference, 18331841. European Language Resources Association.Google Scholar
Provost, F., and Domingos, P. (2003). Tree induction for probability-based ranking. Machine Learning, 52(3), 199215.CrossRefGoogle Scholar
Puerto, M., Kellett, M., Nikopoulou, R. et al. (2022). Assessing the trade-off between prediction accuracy and interpretability for topic modeling on energetic materials corpora. arXiv:2206.00773.Google Scholar
Rashka, S. (2014). cochransq: Cochran’s Q test for comparing multiple classifiers. https://rasbt.github.io/mlxtend/user_guide/evaluate/cochrans_q/Google Scholar
Reich, Y., and Barai, S. V. (1999). Evaluating machine learning models for engineering problems. Artificial Intelligence in Engineering, 13(3), 257272.CrossRefGoogle Scholar
Ren, P., Xiao, Y., Chang, X. et al. (2020). A survey of deep active learning. ACM Computing Surveys, 54, 140.CrossRefGoogle Scholar
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). “Why should I trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 11351144. Association for Computing Machinery.CrossRefGoogle Scholar
Rigaki, M., and Garcia, S. (2020). A survey of privacy attacks in machine learning. arXiv:2007.07646.Google Scholar
Roberts, C. (2017). How to unit test machine learning code. Medium, Blog. https://thenerdstation.medium.com/how-to-unit-test-machine-learning-code-57cf6fd81765Google Scholar
Rodriguez, M. Z., Comin, C. H., Casanova, D. et al. (2019). Clustering algorithms: a comparative approach. PLoS ONE, 14(1): e0210236.CrossRefGoogle ScholarPubMed
Rodríguez, N. D., Lomonaco, V., Filliat, D., and Maltoni, D. (2018). Don’t forget, there is more than forgetting: new metrics for continual learning. arXiv, abs/1810.13166.Google Scholar
Roh, Y., Heo, G., and Whang, S. E. (2021). A survey on data collection for machine learning: a big data – AI integration perspective. IEEE Transactions on Knowledge and Data Engineering, 33, 13281347.CrossRefGoogle Scholar
Rosset, S. (2004). Model selection via the AUC. In Proceedings of the Twenty-First International Conference on Machine Learning, 89. Association for Computing Machinery.Google Scholar
Saito, T., and Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE, 10(3): e0118432.CrossRefGoogle Scholar
Salzberg, S. L. (1997). On comparing classifiers: pitfalls to avoid and a recommeded approach. Data Mining and Knowledge Discovery, 1, 317327.CrossRefGoogle Scholar
Santos-Rodríguez, R., Guerrero-Curieses, A., Alaiz-Rodríguez, R., and Cid-Sueiro, J. (2009). Cost-sensitive learning based on Bregman divergences. Machine Learning, 76(2–3), 271285.CrossRefGoogle Scholar
Saunders, J. D., and Freitas, A. A. (2022). Evaluating the predictive performance of positive-unlabelled classifiers: a brief critical review and practical recommendations for improvement. arXiv:abs/2206.02423.Google Scholar
Schölkopf, B., Williamson, R. C., Smola, A., Shawe-Taylor, J., and Platt, J. C. (1999). Support vector method for novelty detection. In Proceedings of NIPS Advances in Neural Information Processing Systems. MIT Press.Google Scholar
Schubert, E., Wojdanowski, R., Zimek, A., and Kriegel, H.-P. (2012). On evaluation of outlier rankings and outlier scores. In Proceedings of the 2012 SIAM International Conference on Data Mining, 10471058. SIAM.CrossRefGoogle Scholar
Scott, W. A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Q, 19, 321325.CrossRefGoogle Scholar
Scudder, H. J. (1965). Probability of error of some adaptive pattern-recognition machines. IEEE Transactions in Information Theory, 11, 363371.CrossRefGoogle Scholar
Serdar, C. C., Cihan, M., Yücel, D., and Serdar, M. A. (2021). Sample size, power and effect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies. Biochemia Medica, 31(1), 2753.CrossRefGoogle ScholarPubMed
Settles, B. (2009). Active learning literature survey. Technical Report. University of Wisconsin.Google Scholar
Shah, M. (2006). Sample compression, margins and generalization: extensions to the set covering machine. Ph.D. thesis. University of Ottawa, Canada.Google Scholar
Shokri, R., and Shmatikov, V. (2015). Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, 13101321. Association for Computing Machinery.CrossRefGoogle Scholar
Shorten, C., and Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6, 148.CrossRefGoogle Scholar
Singh, A. (2022). A hands-on introduction to time series classification (with Python code). Analytics Vidhya. www.analyticsvidhya.com/blog/2019/01/introduction-time-series-classification/Google Scholar
Soares, C., Costa, J., and Bradzil, P. (2000). A simple and intuitive measure for multicriteria evaluation of classification algorithms. In Proceedings of the ECML’2000 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, 8796.Google Scholar
Sokolova, M., Japkowicz, N., and Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In Proceedings of the 2006 Australian Conference on Artificial Intelligence, 10151021. Springer.CrossRefGoogle Scholar
Sokolova, M., and Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427437.CrossRefGoogle Scholar
Song, L., and Mittal, P. (2021). Systematic evaluation of privacy risks of machine learning models. arXiv:2003.10595.Google Scholar
Sorower, M. S. (2010). A literature survey on algorithms for multi-label learning. Oregon State University, Corvallis, 18(1), 25.Google Scholar
Stapor, K., Ksieniewicz, P., García, S., and Wozniak, M. (2021). How to design the fair experimental classifier evaluation. Applied Soft Computing, 104, 107219.CrossRefGoogle Scholar
Sun, X., Zhou, T., Li, G. et al. (2017). An empirical study on real bugs for machine learning programs. In 24th Asia-Pacific Software Engineering Conference, 348357. IEEE.Google Scholar
Sweeney, L. (2002). k-Anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557570.CrossRefGoogle Scholar
Sydell, L. (2016). It ain’t me, babe: researchers find flaws in police facial recognition technology. NPR. www.npr.org/sections/alltechconsidered/2016/10/25/499176469/it-aint-me-babe-researchers-find-flaws-in-police-facial-recognitionGoogle Scholar
Szegedy, C., Zaremba, W., Sutskever, I. et al. (2014). Intriguing properties of neural networks. arXiv: 1312.6199.Google Scholar
Tax, D. M. J., and Duin, R. P. W. (2004). Support vector data description. Machine Learning, 54, 4566.CrossRefGoogle Scholar
Tevet, G., Habib, G., Shwartz, V., and Berant, J. (2018). Evaluating text GANs as language models. arXiv:1810.12686.Google Scholar
Tipping, M. E., and Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B, 61(3), 611622.CrossRefGoogle Scholar
Tomczak, M., and Tomczak, E. (2014). The need to report effect size estimates revisited. An overview of some recommended measures of effect size. Trends in Sport Sciences, 1(21), 1925.Google Scholar
Vaicenavicius, J., Widmann, D., Andersson, C. R. et al. (2019). Evaluating model calibration in classification. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 34593467. Proceedings of Machine Learning Research.Google Scholar
van Engelen, J. E., and Hoos, H. H. (2019). A survey on semi-supervised learning. Machine Learning, 109, 373440.CrossRefGoogle Scholar
Vanderlooy, S., and Hüllermeier, E. (2008). A critical analysis of variants of the AUC. Machine Learning, 72(3), 247262.CrossRefGoogle Scholar
Weiss, S. M., and Kulikowski, C. A. (1991). Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning and expert systems. Morgan Kaufmann.Google Scholar
Wood, M. (2005). Bootstrapped confidence intervals as an approach to statistical inference. Organizational Research Methods, 8, 454470.CrossRefGoogle Scholar
Wu, S., Flach, P. A., and Ferri, C. (2007). An improved model selection heuristic for AUC. In 18th European Conference on Machine Learning, vol. 4701, 478487. Springer.Google Scholar
Xiao, Q., Li, K., Zhang, D., and Xu, W. (2018). Security risks in deep learning implementations. In 2018 IEEE Security and Privacy Workshops, 123128. IEEE.CrossRefGoogle Scholar
Yan, L., Dodier, R., Mozer, M. C., and Wolniewicz, R. (2003). Optimizing classifier performance via the Wilcoxon-Mann-Whitney statistic. In Proceedings of the International Conference on Machine Learning, 848855. AAAI Press.Google Scholar
Yao, Y. (1995). Measuring retrieval effectiveness based on user preference of documents. Journal of the American Society for Information Science, 46(2), 133145.3.0.CO;2-Z>CrossRefGoogle Scholar
Yousef, W. A., Wagner, R. F., and Loew, M. H. (2005). Estimating the uncertainty in the estimated mean area under the ROC curve of a classifier. Pattern Recognition Letters, 26, 26002610.CrossRefGoogle Scholar
Yousef, W. A., Wagner, R. F., and Loew, M. H. (2006). Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 18091817.CrossRefGoogle Scholar
Yuheng, S., and Hao, Y. (2017). Image segmentation algorithms overview. arXiv:abs/1707.02051.Google Scholar
Zadrozny, B., and Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the 18th International Conference on Machine Learning, 609616. Morgan Kaufmann.Google Scholar
Zadrozny, B., and Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 694699. Association for Computing Machinery.Google Scholar
Zadrozny, B., Langford, J., and Abe, N. (2003). Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining, p. 435442. IEEE.CrossRefGoogle Scholar
Zaiontz, C. (2023). Real statistics using Excel. https://real-statistics.com/Google Scholar
Zeiler, M. D., and Fergus, R. (2014). Visualizing and understanding convolutional networks. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.) Computer Vision. ECCV 2014, vol. 8689 of Lecture Notes in Computer Science, 818833. Springer.CrossRefGoogle Scholar
Zhang, Y.-J. (1996). A survey on evaluation methods for image segmentation. Pattern Recognition, 29(8), 13351346.CrossRefGoogle Scholar
Zhang, H., Fritts, J. E., and Goldman, S. A. (2008). Image segmentation evaluation: a survey of unsupervised methods. Computer Vision and Image Understanding, 110(2), 260280.CrossRefGoogle Scholar
Zhang, J., Harman, M., Ma, L., and Liu, Y. (2022). Machine learning testing: survey, landscapes and horizons. IEEE Transactions on Software Engineering, 48, 136.CrossRefGoogle Scholar
Zhang, M.-L., and Zhou, Z.-H. (2014). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26, 18191837.CrossRefGoogle Scholar
Zhang, Y., Kang, B., Hooi, B., Yan, S., and Feng, J. (2023). Deep long-tailed learning: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 1079510816.CrossRefGoogle ScholarPubMed
Zhao, S., Song, J., and Ermon, S. (2017). InfoVAE: information maximizing variational autoencoders. arXiv:1706.02262.Google Scholar
Zhou, J., Gandomi, A. H., Chen, F., and Holzinger, A. (2021). Evaluating the quality of machine learning explanations: a survey on methods and metrics. Electronics, 10(5), 119.CrossRefGoogle Scholar
Zhu, X. (2005). Semi-supervised learning literature survey. Technical Report 1530. University of Wisconsin, Madison.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • References
  • Nathalie Japkowicz, American University, Washington DC, Zois Boukouvalas, American University, Washington DC
  • Book: Machine Learning Evaluation
  • Online publication: 07 November 2024
  • Chapter DOI: https://doi.org/10.1017/9781009003872.021
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • References
  • Nathalie Japkowicz, American University, Washington DC, Zois Boukouvalas, American University, Washington DC
  • Book: Machine Learning Evaluation
  • Online publication: 07 November 2024
  • Chapter DOI: https://doi.org/10.1017/9781009003872.021
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • References
  • Nathalie Japkowicz, American University, Washington DC, Zois Boukouvalas, American University, Washington DC
  • Book: Machine Learning Evaluation
  • Online publication: 07 November 2024
  • Chapter DOI: https://doi.org/10.1017/9781009003872.021
Available formats
×