Comparative Evaluation of Data Cleaning Techniques Using Cricket Data: A Statistical and Machine Learning Perspective

Authors

  • Muhammad Fahim Khan Department of Statistics, University of Peshawar
  • Aizaz Shah Department of Statistics, University of Peshawar
  • Muhammad Asif Department of Statistics, University of Peshawar
  • Kainat Sabir Department of Statistics, University of Peshawar
  • Sana Ullah Department of Statistics, University of Peshawar
  • Mansoor Ahmad Department of Statistics, University of Peshawar
  • Qamruz Zaman Department of Statistics, University of Peshawar

Keywords:

Data cleaning, Imputation, Cricket analytics, Outlier handling, MICE, Winsorization, Random Forest

Abstract

Data cleaning plays a crucial role in ensuring the quality and reliability of data in statistical and machine learning analyses. In this study, a comparative evaluation of data cleaning techniques was conducted using real cricket data from the International Cricket Council (ICC) Top-30 T20 bowlers (2023). The dataset contained variables including matches, innings, balls bowled, runs conceded, wickets taken, and performance points. Artificial contamination was introduced by adding missing values and outliers to simulate realistic data quality problems. Five data cleaning methods were evaluated: mean imputation, median imputation, k-nearest neighbor (KNN) imputation, multiple imputation by chained equations (MICE), and a Winsorized-Mean approach. A Random Forest model was applied to assess the predictive performance of each method using 5-fold cross-validation. The evaluation metrics were the root mean square error (RMSE) and the R-squared value (R²). Results indicated that the Winsorized-Mean method achieved the lowest RMSE and the highest R², though differences were statistically insignificant at α = 0.05 based on the Friedman test. The findings highlight that Winsorization, combined with mean imputation, effectively handles outliers and missingness in moderately sized cricket datasets. The study underscores the importance of rigorous data preprocessing for robust statistical inference and predictive modeling.

Downloads

Published

2025-10-31

How to Cite

Muhammad Fahim Khan, Aizaz Shah, Muhammad Asif, Kainat Sabir, Sana Ullah, Mansoor Ahmad, & Qamruz Zaman. (2025). Comparative Evaluation of Data Cleaning Techniques Using Cricket Data: A Statistical and Machine Learning Perspective. Journal for Current Sign, 3(4), 539–545. Retrieved from http://currentsignreview.com/index.php/JCS/article/view/402