Comparative Evaluation of Data Cleaning Techniques Using Cricket Data: A Statistical and Machine Learning Perspective
Keywords:
Data cleaning, Imputation, Cricket analytics, Outlier handling, MICE, Winsorization, Random ForestAbstract
Data cleaning plays a crucial role in ensuring the quality and reliability of data in statistical and machine learning analyses. In this study, a comparative evaluation of data cleaning techniques was conducted using real cricket data from the International Cricket Council (ICC) Top-30 T20 bowlers (2023). The dataset contained variables including matches, innings, balls bowled, runs conceded, wickets taken, and performance points. Artificial contamination was introduced by adding missing values and outliers to simulate realistic data quality problems. Five data cleaning methods were evaluated: mean imputation, median imputation, k-nearest neighbor (KNN) imputation, multiple imputation by chained equations (MICE), and a Winsorized-Mean approach. A Random Forest model was applied to assess the predictive performance of each method using 5-fold cross-validation. The evaluation metrics were the root mean square error (RMSE) and the R-squared value (R²). Results indicated that the Winsorized-Mean method achieved the lowest RMSE and the highest R², though differences were statistically insignificant at α = 0.05 based on the Friedman test. The findings highlight that Winsorization, combined with mean imputation, effectively handles outliers and missingness in moderately sized cricket datasets. The study underscores the importance of rigorous data preprocessing for robust statistical inference and predictive modeling.