Comprehensive evaluation of data preprocessing and visualization techniques for enhanced classification and sampling

Küçük Resim Yok

Tarih

2025

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Springer

Erişim Hakkı

info:eu-repo/semantics/openAccess

Özet

Effective representation and visualization of data are critical components of data analysis, particularly in classification tasks. This paper presents a comprehensive evaluation of various functions employed in data preprocessing and visualization, emphasizing their roles in enhancing data representation, facilitating classification, and optimizing sampling techniques. We explore the Jitter function, which mitigates overplotting in visualizations by introducing small random variations to data points, thereby improving clarity in the depiction of class distributions. The hexagonal binning function aggregates data into hexagonal grids, enabling the identification of density patterns and enhancing the understanding of class separability in two-dimensional space. The center function is examined for its utility in computing centroids of data clusters, aiding in visualizing class distributions and enhancing clustering algorithms. Additionally, we investigate the swarm function, which serves dual purposes as an optimization technique in particle swarm optimization for feature selection and as a visualization tool to illustrate data point distributions without overlap. The random function is discussed for its role in generating synthetic datasets and initializing parameters, crucial for achieving balanced and representative training samples. Lastly, the square function is evaluated for its application in distance calculations and error metrics, essential for assessing model performance in classification tasks. The experimental results reveal that the random function consistently shows the highest means and variability across most distributions, while the center function, despite exhibiting lower means, demonstrates higher variability (CV) and entropy, indicating greater uncertainty. Conversely, the Jitter function displays lower means and variances, typically exhibiting more predictability and less uncertainty. This comprehensive evaluation highlights the importance of these functions in preprocessing and visualizing data, ultimately contributing to improved classification outcomes and enhanced interpretability of data-driven insights.

Açıklama

Anahtar Kelimeler

Data representation, Data visualization, Classification, Error metrics, Overplotting

Kaynak

Cluster Computing-The Journal of Networks Software Tools And Applications

WoS Q Değeri

Q1

Scopus Q Değeri

Q1

Cilt

28

Sayı

7

Künye