Comparison of methods and analysis’ strategies for clustering the objects of the multidimensional dataset “Forest Cover Type”


Published: Jan 2, 2026
Keywords:
Multivariate data Big data Hierarchical clustering Partitional Clustering Mixed type data
Emmanouil Pratsinakis
https://orcid.org/0000-0002-3725-3525
Zacharenia Kyrana
https://orcid.org/0000-0001-9269-0675
Nikolaos Papafilippoy
https://orcid.org/0009-0003-3148-7229
Angelos Markos
George Menexes
https://orcid.org/0000-0002-1034-7345
Abstract

A multidimensional and multivariate structure with mixed-type data give researchers the opportunity to use various statistical approaches and methods of data clustering (classification), through statistical packages and programming languages. The choice of clustering method used can have an impact on the results obtained, with the distance and joining method playing a key role. In this study, the Partitioning Clustering or k-means and the Hierarchical Cluster Analysis methods were compared using the "Forest Cover Type" dataset. The main objective of this study was to apply and compare these methods in the division of the data set into groups-clusters. The results of the study showed that there are several analysis strategies for clustering mixed type data, based on data coding and selection of the measurement scale of the input variables. Python exported the results faster, up to more than 100%, compared to SPSS, and it was observed that the results of both were similar except for minor differences due to numerical rounding. It was found that Hierarchical Cluster could not be performed on this or other data sets of similar size, and with the specific PC configuration used for the analyses, since both softwares "crashed". This is probable a disadvantage, as Hierarchical Cluster allows the determination of the number of clusters through dendrograms, by combining various distances and clustering joining methods, which cannot be achieved with the k-means method. Finally, it was found that the results of the Classification depend on the coding strategy and the selection of the measurement scale of the variables to be used in the analysis.

Article Details
  • Section
  • Empirical studies
Downloads
Download data is not yet available.
References
Ahmad, A., & Dey, L. (2007). A k-mean clustering algorithm for mixed numeric and categorical data. Data and Knowledge Engineering, 63(2), 503-527.
Ahmad, A., & Khan, S. S. (2019). Survey of State-of-the-Art Mixed Data Clustering Algorithms. IEEE Access, 7, 31883-31902.
Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate Data Analysis: A Global Perspective (7th ed.). New Jersey: Pearson Education, Inc.
Hennig, C., & Liao, T. F. (2013). How to find an appropriate clustering for mixed‐type variables with application to socio‐economic stratification. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(3), 309-369.
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Englewood Cliffs: Prentice Hall.
Michos, M. C., Mamolos, A. P., Menexes, G. C., Tsatsarelis, C. A., Tsirakoglou, V. M., & Kalburtji, K. L. (2012). Energy inputs, outputs and greenhouse gas emissions in organic, integrated and conventional peach orchards. Ecological indicators, 13(1), 22-28.
Morlini, I., & Zani, S. (2010). Comparing approaches for clustering mixed mode data: An application in marketing research. Data Analysis and Classification (pp. 49-57). Springer, Berlin, Heidelberg.
Pagès, J. (2004). Analyse factorielle de données mixtes. Revue de statistique appliquée, 52(4), 93-111.
Tripathi, S., Bhardwaj, A., & Poovammal, E. (2018). Approaches to clustering in customer segmentation. International Journal of Engineering & Technology, 7(3.12), 802-807.
Van Rijckevorsel, J. & De Leeuw, J. (Eds) (1988). Component and Correspondence Analysis. Dimension Reduction by Functional Approximation (pp. 103-114). Chichester: John Willey and Sons Ltd.
Witten, I. H., & Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. San Fransisco: Morgan Kaufmann.
Young, F. (1981). Quantitative Analysis of Qualitative Data. Psychometrika, 46(4), 357-388.
Μενεξές, Γ. (2006). Πειραματικοί Σχεδιασμοί στην Ανάλυση Δεδομένων. Διδακτορική Διατριβή στο Τμήμα Εφαρμοσμένης Πληροφορικής του Πανεπιστημίου Μακεδονίας. Θεσσαλονίκη.
Παπαδημητρίου, Γ. (1994). Μέθοδοι Ανάλυσης Δεδομένων: Πανεπιστημιακές Παραδόσεις. Θεσσαλονίκη: Έκδοση Πανεπιστήμιου Μακεδονίας Οικονομικών και Κοινωνικών Επιστημών.
Φλώρου, Γ. (1997). Προσδιορισμός της Ιδανικότερης Μετρικής Απόστασης και Καλύτερου Τρόπου Ομαδοποίησης στις Διάφορες Μεθόδους της Αυτόματης Ταξινόμησης κατά Αύξουσα Ιεραρχία. Διδακτορική Διατριβή που υποβλήθηκε στο Τμήμα Εφαρμοσμένης Πληροφορικής του Πανεπιστημίου Μακεδονίας.
Most read articles by the same author(s)