Σύγκριση αλγορίθμων μηχανικής μάθησης στην προβλεπτική ταξινόμηση και την επιλογή των σημαντικών χαρακτηριστικών

ΝΙΚΟΛΑΟΣ ΠΑΠΑΦΙΛΙΠΠΟΥ; Zacharenia Kyrana; Emmanouil Pratsinakis; Christos Dordas; Angelos Markos; George Menexes

Σύγκριση αλγορίθμων μηχανικής μάθησης στην προβλεπτική ταξινόμηση και την επιλογή των σημαντικών χαρακτηριστικών

Published: Jan 2, 2026

ΝΙΚΟΛΑΟΣ ΠΑΠΑΦΙΛΙΠΠΟΥ

https://orcid.org/0009-0003-3148-7229

Zacharenia Kyrana

Aristotle University Thessalonikis

https://orcid.org/0000-0001-9269-0675

Emmanouil Pratsinakis

Aristotle University Thessalonikis

https://orcid.org/0000-0002-3725-3525

Christos Dordas

Aristotle University Thessalonikis

https://orcid.org/0000-0002-7027-474X

Angelos Markos

Democritus University of Thrace

George Menexes

Aristotle University Thessalonikis

https://orcid.org/0000-0002-1034-7345

Abstract

In the present study, Machine Learning algorithms were compared in terms of their predictive ability in classification and the identification of features that contribute most to it. The algorithms evaluated mainly in terms of their classification accuracy were Support Vector Classification (SVC), multinomial Logistic Regression, Stochastic Gradient Descent (SGD), Decision Trees, K-Nearest Neighbors (K-NN), Gaussian Naive Bayes, Neural Networks, as well as ensemble methods like Random Forest and Extra Trees. Optimal parameters for the algorithms were sought using the GridSearch method, while Adaboosting and cross-validation methods were applied to enhance the results. The dataset used was the ‘Forest Covertype, n=581,012’ from the UCI machine learning repository, which includes information on various forest cover types with the aim of predicting the type of forest cover. The algorithms were evaluated on both the original and standardized data. The results showed that the K-NN algorithm had the highest accuracy on the original data, while the Random Forest and Extra Tree algorithms exhibited the highest accuracy in both cases. Standardization of the data had no effect on the accuracy of the Decision Trees, Random Forest, Extra Trees, and multinomial Logistic Regression algorithms, improved the accuracy of the SVC, Neural Networks and SGD algorithms, and reduced the accuracy of the K-NN and Gaussian Naive Bayes algorithms. Additionally, the feature importance analysis showed that elevation, soil type, and wilderness area contributed the most to the classification. Furthermore, the prediction of a random data vector was the same across all algorithms applied to the standardized data, whereas it varied in the original data for the K-NN and Extra Tree algorithms.

Article Details

How to Cite
ΠΑΠΑΦΙΛΙΠΠΟΥ Ν., Kyrana, Z., Pratsinakis, E., Dordas, C., Markos, A., & Menexes, G. (2026). Σύγκριση αλγορίθμων μηχανικής μάθησης στην προβλεπτική ταξινόμηση και την επιλογή των σημαντικών χαρακτηριστικών. Data Analysis Bulletin, 21(1). Retrieved from https://ejournals.epublishing.ekt.gr/index.php/dab/article/view/39468
More Citation Formats

ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver
Download Citation

Endnote/Zotero/Mendeley (RIS)
BibTeX

Section
Empirical studies

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Authors who publish their work in the journal DATA ANALYSIS BULLETIN agree to the following terms:

1. Authors will not be charged any submission, processing or publication fees for their work. These costs are covered by the Greek Society of Data Analysis.

2. The copyright of papers published in the journal DATA ANALYSIS BULLETIN is protected by the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license. The Authors retain the Copyright and grant the journal the right of first publication. This license allows third party licensees to use the work in any form for non-commercial purposes only. If third parties modify or adapt the content, they must license the modified material for noncommercial purposes only. If others modify or adapt the material, they must license the modified material under identical terms.

3. Provided that the terms of the licence concerning the reference to the original author and the original publication in the journal DATA ANALYSIS BULLETIN are maintained.

4. Authors may enter into separate and additional contracts and agreements for the non-exclusive distribution of the work as published in the DATA ANALYSIS BULLETIN journal (e.g., deposit in academic repositories), provided that the acknowledgement and citation of the first publication in the DATA ANALYSIS BULLETIN journal is acknowledged.

5. The DATA ANALYSIS BULLETIN journal allows and encourages authors to deposit their work in institutional (e.g. the repository of the National Documentation Centre) or thematic repositories, after publication in DATA ANALYSIS BULLETIN and under Open Access conditions, as determined by their research funders and/or the institutions with which they collaborate, as appropriate. When submitting their work, authors should provide information on the publication of the work in the journal and the sources of funding for their research. Lists of institutional and thematic repositories by country are available at http://opendoar.org/countrylist.php. Authors can deposit their work free of charge in the repository www.zenodo.org, which is supported by OpenAIRE (www.openaire.eu ), as part of the European Commission's policies to support Open Academic Research.

Downloads

Download data is not yet available.

References

Bisong, E. (2019). Logistic regression. Building machine learning and deep learning models on google cloud platform: A comprehensive guide for beginners, 243-250.

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Blackard, J. (1998). Covertype [Dataset]. UCI Machine Learning Repository.

https://doi.org/10.24432/C50K5N.

Carvalho, D. V., Pereira, E. M., & Cardoso, J. S. (2019). Machine learning interpretability: A survey on methods and metrics. Electronics, 8(8), 832.

Charbuty, B., & Abdulazeez, A. (2021). Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends, 2(01), 20-28.

Dong, S. (2021). Multi class SVM algorithm with active learning for network traffic classification. Expert Systems with Applications, 176, 114885.

Eidelman, A. (2020). Python Data Science Handbook by Jake VANDERPLAS (2016). Statistique et Société, 8(2), 45-47.

Grandini, M., Bagli, E., & Visani, G. (2020). Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756.

Jahromi, A. H., & Taheri, M. (2017, October). A non-parametric mixture of Gaussian naive Bayes classifiers based on local independent features. In 2017 Artificial intelligence and signal processing conference (AISP) (pp. 209-212). IEEE.

Jiang, M., Liang, Y., Feng, X., Fan, X., Pei, Z., Xue, Y., & Guan, R. (2018). Text classification based on deep belief network and softmax regression. Neural Computing and Applications, 29, 61-70.

Kamel, H., Abdulah, D., & Al-Tuwaijari, J. M. (2019, June). Cancer classification using gaussian naive bayes algorithm. In 2019 international engineering conference (IEC) (pp. 165-170). IEEE.

Liu, Y., Liu, Y., & Zhao, Y. (2020). Research on the Application of Decision Tree Algorithm in Credit Risk Evaluation. In 2020 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS) (pp. 1-5). IEEE.

Mahesh, B. (2020). Machine learning algorithms-a review. International Journal of Science and Research (IJSR). 9, 381-386.

Mehlig, B. (2021). Machine learning with neural networks: an introduction for scientists and engineers. Cambridge University Press.

Montesinos López, O. A., Montesinos López, A., & Crossa, J. (2022). Overfitting, model tuning, and evaluation of prediction performance. In Multivariate statistical machine learning methods for genomic prediction (pp. 109-139). Springer International Publishing.

Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

Parmar, A., Katariya, R., & Patel, V. (2019). A review on random forest: An ensemble classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) (pp. 758-763). Springer International Publishing.

Sakr, C., Patil, A., Zhang, S., Kim, Y., & Shanbhag, N. (2017, March). Minimum precision requirements for the SVM-SGD learning algorithm. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1138-1142). IEEE.

Sarker, I. H. (2021). Machine learning: Algorithms, real-world applications and research directions. SN computer science, 2(3), 160.

Sen, P. C., Hajra, M., & Ghosh, M. (2020). Supervised classification algorithms in machine learning: A survey and review. In Emerging Technology in Modelling and Graphics: Proceedings of IEM Graph 2018 (pp. 99-111). Springer Singapore.

Sharma, A. (2018). Guided stochastic gradient descent algorithm for inconsistent datasets. Applied Soft Computing, 73, 1068-1080.

Suyal, M., & Goyal, P. (2022). A review on analysis of k-nearest neighbor classification machine learning algorithms based on supervised learning. International Journal of Engineering Trends and Technology, 70(7), 43-48.

Tangirala, S. (2020). Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm. International Journal of Advanced Computer Science and Applications, 11(2), 612-619.

Wang, J., & Joshi, G. (2021). Cooperative SGD: A unified framework for the design and analysis of local-update SGD algorithms. Journal of Machine Learning Research, 22(213), 1-50.

Wang, W., & Sun, D. (2021). The improved AdaBoost algorithms for imbalanced data classification. Information Sciences, 563, 358-374.

Yadav, D. C., & Pal, S. (2020). Prediction of thyroid disease using decision tree ensemble method. Human-Intelligent Systems Integration, 2, 89-95.

Zhang, Y. (2012). Support vector machine classification algorithm and its application. In Information Computing and Applications: Third International Conference, ICICA, Chengde, China, September 14-16. Proceedings, Part II 3 (pp. 179-186). Springer Berlin Heidelberg.

Zhang, Z. (2016). Introduction to machine learning: k-nearest neighbors. Annals of translational medicine, 4(11).