Σύγκριση αλγορίθμων μηχανικής μάθησης στην προβλεπτική ταξινόμηση και την επιλογή των σημαντικών χαρακτηριστικών


Published: Jan 2, 2026
ΝΙΚΟΛΑΟΣ ΠΑΠΑΦΙΛΙΠΠΟΥ
https://orcid.org/0009-0003-3148-7229
Zacharenia Kyrana
https://orcid.org/0000-0001-9269-0675
Emmanouil Pratsinakis
https://orcid.org/0000-0002-3725-3525
Christos Dordas
https://orcid.org/0000-0002-7027-474X
Angelos Markos
George Menexes
https://orcid.org/0000-0002-1034-7345
Abstract

In the present study, Machine Learning algorithms were compared in terms of their predictive ability in classification and the identification of features that contribute most to it. The algorithms evaluated mainly in terms of their classification accuracy were Support Vector Classification (SVC), multinomial Logistic Regression, Stochastic Gradient Descent (SGD), Decision Trees, K-Nearest Neighbors (K-NN), Gaussian Naive Bayes, Neural Networks, as well as ensemble methods like Random Forest and Extra Trees. Optimal parameters for the algorithms were sought using the GridSearch method, while Adaboosting and cross-validation methods were applied to enhance the results. The dataset used was the ‘Forest Covertype, n=581,012’ from the UCI machine learning repository, which includes information on various forest cover types with the aim of predicting the type of forest cover. The algorithms were evaluated on both the original and standardized data. The results showed that the K-NN algorithm had the highest accuracy on the original data, while the Random Forest and Extra Tree algorithms exhibited the highest accuracy in both cases. Standardization of the data had no effect on the accuracy of the Decision Trees, Random Forest, Extra Trees, and multinomial Logistic Regression algorithms, improved the accuracy of the SVC, Neural Networks and SGD algorithms, and reduced the accuracy of the K-NN and Gaussian Naive Bayes algorithms. Additionally, the feature importance analysis showed that elevation, soil type, and wilderness area contributed the most to the classification. Furthermore, the prediction of a random data vector was the same across all algorithms applied to the standardized data, whereas it varied in the original data for the K-NN and Extra Tree algorithms.

Article Details
  • Section
  • Empirical studies
Downloads
Download data is not yet available.
References
Bisong, E. (2019). Logistic regression. Building machine learning and deep learning models on google cloud platform: A comprehensive guide for beginners, 243-250.
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
Blackard, J. (1998). Covertype [Dataset]. UCI Machine Learning Repository.
Carvalho, D. V., Pereira, E. M., & Cardoso, J. S. (2019). Machine learning interpretability: A survey on methods and metrics. Electronics, 8(8), 832.
Charbuty, B., & Abdulazeez, A. (2021). Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends, 2(01), 20-28.
Dong, S. (2021). Multi class SVM algorithm with active learning for network traffic classification. Expert Systems with Applications, 176, 114885.
Eidelman, A. (2020). Python Data Science Handbook by Jake VANDERPLAS (2016). Statistique et Société, 8(2), 45-47.
Grandini, M., Bagli, E., & Visani, G. (2020). Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756.
Jahromi, A. H., & Taheri, M. (2017, October). A non-parametric mixture of Gaussian naive Bayes classifiers based on local independent features. In 2017 Artificial intelligence and signal processing conference (AISP) (pp. 209-212). IEEE.
Jiang, M., Liang, Y., Feng, X., Fan, X., Pei, Z., Xue, Y., & Guan, R. (2018). Text classification based on deep belief network and softmax regression. Neural Computing and Applications, 29, 61-70.
Kamel, H., Abdulah, D., & Al-Tuwaijari, J. M. (2019, June). Cancer classification using gaussian naive bayes algorithm. In 2019 international engineering conference (IEC) (pp. 165-170). IEEE.
Liu, Y., Liu, Y., & Zhao, Y. (2020). Research on the Application of Decision Tree Algorithm in Credit Risk Evaluation. In 2020 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS) (pp. 1-5). IEEE.
Mahesh, B. (2020). Machine learning algorithms-a review. International Journal of Science and Research (IJSR). 9, 381-386.
Mehlig, B. (2021). Machine learning with neural networks: an introduction for scientists and engineers. Cambridge University Press.
Montesinos López, O. A., Montesinos López, A., & Crossa, J. (2022). Overfitting, model tuning, and evaluation of prediction performance. In Multivariate statistical machine learning methods for genomic prediction (pp. 109-139). Springer International Publishing.
Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
Parmar, A., Katariya, R., & Patel, V. (2019). A review on random forest: An ensemble classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) (pp. 758-763). Springer International Publishing.
Sakr, C., Patil, A., Zhang, S., Kim, Y., & Shanbhag, N. (2017, March). Minimum precision requirements for the SVM-SGD learning algorithm. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1138-1142). IEEE.
Sarker, I. H. (2021). Machine learning: Algorithms, real-world applications and research directions. SN computer science, 2(3), 160.
Sen, P. C., Hajra, M., & Ghosh, M. (2020). Supervised classification algorithms in machine learning: A survey and review. In Emerging Technology in Modelling and Graphics: Proceedings of IEM Graph 2018 (pp. 99-111). Springer Singapore.
Sharma, A. (2018). Guided stochastic gradient descent algorithm for inconsistent datasets. Applied Soft Computing, 73, 1068-1080.
Suyal, M., & Goyal, P. (2022). A review on analysis of k-nearest neighbor classification machine learning algorithms based on supervised learning. International Journal of Engineering Trends and Technology, 70(7), 43-48.
Tangirala, S. (2020). Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm. International Journal of Advanced Computer Science and Applications, 11(2), 612-619.
Wang, J., & Joshi, G. (2021). Cooperative SGD: A unified framework for the design and analysis of local-update SGD algorithms. Journal of Machine Learning Research, 22(213), 1-50.
Wang, W., & Sun, D. (2021). The improved AdaBoost algorithms for imbalanced data classification. Information Sciences, 563, 358-374.
Yadav, D. C., & Pal, S. (2020). Prediction of thyroid disease using decision tree ensemble method. Human-Intelligent Systems Integration, 2, 89-95.
Zhang, Y. (2012). Support vector machine classification algorithm and its application. In Information Computing and Applications: Third International Conference, ICICA, Chengde, China, September 14-16. Proceedings, Part II 3 (pp. 179-186). Springer Berlin Heidelberg.
Zhang, Z. (2016). Introduction to machine learning: k-nearest neighbors. Annals of translational medicine, 4(11).
Most read articles by the same author(s)