Χρήση μεθόδων της Ανάλυσης Δεδομένων πριν τη χρήση αλγορίθμων της Μηχανικής Μάθησης:πρόβλεψη σε δεδομένα μικτού τύπου

Nikolaos Papafilippou; Zacharenia Kyrana; Emmanouil Pratsinakis; Angelos Markos; George Menexes

Using Data Analytics methods before using Machine Learning algorithms: prediction on mixed data

Αρχείο PDF (ελληνικά)

Published: Apr 22, 2024

Keywords:

Multivariate data Multidimensional data Mixed-type data Principal Components Analysis Multiple Correspondence Analysis Machine learning Application of Machine Learning algorithms

Nikolaos Papafilippou

ARISTOTLE UNIVERSITY OF THESSALONIKI

https://orcid.org/0009-0003-3148-7229

Zacharenia Kyrana

ARISTOTLE UNIVERSITY OF THESSALONIKI

Emmanouil Pratsinakis

ARISTOTLE UNIVERSITY OF THESSALONIKI

Angelos Markos

DEMOKRITOUS UNIVERSITY OF THRACE

George Menexes

ARISTOTLE UNIVERSITY OF THESSALONIKI

Abstract

In this study, the possibility of using certain methods of Data Analysis as a preparatory stage of Machine Learning methods to improve their predictive ability, was investigated. Data Analysis methods examined were: Principal Component Analysis, Multiple Correspondence Analysis and Non-Linear - Categorical Principal Component Analysis with optimal scaling. Machine Learning methods examined were Support Vector Machine (SVM) and more specifically Support Vector Classifier (SVC), Stochastic Gradient Descent (SGDClassifier), Naïve Bayes (GaussianNB), K-Nearest Neighbor (KNN), Decision Tree Classifier, Random Forest Classifier and Multinomial Logistic Regression. Tests were performed using data collected in a nationwide survey. The total sample involved 42,593 teenagers, who were interviewed and answered more than 155 questions regarding their eating habits. Body Mass Index (BMI) was set as a dependent variable. BMI was measured and used in the analyses as a quantitative variable, but also as a qualitative one, where the values of the index were divided into classes, based on the recommendations of the World Health Organization. According to the tests results for this data set, the prediction is more secure when we use as dependent variable BMI as a qualitive ordinal variable with four classes. Disinging with a data analysis strategy contributes to saving time and also to the selection of the best prediction model, while dimensionality reduction, if it does not improve tre predictive ability of the modes, at least contributes to the “interpretability” of the results.

Article Details

How to Cite
Papafilippou, N., Kyrana, Z., Pratsinakis, E., Markos, A., & Menexes, G. (2024). Using Data Analytics methods before using Machine Learning algorithms: prediction on mixed data. Data Analysis Bulletin, 20(1), 32–44. Retrieved from https://ejournals.epublishing.ekt.gr/index.php/dab/article/view/33723
More Citation Formats

ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver
Download Citation

Endnote/Zotero/Mendeley (RIS)
BibTeX

Section
Empirical studies

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Authors who publish their work in the journal DATA ANALYSIS BULLETIN agree to the following terms:

1. Authors will not be charged any submission, processing or publication fees for their work. These costs are covered by the Greek Society of Data Analysis.

2. The copyright of papers published in the journal DATA ANALYSIS BULLETIN is protected by the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license. The Authors retain the Copyright and grant the journal the right of first publication. This license allows third party licensees to use the work in any form for non-commercial purposes only. If third parties modify or adapt the content, they must license the modified material for noncommercial purposes only. If others modify or adapt the material, they must license the modified material under identical terms.

3. Provided that the terms of the licence concerning the reference to the original author and the original publication in the journal DATA ANALYSIS BULLETIN are maintained.

4. Authors may enter into separate and additional contracts and agreements for the non-exclusive distribution of the work as published in the DATA ANALYSIS BULLETIN journal (e.g., deposit in academic repositories), provided that the acknowledgement and citation of the first publication in the DATA ANALYSIS BULLETIN journal is acknowledged.

5. The DATA ANALYSIS BULLETIN journal allows and encourages authors to deposit their work in institutional (e.g. the repository of the National Documentation Centre) or thematic repositories, after publication in DATA ANALYSIS BULLETIN and under Open Access conditions, as determined by their research funders and/or the institutions with which they collaborate, as appropriate. When submitting their work, authors should provide information on the publication of the work in the journal and the sources of funding for their research. Lists of institutional and thematic repositories by country are available at http://opendoar.org/countrylist.php. Authors can deposit their work free of charge in the repository www.zenodo.org, which is supported by OpenAIRE (www.openaire.eu ), as part of the European Commission's policies to support Open Academic Research.

Downloads

Download data is not yet available.

References

Anderson, T. W. (1984). An Introduction to Multivariate Statistical Analysis (2nd ed.). New York: John Wiley & Sons, Inc.

Bhandari, A. K., & Gupta, M. (2021). A comprehensive survey of machine learning algorithms for image classification. Journal of Ambient Intelligence and Humanized Computing, 12(2), 2117–2136. https://doi.org/10.1007/s12652-020-02741-3

Bisong, E. (2019). Logistic regression. Building machine learning and deep learning models on google cloud platform: A comprehensive guide for beginners, 243-250.

Bond, J., & Michailidis, G. (1996). Homogeneity Analysis in Xlisp-Stat. Journal of Statistical Software, 1(2). https://doi.org/10.18637/jss.v001.i02

Carvalho, D. V., Pereira, E. M., & Cardoso, J. S. (2019). Machine learning interpretability: A survey on methods and metrics. Electronics, 8(8), 832

Eidelman, A. (2020). Python Data Science Handbook by Jake VANDERPLAS (2016). Statistique et Société, 8(2), 45-47

Grandini, M., Bagli, E., & Visani, G. (2020). Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756.

Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate Data Analysis: A Global Perspective (7th ed.). New Jersey: Pearson Education, Inc.

Ketkar, N. (2017). Stochastic gradient descent. Deep learning with Python: A hands-on introduction, 113-132.

Liu, Y., Liu, Y., & Zhao, Y. (2020). Research on the Application of Decision Tree Algorithm in Credit Risk Evaluation. In 2020 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS) (pp. 1-5). IEEE.

Mahesh, B. (2020). Machine learning algorithms-a review. International Journal of Science and Research (IJSR).[Internet], 9, 381-386.

Μενεξές, Γ. (2006). Πειραματικοί Σχεδιασμοί στην Ανάλυση Δεδομένων. Διδακτορική Διατριβή στο Τμήμα Εφαρμοσμένης Πληροφορικής του Πανεπιστημίου Μακεδονίας. Θεσσαλονίκη.

Michailidis, G., & De Leeuw, J. (1998). The Gifi System of Descriptive Multivariate Analysis. Statistical Science, 13(4), 307-336. https://doi.org/10.1214/ss/1028905828

Mohr, F., Wever, M., Tornede, A., & Hüllermeier, E. (2021). Predicting machine learning pipeline runtimes in the context of automated machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(9), 3055-3066.

Parmar, A., Katariya, R., & Patel, V. (2019). A review on random forest: An ensemble classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018 (pp. 758-763). Springer International Publishing.

Ray, S. (2019, February). A quick review of machine learning algorithms. In 2019 International conference on machine learning, big data, cloud and parallel computing (COMITCon) (pp. 35-39). IEEE.

Singh, A., Thakur, N., & Sharma, A. (2016, March). A review of supervised machine learning algorithms. In 2016 3rd International

Conference on Computing for Sustainable Global Development (INDIACom) (pp. 1310-1315). Ieee.

Tangirala, S. (2020). Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm. International Journal of Advanced Computer Science and Applications, 11(2), 612-619.

Wang, W., & Sun, D. (2021). The improved AdaBoost algorithms for imbalanced data classification. Information Sciences, 563, 358-374.

Using Data Analytics methods before using Machine Learning algorithms: prediction on mixed data

Abstract

Article Details

Downloads

References

Make a Submission

Most read articles by the same author(s)