Using Data Analytics methods before using Machine Learning algorithms: prediction on mixed data


Published: Apr 22, 2024
Keywords:
Multivariate data Multidimensional data Mixed-type data Principal Components Analysis Multiple Correspondence Analysis Machine learning Application of Machine Learning algorithms
Nikolaos Papafilippou
https://orcid.org/0009-0003-3148-7229
Zacharenia Kyrana
Emmanouil Pratsinakis
Angelos Markos
George Menexes
Abstract

In this study, the possibility of using certain methods of Data Analysis as a preparatory stage of Machine Learning methods to improve their predictive ability, was investigated. Data Analysis methods examined were: Principal Component Analysis, Multiple Correspondence Analysis and Non-Linear - Categorical Principal Component Analysis with optimal scaling. Machine Learning methods examined were Support Vector Machine (SVM) and more specifically Support Vector Classifier (SVC), Stochastic Gradient Descent (SGDClassifier), Naïve Bayes (GaussianNB), K-Nearest Neighbor (KNN), Decision Tree Classifier, Random Forest Classifier and Multinomial Logistic Regression. Tests were performed using data collected in a nationwide survey. The total sample involved 42,593 teenagers, who were interviewed and answered more than 155 questions regarding their eating habits. Body Mass Index (BMI) was set as a dependent variable. BMI was measured and used in the analyses as a quantitative variable, but also as a qualitative one, where the values of the index were divided into classes, based on the recommendations of the World Health Organization. According to the tests results for this data set, the prediction is more secure when we use as dependent variable BMI as a qualitive ordinal variable with four classes. Disinging with a data analysis strategy contributes to saving time and also to the selection of the best prediction model, while dimensionality reduction, if it does not improve tre predictive ability of the modes, at least contributes to the “interpretability” of the results.

Article Details
  • Section
  • Empirical studies
Downloads
Download data is not yet available.
References
Anderson, T. W. (1984). An Introduction to Multivariate Statistical Analysis (2nd ed.). New York: John Wiley & Sons, Inc.
Bhandari, A. K., & Gupta, M. (2021). A comprehensive survey of machine learning algorithms for image classification. Journal of Ambient Intelligence and Humanized Computing, 12(2), 2117–2136. https://doi.org/10.1007/s12652-020-02741-3
Bisong, E. (2019). Logistic regression. Building machine learning and deep learning models on google cloud platform: A comprehensive guide for beginners, 243-250.
Bond, J., & Michailidis, G. (1996). Homogeneity Analysis in Xlisp-Stat. Journal of Statistical Software, 1(2). https://doi.org/10.18637/jss.v001.i02
Carvalho, D. V., Pereira, E. M., & Cardoso, J. S. (2019). Machine learning interpretability: A survey on methods and metrics. Electronics, 8(8), 832
Eidelman, A. (2020). Python Data Science Handbook by Jake VANDERPLAS (2016). Statistique et Société, 8(2), 45-47
Grandini, M., Bagli, E., & Visani, G. (2020). Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756.
Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate Data Analysis: A Global Perspective (7th ed.). New Jersey: Pearson Education, Inc.
Ketkar, N. (2017). Stochastic gradient descent. Deep learning with Python: A hands-on introduction, 113-132.
Liu, Y., Liu, Y., & Zhao, Y. (2020). Research on the Application of Decision Tree Algorithm in Credit Risk Evaluation. In 2020 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS) (pp. 1-5). IEEE.
Mahesh, B. (2020). Machine learning algorithms-a review. International Journal of Science and Research (IJSR).[Internet], 9, 381-386.
Μενεξές, Γ. (2006). Πειραματικοί Σχεδιασμοί στην Ανάλυση Δεδομένων. Διδακτορική Διατριβή στο Τμήμα Εφαρμοσμένης Πληροφορικής του Πανεπιστημίου Μακεδονίας. Θεσσαλονίκη.
Michailidis, G., & De Leeuw, J. (1998). The Gifi System of Descriptive Multivariate Analysis. Statistical Science, 13(4), 307-336. https://doi.org/10.1214/ss/1028905828
Mohr, F., Wever, M., Tornede, A., & Hüllermeier, E. (2021). Predicting machine learning pipeline runtimes in the context of automated machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(9), 3055-3066.
Parmar, A., Katariya, R., & Patel, V. (2019). A review on random forest: An ensemble classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018 (pp. 758-763). Springer International Publishing.
Ray, S. (2019, February). A quick review of machine learning algorithms. In 2019 International conference on machine learning, big data, cloud and parallel computing (COMITCon) (pp. 35-39). IEEE.
Singh, A., Thakur, N., & Sharma, A. (2016, March). A review of supervised machine learning algorithms. In 2016 3rd International
Conference on Computing for Sustainable Global Development (INDIACom) (pp. 1310-1315). Ieee.
Tangirala, S. (2020). Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm. International Journal of Advanced Computer Science and Applications, 11(2), 612-619.
Wang, W., & Sun, D. (2021). The improved AdaBoost algorithms for imbalanced data classification. Information Sciences, 563, 358-374.
Most read articles by the same author(s)