Comparison of dimensionality reduction methods and strategies on multivariate categorical data


Published: Jan 2, 2026
Keywords:
Multivariate Analysis Multiple Correspondence Analysis (MCA); Nonlinear Categorical Principal Component Analysis (CATPCA); Factor Analysis for Mixed Data (FAMD); Nonlinear Canonical Correlation Analysis (NLCCA); Multiple Factor Analysis (MFA); Principal Component Analysis (PCA)
ΝΙΚΟΛΑΟΣ ΠΑΠΑΦΙΛΙΠΠΟΥ
https://orcid.org/0009-0003-3148-7229
Zacharenia Kyrana
https://orcid.org/0000-0001-9269-0675
Emmanouil Pratsinakis
https://orcid.org/0000-0002-3725-3525
Christos Dordas
https://orcid.org/0000-0002-7027-474X
Angelos Markos
George Menexes
https://orcid.org/0000-0002-1034-7345
Abstract




The analysis of ‘high’-dimensional categorical data poses significant challenges in Data Science, Machine Learning, and Statistics, particularly in terms of the study of variability (inertia) of measured characteristics, its structure and components as well as the interpretation of the results. This paper addresses the above issues by investigating and comparing various methods and strategies for the dimensionality reduction of categorical data. These strategies were applied to the "Forest Cover Type, n=581,012" dataset from the UCI Machine Learning Repository. The proposed strategies, which provided more and sometimes different information about the structure of variability, were evaluated by applying and comparing several methods, such as Multiple Correspondence Analysis (MCA), Non-Linear Categorical Principal Components Analysis with Optimal Scaling (CATPCA), Principal Components Analysis (PCA), Factor Analysis for Mixed Data (FAMD), Nonlinear Canonical Correlation Analysis (NLCCA), and Multiple Factor Analysis (MFA). The results showed that the use of different strategies is probably required depending on the nature of the data and the research objectives. Also, demonstrated the applicability of each method in different contexts and revealed that while no single approach is “universally” superior, strategies tailored to the data's nature, such as Singular Value Decomposition (SVD) on several correlation matrices followed by PCA, combining MCA and CATPCA or advanced methods like FAMD, NLCCA or MFA, offer alternative solutions. In general, it is wiser to apply different analysis strategies depending on the objectives of the study and the researcher’s willingness on how the variables should be handled (nominal, ordinal, scale) in a specific scientific frame.





Article Details
  • Section
  • Empirical studies
Downloads
Download data is not yet available.
References
Abdi, H., Williams, L. J., & Valentin, D. (2013). Multiple Factor Analysis: principal component analysis for multitable and multiblock data sets. Wiley Interdisciplinary Reviews: Computational Statistics, 5(2), 72-74.
Agresti, A. (2012). Categorical data analysis (Vol. 792). John Wiley & Sons.
Benzécri, J.-P. (1992). Correspondence Analysis Handbook. New York: Marcel Dekker, Inc.
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
Blackard, J. (1998). Covertype [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C50K5N.
Bond, J., & Michailidis, G. (1996). Homogeneity Analysis in Xlisp-Stat. Journal of Statistical Software, 1(2). https://doi.org/10.18637/jss.v001.i02
Gifi, A. (1990). Nonlinear multivariate analysis. Edited by Heiser W., Meulman J.J., & van den Berg, G., Wiley, Chichester.
Greenacre, M. (1991). Interpreting Multiple Correspondence Analysis. Applied Stochastic Models and Data Analysis, 7, 195-210.
Greenacre, M., & Blasius, J. (2006). Multiple correspondence analysis and related methods. Chapman and Hall/CRC.
Greenacre, M., Groenen, P. J., Hastie, T., d’Enza, A. I., Markos, A., & Tuzhilina, E. (2022). Principal component analysis. Nature Reviews Methods Primers, 2(1), 100.
Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate Data Analysis: A Global Perspective (7th ed.). New Jersey: Pearson Education, Inc.
Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202.
Kassambara, A. (2017). Practical guide to principal component methods in R: PCA, M (CA), FAMD, MFA, HCPC, factoextra (Vol. 2). Sthda.
Le Roux, B., & Rouanet, H. (2004). Geometric data analysis: from correspondence analysis to structured data analysis. Dordrecht: Springer Science & Business Media.
Μενεξές, Γ. (2006). Πειραματικοί Σχεδιασμοί στην Ανάλυση Δεδομένων. Διδακτορική Διατριβή στο Τμήμα Εφαρμοσμένης Πληροφορικής του Πανεπιστημίου Μακεδονίας. Θεσσαλονίκη.
Meulman, J. J., Van der Kooij, A. J., & Heiser, W. J. (2004). Principal components analysis with nonlinear optimal scaling transformations for ordinal and nominal data. In D. Kaplan (Ed.) Handbook of quantitative methodology for the social sciences (pp. 49-70). London: Sage.
Michaeli, T., Wang, W., & Livescu, K. (2016). Nonparametric canonical correlation analysis. In International conference on machine learning (pp. 1967-1976). PMLR.
Michailidis, G., & De Leeuw, J. (1998). The Gifi System of Descriptive Multivariate Analysis. Statistical Science, 13(4), 307-336. https://doi.org/10.1214/ss/1028905828
Nanga, S., Bawah, A. T., Acquaye, B. A., Billa, M. I., Baeta, F. D., Odai, N. A., ... & Nsiah, A. D. (2021). Review of dimension reduction methods. Journal of Data Analysis and Information Processing, 9(3), 189-231.
Oseledets, I. V., & Tyrtyshnikov, E. E. (2009). Breaking the curse of dimensionality, or how to use SVD in many dimensions. SIAM Journal on Scientific Computing, 31(5), 3744-3759.
Pagès, J. (2014). Multiple factor analysis by example using R. CRC Press.
Painsky, A., Feder, M., & Tishby, N. (2020). Nonlinear canonical correlation analysis: A compressed representation approach. Entropy, 22(2), 208.
Most read articles by the same author(s)