Hypatia Digital Library: A novel text classification approach for small text fragments


Published: Dec 1, 2019
Keywords:
Digital libraries Statistical natural language processing Text classification WEKA Word stemming
Ioannis Triantafyllou
Frosso Vorgia
Alexandros Koulouris
Abstract

Purpose - The purpose of this paper is to further investigate prior work of the authors in text classification in Hypatia, the digital library of University of Western Attica. The main objective is to provide an accurate automated classification tool as an alternative to manual assignments.


Design/methodology/approach - The crucial point in text classification is the selection of the most important term-words for document representation. The specific document collection consists of 718 abstracts in Medicine, Tourism and Food Technology. Two weighting methods were investigated: classic TF.IDF and DEVMAX.DF. The last one was proposed by the authors as a more accurate term-word selection tool for smaller text fragments. Classification was conducted by applying 14 classifiers available on WEKA.


Findings - Classification process yielded an excellent ~97% precision score and DEVMAX.DF proved to perform better than classic TF.IDF.

Article Details
  • Section
  • Research Articles
References
A. Joorabchi, A. Mahdi, An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata, Journal of Information Science, 37(5) (2011) 499-514.
I. Triantafyllou, A. Koulouris, S. Zervos, M. Dendrinos, D. Kyriaki-Manessi, G. Giannakopoulos, Significance of clustering and classification applications in digital and physical libraries, In: Proceedings of 4th International Conference IC-ININFO, Madrid, Spain, 2014.
F. Vorgia, I. Triantafyllou, A. Koulouris., Hypatia Digital Library: A text classification approach based on abstracts, Strategic Innovative Marketing, Springer International Publishing, (2017), 727-733.
R.R. Bouckaert, E. Frank, M.A. Hall, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, WEKA- experiences with a Java open-source project, Journal of Machine Learning Research 11 (2010) 2533-2541.
Machine Learning Group at the University of Waikato, WEKA 3- data mining with open source machine learning software in Java, 2015.
F. Sebastiani, Machine learning in automated text categorization, ACM computing surveys (CSUR) 34 (2002) 1- 47.
I. H. Witten, E. Frank, M.A. Hall, Data mining: practical machine learning tools and techniques, Morgan Kaufmann, 2011.
I. Triantafyllou, I. Demiros, S. Piperidis, Two level self- organizing approach to text classification, In: Proceedings of RANLP-2001: Recent Advances in NLP, 2001.
C. C. Aggarwal, C. Zhai, A survey of text classification algorithms, Mining text data (2012) 163-222.
T. S. Guzella, W. M. Caminhas, A review of machine learning approaches to Spam filtering, Expert Systems with Applications 36 (2009) 10206–10222.
L. Huan, Y. Lei, Toward integrating feature selection Algorithms for Classification and Clustering, IEEE Transaction on Knowledge and Data Engineering 17(4) (2005).
R. Islam, J. Abawajy, A multi-tier phishing detection and filtering approach, Journal of Network and Computer Applications 36 (2013) 324–335.
I. Ahmed, R. Ali, D. Guan, Y. K. Lee, S. Lee, T. C. Chung, Semi- supervised learning using frequent itemset and ensemble learning for SMS classification, Expert Systems with Applications, 42(3) (2015) 1065-1073.
S. J. Delany, M. Buckley, D. Greene, SMS spam filtering: Methods and data, Expert Systems with Applications 39 (2012) 9899–9908.
W. Liu, T. Wang, Index-based Online Text Classification for SMS Spam Filtering, Journal of Computers 5(6) (2010).
D. Irani, S. Webb, C. Pu, K. Li, Study of trend-stuffing on twitter through text classification, In: Proceedings of Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), 2010.
M. Daniela, R. F. Nevesa, N. Horta, Company event popularity for financial markets using Twitter and sentiment analysis, Expert Systems with Applications 71(1) (2017) 111-124.
B. O'Dea, S. Wan, P. J. Batterham, A. L. Calear, C. Paris, H. Christensen, Detecting suicidality on Twitter, Internet Interventions 2(2) (2015) 183-188.
E. Barfian, B. H. Iswanto, S. M. Isa, Twitter Pornography Multilingual Content Identification Based on Machine Learning, Procedia Computer Science 116 (2017) 129-136.
B. Desmet, V. Hoste, Online suicide prevention through optimised text classification, Information Sciences 439–440 (2018) 61–78.
L. Li, Y. G. Huang, Z. W. Liu, Chinese text classification for small sample set, The Journal of China Universities of Posts and Telecommunications 18 (2011) 83–89.
W. J. Wilbur, K. Sirotkin, The automatic identification of stop words, Journal of Information Science 18 (1992) 45-55.
W. B. Croft, D. Metzler, T. Strohman, Search engines: information retrieval in practice, Addison-Wesley, 2010.
K. S. Jones, A statistical interpretation of term frequency and its application in retrieval, Journal of Documentation 28 1972 11-21.
K. Fawagreh, M. Medhat Gaber, E. Elyan, Random forests: from early developments to recent advancements, Systems Science & Control Engineering 2(1) (2014) 602-609.
A. J. Wyner, M. Olson, J. Bleich, Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers, Journal of Machine Learning Research 18 (2017)