Hypatia Digital Library: A novel text classification approach for small text fragments

Ioannis Triantafyllou; Frosso Vorgia; Alexandros Koulouris

Hypatia Digital Library: A novel text classification approach for small text fragments

PDF

Published: Dec 1, 2019

Keywords:

Digital libraries Statistical natural language processing Text classification WEKA Word stemming

Ioannis Triantafyllou

Department of Archival, Library and Information Studies, University of West Attica, Athens, Greece

Frosso Vorgia

Department of Archival, Library and Information Studies, University of West Attica, Athens, Greece

Alexandros Koulouris

Department of Archival, Library and Information Studies, University of West Attica, Athens, Greece

Abstract

Purpose - The purpose of this paper is to further investigate prior work of the authors in text classification in Hypatia, the digital library of University of Western Attica. The main objective is to provide an accurate automated classification tool as an alternative to manual assignments.

Design/methodology/approach - The crucial point in text classification is the selection of the most important term-words for document representation. The specific document collection consists of 718 abstracts in Medicine, Tourism and Food Technology. Two weighting methods were investigated: classic TF.IDF and DEVMAX.DF. The last one was proposed by the authors as a more accurate term-word selection tool for smaller text fragments. Classification was conducted by applying 14 classifiers available on WEKA.

Findings - Classification process yielded an excellent ~97% precision score and DEVMAX.DF proved to perform better than classic TF.IDF.

Article Details

How to Cite
Triantafyllou, I., Vorgia, F., & Koulouris , A. (2019). Hypatia Digital Library: A novel text classification approach for small text fragments. Journal of Integrated Information Management, 4(2), 16–23. Retrieved from https://ejournals.epublishing.ekt.gr/index.php/jiim/article/view/37872
More Citation Formats

ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver
Download Citation

Endnote/Zotero/Mendeley (RIS)
BibTeX

Issue
Vol. 4 No. 2 (2019): Jul-Dec 2019

Section
Research Articles

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Copyright Notice

Authors who publish with JIIM agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution Non-Commercial License that allows others to share the work with:
An acknowledgment of the work's authorship and initial publication in this journal.
Authors are permitted and encouraged to post their work online (preferably in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.

References

A. Joorabchi, A. Mahdi, An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata, Journal of Information Science, 37(5) (2011) 499-514.

I. Triantafyllou, A. Koulouris, S. Zervos, M. Dendrinos, D. Kyriaki-Manessi, G. Giannakopoulos, Significance of clustering and classification applications in digital and physical libraries, In: Proceedings of 4th International Conference IC-ININFO, Madrid, Spain, 2014.

F. Vorgia, I. Triantafyllou, A. Koulouris., Hypatia Digital Library: A text classification approach based on abstracts, Strategic Innovative Marketing, Springer International Publishing, (2017), 727-733.

R.R. Bouckaert, E. Frank, M.A. Hall, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, WEKA- experiences with a Java open-source project, Journal of Machine Learning Research 11 (2010) 2533-2541.

Machine Learning Group at the University of Waikato, WEKA 3- data mining with open source machine learning software in Java, 2015.

F. Sebastiani, Machine learning in automated text categorization, ACM computing surveys (CSUR) 34 (2002) 1- 47.

I. H. Witten, E. Frank, M.A. Hall, Data mining: practical machine learning tools and techniques, Morgan Kaufmann, 2011.

I. Triantafyllou, I. Demiros, S. Piperidis, Two level self- organizing approach to text classification, In: Proceedings of RANLP-2001: Recent Advances in NLP, 2001.

C. C. Aggarwal, C. Zhai, A survey of text classification algorithms, Mining text data (2012) 163-222.

T. S. Guzella, W. M. Caminhas, A review of machine learning approaches to Spam filtering, Expert Systems with Applications 36 (2009) 10206–10222.

L. Huan, Y. Lei, Toward integrating feature selection Algorithms for Classification and Clustering, IEEE Transaction on Knowledge and Data Engineering 17(4) (2005).

R. Islam, J. Abawajy, A multi-tier phishing detection and filtering approach, Journal of Network and Computer Applications 36 (2013) 324–335.

I. Ahmed, R. Ali, D. Guan, Y. K. Lee, S. Lee, T. C. Chung, Semi- supervised learning using frequent itemset and ensemble learning for SMS classification, Expert Systems with Applications, 42(3) (2015) 1065-1073.

S. J. Delany, M. Buckley, D. Greene, SMS spam filtering: Methods and data, Expert Systems with Applications 39 (2012) 9899–9908.

W. Liu, T. Wang, Index-based Online Text Classification for SMS Spam Filtering, Journal of Computers 5(6) (2010).

D. Irani, S. Webb, C. Pu, K. Li, Study of trend-stuffing on twitter through text classification, In: Proceedings of Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), 2010.

M. Daniela, R. F. Nevesa, N. Horta, Company event popularity for financial markets using Twitter and sentiment analysis, Expert Systems with Applications 71(1) (2017) 111-124.

B. O'Dea, S. Wan, P. J. Batterham, A. L. Calear, C. Paris, H. Christensen, Detecting suicidality on Twitter, Internet Interventions 2(2) (2015) 183-188.

E. Barfian, B. H. Iswanto, S. M. Isa, Twitter Pornography Multilingual Content Identification Based on Machine Learning, Procedia Computer Science 116 (2017) 129-136.

B. Desmet, V. Hoste, Online suicide prevention through optimised text classification, Information Sciences 439–440 (2018) 61–78.

L. Li, Y. G. Huang, Z. W. Liu, Chinese text classification for small sample set, The Journal of China Universities of Posts and Telecommunications 18 (2011) 83–89.

W. J. Wilbur, K. Sirotkin, The automatic identification of stop words, Journal of Information Science 18 (1992) 45-55.

W. B. Croft, D. Metzler, T. Strohman, Search engines: information retrieval in practice, Addison-Wesley, 2010.

K. S. Jones, A statistical interpretation of term frequency and its application in retrieval, Journal of Documentation 28 1972 11-21.

K. Fawagreh, M. Medhat Gaber, E. Elyan, Random forests: from early developments to recent advancements, Systems Science & Control Engineering 2(1) (2014) 602-609.

A. J. Wyner, M. Olson, J. Bleich, Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers, Journal of Machine Learning Research 18 (2017)