Ecti Transactions on Computer and Information Technology, Volume 15, Issue 2, Pages 166-176 , 21/04/2021
Hierarchical text classification using relative inverse document frequency
Abstract
Automated text classification for hierarchical taxonomy has been a challenge resulting from the increasing popularity of applying knowledge organization to express relations among classes in a tree structure. Categories on the same branch contain overlapped generalized concept from its super-category. This overlap causes dificulty in classification to arise relatively to complexity of a hierarchy. This paper presents the use of frequency of occurring terms in related categories among the hierarchical tree to help in document classification. The four extended terms for weighting of Relative Inverse Document Frequency (IDFr) include its located category, its parent category, its sibling categories, and its child categories. These are exploited to generate a classifier model using a centroid-based technique. In an experiment on hierarchical text classification of Thai documents, the IDFr achieved the best accuracy and F-measures of 53.65% and 50.80% when applied to the Top-n features set higher than traditional term frequency-inverse document frequency for 2.35% and 1.15%, respectively.
Document Type
Article
Source Type
Journal
Keywords
Hierarchical CategoriesHierarchical Text ClassificationRelative Inverse Documents Frequency (IDFr)Term Weighting
ASJC Subject Area
Computer Science : Information SystemsComputer Science : Computer Networks and CommunicationsDecision Sciences : Information Systems and ManagementEngineering : Electrical and Electronic Engineering
Funding Agency
National Science and Technology Development Agency