WURISWebSearch

Ecti Transactions on Computer and Information Technology, Volume 15, Issue 2, Pages 166-176 , 21/04/2021
Hierarchical text classification using relative inverse document frequency

Boonthida Chiraratanasopha, Thanaruk Theeramunkong, Salin Boonbrahm

Abstract

Automated text classification for hierarchical taxonomy has been a challenge resulting from the increasing popularity of applying knowledge organization to express relations among classes in a tree structure. Categories on the same branch contain overlapped generalized concept from its super-category. This overlap causes dificulty in classification to arise relatively to complexity of a hierarchy. This paper presents the use of frequency of occurring terms in related categories among the hierarchical tree to help in document classification. The four extended terms for weighting of Relative Inverse Document Frequency (IDFr) include its located category, its parent category, its sibling categories, and its child categories. These are exploited to generate a classifier model using a centroid-based technique. In an experiment on hierarchical text classification of Thai documents, the IDFr achieved the best accuracy and F-measures of 53.65% and 50.80% when applied to the Top-n features set higher than traditional term frequency-inverse document frequency for 2.35% and 1.15%, respectively.

Document Type

Article

Source Type

Journal

Keywords

Hierarchical CategoriesHierarchical Text ClassificationRelative Inverse Documents Frequency (IDFr)Term Weighting

ASJC Subject Area

Computer Science : Information SystemsComputer Science : Computer Networks and CommunicationsDecision Sciences : Information Systems and ManagementEngineering : Electrical and Electronic Engineering

Funding Agency

National Science and Technology Development Agency

Access to Document

DOI : 10.37936/ecti-cit.2021152.240515
Link to scopus

0
Citations (Scopus)