📞 +91-7667918914 | âœ‰ī¸ ijarcce@gmail.com
IJARCCE Logo
International Journal of Advanced Research in Computer and Communication Engineering A monthly Peer-reviewed & Refereed journal
ISSN Online 2278-1021ISSN Print 2319-5940Since 2012
IJARCCE adheres to the suggestive parameters outlined by the University Grants Commission (UGC) for peer-reviewed journals, upholding high standards of research quality, ethical publishing, and academic excellence.
← Back to VOLUME 5, ISSUE 9, SEPTEMBER 2016

A Different Type of Feature Selection Methods for Text Categorization on Imbalanced Data

Senthil Kumar B, Bhavitha Varma E

DOI: 10.17148/IJARCCE.2016.5963

Abstract: Text categorization is an important and well-studied area of pattern recognition, with a variety of modern applications. Effective spam email filtering systems, automated document organization and management, and improved information retrieval systems all benefit from techniques within this field. The problem of feature selection, or choosing the most relevant features out of what can be an incredibly large set of data, is particularly important for accurate text categorization. The proposed system (i) use well known pre-processing method porter and Lancaster for train the dataset. (ii) A number of feature selection metrics have been explored in text categorization, among which information gain (IG), chi-square (CHI), Mutual information (MI), Ng-Goh-Low (NGL), Galavotti-Sebastiani-Simi (GSS), Relevancy Score (RS), Multi-Sets of Features (MSF) Document frequency (DF) and odds ratios (OR) are considered most effective. Pruning techniques are also proposed using ignore the feature based on TF and DF to further reduce the set of possible features (typically words) within a document prior to applying a method of feature selection. (iii) Finally classify the selected feature based on two algorithm KNN and Navie bayes. Two benchmark collections were chosen as the testbeds: Reuters-21578 and small portion of Reuters Corpus Version 1 (RCV1). The two classifiers and both data collections, and that a further increase in performance is obtain by combining uncorrelated and high-performing feature selection methods.



Keywords: Locally Weighted Spectral Cluster, matrix, Local Scaling, Estimating Weight based Clusters

How to Cite:

[1] Senthil Kumar B, Bhavitha Varma E, “A Different Type of Feature Selection Methods for Text Categorization on Imbalanced Data,” International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), DOI: 10.17148/IJARCCE.2016.5963