Abstract: The exponential growth of internet over the past decade has increased millions of web pages published on every subject. Internet provides only a medium for communication between the computer and for accessing online document over this network but not to organize this large amount of data. There are different subject based web directories like Open Directory Projectís (ODP) Directory Mozilla (DMOZ), Yahoo etc., these directories organize web pages in hierarchy. Due to the rapid growth of web pages the categorization demands the need of machine learning technique to automatically maintain the web page directory service. To assign a web page into a class the textual information in the page serves as a hint. Here we propose a method which uses an extended TDW scheme for feature representation and a naÔve Bayesian to build the classification model. The web page categorization provides a wide range of advantages that ranges from knowledgebase construction, to improve the quality of web results, web content filtering, focused crawling etc.
Keywords: Categorization, Extended TDW Matrix, Naive Bayesian, Feature selection.