Abstract: Due to richness of information in forums, researchers are increasingly interested in mining knowledge from forums. From this observation, the forum posts and replies are clustered and analyzed in order to improve the user knowledge in the field. To harvest knowledge from the forum the contents must be downloaded. Forum board or thread is usually divided into multiple pages which are linked by page flipping links.  The forum sites contain different pages like entry pages, thread pages and page flipping.             The forum mining have three phases: preprocessing, mining the data by applying various data mining strategies such as clustering and post processing. In preprocessing raw data is transformed into a usable format, mainly by parsing and cleaning.  While preprocessing, the pages are downloaded as the html file and the files are invoked into parsing and assign attributes like forum id, forum title, thread count, post count. The parsing process is accomplished; data cleaning process is applied to the downloaded post sets and automatically remove noise data and irrelevant data. Clustering algorithm is applied for the preprocessed data to groups the forums into various clusters.  The clustering is accomplished by using all topics and sub topics of the forum. The four dimensions of clustering are number of posts/topics, average sentiment values/topics, positive percentage of posts/topics and negative percentage of posts/topics. The posts/topics dimension are determined by number of replies for a post, the sentiment values of this topics are identified from user replies, it describe the user opinion, the positive and negative dimensions are determined from user replies, describe the user perception in the posts. The positive and negative dimensions are also used to identifying the user attitude and pros and cons of the specific topics are discussed in the particular forum. In the post processing stage numbers of clusters are obtained. The obtained final clusters are grouped based on the topics with similar sentiment values and user opinions. Based on the sentiment values, the positive and negative posts are clustered for each thread. Information seekers, decision makers can benefit from this clustering. It simplifies the decision making process.

 

Keywords: Clustering, Forum, Graph, Threads.