Abstract:
Due
to richness of information in forums, researchers are increasingly interested
in mining knowledge from forums. From this observation, the forum posts and
replies are clustered and analyzed in order to improve the user knowledge in
the field. To harvest knowledge from the forum the contents must be downloaded.
Forum board or thread is usually divided into multiple pages which are linked
by page flipping links. The forum sites
contain different pages like entry pages, thread pages and page flipping. The forum mining have three phases:
preprocessing, mining the data by applying various data mining strategies such
as clustering and post processing. In preprocessing raw data is transformed
into a usable format, mainly by parsing and cleaning. While preprocessing, the pages are downloaded
as the html file and the files are invoked into parsing and assign attributes
like forum id, forum title, thread count, post count. The
parsing process is accomplished; data cleaning process is applied to the downloaded
post sets and automatically remove noise data and irrelevant data. Clustering
algorithm is applied for the preprocessed data to groups the forums into
various clusters. The clustering is
accomplished by using all topics and sub topics of the forum. The four
dimensions of clustering are number of posts/topics, average sentiment
values/topics, positive percentage of posts/topics and negative percentage of
posts/topics. The posts/topics dimension are determined by number of replies
for a post, the sentiment values of this topics are identified from user
replies, it describe the user opinion, the positive and negative dimensions are
determined from user replies, describe the user perception in the posts. The
positive and negative dimensions are also used to identifying the user attitude
and pros and cons of the specific topics are discussed in the particular forum.
In the post processing stage numbers of clusters are obtained. The obtained
final clusters are grouped based on the topics with similar sentiment values
and user opinions. Based on the sentiment values, the positive and negative
posts are clustered for each thread. Information seekers, decision makers can
benefit from this clustering. It simplifies the decision making process.
Keywords: Clustering, Forum, Graph, Threads.