Abstract: Text mining have gain huge momentum in recent years, with user-generated content becoming widely available. One keyuse is remark mining, with much attention being given to sentiment analysis and opinion mining. An essential step in the process of comment mining is text pre-processing; a step in which each linguistic term is assigned with a weight that commonly increase with its appearance in the studied text, yet is offset by the occurrence of the term in the domain of interest. A common practice is to use the well-known tf-idf formula to calculate these weights.This paper reveals the bias introduce by between-participants’ discourse to the study of comments in social media, and proposes an adjustment. We find that content extract from discourse is often highly correlated, resulting in dependence structures between observations in the study, thus introducing a statistical bias. Ignoring this bias can obvious in a non-robust analysis at best and can lead to an entirely wrong conclusion at worst. We propose a change to tf-idf that accounts for this bias. We show the effects of both the bias and correction with seven Facebook fan pages data, covering different domains, including news, finance, politics, sport, shopping, and entertainment.
Keywords: Sentiment Analysis, Text Mining, Statistical Bias, Discourse, TF-IDF
| DOI: 10.17148/IJARCCE.2018.71102