Abstract: Data Mining techniques are helpful in finding out patterns between data attributes and results in probalistic prediction of the label attributes. Keeping Predictive Modeling as center of attention, this paper focuses on application of analytics on dataset comprising of real world text messages. The Classification techniques i.e. Decision Tree and Random Forest combined with Bag Of Words Model, Latent Semantic Analysis, Singular Value Decomposition and Feature Engineering helps in meticulously predicting and classifying the dataset into two distinct parts i.e.legitimate text messages HAM and SPAM. The paper presents a thorough study and analysis of the techniques applied for classification and prediction, and also discusses the application of Vector Space Modelin making the dataset feasible for the application of the prediction and classification algorithms.
Keywords: Bag of Words Model, Document Frequency Matrix, Stop Words, Stemming, Cross Validation, Decision Tree, Random Forest, TF-IDF, Documents, Terms, Corpus, Vector Space Model, Latent Semantic Analysis,Singular Value Decomposition, n-gram, Feature Engineering.