Abstract: Clustering is a useful exploratory technique for the analysis of gene expression data. In particular, model-based clustering considers that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a ‘good’ clustering method and determining the ‘correct’ number of clusters are reduced to model selection problems in the probability framework. This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, and classification. Using clustering attributes, the search dimension of a data mining algorithm is dense. It is for the aforementioned reasons that gene grouping and selection are important preprocessing steps for many data mining algorithms to be effective when applied to gene expression data. This project defines the problem of attribute clustering and introduces a methodology to solving it. Our proposed method group’s interdependent attributes into clusters by optimizing a criterion function derived from an information measure that reflects the interdependence between attributes. By applying our OFS algorithm to gene expression data, important clusters of genes are exposed. The grouping of genes based on feature interdependence within group helps to capture different aspects of gene association patterns in each group. Important genes selected from each group then contain useful information for gene expression classification and identification.
Keywords: Feature Selection, Online Learning, Large-scale Data Mining, Classification.