A Simple Mixture Model For Unsupervised Text Categorisation
Free (open access)
F. Clérot, F. Fessant, O. Collin, O. Cappé & E. Moulines
Automatically segmenting text corpora into thematically related groups is a complex exploratory analysis problem. In this article, we outline our multi-stage exploratory analysis process and investigate the performance of a simple statistical model. After a description of this model and of its fitting procedure, we illustrate its performance on the segmentation of a corpus of CKM-related texts in English. Keywords: text mining, exploratory analysis, clustering, mixture model. 1 Introduction Clustering is a key tool in exploratory data analysis; segmenting the data into homogeneous groups leads to a more synthetic understanding of the data, allows to build powerful visualisations and is often the first step towards more specific analysis such as supervised classification. Although less standard in analysis of text data, clustering has recently received a lot of attention. The goal is to bring to text data analysis the same benefits as above. There are however significant differences between text data analysis and numerical data analysis: for numerical data analysis, the cluster "homogeneity" is judged from a metric in data space; when dealing with text data, it is clearly implicit that "homogeneity" means "topical homogeneity", a notion which is more difficult to measure. In this article, the purpose of text clustering is to build a topical segmentation of a corpus. Because of the difficulty of defining a priori a topical homogeneity measure, the text clustering analysis must be considered as a part of
text mining, exploratory analysis, clustering, mixture model.