Text Mining: Crossing The Chasm Between The Academy And The Industry


E M Silva, H A do Prado & E Ferneda


The existence of a chasm between the development phase and the adoption of new technologies has been widely recognized. Some reasons that make hard the transition academy-industry for new technology are: (a) the weak usability commonly presented by emergent technology in regard to the required ease of ordinary users; (b) few successful experiences reported; and (c) the lack of an adequate methodology to new tools. In this paper we argue that text mining technology is exactly in the chasm point and study the hypothesis (c) mentioned above. The start point of our argumentation is the contradiction posed by the extraordinary amount of information in text form - about 800/0 of all existing information in a company - while the amount of text mining/web mining applications does not go beyond ‘7°/0. At the same time, we observe that the available technological alternatives present an excellent level of maturity, with many functions and adequate interfaces for the common user. The research was carried out by means of a case study in which we used texts issued by a journalistic agency. In order to explore our hypothesis, we applied the CRISP-DM method that was originally conceived for data mining. The contribution of this work includes the examination of the methodological hypothesis for the lack of text mining applications, an experience report in which we describe the steps carried out to apply CRISP-DM to text mining, and the findings in the target domain. 1 Introduction Since the early nineties, researchers in Knowledge Discovery from Databases (KDD) have dedicated intensive efforts to extract human understandable patterns from structured databases, as well as to make the whole work as automatic as