WIT Press

Knowledge Discovery In Large Text Databases Using The MST Algorithm


Free (open access)








508 kb

Paper DOI



WIT Press


V. Romanov & E. Pantileeva


In many cases the problem in situations of preliminary stage decision-making consists of the collection of relevant data. The next stage is crucial for the problem situation concept and relation extractions. Both these functions are considered as embedded parts of information systems, permanently updating the situation picture. Such a dynamic picture is formed as a maximum spanning tree for a graph, representing the covariance matrix for word/lemmas or concept pairs. Being displayed fragmentary on the screen like pages of a road map with different edge (road) widths, MST serves as a dynamically changing thesaurus or semantic net for query navigation. Sequentially cutting the weak edges of MST, we can extract lemma clusters, forming concepts and frames, which can be reflected as relational database tables. Recognizing attributes, table names and attribute values by the means of comparison with preliminary prepared categories extensional we can fill, row by row, such formed database tables, extracting knowledge from the text. The described method was tested as part of some real projects, including in particular the abstracts database, divided on thematic classes. The extensional categories are preliminarily formed as the parts of domain ontology and contain such concepts as enterprises, official structures and their units, production units, activities, constructing operations. Besides the database tables, containing data extracted from text, semi-automatic MST analysis gives the possibility of formulating the rules, extracted from text collections which are being formulated in PROLOG, describing the conditions of the transition from one state to another within the same situation. Some ways of overcoming the great dimensionality of a covariance matrix for increasing computing speed were justified. It was shown that a set of lemmas, belonging to MST, can contain rather long sequences of terms (5–6 terms), which are essential for that problem situation and expose, through statistical links, important semantic relations. Keywords: text mining, maximum spanning tree, data extraction, Galois lattices.


text mining, maximum spanning tree, data extraction, Galois lattices.