WIT Press


Multilingual Text Mining

Price

Free (open access)

Paper DOI

10.2495/DATA050091

Volume

35

Pages

6

Published

2005

Size

488 kb

Author(s)

F. Neri

Abstract

The availability of a huge amount of textual data from a bewildering variety of sources leads to the well-identified paradox based on which an overload of information means no usable knowledge. In fact, up to 80% of electronic data is textual. Moreover, the most valuable information is encoded in pages which are written in various native languages, but are relevant even to non-native speakers. The process of accessing all these raw data, heterogeneous for language used, and transforming them into information is therefore inextricably linked to the concepts of textual analysis and synthesis, hinging greatly on the ability to master the problems of multilingualism. Through multilingual text mining, users can get an overview of great volumes of textual data having a highly readable grid, which helps them discover meaningful similarities among documents and find all related information. This paper describes the approach used by SYNTHEMA for multilingual text mining, showing the classification results on around 600 breaking news items written in English, Italian and French. 1 Multilingual resources construction Generally speaking, the manual construction and maintenance of multilingual language resources is undoubtedly expensive, requiring remarkable efforts. Being established in 1994 by computer scientists from the IBM Research Center, with the expertise and skills suited to provide effective software solutions, as well as carry out R&D in Natural Language Processing area, SYNTHEMA has been involved in Machine Translation, Information Extraction and Text Mining activities since 1996, primarily in the field of Technology Watch. The growing availability of comparable and parallel corpora has pushed SYNTHEMA to develop specific methods for semi-automatic updating of lexical resources. They are based on Natural Language Understanding and Machine Learning. These techniques detect multilingual lexicons from such corpora, by extracting all the

Keywords