WIT Press


Text Preparation Through Extended Tokenization

Price

Free (open access)

Volume

37

Pages

9

Published

2006

Size

533 kb

Paper DOI

10.2495/DATA060021

Copyright

WIT Press

Author(s)

M. Hassler & G. Fliedl

Abstract

Tokenization is commonly understood as the first step of any kind of natural language text preparation. The major goal of this early (pre-linguistic) task is to convert a stream of characters into a stream of processing units called tokens. Beyond the text mining community this job is taken for granted. Commonly it is seen as an already solved problem comprising the identification of word borders and punctuation marks separated by spaces and line breaks. But in our sense it should manage language related word dependencies, incorporate domain specific knowledge, and handle morphosyntactically relevant linguistic specificities. Therefore, we propose rule-based extended tokenization including all sorts of linguistic knowledge (e.g., grammar rules, dictionaries). The core features of our implementation are identification and disambiguation of all kinds of linguistic markers, detection and expansion of abbreviations, treatment of special formats, and typing of tokens including single- and multi-tokens. To improve the quality of text mining we suggest linguistically-based tokenization as a necessary step preceeding further text processing tasks. In this paper, we focus on the task of improving the quality of standard tagging. Keywords: text preparation, natural language processing, tokenization, tagging improvement, tokenization prototype. 1 Introduction Nearly all researchers concerned with text mining presuppose tokenizing as first step during text preparation [1–5]. Good surveys about tokenization techniques are provided by Frakes and Baeza-Yates [6] and Baeza-Yates and Ribeiro-Neto [7], and Manning and Sch¨ utze in [8, pp.124–136]. But – as we know – only very few reflect tokenization as a task of multi-language text processing with far-reaching impact [9]. This involves language-related knowledge about linguistically

Keywords

text preparation, natural language processing, tokenization, tagging improvement, tokenization prototype.