WIT Press

An Evaluation Component For Categorization Systems


Free (open access)







699 kb

Paper DOI



WIT Press


P Gerstl, U Hofmann & A Lang


Automatic categorization is a way to combat the information overflow many companies face every day. For such an automatic categorization to be useful, the categorization model must correctly represent the intended ‘meaning’ of the categories. An important factor is the categorization model used which is created in a training step based on representative sets of training documents for each category. The quality of the categorization model critically depends on the quality of the ‘training base’, which is the set of categories that make up the taxonomy together with the sets of training documents associated with each categories of the taxonomy. While a number of categorization systems are commercially available, assessing the categorization quality is a costly and labor-intensive task when performed manually on the basis of separate validation data. In this article, we describe an evaluation component that automatically evaluates the quality of a training base independent from the categorization system used. Furthermore, we show how this component can be used to propose changes to the training base that can help to reduce the number of problematic areas thereby improving the quality of the categorization system. We illustrate our approach using a sample taxonomy with 10 categories and approx. 400 training documents. 1 Introduction The amount of information that is available electronically increases by the minute, not only on the Internet, but also in intranets, mail systems, repositories and databases. To turn this information into a valuable asset, a company must find ways to provide its employees, customers and business partners with the subset relevant and useful for their task. A common practice for managing this information complexity is to de-