A Comparison Of Two Algorithms For Discovering Repeated Word Sequences
Free (open access)
R. Tesar, D. Fiala, F. Rousselot & K. Jezek
We will make the readers of this paper familiar with two basic approaches to repeated sequences extraction – a suffix tree based method and an inverted list based method. The first algorithm (ST) makes use of a tree data structure known from suffix tree clustering (STC) where each node represents one word and the root represents the null word. Thus, each path from the root is a phrase occurring somewhere in the corpus. The second applies an inverted index (broadly used in information retrieval), more specifically its hash table implementation (HT). Occurrences of each word in this index are employed to construct lists of words following the current word in different places of the corpus. These lists are then modified in a recursive fashion to satisfy the constraints of repeated segments. We will introduce both methods and compare them in terms of their time and space complexity, efficiency, effectiveness and results yielded with some sample input data. We will conclude that whereas the ST-algorithm is better with regards to time cost, it is outperformed by the HT-approach in respect of space. Finally, we will suggest several possible applications of repeated sequences. Keywords: repeated sequences, repeated segments, frequent phrases, suffix tree, algorithms, comparison. 1 Introduction Repeated word sequences, sometimes referred to as repeated segments, phrases, co-occurrences, or collocations, are simply sequences of words that occur more than once in a text or corpus. Whether all words in the text are taken into account
repeated sequences, repeated segments, frequent phrases, suffix tree, algorithms, comparison.