Experiments in Unsupervised Entropy-Based Corpus Segmentation

The paper presents an entropy-based approach to segment a corpus into words, when no additional information about the corpus or the language, and no other resources such as a lexicon or grammar are available. To segment the corpus, the algorithm searches for separators, without knowing a priory by which symbols they are constituted. Good results can be obtained with corpora containing ``clearly perceptible'' separators such as blank or new-line.

Postscript provided by author: http://lcg-www.uia.ac.be/conll99/papers/kempe.ps.gz

This is the abstract of a paper presented at the CoNLL-99 workshop.

Last update: May 23, 2000. erikt@uia.ua.ac.be