André Kempe, andre.kempe@xrce.xerox.com
The paper presents an entropy-based approach to segment a corpus into words, when no additional information about the corpus or the language, and no other resources such as a lexicon or grammar are available. To segment the corpus, the algorithm searches for separators, without knowing a priory by which symbols they are constituted. Good results can be obtained with corpora containing ``clearly perceptible'' separators such as blank or new-line.
Postscript provided by author: http://lcg-www.uia.ac.be/conll99/papers/kempe.ps.gz