Unsupervised Learning of Word Boundary with Description Length Gain

Chunyu Kit, ctckit@cityu.edu.hk
Yorick Wilks, yorick@dcs.shef.ac.uk

This paper presents an unsupervised approach to lexical acquisition with the goodness measure description length gain (DLG) formulated following classic information theory within the minimum description length (MDL) paradigm. The learning algorithm seeks for an optimal segmentation of an utterance that maximises the description length gain from the individual segments. The resultant segments show a nice correspondence to lexical items (in particular, words) in a natural language like English. Learning experiments on large-scale corpora (e.g., the Brown corpus) have shown the effectiveness of both the learning algorithm and the goodness measure that guides that learning.

Postscript supplied by author: http://www.dcs.shef.ac.uk/~yorick/papers/unsup.ps

This is the abstract of a paper presented at the CoNLL-99 workshop.

Last update: May 22, 2000. erikt@uia.ua.ac.be