Improving Automatically Parsed Dutch Treebanks

Friday, January 29, 2016 - 15:00 - 16:30
Annexe, Building R, Stadcampus, University of Antwerp
Gertjan Van Noord

In this presentation, we will describe the efforts (some in vain) to improve the available automatically parsed Lassy Large treebank. We describe some aspects of the Alpino parser, and some recent attempts at improving the parser. Alpino is a hybrid system in which a hand-written grammar and large dictionary is combined with a statistical disambiguation component. The disambiguation component uses co-occurrence information extracted from large treebanks for improved disambiguation accuracy. We describe a recent experiment to add word embedding features to the disambiguation component.

We further zoom in on the part-of-speech annotation layer of the existing Lassy Large treebanks, suggesting that the part-of-speech labels, originally provided by a separate POS-tagger, are of questionable quality. We analyse some of the reasons for this, and describe our efforts to provide part-of-speech labels as a side-effect of parsing, and we provide some initial experimental results indicating a huge potential improvement of POS-tagging accuracy using the parser as a tagger.


