NP Bracketing

In 1995 Lance Ramshaw and Mitch Marcus put forward a standard data set for NP chunking: recognizing non-overlapping text parts that consist of noun phrases (NPs) [RM95]. These NPs were non-recursive and lacked post-modifying phrases such as prepositional phrases. The recognition of these NPs would be a useful step towards a complete parsing process. Therefore we propose to expand this NP chunking data set and use it for an extended task: NP bracketing, the recognition of all noun phrase structures in a text.

The NP chunking data set put forward by Ramshaw and Marcus consists of two parts: training data and test data. Both parts have been extracted from the Wall Street Journal corpus (WSJ). The training part contains sections 15-18 of this corpus and the test part consists of section 20. In order to make the data usable for the NP bracketing task, it needs to be extended with an annotation for the complete noun phrase structure. This information can be extracted from the WSJ corpus as well. The location of the chunking data files and the bracketing data files can be found in the Software and Data section on this page.

The performance of the machine algorithm is measured with two scores: precision and recall. Precision measures how many noun phrases found by the algorithm are correct and the recall rate contains the percentage of NPs defined in the corpus that were found by the chunking program. The machine learning algorithm is assumed to output a balanced bracketing structure which means that every opening bracket must match with a closing bracket and vice versa. The two rates can be combined in one measure: the F rate in which F = 2*precision*recall / (recall+precision) [Rij79].

The following results have been reported for this data set (CR=crossing rate):

             +------+-----------+--------++-------++
             |  CR  | precision | recall ||   F   ||
   +---------+------+-----------+--------++-------++
   | [TKS00] |      |   90.00%  | 78.38% || 83.79 || 
   | [TKS99] | 0.14 |   91.28%  | 76.06% || 82.98 || 
   +---------+------+-----------+--------++-------++

[KD00] have done similar work for both NPs and VPs. They obtained similar results with more training data but without using lexical information. [Bra99] has reported NP bracketing results for German.

NP chunking and NP bracketing are two intermediate steps to achieving the goal of the TMR network Learning Computational Grammars, to learn the structure of noun phrases.

Software and Data

ftp://ftp.cis.upenn.edu/pub/chunker/
The NP chunking data supplied by Lance Ramshaw and Mitch Marcus. Their NP chunker and their WVLC95 paper can also be obtained from this site. The data contains one word per line and each line contains six fields of which only the first two fields are relevant: the word and the part-of-speech tag assigned by the Brill tagger.
http://www.cnts.ua.ac.be/conll99/npb/data/
The supplement data for the NP bracketing task. In order to create the complete data sets for this task you need to combine this data with the Ramshaw and Marcus chunking data. This directory contains a Perl script conll2evalb which can be used to convert the bracketing format to a format usable by the bracketing score program evalb. Only NP* and WHNP* pharses are used. All others like NX and NAC are ignored.
http://www.cnts.ua.ac.be/conll99/corpora/
A larger data set for this task. The training material consists of almost the complete WSJ corpus. This data set is password protected. Contact osborne@let.rug.nl for more information.
http://cs.nyu.edu/cs/projects/proteus/evalb/
A bracketing score program which can be used for evaluating the output of the bracketing program. Created by Satoshi Sekine and Michael John Collins.

Related information

http://www.cnts.ua.ac.be/~erikt/research/np-chunking.html
Background information on NP chunking.
http://www.cnts.ua.ac.be/conll99/
Home page of the workshop on Computational Natural Language Learning (CoNLL-99)
http://www.cnts.ua.ac.be/lcg/
Home page of the TMR network - Learning Computational Grammars.

References

[Bra99]
Thorsten Brandts, Cacaded Markov Models, In: "Proceedings of EACL'99", Association for Computational Linguistics, 1999.
http://xxx.lanl.gov/abs/cs.CL/9906009
[KD00]
Yuval Krymolowski and Ido Dagan, Incorporating Compositional Evidence in Memory-Based Partial Parsing. In: "Proceedings of ACL 2000", Hong Kong, 2000.
[Osb99]
Miles Osborne, "MDL-based DCG Induction for NP Identification", In: "Proceedings of CoNLL-99", Bergen, Norway, 1999.
http://www.cnts.ua.ac.be/lcg/ps/osborne.conll99.ps.gz
[Rij79]
C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.
[RM95]
Lance A. Ramshaw and Mitchell P. Marcus, Text Chunking Using Transformation-Based Learning. In: "Proceedings of the Third ACL Workshop on Very Large Corpora", Association for Computational Linguistics, 1995.
ftp://ftp.cis.upenn.edu/pub/chunker/wvlcbook.ps.gz
[TKS99]
Erik F. Tjong Kim Sang, "Noun Phrase Detection by Repeated Chunking", talk presented at the NP Identification session of the CoNLL-99 workshop, Bergen, Norway, 1999.
http://www.cnts.ua.ac.be/~erikt/talks/conll99.sh.ps
[TKS00]
Erik F. Tjong Kim Sang, Noun Phrase Representation by System Combination. In "Proceedings of ANLP-NAACL 2000", Seattle, WA, USA, 2000.
http://www.cnts.ua.ac.be/~erikt/papers/naacl2000.ps

Last update: May 08, 2005. erikt@uia.ua.ac.be