In 1995 Lance Ramshaw and Mitch Marcus put forward a standard data set
for NP chunking: recognizing non-overlapping text parts that consist of
noun phrases (NPs) [RM95].
These NPs were non-recursive and lacked post-modifying phrases such as
The recognition of these NPs would be a useful step towards a complete
Therefore we propose to expand this NP chunking data set and use it for
an extended task: NP bracketing, the recognition of all noun phrase
structures in a text.
The NP chunking data set put forward by Ramshaw and Marcus
consists of two parts: training data and test data.
Both parts have been extracted from the Wall Street Journal corpus
The training part contains sections 15-18 of this corpus and the test
part consists of section 20.
In order to make the data usable for the NP bracketing task, it needs
to be extended with an annotation for the complete noun phrase
This information can be extracted from the WSJ corpus as well.
The location of the chunking data files and the bracketing data
files can be found in the Software and Data section on this page.
The performance of the machine algorithm is measured with two scores:
precision and recall.
Precision measures how many noun phrases found by the algorithm are
correct and the recall rate contains the percentage of NPs defined
in the corpus that were found by the chunking program.
The machine learning algorithm is assumed to output a balanced
bracketing structure which means that every opening bracket must
match with a closing bracket and vice versa.
The two rates can be combined in one measure: the F rate in which
F = 2*precision*recall / (recall+precision) [Rij79].
The following results have been reported for this data set
| CR | precision | recall || F ||
| [TKS00] | | 90.00% | 78.38% || 83.79 ||
| [TKS99] | 0.14 | 91.28% | 76.06% || 82.98 ||
[KD00] have done similar work for both NPs and VPs.
They obtained similar results with more training data but
without using lexical information.
[Bra99] has reported NP bracketing results for German.
NP chunking and NP bracketing are two intermediate steps to achieving
the goal of the TMR network Learning Computational Grammars, to learn
the structure of noun phrases.
Software and Data
The NP chunking data supplied by Lance Ramshaw and Mitch Marcus.
Their NP chunker and their WVLC95 paper can also be obtained from
The data contains one word per line and each line contains six
fields of which only the first two fields are relevant: the
word and the part-of-speech tag assigned by the Brill tagger.
The supplement data for the NP bracketing task.
In order to create the complete data sets for this task you need
to combine this data with the Ramshaw and Marcus chunking data.
This directory contains a Perl script conll2evalb which can be
used to convert the bracketing format to a format usable by
the bracketing score program evalb.
Only NP* and WHNP* pharses are used.
All others like NX and NAC are ignored.
A larger data set for this task.
The training material consists of almost the complete WSJ corpus.
This data set is password protected.
for more information.
A bracketing score program which can be used for evaluating the
output of the bracketing program.
Created by Satoshi Sekine and Michael John Collins.
Cacaded Markov Models,
In: "Proceedings of EACL'99", Association for Computational Linguistics,
Yuval Krymolowski and Ido Dagan,
Incorporating Compositional Evidence in Memory-Based Partial Parsing.
In: "Proceedings of ACL 2000", Hong Kong, 2000.
Miles Osborne, "MDL-based DCG Induction for NP Identification",
In: "Proceedings of CoNLL-99", Bergen, Norway, 1999.
C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.
Lance A. Ramshaw and Mitchell P. Marcus,
Text Chunking Using Transformation-Based Learning.
In: "Proceedings of the Third ACL Workshop on Very Large Corpora",
Association for Computational Linguistics, 1995.
Erik F. Tjong Kim Sang,
"Noun Phrase Detection by Repeated Chunking",
talk presented at the NP Identification session of the CoNLL-99
workshop, Bergen, Norway, 1999.
Erik F. Tjong Kim Sang,
Noun Phrase Representation by System Combination.
In "Proceedings of ANLP-NAACL 2000", Seattle, WA, USA, 2000.
Last update: May 08, 2005.