Language-Independent Named Entity Recognition (I)

Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Example:

[PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ] in the final years of the seventies in [ORG Real Madrid ] .

The shared task of CoNLL-2002 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The participants of the shared task will be offered training and test data for at least two languages. They will use the data for developing a named-entity recognition system that includes a machine learning component. Information sources other than the training data may be used in this shared task. We are especially interested in methods that can use additional unannotated data for improving their performance (for example co-training).

Background information

Named Entity Recognition (NER) is a subtask of Information Extraction. Different NER systems were evaluated as a part of the Sixth Message Understanding Conference in 1995 (MUC6). The target language was English. The participating systems performed well. However, many of them used language-specific resources for performing the task and it is unknown how they would have performed on another language than English [PD97].

After 1995 NER systems have been developed for some European languages and a few Asian languages. There have been at least two studies that have applied one NER system to different languages. Palmer and Day [PD97] have used statistical methods for finding named entities in newswire articles in Chinese, English, French, Japanese, Portuguese and Spanish. They found that the difficulty of the NER task was different for the six languages but that a large part of the task could be performed with simple methods. Cucerzan and Yarowsky [CY99] used both morphological and contextual clues for identifying named entities in English, Greek, Hindi, Rumanian and Turkish. With minimal supervision, they obtained overall F measures between 40 and 70, depending on the languages used.

Software and Data

The data consists of two columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word and the second the named entity tag. The tags have the same format as in the chunking task: a B denotes the first item of a phrase and an I any non-initial word. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC). Here is an example:

        Wolff B-PER
            , O
    currently O
            a O
   journalist O
           in O
    Argentina B-LOC
            , O
       played O
         with O
          Del B-PER
       Bosque I-PER
           in O
          the O
        final O
        years O
           of O
          the O
    seventies O
           in O
         Real B-ORG
       Madrid I-ORG
            . O

The data consists of three files per language: one training file and two test files testa and testb. The first test file will be used in the development phase for finding good parameters for the learning system. The second test file will be used for the final evaluation. Currently there are data files available for two languages: Spanish and Dutch.

http://www.cnts.ua.ac.be/conll2002/ner.tgz
The data sets and evaluation software for this shared task in one gzipped tar file. You can also retrieve these files one by one: data and software.
Xavier Carreras provides the Spanish data sets with part of speech tags (20030803)
http://www.cnts.ua.ac.be/conll2000/chunking/output.html
Output example of the evaluation script for this shared task: conlleval. The example deals with text chunking, a task which uses the same output format as this named entity task. The output of the NER system for each word should be appended behind each line, with a single space between the line and the output tag.

The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are from May 2000. The annotation was carried out by the TALP Research Center of the Technical University of Catalonia (UPC) and the Center of Language and Computation (CLiC) of the University of Barcelona (UB), and funded by the European Commission through the NAMIC project (IST-1999-12392).

The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). The data was annotated as a part of the Atranos project at the University of Antwerp.

If your have data for other languages that you would like to make available, please contact erikt@uia.ua.ac.be We are interested in tokenized text files containing about 250,000 words for which the named entities (names of persons, organizations, locations and other) have been marked up.

Results

Twelve systems have participated in the CoNLL-2002 shared task. They used a wide variety of machine learning techniques. Here is an overview of their performance on the two test data sets:

              +-----------+-----------++-----------++
     Spanish  | precision |   recall  ||     F     ||
   +----------+-----------+-----------++-----------++
   | [CMP02]  |   81.38%  |   81.40%  ||   81.39   || ±1.5
   | [Flo02]  |   78.70%  |   79.40%  ||   79.05   || ±1.4
   | [CY02]   |   78.19%  |   76.14%  ||   77.15   || ±1.4
   | [WNC02]  |   75.85%  |   77.38%  ||   76.61   || ±1.4
   | [BHM02]  |   74.19%  |   77.44%  ||   75.78   || ±1.4
   | [Tjo02]  |   76.00%  |   75.55%  ||   75.78   || ±1.5
   | [PWM02]  |   74.32%  |   73.52%  ||   73.92   || ±1.5
   | [Jan02]  |   74.03%  |   73.76%  ||   73.89   || ±1.5
   | [Mal02]  |   73.93%  |   73.39%  ||   73.66   || ±1.6
   | [Tsu02]  |   69.04%  |   74.12%  ||   71.49   || ±1.4
   | [BV02]   |   60.53%  |   67.29%  ||   63.73   || ±1.8
   | [MM02]   |   56.28%  |   66.51%  ||   60.97   || ±1.7
   +----------+-----------+-----------++-----------++
   | baseline |   26.27%  |   56.48%  ||   35.86   || ±1.3
   +----------+-----------+-----------++-----------++

              +-----------+-----------++-----------++
     Dutch    | precision |   recall  ||     F     ||
   +----------+-----------+-----------++-----------++
   | [CMP02]  |   77.83%  |   76.29%  ||   77.05   || ±1.5
   | [WNC02]  |   76.95%  |   73.83%  ||   75.36   || ±1.6
   | [Flo02]  |   75.10%  |   74.89%  ||   74.99   || ±1.5
   | [BHM02]  |   72.69%  |   72.45%  ||   72.57   || ±1.4
   | [CY02]   |   73.03%  |   71.62%  ||   72.31   || ±1.6
   | [PWM02]  |   74.01%  |   68.90%  ||   71.36   || ±1.6
   | [Tjo02]  |   72.56%  |   68.88%  ||   70.67   || ±1.6
   | [Jan02]  |   70.11%  |   69.26%  ||   69.68   || ±1.7
   | [Mal02]  |   70.88%  |   65.50%  ||   68.08   || ±1.9
   | [Tsu02]  |   57.33%  |   65.02%  ||   60.93   || ±1.7
   | [MM02]   |   56.22%  |   63.24%  ||   59.52   || ±2.0
   | [BV02]   |   51.89%  |   47.78%  ||   49.75   || ±2.2
   +----------+-----------+-----------++-----------++
   | baseline |   81.29%  |   45.42%  ||   58.28   || ±1.4
   +----------+-----------+-----------++-----------++

Here are some remarks on these results:

The baseline results have been produced by a system which only selects complete named entities which appear in the training data.
The column to the right of the F rates shows estimations of the significance intervals for the F rates. They have been obtained with bootstrap resampling [Nor89]. F rates outside of these intervals are assumed to be significantly different from the related F rate (p<0.05).
The results of [BV02] mentioned here are different from those listed in the related paper because the latter were produced by incorrect software.
The baseline numbers for Dutch mentioned in the introduction paper to this shared task have been generated from data sets with errors and are thus different from those mentioned in the table here.
The system of [BV02] performs worse than the baseline system when processing the Dutch data because the authors used a poor representation of the data. They had removed all sentence breaks.

The system of Xavier Carreras, Luís Màrquez and Luís Padró [CMP02] outperformed all other systems by a significant margin, both on the Spanish test data (81.39) and the Dutch test data (77.05). It should be noted that they have used some additional information beside the training data in the Spanish experiment. Without this additional information their system (79.28) does not perform significantly better than that of [Flo02] (79.05). The [CMP02] system uses AdaBoost applied to decision trees.

The papers associated with the participating systems can be found in the reference section below.

Related information

http://cnts.uia.ac.be/conll2003/ner/
The CoNLL-2003 shared task deals with Language-Independent Named Entity Recognition as well.
http://cnts.uia.ac.be/conll2002/
Home page of the workshop on Computational Natural Language Learning (CoNLL-2002) of which this shared task is part of.
http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html
Home page of the Sixth Message Understanding Conference (1995) that introduced named entity recognition as shared task.
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
Home page of the Seventh Message Understanding Conference (1998) which contained a named entity recognition as shared task.
http://www.nist.gov/speech/tests/ie-er/er_99/er_99.htm
Home page of the 1999 DARPA-TIDES Information Extraction-Entity Recognition (IE-ER) technology evaluation project, which contained a named entity recognition task.
http://www.itl.nist.gov/iaui/894.02/related_projects/tipster/met.htm
Information on the Multilingual Entity Task Conference (MET) which contained named entity recognition for Chinese, Japanese and Spanish (overview).
http://www.calle.com/world/
List of about 2.8 million locations on Earth.

References

This is a list of papers that are relevant for this task.

CoNLL-2002 Shared Task Papers

Note: in some cases the output files provided here contain results which are slightly different from those mentioned in the papers.

[TKS02]
Erik F. Tjong Kim Sang, Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 155-158.
paper: [ps] [ps.gz] [pdf] [bibtex]
sheets: [ps] [ps.gz] [pdf]
[BV02]
William J. Black and Argyrios Vasilakopoulos, Language-Independent Named Entity Classification by Modified Transformation-Based Learning and by Decision Tree Induction. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 159-162.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[BHM02]
John D. Burger, John C. Henderson and William T. Morgan, Statistical Named Entity Recognizer Adaptation. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 163-166.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[CMP02]
Xavier Carreras, Lluís Márques and Lluís Padró, Named Entity Extraction using AdaBoost In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 167-170.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[CY02]
Silviu Cucerzan and David Yarowsky, Language Independent NER using a Unified Model of Internal and Contextual Evidence. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 171-174.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[Flo02]
Radu Florian, Named Entity Recognition as a House of Cards: Classifier Stacking. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 175-178.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[Jan02]
Martin Jansche, Named Entity Extraction with Conditional Markov Models and Classifiers. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 179-182.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[Mal02]
Robert Malouf, Markov models for language-independent named entity recognition. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 187-190.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[MM02]
Paul McNamee and James Mayfield, Entity Extraction Without Language-Specific Resources. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 183-186.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[PWM02]
Jon Patrick, Casey Whitelaw and Robert Munro, SLINERC: The Sydney Language-Independent Named Entity Recogniser and Classifier. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 199-202.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[Tjo02]
Erik F. Tjong Kim Sang, Memory-Based Named Entity Recognition. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 203-206.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[Tsu02]
Koji Tsukamoto, Yutaka Mitsuishi and Manabu Sassano, Learning with Multiple Stacking for Named Entity Recognition. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 191-194.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[WNC02]
Dekai Wu, Grace Ngai, Marine Carpuat, Jeppe Larsen and Yongsheng Yang, Boosting for Named Entity Recognition. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 195-198.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]

Other related publications

A paper that is related to the topic of this shared task is the EMNLP-99 paper by Cucerzan and Yarowsky [CY99]. Interesting papers about using unsupervised data, though not for the NER task, are those of Mitchell [Mit99] and Banko and Brill [BB01].

[BB01]
Michele Banko and Eric Brill, Scaling to Very Very Large Corpora for Natural Language Disambiguation. In Proceedings of ACL 2001, Toulouse, France, 2001, pp. 26-33.
http://www.research.microsoft.com/users/mbanko/ACL2001VeryVeryLargeCorpora.pdf
[Bor99]
Andrew Borthwick, A Maximum Entropy Approach to Named Entity Recognition. PhD thesis, New York University, 1999.
http://cs.nyu.edu/cs/projects/proteus/publication/papers/borthwick_thesis.ps
[BV00]
Sabine Buchholz and Antal van den Bosch, Integrating seed names and n-grams for a named entity list and classifier, In: Proceedings of LREC-2000, Athens, Greece, June 2000, pp. 1215-1221.
http://ilk.kub.nl/downloads/pub/papers/ilk.0002.ps.gz
[CBFR99]
Nancy Chinchor, Erica Brown, Lisa Ferro and Patty Robinson, 1999 Named Entity Recognition Task Definition, MITRE, 1999.
http://www.nist.gov/speech/tests/ie-er/er_99/doc/ne99_taskdef_v1_4.pdf
[CS99]
Michael Collins and Yoram Singer, Unsupervised models for named entity classification. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, University of Maryland, MD, 1999.
http://citeseer.nj.nec.com/collins99unsupervised.html
[CY99]
Silviu Cucerzan and David Yarowsky, Language independent named entity recognition combining morphological and contextual evidence. In Proceedings of 1999 Joint SIGDAT Conference on EMNLP and VLC, University of Maryland, MD, 1999.
http://citeseer.nj.nec.com/cucerzan99language.html
[Mit99]
Tom M. Mitchell, The Role of Unlabeled Data in Supervised Learning. In Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain, 1999.
http://citeseer.nj.nec.com/mitchell99role.html
[MMG99]
Andrei Mikheev, Marc Moens and Claire Grover, Named Entity Recognition without Gazetteers, In Proceedings of EACL'99, Bergen, Norway, 1999, pp. 1-8.
http://www.ltg.ed.ac.uk/~mikheev/papers_my/eacl99.ps
[Nor89]
Eric W. Noreen, Computer-Intensive Methods for Testing Hypotheses John Wiley & Sons, 1989.
[PD97]
David D. Palmer and David S. Day, A Statistical Profile of the Named Entity Task. In Proceedings of Fifth ACL Conference for Applied Natural Language Processing (ANLP-97), Washington D.C., 1997
http://crow.ee.washington.edu/people/palmer/papers/anlp97.ps

Last update: May 08, 2005. erikt@uia.ua.ac.be