Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Example:
[PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ] in the final years of the seventies in [ORG Real Madrid ] .
The shared task of CoNLL-2002 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The participants of the shared task will be offered training and test data for at least two languages. They will use the data for developing a named-entity recognition system that includes a machine learning component. Information sources other than the training data may be used in this shared task. We are especially interested in methods that can use additional unannotated data for improving their performance (for example co-training).
Named Entity Recognition (NER) is a subtask of Information Extraction. Different NER systems were evaluated as a part of the Sixth Message Understanding Conference in 1995 (MUC6). The target language was English. The participating systems performed well. However, many of them used language-specific resources for performing the task and it is unknown how they would have performed on another language than English [PD97].
After 1995 NER systems have been developed for some European languages and a few Asian languages. There have been at least two studies that have applied one NER system to different languages. Palmer and Day [PD97] have used statistical methods for finding named entities in newswire articles in Chinese, English, French, Japanese, Portuguese and Spanish. They found that the difficulty of the NER task was different for the six languages but that a large part of the task could be performed with simple methods. Cucerzan and Yarowsky [CY99] used both morphological and contextual clues for identifying named entities in English, Greek, Hindi, Rumanian and Turkish. With minimal supervision, they obtained overall F measures between 40 and 70, depending on the languages used.
The data consists of two columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word and the second the named entity tag. The tags have the same format as in the chunking task: a B denotes the first item of a phrase and an I any non-initial word. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC). Here is an example:
        Wolff B-PER
            , O
    currently O
            a O
   journalist O
           in O
    Argentina B-LOC
            , O
       played O
         with O
          Del B-PER
       Bosque I-PER
           in O
          the O
        final O
        years O
           of O
          the O
    seventies O
           in O
         Real B-ORG
       Madrid I-ORG
            . O
The data consists of three files per language: one training file and two test files testa and testb. The first test file will be used in the development phase for finding good parameters for the learning system. The second test file will be used for the final evaluation. Currently there are data files available for two languages: Spanish and Dutch.
The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are from May 2000. The annotation was carried out by the TALP Research Center of the Technical University of Catalonia (UPC) and the Center of Language and Computation (CLiC) of the University of Barcelona (UB), and funded by the European Commission through the NAMIC project (IST-1999-12392).
The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). The data was annotated as a part of the Atranos project at the University of Antwerp.
If your have data for other languages that you would like to make available, please contact erikt@uia.ua.ac.be We are interested in tokenized text files containing about 250,000 words for which the named entities (names of persons, organizations, locations and other) have been marked up.
Twelve systems have participated in the CoNLL-2002 shared task. They used a wide variety of machine learning techniques. Here is an overview of their performance on the two test data sets:
              +-----------+-----------++-----------++
     Spanish  | precision |   recall  ||     F     ||
   +----------+-----------+-----------++-----------++
   | [CMP02]  |   81.38%  |   81.40%  ||   81.39   || ±1.5
   | [Flo02]  |   78.70%  |   79.40%  ||   79.05   || ±1.4
   | [CY02]   |   78.19%  |   76.14%  ||   77.15   || ±1.4
   | [WNC02]  |   75.85%  |   77.38%  ||   76.61   || ±1.4
   | [BHM02]  |   74.19%  |   77.44%  ||   75.78   || ±1.4
   | [Tjo02]  |   76.00%  |   75.55%  ||   75.78   || ±1.5
   | [PWM02]  |   74.32%  |   73.52%  ||   73.92   || ±1.5
   | [Jan02]  |   74.03%  |   73.76%  ||   73.89   || ±1.5
   | [Mal02]  |   73.93%  |   73.39%  ||   73.66   || ±1.6
   | [Tsu02]  |   69.04%  |   74.12%  ||   71.49   || ±1.4
   | [BV02]   |   60.53%  |   67.29%  ||   63.73   || ±1.8
   | [MM02]   |   56.28%  |   66.51%  ||   60.97   || ±1.7
   +----------+-----------+-----------++-----------++
   | baseline |   26.27%  |   56.48%  ||   35.86   || ±1.3
   +----------+-----------+-----------++-----------++
              +-----------+-----------++-----------++
     Dutch    | precision |   recall  ||     F     ||
   +----------+-----------+-----------++-----------++
   | [CMP02]  |   77.83%  |   76.29%  ||   77.05   || ±1.5
   | [WNC02]  |   76.95%  |   73.83%  ||   75.36   || ±1.6
   | [Flo02]  |   75.10%  |   74.89%  ||   74.99   || ±1.5
   | [BHM02]  |   72.69%  |   72.45%  ||   72.57   || ±1.4
   | [CY02]   |   73.03%  |   71.62%  ||   72.31   || ±1.6
   | [PWM02]  |   74.01%  |   68.90%  ||   71.36   || ±1.6
   | [Tjo02]  |   72.56%  |   68.88%  ||   70.67   || ±1.6
   | [Jan02]  |   70.11%  |   69.26%  ||   69.68   || ±1.7
   | [Mal02]  |   70.88%  |   65.50%  ||   68.08   || ±1.9
   | [Tsu02]  |   57.33%  |   65.02%  ||   60.93   || ±1.7
   | [MM02]   |   56.22%  |   63.24%  ||   59.52   || ±2.0
   | [BV02]   |   51.89%  |   47.78%  ||   49.75   || ±2.2
   +----------+-----------+-----------++-----------++
   | baseline |   81.29%  |   45.42%  ||   58.28   || ±1.4
   +----------+-----------+-----------++-----------++
Here are some remarks on these results:
The system of Xavier Carreras, Luís Màrquez and Luís Padró [CMP02] outperformed all other systems by a significant margin, both on the Spanish test data (81.39) and the Dutch test data (77.05). It should be noted that they have used some additional information beside the training data in the Spanish experiment. Without this additional information their system (79.28) does not perform significantly better than that of [Flo02] (79.05). The [CMP02] system uses AdaBoost applied to decision trees.
The papers associated with the participating systems can be found in the reference section below.
This is a list of papers that are relevant for this task.
Note: in some cases the output files provided here contain results which are slightly different from those mentioned in the papers.
A paper that is related to the topic of this shared task is the EMNLP-99 paper by Cucerzan and Yarowsky [CY99]. Interesting papers about using unsupervised data, though not for the NER task, are those of Mitchell [Mit99] and Banko and Brill [BB01].