This is an output example for the Perl script conlleval, which can be used for measuring the performance of a system that has processed the CoNLL-2000 shared task data. The input of this script should consist of lines similar to the shared task data files. Here is an example:
Boeing NNP B-NP B-NP 's POS B-NP B-NP 747 CD I-NP I-NP jetliners NNS I-NP I-NP . . O O Rockwell NNP B-NP B-NP said VBD B-VP B-VP the DT B-NP B-NP agreement NN I-NP I-NP
Each line contains four symbols: the current word, its part-of-speech tag (POS), the chunk tag according to the corpus and the predicted chunk tag. Sentences have been separated by empty lines. We have processed the file output.txt with the evaluation script as:
conlleval < output.txt
and the result was:
processed 961 tokens with 459 phrases; found: 539 phrases; correct: 371. accuracy: 87.20%; precision: 68.83%; recall: 80.83%; FB1: 74.35 ADJP: precision: 0.00%; recall: 0.00%; FB1: 0.00 1 ADVP: precision: 45.45%; recall: 62.50%; FB1: 52.63 11 NP: precision: 64.98%; recall: 78.63%; FB1: 71.16 317 PP: precision: 83.18%; recall: 98.89%; FB1: 90.36 107 SBAR: precision: 66.67%; recall: 33.33%; FB1: 44.44 3 VP: precision: 69.00%; recall: 79.31%; FB1: 73.80 100
The program counted 961 tokens (words and punctuation signs) with 459 phrases according to the corpus. The chunking software found 539 phrases of which 371 were correct. The evaluation script presents the percentage of tokens that received the correct chunk tag (accuracy, this number should be ignored) and the precision, recall and FB1 rates, both overall and for each of the six chunk types in the data. Only chunk types that are present in the corpus data or in the predicted chunks will be mentioned in the evaluation overview. The zero scores for the ADJP type can be explained by the fact that there are six ADJP phrases in this data but none of them was detected by the chunker.
In case you work with data files in which sentences have been separated by empty lines, please make sure that you have kept these empty lines in the input of the evaluation program conlleval. If you remove the empty lines then the program may not observe the end of a sentence-final clause or the beginning of a sentence-initial clause. Because of this a sentence-final clause in combination with a sentence-initial clause of the same type will be regarded as one big clause. The evaluation program needs the empty lines between the sentences to know where such clauses end and where they begin.
conlleval recognizes the following command line options:
Example: if your output file MYFILE looks like this:
0,1,1,0,0,1,giraffe,giraffe
0,1,0,1,0,1,zebra,giraffe
0,0,0,0,0,1,NOEXIST,NOEXIST
then you have commas rather than white space between tokens, you are neither using B- prefixes nor -I prefixes and you use the token NOEXIST as outside output tag. In that case you should apply the software to the file in the following way:
conlleval -d , -r -o NOEXIST < MYFILE