Clauses are word sequences which contain a subject and a predicate. Here is an example of a sentence and its clauses obtained from Wall Street Journal section 15 of the Penn Treebank [MSM93]:
(S The deregulation of railroads and trucking companies (SBAR that (S began in 1980) ) enabled (S shippers to bargain for transportation) . )
The clauses of this sentence have been enclosed between brackets. A tag next to the open bracket denotes the type of the clause.
In the CoNLL-2001 shared task, the goal is to identify clauses in text. Training and test data for this task are available. This data consists of the same partitions of the Wall Street Journal part (WSJ) of the Penn Treebank as the widely used data for noun phrase chunking: sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens). The clause segmentation of the data has been derived from the Penn Treebank by a program written by Sabine Buchholz from Tilburg University, The Netherlands.
The shared task consists of three parts: identifying clause start positions, recognizing clause end positions and building complete clauses. We have not used clauses labeled with FRAG or RRC, and all clause labels have been converted to S. The goal of this task is to come forward with machine learning methods which after a training phase can recognize the clause segmentation of the test data as well as possible. For all three parts of the shared task, the clause segmentation methods will be evaluated with the F rate, which is a combination of the precision and recall rates: F = 2*precision*recall / (recall+precision) [Rij79].
There have been some earlier studies in identifying clauses. [Abn90] used a clause filter as a part of his CASS parser. It consists of two parts: one for recognizing basic clauses and one for repairing difficult cases (clauses without subjects and clauses with additional VPs). [Eje96] showed that a parser can benefit from automatically identified clause boundaries in discourse. [Lef98] built a rule-based algorithm for finding clauses in English and Portuguese texts. [Ora00] used memory-based learning techniques for finding clauses in the Susanne corpus. His system included a rule-based post-processing phase for improving clause recognition performance.
The train and test data consist of four columns separated by spaces. Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the current word, the second a part-of-speech tag derived by the Brill tagger, the third a chunk tag generated by a chunker [TKS00] and the fourth a corresponding clause tag extracted from the Penn Treebank. The chunk tags contain two parts: one stating whether the word is chunk initial (B) or not (I), and one holding the type of the chunk (NP, VP, PP, etcetera). There are two varieties of chunk tags: one for start/end parts which uses S (start), E (end) and X (neither), and one for the complete clause segmentation which contains (S* (start), *S) (end), * (neither) and several combinations such as (S(S*S) (two clause starts and one clause end). Here is an example:
The DT B-NP S/X/(S* deregulation NN I-NP X/X/* of IN B-PP X/X/* railroads NNS B-NP X/X/* and CC O X/X/* trucking NN B-NP X/X/* companies NNS I-NP X/X/* that WDT B-NP S/X/(S* began VBD B-VP S/X/(S* in IN B-PP X/X/* 1980 CD B-NP X/E/*S)S) enabled VBD B-VP X/X/* shippers NNS B-NP S/X/(S* to TO B-VP X/X/* bargain VB I-VP X/X/* for IN B-PP X/X/* transportation NN B-NP X/E/*S) . . O X/E/*S)
In this example, the fourth column contains the clause tags for part 1, 2 and 3 of the shared task separated by slashes. In the third column, the O chunk tag is used for tokens which are not part of any chunk.
There are two evaluation programs (Perl) available: one for parts 1 and 2 (conlleval1) and one for part 3 (conlleval3). The input of the programs should consist of a file which is the same as the test data but which contains an additional final column which holds the results of that should be evaluated. The programs should be invoked as conlleval1 < file
Six systems have participated in the CoNLL-2001 shared task. They used a wide variety of machine learning techniques. Here is an overview of their performance on the test data of part 3 of the shared task (full clause identification) of the systems that have participated in the shared task together with other results (*) for this data set that were published after the workshop:
+-----------+-----------++-----------++ test org | precision | recall || F || +----------+-----------+-----------++-----------++ | [CMPR02] | 90.18% | 72.59% || 80.44 || (*) | [CM01] | 84.82% | 73.28% || 78.63 || | [MP01] | 70.89% | 65.57% || 68.12 || | [TKS01] | 76.91% | 60.61% || 67.79 || | [PG01] | 73.75% | 60.00% || 66.17 || | [Dej01] | 72.56% | 54.55% || 62.27 || | [Ham01] | 55.81% | 45.99% || 50.42 || +----------+-----------+-----------++-----------++ | baseline | 98.44% | 31.48% || 47.71 || +----------+-----------+-----------++-----------++
The baseline results were produced by a system which only put clause brackets around sentences. All of the participating systems outperformed the baseline. Most systems obtained an F-rate between 62 and 68. One performed below the rest [Ham01] but has not used all training data. The system of Xavier Carreras and Luís Màrquez [CM01] outperformed all other systems both on the main part of the shared task (F=78.63) as the other two parts. It uses AdaBoost applied to decision trees.
Xavier Carreras has reported errors in the test data set testb3 which concerned the presence of duplicate clauses: (S(S words S)S). These clauses have been removed on August 3, 2003. Here are the results of the systems that participated in shared task for the corrected test data set:
+-----------+-----------++-----------++ test cor | precision | recall || F || +----------+-----------+-----------++-----------++ | [CM03] | 87.99% | 81.01% || 84.36 || (*) | [CMPR02] | 90.18% | 78.11% || 83.71 || (*) | [CM01] | 84.82% | 78.85% || 81.73 || | [MP01] | 70.85% | 70.51% || 70.68 || | [TKS01] | 76.91% | 65.22% || 70.58 || | [PG01] | 73.75% | 64.56% || 68.85 || | [Dej01] | 72.56% | 58.69% || 64.89 || | [Ham01] | 55.81% | 49.49% || 52.46 || +----------+-----------+-----------++-----------++ | baseline | 98.44% | 33.88% || 50.41 || +----------+-----------+-----------++-----------++
The correction influences the recall and F rates, all of which improve.
The papers associated with the participating systems can be found in the reference section below.
This reference section contains two parts: first the papers from the shared task session at CoNLL-2001 and then the other related publications.
Note: at the workshop some of the participants have presented results that are different from the ones mentioned in their paper. Whenever possible an update of the paper with the improved results is available alongside the original version.