NEON/DAESO Sentence compression
Please enter one Dutch sentence:
The compression system deletes parts of the sentence in order to
compress it. This system was developed for Dutch within the NEON
project and within the Daeso Project, and is based on earlier research for
English in the MUSA
project.
The system takes a hybrid rule-based - statistical approach. First
each sentence is parsed with the memory-based shallow parser. The
parser tokenizes the sentence and assigns part-of-speech tags, IOB-
chunk tags and lemmas to every token. The compression system uses the
predicted chunk tags to determine which words or phrases are a
candidate for removal. The following types of phrases or words are
marked as candidates for removal:
For each of the categories above some specific rules and constraints
are implemented. For example, some adverbs like 'not' are never
removed and adjectives can only be candidates if they are part -but
not the last part- of a noun phrase. After identifying all candidates,
an importance score is computed for every candidate. The candidates
with the lowest score are deleted. The importance of a candidate
depends on two types of weights: rule-weight and suprise value. The
rule-weight represents importance of part-of-speech tag. For example,
according to the rule-weights, adjectives are less important than
nouns. The surprise value is based on TF*IDF scores computed on a
large corpus. The higher the surprise value of a candidate, the less
likely it is that it will be removed. The compression system
iteratively removes the least important candidates until the required
compression rate is met.
In this demo compression rate is computed as:
number-of-characters-in-compressed-sentence/number-of-characters-in-
original-sentence.
For more information, contact Vincent Van Asch or Walter Daelemans