Background

The Dutch Multi-Document summarizer is developed within the Daeso Project, a part of the STEVIN initiative. The Daeso project investigates the detection of semantic overlap between Dutch sentences. An example of semantic overlap can be seen in the following two sentences that express the same information in different ways:

De kans dat een Elfstedentocht georganiseerd kan worden is afgenomen tot eens per achttien jaar.

Eens in de 18 jaar zal er een Elfstedentocht kunnen worden gehouden.

For an NLP application such as automatic multi-document summarization, detection of semantic overlap is essential to avoid redundancy. Many of the current state of the art systems use word overlap to measure semantic overlap and fail to detect paraphrases. The Dutch Multi-Document summarizer in this demo is our baseline system and also uses word overlap. In a next step we hope to improve this system using a more sophisticated semantic overlap detection module which is being developed in the Daeso project.

The core of this summarizer is the MEAD summarization toolkit . The original MEAD summarizer handles English and Chinese text. We adapted the MEAD system for Dutch.

How does it work?

MEAD's basic method: Process the documents sentence by sentence. Compute for every sentence in the documents an importance weight. Sort sentences on their importance. Start creating a summary by adding the sentence with highest weight. Take the next important sentence and measure the similarity with the sentence that is already in the summary. If they don't overlap, add sentence to summary. Repeat this until maximum summary size is reached.

This summarizer uses three types of information to compute the importance score: the position of the sentence within the document, the sentence length and a more complex infomation value based on word frequencies. For more information, please look at the following references:

CLIPS