The Chatty Corpus: a gold standard for Dutch chat normalization

TitleThe Chatty Corpus: a gold standard for Dutch chat normalization
Publication TypeTalks
AuthorsPeersman, C., Kestemont M., De Decker B., De Pauw G., Luyckx K., Morante R., Vaassen F., van de Loo J., & Daelemans W.
Place PresentedPresented at the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN2013), Enschede, The Netherlands
Year of Publication2013
Date Presented18/01/2013
Abstract

In recent years, numerous forms of Internet communication, such as email, blogs, social network posts, tweets and chat room conversations, have emerged together with a new language variety called chat speak. In Flanders, chat speak does not only include Internet abbreviations (e.g. ‘lol’) and spelling errors, which are typical for chat speak, but also entails representations colloquial Flemish — a conglomerate of Dutch dialects spoken in the North of Belgium. This language variety differs significantly from standard (Netherlandic) Dutch, because it displays a lot more dialectal features, which are continued in chat speak. Because state-of-the-art NLP tools fail to correctly analyse the surface forms of (Flemish) chat language usage, we propose to normalize this ‘anomalous’ input into a format suitable for existing NLP solutions for standard Dutch. To achieve this, we have annotated a substantial part of a corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog in order to provide a gold standard for the evaluation of future approaches to automatic (Flemish) chat speak normalization. We discuss our annotation guidelines and inter-annotator agreement scores during our presentation. However, in a world where adolescent (chat) language varies constantly, we believe that machine learning approaches for the normalization of this text genre are needed which require minimal supervision, in order to reduce the cost and effort of manual annotation. Therefore, we also go into our normalization strategies we will investigate during our future research on this topic.