CLiPS Stylometry Investigation (CSI) Corpus

Creative Commons License

CLiPS Research Center, University of Antwerp
Ben Verhoeven, Walter Daelemans


The CSI corpus is a yearly expanded corpus of student texts in two genres: essays and reviews. The purpose of this corpus lies primarily in stylometric research, but other applications are possible. There is a vast amount of meta-data available, both on the author (gender, age, sexual orientation, region of origin, personality profile) and on the document (timestamp, genre, veracity, sentiment, grade).

The current version of the corpus was assembled in October 2015. Previous versions of the corpus are available from the authors via e-mail request.


We would like to express our gratitude to Katrien Verreyken, Shanti Verellen, Sarah Van Hoof, Dominiek Sandra and Reinhild Vandekerckhove (University of Antwerp) for their help in collecting all the data.

This corpus was constructed within the framework of the AMiCA project, funded by the Flemish Agency for Innovation through Science and Technology (IWT).


If you use this dataset in your research, make sure to cite the following paper:

Verhoeven, Ben & Daelemans Walter. (2014) CLiPS Stylometry Investigation (CSI) corpus: A Dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik, Iceland.