TwiSty Corpus

Spanish, Portuguese, French, Dutch, Italian, German

Creative Commons License


Ben Verhoeven (1), Walter Daelemans (1), Barbara Plank (2)

(1) CLiPS Research Center, University of Antwerp, Belgium
(2) University of Groningen, The Netherlands


TwiSty is a corpus developed for research in author profiling. It contains personality (MBTI) and gender annotations for a total of 18,168 authors spanning six languages. We distribute the Twitter ids of these authors as well as the ids of their available tweets at the time of corpus development. The tweets have undergone language identification and can be found in a Confirmed (as belonging to the language in which the author is situated) and Other category.


 This research is supported by a doctoral grant from the FWO Research Council - Flanders for the first author. We thank Guy De Pauw and Tom De Smedt for technical support. Part of this research was carried out in the framework of the AMiCA (IWT SBO-project 120007) project, funded by the Flemish government agency for Innovation by Science and Technology (IWT).


If you use this dataset in your research, make sure to cite the following paper:

Verhoeven, B., Daelemans, W., & Plank, B. (2016) TwiSty: a multilingual Twitter Stylometry corpus for gender and personality profiling. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia.