Teaching materials Walter Daelemans: stylometry

Stylometry

Text Mining

Shallow Parsing

Shakespeare from wikipedia

Computational Stylometry at CLiPS

Computational stylometry develops techniques that allow us to find out information about the authors of texts on the basis of an automatic linguistic analysis of those texts. This information can be the gender, region of origin, age, personality (extraverted or introverted), education level, etc. of the author, and ultimately an indication of the identity of the author (authorship attribution). These techniques are interesting for basic research on the linguistic properties of text determining style (as opposed to register, topic, etc.), but they are also useful in literary research (resolving disputed authorship) and in forensic applications (disputed authorship of suicide notes, blackmail letters etc.)

Our approach to computational stylometry involves robust text analysis using shallow parsing to create a document representation in terms of linguistic features (e.g. frequency distributions of function words, or of syntactic patterns), and combines this with machine learning to learn correlations between linguistic features and author properties. On both aspects a lot of work is still to be done: Which linguistic features work best for which task? Which type of machine learning and which machine learning algorithms are best suited for the task?

Reading list for students
If you are interested in doing a BA or MA thesis on computational stylometry, these are useful starting points:

General:

Stamatatos, E. (2009), A Survey of Modern Authorship Attribution Methods (2009), Journal of the American Society for Information Science and Technology, 60(3), 538-556. [pdf]

Koppel, M., Argamon, S. and Shimoni, A. (2003), Automatically Categorizing Written Texts by Author Gender, Literary and Linguistic Computing, 17(4), 401-412. [pdf]

Patrick Juola, Authorship Attribution (Foundations and Trends in Information Retrieval), 2008. (Especially introductory chapters). [pdf]

Currently, CLiPS has research projects using and developing computational stylometry techniques for the following topics. Some key publications of these projects or of similar projects are listed:

Authorship attribution:

Stylometry has come up with impressive results when a disputed text has only two or a few possible authors and when ample material is available to construct a model of their style. But what if we have many possible authors and only a few hundreds of words to build our models on?

Kim Luyckx and Walter Daelemans, Authorship Attribution and Verification with Many Authors and Limited Data. In: Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), 513-520, Manchester, UK, 2008. [pdf]

Personality detection:

It is possible to decide with reasonable certainty whether someone is extraverted or introverted on the basis of a text written by him or her, even if the text is about a prosaic subject such as a documentary on artificial life.

Kim Luyckx and Walter Daelemans, Personae: A Corpus for Author and Personality Prediction from Text. In: Proceedings of the 6th Language Resources and Evaluation Conference (LREC), Marrakech, Morocco, 2008. [pdf]

Emotion detection:

In a serious game, Delearyous, people can exercise their communication skills by interacting through text with customers, trying to influence their state of mind. In order to do this we analyze text according to dimensions such as cooperativeness, aggressiveness, etc.

Related publications:

Diana Inkpen and Fazel Keshtkar and Diman Ghazi. Analysis and Generation of Emotion in Texts. In: KEPT 2009 Knowledge Engineering-Principles and Techniques, Selected Papers, pages 3-13, 2009. [pdf]

Oren Tsur and Dmitry Davidov and Ari Rappoport. ICWSM — A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews. In Proceedings of the Fourth International Conference on Weblogs and Social Media (ICWSM-2010), 2010. [pdf]

Copyist detection:

The attribution of authorship to medieval works poses special problems. Copyists are known to have put in their own stylistic and dialectical preferences when copying texts, not unlike editors changing the text of authors before publication. It is an interesting challenge to develop methods that can separate author style from copyist / editor style. An additional problem here is that we have to develop tools (part of speech taggers, lemmatization) for these old and difficult variants of Dutch.

Kestemont, M. & K. Van Dalen-Oskam, ‘Predicting the past: memory-based copyist and author discrimination in medieval epics’, in: Proceedings of the twenty-first Benelux conference on artificial intelligence (BNAIC 2009), T. Calders; K. Tuyls & M. Pechenizkiy (eds.). Eindhoven 2009, 121-128. [pdf]

Detecting pedophiles in social networks:

Social networks are a lot of fun, but sometimes pedophiles abuse them trying to get into contact with children and posing as children themselves. Can we spot them using a combination of topic detection methods and stylometric techniques? And more generally, can we detect the age of an author on the basis of the text? An additional problem is that we have to develop tools (part of speech tagging, lemmatization) for chat language.

Related publications:

Rashid, A., Rayson, P., Greenwood, P., Walkerdine, J., Duquenoy, P., Watson, P., Brennan, M., Jones, M. Isis: Protecting Children in Online Social Networks. At the International Conference on Advances in the Analysis of Online Paedophile Activity, Paris, June 2009 [pdf]

Latapy, M. Measurement and Analysis of P2P Activity Against Paedophile Content. General Public Report. 13p. 2009. [pdf]

Haichao Dong, Siu Cheung Hui and Yulan He. Structural Analysis of Chat Messages for Topic Detection. Online Information Review, 30 (5), 496-516, 2006. [pdf]

Burger J. and Henderson J. An Exploration of Observable Features Related to Blogger Age. 2006. [pdf]

Last modified: Tue Oct 19 14:26:06 CEST 2010