Text-based age and gender prediction for online safety monitoring

TitleText-based age and gender prediction for online safety monitoring
Publication TypeTalks
Authorsvan de Loo, J., De Pauw G., & Daelemans W.
Place PresentedPresented at the 26th Meeting of Computational Linguistics in the Netherlands (CLIN26), Amsterdam, The Netherlands.
Year of Publication2015
Date Presented18/12/2015
Abstract

We present results of author profiling experiments that explore the capabilities of text-based age and gender prediction for online safety monitoring. In the project AMiCA, we are developing a monitoring tool for automatically detecting harmful content and conduct in online social networks, such as cyberbullying, “grooming” activities by sexual predators and suicidal behavior. Author profiling – i.e., automatically detecting “metadata” of authors, such as their age and gender – is an important subtask in this application.

The use case for which the relevance of age and gender classification is most evident, is the detection of sexual predators, who may provide false age and gender information in their user profiles. Also for other use cases, however, automatically detecting age and gender information can be useful for risk estimation. Regarding age prediction, various age categories can be relevant, based on legal constraints (e.g. minors vs. adults) or age related statistics (e.g. suicide incidence rates across age groups).

In our study, we evaluated and compared binary age classifiers trained to separate younger and older authors according to different age boundaries. Experiments were carried out on a dataset of nearly 380,000 Dutch chat posts from the Belgian social network Netlog, using a ten-fold cross-validation setup. We found that macro-averaged F-scores increased when the age boundary was raised and that practically applicable performance levels can be achieved, thereby providing a useful component in a cybersecurity monitoring tool for social network moderators.