Google Summer of Code 2017 | Ideas Page

The Computational Linguistics & Psycholinguistics Research Group of the University of Antwerp (CLiPS) focuses on applications of statistical and machine learning methods, trained on corpus data, to explain human language acquisition and processing data, and to develop automatic text analysis systems that are accurate, efficient, and robust enough to be used in practical applications. 

Project information

We conduct research into text analytics (e.g., do adults use more punctuation than adolescents?) and its real-world applications (e.g., can we predict age by punctuation?). Many of our resources are freely available. Here are some reads on how we constructed or applied such resources, for example for sentiment analysis, demograpy prediction, and detection of subversive behavior (cyberbullying, grooming, hate speech, ...). 

We frequently release open source tools, such as Pattern, a popular Python toolkit for online text mining and text analytics. If you are feeling adventurous, also check out the new Grasp toolkit on GitHub.

We collaborate closely with Red Hen Lab, who develop and maintain the massive NewsScape corpus of world-wide news. We also collaborate with EMRG, who develop tools for data visualization. When you apply for a project with us, you are welcome to work with them too.

Here are some project ideas:

Pattern 3. The Pattern toolkit currently only works on Python 2. Students at the university are moving on to Python 3. We want to keep offering students (at any university) free tools for data mining, natural language processing and network analysis. You can help out – see for details. Pattern is multilingual, with part-of-speech tagging for English, Spanish, German, etc., and sentiment analysis for English, French, Italian, etc. Projects aiming to support more languages are welcomed.

Fake news. Fake news websites such as publish hoaxes, propaganda and disinformation, often using social media to drive web traffic and amplify their effect. We are interested in developing systems that can automatically detect fake news, or predict its credibility (as a value between 0.0 and 1.0, for example). We collaborate with linguists, journalists and data scientists.

Some project ideas to explore:

  • The term "fake news" has spiked since Trump's presidency. When it is used on social media, it is more often used by Trump supporters (80% of all cases) than by Trump opponents. Inside the community of supporters on social media is a smaller community that seems to be more prone to spreading misleading information than other communities. Can we use this as a cue for identifying dubious links?
  • Can we track the timeline of a breaking news item? Journalists usually need an hour or two to fact check a news item before adopting it. Sources that copy news items within minutes are often suspicious or careless.
  • Can we track the primary source of a news item spreading on social media? If thousands of reposts all link back to a single source that is not a known expert, that is suspicious too.
  • Can we track the co-occurrence of statements across news items, for example by Googling words or sentences? How many credible / dubious results show up?
  • Can we scan a news item for dramatic word use? (e.g., horrendousbaffling, ...)
  • Can we build a classifier for dramatic, sensational news vs. objective news?
  • Can we automatically identify communities that spread sensational news?
  • Can we automatically visualize communities that spread fake news?
  • Can we learn from visual style? (ALL CAPS HEADLINES, ominous colors, ...)
  • Can we detect manipulated images using deep learning?
  • Can we learn from dates and numbers that don't match up across articles?
  • Are web pages with clickbait less reliable? ("10 things you never knew...")
  • How do we deal with factual news that also tells us how we should feel about it? 


Resources. We are always working to improve our tools for different languages. Do you want to conduct a focused, basic research project? Here are some ideas:

  • Part-of-speech tagging of short, social media texts.
  • Multilingual sentiment analysis of short, social media texts.
  • Dataset of social media texts by depressed teens.
  • Dataset of social media texts promoting hate and bullying.
  • Python module for automatic concept extraction, using Wikipedia as a knowledge base.
  • Common Kids Crawl Corpus. Natural Language Generation for children is an active research area with applications in entertainment, education, and healthcare. Large scale common crawl corpora provide the backbone of NLG systems. However, these datasets are not assembled with child safety in mind, and can potentially contain harmful or offensive information. The Common Kids Crawl would utilise existing child safety browsing systems to source online text, and assemble it into an age certified corpus. Such a corpus would be highly novel, and applicable by further developers. 
  • ...

We will assist you in publishing your research and point out interesting vacancies.

Project Leader(s): 
Tom De Smedt
Walter Daelemans
Guy De Pauw
External Collaborator(s): 

Red Hen Lab

Syndicate content