100 days of web mining
In this experiment, we collected Google News stories at regular 1-hour intervals between November 22, 2010, and March 8, 2011, resulting in a set of 6,405 news stories. We grouped these per day and then determined the top daily keywords using tf-idf, a measurement of a word's uniqueness or importance. For example: if the word news is mentioned every day, it is not particularly unique at any single given day.
To set up the experiment we used the Pattern web mining module for Python.
The basic script is simple enough:
from pattern.web import Newsfeed, plaintext from pattern.db import date from pattern.vector import Model, Document, LEMMA news, url = {}, 'http://news.google.com/news?output=rss' for story in Newsfeed().search(url, cached=False): d = str(date(story.date, format='%Y-%m-%d')) s = plaintext(story.description) # Each key in the news dictionary is a date: news is grouped per day. # Each value is a dictionary of id => story items. # We use hash(story.description) as a unique id to avoid duplicate content. news.setdefault(d, {})[hash(s)] = s
m = Model() for date, stories in news.items(): s = stories.values() s = ' '.join(s).lower() # Each day of news is a single document. # By adding all documents to a model we can calculate tf-idf. m.append(Document(s, stemmer=LEMMA, exclude=['news', 'day'], name=date)) for document in m: print document.name print document.keywords(top=10)
In the image below, important words (i.e., events) that occured across multiple days are highlighted (we took a word's document frequency as an indication). You might remember the North Korean artillery attack on a South Korean island on November 23, 2010, the arrest of Julian Assange from Wikileaks in the beginning of December, the shooting of congresswoman Gabrielle Gifford, the unrest in Egypt and the subsequent ousting of Hosni Mubarak, and the Libyan revolt.
Simultaneously, we mined Twitter messages containing the words I love or I hate – 35,784 love-tweets and 35,212 hate-tweets in total. One would expect a correlation between important media events and strongly voiced opinions on Twitter, right? Not so. Out of all hate-tweets, only one matched a media event. On November 24, the most discussed word in hate-tweets was food, correlating with news on December 1st (but this relation is not very meaningful).
The name Mubarak (for example) was only mentioned five times in our Twitter-corpus (e.g., in "I love you as much as Mubarak loves his chair", or in "How do I hate thee, Mubarak? Let the people count the ways"). The names of Gabrielle Gifford or Julian Assange were never mentioned. Perhaps we missed a number of tweets correlating with media events. Perhaps the Twitter buzz does not discuss news in terms of I love or I hate. Instead, consider these tweets from the hate-corpus which are exemplar in terms of language use: "I hate when dudes text me boring shit, whats up uvsvorjibne go fuck yourself!", or "I hate that my son is making me watch this dumb ass movie. I'm gonna fart and see how long it takes for him to notice". The word ass occurs 2,439 times. Phrases containing I love or I hate seem to be used by teenagers predominantly. Perhaps adults retweet or express their opinions in more intricate language forms such as irony or sarcasm. We then calculated document frequency for each word in the hate-corpus. A higher document frequency indicates that the word is present in more documents (i.e., bundles of daily tweets). By taking the top most frequent words we get an idea of words habitually used in the corpus during a 100-day timeframe:
m = Model.load('tweet-hate.pickle') w = m.vector.keys() w = [(m.df(w), w) for w in w] w = sorted(w, reverse=True) print w[:10]
Top 10: | bitch | shit | school | girl | time | fuck | friend | ass | nigga | justin bieber |
Daily drudge
Assuming we have a collection of hate-tweets organised as a list of (tweet, date)-tuples, it is not difficult to group the tweets by day and look at the difference between weekdays and weekends. In general, this difference is small, likely because Twitter messages are retweeted across days.
daily = {} days = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun'] for tweet, date in hate_tweets: # Collect tweets in a dictionary indexed by weekday. daily.setdefault(days[date.weekday()],[]).append(tweet) m = Model() for k, v in daily.items: m.append(Document(' '.join(v).lower(), name=k, stemmer=LEMMA)) for document in m: print document.name print document.keywords(10)
Here are the top keywords of hate-tweets grouped by day:
Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
5622 | 6610 | 5689 | 5727 | 4596 | 3472 | 3573 |
monday | school | shit | shit | shit | shit | shit |
shit | shit | time | time | time | time | girl |
time | time | school | girl | girl | girl | time |
school | bitch | girl | bitch | school | ass | bitch |
bitch | damn | bitch | talk | bitch | bitch | fuck |
Traffic is heaviest on Tuesday, almost twice as much as in weekends. The litany of swear words is constant across different days. If we filter these out we arrive at a new level of universal annoyance:
Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
monday | sleep | sick | sick | sick | sleep | katie |
morning | sick | snow | home | song | home | monday |
sleep | cold | cold | cold | sleep | night | justin |
home | home | song | morning | cold | wake | sick |
sick | morning | morning | snow | morning | hair | real |
Here, tweets appear to be preoccupied with sleep, early mornings, bad weather and sickness (our data is from November to March). On Saturday however, hair and nighttime play a more prominent role. Sunday is Justin Bieber-bashing day. Some more filtering reveals the importance of cars and dates on Friday, movies on Saturday, games on Sunday, and mothers in general:
Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
monday | mad | nigga | mom | night | night | monday |
math | night | talk | nigga | kid | wake | mom |
teacher | weather | kid | talk | wake | mom | game |
mom | annoy | mad | stuff | date | movie | fan |
wrong | talk | question | suck | car | house | watch |
The pattern.en.wordlist module has a number of lists of words (ACADEMIC, PROFANITY, TIME) that can be used to filter noise from a document. For example, academic words include domain, research, technology, profanity includes words such as shit and hell.
from pattern.en.wordlist import ACADEMIC from pattern.vector import Document d = Document(open("paper.txt").read(), exclude=ACADEMIC)
Twitter is the new shampoo.
In another experiment, we mined Twitter for 35,371 tweets containing the words is the new, such as: "green is the new gold" or "lipstick is the new trend". We can parse the nouns in the comparison from each tweet and bundle them in a graph. Calculating centrality should then give us an idea of new concepts pointing to newer concepts, pointing to the newest concept.
Click to play movie | Watch on Vimeo
We calculated eigenvector centrality (i.e., PageRank) on the full graph, a measurement of how many nodes are (indirectly) connected to each node in the graph. Nodes with a high eigenvector weight can be considered more important. We still get a lot of profanity noise, but green energy, China as emerging economy and handheld devices such as Google's Android phone are salient examples of trends surfacing in 2010-2011. The top 100 for X is-the-new Y includes:
black | green | shit | hell | money | food | China | ass | Android | Perry | hipster |
Source code
The visualization was realized in NodeBox for OpenGL, a Python module that generates 2D interactive animation using OpenGL. It comes with functionality for drawing graphs – more specifically with a Graph object that is identical to pattern.graph.Graph. Output from Pattern can easily be plugged into NodeBox. Essentially, we have the Twitter data stored in a Datasheet on which we create an iterator function. Each time comparison() is called it yields a (concept1, concept2, date)-tuple (or None), where concept1 is the new concept2:
from pattern.db import Datasheet, date from pattern.en import parse, Sentence from pattern.search import search rows = iter(Datasheet.load('the_new.txt')) def comparison(exclude=['', 'i', 'me', 'it', 'side', 'he', 'mine']): try: r = rows.next() # Fetch next row in the matrix. p = Sentence(parse(r[2], lemmata=True, light=True)) m = search('NP be the new NP', p) a = m[0].constituents()[ 0].head.string.upper().strip('.!'"#') b = m[0].constituents()[-1].head.string.upper().strip('.!'"#') if a not in exclude and b not in exclude: # Additionally, we could check if a and b occur in WordNet # to get "clean" nouns in the output. return b, a, date(r[-1]) except: pass
from nodebox.graphics import * from nodebox.graphics.physics import Graph g = Graph() for i in range(700): n = comparison() if n is not None: b,a,d = n g.add_node(b, stroke=(1,1,1,0.1), text=(1,1,1,0.5), fontsize=7) g.add_node(a, stroke=(1,1,1,0.1), text=(1,1,1,0.5), fontsize=7) g.add_edge(b, a, stroke=(1,1,1,0.1)) g = g.split()[0]
Next we implement a NodeBox draw() loop, which is called each frame of animation.
It updates and draws the graph, and incrementally adds new comparisons to it.
def draw(canvas): background(0.18, 0.22, 0.28) translate(300, 300) g.betweenness_centrality() g.update() g.draw() for i in range(4): # Add up to 4 new nodes per frame. n = comparison() if n: b,a,d = n if a in g or b in g: if a in g: g[a].text.fill.alpha = 0.75 if b in g: g[b].text.fill.alpha = 0.75 g.add_node(a, stroke=(1,1,1,0.1), text=(1,1,1,0.5), fontsize=7) g.add_node(b, stroke=(1,1,1,0.1), text=(1,1,1,0.5), fontsize=7) g.add_edge(b, a, stroke=(1,1,1,0.1)) for n in g.nodes: # Nodes with more connections grow bigger. n.radius = 3 + n.weight*6 + len(n.links)*0.5 canvas.size = 600, 600 canvas.fps = 40 canvas.run(draw)