Using Wikicorpus & NLTK to build a Spanish part-of-speech tagger

  Tom De Smedt (Computational Linguistics Research Group, University of Antwerp)


Pattern contains part-of-speech taggers for a number of languages (including English, German, French and Dutch). Part-of-speech tagging is useful in many data mining tasks. A part-of-speech tagger takes a string of text and identifies the sentences and the words in the text along with their word type. The word type or part-of-speech can vary according to a word's role in the sentence. For example, in English, can can be a verb ("Can I have a can of soda?") or a noun ("Can I have a can of soda?").

The output takes the following form:

Can I have a can of soda ?

POS-tag MD indicates a modal verb, PRP a personal pronoun, VB a verb, DT a determiner, NN a noun and IN a preposition. The tags are part of the Penn Treebank II tagset.

One approach to building a part-of-speech tagger is to use a treebank and a machine learning algorithm. A treebank is a large text corpus (e.g., 1 million words and more) where each sentence has been annotated with syntactic structure (i.e., tagged by hand). A machine learning algorithm can then be used to train a part-of-speech tagger, by inferring statistical rules and patterns from the treebank.

In the past, treebanks had to be constructed manually by human annotators. This is expensive and time consuming. It can take a linguistics research group several years to construct a treebank. The availability and quality of free treebanks varies from language to language. For Spanish, we can use the freely available Spanish Wikicorpus (Reese, Boleda, Cuadros, Padró & Rigau, 2010).

Reese, S., Boleda, G., Cuadros, M., Padró, L., Rigau, G. (2010) Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus. In Proceedings of 7th Language Resources and Evaluation Conference (LREC'10)

1. Reading the Spanish Wikicorpus

The corpus contains over 50 text files, each 25-100MB in size. Each line in each file is a word with its lemma (base form) and its part-of-speech tag in the Parole tagset. Some lines are metadata (e.g., <doc>).

<doc id="267233" title="Deltoide" dbindex="85004"> 
En en NP00000 0
geometríıa geometríıa NCFS000 0
, , Fc 0
un uno DI0MS0 0
deltoide deltoide NCFS000 0
es ser VSIP3S0 01775973
un uno DI0MS0 0
cuadrilátero cuadrilátero NCMS000 0
no no RN 0
regular regular AQ0CS0 01891762

The following function reads the corpus efficiently. Note how open() is used as an iterator that yields lines from a given text file. This way, we avoid loading the whole text file into memory. The function simplifies the tags and returns a list of sentences, in which each sentence is a list of (word, tag)-tuples:

from glob import glob
from codecs import open, BOM_UTF8 
def wikicorpus(words=1000000, start=0):
    s = [[]]
    i = 0
    for f in glob("*")[start:]:
        for line in open(f, encoding="latin-1"):
            if line == "\n" or line.startswith((
              "<doc", "</doc>", "ENDOFARTICLE", "REDIRECT",
            w, lemma, tag, x = line.split(" ")
            if tag.startswith("Fp"):
                tag = tag[:3]
            elif tag.startswith("V"):  # VMIP3P0 => VMI
                tag = tag[:3]
            elif tag.startswith("NC"): # NCMS000 => NCS
                tag = tag[:2] + tag[3]
                tag = tag[:2]
            for w in w.split("_"): # Puerto_Rico
                s[-1].append((w, tag)); i+=1
            if tag == "Fp" and w == ".":
            if i >= words:
                return s[:-1]

 The function can be adapted to read other corpora of course.


2. Extracting a lexicon of known words

Pattern uses Brill's algorithm to construct its part-of-speech taggers. Other algorithms are more robust, but a Brill tagger is fast and compact (i.e., 1 MB of data) so it makes a good candidate for Pattern. Brill's algorithm essentially produces a lexicon of known words and their part-of-speech tag (e.g., can → VB), along with some rules for unknown words, and rules that change the tag according to a word's role in the sentence (e.g., "can of soda" → NN).

Using a large lexicon of the most common words and tagging unknown words as nouns, we can get quite decent tagging accuracy: between 80–90%. Then, we can use rules to improve the part-of-speech tags. Constructing the lexicon is not difficult. We simply count the occurrence of words in Wikicorpus, count the occurrence of their tags (some words have multiple tags), and take the top most frequent words with their most frequent tag.

from collections import defaultdict

# "el" => {"DA": 3741, "NP": 243, "CS": 13, "RG": 7}) 
lexicon = defaultdict(lambda: defaultdict(int))
for sentence in wikicorpus(1000000):
    for w, tag in sentence:
        lexicon[w][tag] += 1

top = []  
for w, tags in lexicon.items():    
    freq = sum(tags.values())      # 3741 + 243 + ...
    tag  = max(tags, key=tags.get) # DA
    top.append((freq, w, tag))

top = sorted(top, reverse=True)[:100000] # top 100,000
top = ["%s %s" % (w, tag) for freq, w, tag in top if w]

open("es-lexicon.txt", "w").write(BOM_UTF8 + "\n".join(top).encode("utf-8"))

The result is stored as a es-lexicon.txt file for reuse.


3. Extracting contextual rules with NLTK

The Natural Language Toolkit for Python (NLTK) has an implementation of Brill's algorithm that we can use to learn contextual rules. First, we will anonymize proper nouns. The reason is that we want to learn general rules in the form of, for example: any proper noun followed by a verb instead of "Puerto Rico" followed by a verb.

sentences = wikicorpus(words=1000000)

ANONYMOUS = "anonymous"
for s in sentences:
    for i, (w, tag) in enumerate(s):
        if tag == "NP": # NP = proper noun in Parole tagset.
            s[i] = (ANONYMOUS, "NP")

We can then train NLTK's FastBrillTaggerTrainer. It is based on a unigram tagger, which is simply a lexicon of known words and their part-of-speech tag. It will then boost the accuracy with a set of contextual rules that change a word's part-of-speech tag depending on the surrounding words.

from nltk.tag import UnigramTagger
from nltk.tag import FastBrillTaggerTrainer

from nltk.tag.brill import SymmetricProximateTokensTemplate
from nltk.tag.brill import ProximateTokensTemplate
from nltk.tag.brill import ProximateTagsRule
from nltk.tag.brill import ProximateWordsRule

ctx = [ # Context = surrounding words and tags.
    SymmetricProximateTokensTemplate(ProximateTagsRule,  (1, 1)),
    SymmetricProximateTokensTemplate(ProximateTagsRule,  (1, 2)),
    SymmetricProximateTokensTemplate(ProximateTagsRule,  (1, 3)),
    SymmetricProximateTokensTemplate(ProximateTagsRule,  (2, 2)),
    SymmetricProximateTokensTemplate(ProximateWordsRule, (0, 0)),
    SymmetricProximateTokensTemplate(ProximateWordsRule, (1, 1)),
    SymmetricProximateTokensTemplate(ProximateWordsRule, (1, 2)),
    ProximateTokensTemplate(ProximateTagsRule, (-1, -1), (1, 1)),

tagger = UnigramTagger(sentences)
tagger = FastBrillTaggerTrainer(tagger, ctx, trace=0)
tagger = tagger.train(sentences, max_rules=100)

#print tagger.evaluate(wikicorpus(10000, start=1))  

Brill's algorithm uses an iterative approach to learn contextual rules. In short, this means that it tries different combinations of interesting rules to find a subset that produces the best tagging accuracy. This process is time-consuming (minutes or hours), so we want to store the final subset for reuse.

Brill's algorithm in NLTK defines context using indices. For example, (1, 2) in the previous script means: one or two words (or tags) after the current word. Brill's original implementation uses commands to describe context, e.g., NEXT1OR2WORD or NEXT1OR2TAG. Pattern also uses these commands, so we need to map NLTK's indices to the command set:

ctx = []

for rule in tagger.rules():
    a = rule.original_tag
    b = rule.replacement_tag
    c = rule._conditions
    x = c[0][2]
    r = c[0][:2]
    if len(c) != 1: # More complex rules are ignored in this script. 
    if isinstance(rule, ProximateTagsRule):
        if r == (-1, -1): cmd = "PREVTAG"
        if r == (+1, +1): cmd = "NEXTTAG"
        if r == (-2, -1): cmd = "PREV1OR2TAG"
        if r == (+1, +2): cmd = "NEXT1OR2TAG"
        if r == (-3, -1): cmd = "PREV1OR2OR3TAG"
        if r == (+1, +3): cmd = "NEXT1OR2OR3TAG"
        if r == (-2, -2): cmd = "PREV2TAG"
        if r == (+2, +2): cmd = "NEXT2TAG"
    if isinstance(rule, ProximateWordsRule):
        if r == (+0, +0): cmd = "CURWD"
        if r == (-1, -1): cmd = "PREVWD"
        if r == (+1, +1): cmd = "NEXTWD"
        if r == (-2, -1): cmd = "PREV1OR2WD"
        if r == (+1, +2): cmd = "NEXT1OR2WD"
    ctx.append("%s %s %s %s" % (a, b, cmd, x))

open("es-context.txt", "w").write(BOM_UTF8 + "\n".join(ctx).encode("utf-8"))

We end up with a file es-context.txt with a 100 contextual rules in a format usable with Pattern.


4. Rules for unknown words based on word suffixes

By default, unknown words (= not in lexicon) will be tagged as nouns. We can improve this with morphological rules, in other words, rules based on word prefixes and suffixes. For example, English words ending in -ly are usually adverbs: really, extremely, and so on. Similarily, Spanish words that end in -mente are adverbs. Spanish words ending in -ando or -iendo are verbs in the present participle: hablando, escribiendo, and so on.

# {"mente": {"RG": 4860, "SP": 8, "VMS": 7}}
suffix = defaultdict(lambda: defaultdict(int))

for sentence in wikicorpus(1000000):
    for w, tag in sentence:
        x = w[-5:] # Last 5 characters.
        if len(x) < len(w) and tag != "NP":
            suffix[x][tag] += 1

top = []
for x, tags in suffix.items():
    tag = max(tags, key=tags.get) # RG
    f1  = sum(tags.values())      # 4860 + 8 + 7
    f2  = tags[tag] / float(f1)   # 4860 / 4875
    top.append((f1, f2, x, tag))

top = sorted(top, reverse=True)
top = filter(lambda (f1, f2, x, tag): f1 >= 10 and f2 > 0.8, top)
top = filter(lambda (f1, f2, x, tag): tag != "NCS", top)
top = top[:100] 
top = ["%s %s fhassuf %s %s" % ("NCS", x, len(x), tag) for f1, f2, x, tag in top]

open("es-morphology.txt", "w").write(BOM_UTF8 + "\n".join(top).encode("utf-8"))

We end up with a file es-morphology.txt with a 100 suffix rules in a format usable with Pattern.

To clarify this, examine the table below. We read 1 million words from Wikicorpus, of which 4,875 words end in -mente. 98% of those are tagged as RG (Parole tag for adverb, RB in Penn tagset). It was also tagged SP (preposition) 8 times and VMS (verb) 7 times.

The above script has two constraint for rule selection: f1 >= 10 will discard rules that match less than 10 words, and f2 > 0.8 will discard rules for which the most frequent tag falls below 80%. This means that unknown words that end in -mente will be tagged as adverbs, since we consider the other cases negligible. We can experiment with different settings to see if the accuracy of the tagger improves.

Frequency Suffix Parts-of-speech Example
5986 -ación 99% NCS + 1% SP derivación
4875 -mente 98% RG + 1% SP + 1% VMS  correctamente
3276 -iones 99% NCP + 1% VMS dimensiones
1824 -mbién 100% RG también
1247 -embre 99% W + 1% NCS septiembre
1134 -dades 99% NCP + 1% SP posibilidades 


5. Subclassing the pattern.text Parser class

In summary, we constructed an es-lexicon.txt with the part-of-speech tags of known words (steps 1-2) together with an es-context.txt (step 3) and an es-morphology.txt (step 4). We can use these to create a parser for Spanish by subclassing the base Parser in the pattern.text module. The pattern.text module has base classes for Parser, Lexicon, Morphology, etc. Take a moment to review the source code, and the source code of other parsers in Pattern. You'll notice that all parsers follow the same simple steps. A template for new parsers is included in pattern.text.xx.

The Parser base class has the following methods with default behavior:

  • Parser.find_tokens()  finds sentence markers (.?!) and splits punctuation marks from words,
  • Parser.find_tags()    finds word part-of-speech tags,
  • Parser.find_chunks()  finds words that belong together (e.g., the black cats),
  • Parser.find_labels()  finds word roles in the sentence (e.g., subject and object), 
  • Parser.find_lemmata() finds word base forms (cats → cat)
  • Parser.parse()        executes the above steps on a given string.

We can create an instance of the SpanishParser and feed it our data. We will need to redefine find_tags() to map Parole tags to Penn Treebank tags (which all other parsers in Pattern use as well).

from pattern.text import Parser

    "AO": "JJ"  ,   "I": "UH"  , "VAG": "VBG",
    "AQ": "JJ"  ,  "NC": "NN"  , "VAI": "MD", 
    "CC": "CC"  , "NCS": "NN"  , "VAN": "MD", 
    "CS": "IN"  , "NCP": "NNS" , "VAS": "MD", 
    "DA": "DT"  ,  "NP": "NNP" , "VMG": "VBG",
    "DD": "DT"  ,  "P0": "PRP" , "VMI": "VB", 
    "DI": "DT"  ,  "PD": "DT"  , "VMM": "VB", 
    "DP": "PRP$",  "PI": "DT"  , "VMN": "VB", 
    "DT": "DT"  ,  "PP": "PRP" , "VMP": "VBN",
    "Fa": "."   ,  "PR": "WP$" , "VMS": "VB", 
    "Fc": ","   ,  "PT": "WP$" , "VSG": "VBG",
    "Fd": ":"   ,  "PX": "PRP$", "VSI": "VB", 
    "Fe": "\""  ,  "RG": "RB"  , "VSN": "VB", 
    "Fg": "."   ,  "RN": "RB"  , "VSP": "VBN",
    "Fh": "."   ,  "SP": "IN"  , "VSS": "VB", 
    "Fi": "."   ,                  "W": "NN", 
    "Fp": "."   ,                  "Z": "CD", 
    "Fr": "."   ,                 "Zd": "CD", 
    "Fs": "."   ,                 "Zm": "CD", 
   "Fpa": "("   ,                 "Zp": "CD",
   "Fpt": ")"   ,    
    "Fx": "."   ,    
    "Fz": "."   

def parole2penntreebank(token, tag):
    return token, PAROLE.get(tag, tag)

class SpanishParser(Parser):
    def find_tags(self, tokens, **kwargs):
        # Parser.find_tags() can take an optional map(token, tag) function,
        # which returns an updated (token, tag)-tuple for each token. 
        kwargs.setdefault("map", parole2penntreebank)
        return Parser.find_tags(self, tokens, **kwargs)

Load the lexicon and the rules in an instance of SpanishParser:

from pattern.text import Lexicon

lexicon = Lexicon(
        path = "es-lexicon.txt", 
  morphology = "es-morphology.txt", 
     context = "es-context.txt", 
    language = "es"

parser = SpanishParser(
     lexicon = lexicon,
     default = ("NCS", "NP", "Z"),
    language = "es"

def parse(s, *args, **kwargs):
    return parser.parse(s, *args, **kwargs)

It is still missing features (notably lemmatization) but our Spanish parser is essentially ready for use:

print parse(u"El gato se sentó en la alfombra.")
El gato se sentó en la alfombra .


6. Testing the accuracy of the parser

The following script can be used to test the accuracy of the parser against Wikicorpus. We used 1.5 million words with 300 contextual and 100 morphological rules for an accuracy of about 91%. So we lost 9% but the parser is also fast and compact – the data files are about 1MB in size. Note how we pass map=None to the parse() command. This parameter is in turn passed to SpanishParser.find_tags() so that the original Parole tags are returned, which we can compare to the tags in Wikicorpus.

i = 0
n = 0
for s1 in wikicorpus(100000, start=1):
    s2 = " ".join(w for w, tag in s1)
    s2 = parse(s2, tags=True, chunks=False, map=None).split()[0]
    for (w1, tag1), (w2, tag2) in zip(s1, s2):
        if tag1 == tag2:
            i += 1
        n += 1
print float(i) / n