Natural Language Processing in Python

From Training Material
Jump to: navigation, search


Title

Natural Language Processing in Python
Author
Krzysztof Mędrela
Subfooter

Natural Language Processing in Python          Krzysztof Mędrela


Contents

1. Accessing Text Corpora and Lexical Resources ¶

A text corpus is a structured collections of texts. It’s a collection of files inside a directory.

Corporas can be categorized by genre or topic. Sometimes the categories may overlap, that is one text may belong to more than one genre or topic.

1.1. Common Sources for Corpora ¶

There are some builtin corpuses distributed along with nltk library. All corpuses live inside nltk.corpus subpackage. One of the corpuses is named gutenberg.

Corpuses may need to be downloaded. Execute the following code:

import nltk
print(nltk.corpus.gutenberg.fileids())

If it results in an error:

**********************************************************************
  Resource u'corpora/gutenberg' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/home/chris/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

You need to open the NLTK downloader by typing:

nltk.download()

A new window appears. You need to go to “corpora” tab. And then install appropriate corpuses. Here is how the window looks like:

Downloading-corporas.png

Once you do this, you can close the window and again execute the following code. It’ll print file names in this corpus.

import nltk
nltk.corpus.gutenberg.fileids()
[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']

We’ll also some other corporas, so please download entire collection:

nltk.download()

In the collections tab, select book collection and click download. Wait while all files are downloaded.

Downloading-collection.png

You can specify where nltk should look for corpora:

nltk.data.path
['/home/chris/nltk_data',
 '/usr/share/nltk_data',
 '/usr/local/share/nltk_data',
 '/usr/lib/nltk_data',
 '/usr/local/lib/nltk_data']

Corpuses most important methods¶

from nltk.corpus import gutenberg

raw(filename) let you get the file content as a string:

print(gutenberg.raw('bible-kjv.txt')[:500])  # display only first 500 characters
[The King James Bible]

The Old Testament of the King James Bible

The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.

1:2 And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
waters.

1:3 And God said, Let there be light: and there was light.

1:4 And God saw the light, that it was good: and God divided the light
from the darkness.

1:5 And God called the light Da

raw() returns raw content of entire corpus. All files are concatenated:

print(gutenberg.raw()[:500])
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I

Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his

words(filename) splits the file content into list of words:

gutenberg.words('bible-kjv.txt')
[u'[', u'The', u'King', u'James', u'Bible', u']', ...]

Get list of sentences, where each sentence is represented as a list of words:

gutenberg.sents('bible-kjv.txt')
[[u'[', u'The', u'King', u'James', u'Bible', u']'], [u'The', u'Old', u'Testament', u'of', u'the', u'King', u'James', u'Bible'], ...]

Find the absolute path to given corpora file. Useful if you want to know where it lives.

gutenberg.abspath('bible-kjv.txt')
FileSystemPathPointer(u'/home/chris/nltk_data/corpora/gutenberg/bible-kjv.txt')

Exercise¶

Explore file austen-persuasion.txt from gutenberg corpus. How many words does the file has? How many different words are there?

New concepts introduced in this exercise:

  • len()
  • set()

Solution¶

We can get a list of words and count them:

words = gutenberg.words('austen-persuasion.txt')
len(words)
98171

Given the list of words, we can delete duplications by converting it to a set, and then, count unique words.

len(set(words))
6132

However, if we do care about letter case and we don’t want to treat own and Own as two different words, we can lower case all words:

all_words = len(set(w.lower() for w in words))
all_words
5835

And if we want to filter out punctuation:

unique = len(set(w.lower() for w in words if w.isalpha()))
unique
5739

You can compute lexical dispersion:

all_words / unique
1

Exercise¶

A trigram is a tuple of three words. You can generate trigrams for any text:

words = 'The Tragedie of Macbeth by William Shakespeare . The Tragedie is great .'.split(' ')
nltk.trigrams(words)
<generator object trigrams at 0xadf7a2c>

You can generate random text by following this algorithm:

  1. Start from two random words. The random words can be considered the seed. Output both words.
  2. Output any word that appears in corpus next to the last two outputted word.
  3. Repeat step 2 as many times as you want.

Write a program that generates random text using this algorithm. Hint: use random.choice to choose randomly one element from a list. New concepts introduced in this exercise:

  • nltk.trigram
  • tuples (pairs, unpacking)
  • dictionaries (creating new key, in, accessing existing key, iterating)
  • list.append
  • generate the mapper
  • defaultdict
  • generate the mapper with defaultdict
  • str.lower()
  • list comprehension
  • str.isalpha()
  • show the stub of generate function and its usage
  • write generate function

Solution¶

from collections import defaultdict
from random import choice

words = (word.lower() for word in gutenberg.words() if word.isalpha())
trigrams = nltk.trigrams(words)
mapper = defaultdict(list)
for a, b, c in trigrams:
    mapper[a, b].append(c)

def generate(word1, word2, N):
    sentence = [word1, word2]
    for i in xrange(N):
        new_word = choice(mapper[word1, word2])
        sentence.append(new_word)
        word1, word2 = word2, new_word
    return " ".join(sentence)

# word1, word2 = choice(mapper.keys())  # use for random results
word1, word2 = 'i', 'can'
generate(word1, word2, 30)

u'i can make me say not ye that speak evil against you and i fear this glorious thing is against them since by many a good thousand a year two thousand pounds'

Exercise¶

Write function find_language that takes a word and returns a list of language that this word may be in. Use udhr corpus. Narrow down your search scope to files in Latin1 encoding. Lookup words one and ein. New concepts introduced in this exercise:

  • str.endswith
  • and boolean operator

Solution¶

from nltk.corpus import udhr

def find_language(word):
    word = word.lower()
    return [
        language
        for language in udhr.fileids()
        if language.endswith('Latin1')
        and word in nltk.corpus.udhr.words(language)
    ]

find_language('one')
[u'English-Latin1', u'NigerianPidginEnglish-Latin1', u'Picard-Latin1']
find_language('ein')
[u'Faroese-Latin1',
 u'German_Deutsch-Latin1',
 u'Norwegian_Norsk-Nynorsk-Latin1']

1.2. Conditional Frequency Distributions¶

Frequency distribution¶

A frequency distribution tells you how often given word appears in a corpus. nltk library has a builtin class to compute frequency distribution:

fd = nltk.FreqDist(gutenberg.words('bible-kjv.txt'))

How many times does the word the appear?

fd['the']
62103

What percentage of all words is the word the?

fd.freq('the') * 100
6.1448329497533285

What are the most common 10 words?

fd.most_common(10)
[(u',', 70509),
 (u'the', 62103),
 (u':', 43766),
 (u'and', 38847),
 (u'of', 34480),
 (u'.', 26160),
 (u'to', 13396),
 (u'And', 12846),
 (u'that', 12576),
 (u'in', 12331)]

If you’re using jupyter notebook (formerly known as ipython notebook), include the following line:

%matplotlib inline

Let’s generate cumulative frequency plot for most common words:

fd.plot(30, cumulative=True)

2016-12-29 1017.png

Exercise¶

Find all words that occur at least ten times in brown corpus.

Solution¶

from nltk.corpus import brown
fd = nltk.FreqDist(brown.words())
frequent_words = [word
                  for word, frequency in fd.iteritems()
                  if frequency >= 10]
frequent_words[:20]
[u'woods',
 u'hanging',
 u'Western',
 u'co-operation',
 u'bringing',
 u'wooden',
 u'non-violent',
 u'inevitably',
 u'errors',
 u'cooking',
 u'Hamilton',
 u'College',
 u'Foundation',
 u'kids',
 u'climbed',
 u'controversy',
 u'rebel',
 u'golden',
 u'Harvey',
 u'stern']

Exercise¶

How frequently do letters appear in English text? Use swadesh corpora.

Solution¶

Let’s see how the corpus is structured:

from nltk.corpus import swadesh
swadesh.fileids()
[u'be',
 u'bg',
 u'bs',
 u'ca',
 u'cs',
 u'cu',
 u'de',
 u'en',
 u'es',
 u'fr',
 u'hr',
 u'it',
 u'la',
 u'mk',
 u'nl',
 u'pl',
 u'pt',
 u'ro',
 u'ru',
 u'sk',
 u'sl',
 u'sr',
 u'sw',
 u'uk']

Each file contains an example text in one language.

text = swadesh.raw('en')

# Filter out punctuation, spaces etc.
letters = [letter for letter in text.lower() if letter.isalpha()]

fd = nltk.FreqDist(letters)
fd.plot()

2016-12-29 1021.png

Exercise¶

Zipf’s Law states that the frequency of a word is inversely proportional to its rank. Does this law holds for English?

Hint: use logarithmic x and y axes when plotting number of occurrences against rank.

New concepts introduced in this exercise:

  • plotting with matplotlib: xscale(‘log’), yscale(‘log’), plot(array)

Solution¶

fd = nltk.FreqDist(gutenberg.words('bible-kjv.txt'))
occurences = [o for word, o in fd.most_common()]

from matplotlib import pylab
pylab.xscale('log')
pylab.yscale('log')
pylab.plot(occurences)
[<matplotlib.lines.Line2D at 0x15fe490c>]

2016-12-29 1026.png

Conditional frequency distribution¶

A conditional frequency distribution is a collection of frequency distributions, each one for a different “condition”. The condition is usually the category of a text.

nltk provides ConditionalFreqDist that accepts list of (category, word) pairs, in contrast to FreqDist which accepts simply list of words.

We’ll work on brown corpus where files have categories:

from nltk.corpus import brown
brown.categories()
[u'adventure',
 u'belles_lettres',
 u'editorial',
 u'fiction',
 u'government',
 u'hobbies',
 u'humor',
 u'learned',
 u'lore',
 u'mystery',
 u'news',
 u'religion',
 u'reviews',
 u'romance',
 u'science_fiction']

Let’s count the conditional frequency distribution:

from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
    (category, word)
    for category in brown.categories()
    for word in brown.words(categories=category))

You can see, that indeed conditions are categories:

cfd.conditions()
[u'mystery',
 u'belles_lettres',
 u'humor',
 u'government',
 u'fiction',
 u'reviews',
 u'religion',
 u'romance',
 u'science_fiction',
 u'adventure',
 u'editorial',
 u'hobbies',
 u'lore',
 u'news',
 u'learned']

How many times does the word love appear in romances?

cfd['romance']['love']
32

1.3. Counting Words by Genre¶

Lets display more data. How many times does the most common words appear in different genres?
fd = nltk.FreqDist(brown.words())
words_and_counts = fd.most_common(10)
words = [x[0] for x in words_and_counts]
cfd.tabulate(samples=words)
                  the     ,     .    of   and    to     a    in  that    is
      adventure  3370  3488  4057  1322  1622  1309  1354   847   494    98
 belles_lettres  9726  9166  6397  6289  4282  4084  3308  3089  1896  1799
      editorial  3508  2766  2481  1976  1302  1554  1095  1001   578   744
        fiction  3423  3654  3639  1419  1696  1489  1281   916   530   144
     government  4143  3405  2493  3031  1923  1829   867  1319   489   649
        hobbies  4300  3849  3453  2390  2144  1797  1737  1427   514   959
          humor   930  1331   877   515   512   463   505   334   241   117
        learned 11079  8242  6773  7418  4237  3882  3215  3644  1695  2403
           lore  6328  5519  4367  3668  2758  2530  2304  2001   984  1007
        mystery  2573  2805  3326   903  1215  1284  1136   658   494   116
           news  5580  5188  4030  2849  2146  2116  1993  1893   802   732
       religion  2295  1913  1382  1494   921   882   655   724   475   533
        reviews  2048  2318  1549  1299  1103   706   874   656   336   513
        romance  2758  3899  3736  1186  1776  1502  1335   875   583   150
science_fiction   652   791   786   321   278   305   222   152   126    47

Exercise¶

How frequently do letters appear in different languages? Use swadesh corpus. New concepts introduced in this exercise:

  • list comprehension with multiple _for_s
  • cfd.tabulate(**conditions*=...)* and cfd.plot(**conditions*=...)*

Solution¶

cfd = nltk.ConditionalFreqDist(
    (lang, letter)
    for lang in swadesh.fileids()
    for letter in swadesh.raw(lang).lower()
    if letter.isalpha())
cfd.tabulate(samples='abcdefghijklmnopqrstuvwxyz')
     a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z
be   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
bg   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
bs 124  16  13  31  71   0  21   7 117  46  43  37  22  40  77  29   0  57  46  95  36  37   0   0   0  16
ca 149  16  48  23 115  13  30   3  49   7   0 104  34  57  65  39  10 149  53  60  59  15   0   6   6   0
cs  67  13  27  36  68   0   0  38  25  12  40  45  27  37  69  32   0  43  39 107  26  36   0   0  11  20
cu   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
de  68  26  43  27 196  29  34  64  67   3  20  52  25 134  18   7   1  73  71  55  36   6  31   0   0  14
en  75  20  19  31 108  29  24  53  56   0  18  57  24  56  65  10   1  67  52  77  34   8  34   0  16   2
es 158  19  56  42 141  10  18  21  59  13   0  69  32  60 140  40   9 139  64  49  46  21   0   0   2  10
fr  63  13  44  27 181  19  22  14  79   7   0  53  32  72  81  32   9 135  59  57  76  23   0   3   1   1
hr 121  16  13  32  71   0  20   7 117  46  42  37  22  41  80  29   0  56  46  96  35  37   0   0   0  16
it 130  17  76  37 174  18  36  13 105   0   0  53  30  74 135  30   8 121  57  70  45  22   0   0   0   2
la 108  15  50  39 169  17  25   6 109   0   0  56  40  69  48  30  15 132 109  49 113  30   0   7   0   0
mk   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
nl  73  20  15  51 182   4  39  25  65  20  27  48  23 121  70  13   0  72  42  62  29  37  35   0   0  21
pl  96  16  39  34  69   0  20  11  86  12  44  21  26  41  78  36   0  60  41  36  17   0  47   0  57  63
pt 189  19  59  44 158  20  32  27  70   5   0  51  55  65 157  43  14 187  72  63  63  31   0   5   0   5
ro 214  24  79  35 147  20  32  10  92   3   0  44  47  84  52  56   0 100  60  83  79  16   0   1   0   7
ru   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
sk  91  11  21  31  83   1   1  30  51   6  48  34  26  40  68  32   0  59  45  45  20  41   0   0  19  16
sl  94  21   8  32 104   0  17  13 112  23  44  45  25  41  76  29   0  60  38  93  17  34   0   0   0  20
sr   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
sw 121  17   6  43  47  21  42  20  35  12  37  51  40  73  25   8   0  78  60  73  25  33   0   0  12   0
uk   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

You can even plot the results:

cfd.plot(samples='abcdefghijklmnopqrstuvwxyz',
         conditions=['en', 'ru', 'pl', 'de'])

2016-12-29 1045.png

Exercise¶

Which initial letters are more frequent for males versus females? Use names corpus.

New concepts introduced in this exercise:

  • str[0]

Solution¶

from nltk.corpus import names

cfd = nltk.ConditionalFreqDist(
    (gender, name[0].lower())
    for gender in names.fileids()
    for name in names.words(gender))
cfd.plot()

2016-12-29 1048.png

Exercise¶

What is the lexical diversity of different genres? Lexical diversity is the ratio of number of words and number of unique words. That is, it’s the average number of times each word appear in the text.

New concepts introduced in this exercise:

  • 3/2==1 issue

Solution¶

def get_lexical_diversity(words):
    return len(words) / float(len(set(words)))

for category in brown.categories():
    words = brown.words(categories=[category])
    diversity = get_lexical_diversity(words)
    print('{:5.2f} {}'.format(diversity, category))
7.81 adventure
 9.40 belles_lettres
 6.23 editorial
 7.36 fiction
 8.57 government
 6.90 hobbies
 4.32 humor
10.79 learned
 7.61 lore
 8.19 mystery
 6.99 news
 6.18 religion
 4.72 reviews
 8.28 romance
 4.48 science_fiction

1.4. Creating Own Corpus¶

A corpus is just a collection of files. You can easily create your own corpus by enumerating a list of all files creating your corpus. It’s wise to keep all the files in one directory.

my_corpus = nltk.corpus.PlaintextCorpusReader('/home/chris/nltk_data/corpora/abc/', '.*')
my_corpus.fileids()
['README', 'rural.txt', 'science.txt']

Your own corporas have access to all the methods available in standard corporas:

my_corpus.words()
[u'Australian', u'Broadcasting', u'Commission', ...]

1.5. Pronouncing Dictionary¶

A pronouncing dictionary is a lexical resource where for each word you’re given pronunciation. This pronunciation was designed to be used by speech synthesizers.

entries = nltk.corpus.cmudict.entries()
len(entries)
133737

133737 The entries are basically pairs of word and pronunciation. Pronunciation is a list of phones represented as strings. The interpretation of pronunciation is described in more details on Wikipedia

for entry in entries[39943:39951]:
    print(entry)
(u'explorer', [u'IH0', u'K', u'S', u'P', u'L', u'AO1', u'R', u'ER0'])
(u'explorers', [u'IH0', u'K', u'S', u'P', u'L', u'AO1', u'R', u'ER0', u'Z'])
(u'explores', [u'IH0', u'K', u'S', u'P', u'L', u'AO1', u'R', u'Z'])
(u'exploring', [u'IH0', u'K', u'S', u'P', u'L', u'AO1', u'R', u'IH0', u'NG'])
(u'explosion', [u'IH0', u'K', u'S', u'P', u'L', u'OW1', u'ZH', u'AH0', u'N'])
(u'explosions', [u'IH0', u'K', u'S', u'P', u'L', u'OW1', u'ZH', u'AH0', u'N', u'Z'])
(u'explosive', [u'IH0', u'K', u'S', u'P', u'L', u'OW1', u'S', u'IH0', u'V'])
(u'explosively', [u'EH2', u'K', u'S', u'P', u'L', u'OW1', u'S', u'IH0', u'V', u'L', u'IY0'])

1.6. Shoebox and Toolbox Lexicons¶

A Toolbox file, previously known as Shoebox file, is one of the most popular tools used by linguists. This kind of file let you associate as many data about each word as you want. On the other side, the loose structure of Toolbox files makes it hard to use it.

Each line in the file represents one word. For each word, there is a dictionary of information about this word. The keys of this dictionary represents what type of data it is. For example, ps stands for part-of-speech. The value of this dictionary is actual data. This dictionary is represented as a list of pairs instead of pure Python dictionary.

from nltk.corpus import toolbox
toolbox.entries('rotokas.dic')[:2]
[(u'kaa',
  [(u'ps', u'V'),
   (u'pt', u'A'),
   (u'ge', u'gag'),
   (u'tkp', u'nek i pas'),
   (u'dcsv', u'true'),
   (u'vx', u'1'),
   (u'sc', u'???'),
   (u'dt', u'29/Oct/2005'),
   (u'ex', u'Apoka ira kaaroi aioa-ia reoreopaoro.'),
   (u'xp', u'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'),
   (u'xe', u'Apoka is gagging from food while talking.')]),
 (u'kaa',
  [(u'ps', u'V'),
   (u'pt', u'B'),
   (u'ge', u'strangle'),
   (u'tkp', u'pasim nek'),
   (u'arg', u'O'),
   (u'vx', u'2'),
   (u'dt', u'07/Oct/2006'),
   (u'ex', u'Rera rauroro rera kaarevoi.'),
   (u'xp', u'Em i holim pas em na nekim em.'),
   (u'xe', u'He is holding him and strangling him.'),
   (u'ex',
    u'Iroiro-ia oirato okoearo kaaivoi uvare rirovira kaureoparoveira.'),
   (u'xp',
    u'Ol i pasim nek bilong man long rop bikos em i save bikhet tumas.'),
   (u'xe',
    u"They strangled the man's neck with rope because he was very stubborn and arrogant."),
   (u'ex',
    u'Oirato okoearo kaaivoi iroiro-ia. Uva viapau uvuiparoi ra vovouparo uva kopiiroi.'),
   (u'xp',
    u'Ol i pasim nek bilong man long rop. Olsem na em i no pulim win olsem na em i dai.'),
   (u'xe',
    u"They strangled the man's neck with a rope. And he couldn't breathe and he died.")])]

1.7. Senses and Synonyms¶

Let’s consider the following sentences:

Benz is credited with the invention of the motorcar.

Benz is credited with the invention of the automobile.

Both sentences have exactly the same meaning, because motorcar and automobile are synonyms. These words have the same meaning.

WordNet is semantically oriented dictionary of English. It’s aware that some words are synonyms.

from nltk.corpus import wordnet as wn

We can find all synsets of motorcar. A synset is a collections of synonyms.

wn.synsets('motorcar')
[Synset('car.n.01')]

Let’s see what are the synonyms in this synset:

wn.synset('car.n.01').lemma_names()
[u'car', u'auto', u'automobile', u'machine', u'motorcar']

WordNet contains much more data, like examples for given synset:

wn.synset('car.n.01').examples()
[u'he needs a car to get to work']

And even definition:

wn.synset('car.n.01').definition()
u'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

Word car is ambiguous and belongs to multiple synsets:

for synset in wn.synsets('car'):
    print("{} -> {}".format(
        synset.name(),
car.n.01 -> car, auto, automobile, machine, motorcar
car.n.02 -> car, railcar, railway_car, railroad_car
car.n.03 -> car, gondola
car.n.04 -> car, elevator_car
cable_car.n.01 -> cable_car, car

        ", ".join(synset.lemma_names())))

1.8. Hierarchies¶

WordNet is aware of hierarchies of words.

Hyponyms are more specific concepts of a general concept. For example, ambulance is a hyponym of car.

motorcar = wn.synset('car.n.01')
hyponyms = motorcar.hyponyms()
hyponyms[0]
Synset('ambulance.n.01')

We can display all words that are more specific than motorcar:

[u'Model_T',
 u'S.U.V.',
 u'SUV',
 u'Stanley_Steamer',
 u'ambulance',
 u'beach_waggon',
 u'beach_wagon',
 u'bus',
 u'cab',
 u'compact',
 u'compact_car',
 u'convertible',
 u'coupe',
 u'cruiser',
 u'electric',
 u'electric_automobile',
 u'electric_car',
 u'estate_car',
 u'gas_guzzler',
 u'hack',
 u'hardtop',
 u'hatchback',
 u'heap',
 u'horseless_carriage',
 u'hot-rod',
 u'hot_rod',
 u'jalopy',
 u'jeep',
 u'landrover',
 u'limo',
 u'limousine',
 u'loaner',
 u'minicar',
 u'minivan',
 u'pace_car',
 u'patrol_car',
 u'phaeton',
 u'police_car',
 u'police_cruiser',
 u'prowl_car',
 u'race_car',
 u'racer',
 u'racing_car',
 u'roadster',
 u'runabout',
 u'saloon',
 u'secondhand_car',
 u'sedan',
 u'sport_car',
 u'sport_utility',
 u'sport_utility_vehicle',
 u'sports_car',
 u'squad_car',
 u'station_waggon',
 u'station_wagon',
 u'stock_car',
 u'subcompact',
 u'subcompact_car',
 u'taxi',
 u'taxicab',
 u'tourer',
 u'touring_car',
 u'two-seater',
 u'used-car',
 u'waggon',
 u'wagon']

A hypernym is the opposite to a hyponym. For example, car is a hypernym of ambulance.

ambulance = wn.synset('ambulance.n.01')
ambulance.hypernyms()
[Synset('car.n.01')]

Exercise¶

What percentage of noun synsets have no hyponyms?

You can get all noun synsets using wn.all_synsets('n').

New concepts introduced in this exercise:

  • sum()
  • wn.all_synsets

Solution¶

all = sum(1 for x in wn.all_synsets('n'))
no_hyponyms = sum(1 for n in wn.all_synsets('n') if n.hyponyms())
no_hyponyms / float(all) * 100
20.328807160689276

Exercise¶

What is the branching factor of noun hypernyms? That is, how many hyponyms on average has each noun synset?

Solution¶

hyponyms = [len(n.hyponyms()) for n in wn.all_synsets('n')]
sum(hyponyms) / float(len(hyponyms))
0.9237045606770992

1.9. Lexical Relations: Meronyms, Holonyms¶

A word is meronym of another word if the former is component of the latter. For example, trunk is a part of tree, so trunk is a meronym of a tree.

wn.synset('tree.n.01').part_meronyms()
[Synset('burl.n.02'),
 Synset('crown.n.07'),
 Synset('limb.n.02'),
 Synset('stump.n.01'),
 Synset('trunk.n.01')]
wn.synset('tree.n.01').substance_meronyms()
[Synset('heartwood.n.01'), Synset('sapwood.n.01')]

A holonym is kind of the opposite to a meronym. Holonyms of a word are the things it’s contained in. For example, a collection of trees forms a forest, so forest is a holonym of tree.

wn.synset('tree.n.01').member_holonyms()
[Synset('forest.n.01')]

Note that if A is a meronym of B, it doesn’t mean that B is a holonym of A. For example, tree is not a holonym of trunk.

wn.synset('trunk.n.01').member_holonyms()
[]

1.10. Semantic Similarity¶

We have seen that WordNet is aware of semantic relations between words.

All words create a network where each word is a node and hyponyms and hypernyms relations are edges.

If two synsets are closely related, they must share a very specific hypernym:

right_whale = wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
right_whale.lowest_common_hypernyms(orca)
[Synset('whale.n.02')]

And that hypernym must be low down in the hierarchy:

wn.synset('whale.n.02').min_depth()
13

In contrast, two synsets that are not related at all share a very general hypernym:

novel = wn.synset('novel.n.01')
right_whale.lowest_common_hypernyms(novel)
[Synset('entity.n.01')]

This hypernym is actually the root of the hierarchy:

wn.synset('entity.n.01').min_depth()
0

Exercise¶

Given list of words, find similarity of each pair of synsets. Use synset.path_similarity(another_synset) method.

New concepts introduced in this exercise:

  • synset.path_similarity
  • write solution without string formatting
  • str.split
  • str.join
  • str.format

Solution¶

words = 'food fruit car'.split(' ')

synsets = [synset
           for word in words
           for synset in wn.synsets(word, 'n')]

for s in synsets:
    similarities = [s.path_similarity(t)*100 for t in synsets]
    row = ' '.join('{:3.0f}'.format(s) for s in similarities)
    print('{:20} {}'.format(s.name(), row))
food.n.01            100  20  10   9  10   8   8   9   8   8   8
food.n.02             20 100  10   9  10   8   8   9   8   8   8
food.n.03             10  10 100   7   8  10   6   7   7   7   7
fruit.n.01             9   9   7 100  10   6   8   9   8   8   8
yield.n.03            10  10   8  10 100   6  10  12  11  11  11
fruit.n.03             8   8  10   6   6 100   5   6   6   6   6
car.n.01               8   8   6   8  10   5 100  20   8   8   8
car.n.02               9   9   7   9  12   6  20 100  10  10  10
car.n.03               8   8   7   8  11   6   8  10 100  33  33
car.n.04               8   8   7   8  11   6   8  10  33 100  33
cable_car.n.01         8   8   7   8  11   6   8  10  33  33 100

2. Processing Raw Text¶

In this chapter, we need the following imports:

import nltk, re

2.0. Accessing raw text¶

Downloading files¶

from urllib import urlopen

url = "http://www.gutenberg.org/files/1112/1112.txt"
raw = urlopen(url).read()
print(raw[:200])
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give i

As you can see, there is some metadata at the beginning of the file.

print(raw[3040:4290])
                          THE PROLOGUE

                        Enter Chorus.


  Chor. Two households, both alike in dignity,
    In fair Verona, where we lay our scene,
    From ancient grudge break to new mutiny,
    Where civil blood makes civil hands unclean.
    From forth the fatal loins of these two foes
    A pair of star-cross'd lovers take their life;
    Whose misadventur'd piteous overthrows
    Doth with their death bury their parents' strife.
    The fearful passage of their death-mark'd love,
    And the continuance of their parents' rage,
    Which, but their children's end, naught could remove,
    Is now the two hours' traffic of our stage;
    The which if you with patient ears attend,
    What here shall miss, our toil shall strive to mend.
                                                         [Exit.]




ACT I. Scene I.
Verona. A public place.

Enter Sampson and Gregory (with swords and bucklers) of the house
of Capulet.


  Samp. Gregory, on my word, we'll not carry coals.

  Greg. No, for then we should be colliers.

  Samp. I mean, an we be in choler, we'll draw.

  Greg. Ay, while you live, draw your neck out of collar.

  Samp. I strike quickly, being moved.

Let’s analyze this text

tokens = nltk.word_tokenize(raw)
tokens[:20]
['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Romeo',
 'and',
 'Juliet',
 ',',
 'by',
 'William',
 'Shakespeare',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone']

Let’s find collocations – sequences of words that co-occur more often thatn would be expected by chance.

text = nltk.Text(tokens)
text.collocations(num=50)
Project Gutenberg-tm; Project Gutenberg; Literary Archive; Gutenberg-
tm electronic; Archive Foundation; electronic works; Gutenberg
Literary; United States; thou art; thou wilt; William Shakespeare;
thou hast; art thou; public domain; electronic work; Friar Laurence;
County Paris; set forth; Gutenberg-tm License; Chief Watch; thousand
times; PROJECT GUTENBERG; good night; Scene III; Enter Juliet;
copyright holder; Enter Romeo; Make haste; Complete Works; pray thee;
Enter Friar; Wilt thou; Friar John; upon thy; free distribution; woful
day; paragraph 1.F.3; Lammas Eve; Plain Vanilla; considerable effort;
intellectual property; restrictions whatsoever; solely singular;
derivative works; Enter Old; honest gentleman; Thou shalt; Enter
Benvolio; Art thou; Hast thou

2.1. Printing¶

In Python 2, print is a special keyword:

text = 'In 2015, e-commerce sales by businesses with 10 or more employees'
print text
In 2015, e-commerce sales by businesses with 10 or more employees

In Python 3, print is a function and parenthesis are required:

print(text)
In 2015, e-commerce sales by businesses with 10 or more employees

The above code works also in Python 2 - the parenthesis are simply ignored.

2.2. Truncating¶

Truncating restricts the maximum length of strings. We truncate text, so that it has a substring of the first specified number of characters.

print(text[:20])
In 2015, e-commerce

2.3. Extracting Parts of String¶

Also named slicing:

print(text[5:20])
15, e-commerce

This works also for lists:

l = ['first', 'second', 'third', '4th', '5th']
print(l[2:4])
['third', '4th']

2.4. Accessing Individual Characters¶

This works in the same way as accessing individual elements in a list:

print(text[5])
1

Or you can get the n-th element from the end:

print(text[-5])
0

2.5. Searching, Replacing, Splitting, Joining¶

Searching¶

text.find('sales')
20

or

'sales' in text
True

Replacing¶

text.replace('2015', '2016')
'In 2016, e-commerce sales by businesses with 10 or more employees'

Splitting¶

tokens = text.split(' ')
tokens
['In',
 '2015,',
 'e-commerce',
 'sales',
 'by',
 'businesses',
 'with',
 '10',
 'or',
 'more',
 'employees']

Joining¶

' '.join(tokens)
'In 2015, e-commerce sales by businesses with 10 or more employees'

or

print('\n'.join(tokens))
In
2015,
e-commerce
sales
by
businesses
with
10
or
more
employees

2.6. Regular Expressions¶

The simplest kinds of pattern matching can be handled with string methods:

'worked'.endswith('ed')
True

However, if you want to be able to do more complex stuff, you need a more powerful tool. Regular expressions come to rescue!

We need the right module:

import re

Use PyRegex for cheat sheet.

There are three most important methods in re module: search, match and findall.

We can use re.search function to look for a pattern anywhere inside a string.

help(re.search)
Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found.

We can use re.match function to look for a pattern at the beginning of the string.

help(re.match)
Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found.

There is also re.findall method:

help(re.findall)
Help on function findall in module re:

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.

    If one or more groups are present in the pattern, return a
    list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result.

Exercises¶

We will use words corpus:

wordlist = nltk.corpus.words.words('en')

Find all English words in words corpus ending with ed (without using endswith method):

wordlist = nltk.corpus.words.words('en')
[w for w in wordlist if re.search('ed$', w)][:10]
[u'abaissed',
 u'abandoned',
 u'abased',
 u'abashed',
 u'abatised',
 u'abed',
 u'aborted',
 u'abridged',
 u'abscessed',
 u'absconded']

Suppose we have room in a crossword puzzle for an 8-letter word with j as its third letter and t as its sixth letter. Find matching words?

[w for w in wordlist if re.search('^..j..t..$', w)][:10]
[u'abjectly',
 u'adjuster',
 u'dejected',
 u'dejectly',
 u'injector',
 u'majestic',
 u'objectee',
 u'objector',
 u'rejecter',
 u'rejector']

The T9 system is used for entering text on mobile phones). Two or more words that are entered with the same sequence of keystrokes are known as textonyms. For example, both hole and golf are entered by pressing the sequence 4653. What other words could be produced with the same sequence?

[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)][:10]
[u'gold', u'golf', u'hold', u'hole']

Exercise¶

Now, well use treebank corpus.
wsj = sorted(set(nltk.corpus.treebank.words()))
  1. Find all words that contain at least one digit.
  2. Find four-digit numbers.
  3. Find all integer numbers that have four or more digits.
  4. Find decimal numbers.
  5. Find words ending with ed or ing.

Solution¶

1. Find all words that contain at least one digit.

[w for w in wsj if re.search('[0-9]', w)][:10]
[u"'30s",
 u"'40s",
 u"'50s",
 u"'80s",
 u"'82",
 u"'86",
 u'-1',
 u'-10',
 u'-100',
 u'-101']

2. Find four-digit numbers:

[w for w in wsj if re.search('^[0-9]{4}$', w)][:10]
[u'1614',
 u'1637',
 u'1787',
 u'1901',
 u'1903',
 u'1917',
 u'1925',
 u'1929',
 u'1933',
 u'1934']

3. Find all integer numbers that have four or more digits.

[w for w in wsj if re.search('^[0-9]{4,}$', w)][:10]
[u'1614',
 u'1637',
 u'1787',
 u'1901',
 u'1903',
 u'1917',
 u'1925',
 u'1929',
 u'1933',
 u'1934']

4. Find decimal numbers.

[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)][:10]
[u'0.0085',
 u'0.05',
 u'0.1',
 u'0.16',
 u'0.2',
 u'0.25',
 u'0.28',
 u'0.3',
 u'0.4',
 u'0.5']

5. Find words ending with ed or ing.

[w for w in wsj if re.search('(ed|ing)$', w)][:10]
[u'62%-owned',
 u'Absorbed',
 u'According',
 u'Adopting',
 u'Advanced',
 u'Advancing',
 u'Alfred',
 u'Allied',
 u'Annualized',
 u'Anything']

6. Find words like black-and-white, father-in-law etc.

[w for w in wsj if re.search('^[a-z]+-[a-z]+-[a-z]+$', w)][:10]
[u'anti-morning-sickness',
 u'black-and-white',
 u'bread-and-butter',
 u'built-from-kit',
 u'cash-and-stock',
 u'cents-a-unit',
 u'computer-system-design',
 u'day-to-day',
 u'do-it-yourself',
 u'easy-to-read']

2.7. Detecting Word Patterns¶

Example with .startswith.

Find out all vowels:

word = 'supercalifragilisticexpialidocious'
re.findall(r'[aeiou]', word)
['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
len(re.findall(r'[aeiou]', word))
16

Exercise¶

What are the most common sequences of two or more vowels in English language?

Use entire treebank corpus. What is the format of treebank corpus files?

Solution¶

wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs
                   for word in wsj
                   for vs in re.findall(r'[aeiou]{2,}', word))
fd.most_common()
[(u'io', 549),
 (u'ea', 476),
 (u'ie', 331),
 (u'ou', 329),
 (u'ai', 261),
 (u'ia', 253),
 (u'ee', 217),
 (u'oo', 174),
 (u'ua', 109),
 (u'au', 106),
 (u'ue', 105),
 (u'ui', 95),
 (u'ei', 86),
 (u'oi', 65),
 (u'oa', 59),
 (u'eo', 39),
 (u'iou', 27),
 (u'eu', 18),
 (u'oe', 15),
 (u'iu', 14),
 (u'ae', 11),
 (u'eau', 10),
 (u'uo', 8),
 (u'oui', 6),
 (u'ao', 6),
 (u'uou', 5),
 (u'eou', 5),
 (u'uee', 4),
 (u'aa', 3),
 (u'ieu', 3),
 (u'uie', 3),
 (u'eei', 2),
 (u'iao', 1),
 (u'iai', 1),
 (u'aii', 1),
 (u'aiia', 1),
 (u'eea', 1),
 (u'ueui', 1),
 (u'ooi', 1),
 (u'oei', 1),
 (u'ioa', 1),
 (u'uu', 1),
 (u'aia', 1)]

Exercise¶

Extract all consonant-vowel sequences from the words of Rotokas, such as ka and si. Build an dictionary mapping from the sequences to list of words containing given sequence. New concepts introduced in this exercise:

  • return value of re.findall
  • two-letter strings can be interpreted as a pair
  • build the mapping using a dictionary
  • nltk.Index
  • build the mapping using an index

Solution¶

rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv
       for w in rotokas_words
       for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()
    a   e   i   o   u
k 418 148  94 420 173
p  83  31 105  34  51
r 187  63  84  89  79
s   0   0 100   2   1
t  47   8   0 148  37
v  93  27 105  48  49

Building the dictionary:

from collections import defaultdict
mapping = defaultdict(list)
for w in rotokas_words:
    for cv in re.findall(r'[ptksvr][aeiou]', w):
        mapping[cv].append(w)
mapping['su']
[u'kasuari']

Building the mapping using nltk.Index:

cv_word_pairs = [(cv, w)
                 for w in rotokas_words
                 for cv in re.findall(r'[ptksvr][aeiou]', w)]
cv_index = nltk.Index(cv_word_pairs)
cv_index['su']
[u'kasuari']

2.8. Stemming¶

nltk library is well documented. You can learn a lot just by inspecting its modules and classes:

help(nltk.stem)
Help on package nltk.stem in nltk:

NAME
    nltk.stem - NLTK Stemmers

FILE
    /usr/local/lib/python2.7/dist-packages/nltk/stem/__init__.py

DESCRIPTION
    Interfaces used to remove morphological affixes from words, leaving
    only the word stem.  Stemming algorithms aim to remove those affixes
    required for eg. grammatical role, tense, derivational morphology
    leaving only the stem of the word.  This is a difficult problem due to
    irregular words (eg. common verbs in English), complicated
    morphological rules, and part-of-speech and sense ambiguities
    (eg. ceil- is not the stem of ceiling).

    StemmerI defines a standard interface for stemmers.

PACKAGE CONTENTS
    api
    isri
    lancaster
    porter
    regexp
    rslp
    snowball
    util
    wordnet

Exercise¶

However, as an exercise of writing regular expression, write a function that returns a stem of given word.

New concepts in this exercise:

  • more than one group in regular expressions; how to process them?

Solution¶

def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem

Results:

raw = """
DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony.
"""

tokens = nltk.word_tokenize(raw)
[stem(t) for t in tokens]
['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'ly',
 'in',
 'pond',
 'distribut',
 'sword',
 'i',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Supreme',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

Builtin stemmers¶

We don’t have to write our own regular expressions. nltk library has builtin stemmers:

[x for x in dir(nltk) if 'stemmer' in x.lower()]
['ISRIStemmer',
 'LancasterStemmer',
 'PorterStemmer',
 'RSLPStemmer',
 'RegexpStemmer',
 'SnowballStemmer',
 'StemmerI']

The most important one is PorterStemmer. It can handle correctly words like lying (which is mapped to lie, not ly):

porter = nltk.PorterStemmer()
[porter.stem(t) for t in tokens]
[u'DENNI',
 u':',
 u'Listen',
 u',',
 u'strang',
 u'women',
 u'lie',
 u'in',
 u'pond',
 u'distribut',
 u'sword',
 u'is',
 u'no',
 u'basi',
 u'for',
 u'a',
 u'system',
 u'of',
 u'govern',
 u'.',
 u'Suprem',
 u'execut',
 u'power',
 u'deriv',
 u'from',
 u'a',
 u'mandat',
 u'from',
 u'the',
 u'mass',
 u',',
 u'not',
 u'from',
 u'some',
 u'farcic',
 u'aquat',
 u'ceremoni',
 u'.']

2.9. Tokenization¶

Searching tokenized text¶

So far, we searched across one long string. However, it’s possible to write regular expressions to search across multiple strings, i.e. list of words. Use nltk.Text.findall method, not the one from re library which cannot handle such kind of regular expressions.

moby = nltk.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave

Exercise¶

Search for patterns like A and other Bs in brown corpus.

Solution¶

from nltk.corpus import brown
words = nltk.corpus.brown.words()
hobbies_learned = nltk.Text(words)
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")
companies and other corporations; Union and other members; Wagner and
other officials; grillwork and other things; this and other units;
supervisors and other employees; suits and other misunderstandings;
dormitories and other buildings; ships and other submarines; Taylor
and other officers; chemicals and other products; amount and other
particulars; wife and other employes; Vietnam and other nations;
British and other replies; civic and other leaders; political and
other resources; Legion and other groups; this and other origins;
these and other matters; mink and other animals; These and other
figures; Ridge and other parts; Newman and other Tractarians; lawyers
and other managers; Awakening and other revivals; speed and other
activities; water and other liquids; tomb and other landmarks; Statues
and other monuments; pearls and other jewels; charts and other items;
roads and other features; figures and other objects; manure and other
fertilizers; these and other reasons; duration and other
circumstances; experience and other sources; television and other
mass; compromise and other factors; This and other fears; peoples and
other nations; head and other members; prints and other things; Bank
and other instrumentalities; candy and other presents; accepted and
other letters; schoolmaster and other privileges; proctors and other
officers; color and other varieties; Providence and other garages;
taxes and other revenues; Central and other railroads; this and other
departments; polls and other reports; ethnic and other lines;
management and other resources; loan and other provisions; India and
other countries; memoranda and other records; textile and other
industries; Leesona and other companies; Admissions and other issues;
military and other areas; demands and other factors; abstracts and
other compilations; iron and other metals; House and other places;
Bordel and other places; blues and other songs; oxygen and other
gases; structure and other buildings; sand and other girls; this and
other spots

Approaches to tokenization¶

So far, we used the builtin nltk.word_tokenize function that accepts a long string and returns a list of words.

help(nltk.tokenize)
Help on package nltk.tokenize in nltk:

NAME
    nltk.tokenize - NLTK Tokenizer Package

FILE
    /usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py

DESCRIPTION
    Tokenizers divide strings into lists of substrings.  For example,
    tokenizers can be used to find the words and punctuation in a string:

        >>> from nltk.tokenize import word_tokenize
        >>> s = '''Good muffins cost $3.88nin New York.  Please buy me
        ... two of them.nnThanks.'''
        >>> word_tokenize(s)
        ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
        'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

    This particular tokenizer requires the Punkt sentence tokenization
    models to be installed. NLTK also provides a simpler,
    regular-expression based tokenizer, which splits text on whitespace
    and punctuation:

        >>> from nltk.tokenize import wordpunct_tokenize
        >>> wordpunct_tokenize(s)
        ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
        'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

    We can also operate at the level of sentences, using the sentence
    tokenizer directly as follows:

        >>> from nltk.tokenize import sent_tokenize, word_tokenize
        >>> sent_tokenize(s)
        ['Good muffins cost $3.88nin New York.', 'Please buy mentwo of them.', 'Thanks.']
        >>> [word_tokenize(t) for t in sent_tokenize(s)]
        [['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
        ['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]

    Caution: when tokenizing a Unicode string, make sure you are not
    using an encoded version of the string (it may be necessary to
    decode it first, e.g. with s.decode("utf8").

    NLTK tokenizers can produce token-spans, represented as tuples of integers
    having the same semantics as string slices, to support efficient comparison
    of tokenizers.  (These methods are implemented as generators.)

        >>> from nltk.tokenize import WhitespaceTokenizer
        >>> list(WhitespaceTokenizer().span_tokenize(s))
        [(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
        (45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]

    There are numerous ways to tokenize text.  If you need more control over
    tokenization, see the other methods provided in this package.

    For further information, please see Chapter 3 of the NLTK book.

PACKAGE CONTENTS
    api
    casual
    mwe
    punkt
    regexp
    sexpr
    simple
    stanford
    stanford_segmenter
    texttiling
    treebank
    util

FUNCTIONS
    sent_tokenize(text, language='english')
        Return a sentence-tokenized copy of text,
        using NLTK's recommended sentence tokenizer
        (currently PunktSentenceTokenizer
        for the specified language).

        :param text: text to split into sentences
        :param language: the model name in the Punkt corpus

    word_tokenize(text, language='english')
        Return a tokenized copy of text,
        using NLTK's recommended word tokenizer
        (currently TreebankWordTokenizer
        along with PunktSentenceTokenizer
        for the specified language).

        :param text: text to split into sentences
        :param language: the model name in the Punkt corpus

2.10. Normalization of Text¶

Normalization of text is not only stemming, but also some more basic operations. For example, we may want to convert text to lowercase, so that we treat words the and The as the same word

2.11. Word Segmentation (especially in Chinese)¶

Sentence Segmentation¶

So far, we manipulated text at the low level, that is at the level of single words. However, it’s useful to operate on bigger structures, like i.e. sentences. nltk supports sentence segmentation. For example, you can easily find out what is the average sentences length:

len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())
20

There is a builtin function nltk.sent_tokenize that works in similar way to nltk.word_tokenize:

raw = nltk.corpus.gutenberg.raw('austen-emma.txt')
sentences = nltk.sent_tokenize(raw)
for sent in sentences[:5]:
    print(sent)
    print('---')
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
---
She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.
---
Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.
---
Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.
---
Between _them_ it was more the intimacy
of sisters.

Word Segmentation¶

For some writing systems, tokenizing text is made more difficult by the fact that there is no visual representation of word boundaries. For example, in Chinese, the three-character string: 爱国人 (ai4 love [verb], guo3 country, ren2 person) could be tokenized as 爱国 / , country-loving person, or as  / 国人, love country-person.
NLTK book

3. Categorizing and Tagging Words¶

The following imports are required in this chapter:

import nltk

Most of the time we’ll use nltk.tag submodule. Its docstring is a great introduction to categorizing and tagging words.

help(nltk.tag)
Help on package nltk.tag in nltk:

NAME
    nltk.tag - NLTK Taggers

FILE
    /usr/local/lib/python2.7/dist-packages/nltk/tag/__init__.py

DESCRIPTION
    This package contains classes and interfaces for part-of-speech
    tagging, or simply "tagging".

    A "tag" is a case-sensitive string that specifies some property of a token,
    such as its part of speech.  Tagged tokens are encoded as tuples
    (tag, token).  For example, the following tagged token combines
    the word 'fly' with a noun part of speech tag ('NN'):

        >>> tagged_tok = ('fly', 'NN')

    An off-the-shelf tagger is available.  It uses the Penn Treebank tagset:

        >>> from nltk import pos_tag, word_tokenize
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
        ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]

    This package defines several taggers, which take a list of tokens,
    assign a tag to each one, and return the resulting list of tagged tokens.
    Most of the taggers are built automatically based on a training corpus.
    For example, the unigram tagger tags each word w by checking what
    the most frequent tag for w was in a training corpus:

        >>> from nltk.corpus import brown
        >>> from nltk.tag import UnigramTagger
        >>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
        >>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
        >>> for word, tag in tagger.tag(sent):
        ...     print(word, '->', tag)
        Mitchell -> NP
        decried -> None
        the -> AT
        high -> JJ
        rate -> NN
        of -> IN
        unemployment -> None

    Note that words that the tagger has not seen during training receive a tag
    of None.

    We evaluate a tagger on data that was not seen during training:

        >>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])
        0.73...

    For more information, please consult chapter 5 of the NLTK Book.

PACKAGE CONTENTS
    api
    brill
    brill_trainer
    crf
    hmm
    hunpos
    mapping
    perceptron
    senna
    sequential
    stanford
    tnt
    util

FUNCTIONS
    pos_tag(tokens, tagset=None)
        Use NLTK's currently recommended part of speech tagger to
        tag the given list of tokens.

            >>> from nltk.tag import pos_tag
            >>> from nltk.tokenize import word_tokenize
            >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
            [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
            ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
            >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
            [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
            ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]

        NB. Use pos_tag_sents() for efficient tagging of more than one sentence.

        :param tokens: Sequence of tokens to be tagged
        :type tokens: list(str)
        :param tagset: the tagset to be used, e.g. universal, wsj, brown
        :type tagset: str
        :return: The tagged tokens
        :rtype: list(tuple(str, str))

    pos_tag_sents(sentences, tagset=None)
        Use NLTK's currently recommended part of speech tagger to tag the
        given list of sentences, each consisting of a list of tokens.

        :param tokens: List of sentences to be tagged
        :type tokens: list(list(str))
        :param tagset: the tagset to be used, e.g. universal, wsj, brown
        :type tagset: str
        :return: The list of tagged sentences
        :rtype: list(list(tuple(str, str)))

DATA
    print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...

3.1. Tagged Corpora¶

Just use .tagged_words() method instead of .words():

nltk.corpus.brown.tagged_words(categories='news')
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...]

Make sure that your corpus is tagged. If your corpus is not tagged, there will be no .tagged_words() method and you’ll get an error:

nltk.corpus.gutenberg.tagged_words(categories='news')
Traceback (most recent call last):

  File "", line 1, in 
    nltk.corpus.gutenberg.tagged_words(categories='news')

AttributeError: 'PlaintextCorpusReader' object has no attribute 'tagged_words'

Exercise¶

How often do different parts of speech appear in different genres? Use brown corpus.

New concepts introduced in this exercise:

  • brown.categories
  • brown.tagged_words(**tagset*=...)*

Solution¶

cfd = nltk.ConditionalFreqDist(
    (category, tag)
    for category in nltk.corpus.brown.categories()
    for (word, tag) in nltk.corpus.brown.tagged_words(
        categories=category, tagset='universal'))
cfd.tabulate()
     .   ADJ   ADP   ADV  CONJ   DET  NOUN   NUM  PRON   PRT  VERB     X
      adventure 10929  3364  7069  3879  2173  8155 13354   466  5205  2436 12274    38
 belles_lettres 20975 13311 23214  8498  5954 21695 39546  1566  7715  3919 26354   349
      editorial  7099  4958  7613  2997  1862  7416 15170   722  2291  1536  9896    44
        fiction 10170  3684  7198  3716  2261  8018 13509   446  4799  2305 12299    83
     government  7598  5692 10221  2333  2560  8043 19486  1612  1269  1358  9872    73
        hobbies  9792  6559 10070  3736  2948  9497 21276  1426  2334  1889 12733    85
          humor  3419  1362  2392  1172   713  2459  4405   154  1357   663  3546    53
        learned 19632 15520 25938  8582  5783 22296 46295  3187  3799  3471 27145   240
           lore 13073  8445 14405  5500  3793 13381 26749  1270  3970  2644 16972    97
        mystery  8962  2721  5755  3493  1696  6209 10673   472  4362  2426 10383    17
           news 11928  6706 12355  3349  2717 11389 30654  2166  2535  2264 14399    92
       religion  4809  3004  5335  2087  1353  4948  8616   510  1770   910  6036    21
        reviews  5354  3554  4832  2083  1453  4720 10528   477  1246   870  5478   109
        romance 11397  3912  6918  3986  2469  7211 12550   321  5748  2655 12784    71
science_fiction  2428   929  1451   828   416  1582  2747    79   934   483  2579    14

Exercise¶

What words do follow often? How often are they verbs? Use news category from brown corpus. Remove duplicates and sort the result. New concepts introduced in this exercise:

  • remind of nltk.bigrams
  • set to remove duplicates
  • sorted

Solution¶

Words that follow often:

words = nltk.corpus.brown.words(categories='news')
sorted(set(b for (a, b) in nltk.bigrams(words) if a == 'often'))
[u',',
 u'.',
 u'a',
 u'acceptable',
 u'ambiguous',
 u'build',
 u'did',
 u'enough',
 u'hear',
 u'in',
 u'mar',
 u'needs',
 u'obstructed',
 u'that']

What part-of-speech are these words?

tagged_words = nltk.corpus.brown.tagged_words(categories='news', tagset='universal')
fd = nltk.FreqDist(b[1]
                   for (a, b) in nltk.bigrams(tagged_words)
                   if a[0] == 'often')
fd.tabulate()
RB  ADP    .  ADJ  ADV  DET
   6    2    2    2    1    1

How often (in percentages) are they verbs?

fd.freq('VERB')
0.42857142857142855

Exercise¶

Find out words which are highly ambiguous as to their part of speech tag. That is, find words that are at least four different parts of speech in different contexts. Use brown corpus.

For example, close can be:

  • an adjective: Only the families and a dozen close friends will be present.
  • an adverb: Both came close to playing football at the University of Oklahoma.
  • a verb: Position your components before you close them in.
  • a noun: One night, at the close of the evening service, he came forward.

Solution¶

tagged_words = nltk.corpus.brown.tagged_words(categories='news', tagset='universal')
cfd = nltk.ConditionalFreqDist((word.lower(), tag)
                               for (word, tag) in tagged_words)

for word in sorted(cfd.conditions()):
    if len(cfd[word]) > 3:
        tags = [tag for (tag, _) in cfd[word].most_common()]
        print('{:20} {}'.format(word, ' '.join(tags)))
best                 ADJ ADV NOUN VERB
close                ADV ADJ VERB NOUN
open                 ADJ VERB NOUN ADV
present              ADJ ADV NOUN VERB
that                 ADP DET PRON ADV

3.2. Tagged Tokens¶

You can construct tagged tokens from scratch:

tagged_token = nltk.tag.str2tuple('fly/NN')
tagged_token
('fly', 'NN')

However, most of the time you’ll operate on a much higher level, that is on whole tagged corpora.

3.3. Part-of-Speech Tagset¶

Here is a simplified part-of-speech tagset:

2016-12-22 1008.png

3.4. Python Dictionaries¶

Python dictionaries are a data type that let you map from arbitrary immutable data (called keys) to another data (named values). Most of the time, keys are strings. For example, you can map from string to strings:

pos = {}
pos['ideas'] = 'NOUN'
pos
{'ideas': 'NOUN'}

defaultdict¶

A very common scenario is when you want to have a default value when a key is not in the dictionary. For example, if you count words, you map from words to number of occurrences. The default value should be zero.

from collections import defaultdict
words = 'A B C A'.split(' ')
count = defaultdict(int)
for word in words:
    count[word] += 1
count
defaultdict(int, {'A': 2, 'B': 1, 'C': 1})

Another use case is when you want to have more than one value for each key. This can be simulated by a dictionary mapping from keys to list of values. The default value is a new empty list:

all_words = nltk.corpus.brown.tagged_words(categories='news', tagset='universal')
words_by_part_of_speech = defaultdict(list)
for word, part_of_speech in all_words[:100]:
    words_by_part_of_speech[part_of_speech].append(word)
for key, words in words_by_part_of_speech.items():
    print('{} -> {}'.format(key, ", ".join(words)))
ADV -> further
NOUN -> Fulton, County, Jury, Friday, investigation, Atlanta's, primary, election, evidence, irregularities, place, jury, term-end, presentments, City, Committee, charge, election, praise, thanks, City, Atlanta, manner, election, September-October, term, jury, Fulton, Court, Judge, Durwood, Pye, reports, irregularities, primary, Mayor-nominate, Ivan
ADP -> of, that, in, that, of, of, of, for, in, by, of, in, by
DET -> The, an, no, any, The, the, which, the, the, the, the, which, the, The, the, which
. -> ``, '', ., ,, ,, ``, '', ., ``, ''
PRT -> to
VERB -> said, produced, took, said, had, deserves, was, conducted, had, been, charged, investigate, was, won
CONJ -> and
ADJ -> Grand, recent, Executive, over-all, Superior, possible, hard-fought

3.5. Words to Properties mapping¶

Exercise¶

Create a mapping from a word to its part of speech. Note that one word can have more than one part of speech tag (i.e. close). Use news category from brown corpus.

New concepts introduced in this exercise:

  • set.add

Solution¶

tagged_words = nltk.corpus.brown.tagged_words(categories='news', tagset='universal')
words_to_part_of_speech = defaultdict(set)
for word, part_of_speech in tagged_words[:200]:
    words_to_part_of_speech[word].add(part_of_speech)
for word, part_of_speeches in words_to_part_of_speech.items()[:10]:
    print('{} -> {}'.format(word, ", ".join(part_of_speeches)))
modernizing -> VERB
produced -> VERB
thanks -> NOUN
Durwood -> NOUN
find -> VERB
Jury -> NOUN
September-October -> NOUN
had -> VERB
, -> .
to -> PRT, ADP

3.6. Automatic Tagging¶

Sometimes you don’t have access to tagged corpus. You have only a raw text. In that case, you may want to tag the text.

Builtin Taggers¶

nltk has a lot of builtin taggers:

[x for x in dir(nltk) if 'tagger' in x.lower()]
['AffixTagger',
 'BigramTagger',
 'BrillTagger',
 'BrillTaggerTrainer',
 'CRFTagger',
 'ClassifierBasedPOSTagger',
 'ClassifierBasedTagger',
 'ContextTagger',
 'DefaultTagger',
 'HiddenMarkovModelTagger',
 'HunposTagger',
 'NgramTagger',
 'PerceptronTagger',
 'RegexpTagger',
 'SennaChunkTagger',
 'SennaNERTagger',
 'SennaTagger',
 'SequentialBackoffTagger',
 'StanfordNERTagger',
 'StanfordPOSTagger',
 'StanfordTagger',
 'TaggerI',
 'TrigramTagger',
 'UnigramTagger']

We’ll use brown corpus to evaluate taggers performance:

brown_tagged_sents = nltk.corpus.brown.tagged_sents(categories='news', tagset='universal')

DefaultTagger¶

Default tagger assigns the same tag to every token. Let’s see what is the most common part of speech?

tagged_words = nltk.corpus.brown.tagged_words(categories='news', tagset='universal')
tags = [tag for (word, tag) in tagged_words]
fd = nltk.FreqDist(tags)
fd.most_common(10)
[(u'NOUN', 30654),
 (u'VERB', 14399),
 (u'ADP', 12355),
 (u'.', 11928),
 (u'DET', 11389),
 (u'ADJ', 6706),
 (u'ADV', 3349),
 (u'CONJ', 2717),
 (u'PRON', 2535),
 (u'PRT', 2264)]

This is the simplest possible tagger and it’s not very good:

default_tagger = nltk.DefaultTagger('NOUN')
default_tagger.evaluate(brown_tagged_sents)
0.30485112476878096

RegexpTagger¶

The regular expression tagger tries to match a word against different regular expressions. The regular expression can i.e. check the suffix. If there is a match, it assigns the tag associated with the regexp.

patterns = [
    (r'.*ing$', 'VERB'),               # gerunds
    (r'.*ed$', 'VERB'),                # simple past
    (r'.*es$', 'VERB'),                # 3rd singular present
    (r'.*ould$', 'VERB'),               # modals
    (r'.*\'s$', 'NOUN'),               # possessive nouns
    (r'.*s$', 'NOUN'),                 # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'NUM'),  # cardinal numbers
    (r'.*', 'NOUN'),                    # nouns (default)
]
regexp_tagger = nltk.RegexpTagger(patterns)

However, it’s performance is not very good:

regexp_tagger.evaluate(brown_tagged_sents)
0.34932474093521887

UnigramTagger¶

This tagger is given the list of i.e. hundred most frequent words along with their most likely tag. This data is used to tag new words. If a word is not known to the tagger, it assigns None.

fd = nltk.FreqDist(nltk.corpus.brown.words(categories='news'))
most_freq_words = fd.most_common(100)
cfd = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words(categories='news', tagset='universal'))
likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words)
likely_tags.items()[:20]
[(u'all', u'PRT'),
 (u'over', u'ADP'),
 (u'years', u'NOUN'),
 (u'against', u'ADP'),
 (u'its', u'DET'),
 (u'before', u'ADP'),
 (u'(', u'.'),
 (u'had', u'VERB'),
 (u',', u'.'),
 (u'to', u'PRT'),
 (u'only', u'ADJ'),
 (u'under', u'ADP'),
 (u'has', u'VERB'),
 (u'New', u'ADJ'),
 (u'them', u'PRON'),
 (u'his', u'DET'),
 (u'Mrs.', u'NOUN'),
 (u'they', u'PRON'),
 (u'not', u'ADV'),
 (u'now', u'ADV')]

What is its performance?

unigram_tagger = nltk.UnigramTagger(model=likely_tags)
unigram_tagger.evaluate(brown_tagged_sents)
0.46503371322871295

UnigramTagger with backlog¶

If the UnigramTagger does not recognize a word and cannot assign any tag, it can delegate to another tagger. What is the performance if we delegate to the default tagger?

unigram_with_backlog = nltk.UnigramTagger(model=likely_tags,
                                          backoff=nltk.DefaultTagger('NOUN'))
unigram_with_backlog.evaluate(brown_tagged_sents)
0.7582194641685065

If we use 10000 most common words instead of 100 for the unigram tagger, we’ll get much better results:

fd = nltk.FreqDist(nltk.corpus.brown.words(categories='news'))
most_freq_words = fd.most_common(10000)
cfd = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words(categories='news', tagset='universal'))
likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words)
unigram_with_backlog = nltk.UnigramTagger(model=likely_tags,
                                          backoff=nltk.DefaultTagger('NOUN'))
unigram_with_backlog.evaluate(brown_tagged_sents)
0.9489826361954771

3.7. Determining the Category of a Word¶

An example regular expression tagger tried to determine the part of speech by looking at the suffix of a word. This is known as morphological clue.

If you look at context in which a word can occur, you try to leverage syntactic clues. For example, adjectives can occur before a noun or after words like be. This way, you can identify that near is an adjective in the following parts of sentences:

  • the near window
  • the end is near

The meaning of a word can help you determine the part of speech. For example, nouns are basically names of people, places or things. This is known as semantic clues.

4. Text Classification (Machine Learning)¶

import nltk

Docstrings of nltk submodules are very helpful to start learning a new topic:

help(nltk.classify)
Help on package nltk.classify in nltk:

NAME
    nltk.classify

FILE
    /usr/local/lib/python2.7/dist-packages/nltk/classify/__init__.py

DESCRIPTION
    Classes and interfaces for labeling tokens with category labels (or
    "class labels").  Typically, labels are represented with strings
    (such as 'health' or 'sports').  Classifiers can be used to
    perform a wide range of classification tasks.  For example,
    classifiers can be used...

    - to classify documents by topic
    - to classify ambiguous words by which word sense is intended
    - to classify acoustic signals by which phoneme they represent
    - to classify sentences by their author

    Features
    ========
    In order to decide which category label is appropriate for a given
    token, classifiers examine one or more 'features' of the token.  These
    "features" are typically chosen by hand, and indicate which aspects
    of the token are relevant to the classification decision.  For
    example, a document classifier might use a separate feature for each
    word, recording how often that word occurred in the document.

    Featuresets
    ===========
    The features describing a token are encoded using a "featureset",
    which is a dictionary that maps from "feature names" to "feature
    values".  Feature names are unique strings that indicate what aspect
    of the token is encoded by the feature.  Examples include
    'prevword', for a feature whose value is the previous word; and
    'contains-word(library)' for a feature that is true when a document
    contains the word 'library'.  Feature values are typically
    booleans, numbers, or strings, depending on which feature they
    describe.

    Featuresets are typically constructed using a "feature detector"
    (also known as a "feature extractor").  A feature detector is a
    function that takes a token (and sometimes information about its
    context) as its input, and returns a featureset describing that token.
    For example, the following feature detector converts a document
    (stored as a list of words) to a featureset describing the set of
    words included in the document:

        >>> # Define a feature detector function.
        >>> def document_features(document):
        ...     return dict([('contains-word(%s)' % w, True) for w in document])

    Feature detectors are typically applied to each token before it is fed
    to the classifier:

        >>> # Classify each Gutenberg document.
        >>> from nltk.corpus import gutenberg
        >>> for fileid in gutenberg.fileids(): # doctest: +SKIP
        ...     doc = gutenberg.words(fileid) # doctest: +SKIP
        ...     print fileid, classifier.classify(document_features(doc)) # doctest: +SKIP

    The parameters that a feature detector expects will vary, depending on
    the task and the needs of the feature detector.  For example, a
    feature detector for word sense disambiguation (WSD) might take as its
    input a sentence, and the index of a word that should be classified,
    and return a featureset for that word.  The following feature detector
    for WSD includes features describing the left and right contexts of
    the target word:

        >>> def wsd_features(sentence, index):
        ...     featureset = {}
        ...     for i in range(max(0, index-3), index):
        ...         featureset['left-context(%s)' % sentence[i]] = True
        ...     for i in range(index, max(index+3, len(sentence))):
        ...         featureset['right-context(%s)' % sentence[i]] = True
        ...     return featureset

    Training Classifiers
    ====================
    Most classifiers are built by training them on a list of hand-labeled
    examples, known as the "training set".  Training sets are represented
    as lists of (featuredict, label) tuples.

PACKAGE CONTENTS
    api
    decisiontree
    maxent
    megam
    naivebayes
    positivenaivebayes
    rte_classify
    scikitlearn
    senna
    svm
    tadm
    textcat
    util
    weka

4.1. Supervised Classification¶

Classification is the process of assigning a class label to an input. For example, you may want to find out whether an email is spam or not. In that case there are two class labels: spam and non-spam, and the input is the email.

Supervised classification is built based on some training corpus where each input has an already assigned class label. For example, for supervised email classification you may want to use a corpus of emails along with information whether they’re spam or not.

Classifiers don’t work directly in the input. Instead, you convert the input to a featureset, that is a dictionary mapping from feature names to their values. Then, classifiers look at the featuresets.

Features are some properties of the input. For example, in the case of the spam classification, it can be a number of occurrences for each word. That is, the featureset can be a dictionary mapping from words to the number its occurrences. It’s important that feature values are primitive values, like boolean, number or a string.

4.2. Sentence Segmentation¶

We’ll use supervised classification for sentence segmentation, that is to the task of splitting list of tokens into list of sentences (which are list of tokens).

This indeed is classification problem – for each token, you can assign whether it’s an end of sentence or not.

Input¶

We’ll use treebank_raw corpus. We need to convert the input into a format that is more convenient for supervised learning:

sents = nltk.corpus.treebank_raw.sents()
tokens = []
boundaries = set()
offset = 0
for sent in sents:
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)

print(' '.join(tokens[:50]))
. START Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov . 29 . Mr . Vinken is chairman of Elsevier N . V ., the Dutch publishing group . . START Rudolph Agnew , 55 years old and former chairman of Consolidated

Extracting Features¶

Classifiers don’t operate directly at the input. Instead, they have a look at featuresets which contains some properties of the input. Therefore, we need a function that convert from the input to the featuresets:

def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

And then, convert each dot token to the featureset:

featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']
featuresets[:8]
[({'next-word-capitalized': False,
   'prev-word': u'nov',
   'prev-word-is-one-char': False,
   'punct': u'.'},
  False),
 ({'next-word-capitalized': True,
   'prev-word': u'29',
   'prev-word-is-one-char': False,
   'punct': u'.'},
  True),
 ({'next-word-capitalized': True,
   'prev-word': u'mr',
   'prev-word-is-one-char': False,
   'punct': u'.'},
  False),
 ({'next-word-capitalized': True,
   'prev-word': u'n',
   'prev-word-is-one-char': True,
   'punct': u'.'},
  False),
 ({'next-word-capitalized': False,
   'prev-word': u'group',
   'prev-word-is-one-char': False,
   'punct': u'.'},
  True),
 ({'next-word-capitalized': True,
   'prev-word': u'.',
   'prev-word-is-one-char': True,
   'punct': u'.'},
  False),
 ({'next-word-capitalized': False,
   'prev-word': u'conglomerate',
   'prev-word-is-one-char': False,
   'punct': u'.'},
  True),
 ({'next-word-capitalized': True,
   'prev-word': u'.',
   'prev-word-is-one-char': True,
   'punct': u'.'},
  False)]

Learning Classifier¶

nltk comes with a lot of classifiers:

[x for x in dir(nltk) if 'classifier' in x.lower()]
['ClassifierBasedPOSTagger',
 'ClassifierBasedTagger',
 'ClassifierI',
 'ConditionalExponentialClassifier',
 'DecisionTreeClassifier',
 'MaxentClassifier',
 'MultiClassifierI',
 'NaiveBayesClassifier',
 'PositiveNaiveBayesClassifier',
 'SklearnClassifier',
 'WekaClassifier',
 '__classifiers__',
 'rte_classifier']

But we’ll use NaiveBayesClassifier.

classifier = nltk.NaiveBayesClassifier.train(featuresets)

Now, you classify an example sentence. First, we need to split the sentence into tokens:

tokens = nltk.word_tokenize('This is an example sentence, mr . Smith. And this is another one.')
tokens
['This',
 'is',
 'an',
 'example',
 'sentence',
 ',',
 'mr',
 '.',
 'Smith',
 '.',
 'And',
 'this',
 'is',
 'another',
 'one',
 '.']

Then, we construct featuresets (but only for punctuation tokens):

test_featuresets = [punct_features(tokens, i)
                    for i in range(1, len(tokens)-1)
                    if tokens[i] in '.?!']
test_featuresets
[{'next-word-capitalized': True,
  'prev-word': 'mr',
  'prev-word-is-one-char': False,
  'punct': '.'},
 {'next-word-capitalized': True,
  'prev-word': 'smith',
  'prev-word-is-one-char': False,
  'punct': '.'}]

And then, we can classify both dots:

classifier.classify_many(test_featuresets)
[False, True]

You can see that the first dot (in an example sentence, mr . Smith) was assigned False, which means it’s not the end of sentence. At the same time, the second dot (in Smith. And this is) was correctly detected as end of sentence.

You construct feature extractor by trial and error. Classifiers come with methods that help you find out which features are most informative.

classifier.show_most_informative_features()
Most Informative Features
               prev-word = u'mr'           False : True   =    112.7 : 1.0
               prev-word = u'3'            False : True   =     46.9 : 1.0
               prev-word = u'7'            False : True   =     42.3 : 1.0
               prev-word = u'2'            False : True   =     27.6 : 1.0
               prev-word = u'6'            False : True   =     25.0 : 1.0
               prev-word = u'n'            False : True   =     19.4 : 1.0
   prev-word-is-one-char = True            False : True   =     15.3 : 1.0
               prev-word = u'1'            False : True   =     15.2 : 1.0
               prev-word = u's'            False : True   =     14.9 : 1.0
               prev-word = u'corp'         False : True   =     14.7 : 1.0

4.3. Evaluation & Cross Validation¶

Evaluation¶

Machine learning involves a lot of trial and error. We could see that you construct feature extractor by intuition. In order to help our intuition, we should evaluate classifiers, that is find its score:

nltk.classify.accuracy(classifier, featuresets)
0.9704102219233356

This represents how much token are correctly classified. However, this number is overestimated. This is because we evaluate the classifier on the same data that it was trained. The classifier could simplify remember the input and expected class label.

Therefore, we need to split all input into three sets:

  • the training set used to train all classifiers you build,
  • the dev-test set used to evaluate all these classifiers in order to compare them,
  • the test set used to evaluate the best classifier.

As a rule of thumb, split all the data in 60%/20%/20% proportions.

index1 = int(len(featuresets) * 0.6)
index2 = int(len(featuresets) * 0.8)
train_set = featuresets[:index1]
devtest_set = featuresets[index1:index2]
test_set = featuresets[index2:]
len(train_set), len(devtest_set), len(test_set)
(3568, 1190, 1190)

Now, use train_set for training and test_set for evaluating the classifier. We don’t use devtest_set now, because we have only one classifier and we don’t have to compare different classifiers.

classifier = nltk.NaiveBayesClassifier.train(train_set) nltk.classify.accuracy(classifier, test_set)

0.9554621848739496

As you can see, this is lower accuracy than when you evaluate on training dataset:

nltk.classify.accuracy(classifier, train_set)
0.9694506726457399

Cross validation¶

However, sometimes the input is small enough and you cannot sacrifice 40% of the dataset just for evaluation purposes.

One solution to this problem is to perform multiple evaluations on different test sets, then to combine the scores from those evaluations, a technique known as cross-validation. In particular, we subdivide the original corpus into N subsets called folds. For each of these folds, we train a model using all of the data except the data in that fold, and then test that model on the fold. Even though the individual folds might be too small to give accurate evaluation scores on their own, the combined evaluation score is based on a large amount of data, and is therefore quite reliable. NLTK book

Exercise¶

Male and female names have distinctive characteristics. For example, if a name ends with a, e or i, it’s likely to be female. Build a classifier that looks at the first and last letter in a name. Evaluate its accuracy. Use names corpus. Then, use another classifier (DecisionTreeClassifier) and compare it with the first one. New concepts introduced in this exercise:

  • concatenating two lists
  • random.shuffle
  • remind if str[-1]
  • nltk.DecisionTreeClassifier

Solution¶

First, we need input data in a convenient format:

male_names = nltk.corpus.names.words('male.txt')
female_names = nltk.corpus.names.words('female.txt')
labeled_names = [(name, 'male') for name in male_names] + \
                [(name, 'female') for name in female_names]

import random
random.shuffle(labeled_names)
labeled_names[:10]
[(u'Erik', 'male'),
 (u'Lolita', 'female'),
 (u'Merralee', 'female'),
 (u'Ramsey', 'male'),
 (u'Georgiamay', 'female'),
 (u'Aristotle', 'male'),
 (u'Virgilio', 'male'),
 (u'Diann', 'female'),
 (u'Alejandrina', 'female'),
 (u'Albrecht', 'male')]

Let’s build feature extractor:

def gender_features(name):
    return {'first_letter': name[0],
            'last_letter': name[-1]}

gender_features('Shrek')
{'first_letter': 'S', 'last_letter': 'k'}

Convert the input data into featuresets:

featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

And split it into different sets:

index1 = int(len(featuresets) * 0.6)
index2 = int(len(featuresets) * 0.8)
train_set = featuresets[:index1]
devtest_set = featuresets[index1:index2]
test_set = featuresets[index2:]
len(train_set), len(devtest_set), len(test_set)
(4766, 1589, 1589)

Build classifier:

classifier = nltk.NaiveBayesClassifier.train(train_set)

Does it recognize male names?

classifier.classify(gender_features('Neo'))
'male'

Does it recognize female names?

classifier.classify(gender_features('Maria'))
'female'

How accurate this classifier is?

nltk.classify.accuracy(classifier, devtest_set)
0.7602265575833858

And build another classifier:

tree_classifier = nltk.DecisionTreeClassifier.train(train_set)

How accurate this classifier is?

nltk.classify.accuracy(tree_classifier, devtest_set)
0.7551919446192574

Which classifier is better? Once you know that, you need to find out how this classifier performs on new data.

int(nltk.classify.accuracy(classifier, test_set))
print(nltk.classify.accuracy(tree_classifier, test_set))
0.774701069855
0.763373190686

4.4. Decision Trees¶

In the previous exercise, we used a DecisionTreeClassifier.

The advantage of decisions tree is that their model is easy to understand. You can even convert it into pseudocode:

print(tree_classifier.pseudocode()[:1000])
if last_letter == u'a': return 'female'
if last_letter == u'b':
  if first_letter == u'A': return 'male'
  if first_letter == u'C': return 'male'
  if first_letter == u'D': return 'female'
  if first_letter == u'G': return 'male'
  if first_letter == u'H': return 'male'
  if first_letter == u'J': return 'male'
  if first_letter == u'R': return 'male'
  if first_letter == u'T': return 'male'
if last_letter == u'c': return 'male'
if last_letter == u'd':
  if first_letter == u'A': return 'male'
  if first_letter == u'B': return 'male'
  if first_letter == u'C': return 'male'
  if first_letter == u'D': return 'male'
  if first_letter == u'E': return 'male'
  if first_letter == u'F': return 'male'
  if first_letter == u'G': return 'male'
  if first_letter == u'H': return 'male'
  if first_letter == u'I': return 'female'
  if first_letter == u'J': return 'male'
  if first_letter == u'K': return 'male'
  if first_letter == u'L': return 'male'
  if first_letter == u'M': return 'male'
  if fi

5. Extracting Information from Text¶

In this chapter, we need the following imports:

import nltk

The architecture of simple information extraction system is:

  • At the beginning, we have raw text as a string.
  • Result of sentence segmentation is list of strings.
  • Result of tokenization is list of list of strings.
  • Result of part of speech tagging is list of list of pairs (word, tag).
  • Result of entity detection is list of trees.
  • Result of relation detection is list of tuples (entity, relation, entity).

We’ll focus on entity detection. The basic technique is chunking.

5.1. Chunking¶

Shallow parsing (also chunking or light parsing) is an analysis of a sentence which first identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings (noun groups or phrases,, verb groups, etc.). Wikipedia

nltk library has chunk module:

help(nltk.chunk)
Help on package nltk.chunk in nltk:

NAME
    nltk.chunk

FILE
    /usr/local/lib/python2.7/dist-packages/nltk/chunk/__init__.py

DESCRIPTION
    Classes and interfaces for identifying non-overlapping linguistic
    groups (such as base noun phrases) in unrestricted text.  This task is
    called "chunk parsing" or "chunking", and the identified groups are
    called "chunks".  The chunked text is represented using a shallow
    tree called a "chunk structure."  A chunk structure is a tree
    containing tokens and chunks, where each chunk is a subtree containing
    only tokens.  For example, the chunk structure for base noun phrase
    chunks in the sentence "I saw the big dog on the hill" is::

      (SENTENCE:
        (NP: <I>)
        <saw>
        (NP: <the> <big> <dog>)
        <on>
        (NP: <the> <hill>))

    To convert a chunk structure back to a list of tokens, simply use the
    chunk structure's leaves() method.

    This module defines ChunkParserI, a standard interface for
    chunking texts; and RegexpChunkParser, a regular-expression based
    implementation of that interface. It also defines ChunkScore, a
    utility class for scoring chunk parsers.

    RegexpChunkParser
    =================

    RegexpChunkParser is an implementation of the chunk parser interface
    that uses regular-expressions over tags to chunk a text.  Its
    parse() method first constructs a ChunkString, which encodes a
    particular chunking of the input text.  Initially, nothing is
    chunked.  parse.RegexpChunkParser then applies a sequence of
    RegexpChunkRule rules to the ChunkString, each of which modifies
    the chunking that it encodes.  Finally, the ChunkString is
    transformed back into a chunk structure, which is returned.

    RegexpChunkParser can only be used to chunk a single kind of phrase.
    For example, you can use an RegexpChunkParser to chunk the noun
    phrases in a text, or the verb phrases in a text; but you can not
    use it to simultaneously chunk both noun phrases and verb phrases in
    the same text.  (This is a limitation of RegexpChunkParser, not of
    chunk parsers in general.)

    RegexpChunkRules
    ----------------

    A RegexpChunkRule is a transformational rule that updates the
    chunking of a text by modifying its ChunkString.  Each
    RegexpChunkRule defines the apply() method, which modifies
    the chunking encoded by a ChunkString.  The
    RegexpChunkRule class itself can be used to implement any
    transformational rule based on regular expressions.  There are
    also a number of subclasses, which can be used to implement
    simpler types of rules:

        - ChunkRule chunks anything that matches a given regular
          expression.
        - ChinkRule chinks anything that matches a given regular
          expression.
        - UnChunkRule will un-chunk any chunk that matches a given
          regular expression.
        - MergeRule can be used to merge two contiguous chunks.
        - SplitRule can be used to split a single chunk into two
          smaller chunks.
        - ExpandLeftRule will expand a chunk to incorporate new
          unchunked material on the left.
        - ExpandRightRule will expand a chunk to incorporate new
          unchunked material on the right.

    Tag Patterns
    ~~~~~~~~~~~~

    A RegexpChunkRule uses a modified version of regular
    expression patterns, called "tag patterns".  Tag patterns are
    used to match sequences of tags.  Examples of tag patterns are::

         r'(<DT>|<JJ>|<NN>)+'
         r'<NN>+'
         r'<NN.*>'

    The differences between regular expression patterns and tag
    patterns are:

        - In tag patterns, '<' and '>' act as parentheses; so
          '<NN>+' matches one or more repetitions of '<NN>', not
          '<NN' followed by one or more repetitions of '>'.
        - Whitespace in tag patterns is ignored.  So
          '<DT> | <NN>' is equivalant to '<DT>|<NN>'
        - In tag patterns, '.' is equivalant to '[^{}<>]'; so
          '<NN.*>' matches any single tag starting with 'NN'.

    The function tag_pattern2re_pattern can be used to transform
    a tag pattern to an equivalent regular expression pattern.

    Efficiency
    ----------

    Preliminary tests indicate that RegexpChunkParser can chunk at a
    rate of about 300 tokens/second, with a moderately complex rule set.

    There may be problems if RegexpChunkParser is used with more than
    5,000 tokens at a time.  In particular, evaluation of some regular
    expressions may cause the Python regular expression engine to
    exceed its maximum recursion depth.  We have attempted to minimize
    these problems, but it is impossible to avoid them completely.  We
    therefore recommend that you apply the chunk parser to a single
    sentence at a time.

    Emacs Tip
    ---------

    If you evaluate the following elisp expression in emacs, it will
    colorize a ChunkString when you use an interactive python shell
    with emacs or xemacs ("C-c !")::

        (let ()
          (defconst comint-mode-font-lock-keywords
            '(("<[^>]+>" 0 'font-lock-reference-face)
              ("[{}]" 0 'font-lock-function-name-face)))
          (add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))

    You can evaluate this code by copying it to a temporary buffer,
    placing the cursor after the last close parenthesis, and typing
    "C-x C-e".  You should evaluate it before running the interactive
    session.  The change will last until you close emacs.

    Unresolved Issues
    -----------------

    If we use the re module for regular expressions, Python's
    regular expression engine generates "maximum recursion depth
    exceeded" errors when processing very large texts, even for
    regular expressions that should not require any recursion.  We
    therefore use the pre module instead.  But note that pre
    does not include Unicode support, so this module will not work
    with unicode strings.  Note also that pre regular expressions
    are not quite as advanced as re ones (e.g., no leftward
    zero-length assertions).

    :type CHUNK_TAG_PATTERN: regexp
    :var CHUNK_TAG_PATTERN: A regular expression to test whether a tag
         pattern is valid.

PACKAGE CONTENTS
    api
    named_entity
    regexp
    util

FUNCTIONS
    ne_chunk(tagged_tokens, binary=False)
        Use NLTK's currently recommended named entity chunker to
        chunk the given list of tagged tokens.

    ne_chunk_sents(tagged_sentences, binary=False)
        Use NLTK's currently recommended named entity chunker to chunk the
        given list of tagged sentences, each consisting of a list of tagged tokens.

We first need to preprocess any input:

def preprocess(document):

   sentences = nltk.sent_tokenize(document)
   sentences = [nltk.word_tokenize(sent) for sent in sentences]
   sentences = [nltk.pos_tag(sent, tagset='universal') for sent in sentences]
   return sentences

sentences = preprocess('The little yellow dog barked at the cat.') sentences </source>

[[('The', u'DET'),
  ('little', u'ADJ'),
  ('yellow', u'ADJ'),
  ('dog', u'NOUN'),
  ('barked', u'VERB'),
  ('at', u'ADP'),
  ('the', u'DET'),
  ('cat', u'NOUN'),
  ('.', u'.')]]

We’ll do noun phrase chunking.

grammar = "NP: {<DET>?<ADJ>*<NOUN>}"

cp = nltk.RegexpParser(grammar)
cp.parse(sentences[0])

2016-12-29 1136.png

Let’s try a bit more complex sentence:

sentences = preprocess('The market for system-management software for Digital\'s hardware is fragmented enough that a giant such as Computer Associates should do well there.')
result = cp.parse(sentences[0])
result

As you can see, this tree is to large, so we’ll switch to another representation:

Course-5 9 0.png

print(result.pformat())
(S
  (NP The/DET market/NOUN)
  for/ADP
  (NP system-management/ADJ software/NOUN)
  for/ADP
  (NP Digital/NOUN)
  's/PRT
  (NP hardware/NOUN)
  is/VERB
  fragmented/VERB
  enough/ADV
  that/ADP
  a/DET
  giant/ADJ
  such/ADJ
  as/ADP
  (NP Computer/NOUN)
  (NP Associates/NOUN)
  should/VERB
  do/VERB
  well/ADV
  there/ADV
  ./.)

Exercise¶

Find all sequences in the form of NOUN to NOUN, i.e. combined to achieve or want to buy. Use brown corpus.

New concepts introduced in this exercise:

  • tree.subtrees()
  • subtree.label()

Solution¶

First, find all sequences of NOUN PRT NOUN where PRT stands for particle (like to).

cp = nltk.RegexpParser('CHUNK: {<VERB> <PRT> <VERB>}')
sentences = preprocess(nltk.corpus.gutenberg.raw()[:10000])
for sent in sentences:
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK':
            print(subtree)
(CHUNK seemed/VERB to/PRT unite/VERB)
(CHUNK ceased/VERB to/PRT hold/VERB)
(CHUNK left/VERB to/PRT dine/VERB)
(CHUNK disposed/VERB to/PRT think/VERB)
(CHUNK going/VERB to/PRT see/VERB)
(CHUNK coming/VERB to/PRT see/VERB)
(CHUNK like/VERB to/PRT put/VERB)
(CHUNK are/VERB to/PRT be/VERB)
(CHUNK used/VERB to/PRT see/VERB)

Now, let’s filter out other particles:

cp = nltk.RegexpParser('CHUNK: {<VERB> <to> <VERB>}')
sentences = preprocess(nltk.corpus.gutenberg.raw()[:10000])
for sent in sentences:
    sent = [(word, 'to' if word.lower() == 'to' else tag)
            for (word, tag) in sent]
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK':
            print(subtree)
(CHUNK seemed/VERB to/to unite/VERB)
(CHUNK ceased/VERB to/to hold/VERB)
(CHUNK left/VERB to/to dine/VERB)
(CHUNK disposed/VERB to/to think/VERB)
(CHUNK going/VERB to/to see/VERB)
(CHUNK coming/VERB to/to see/VERB)
(CHUNK like/VERB to/to put/VERB)
(CHUNK are/VERB to/to be/VERB)
(CHUNK used/VERB to/to see/VERB)

5.2. Chinking¶

We can define a chink - a sequence of tokens that should be not included in a chunk. Chunking let you define what should be included, while chinking is the opposite – it excludes some sequences of tokens. Results of chinking for three situations:

2016-12-29 1155.png

Let’s put entire sentence into a single chunk, and then split it into smaller chunks by chinking:

grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VERB|ADP>+{      # Chink sequences of VERB and ADP
  """
sentences = preprocess('The little yellow dog barked at the cat.')
cp = nltk.RegexpParser(grammar)
cp.parse(sentences[0])

2016-12-29 1159.png

5.3. Tags vs Trees¶

We’ve seen that chunk structures can be represented as shallow trees. However, there is another format - IOB.

The IOB format (short for Inside, Outside, Beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. Named Entity Recognition).[1] The B- prefix before a tag indicates that the tag is the beginning of a chunk, and an I- prefix before a tag indicates that the tag is inside a chunk. The B- tag is used only when a tag is followed by a tag of the same type without O tokens between them. An O tag indicates that a token belongs to no chunk. Wikipedia

You can easily convert tree to IOB format.

iob = nltk.tree2conllstr(result)
print(iob)
The DET B-NP
market NOUN I-NP
for ADP O
system-management ADJ B-NP
software NOUN I-NP
for ADP O
Digital NOUN B-NP
's PRT O
hardware NOUN B-NP
is VERB O
fragmented VERB O
enough ADV O
that ADP O
a DET O
giant ADJ O
such ADJ O
as ADP O
Computer NOUN B-NP
Associates NOUN B-NP
should VERB O
do VERB O
well ADV O
there ADV O
. . O

Exercise¶

Convert back iob to a list of tuples (word, tag, IOB-tag).

Solution¶

tree = nltk.conllstr2tree(iob)
nltk.tree2conlltags(tree)
[(u'The', u'DET', u'B-NP'),
 (u'market', u'NOUN', u'I-NP'),
 (u'for', u'ADP', u'O'),
 (u'system-management', u'ADJ', u'B-NP'),
 (u'software', u'NOUN', u'I-NP'),
 (u'for', u'ADP', u'O'),
 (u'Digital', u'NOUN', u'B-NP'),
 (u"'s", u'PRT', u'O'),
 (u'hardware', u'NOUN', u'B-NP'),
 (u'is', u'VERB', u'O'),
 (u'fragmented', u'VERB', u'O'),
 (u'enough', u'ADV', u'O'),
 (u'that', u'ADP', u'O'),
 (u'a', u'DET', u'O'),
 (u'giant', u'ADJ', u'O'),
 (u'such', u'ADJ', u'O'),
 (u'as', u'ADP', u'O'),
 (u'Computer', u'NOUN', u'B-NP'),
 (u'Associates', u'NOUN', u'B-NP'),
 (u'should', u'VERB', u'O'),
 (u'do', u'VERB', u'O'),
 (u'well', u'ADV', u'O'),
 (u'there', u'ADV', u'O'),
 (u'.', u'.', u'O')]

6. Analyzing Sentence Structure¶

6.1. Context Free Grammar¶

Composability of sentences¶

We’ll start from a discussion on some grammatical dilemmas. The very first important concept is composability of sentences. Sentences have an interesting property that you can take smaller sentences and combine that inside a larger sentences.

Consider this example:

  • Apples are red.
  • I think that apples are red.
  • Alice reported that I think that apples are red.

For a moment, let replace whole sentences with symbol S. We’ll see patters like:

  • I think that S.
  • Alice reported that S.

There could be other templates we can use like S when S, for example Apples are red when they’re ripe.

We can easily construct some really long sentences in this way. Even a very long sentence has simple structure like, for example S when S, as long as the sentence is grammatically correct.

Basic concepts¶

We’ll call these templates or patterns like S when S productions.

We’ll assume that a language is considered to be nothing more than just a big collection of all grammatical sentences. We exclude all ungrammatical ones. At the same time, we include all sentences that are grammatical, but have no sense.

A language can be described by a grammar which is a set of recursive productions, like S -> S when S. A grammar is a declarative specification of well-formedness or a description of the structure of an unlimited set of sentences.

While a grammar is declarative specification, parser for a given grammar is a procedural interpretation of the grammar. There are many parsers:

import nltk

[i for i in dir(nltk) if 'parser' in i.lower()]
['BllipParser',
 'BottomUpChartParser',
 'BottomUpLeftCornerChartParser',
 'BottomUpProbabilisticChartParser',
 'ChartParser',
 'ChunkParserI',
 'EarleyChartParser',
 'FeatureBottomUpChartParser',
 'FeatureBottomUpLeftCornerChartParser',
 'FeatureChartParser',
 'FeatureEarleyChartParser',
 'FeatureIncrementalBottomUpChartParser',
 'FeatureIncrementalBottomUpLeftCornerChartParser',
 'FeatureIncrementalChartParser',
 'FeatureIncrementalTopDownChartParser',
 'FeatureTopDownChartParser',
 'IncrementalBottomUpChartParser',
 'IncrementalBottomUpLeftCornerChartParser',
 'IncrementalChartParser',
 'IncrementalLeftCornerChartParser',
 'IncrementalTopDownChartParser',
 'InsideChartParser',
 'LeftCornerChartParser',
 'LongestChartParser',
 'MaltParser',
 'NonprojectiveDependencyParser',
 'ParserI',
 'ProbabilisticNonprojectiveParser',
 'ProbabilisticProjectiveDependencyParser',
 'ProjectiveDependencyParser',
 'RandomChartParser',
 'RecursiveDescentParser',
 'RegexpChunkParser',
 'RegexpParser',
 'ShiftReduceParser',
 'SteppingChartParser',
 'SteppingRecursiveDescentParser',
 'SteppingShiftReduceParser',
 'TopDownChartParser',
 'TransitionParser',
 'UnsortedChartParser',
 'ViterbiParser',
 'load_parser',
 'nonprojectivedependencyparser',
 'projectivedependencyparser',
 'transitionparser']

Different parsers for the same grammar may vary in time and space complexity – some of them are faster and some of them are slower. Some of them use more memory and some of them use less. Some parsers cannot handle complex grammars (to be precise, left-recursive ones; that’ll be explained later). Some parsers may not found a syntax tree for a valid sentence. There is no one “best” parser in that sense and you need to choose the best one.

A parser takes a sentence (in a format of a list of tokens) and outputs syntax trees.

There are a lot of new concepts. To better understand all these concepts and how they fit together, let’s consider a very simple example:

import nltk

grammar = nltk.CFG.fromstring("""
S -> 'I' 'think' 'that' S
S -> 'Alice' 'reported' 'that' S
S -> 'apples' 'are' 'red'
""")

statement = 'Alice reported that I think that apples are red'.split(' ')

# parsers will be described in details later
parser = nltk.ChartParser(grammar)

trees = list(parser.parse(statement))

for tree in trees:
    print(tree)

trees[0]
(S Alice reported that (S I think that (S apples are red)))

2016-12-29 1202.png

Exercise¶

Trees are objects that provide some useful methods. There is a builtin method for computing the height of a tree. What it is?

Solution¶

trees[0].height()
4

Exercise¶

Write a grammar to parse sentence Apples are red when they are ripe. Start with the same code as above, except for starting from different grammar:

?
S -> 'apples' 'are' 'red'
S -> 'they' 'are' 'ripe'

Solution¶

import nltk

grammar = nltk.CFG.fromstring("""
S -> S 'when' S
S -> 'apples' 'are' 'red'
S -> 'they' 'are' 'ripe'
""")

statement = 'apples are red when they are ripe'.split(' ')

parser = nltk.ChartParser(grammar)

trees = list(parser.parse(statement))

for tree in trees:
    print(tree)

tree
(S (S apples are red) when (S they are ripe))

2016-12-29 1206.png

Now it’s easier to see that, indeed, syntax trees are trees.

Ambiguity¶

Once we understand the basic concepts and once we discussed the very first interesting property of sentences, which is composability, let’s dive into more complicated examples.

Almost every sentence, except for the short ones, are ambiguous. Let’s consider this statement:

  • “I shot an elephant in my pajamas”.
  • How an elephant got into my pajamas?

Let’s take a closer look at the ambiguity in this phrase. Let’s try to write a grammar for this sentence.

import nltk

# S = sentence (I shot an elephant in my pajamas)
# PP = prepositional phrase (in my pajamas)
# VP = verb phrase (shot an elephant, shot an elephant in my pajamas)
# NP = noun phrase (I, an elephant, my pajamas, an elephant in my pajamas)
# Det = determiner (an, my)
# N = noun (elephant, pajamas)
# V = verb (shot)
# P = preposition (in)

grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
VP -> V NP | VP PP
NP -> Det N | Det N PP | 'I'
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")

statement = 'I shot an elephant in my pajamas'.split(' ')
parser = nltk.ChartParser(grammar)
trees = list(parser.parse(statement))

for tree in trees:
    print(tree)
(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))
trees[0]

2016-12-29 1208.png

trees[1]

2016-12-29 1212.png

This kind of ambiguity is called prepositional phrase attachment ambiguity.

Exercise¶

Consider the sentence Kim arrived or Dana left and everyone cheered. Write down the parenthesized forms or syntax trees to show the relative scope of and and or. Write grammar to parse this sentence. Generate tree structures corresponding to both of these interpretations. Hint: start from writing a grammar to parse subsentences like: * Kim arrived,

  • Dana left,
  • everyone cheered.

Solution¶

import nltk

grammar = nltk.CFG.fromstring("""
S -> S 'or' S
S -> S 'and' S
S -> N V
N -> 'Kim' | 'Dana' | 'everyone'
V -> 'arrived' | 'left' | 'cheered'
""")

statement = 'Kim arrived or Dana left and everyone cheered'.split(' ')
parser = nltk.ChartParser(grammar)
trees = list(parser.parse(statement))
for tree in trees:
    print tree
(S
  (S (S (N Kim) (V arrived)) or (S (N Dana) (V left)))
  and
  (S (N everyone) (V cheered)))
(S
  (S (N Kim) (V arrived))
  or
  (S (S (N Dana) (V left)) and (S (N everyone) (V cheered))))
trees[0]

2016-12-29 1216.png

trees[1]

2016-12-29 1217.png

Exercise¶

Consider the sequence of words: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo. This is a grammatically correct sentence, as explained at Wikipedia:

The sentence uses three distinct meanings of the word buffalo:

  • the city of Buffalo, New York;
  • the verb (uncommon in usage) to buffalo, meaning “to bully, harass, or intimidate” or “to baffle”; and
  • the animal, bison (often called buffalo in North America).

The sentence can be phrased differently as “Those buffalo(es) from Buffalo that are intimidated by buffalo(es) from Buffalo intimidate buffalo(es) from Buffalo.”

Consider the tree diagram presented on this Wikipedia page, and write down a suitable grammar. Normalize case to lowercase, to simulate the problem that a listener has when hearing this sentence. Can you find other parses for this sentence? How does the number of parse trees grow as the sentence gets longer? (More examples of these sentences can be found at another Wikipedia page)

2016-12-29 1220.png

  • PN = proper noun
  • N = noun
  • V = verb
  • NP = noun phrase
  • RC = relative clause
  • VP = verb phrase
  • S = sentence

Solution¶

import nltk

grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> PN N
NP -> NP RC
RC -> NP V
VP -> V NP
N -> 'buffalo'
V -> 'buffalo'
PN -> 'buffalo'
""")

statement = 'buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo'.split(' ')
parser = nltk.ChartParser(grammar)
trees = list(parser.parse(statement))
for tree in trees:
    print tree
(S
  (NP (PN buffalo) (N buffalo))
  (VP
    (V buffalo)
    (NP
      (NP (PN buffalo) (N buffalo))
      (RC (NP (PN buffalo) (N buffalo)) (V buffalo)))))
(S
  (NP
    (NP (PN buffalo) (N buffalo))
    (RC (NP (PN buffalo) (N buffalo)) (V buffalo)))
  (VP (V buffalo) (NP (PN buffalo) (N buffalo))))
trees[0]
trees[1]

More about concepts¶

Each node in the syntax tree is called a constituent.

The immediate constituents of S are N and V.

Start-symbol of a grammar is the lefthand side of the first production. In all examples, it was S.

More about grammars¶

You cannot place multiword lexical items:

NP -> 'New York'

You may have a recursive grammar (Nom = nominals):

Nom -> Adj Nom

You can have indirect recursion:

S -> NP VP
VP -> V S

A grammar may have a left-recursive productions:

X -> X Y

Later on we’ll see that it’s important, because some parsers (RecursiveDescentParser) cannot handle left-recursive productions.

6.2. Parsers¶

ChartParser¶

  • Is slow and not memory efficient.
  • Can handle left-recursive grammars.
  • Always finds all possible syntax trees.
  • Is the best parser to start with. Consider other parsers only if this one is not fast enough or it’s not memory efficient enough.

RecursiveDescentParser¶

  • Does top-down parsing.
  • Does not work with left-recursive productions (they sent it to an infinite loop).
  • Wastes a lot of time, considering words and structures that do not correspond to the input sentence.
  • When backtracking, this parser discards parsed constituents that will need to be rebuilt again later.

ShiftReduceParser¶

  • Does bottom-up parsing.
  • Does not implement any backtracking, so it is not guaranteed to find a parse for a text, even if one exists. Furthermore, it will only find at most one parse, even if more parses exist.

7. Building Feature Based Grammars¶

import nltk

7.0 Motivation¶

There are some issues with context free grammars. Let’s see an example.

Exercise¶

Develop a context free grammar to parse sentences like:

  • This dog runs.
  • These dogs run.

Write function parse that:

  • takes a sentence (string)
  • uses global variable grammar
  • prints all trees (or a message if the sentence is invalid)
  • returns the first one.

Solution¶

grammar = nltk.CFG.fromstring("""
S   ->   NP V
NP  ->   Det N
Det  ->  'this' | 'these'
N    ->  'dog' | 'dogs'
V    ->  'run' | 'runs'
""")

def parse(sentence):
    parser = nltk.ChartParser(grammar)
    tokens = sentence.split(' ')
    trees = list(parser.parse(tokens))
    for tree in trees:
        print(tree)
    if len(trees) > 0:
        return trees[0]
    else:
        print("Ungrammatical sentence.")

This correctly parses the first sentence:

parse('this dog runs')
(S (NP (Det this) (N dog)) (V runs))

2016-12-29 1238.png

as well as the second one:

parse('these dogs run')
(S (NP (Det these) (N dogs)) (V run))

2016-12-29 1240.png

Unfortunately, this also allows incorrect variants:

parse('this dog run')
(S (NP (Det this) (N dog)) (V run))

What can we do?

Exercise¶

Develop a grammar that parses sentences like This dog runs. and These dogs run., but doesn’t accept This dog run..

Hint: duplicate each predicate.

Solution¶

grammar = nltk.CFG.fromstring("""
S -> NP_SG VP_SG
S -> NP_PL VP_PL
NP_SG -> Det_SG N_SG
NP_PL -> Det_PL N_PL
VP_SG -> V_SG
VP_PL -> V_PL

Det_SG -> 'this'
Det_PL -> 'these'
N_SG -> 'dog'
N_PL -> 'dogs'
V_SG -> 'runs'
V_PL -> 'run'
""")

This correctly parses valid sentences:

parse('this dog runs')
(S (NP_SG (Det_SG this) (N_SG dog)) (VP_SG (V_SG runs)))

as well as this one:

parse('these dogs run')
(S (NP_PL (Det_PL these) (N_PL dogs)) (VP_PL (V_PL run)))

It also blocks invalid sentences:

parse('this dog run')
Ungrammatical sentence.

and this one too:

parse('these dogs runs')
Ungrammatical sentence.

Conclusion¶

You can see this approach is not going to scale.

7.1. Grammatical Features¶

Fixing these dog runs¶

The traditional categories of context-free grammar are atomic symbols.

We can extend the framework of context free grammars with features, to gain more fine-grained control over grammatical categories and productions.

Let’s rewrite our example of running dogs using feature based context free grammars. First, we need to define grammar. This time, it’ll be FeatureGrammar:

featured_grammar = nltk.grammar.FeatureGrammar.fromstring('''
S -> NP[NUM=?n] V[NUM=?n]
NP[NUM=?n] -> Det[NUM=?n] N[NUM=?n]
Det[NUM=sg] -> 'this'
Det[NUM=pl] -> 'these'
N[NUM=sg] -> 'dog'
N[NUM=pl] -> 'dogs'
V[NUM=sg] -> 'runs'
V[NUM=pl] -> 'run'
''')

We also need a parser. We cannot use traditional parsers that we used so far, because they work with casual grammars. Fortunately, nltk comes with a lot of parsers for feature based grammars to:

[x for x in dir(nltk.parse)
 if 'feature' in x.lower()
 and 'parser' in x.lower()]
['FeatureBottomUpChartParser',
 'FeatureBottomUpLeftCornerChartParser',
 'FeatureChartParser',
 'FeatureEarleyChartParser',
 'FeatureIncrementalBottomUpChartParser',
 'FeatureIncrementalBottomUpLeftCornerChartParser',
 'FeatureIncrementalChartParser',
 'FeatureIncrementalTopDownChartParser',
 'FeatureTopDownChartParser']

We’ll use FeatureChartParser for the same reason we chose ChartParser:

def featured_parse(sentence):
    featured_parser = nltk.parse.FeatureChartParser(featured_grammar)
    tokens = sentence.split(' ')
    trees = list(featured_parser.parse(tokens))
    for tree in trees:
        print(tree)
    if len(trees) > 0:
        return trees[0]
    else:
        print("Ungrammatical sentence.")

Now, we can parse valid sentences:

featured_parse('this dog runs')
(S[]
  (NP[NUM='sg'] (Det[NUM='sg'] this) (N[NUM='sg'] dog))
  (V[NUM='sg'] runs))

2016-12-29 1243.png

as well as this one:

featured_parse('this dog runs')
(S[]
  (NP[NUM='sg'] (Det[NUM='sg'] this) (N[NUM='sg'] dog))
  (V[NUM='sg'] runs))

2016-12-29 1243.png

And reject this one:

featured_parse('these dogs runs')
Ungrammatical sentence.

as well as this one:

featured_parse('this dog run')
Ungrammatical sentence.

as well as this one:

featured_parse('these dog runs')
Ungrammatical sentence.

Exercise¶

Have a look at an example of more complicated feature based grammar:

nltk.data.show_cfg('grammars/book_grammars/feat0.fcfg')
% start S
# ###################
# Grammar Productions
# ###################
# S expansion productions
S -> NP[NUM=?n] VP[NUM=?n]
# NP expansion productions
NP[NUM=?n] -> N[NUM=?n]
NP[NUM=?n] -> PropN[NUM=?n]
NP[NUM=?n] -> Det[NUM=?n] N[NUM=?n]
NP[NUM=pl] -> N[NUM=pl]
# VP expansion productions
VP[TENSE=?t, NUM=?n] -> IV[TENSE=?t, NUM=?n]
VP[TENSE=?t, NUM=?n] -> TV[TENSE=?t, NUM=?n] NP
# ###################
# Lexical Productions
# ###################
Det[NUM=sg] -> 'this' | 'every'
Det[NUM=pl] -> 'these' | 'all'
Det -> 'the' | 'some' | 'several'
PropN[NUM=sg]-> 'Kim' | 'Jody'
N[NUM=sg] -> 'dog' | 'girl' | 'car' | 'child'
N[NUM=pl] -> 'dogs' | 'girls' | 'cars' | 'children'
IV[TENSE=pres,  NUM=sg] -> 'disappears' | 'walks'
TV[TENSE=pres, NUM=sg] -> 'sees' | 'likes'
IV[TENSE=pres,  NUM=pl] -> 'disappear' | 'walk'
TV[TENSE=pres, NUM=pl] -> 'see' | 'like'
IV[TENSE=past] -> 'disappeared' | 'walked'
TV[TENSE=past] -> 'saw' | 'liked'

Develop some correct and incorrect sentences that are valid in this grammar.

Solution¶

feat0_parser = nltk.load_parser('grammars/book_grammars/feat0.fcfg')

def feat0_parse(sentence):
    tokens = sentence.split(' ')
    trees = list(feat0_parser.parse(tokens))
    for tree in trees:
        print(tree)
    if len(trees) > 0:
        return trees[0]
    else:
        print("Ungrammatical sentence.")

Correct one:

feat0_parse('this dog disappears')
(S[]
  (NP[NUM='sg'] (Det[NUM='sg'] this) (N[NUM='sg'] dog))
  (VP[NUM='sg', TENSE='pres']
    (IV[NUM='sg', TENSE='pres'] disappears)))

2016-12-29 1248.png

Incorrect one:

feat0_parse('these dog disappears')
Ungrammatical sentence.

Grammatical one:

feat0_parse('all dogs see Kim')
(S[]
  (NP[NUM='pl'] (Det[NUM='pl'] all) (N[NUM='pl'] dogs))
  (VP[NUM='pl', TENSE='pres']
    (TV[NUM='pl', TENSE='pres'] see)
    (NP[NUM='sg'] (PropN[NUM='sg'] Kim))))

2016-12-29 1253.png

Ungrammatical one:

feat0_parse('all dogs see this Kim')
Ungrammatical sentence.

Valid one:

feat0_parse('every child sees some cars')
(S[]
  (NP[NUM='sg'] (Det[NUM='sg'] every) (N[NUM='sg'] child))
  (VP[NUM='sg', TENSE='pres']
    (TV[NUM='sg', TENSE='pres'] sees)
    (NP[NUM='pl'] (Det[] some) (N[NUM='pl'] cars))))

2016-12-29 1255.png

This grammar still accepts some sentences which are invalid:

feat0_parse('every child sees some car')
(S[]
  (NP[NUM='sg'] (Det[NUM='sg'] every) (N[NUM='sg'] child))
  (VP[NUM='sg', TENSE='pres']
    (TV[NUM='sg', TENSE='pres'] sees)
    (NP[NUM='sg'] (Det[] some) (N[NUM='sg'] car))))

7.2. Processing Feature Structures¶

Boolean features¶

So far, we have seen categorical features. These are example of atomic features that can be decomposed into subparts.

Another example are boolean features. If you have only two categories: + and -, you can use a shorter syntax. You can write +PLURAL or -PLURAL instead of PLURAL=+ or PLURAL=-.

Exercise¶

Rewrite the following grammar, so that it uses boolean features.

S -> NP[NUM=?n] V[NUM=?n]
NP[NUM=?n] -> Det[NUM=?n] N[NUM=?n]
Det[NUM=sg] -> 'this'
Det[NUM=pl] -> 'these'
N[NUM=sg] -> 'dog'
N[NUM=pl] -> 'dogs'
V[NUM=sg] -> 'runs'
V[NUM=pl] -> 'run'

Solution¶

featured_grammar = nltk.grammar.FeatureGrammar.fromstring('''
S -> NP[PLURAL=?P] V[PLURAL=?P]
NP[PLURAL=?P] -> Det[PLURAL=?P] N[PLURAL=?P]
Det[-PLURAL] -> 'this'
Det[+PLURAL] -> 'these'
N[-PLURAL] -> 'dog'
N[+PLURAL] -> 'dogs'
V[-PLURAL] -> 'runs'
V[+PLURAL] -> 'run'
''')

Let’s see if it handles sentences correctly:

featured_parse('this dog runs')
(S[]
  (NP[-PLURAL] (Det[-PLURAL] this) (N[-PLURAL] dog))
  (V[-PLURAL] runs))

2016-12-29 1257.png

And reject this one:

featured_parse('these dogs runs')
Ungrammatical sentence.

Exercise¶

Write a grammar to parse sentences like:

  • I am happy.
  • She is happy.
  • Kim is happy.
  • You are happy.

And reject invalid ones like:

  • She am happy.
  • Kim are happy

Solution¶

featured_grammar = nltk.grammar.FeatureGrammar.fromstring('''
S -> PRON[NUM=?n, PER=?p] N[NUM=?n, PER=?p] 'happy'

PRON[NUM=sg, PER=1] -> 'I'
PRON[PER=2] -> 'you'
PRON[NUM=sg, PER=3] -> 'he' | 'she' | 'it' | 'Kim'
PRON[NUM=pl, PER=1] -> 'we'
PRON[NUM=pl, PER=3] -> 'they'

N[NUM=sg, PER=1] -> 'am'
N[PER=2] -> 'are'
N[NUM=pl] -> 'are'
N[NUM=sg, PER=3] -> 'is'
''')

Let’s check valid sentences:

featured_parse('I am happy')
(S[] (PRON[NUM='sg', PER=1] I) (N[NUM='sg', PER=1] am) happy)

2016-12-29 1259.png

and this one:

featured_parse('she is happy')
(S[] (PRON[NUM='sg', PER=3] she) (N[NUM='sg', PER=3] is) happy)

2016-12-29 1301.png

as well as this one:

featured_parse('Kim is happy')
(S[] (PRON[NUM='sg', PER=3] Kim) (N[NUM='sg', PER=3] is) happy)

2016-12-29 1302.png

and this one:

featured_parse('you are happy')
(S[] (PRON[PER=2] you) (N[NUM='pl'] are) happy)
(S[] (PRON[PER=2] you) (N[PER=2] are) happy)

2016-12-29 1304.png

Let’s check invalid sentences:

featured_parse('she am happy')
Ungrammatical sentence.

and this one:

featured_parse('Kim are happy')
Ungrammatical sentence.

Complex features¶

Features can be atomic values like strings or booleans. They can be also complex values holding more than one atomic value (or yet another complex value). Here are a few examples:

S                    -> NP[AGR=?n] VP[AGR=?n]
NP[AGR=?n]           -> PropN[AGR=?n]
VP[AGR=?n]           -> Cop[AGR=?n] Adj

Cop[AGR=[NUM=sg, PER=3]]              -> 'is'
PropN[AGR=[NUM=sg, PER=3]]            -> 'Kim'
Adj                                   -> 'happy'

Exercise¶

Rewrite the previous grammar so that it uses complex features.

Solution¶

featured_grammar = nltk.grammar.FeatureGrammar.fromstring('''
S -> PRON[AGR=?a] N[AGR=?a] 'happy'

PRON[AGR=[NUM=sg, PER=1]] -> 'I'
PRON[AGR=[PER=2]] -> 'you'
PRON[AGR=[NUM=sg, PER=3]] -> 'he' | 'she' | 'it' | 'Kim'
PRON[AGR=[NUM=pl, PER=1]] -> 'we'
PRON[AGR=[NUM=pl, PER=3]] -> 'they'

N[AGR=[NUM=sg, PER=1]] -> 'am'
N[AGR=[PER=2]] -> 'are'
N[AGR=[NUM=pl]] -> 'are'
N[AGR=[NUM=sg, PER=3]] -> 'is'
''')

Let’s check valid sentences:

featured_parse('I am happy')
(S[]
  (PRON[AGR=[NUM='sg', PER=1]] I)
  (N[AGR=[NUM='sg', PER=1]] am)
  happy)

2016-12-29 1307.png

and this one:

featured_parse('she is happy')
(S[]
  (PRON[AGR=[NUM='sg', PER=3]] she)
  (N[AGR=[NUM='sg', PER=3]] is)
  happy)

2016-12-29 1314.png

as well as this one:

featured_parse('Kim is happy')
(S[]
  (PRON[AGR=[NUM='sg', PER=3]] Kim)
  (N[AGR=[NUM='sg', PER=3]] is)
  happy)

2016-12-29 1316.png

and this one:

featured_parse('you are happy')
(S[] (PRON[AGR=[PER=2]] you) (N[AGR=[NUM='pl']] are) happy)
(S[] (PRON[AGR=[PER=2]] you) (N[AGR=[PER=2]] are) happy)

2016-12-29 1317.png

Let’s check invalid sentences:

featured_parse('she am happy')
Ungrammatical sentence.

and this one:

featured_parse('Kim are happy')
Ungrammatical sentence.

Tuple features¶

You can use combine primitive strings into tuples. This is actually a common way of mechanically translating one language into another one, i.e. SQL.

featured_grammar = nltk.grammar.FeatureGrammar.fromstring( S[SEM=(?first + ?second)] -> NP[SEM=?first] VP[SEM=?second] NP[SEM='John'] -> 'John' | 'he' VP[SEM='lives'] -> 'lives' )

Let’s see that it actually works:

featured_parse('John lives')
Ungrammatical sentence.

and one sentence more:

featured_parse('he lives')
Ungrammatical sentence.

Exercise¶

Develop a grammar to convert sentence like What cities are located in China? into SQL queries like SELECT Country FROM city_table WHERE CITY = 'athens'. First of all, write a non-feature based grammar that parses sentences like What cities are located in China?. Start from the following stub:

S -> NP VP
???
NP -> 'Greece'
NP -> 'China'
Det -> 'Which' | 'What'
N -> 'cities'
IV -> 'are'
A -> 'located'
P -> 'in'

Solution - step 1¶

grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> IV PP
VP -> IV AP
NP -> Det N
PP -> P NP
AP -> A PP
NP -> 'Greece'
NP -> 'China'
Det -> 'Which' | 'What'
N -> 'cities'
IV -> 'are'
A -> 'located'
P -> 'in'
""")

Let’s check an example sentence:

parse('What cities are located in China')
(S
  (NP (Det What) (N cities))
  (VP (IV are) (AP (A located) (PP (P in) (NP China)))))

2016-12-29 1320.png

Convert this grammar to a feature based grammar, so that it converts the sentence into SQL query.

Use % start S at the beginning of the grammar.

Solution - step 2¶

featured_grammar = nltk.grammar.FeatureGrammar.fromstring('''
% start S
S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp]
VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp]
VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap]
NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n]
PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np]
AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp]
NP[SEM='Country="greece"'] -> 'Greece'
NP[SEM='Country="china"'] -> 'China'
Det[SEM='SELECT'] -> 'Which' | 'What'
N[SEM='City FROM city_table'] -> 'cities'
IV[SEM=''] -> 'are'
A[SEM=''] -> 'located'
P[SEM=''] -> 'in'
''')

Let’s see that it actually works:

t = featured_parse('What cities are located in China')
t
(S[SEM=(SELECT, City FROM city_table, WHERE, , , Country="china")]
  (NP[SEM=(SELECT, City FROM city_table)]
    (Det[SEM='SELECT'] What)
    (N[SEM='City FROM city_table'] cities))
  (VP[SEM=(, , Country="china")]
    (IV[SEM=''] are)
    (AP[SEM=(, Country="china")]
      (A[SEM=''] located)
      (PP[SEM=(, Country="china")]
        (P[SEM=''] in)
        (NP[SEM='Country="china"'] China)))))

2016-12-29 1322.png

Let’s see the end result (SQL query)

' '.join(t.label()['SEM'])
u'SELECT City FROM city_table WHERE   Country="china"'

8. Analyzing the Meaning of Sentences¶

We need the following imports in this chapter:

import nltk

8.1. Semantics and Logic¶

One approach to analyzing the meaning of sentence is to leverage its semantic. Look at the sentences:

  • John is 25 old. He lives in Titchfield.

There are two entities: Titchfield and John. John is referred to in to different ways (John and he), but both words represent the same person.

If you think about it, you can realize that definite noun phrases (like the apple) and proper nouns (like John, London, Microsoft) refer to things in the world. They are names of some entities. This is the very first fundamental notion of semantic - meaning of words and sentences.

What’s more, each sentence can be true or false in certain situations. John can be 25 old or not. He may live in Titchfield or not.

You also realize that you can reason about sentences. Consider the following sentences:

  • Sylvania is to the north of Freedonia.
  • Freedonia is to the north of Sylvania.

There are two entities: Sylvania and Freedonia. You can quickly realize that these sentences are inconsistent - they cannot be both true. How can this conclusion be made by a computer? How can a computer infer this conclusion?

In this chapter, we’ll focus on logic-based approach which is guided by our judgment of consistency and inconsistency.

8.2. Propositional Logic¶

We need a logical language that makes reasoning formally explicit.

Propositional logic let us represent some linguistic structures like:

  • A and B
  • A or B,
  • not A,
  • if A, then B
  • and A if and only if B (or shortly A iff B).

Here are a few examples of what sentences can be represented in propositional logic:

  • John is 25 old and John lives in Titchfield. - A and B
  • It’s not true that Freedonia is to the north of Sylvania. - not A

Syntax¶

We can represent these structures using special syntax:

nltk.Expression.fromstring('A & B')
<AndExpression (A & B)>

You can represent negation with -:

nltk.Expression.fromstring('-A')
<NegatedExpression -A>

Use | for disjunction:

nltk.Expression.fromstring('A | B')
<OrExpression (A | B)>

Use -> for implication:

nltk.Expression.fromstring('A -> B')
<ImpExpression (A -> B)>

Use <-> for equivalence:

nltk.Expression.fromstring('A <-> B')
<IffExpression (A <-> B)>

You can combine them all. Use parenthesis when necessary:

nltk.Expression.fromstring('A | (B -> --C)')
<OrExpression (A | (B -> --C))>

Exercise¶

Translate the following sentences into propositional logic and verify that they can be processed with Expression.fromstring().

Provide a key which shows how the propositional variables in your translation correspond to expressions of English.

  1. If Angus sings, it is not the case that Bertie sulks.
  2. Cyril runs and barks.
  3. It will snow if it doesn’t rain.
  4. It’s not the case that Irene will be happy if Olive or Tofu comes.
  5. Pat didn’t cough or sneeze.
  6. If you don’t come if I call, I won’t come if you call.

Solution¶

1. If Angus sings, it is not the case that Bertie sulks.

# A = Angus sings
# B = Bertie sulks
nltk.Expression.fromstring('A -> -B')
<ImpExpression (A -> -B)>

2. Cyril runs and barks.

# C = Cyril runs
# D = Cyril barks
nltk.Expression.fromstring('C & D')
<AndExpression (C & D)>

3. It will snow if it doesn’t rain.

# E = it rains
# F = it will snow
nltk.Expression.fromstring('-E -> F')
<ImpExpression (-E -> F)>

4. It’s not the case that Irene will be happy if Olive or Tofu comes.

# G = Olive comes
# H = Tofu comes
# I = Irene will be happy
nltk.Expression.fromstring('-((G | H) -> I)')
<NegatedExpression -((G | H) -> I)>

5. Pat didn’t cough or sneeze.

# J = Pat cough
# K = Pat sneeze
nltk.Expression.fromstring('-(J | K)')
<NegatedExpression -(J | K)>

6. If you don’t come if I call, I won’t come if you call.

# L = I call
# M = You come
# N = You call
# O = I come
nltk.Expression.fromstring('(L -> -M) -> (N -> -O)')
<ImpExpression ((L -> -M) -> (N -> -O))>

Valuation¶

All these expressions are true or false, depending on the valuations of each sentence. Let’s create an example valuation, that is a mapping from basic expressions (like A, B or C) to their values (True of False):

val = nltk.Valuation([('A', True), ('B', True), ('C', False)])

As mentioned, it’s a plain map:

val['B']
True

We need one more piece, which we won’t explain now:

dom = set()
g = nltk.Assignment(dom)
m = nltk.Model(dom, val)

And we can evaluate different expressions:

m.evaluate('A & B', g)
True

Proving¶

Let’s consider the following sentences:

  • Sylvania is to the north of Freedonia. abbreviated SnF
  • Freedonia is not to the north of Sylvania. abbreviated NotFnS

You can realize that the second sentence must be true if the first one holds. This is because is_to_the_north relation is asymmetrical.

Let’s think about it again. SnF is our assumption. NotFnS is our conclusion. The step of moving from one or more assumptions to a conclusion is inference. We can make an argument that infers a conclusion based on some assumptions:

SnF = nltk.Expression.fromstring('SnF')
NotFnS = nltk.Expression.fromstring('-FnS')
R = nltk.Expression.fromstring('SnF -> -FnS')

prover = nltk.Prover9()
assumptions = [SnF, R]
conclusion = NotFnS
prover.prove(conclusion, assumptions)
True

Propositional Logic can handle all the boolean operators. However, how could we represent the general truth behind is_to_the_north? How can we tell computer that if A is to the north of B then B cannot be to the north of A, for all A and B?

8.3. First-Order Logic¶

Predicates¶

To represent sentences like John walks., we need somehow to represents relations. In the first-order logic, we call entities terms. Predicates take one or more terms as arguments and are true or false.

For example, in the sentence John walks we have one term - John. We also have a predicate walk, which is true for John. walk is an example of unary predicate. We can write walk(John) which is just a different syntax:

nltk.Expression.fromstring("walk(john)")
<ApplicationExpression walk(john)>

In the sentence Sylvania is to the north of Freedonia we have two terms: Sylvania and Freedonia. is_to_the_north_of is an example of binary predicate. We also have a predicate is to the north of, which is true for the pair of (Sylvania, Freedonia). Therefore, we can write:

nltk.Expression.fromstring("is_to_the_north_of(Sylvania, Freedonia)")
<ApplicationExpression is_to_the_north_of(Sylvania,Freedonia)>

Logic-base approach doesn’t care about the names of terms or predicates. It utilizes only relations between them.

Exercise¶

Translate the following sentences into predicate-argument formula of first order logic.

  1. Angus likes Cyril and Irene hates Cyril.
  2. Tofu is taller than Bertie.
  3. Bruce loves himself and Pat does too.
  4. Cyril saw Bertie, but Angus didn’t.
  5. Cyril is a fourlegged friend.
  6. Tofu and Olive are near each other.

Solution¶

1. Angus likes Cyril and Irene hates Cyril.

nltk.Expression.fromstring("like(Angus, Cyril) & hate(Irene, Cyril)")
<AndExpression (like(Angus,Cyril) & hate(Irene,Cyril))>

2. Tofu is taller than Bertie.

nltk.Expression.fromstring("taller(Tofu, Bertie)")
<ApplicationExpression taller(Tofu,Bertie)>

3. Bruce loves himself and Pat does too.

nltk.Expression.fromstring("love(Bruce, Bruce) & love(Pat, Pat)")
<AndExpression (love(Bruce,Bruce) & love(Pat,Pat))>

4. Cyril saw Bertie, but Angus didn’t.

nltk.Expression.fromstring("saw(Cyril, Bertie) & -saw(Angus, Bertie)")
<AndExpression (saw(Cyril,Bertie) & -saw(Angus,Bertie))>

5. Cyril is a fourlegged friend.

nltk.Expression.fromstring("has_four_legs(Cyril) & friendly(Cyril)")
<AndExpression (has_four_legs(Cyril) & friendly(Cyril))>

6. Tofu and Olive are near each other.

nltk.Expression.fromstring("near(Tofu, Olive) & near(Olive, Tofu)")
<AndExpression (near(Tofu,Olive) & near(Olive,Tofu))>

Quantifiers¶

How do we represent sentences like the following ones?

  • All dogs have four legs.
  • There exists a city in the north of Sylvania.
  • There is no city in the north of Sylvania.

Or even a simpler one:

Or even a simpler one: * It is a dog and it walks.

This one contains indefinite noun phrases. This sentence is not true nor false unless we provide some context. This is basically an open formula with two occurrences of variable x:

e = nltk.Expression.fromstring("dog(x) and walk(x)")

ou can see there is one free variable:

e.free()
{Variable('x')}

While simple expressions like is_to_the_north_of(Sylvania, Freedonia) don’t have any free variable:

nltk.Expression.fromstring("is_to_the_north_of(Sylvania, Freedonia)").free()
set()

You can bind the variable by placing an existential quantifier (there exists x), so that you can represent a statement like At least one entity is a dog and walks., which we write in this way:

nltk.Expression.fromstring("exists x. (dog(x) & walk(x))")
<ExistsExpression exists x.(dog(x) & walk(x))>

There is also universal quantifier (for all x). It represents sentences like Every entity is a dog and walks.. We write:

nltk.Expression.fromstring("all x. (dog(x) & walk(x))")
<AllExpression all x.(dog(x) & walk(x))>

We can use implication to say things like every dog:

e = nltk.Expression.fromstring("all x. (dog(x) -> walk(x))")
e
<AllExpression all x.(dog(x) -> walk(x))>

Now, these expressions can be true or false. They don’t have any free variables:

e.free()
set()

Note that variables should be single letters:

nltk.Expression.fromstring("walk(x)").free()
{Variable('x')}

Otherwise, they are interpreted as entity names:

nltk.Expression.fromstring("walk(john)").free()
set()

Exercise¶

Translate the following sentences into quantified formulas of first order logic.

  1. Angus likes someone and someone likes Julia.
  2. Angus loves a dog who loves him.
  3. Nobody smiles at Pat.
  4. Somebody coughs and sneezes.
  5. Nobody coughed or sneezed.
  6. Bruce loves somebody other than Bruce.
  7. Nobody other than Matthew loves Pat.
  8. Cyril likes everyone except for Irene.
  9. Exactly one person is asleep.

Solution¶

1. Angus likes someone and someone likes Julia.

nltk.Expression.fromstring("(exists x. like(Angus, x)) & (exists y. like(y, Julia))")
<AndExpression (exists x.like(Angus,x) & exists y.like(y,Julia))>

assuming that Angus likes John and Matthew likes Julia is acceptable (that is, both someones can refer to different people).

2. Angus loves a dog who loves him.

nltk.Expression.fromstring("exists x. (love(Angus, x) & love(x, Angus))")
<ExistsExpression exists x.(love(Angus,x) & love(x,Angus))>

3. Nobody smiles at Pat.

nltk.Expression.fromstring("all x. - smile_at(x, Pat)")
<AllExpression all x.-smile_at(x,Pat)>

4. Somebody coughs and sneezes.

nltk.Expression.fromstring("exists x. (cough(x) & sneeze(x))")
<ExistsExpression exists x.(cough(x) & sneeze(x))>

5. Nobody coughed or sneezed.

nltk.Expression.fromstring("- exists x. (cough(x) | sneeze(x))")
<NegatedExpression -exists x.(cough(x) | sneeze(x))>

or

nltk.Expression.fromstring("all x. - (cough(x) | sneeze(x))")
<AllExpression all x.-(cough(x) | sneeze(x))>

which is equivalent to:

nltk.Expression.fromstring("all x. (-cough(x) & -sneeze(x))")
<AllExpression all x.(-cough(x) & -sneeze(x))>

6. Bruce loves somebody other than Bruce.

nltk.Expression.fromstring("exists x. (love(Bruce, x) & (x != Bruce))")
<ExistsExpression exists x.(love(Bruce,x) & -(x = Bruce))>

7. Nobody other than Matthew loves Pat.

nltk.Expression.fromstring("- exists x. (x != Matthew & love(x, Pat))")
<NegatedExpression -exists x.(-(x = Matthew) & love(x,Pat))>

8. Cyril likes everyone except for Irene.

nltk.Expression.fromstring("all x. (x != Irene -> like(Cyril, x))")
<AllExpression all x.(-(x = Irene) -> like(Cyril,x))>

9. Exactly one person is asleep.

nltk.Expression.fromstring("exists x. (asleep(x) & (all y. asleep(y) -> (x == y)))")
<ExistsExpression exists x.(asleep(x) & (all y.asleep(y) -> (x = y)))>

Exercise (proving)¶

Recall the constraint of is_to_the_north_of relation. If A is_to_the_north_of B, then it’s not true that B is_to_the_north_of A.

How do you represent it using the syntax described in this chapter? Use this rule to infer that Freedonia is not to the north of Sylvania, assuming that Sylvania is to the north of Freedonia.

Solution¶

NotFnS = nltk.Expression.fromstring('-is_to_the_north_of(Freedonia, Sylvania)')
SnF = nltk.Expression.fromstring('is_to_the_north_of(Sylvania, Freedonia)')
R = nltk.Expression.fromstring('all x. all y. (is_to_the_north_of(x, y) -> -is_to_the_north_of(y, x))')

prover = nltk.Prover9()
prover.prove(NotFnS, [SnF, R])
True

If you want to see details, add verbose=True parameter:

prover.prove(NotFnS, [SnF, R], verbose=True)
Calling: /usr/bin/prover9
Args: []
Input:
 assign(max_seconds, 60).

clear(auto_denials).
formulas(assumptions).
    is_to_the_north_of(Sylvania,Freedonia).
    all x all y (is_to_the_north_of(x,y) -> -(is_to_the_north_of(y,x))).
end_of_list.

formulas(goals).
    -(is_to_the_north_of(Freedonia,Sylvania)).
end_of_list.



Return code: 0
stdout:
 ============================== Prover9 ===============================
Prover9 (32) version 2009-02A, February 2009.
Process 14187 was started by chris on chris,
Fri Dec  9 20:28:05 2016
The command was "/usr/bin/prover9".
============================== end of head ===========================

============================== INPUT =================================
assign(max_seconds,60).
clear(auto_denials).

formulas(assumptions).
is_to_the_north_of(Sylvania,Freedonia).
(all x all y (is_to_the_north_of(x,y) -> -is_to_the_north_of(y,x))).
end_of_list.

formulas(goals).
-is_to_the_north_of(Freedonia,Sylvania).
end_of_list.

============================== end of input ==========================

============================== PROCESS NON-CLAUSAL FORMULAS ==========

% Formulas that are not ordinary clauses:
1 (all x all y (is_to_the_north_of(x,y) -> -is_to_the_north_of(y,x))) # label(non_clause).  [assumption].
2 -is_to_the_north_of(Freedonia,Sylvania) # label(non_clause) # label(goal).  [goal].

============================== end of process non-clausal formulas ===

============================== PROCESS INITIAL CLAUSES ===============

% Clauses before input processing:

formulas(usable).
end_of_list.

formulas(sos).
is_to_the_north_of(Sylvania,Freedonia).  [assumption].
-is_to_the_north_of(x,y) | -is_to_the_north_of(y,x).  [clausify(1)].
is_to_the_north_of(Freedonia,Sylvania).  [deny(2)].
end_of_list.

formulas(demodulators).
end_of_list.

============================== PREDICATE ELIMINATION =================

No predicates eliminated.

============================== end predicate elimination =============

Term ordering decisions:
Predicate symbol precedence:  predicate_order([ is_to_the_north_of ]).
Function symbol precedence:  function_order([ Freedonia, Sylvania ]).
After inverse_order:  (no changes).
Unfolding symbols: (none).

Auto_inference settings:
  % set(neg_binary_resolution).  % (HNE depth_diff=0)
  % clear(ordered_res).  % (HNE depth_diff=0)
  % set(ur_resolution).  % (HNE depth_diff=0)
    % set(ur_resolution) -> set(pos_ur_resolution).
    % set(ur_resolution) -> set(neg_ur_resolution).

Auto_process settings:
  % set(unit_deletion).  % (Horn set with negative nonunits)

kept:      3 is_to_the_north_of(Sylvania,Freedonia).  [assumption].
kept:      4 -is_to_the_north_of(x,y) | -is_to_the_north_of(y,x).  [clausify(1)].
kept:      5 is_to_the_north_of(Freedonia,Sylvania).  [deny(2)].

============================== end of process initial clauses ========

============================== CLAUSES FOR SEARCH ====================

% Clauses after input processing:

formulas(usable).
end_of_list.

formulas(sos).
3 is_to_the_north_of(Sylvania,Freedonia).  [assumption].
4 -is_to_the_north_of(x,y) | -is_to_the_north_of(y,x).  [clausify(1)].
5 is_to_the_north_of(Freedonia,Sylvania).  [deny(2)].
end_of_list.

formulas(demodulators).
end_of_list.

============================== end of clauses for search =============

============================== SEARCH ================================

% Starting search at 0.01 seconds.

given #1 (I,wt=3): 3 is_to_the_north_of(Sylvania,Freedonia).  [assumption].

given #2 (I,wt=6): 4 -is_to_the_north_of(x,y) | -is_to_the_north_of(y,x).  [clausify(1)].
-------- Proof 1 --------

============================== PROOF =================================

% Proof 1 at 0.01 (+ 0.00) seconds.
% Length of proof is 6.
% Level of proof is 2.
% Maximum clause weight is 6.
% Given clauses 2.

1 (all x all y (is_to_the_north_of(x,y) -> -is_to_the_north_of(y,x))) # label(non_clause).  [assumption].
2 -is_to_the_north_of(Freedonia,Sylvania) # label(non_clause) # label(goal).  [goal].
3 is_to_the_north_of(Sylvania,Freedonia).  [assumption].
4 -is_to_the_north_of(x,y) | -is_to_the_north_of(y,x).  [clausify(1)].
5 is_to_the_north_of(Freedonia,Sylvania).  [deny(2)].
6 $F.  [resolve(4,a,3,a),unit_del(a,5)].

============================== end of proof ==========================

============================== STATISTICS ============================

Given=2. Generated=4. Kept=3. proofs=1.
Usable=2. Sos=1. Demods=0. Limbo=0, Disabled=3. Hints=0.
Kept_by_rule=0, Deleted_by_rule=0.
Forward_subsumed=0. Back_subsumed=0.
Sos_limit_deleted=0. Sos_displaced=0. Sos_removed=0.
New_demodulators=0 (0 lex), Back_demodulated=0. Back_unit_deleted=0.
Demod_attempts=0. Demod_rewrites=0.
Res_instance_prunes=0. Para_instance_prunes=0. Basic_paramod_prunes=0.
Nonunit_fsub_feature_tests=0. Nonunit_bsub_feature_tests=1.
Megabytes=0.02.
User_CPU=0.01, System_CPU=0.00, Wall_clock=0.

============================== end of statistics =====================

============================== end of search =========================

THEOREM PROVED

THEOREM PROVED

Exiting with 1 proof.

------ process 14187 exit (max_proofs) ------

Process 14187 exit (max_proofs) Fri Dec  9 20:28:05 2016
True

8.4. Discourse Semantics¶

So far, we focused on single sentences. A discourse is a sequence of sentences. Now we’ll focus on discourses.

Processing Discourses¶

dt = nltk.DiscourseTester(['A student dances', 'Every student is a person'])
dt.readings()
s0 readings:

s0-r0: exists z1.(student(z1) & dance(z1))

s1 readings:

s1-r0: all z1.(student(z1) -> person(z1))

Let’s see what grammar was used?

dt.grammar()
% start S
# Grammar Rules
S[SEM = <app(?subj,?vp)>] -> NP[NUM=?n,SEM=?subj] VP[NUM=?n,SEM=?vp]
NP[NUM=?n,SEM=<app(?det,?nom)> ] -> Det[NUM=?n,SEM=?det]  Nom[NUM=?n,SEM=?nom]
NP[LOC=?l,NUM=?n,SEM=?np] -> PropN[LOC=?l,NUM=?n,SEM=?np]
NP[-LOC,NUM=sg,SEM=<Q. (- exists x. (person(x) & Q(x)))>] -> 'nobody' | 'Nobody'
NP[-LOC,NUM=sg,SEM=<Q. exists x. (person(x) & Q(x))>] -> 'somebody' | 'Somebody'
Pred[SEM=?prd] -> PredN[SEM=?prd] | PP[+LOC,+PRED,SEM=?prd] | Adj[SEM=?prd]
PredN[NUM=?n, SEM=?nom] -> Det[NUM=?n] Nom[NUM=?n, SEM=?nom]
Nom[NUM=?n,SEM=?nom] -> N[NUM=?n,SEM=?nom]
Nom[NUM=?n,SEM=<app(?pp,?nom)>] -> N[NUM=?n,SEM=?nom] PP[SEM=?pp]
VP[NUM=?n,SEM=<app(?v,?obj)>] -> TV[NUM=?n,SEM=?v] NP[SEM=?obj]
VP[NUM=?n,SEM=<app(?v,?prd)>] -> AuxP[+COP,NUM=?n,SEM=?v] Pred[SEM=?prd]
VP[+neg,NUM=?n,SEM=<app(?v,?vp)>] -> AuxP[-COP,NUM=?n,SEM=?v] VP[NUM=pl,SEM=?vp]
AuxP[COP=?c,NUM=?n,SEM=<app(?neg,?aux)>] -> Aux[COP=?c,NUM=?n,SEM=?aux] Neg[SEM=?neg]
AuxP[COP=?c,NUM=?n,SEM=?aux] -> Aux[COP=?c,NUM=?n,SEM=?aux]
VP[NUM=?n,SEM=?v] -> IV[NUM=?n,SEM=?v]
VP[NUM=?n,SEM=<app(?pp,?vp)>] -> VP[NUM=?n,SEM=?vp] PP[-PRED,SEM=?pp]
PP[LOC=?l,PRED=?prd,SEM=<app(?p,?np)>] -> P[LOC=?l,PRED=?prd,SEM=?p] NP[LOC=?l,SEM=?np]
# Lexical Rules
PropN[-LOC,NUM=sg,SEM=<P.P(John)>] -> 'John'
PropN[-LOC,NUM=sg,SEM=<P.P(Mary)>] -> 'Mary'
PropN[-LOC,NUM=sg,SEM=<P.P(Suzie)>] -> 'Suzie'
PropN[-LOC,NUM=sg,SEM=<P.P(Vincent)>] -> 'Vincent'
PropN[-LOC,NUM=sg,SEM=<P.P(Mia)>] -> 'Mia'
PropN[-LOC,NUM=sg,SEM=<P.P(Marsellus)>] -> 'Marsellus'
PropN[-LOC,NUM=sg,SEM=<P.P(Fido)>] -> 'Fido'
PropN[+LOC, NUM=sg,SEM=<P.P(Noosa)>] -> 'Noosa'
NP[-LOC, NUM=sg, SEM=<P.x.P(x)>] -> 'who' | 'Who'
Det[NUM=sg,SEM=<P Q.all x.(P(x) -> Q(x))>] -> 'every' | 'Every'
Det[NUM=pl,SEM=<P Q.all x.(P(x) -> Q(x))>] -> 'all' | 'All'
Det[SEM=<P Q.exists x.(P(x) & Q(x))>] -> 'some' | 'Some'
Det[NUM=sg,SEM=<P Q.exists x.(P(x) & Q(x))>] -> 'a' | 'A'
Det[NUM=sg,SEM=<P Q.(- exists x.(P(x) & Q(x)))>] -> 'no' | 'No'
Det[NUM=sg,SEM=<P Q.exists x.((P(x) & Q(x)) & all y.(P(y) -> (x = y)))>] -> 'the' | 'The'
N[NUM=sg,SEM=<x.boy(x)>] -> 'boy'
N[NUM=pl,SEM=<x.boy(x)>] -> 'boys'
N[NUM=sg,SEM=<x.girl(x)>] -> 'girl'
N[NUM=pl,SEM=<x.girl(x)>] -> 'girls'
N[NUM=sg,SEM=<x.dog(x)>] -> 'dog'
N[NUM=pl,SEM=<x.dog(x)>] -> 'dogs'
N[NUM=sg,SEM=<x.student(x)>] -> 'student'
N[NUM=pl,SEM=<x.student(x)>] -> 'students'
N[NUM=sg,SEM=<x.person(x)>] -> 'person'
N[NUM=pl,SEM=<x.person(x)>] -> 'persons'
N[NUM=sg,SEM=<x.boxerdog(x)>] -> 'boxer'
N[NUM=pl,SEM=<x.boxerdog(x)>] -> 'boxers'
N[NUM=sg,SEM=<x.boxer(x)>] -> 'boxer'
N[NUM=pl,SEM=<x.boxer(x)>] -> 'boxers'
N[NUM=sg,SEM=<x.garden(x)>] -> 'garden'
N[NUM=sg,SEM=<x.kitchen(x)>] -> 'kitchen'
Adj[SEM=<x.happy(x)>] -> 'happy'
Adj[SEM=<x.drunk(x)>] -> 'drunk'
Adj[SEM=<x.married(x)>] -> 'married'
TV[NUM=sg,SEM=<X y.X(x.chase(y,x))>,tns=pres] -> 'chases'
TV[NUM=pl,SEM=<X y.X(x.chase(y,x))>,tns=pres] -> 'chase'
TV[NUM=sg,SEM=<X y.X(x.marry(y,x))>,tns=pres] -> 'marries'
TV[NUM=pl,SEM=<X y.X(x.marry(y,x))>,tns=pres] -> 'marry'
TV[NUM=sg,SEM=<X y.X(x.know(y,x))>,tns=pres] -> 'knows'
TV[NUM=pl,SEM=<X y.X(x.know(y,x))>,tns=pres] -> 'know'
TV[NUM=sg,SEM=<X y.X(x.see(y,x))>,tns=pres] -> 'sees'
TV[NUM=pl,SEM=<X y.X(x.see(y,x))>,tns=pres] -> 'see'
IV[NUM=sg,SEM=<x.bark(x)>,tns=pres] -> 'barks'
IV[NUM=pl,SEM=<x.bark(x)>,tns=pres] -> 'bark'
IV[NUM=sg,SEM=<x.walk(x)>,tns=pres] -> 'walks'
IV[NUM=pl,SEM=<x.walk(x)>,tns=pres] -> 'walk'
IV[NUM=pl,SEM=<x.dance(x)>,tns=pres] -> 'dance'
IV[NUM=sg,SEM=<x.dance(x)>,tns=pres] -> 'dances'
Aux[+COP,NUM=sg,SEM=<P x.P(x)>,tns=pres] -> 'is'
Aux[+COP,NUM=pl,SEM=<P x.P(x)>,tns=pres] -> 'are'
Aux[-COP,NUM=sg,SEM=<P x.P(x)>,tns=pres] -> 'does'
Aux[-COP,NUM=pl,SEM=<P x.P(x)>,tns=pres] -> 'do'
P[+LOC,-PRED,SEM=<X P x.X(y.(P(x) & in(x,y)))>] -> 'in'
P[+LOC,+PRED,SEM=<X x.X(y.in(x,y))>] -> 'in'
P[-LOC,SEM=<X P x.X(y.(P(x) & with(x,y)))>] -> 'with'
Neg[SEM=<T P.T(x.(- P(x)))>] -> 'not'

Why are discourses important?¶

How do you interpret statement like He walks.? To understand this sentence, you need to look back to find out who is he referring to. This is known as pronoun resolution. Sometimes it’s not clear how to resolve such ambiguities:

  • Angus used to have a dog. But he recently disappeared.
  • Angus used to have a dog. He took him for walks.

In the first sentence, he probably refers to the dog. In the second sentence, he refers to Angus.

9. Managing Linguistic Data¶

So far, we talked that corporas can be in different formats, depending whether they’re tagged or not and on other factors.

9.1. Data Formats (Lexicon vs Text)¶

There are two fundamental data types: lexicons and texts. Lexicon is a format where each line represents one record, for example a word or a phrase The line starts with the word or phrase and is followed with data about this record, for example:

  • comparative wordlist
  • verb paradigm
  • dictionary definition

Text is just a long sequence of words or tagged words. Newline characters don’t play any role for computer.

Below, you can see comparison of lexicon and texts from NLTK Book:

Lexicon-vs-text.png

9.2. Metadata¶

Metadata is structured data about data. It helps you discover language resources.

OAI stands for Open Archives initiative and provides a common framework across digital repositories of scholarly materials.

OLAC stands for Open Language Archives Community, an international partnership of institutions and individuals creating a worldwide virtual library of language resources. It’s homepage is http://www.language-archives.org/.

There are so many language resources that you may need a search engine. There actually is one in OLAC Language Resource Catalog.

See an example record (project Gutenberg). Keep in mind that corpora often are delivered with a README file:

Keep in mind that corpora often are delivered with a README file:

import nltk
print(nltk.corpus.gutenberg.readme()[:1000])
Project Gutenberg Selections
http://gutenberg.net/

This corpus contains etexts from from Project Gutenberg,
by the following authors:

* Jane Austen (3)
* William Blake (2)
* Thornton W. Burgess
* Sarah Cone Bryant
* Lewis Carroll
* G. K. Chesterton (3)
* Maria Edgeworth
* King James Bible
* Herman Melville
* John Milton
* William Shakespeare (3)
* Walt Whitman

The beginning of the body of each book could not be identified automatically,
so the semi-generic header of each file has been removed, and included below.
Some source files ended with a line "End of The Project Gutenberg Etext...",
and this has been deleted.

Information about Project Gutenberg (one page)

We produce about two million dollars for each hour we work.  The
fifty hours is one conservative estimate for how long it we take
to get any etext selected, entered, proofread, edited, copyright
searched and analyzed, the copyright letters written, etc.  This
projected audience is one hundred million readers.  If our value