Intro to spaCy

Sun 28 July 2019

spaCy is a tool for natural language processing (NLP). It’s incredibly user-friendly and has several unique, useful features.

This tutorial covers the basics of spaCy, showing you how to install and get started, and then helping you to become familiar with some of the core concepts and features built into spaCy. It also introduces displaCy, spaCy’s viz package. With displaCy, we can view the syntactic dependencies of a doc (text object) in really pretty visualizations, as well as label and colour-code entities.

Installation
NLP and Doc objects
Tokens
- Lexical attributes
- Additional attributes
Match patterns
- Matcher: Overview
Hashing, string stores and vocab
Entity identification
displaCy
- Customization
spacy.explain

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import altair as alt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Installing and downloading

To install spaCy, simply run the following lines of code.

!pip install spacy
!python -m spacy download en_core_web_sm

You should get the following output:

✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')

The NLP object and the Doc object

The NLP object contains the processing pipeline. Once instantiated, it can be used to analyze text.

It is instantiated as follows:

from spacy.lang.en import English

nlp = English()

Next, we pass our text data to the nlp object, and assign this to a variable, the standard name is doc. This doc variable now contains a Doc object, which we’ll see below.

doc = nlp('The alien spaceship travelled 12 billion light years to explore new galaxies.')
print(doc)
print(type(doc))

The alien spaceship travelled 12 billion light years to explore new galaxies.
<class 'spacy.tokens.doc.Doc'>

It behaves the same as a python sequence in that you can iterate over and index its contents. [Note: for the sake of being explicit I passed a raw string to nlp, but you can just as easily pass a variable that contains your text.]

We can run the .text method to get the full text contents of the doc object.

doc.text

'The alien spaceship travelled 12 billion light years to explore new galaxies.'

spaCy uses statistical models to make predictions based on context, which is critically important when processing language. Without context analysis, there’s very little we can say about the words in a given sentence or phrase — a feature of key importance of language is that words which look identical don’t always mean the same thing or have the same structural role, depending on where they occur in a sentence and with which other words. (E.g., “She ate baked potatoes” vs. “She baked potatoes”: in the first sentence, baked is an adjective, but in the second, it’s a verb.)

Things we can do in spaCy with statistical models:

parts-of-speech (POS) tagging
(named) entity identification
syntactic dependency parsing

To use one of spaCy’s models, you’ll need to import spacy, then load a package, such as the en_core_web_sm package.

You can use:

$ python install -m spacy download en_core_web_sm

import spacy

# now we'll simply reinstantiate
nlp = spacy.load('en_core_web_sm')

# still using our old inputs :P
doc = nlp('The SpaceX spaceship travelled 12 billion light years to explore new galaxies.')

Tokens

Tokens are just another word for “item”. In NLP, any individual, discrete object within the text (e.g., a word, punctuation mark, or number) would be a token. We can access individual tokens in a text by iterating over the doc object, and can thus see what types of tokens it contains, along with their lexical attributes.

Lexical just means “relating to the words or vocabulary of a language”, and so a “lexical attribute” is just some feature relating to the words/vocabulary of your doc.

for token in doc:
    print(token)

The
SpaceX
spaceship
travelled
12
billion
light
years
to
explore
new
galaxies
.

Lexical Attributes in spaCy

Linguistics refresher:

POS: the syntactic category a word (or here, token) belongs to, e.g., noun, verb, adjective

Dependency: the grammatical role of a word, relative to the other words in the sentence, e.g. direct/indirect object

Head: the lexical node (on a tree) that governs the word in question

Lemma: the base form of a word (its simplest morphological realization), e.g. “run” is the lemma for “running”

Shape: not linguistics-related per se, but this one is actually pretty cool — the shape attribute shows you what the token “looks like” without showing you the token itself (is it a digit, does it contain capital letters, etc.).

The following are some useful and interesting token attributes.

These return boolean values:

is_alpha  (is it an alphanumeric character)
is_punct  (is it punctuation)
like_num  (is it a number)

These return the index, the POS, the dependency, the lemma and the shape of a token, respectively:

i
pos_
dep_
head
lemma_
shape_

Additional Attributes

We can call doc.ents to get the entities and iterate over this to find the label (i.e., the category) for each entity (which we’ll do in a minute) using the label_ attribute.

Below I’ll run each of the attributes for every token using list comprehensions and toss those outputs into a dataframe, which will now contain all the relevant lexical information we gathered for each of the tokens.

tokens = [token for token in doc]
pos = [token.pos_ for token in doc]
dep = [token.dep_ for token in doc]
heads = [token.head for token in doc]
lemmas = [token.lemma_ for token in doc]
shapes = [token.shape_ for token in doc]

df = pd.DataFrame({
    'token': tokens,
    'pos': pos,
    'dependency': dep,
    'head': heads,
    'lemma': lemmas,
    'shape': shapes
})


df

	token	pos	dependency	head	lemma	shape
0	The	DET	det	spaceship	the	Xxx
1	SpaceX	PROPN	compound	spaceship	SpaceX	XxxxxX
2	spaceship	NOUN	nsubj	travelled	spaceship	xxxx
3	travelled	VERB	ROOT	travelled	travel	xxxx
4	12	NUM	compound	billion	12	dd
5	billion	NUM	nummod	years	billion	xxxx
6	light	ADJ	amod	years	light	xxxx
7	years	NOUN	dobj	travelled	year	xxxx
8	to	PART	aux	explore	to	xx
9	explore	VERB	advcl	travelled	explore	xxxx
10	new	ADJ	amod	galaxies	new	xxx
11	galaxies	NOUN	dobj	explore	galaxy	xxxx
12	.	PUNCT	punct	travelled	.	.

We can use like_num to find any numeric tokens in the text. This includes both digits and words for numbers.

for token in doc:
    if token.like_num:
        print(f'Token: {token}   Index: {token.i}')

Token: 12   Index: 4
Token: billion   Index: 5

We can check to see if the model recognized SpaceX as an entity…

for ent in doc.ents:
    print(ent.text, ent.label_)

12 billion light MONEY

…Looks like it didn’t. “12 billion light” is also not money here.

However, all is not lost! We can use a special feature called match patterns, along with manually adding this element to our list of entities, doc.ents (which we’ll get to shortly), to set things straight.

Match Patterns

source: imgur.com

What do they do?

One of the coolest things about spaCy is its built-in ability to let us search for patterns without having to rely on regular expressions. Match patterns (the spaCy regex equivalent) let you search for a pattern within a text, which can be a string, a doc or a token object — whereas regular expressions can only take string arguments.

Match patterns also let you customize what kind of pattern you’re looking for — which isn’t always a simple sequence of characters. When we’re doing NLP analyses, we might want to look for words belonging to a particular POS category, or words with a particular lemma. For instance, setting the lemma search to “run” would return tokens that contain the base form, such as “running” or “runs”.

What do they look like?

Match patterns are lists of dictionaries, wherein each key is the attribute and each value is the corresponding value you’re looking for.

pattern = [{'ATTRIBUTE': 'VALUE'}]

Each dictionary corresponds to exactly one token. If you have more than one dictionary, the list of dictionaries will correspond to a sequence of tokens appearing in that order (if that sequence exists in the doc).

Let’s say we have a doc containing the string "Hello world!" as well as "it's nice to say hello"and we want to find all instances of "hello" where it’s followed by a noun, as well as also all instances where it’s not.

source: imgur.com

We can use the OP attribute with the * key to indicate that this token is optional (i.e., “find this element 0 or more times”).

For this particular case, this step is somewhat redundant, but I want to illustrate the optional feature in the simplest way possible.

Some attributes for rule-based matching:

            return bool:

TEXT        IS_ALPHA     IS_LOWER   
LOWER       IS_ASCII     IS_UPPER   
LENGTH      IS_DIGIT     IS_TITLE      
POS         IS_PUNCT     IS_SPACE
OP          IS_STOP

To start using match patterns, run the following lines of code:

from spacy.matcher import Matcher

# initialize matcher using vocabulary shared with the nlp object
matcher = Matcher(nlp.vocab)

Note: this Matcher Explorer is a really nice tool for checking and trying out various match patterns — give it a whirl.

# first we set the pattern
pattern = [{'TEXT': 'SpaceX'}]

# then we add the pattern to our matcher
matcher.add('spacex', None, pattern)

# then we pass the text we want to search to the matcher
matches = matcher(doc)
matches

[(1533271143234181118, 1, 2)]

The matcher returns a list of tuples, where the first value is the match ID, the second is the start index, and the third is the end index of the match.

for match_id, start, stop in matches:
    matched_span = doc[start:stop]
    print(matched_span)

SpaceX

Using Matcher: Overview

Steps:

Import Matcher from spacy.matcher
Load model (e.g., 'en_core_web_sm')
Create nlp object (if not already previously done)
Instantiate Matcher with shared vocab (pass in nlp.vocab)
Pass your match pattern to matcher.add, where:
- argument 1 = a string representing a unique ID of your choosing to identify the pattern
- argument 2 = optional callback (if none, set None)
- argument 3 = list of token desriptions, i.e., the pattern
Call matcher on the doc object (or whatever you’ve called your doc object) and toss this into a variable (here, I’ve used matches to keep things simple)
Optional: peek at what’s inside. Remember the returned tuple contains the match ID, and the start and stop indices of the match.
To return the match(es) as a string (to actually see what they are — we might not always know what strings our match is going to return!), use a loop to iterate over each item in the tuple, as demonstrated above.

Hashing, vocab and string stores

The way we access the vocabulary in spaCy is through hash IDs and string stores. The vocabulary will depend on which package we passed into spacy.load, and will contain all the unique tokens within that package / dataset.

spaCy stores tokens in the vocabulary as hashes, such that every identical token string will have the same hash ID (a unique numeric identifier). Hashes enable memory efficiency since this way, if an identical string occurs multiple times (e.g. the word “the”), it only needs to be stored once in the vocabulary (and not multiple times as a string).

Every hash corresponds to a string token, and every string will have a corresponding hash. You can use either one to look up the other, as long as that string is already in the vocabulary.

If the word is new to the vocabulary, we must first hash it (as I’ll show below). We can always get the hash of a word by looking up a string (if it’s not already there, a new one will be generated), but if the word is new and doesn’t yet have a hash, we can’t search from hash ID to string (it won’t exist!).

# reminder of the contents of our doc object
doc.text

'The SpaceX spaceship travelled 12 billion light years to explore new galaxies.'

We can search for a hash or a string using

nlp.vocab.strings['string']
nlp.vocab.strings[hash]

doc.vocab.strings['string']
doc.vocab.strings[hash]

where 'string' is the string token you’re searching for (or adding to the vocab) and hash is an integer representing the hash ID of an existing string.

spaceship_hash = nlp.vocab.strings['spaceship']
spaceship_hash

2527206094249092639

If you search for a hash that doesn’t correspond to any string in the vocab, it’ll throw an error:

nlp.vocab.strings[4523523] # has no corresponding string in the vocab

KeyError: "[E018] Can't retrieve string for hash '4523523'."

The string for hash 4523523 can’t be retrieved because it doesn’t match any existing item in the vocab.

Once we know the hash of a string, we can call it within nlp.vocab.strings to once again get back the corresponding string.

spaceship_string = nlp.vocab.strings[spaceship_hash]
spaceship_string

'spaceship'

Entity identification: Fixing up our entities

You’ll recall that spaCy incorrectly identified “12 billion light” as a MONEY entity, as well as missed SpaceX altogether. SpaceX should be labelled as an organization – ORG – while we can create a “distance” label – DIST – for “light years”.

pattern = [{'TEXT': 'SpaceX'}]
spacex_matcher = matcher.add('spacex', None, pattern)

matcher(doc)

[(1533271143234181118, 1, 2)]

Now we know the start and stop indices for our SpaceX entity. Let’s do the same for “light years”

matcher = Matcher(nlp.vocab)

pattern = [{'LIKE_NUM': True}, {'LIKE_NUM': True}, {'POS': 'ADJ'}, {'POS': 'NOUN'}]

matcher.add('is_numeric', None, pattern)
matches = matcher(doc)
matches

[(14514059297638010452, 4, 8)]

Here I’m iterating over my matches to check that matched_span actually returns what we want it to return, which would indicate that the indices returned by the matcher are correct.

for match_id, start, stop in matches:
    matched_span = doc[start+2:stop]
    print(matched_span)

light years

Yep, good to go!

Note: matched_span originally returned the whole phrase, “12 billion light years” because of how I wrote the match pattern. I only want “light years” though, so I just added 2 to the start index, to effectively jump past the first two tokens of the match (“12 billion”).

from spacy.tokens import Doc, Span

# using the start/stop indicies returned to us by the matcher
span_1 = Span(doc, 1, 2, label='ORG')
print(span_1.text, span_1.label_)

span_2 = Span(doc, 6, 8, label='DIST')
print(span_2.text, span_2.label_)

# Setting the doc's entities using the span object--this also overwrites the existing incorrect entity
doc.ents = [span_1, span_2]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

SpaceX ORG
light years DIST
[('SpaceX', 'ORG'), ('light years', 'DIST')]

displaCy

This is probably my favourite feature. Maybe as a linguist I’m slightly biased toward neat visual respresentations of syntactic parses, but I also think this tool is just pretty objectively cool.

displaCy takes your doc object and visually depicts its syntactic dependencies, in addition to providing the POS of each word.

First, import displaCy, then instantiate your nlp and doc objects (if you hadn’t already done so previously), and then call displacy.render on your doc object (or list of docs!). This will return a dependency-parsed sentence!

It’s at this step, within displacy.render, that you may also specify any additional preferred parameters (which I’ll go into more below).

from spacy import displacy

# displacy.serve(doc) returns an html page with your visualization. Useful for longer sentences!
# html = displacy.render(doc) # optional: throw this into a variable, the default in spaCy documentation is "html"

To simplify things a little bit, I’ll create two new doc objects called “dogs” and “cats”.

dogs = nlp(u"Megan loves dogs.")
cats = nlp(u"Mary loves cats.")

In displacy.render, there are two style options:

dep returns a visualization of the syntactic dependencies (default)
ent returns a prettified sentence which highlights the entities it contains.

For example:

displacy.render([dogs, cats], style='ent')

Megan PERSON loves dogs.

Mary PERSON loves cats.

displacy.render([dogs, cats], style='dep')

displacy.render(doc, style='ent')

The SpaceX ORG spaceship travelled 12 billion light years DIST to explore new galaxies.

Customization

The function signature for displaCy is as follows:

Signature:
displacy.render(
    docs,
    style='dep',
    page=False,
    minify=False,
    jupyter=None,
    options={},
    manual=False,
)
Docstring:
Render displaCy visualisation.

We can pass a dictionary of attributes into the options argument to ultra-customize the viz. Below is a list of valid attributes (option dict keys).

Options:

fine_grained
collapse_punct
collapse_phrases
compact
color
bg
font
offset_x
arrow_stroke
arrow_width
arrow_spacing
word_spacing
distance

We’ll go through a few, but to read more about what each of these arguments does, check out the displaCy documentation here.

options = {'word_spacing': 25}

displacy.render(dogs, style='dep', options=options)

options = {'distance': 120}

displacy.render(dogs, style='dep', options=options)

In the code block below, I’m populating the options dictionary with the features I want to adjust:

word_spacing sets the (vertical) space between the words and the arrows
compact makes the arrows squared to conserve space
distance sets the distance between each of the words (basically, the width of the entire viz)
arrow_spacing sets the space between arrows (so the smaller the int value you pass, the less space between arrows)
offset_x is the amount of space between the “edge” of the viz and the first word that appears
arrow_width sets the size of the “arrowhead”
color sets the color of the text and arrows
bg sets the background color.

options = {
    'word_spacing': 25,
    'compact': False, # default
    'distance': 85,
    'arrow_spacing': 2,
    'offset_x': 17,
    'arrow_width': 6,
    'color': '#ffffff',
    'bg': '#0be3df'
}

displacy.render(doc, style='dep', options=options)

`spacy.explain`

This method is super great when dealing with abbreviations and shorthands, which tend to be ubiquitous in a package filled with descriptors like “adverbial clause modifier”. Not exactly compact. From time to time, we might encounter an abbreviation/label we’re unfamiliar with, and in such cases we can call spacy.explain, which takes any spaCy abbreviation, shorthand or label as a string argument and returns the full name. Check it out:

spacy.explain('JJ')

'adjective'

Here’s a few more:

print('advcl:', spacy.explain('advcl'))
print('\namod:', spacy.explain('amod'))
print('\nnummod:', spacy.explain('nummod'))

print('\nORG:', spacy.explain('ORG'))

advcl: adverbial clause modifier

amod: adjectival modifier

nummod: numeric modifier

ORG: Companies, agencies, institutions, etc.

tags = [token.tag_ for token in tokens]
tags

['DT',
 'NNP',
 'NN',
 'VBD',
 'CD',
 'CD',
 'JJ',
 'NNS',
 'TO',
 'VB',
 'JJ',
 'NNS',
 '.']

tags_explained = [spacy.explain(i) for i in tags]

We can use a simple zip function to create a list of tuples containing each of the tags, along with their respective “definitions”.

list(zip(tags, tags_explained))

[('DT', 'determiner'),
 ('NNP', 'noun, proper singular'),
 ('NN', 'noun, singular or mass'),
 ('VBD', 'verb, past tense'),
 ('CD', 'cardinal number'),
 ('CD', 'cardinal number'),
 ('JJ', 'adjective'),
 ('NNS', 'noun, plural'),
 ('TO', 'infinitival to'),
 ('VB', 'verb, base form'),
 ('JJ', 'adjective'),
 ('NNS', 'noun, plural'),
 ('.', 'punctuation mark, sentence closer')]

c y b e r l i n g u a l