An Exercise in BS4: Scraping language typological data
Sun 18 April 2021This post demonstrates how bs4’s BeautifulSoup can be deployed to scrape webpages. The example I’ve created here uses data on languages and language families from Ethnologue.com, a robust digital catalogue of typological information on the world’s languages.
Check out “Spin the Wheel”, the function I created to randomly select a language and display its primary typological features (name, country of origin, language family, genetic subgroups). A fun and easy way to learn about the countless (well, 7400ish) other terrestrial languages hitherto unknown to me!
Table of Contents
- Step One: Get a list of page URLs
- Step Two: Create basic language dictionaries
- Step Three: Add language-specific info from each language page
- Step Four: Write a function to pick a language, any language
Packages used:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
Step One: Get a list of page URLs🔗
Each Browse by Language Name page (corresponding to a letter from A-Z) contains an enormous list of languages. In order to access every language individually, we must first be able to access each page.
-
The URL for the “A” page is https://www.ethnologue.com/browse/names/a, the “B” page is https://www.ethnologue.com/browse/names/b, usw.
-
We’ll use a loop to compile a list of URLs, each made up of the common base (https://www.ethnologue.com/browse/names/) plus each respective letter of the alphabet.
1. Use the map
method to access the lowercase alphabet from the built-in character list.
alpha = list(map(chr, range(97, 123)))
There’s one additional page after Z — “ǀ”, which is the IPA symbol for a dental click.
Just append this to alpha
and we’re good to go.
Note: although it resembles the standard pipe operator (and despite the fact that it’s also called a “pipe letter”), these are not the same symbol.
alpha.append('ǀ')
2. Create a list containing the corresponding URL for each letter a-z by looping over the alpha
list.
-
First create an empty list called
urls
. Then take the base of the URL that’s common to each page, and assign it to the variableurl_start
. -
The for loop adds the complete URL to
urls
.
urls = []
url_start = 'https://www.ethnologue.com/browse/names/'
for letter in alpha:
urls.append(url_start+letter)
Inspecting the list of URLs we just added to:
>>> urls[-4:]
['https://www.ethnologue.com/browse/names/x',
'https://www.ethnologue.com/browse/names/y',
'https://www.ethnologue.com/browse/names/z',
'https://www.ethnologue.com/browse/names/ǀ']
I’ll be using the root URL a few times in the next couple of steps, so here I’m assigning it to a variable called main
.
main = 'https://www.ethnologue.com'
Step Two: Create basic language dicts📚
Next, loop through all pages A-Z and make a list containing language dictionaries.
The language dictionaries will store all of the relevant language information available on the current page.
# empty list to contain language dicts
languages = []
# iterating through each page A-Z
for url in urls:
res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml')
item_list = soup.find_all('span', {'class': 'field-content'}) # each item is a language listed on the page
for item in item_list:
language = {} # initialize empty dict to store the available info for the language
slug = item.find('a')['href']
language['name'] = item.text
language['slug'] = slug
language['link'] = main+slug
languages.append(language)
Peeking at the last three entries in our languages
list:
>>> languages[-3:]
[{'name': 'Zuni',
'slug': '/language/zun',
'link': 'https://www.ethnologue.com/language/zun'},
{'name': 'ǀGwi',
'slug': '/language/gwj',
'link': 'https://www.ethnologue.com/language/gwj'},
{'name': 'ǀXam',
'slug': '/language/xam',
'link': 'https://www.ethnologue.com/language/xam'}]
Checking how many languages are in the list:
>>> len(languages)
7545
Cool. We now have a master list containing the name, slug and page link for each of the 7545 languages listed in the “Browse by Language Name” pages.
I definitely want to add to the language dicts more specific information, which can be scraped from the individual language pages in Step Three.
Step Three: Add language-specific info from each language page📄
Quick Tip: enumerate()
in for loops
-
enumerate()
iterates over an iterable and produces an enumerate object, which is a tuple of the form(count, value)
. -
Where a value x is the nth item in an iterable,
count
= n andvalue
= x.
In the following code, this is used to access the index of items in the iterable. See the documentation for how this works in a little more detail.
Warning: this took almost three hours to compile, what with a modest 7545 pages to scrape. Not for the faint of heart/processor.
for lang in enumerate(languages):
# getting the location of the current item
index = lang[0]
# scraping
url = lang[1]['link']
res = requests.get(url)
print(res.status_code) # I like to use this to see where the code is at as it's running
soup = BeautifulSoup(res.content, 'lxml')
# getting relevant info from the html
country_origin = soup.find('h2', {'class': 'field-content'})('a')[0].text
resources = soup.find('div', {'class': 'views-field views-field-nothing-2'})('span')[1]('a')[0]['href']
subgroups_list = []
subgroups = soup.find_all('span', re.compile('lineage-item'))
for sub in subgroups:
subgroups_list.append(sub.text)
# adding key-value pairs with relevant info to existing language dict
languages[index]['country_origin'] = country_origin
languages[index]['lineage'] = subgroups_list
languages[index]['additional_resources'] = resources
if len(subgroups_list) == 0:
languages[index]['family'] = '' # some languages have no family (e.g. language isolates)
else:
languages[index]['family'] = subgroups_list[0]
Saving the updated languages
object with pickle:
import pickle
languages_obj = languages
languages_file = open('languages.obj', 'wb')
pickle.dump(languages_obj, languages_file)
Just writing a quick function to grab a random number (one that will correspond to the index of some language dict in languages
):
def get_random():
rand = np.random.randint(0, len(languages), 1)[0]
return rand
Now it’s possible to index languages
at a random location with the get_random()
function.
>>> languages[get_random()]
{'name': 'Mixtec, Ayutla',
'slug': '/language/miy',
'link': 'https://www.ethnologue.com/language/miy',
'country_origin': 'Mexico',
'lineage': ['Otomanguean',
'Eastern Otomanguean',
'Amuzgo-Mixtecan',
'Mixtecan',
'Mixtec'],
'additional_resources': 'http://language-archives.org/language/miy',
'family': 'Otomanguean'}
Step Four: Write a function to pick a language, any language🎲
Use the following code to produce a random language and its typological information from the list of 7545 languages 🥳
Note: there’s no return statement in the following function, since the desired output is purely for visual purposes.
def spin_the_wheel():
rand = get_random()
name = languages[rand]['name']
origin = languages[rand]['country_origin']
family = languages[rand]['family']
lineage = languages[rand]['lineage']
print(f'Language name: {name} \n' +
f'Country of origin: {origin} \n' +
f'Language Family: {family}')
print('--- \nLineage: (in descending order from highest-level group)\n')
for i in range(0, len(lineage)):
print(lineage[i])
Give it a spin…
>>> spin_the_wheel()
Language name: Balochi, Eastern
Country of origin: Pakistan
Language Family: Indo-European
---
Lineage: (in descending order from highest-level group)
Indo-European
Indo-Iranian
Iranian
Western
Northwestern
Balochi