An Exercise in BS4: Scraping language typological data

Drawing Image source

This post demonstrates how bs4’s BeautifulSoup can be deployed to scrape webpages. The example I’ve created here uses data on languages and language families from Ethnologue.com, a robust digital catalogue of typological information on the world’s languages.

Check out “Spin the Wheel”, the function I created to randomly select a language and display its primary typological features (name, country of origin, language family, genetic subgroups). A fun and easy way to learn about the countless (well, 7400ish) other terrestrial languages hitherto unknown to me!

Table of Contents

Packages used:

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re

 

Step One: Get a list of page URLs🔗

Each Browse by Language Name page (corresponding to a letter from A-Z) contains an enormous list of languages. In order to access every language individually, we must first be able to access each page.

1. Use the map method to access the lowercase alphabet from the built-in character list.

alpha = list(map(chr, range(97, 123)))

There’s one additional page after Z — “Ç€”, which is the IPA symbol for a dental click. Just append this to alpha and we’re good to go.

Note: although it resembles the standard pipe operator (and despite the fact that it’s also called a “pipe letter”), these are not the same symbol.

alpha.append('Ç€')

2. Create a list containing the corresponding URL for each letter a-z by looping over the alpha list.

urls = []
url_start = 'https://www.ethnologue.com/browse/names/'
for letter in alpha:
        urls.append(url_start+letter)

Inspecting the list of URLs we just added to:

>>> urls[-4:]

  ['https://www.ethnologue.com/browse/names/x',
   'https://www.ethnologue.com/browse/names/y',
   'https://www.ethnologue.com/browse/names/z',
   'https://www.ethnologue.com/browse/names/Ç€']

I’ll be using the root URL a few times in the next couple of steps, so here I’m assigning it to a variable called main.

main = 'https://www.ethnologue.com'

 

Step Two: Create basic language dicts📚

Next, loop through all pages A-Z and make a list containing language dictionaries.

The language dictionaries will store all of the relevant language information available on the current page.

# empty list to contain language dicts
languages = []

# iterating through each page A-Z
for url in urls:
    res = requests.get(url)
    soup = BeautifulSoup(res.content, 'lxml')
    item_list = soup.find_all('span', {'class': 'field-content'}) # each item is a language listed on the page

    for item in item_list:
        language = {} # initialize empty dict to store the available info for the language
        slug = item.find('a')['href']
        language['name'] = item.text
        language['slug'] = slug
        language['link'] = main+slug
        languages.append(language)

Peeking at the last three entries in our languages list:

>>> languages[-3:]

  [{'name': 'Zuni',
    'slug': '/language/zun',
    'link': 'https://www.ethnologue.com/language/zun'},
   {'name': 'Ç€Gwi',
    'slug': '/language/gwj',
    'link': 'https://www.ethnologue.com/language/gwj'},
   {'name': 'Ç€Xam',
    'slug': '/language/xam',
    'link': 'https://www.ethnologue.com/language/xam'}]

Checking how many languages are in the list:

>>> len(languages)

  7545

Cool. We now have a master list containing the name, slug and page link for each of the 7545 languages listed in the “Browse by Language Name” pages.

I definitely want to add to the language dicts more specific information, which can be scraped from the individual language pages in Step Three.

 

Step Three: Add language-specific info from each language page📄

Quick Tip: enumerate() in for loops

In the following code, this is used to access the index of items in the iterable. See the documentation for how this works in a little more detail.

Warning: this took almost three hours to compile, what with a modest 7545 pages to scrape. Not for the faint of heart/processor.

for lang in enumerate(languages):

    # getting the location of the current item
    index = lang[0]

    # scraping
    url = lang[1]['link']
    res = requests.get(url)
    print(res.status_code) # I like to use this to see where the code is at as it's running
    soup = BeautifulSoup(res.content, 'lxml')

    # getting relevant info from the html
    country_origin = soup.find('h2', {'class': 'field-content'})('a')[0].text
    resources = soup.find('div', {'class': 'views-field views-field-nothing-2'})('span')[1]('a')[0]['href']
    subgroups_list = []
    subgroups = soup.find_all('span', re.compile('lineage-item'))
    for sub in subgroups:
        subgroups_list.append(sub.text)

    # adding key-value pairs with relevant info to existing language dict
    languages[index]['country_origin'] = country_origin
    languages[index]['lineage'] = subgroups_list
    languages[index]['additional_resources'] = resources
    if len(subgroups_list) == 0:
        languages[index]['family'] = '' # some languages have no family (e.g. language isolates)
    else:
        languages[index]['family'] = subgroups_list[0]

 
Saving the updated languages object with pickle:

import pickle

languages_obj = languages
languages_file = open('languages.obj', 'wb')
pickle.dump(languages_obj, languages_file)

 
Just writing a quick function to grab a random number (one that will correspond to the index of some language dict in languages):

def get_random():
    rand = np.random.randint(0, len(languages), 1)[0]
    return rand

 
Now it’s possible to index languages at a random location with the get_random() function.

>>> languages[get_random()]

  {'name': 'Mixtec, Ayutla',
   'slug': '/language/miy',
   'link': 'https://www.ethnologue.com/language/miy',
   'country_origin': 'Mexico',
   'lineage': ['Otomanguean',
    'Eastern Otomanguean',
    'Amuzgo-Mixtecan',
    'Mixtecan',
    'Mixtec'],
   'additional_resources': 'http://language-archives.org/language/miy',
   'family': 'Otomanguean'}

 

Step Four: Write a function to pick a language, any language🎲

Use the following code to produce a random language and its typological information from the list of 7545 languages ðŸ¥³

Note: there’s no return statement in the following function, since the desired output is purely for visual purposes.

def spin_the_wheel():

    rand = get_random()
    name = languages[rand]['name']
    origin = languages[rand]['country_origin']
    family = languages[rand]['family']
    lineage = languages[rand]['lineage']

    print(f'Language name: {name} \n' +
          f'Country of origin: {origin} \n' +
          f'Language Family: {family}')

    print('--- \nLineage: (in descending order from highest-level group)\n')
    for i in range(0, len(lineage)):
        print(lineage[i])

 
Give it a spin…

>>> spin_the_wheel()

    Language name: Balochi, Eastern
    Country of origin: Pakistan
    Language Family: Indo-European
    ---
    Lineage: (in descending order from highest-level group)

    Indo-European
    Indo-Iranian
    Iranian
    Western
    Northwestern
    Balochi

 

Step Five: Discover some new languages🗣💬