An Exercise in BS4: Scraping language typological data

Sun 18 April 2021

This post demonstrates how bs4’s BeautifulSoup can be deployed to scrape webpages. The example I’ve created here uses data on languages and language families from Ethnologue.com, a robust digital catalogue of typological information on the world’s languages.

Check out “Spin the Wheel”, the function I created to randomly select a language and display its primary typological features (name, country of origin, language family, genetic subgroups). A fun and easy way to learn about the countless (well, 7400ish) other terrestrial languages hitherto unknown to me!

Step One: Get a list of page URLs
Step Two: Create basic language dictionaries
Step Three: Add language-specific info from each language page
Step Four: Write a function to pick a language, any language

Packages used:

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re

Step One: Get a list of page URLs🔗

Each Browse by Language Name page (corresponding to a letter from A-Z) contains an enormous list of languages. In order to access every language individually, we must first be able to access each page.

The URL for the “A” page is https://www.ethnologue.com/browse/names/a, the “B” page is https://www.ethnologue.com/browse/names/b, usw.
We’ll use a loop to compile a list of URLs, each made up of the common base (https://www.ethnologue.com/browse/names/) plus each respective letter of the alphabet.

1. Use the `map` method to access the lowercase alphabet from the built-in character list.

alpha = list(map(chr, range(97, 123)))

There’s one additional page after Z — “ǀ”, which is the IPA symbol for a dental click. Just append this to alpha and we’re good to go.

Note: although it resembles the standard pipe operator (and despite the fact that it’s also called a “pipe letter”), these are not the same symbol.

alpha.append('ǀ')

2. Create a list containing the corresponding URL for each letter a-z by looping over the `alpha` list.

First create an empty list called urls. Then take the base of the URL that’s common to each page, and assign it to the variable url_start.
The for loop adds the complete URL to urls.

urls = []
url_start = 'https://www.ethnologue.com/browse/names/'
for letter in alpha:
        urls.append(url_start+letter)

Inspecting the list of URLs we just added to:

>>> urls[-4:]

  ['https://www.ethnologue.com/browse/names/x',
   'https://www.ethnologue.com/browse/names/y',
   'https://www.ethnologue.com/browse/names/z',
   'https://www.ethnologue.com/browse/names/ǀ']

I’ll be using the root URL a few times in the next couple of steps, so here I’m assigning it to a variable called main.

main = 'https://www.ethnologue.com'

Step Two: Create basic language dicts📚

Next, loop through all pages A-Z and make a list containing language dictionaries.

The language dictionaries will store all of the relevant language information available on the current page.

# empty list to contain language dicts
languages = []

# iterating through each page A-Z
for url in urls:
    res = requests.get(url)
    soup = BeautifulSoup(res.content, 'lxml')
    item_list = soup.find_all('span', {'class': 'field-content'}) # each item is a language listed on the page

    for item in item_list:
        language = {} # initialize empty dict to store the available info for the language
        slug = item.find('a')['href']
        language['name'] = item.text
        language['slug'] = slug
        language['link'] = main+slug
        languages.append(language)

Peeking at the last three entries in our languages list:

>>> languages[-3:]

  [{'name': 'Zuni',
    'slug': '/language/zun',
    'link': 'https://www.ethnologue.com/language/zun'},
   {'name': 'ǀGwi',
    'slug': '/language/gwj',
    'link': 'https://www.ethnologue.com/language/gwj'},
   {'name': 'ǀXam',
    'slug': '/language/xam',
    'link': 'https://www.ethnologue.com/language/xam'}]

Checking how many languages are in the list:

>>> len(languages)

  7545

Cool. We now have a master list containing the name, slug and page link for each of the 7545 languages listed in the “Browse by Language Name” pages.

I definitely want to add to the language dicts more specific information, which can be scraped from the individual language pages in Step Three.

Step Three: Add language-specific info from each language page📄

Quick Tip: `enumerate()` in for loops

enumerate() iterates over an iterable and produces an enumerate object, which is a tuple of the form (count, value).
Where a value x is the nth item in an iterable, count = n and value = x.

In the following code, this is used to access the index of items in the iterable. See the documentation for how this works in a little more detail.

Warning: this took almost three hours to compile, what with a modest 7545 pages to scrape. Not for the faint of heart/processor.

for lang in enumerate(languages):

    # getting the location of the current item
    index = lang[0]

    # scraping
    url = lang[1]['link']
    res = requests.get(url)
    print(res.status_code) # I like to use this to see where the code is at as it's running
    soup = BeautifulSoup(res.content, 'lxml')

    # getting relevant info from the html
    country_origin = soup.find('h2', {'class': 'field-content'})('a')[0].text
    resources = soup.find('div', {'class': 'views-field views-field-nothing-2'})('span')[1]('a')[0]['href']
    subgroups_list = []
    subgroups = soup.find_all('span', re.compile('lineage-item'))
    for sub in subgroups:
        subgroups_list.append(sub.text)

    # adding key-value pairs with relevant info to existing language dict
    languages[index]['country_origin'] = country_origin
    languages[index]['lineage'] = subgroups_list
    languages[index]['additional_resources'] = resources
    if len(subgroups_list) == 0:
        languages[index]['family'] = '' # some languages have no family (e.g. language isolates)
    else:
        languages[index]['family'] = subgroups_list[0]

Saving the updated languages object with pickle:

import pickle

languages_obj = languages
languages_file = open('languages.obj', 'wb')
pickle.dump(languages_obj, languages_file)

Just writing a quick function to grab a random number (one that will correspond to the index of some language dict in languages):

def get_random():
    rand = np.random.randint(0, len(languages), 1)[0]
    return rand

Now it’s possible to index languages at a random location with the get_random() function.

>>> languages[get_random()]

  {'name': 'Mixtec, Ayutla',
   'slug': '/language/miy',
   'link': 'https://www.ethnologue.com/language/miy',
   'country_origin': 'Mexico',
   'lineage': ['Otomanguean',
    'Eastern Otomanguean',
    'Amuzgo-Mixtecan',
    'Mixtecan',
    'Mixtec'],
   'additional_resources': 'http://language-archives.org/language/miy',
   'family': 'Otomanguean'}

Step Four: Write a function to pick a language, any language🎲

Use the following code to produce a random language and its typological information from the list of 7545 languages 🥳

Note: there’s no return statement in the following function, since the desired output is purely for visual purposes.

def spin_the_wheel():

    rand = get_random()
    name = languages[rand]['name']
    origin = languages[rand]['country_origin']
    family = languages[rand]['family']
    lineage = languages[rand]['lineage']

    print(f'Language name: {name} \n' +
          f'Country of origin: {origin} \n' +
          f'Language Family: {family}')

    print('--- \nLineage: (in descending order from highest-level group)\n')
    for i in range(0, len(lineage)):
        print(lineage[i])

Give it a spin…

>>> spin_the_wheel()

    Language name: Balochi, Eastern
    Country of origin: Pakistan
    Language Family: Indo-European
    ---
    Lineage: (in descending order from highest-level group)

    Indo-European
    Indo-Iranian
    Iranian
    Western
    Northwestern
    Balochi

c y b e r l i n g u a l

An Exercise in BS4: Scraping language typological data

Table of Contents

Step One: Get a list of page URLs🔗

1. Use the `map` method to access the lowercase alphabet from the built-in character list.

2. Create a list containing the corresponding URL for each letter a-z by looping over the `alpha` list.

Step Two: Create basic language dicts📚

Step Three: Add language-specific info from each language page📄

Quick Tip: `enumerate()` in for loops

Step Four: Write a function to pick a language, any language🎲

Step Five: Discover some new languages🗣💬

c y b e r l i n g u a l

An Exercise in BS4: Scraping language typological data

Table of Contents

Step One: Get a list of page URLs🔗

1. Use the map method to access the lowercase alphabet from the built-in character list.

2. Create a list containing the corresponding URL for each letter a-z by looping over the alpha list.

Step Two: Create basic language dicts📚

Step Three: Add language-specific info from each language page📄

Quick Tip: enumerate() in for loops

Step Four: Write a function to pick a language, any language🎲

Step Five: Discover some new languages🗣💬

1. Use the `map` method to access the lowercase alphabet from the built-in character list.

2. Create a list containing the corresponding URL for each letter a-z by looping over the `alpha` list.

Quick Tip: `enumerate()` in for loops