One Weird Query: Resolving LC Subject Strings to URIs Using Python

2020-01-08 discovery, linked data, tech, python

Several times a year, the same question comes up on Twitter or in one the metadata/library/archives Slacks: “Does anyone know working reconciliation service for LCSH?”¹ The person tends to be using something like OpenRefine to match a bunch of textual subject headings with their appropriate LC linked data URIs.

For example, the complex subject “Brewery workers–Labor unions” is represented by http://id.loc.gov/authorities/subjects/sh85136543. But the Library of Congress doesn’t offer the same kind of service for making those matches as places like wikidata, so people hack together their own and share them.²

This week, I needed to test 513 simple, complex, and coordinated subjects and get their respective URIs, if said URIs existed. I considered poking around to see what was out there or once again putting out a call for services. But then I had an idea… I’m writing it up to a) share the script and b) offer a peek inside my brain.

For the tl;dr, download the Python script and a sample CSV of with a few different kinds of LCSH and experiment with it yourself.

If you want a surprise/extra info, you still might want to scroll to the end!

Walking Through the Experimental Process

The Library of Congress’s Linked Data service provides a guide to known-label retrieval.³ In this case, the full subject is the label. If you know the exact subject, you can get redirected to the page using https://id.loc.gov/authorities/label/{label}. You can add the authority type, e.g. /subjects/ after /authorities/. This ensures that you’ll be redirected to the exact page since sometimes two different vocabularies have the same label.

I decided to see what kind of data I could get by using Python’s requests module with the redirection URI. What kind of data would I receive? What was in the headers? Could I parse it with something like BeautifulSoup?

I started with a simple subject, Brewery workers. I needed to escape the space, so I manually replaced it with an %20. In the Python3 terminal with requests module imported, I ran the following query:

stuff = requests.get("https://id.loc.gov/authorities/label/Brewery%20workers")

I then tested it:

>>> stuff
<Response [200]>

I looked up what kinds of data I could expect in a requests response. This requests documentation seems to still apply to Python 3 (note the tag at the top of the page) and provided helpful. So I tested a few fields.

>>> stuff.status_code
200

hm, ok…

>>> stuff.headers
{'Access-Control-Allow-Headers': 'Content-Type, Access-Control-Allow-Headers, Authorization, X-Requested-With', 'Access-Control-Allow-Methods': 'HEAD, POST, GET, OPTIONS', 'Access-Control-Allow-Origin': '*', 'Age': '0', 'Cache-Control': 'public, max-age=2419200', 'Cf-Cache-Status': 'DYNAMIC', 'Cf-Ray': '551eda324d16f21e-ORD', 'Connection': 'keep-alive', 'Content-Security-Policy': 'frame-ancestors self', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Wed, 08 Jan 2020 14:27:28 GMT', 'Server': 'cloudflare', 'Via': '1.1 varnish-v4', 'X-Frame-Options': 'sameorigin', 'X-Preflabel': 'Brewery workers', 'X-Uri': 'http://id.loc.gov/authorities/subjects/sh85016774', 'X-Varnish': '14944587', 'Transfer-Encoding': 'chunked'}

Great, so .headers has everything I need. X-Uri has not just the URL of the page but the actual proper linked data URI. And X-Preflabel has the full subject string so I can confirm the matches.⁴

In the documentation linked above, I saw that one could easily get the value of a key in the response.

>>> stuff.headers['X-Uri']
'http://id.loc.gov/authorities/subjects/sh85016774'
>>> stuff.headers['X-Preflabel']
'Brewery workers'

Ok, good. I then tried the same with complex subjects, e.g. Brewery workers--Labor unions and it worked too. Fantastic! Now I just had to automate things and ensure that non-matches didn’t cause errors.

Script and Sample Data

I had a CSV with all the subjects in the column subject.⁵ I needed to open it, process each line, make the query, and then write out a new CSV with the matches.

I started by opening and reading the file:

with open(csvSource, newline='') as data:
   reader = csv.DictReader(data)
   for row in reader:
     subjectLabel = row['subject']

But I needed to solve a couple of problems. First, the subjects needed to be properly escaped for URI encoding. I identified the main characters which needed to be replaced and wrote a little function with a quick set of replacements. I will need to revisit this for a more thorough escaping. Second, the subjects I was using had a blank space on each side of the –. Brewery workers -- Labor unions vs. Brewery workers--Labor unions. Because real spaces needed to be replaced with %20, I replaced the superfluous spaces first, then did the rest. The little function I wrote came out like this:

def URI_escape(value):
  return value.replace(' -- ', '--').replace(' ', '%20').replace(',', '%2C').replace("'","%27").replace('(', '%28').replace(')', '%29')

It handles spaces, commas, apostrophes, and parentheses. Obviously it’ll need to do more.

Update: I tried furls and urllib. urllib did what I needed with the least overkill. I’m leaving the above to demonstrate my process, but the new code is included in the full script at the bottom and linked script. Note that it still first replaces any spaces around the --

So now the process looks like:

with open(csvSource, newline='') as data:
  reader = csv.DictReader(data)
  for row in reader:
    subjectLabel = URI_escape(row['subject'])

Now I need to create the URI and get the request. I did these separately because I was also printing to console and testing as I went. I left them that way.

with open(csvSource, newline='') as data:
   reader = csv.DictReader(data)
   for row in reader:
     subjectLabel = URI_escape(row['subject'])
     subjectURI = 'http://id.loc.gov/authorities/subjects/label/' + subjectLabel
     subjectResponse = requests.get(subjectURI)

I now had the response. But I had to be sure that it wasn’t a 404. The subjects I was going through included pre-coordinated headings with no associated URIs. I tested a bit and determined that the response I got should be 200 if it resolved directly, 404 if it didn’t, and 400 if I screwed up escaping. Because I was querying directly to the subjects vocabulary, there should never be an issue of duplicates.

I added the following test:

if subjectResponse.status_code == 200:

and did a quick run-through with the console as below, just printing out the values to see how many matches I got, how many matches I didn’t get, and to compare the initial subject and Preflabel.

with open(csvSource, newline='') as data:
   reader = csv.DictReader(data)
   for row in reader:
     subjectLabel = URI_escape(row['subject'])
     subjectURI = 'http://id.loc.gov/authorities/subjects/label/' + subjectLabel
     subjectResponse = requests.get(subjectURI)
     if subjectResponse.status_code == 200:
       row['subject']
       subjectResponse.headers['X-Preflabel']
       subjectResponse.headers['X-Uri']
     else:
       print("no match")

The test was successful enough that keeping the label part in the final script seemed a bit like overkill. I still did it, for peace of mind.

I then wrote the rest of the script, turning the above into a function, adding a function to create a CSV with the final values, calling the first function from within the CSV-creation function, and adding commands to write values into the new CSV. I added input questions for the path to the input and output CSVs.

Plot Twist 1

Update: When I shared to this, someone pointed that since I’d determined I only needed to use the headers, I could use requests.head instead of requests.get. This kind of thing is another part of the refining process. I used a full get to see what I might get out of the page, but now I know I don’t need it, I need to re-evaluate how much work the script does.

Updating the finished script required a little extra testing. When I get just requests.head the status code is 302 not 200 because it’s redirecting vs. the final page. However, a small batch test determined that the contents of its headers otherwise matched what I was getting from my original query. So I changed the requests call and the test from if I was getting a 200 to if I was getting a 302.

Finished Script (Revised)

The finished script is below, with revisions as noted above (url escaping + requests.head vs. requests.get). You can download the full script and a sample CSV of with a few different kinds of LCSH and experiment with it yourself. You’ll need Python 3 with requests, csv, os, and time modules installed. If you didn’t save the CSV in the same directory, be sure to input the full path.

import requests, csv, os, time, urllib

# this expects a CSV with a column titled "subject." You may add additional fields, e.g. the ASpace ID of the subject record, see https://github.com/ruthtillman/subjectreconscripts/blob/master/retrieve-lc-uris-from-csv.py

def URI_escape(value):
  return urllib.parse.quote(value.replace(' -- ', '--'))

def get_subject_URIs(writer,csvSource):
  with open(csvSource, newline='') as data:
    reader = csv.DictReader(data)
    for row in reader:
      subjectLabel = URI_escape(row['subject'])
      subjectURI = 'http://id.loc.gov/authorities/subjects/label/' + subjectLabel
      subjectResponse = requests.head(subjectURI)
      if subjectResponse.status_code == 302:
          writer.writerow({'subject' : row['subject'], 'LC_URI' : subjectResponse.headers['X-Uri'], 'LC_Label': subjectResponse.headers['X-Preflabel']})
      else:
          writer.writerow({'subject' : row['subject'], 'LC_URI' : '', 'LC_Label': ''})
      time.sleep(4)

def write_subject_csv(csvOutput,csvSource):
    fieldnames = ['subject', 'LC_URI', 'LC_Label']
    with open(csvOutput, 'w', newline='') as outputFile:
        writer = csv.DictWriter(outputFile, fieldnames=fieldnames)
        writer.writeheader()
        get_subject_URIs(writer,csvSource)

csvOutput=input("Name of output CSV: ")
csvSource=input("Path to / name of source CSV: ")

write_subject_csv(csvOutput,csvSource)

Using some other script in my toolkit, I was able to match over 200 subjects and update their records in ArchivesSpace to include the URIs. I also now have a method, if a slow one, of getting exact matches. I plan to test it out on names in the future.

Plot Twist 2

Part of the point of the blog post was to invite folks inside my process, since I sometimes get asked how I do various things I do. It’s wild and experimental in my brain, full of trying things to see what happens and checking documentation.

When I shared about this on Twitter, Dominic Byrd-McDevitt tested and confirmed that (as with the actual authorities pages) adding .json or .rdf to the end of one of these redirect URLs works the same way that it does when you apply it to the regular url. https://id.loc.gov/authorities/label/Orchids.json takes you to the JSON-LD record, etc. Since I have used this with the regular pages, it was one of the many little ideas which had popped into my head as a possibility. But when I checked out the headers, I’d gotten completely distracted on gone off down the rabbit hole of making it work.

The headers had everything I need. But if you’re reading this with an eye toward manipulating the X-Uri to get the file you need, you might want to try the appending trick instead!

Footnotes

The term “reconciliation” gets used in several ways and in this case it should probably be called something like “entity resolution.” As I describe it above, I’ll refer to it simply as “matching” in this case, because that’s what we’re doing here. ↩︎
Or at least that’s how I’d put it in brief. ↩︎
Ironically, as of today the guide has an error in that “orchids” must be capitalized to work! https://id.loc.gov/authorities/label/orchids does not resolve but https://id.loc.gov/authorities/label/Orchids does. ↩︎
I also looked at stuff.content, which was the whole dang HTML page and might’ve been helpful if I were going to use BeautifulSoup, but that would’ve been so much more work than turned out to be needed… And honestly, if I’d needed to do that much work, I probably would’ve seen the well-known Slack message/tweet about finding services. ↩︎
It also had ArchivesSpace IDs, which I won’t be addressing in the script below. The script below does have a link in a comment to the GitHub repo where I’m working on my own scripts, if you’d like to see that version. ↩︎