Learning Cell Cross in OpenRefine

Abstract

This written tutorial and prerecorded demo are derived from a 1-hour webinar held on March 26, 2020.

Date
2020-03-26
Location
Online

Sometimes, when working with spreadsheet data, you need to get one or more values from spreadsheet A into spreadsheet B, matching them with the correct row. There are several ways you can do that, but one of the easiest I’ve found is to use OpenRefine and “Cell Cross.” It’s kind of what it sounds like – cross-referencing between OpenRefine projects.

OpenRefine uses Google Refine Expression Language (GREL). It’s very powerful, but is a little different than most languages (SQL, Python, etc.) which you might already use for data manipulations. In this demo, I hope to make it a little more familiar.

View Demo on YouTube

Watch a video of the walkthrough below:

Sample Project

For the demo, I’ve designed a simple project. Suppose you want to add a finding aid link to each of your MARC records. You have one spreadsheet which tracks all your ArchivesSpace projects and has the project’s ASpaceID, title, extent, and URL in it. You’ve got a second sheet which matches your ASpaceIDs (as archivesID) and your catalog IDs for MARC records (as catkey).

I’ve created sample data below which you can download and use yourself in OpenRefine. When doing a Cell Cross, you’ll need both sheets imported into OpenRefine and to know their project names and column names. I tend to keep them open in two separate Firefox tabs so I can review them.

A screenshot of the process of creating an OpenRefine project. Red circles highlight the project name in the upper right corner and the titles of the two most important columns

Sample Data

You can download the sample data to test it yourself:

One note: while I’m matching IDs in this demo, you can match any two things with, well, match. That can include long test strings, place names, or even LCSH subjects formatted like “Quilting–History.” You may need to manipulate the sheets so your data does match, but that’s another of OpenRefine’s strengths and can be found in other tutorials.

Archives Sample

ASpaceID title extent url
232 Marmot Field Notes 1 linear ft https://aspace.myinstitution.edu/public/repositories/2/resources/232/
284 Village Green Preservation Society Records 184 linear ft https://aspace.myinstitution.edu/public/repositories/2/resources/284/
395 Society for the Dissemination of Hyrax Memes Image Collection 837 digital images https://aspace.myinstitution.edu/public/repositories/2/resources/395/
12 Bulletins of the Esteemed Founder 42 linear feet https://aspace.myinstitution.edu/public/repositories/2/resources/12/
546 Midsomer Pines Garden Club Records 22 linear feet, 2 hydrangeas https://aspace.myinstitution.edu/public/repositories/2/resources/546/
8593 Colin J. Former-President Realia Collection 384 buttons, 2 linear feet, 4 urns https://aspace.myinstitution.edu/public/repositories/2/resources/8593/
23 Doomsday Machine Project Records 8 cubic feet, 2 500GB hard drives https://aspace.myinstitution.edu/public/repositories/2/resources/23/
934 Jessica Fletcher Papers 29 bound journals, 6 linear feet https://aspace.myinstitution.edu/public/repositories/2/resources/934/

Key Match Sample

catkey archivesID
3935823 232
3942352 12
93525335 8593
8643413 4302
3953838 934
39582803 546
490235223 13
9382124 395
29483520 23
928423523 932

Performing Cell Cross

Navigate to the project where you want to add new data (in this case, project key_match_sample).

Select the drop-down in the column you want to match (in this case, archivesID). Choose Edit Column -> Add Column Based on This Column.

A screenshot of the function to select a column, choos edit column, and add column based on this column in the same spreadsheet using archivesID

If the column is an ID, for example, now you have to find/match that same ID in your second project. Format your cell cross statement as follows, where otherProject, columnToMatch, and columnToAdd represent variables:

cell.cross(otherProject, columnToMatch)[0].cells[columnToAdd].value

In this case, that looks like:

cell.cross("archives_sample", "ASpaceID")[0].cells["url"].value

(note the quotes)

A screenshot in which the above formula is entered and you can preview matches of how it will result.

Broken down, that can be read as “in the project archives_sample, match the value of each row in the source column (project: key_match_sample, column: archivesID) with a row in column ASpaceID. If such a match exists, then fill the new column with the value for the field url from that same row.”

The preview window shows a snapshot of the results you’ll get. As you can see, most things have a match and the URLs look good!

For now, just ignore the [0]. In this case, it makes your new data appear as one value instead of an array with one value. If you’re expecting more than one row in the other spreadsheet to match, experiment with removing this to view it as an array.

On Error or No Match

What happens if there’s no match? Or what if we don’t know whether the spreadsheets have 100% overlap? Maybe we have all the catkeys and archivesIDs, but we only have URLs for the records which are published. Maybe we’re trying to match our circulation count spreadsheet with a spreadsheet of consortial holdings. We’re using the OCLC number, which works most of the time but sometimes there’s no match. Maybe we have the old OCLC number.

highlighting the On Error choice box which lets you choose to set to blank

Fortunately, OpenRefine offers an “On error” choice. I chose “set to blank.” You can then do whatever you need with the rows which do or don’t match.

You can add as many columns as you want this way. You can use it to pull together major components of several spreadsheets based on a shared value or values. You can even pull data across more than the 2 projects as shown below.

Extending the Data

Suppose you have another project in which you have your old finding aid URLS and catkeys for their respective MARC records. It might look like something like the below:

catkey old finding aid url
3935823 https://findingaids.mylibrary.edu/935.html
39582803 https://findingaids.mylibrary.edu/25.html
29483520 https://findingaids.mylibrary.edu/742.html

You want to set up redirects from your old finding aid site, but you need to pair the old and new URLs. Since you’ve now got a copy of key_match_sample which contains catkeys and ASpace URLs (even if you deleted the archives IDs), you can build on that with this data.

Using the technique above, you could add a new column to the project, based on the catkey column, and bring over every new finding aid URL if it exists. Let’s suppose the project is named old_fa_urls. Your cell cross would be based on the catkey column in key_match_sample and look like:

cell.cross("old_fa_urls", "catkey")[0].cells["old finding aid url"].value

In this case, the based-on column and the matching column in the other spreadsheet are both named catkey. This doesn’t cause a conflict, but it’s a reminder to keep tract of your project and column names.

You could then export key_match_sample as a spreadsheet from OpenRefine, delete the catkey and archivesID columns, and ask the person handling your site to set up redirects from https://findingaids.mylibrary.edu/935.html to https://aspace.myinstitution.edu/public/repositories/2/resources/232/, for example.