Good afternoon, my name is Ruth Kitchin Tillman. I'm the Cataloging Systems and Linked Data Strategist at the Penn State University Libraries. In my presentation, I'm going to be talking about a project we recently completed with the Association of Religion Data Archives, a multi-institution data catalog located in Penn State's Social Science Research Institute. I want to highlight two elements of the project. First, I will to talk about the process by which I turned a request for something else into an opportunity to explore linked data. I saw a chance to use it in a meaningful way and our partnership paid off better than I could've foreseen. Second, I will do a brief walkthrough of how we described datasets and modeled a data catalog using the Schema.org ontology. In the second part, I want to emphasize that our choices were based on the data available and that they were one of various ways one could approach this work.
Let me start by introducing The ARDA. The Association of Religion Data Archives has been making datasets available online for 21 years now. It is funded by the Lilly Endowment, Templeton Foundation, Penn State, and Chapman University, with partners at other locations. Although the site contains much more than datasets, it includes over 1000 datasets collected over those two decades. The datasets are primarily sociological surveys of religious groups, created by researchers all over the world. The ARDA team has made these datasets available in formats from STATA to SPSS to ASCII to Excel and includes codebooks. The site has built-in tools for assessing data and creating cross-references. It's really fantastic and I encourage you to check it out if you're interested in that kind of thing.
All the metadata about the datasets is held in a database and displayed on the website, which is .ASP.
Critically, because of the grant funding, The ARDA has a full-time developer in-house who was pulled into this project. Her name was also Ruth, Ruth Christensen, making this the first time I've ever had another Ruth on a project, instead of, say, 3 Nathans and 2 Johns. Without her tech expertise, we likely would not have been able to make this project happen.
The ARDA approached cataloging with the kind of request that makes our hearts go pitter-patter — or “buhhh? pitter-patter.” They asked whether we could create MARC records for their over-1000 datasets. They would be happy to provide the data in whatever form we desired.
Their goals made a fair amount of sense. The entire incident had, apparently, been spurred by the PI searching Worldcat and realizing that their project was not represented beyond a few white papers. He wanted to change that. This had also led him to realize that their datasets weren't available in our local catalog either. He knew just enough about catalog records to know what he wanted. Perhaps an additional impetus was that an even larger data catalog the ICPSR, or Inter-university Consortium for Political and Social Research, makes it easy for universities like ours to integrate their records and their data was showing up in Worldcat results.
Now, don't get the idea that I said no and told them we'd do something else. Heck no. I said yes, and.
I was fairly new to Penn State when we had this conversation, but was the faculty liaison for the team which would be doing the work and facilitated the meetings. I also had to review what was possible and whether it would cause too much work for the department. As I did this, I saw an opportunity to “yes, and” the proposal. Yes, this data could realistically be turned into some fair, minimal MARC records.
But what I saw as I spent time on their site was the opportunity to describe it in a different way. Part of my title is “Linked Data Strategist” and one of the most challenging parts of my work is to determine where linked data is a reasonable and strategic choice, vs. where it's just fun or where it's not really useful.
I'll note something very important here. If we hadn't had the scripting capacity in cataloging, we probably would've just proposed the schema part of the work. And, as I mentioned earlier, if they had not had a skilled programmer to add JSON-LD to the site's pages, we probably would've only been able to do the MARC.
We saw this as an opportunity to consult with them on their own system, something for which we're not the maintainer. It would not add to our permanent workload, nor did we anticipate schema.org changing significantly. If there's a major change in future, we should be able to sit down and work through it with them. Even now, we're reviewing how we might change the data as they finally get DOIs for their system.
I think these kinds of opportunities to augment systems outside the library can be very practical opportunities which don't add to our own overhead and can have great results, as I'll outline at the end.
I'll next walk through the project and how we developed our model, then talk about how it's turned out — so far.
Our first step, for both the MARC and Schema.org work, was to sit down with the developer and project manager and take a look at their database fields. We asked for sample data as XML, which is fairly easy to visually review, and went through what the fields looked like in practice.
As I mentioned before, this is a 20-year-project. Wow, was it ever. We had 20 years of data—which was actually 20 years of user-provided data, although the team had done interventions and cleanup. Fortunately, they'd put in some standardization over the years, like cleaning up all the dates to fit 4-digit years. But there were some inherent limits based on a) what they'd collected and b) how it was formatted.
First, we determined the minimum fields which were shared between all items and what was shared between most items. The project manager, Gail Johnston Ulmer, worked with Ruth C to clean up the fields that just needed a little more standardization.
We also evaluated uncontrolled fields, like the user-entered keywords, for how we might represent them. Some fields, we simply rejected as not structured enough. It would've been great to include the Measurement Technique, for example, but it wasn't easily feasible. The perfect would've been the enemy of the good enough and done here. Since it's an ongoing project, however, they might do their own cleanup and revisit the data once it had been made more granular. This certainly isn't the last word.
This is a great example of funder data which is functional for search but certainly does not match formal or legal names. Most should match actual entities, except “an anonymous Catholic foundation.”
If one were starting from scratch, one might be able to standardize this better and use institutions’ legal names, possibly even including URIs. But at 20 years and 1000 datasets, we chose to err on the side of working with what we had.
We chose to model the data catalog as datasets all pointing (“includedInDataCatalog”) to the main browse page, which we declared a data catalog. We could've chosen to use the inverse property, linking to each dataset as “dataset” from the datacatalog, or represented it both ways. With over 1000 datasets and growing, we had to think about what would be an extensible model and also not overload the datacatalog page with all those links in the JSON-LD. We don't know at what point indexers would stop paying attention.
This is what extensibility looks like. Need to add three new datasets? Point them the right way and they should all be parsed as part of the same data catalog.
I'll do a quick overview of the actual fields we used for the Data Catalog. The data catalog was the only manually-encoded record, which we reviewed in some detail.
As mentioned before, we created the records using schema.org. We set the page URL as the @id, because they didn't have anything more stable, like a DOI. For the datacatalog, its schema type was — Data Catalog. We reviewed the various names used for the data archives, chose one, and listed the others as alternate names. We did not try using the more official legalName anywhere in this data. In the DataCatalog, it was primarily because we wanted to use common names. For the rest, we were dealing with user-entered data and didn't have the space to validate that for every entry.
The project manager wrote a description for the main data catalog and ran it by the PI.
Since the data catalog is the collection of surveys and the project itself has grown much bigger, we listed The ARDA as the creator of this data catalog.
We also needed to acknowledge all the funders.
And finally, we had a lot of discussions over how we should describe it. I don't think this is necessarily the best way to go about it. But for the data catalog record, we paired URIs and terms from LCSH, since we could find some which fit the project pretty well. I would be interested in hearing whether folks might recommend other sources of description instead.
This is an overview of the DataCatalog profile, and my slides will be available later if you'd like to use it to start your own.
Now, let's look at a single item record, each of which was generated from the database. Because the project was in .ASP, the web pages had to be manually regenerated once we set up the fields and Ruth C wrote the code. Regeneration and review, although we did not review them all individually, probably took the longest in this project.
Single items are again in the context of schema. As I mentioned earlier, they don't yet have DOIs, though they've finally got a chance to set them up at an affordable rate. So the page's URL is the “@id” field. The Type is always dataset. We also put the identifier piece in “identifier” in case people use that for their search or we come up with another use. Identifier was one of the friendliest pieces of uniform data.
Name was generated from the database's title field.
Like the type, the includedInDataCatalog always posted to the main browse page, although for a site with multiple data catalogs, this could be broken out.
Thanks to some cleanup work from Gail and others, we were able to include a valid dateCreated for each record.
Although the ARDA created the data catalog, it is only the host, or the maintainer, of the datasets. We decided that provider was a good choice to describe its role.
Creators are simply a string, from the database field. Fortunately, people's names, when they occur, are broken out, generally separate from their institutional affiliation. In some cases, like the example above, the creator was considered to be two associations, which ran the project, rather than named individuals. These were 720s in MARC, added entry uncontrolled names.
We also included funder information. Although the Lilly Endowment funded both TheARDA and this particular dataset, most datasets are funded by other organizations or not funded at all.
Of the free-text fields, the descriptive information was the easiest to handle in both types of record. In the MARC, it was a 520, an abstract. Here, we simply mapped it to description, with no expectation of additional structures.
And finally, as above, we had to figure out what to do with descriptive terms. We could've also put these in a keyword list. We may yet. Unlike the data catalog description, I don't see any way we can streamline the wide variety of keywords and map them to linked data vocabularies. Gail did quite a lot of work reviewing these for the issues which normally arise in user-generated keywords, corrected tons of typos, and standardized the form of repeated terms.
Sometimes good enough is good enough.
This is an overview of the single record profile. About 40% of it was hardcoded, with the rest supplied by parsing the same database which fed the HTML webpages.
Everyone involved in the project was thrilled when, shortly after we finished the work, Google launched its dataset search. We searched for records and found very neat, if basic, snippets from the data. The team has seen a marked increase in record use and was able to communicate to their funders that they were already included in what funders saw as a cutting edge place for discovery.
We still have a few things we'd like to tackle. Stub records are fine, since this has enough for a person to determine whether they should click through. But we need to review and see whether there's a way to encode creator data so that the dataset search makes use of it. When you search for the PIs for this dataset, such as “Leland Harder,” — you should get this result. But you don't yet.
This project was a great experience both for me and our team in cataloging. While the others were less involved in the schema portion, it was a chance to look at parallels between traditional systems and linked data in a way that was far less complex than something like BIBFRAME. As I expressed in the title, I hope to create converts to linked data through a kind of syncretism — the term for bringing together parts of different faith traditions into a religious practice. What if, instead of trying to radically convert people and practices to BIBFRAME in one go, we seek these intermediate paths to introduce the ideas?
The project was also a good opportunity to communicate the kind of work we do to others in the library. Although it's still of critical importance to library functions, cataloging is often assumed to be stagnant. This was a chance to remind folks that, ultimately, we work with data which can be reused in a variety of ways. Similarly, there were demonstrable effects for the center, which we hope will open up new possibilities on campus, particularly for people who use simpler systems like Wordpress. While I can't manage .ASP, I already have the PHP code to insert schema.org on my own Wordpress sites and could easily translate that. We may be able to reuse both our profile and the data itself. In fact—what's up next? We're looking at downloading and parsing the JSON-LD Schema data and turning it into wikidata records for the datasets.