Linked Data is Made of Systems

2019-05-16 linked data

… and that’s why it can be so hard to learn.

When I was attending LD4 last week, a session on tools/cataloging moved to the subject of linked data fatigue. One attendee noted that when she’d graduated library school a decade ago, she was told the move to BIBFRAME wasn’t far behind. Others, including me, recounted similar experiences, or struggles we’d had in learning linked data.

To be entirely open, although my job title currently includes “Linked Data Strategist,” I responded very poorly to attempts to teach linked data in my library school class on description. Charitably, the attempts didn’t fit my learning style. Less charitably, they reflected a tendency we have in libraries to downplay the concreteness in our profession and dive right into theory and examples outside our domain because “we’re about so much more than books.” We are. That doesn’t mean complex topics aren’t best taught starting in a domain familiar to the group, even though it would also be a limiting choice not to move further.

(Having taken SPARQL Fundamentals I & II with Robert Chavez, once I had already begun to understand linked data, I’ll add that some people are very good at teaching it. I am assuming the rest of his certificate course is as good, but I didn’t take them.)

The Systems We Don’t Have

It was after this conversation that the phrase “linked data is made of systems” began bouncing around my head. How is it that we can learn to do things in HTML, XML/XSLT, MARC, and even SQL, while linked data is often an extra reach? I believe it’s because of the systems that exist.

It’s easy enough to get a free text editor (like Atom, where I’m writing this) which highlights your HTML or XML tags and performs validation. You can view the HTML in a browser right away, even if you have no site to put it on. You can change tags and see what happens. Plenty of other people use HTML, so it’s not hard to find a buddy who can explain why your text is getting bigger every paragraph.¹ If you have oXygen or something similar, you can set up XSLT transformations of your XML and view the result. MarcEdit makes MARC something anyone can create, plenty is freely available online, and practically all libraries use it.

The landscape for linked data is far more promising than it was when I started library school nearly 10 years ago. It’s easy to update wikidata and see results within the system, at least. There’s a helpful SPARQL editor. You can use the SPARQL service or editor to query records in the British National Bibliography. I see some promise in data.bnf.fr. On the rest of the web, you can find some linked data (or data structured as RDF, at least) in search result knowledge cards and the like. But it’s not very…linked.

The Vision/System Disconnect

First, let me get this out of the way and say that while noble efforts are being made around BIBFRAME tooling, there’s nothing out there with nearly the scale/utility of BIBFRAME as there is for things I mentioned above. BIBFRAME is also extremely heavy — our 12GB of MARC came out at something more like 98GB when transformed to BIBFRAME. I experimented a bit with it in Blazegraph, but it was unwieldy.

We simply do not have the systems to match our vision. Instead, we have what are essentially a bunch of databases working in triples. They can be very nice databases. They’re very expansive because we can create new predicates. We can get data from their endpoints and include it in our pages. But while I’ve seen things like querying DBPedia or Wikidata to create knowledge cards in catalogs using data from just those systems, I haven’t seen cases of querying Wikidata to get some other data point, like a person’s ORCID, and then pinging ORCID to get data about that person.² & ³ Such a case would be the simplest level of inferencing, not the web of interlinked data some of us imagine, but it’s a start.

When it comes to linked data that you or I can actually create, outside of wiki-type projects, we’re often limited to embedding schema.org data in our websites or the like.⁴ In fact, that’s what I was presenting on — a very satisfying partnership with a dataset catalog/center on campus. We generated schema.org and Google incorporated that into its dataset explorer. What we made was primarily text as RDF, with some linking and a simple relational model between datasets and the data catalog. From what I can tell, Google’s harvesting it, rather than doing any kind of live linked data querying. Hurrah for providing structured, machine-usable data. But we’re still not fulfilling the dream or promise of linked data.

The Unproveable Pudding

If the proof of the pudding is in the eating, I would argue that linked data is currently a pudding that we can smell—and most of the time it smells delicious—but it’s always out of reach and never edible. Considering how much I enjoyed LD4, this post came out more cynical than I’d anticipated when I started. But I spelled out similar concerns in my recent chapter “Barriers to Ethical Name Modeling in Current Linked Data Encoding Practice.” The sections on the promise of inferencing (page 247) and then, Barrier 3 “Infrastructure, Scale, and Searching the Open Web” (page 253) are particularly relevant here.

I had originally intended to include here some of the system-type things I had to learn, like the difference between an XML schema declaration and an RDF ontology declaration. But I’m over 1,000 words and am going to save those for another post. I do have a post on an intro to RDF specifically for metadata-type librarians, but I feel I need another on the learning problems I encountered.

Coming back, then, to the subject of linked data fatigue and the lack of systems—do I have anything helpful to offer? I think, to start, understand that without a concrete way to engage and practice any kind of learning, it’s going to be far more difficult for people to understand what linked data is, why they should care, and how they can use it (partly because they can’t, much). Show the kind of info we can’t put in LC or don’t… and then where else we might get it.

Locate the places where it can work. It’s been a while since I tested this particular concept, but see what you can do with open endpoints. An experiment that comes to mind would be using LC (see footnote 3) to get a wikidata identifier for a person, find their DBPedia ID from wikidata, get the IDs of their collaborators from dbpedia, round-trip that to wikidata, and either generate something from wikidata or all the way back in LC.

As I did at LD4, I would also recommend this Code4Lib Journal Article by Stacy Allison-Cassin and Dan Scott, where they show how they demonstrated concrete effects of wikidata edits. It may not be a web of interlinked data, but it seems like a good way to introduce some of the concepts.

And keep working on those systems. If using a Fedora 3 triplestore that’s just a glorified database can undo years of negative feelings toward linked data (after that one class) and show me how much more could be done with truly linked data…(see note 4) then I don’t think it’s hopeless.

Notes

Each line opens a new H2 or H3 tag and never closes them. It’s sublime. Thanks to @xyzzy@cybre.space for reupping that beauty. ↩︎
As far as I know, ORCID doesn’t provide the actual citations as linked data. But it does have an API. You can also get basic person info with queries like curl -v -L -H "Accept: text/turtle" http://orcid.org/0000-0003-4547-8879. ↩︎
One of the people involved in the Library of Congress linked data authorities announced in an LD4 session that they would be publishing wikidata identifiers in LC linked data records, starting May 20. This could be one way to get (comparatively) canonical wikidata IDs and use those to get other IDs. And it is possible to set up linked data cross-queries. But it’s not easy and it’s not done at scale (from what I can see). ↩︎
My first real encounter with RDF was in NASA Goddard’s Fedora 3 repository. However, I would not actually describe it as linked data, as it was more of an RDF-structured database and never linked outside itself to other linked data. It used internal URIs when possible and a small amount of text to create very nice interlinked RDF stubs which could’ve also been represented as SQL without a functional difference. ↩︎