An Introduction to RDF for Librarians (of a Metadata Bent)
I’ve talked for a while about possibly writing a brief introduction to linked data aimed at librarians who are still figuring out what it means/library students. What’s held me back has primarily been the fear of saying something imperfect or incomplete.1 But as I did with the Introduction to SQL, I’m going to recommend some books which I think will give you a fuller picture. Reading Semantic Web for the Working Ontologist helped me solidify things that I’d pieced together from other sources. I’d also recommend Karen Coyle’s Understanding the Semantic Web: Bibliographic Data and Metadata, based what she’s said about this in other places, even though I haven’t read this specific book. Based on having taken the advanced SPARQL class, I’d recommend courses in the Library Juice Academy Certificate in XML and RDF-Based Systems, at least while taught by Robert Chavez.
I’m going to approach it from an angle of what excites me about linked data/RDF from a metadata perspective. I’ll talk about some of the fundamentals, give a few examples of what it looks like, and then talk about where I see the potential. My goal in writing this is that, by the end, you’ll see why people get excited about linked data and some of its potential in libraries, archives, etc..
What is RDF?
RDF stands for “Resource Description Framework.” Immediately, you’ll see the connection to libraries. We have resources. We describe them. We use frameworks to do so. Of course we have various XML languages already, or MARC, so why this resource description framework?
For me, the easiest way to start thinking about RDF was as a metadata database—but one with extraordinary flexibility. RDF uses “triples” to make one statement about one resource, or “subject”.2
Let’s get three major words defined here:
- Subject — The resource that’s being described. Book, person, LCSH, website, function, anything that can be described can be a subject. This is always a URI (uniform resource identifier) of some kind. That could be a URL. It could also be another kind of identifier, such as an ISBN.3
- Predicate — Also known as a “property,” this is a URI which fulfills the role of the database field name or the name of an XML tag. It declares what’s going to be declared about an object. A very simple example would be
<dc:title>in XML. In RDF, this would be the equivalent of
<http://purl.org/dc/terms/title>, although it may sometimes even be written as
dc:title(see “Serialization” below)
- Object — The value of a statement. This can be a URI, like the other two, or it can be what’s called a “Literal,” meaning a string, a number, or a date, enclosed in quotation marks. Strings are what we normally think of as text. We can get more specific about what this Literal is with datatype and language modifiers (see “Datatype and Language Modifiers").
These are combined into what are known as Triples (or “statements”), three-part assertions composed of Subject, Predicate, and Object.
You may have already seen this diagram from the W3C Recommendation for RDF
but what does it actually mean? For most of my examples today, I’m going to assume that the reader is familiar with Dublin Core, since it’s widely taught in library schools and its basic properties are pretty easy for librarians to understand. Let’s look at this way of rendering a title in Subject Predicate Object format. In each set of rows in the table, I’m getting more specific.
|A thing||A type of information about/property of the thing||The value|
|The URI or other identifier of a resource||The URI of a property that will be assigned to the resource||A Literal (string, date, number) or URI representing the value of the property|
|The book||has a title||which is Kindred.|
The actual RDF statement would be:
<https://www.goodreads.com/book/show/60931> <http://purl.org/dc/terms/title> "Kindred" .
Let’s compare RDF to your average database. We’ll say we have a database that’s made up of information about people, perhaps library patrons. In the database, you might have a table with the following information:
An important part there is the Patron_ID, which is acting as a “foreign key” within the database to link all tables back to some primary record of the patron. The names table, like the Address table breaks off a chunk of info into a separate table. In RDF, this is how that information would be written. This example uses the N-Triples format:
<http://ourinternalsystem.org/248249> <http://schema.org/givenName> "Ruth" . <http://ourinternalsystem.org/248249> <http://schema.org/additionalName> "K" . <http://ourinternalsystem.org/248249> <http://schema.org/familyName> "Tillman" .
At first, this seems like a little like overkill. Why not just use a database? But there are two things the schema links do that our database doesn’t do. First, they make a statement about this data. They say “hey, this data is in line with the way schema.org defines this particular property.” We could put all of our names with someone else’s names and they’d parse right. We wouldn’t have to map “given_name” to someone else’s “firstName.” This is a very very simple example, but it’s something we don’t get in a regular database (but do get in XML).
RDF, is rather like XML, in that there is no default or baseline language. RDF goes one further by not having a singe set of syntactic requirements. There are various syntaxes, which one can use to express RDF, which we’ll go over below. Just like there is an XML Schema, there is an RDF Schema with some basic classes and properties, but it’s not built into RDF itself. There are various RDF ontologies (or vocabularies, the line on which is which seems to have to do with rigor) which one can use without one being more fundamental than the others. Ones I will include in this post include Dublin Core, SKOS, and Schema.org. Perhaps the best way of understanding what RDF, underneath it all is, is to think about Dublin Core. Dublin Core is not (just) an XML schema. Dublin Core is a basic vocabulary that has been expressed in XML and in RDF. You can write Dublin Core as XML. You can write it as RDF. There are reasons to choose either method.
But, unlike XML or databases, we can easily use multiple vocabularies in describing a resource without causing any issues. We can use as many or as few as we want (of course we may give ourselves a headache doing so). Schema, for example, supports “honorificSuffix” for things like “Ruth K. Tillman, MLS” but doesn’t support “Micah D. Tillman, Jr.” So I go to Linked Open Vocabularies and search for “name suffix.” I then find that the Bibo Ontology supports
<http://purl.org/ontology/bibo/suffixName>, I decide it’s a relevant ontology to use (vs. the other options) and we’re off to the races.
Obviously one wants to mix and match with care. But unlike in an XML file, one can decide to decide the same thing with a few local terms, a few terms from one major schema, and a few terms from another. Obviously, for it to work in systems, the local system has to be programmed to know which ontologies to support (see “But What Does It MEAN?” below).
Serialization, or Linked Data Doesn't Have to Look Scary
Brief working definition: serialization is how data gets stored, organized, and written to disk. It’s mostly important to know as the word which describes how RDF is expressed or formatted, e.g. serialization formats. From now on, I’m going to use a serialization format called Turtle in examples, but I want to briefly define each of the major ones.
- Turtle — Probably the briefest way of writing RDF. Like XML namespace declarations, assigns a prefix to a prefix URL, e.g. “dc” or “dcterms” to “http://purl.org/dc/terms/" and then writing the property as
dc:title. It also allows one to make multiple statements about a subject without repeating the subject (will be noted and demonstrated in examples). Saved with extension
- N-Triples — A longer, but very straightforward way of writing triples which doesn’t require any prefixes up front. This is the format used in examples above this section, e.g.
<http://example.com/id/my-subject/> <http://purl.org/dc/terms/title> "The title" .. Good and stable, but also sometimes a little intimidating for learners. Saved with extension
- RDF/XML — Gross (subjective opinion). I love XML and I love RDF and I hate RDF/XML. It’s a reason many people look at RDF and think “I’ll never get this.” Probably because while it makes sense as one serialization, it’s how a lot of people get introduced to RDF and doesn’t lend itself to thinking in triples. This slide deck by Dorothea Salo walks you through RDF/XML once you already understand the basics of both. Saved with extension
- JSON-LD — A bit more elaborate than Turtle or N-Triples, but will be very comfortable and familiar to people already familiar with writing JSON. May be used by programmers who also work with JSON and can easily adapt their prgorams. Along with Microdata/RDFa, a way to embed linked data in HTML pages that Google, etc., will actually read.
- Microdata/RDFa — Two rather simple ways of writing RDF in HTML by using attributes. Each defines a set of attributes which, used together, allow you to refer to the ontologies you’ll be using and designate the property the tag encloses. Less messy than RDF/XML, not my favorite, but fairly useful to learn how to do when using RDF in web pages vs. in triplestores. Combine these and Schema.org’s vocabulary and you’re playing ball with Google.
- N-Quads — Like N-Triples but with an extra space at the end for comments and other uses. Avoid until you’re ready to go next-level.
- Triplestore — a database designed to store data that’s serialized in some triple format and allows you to query it.
- SPARQL — a protocol/language for querying RDF (not unlike SQL for databases or XQuery for XML). SPARQL/Update (or SPARUL) is the way to perform updates using SPARQL (like SQL Update queries, but broken off from regular SPARQL, which is only for querying, not for updating).
Turtle in a Little More Depth
Turtle documents contain: A list of prefixes and the URIs with which they’re associated, followed by a list of triples. URI that are URLs must end with either “/”, “?” or “#” depending on the schema. Dublin Core, for example, formats its URIs as “http://purl.org/dc/terms/title" which means its prefix is “http://purl.org/dc/terms/". But if it had been “http://purl.org/dc/terms#title", it would end with the #.
Triples must be separated by whitespace (most of the time it’s two or four spaces or a tab) and followed by a period, a semi-colon, or a comma. A period indicates that it is a complete statement. A semi-colon indicates that the following line will contain a whitespace indentation then a Predicate and Object which relate to the same Subject as the line with the semi-colon. One must then use a period on the final line of triples or the final line that applies to that Subject. A comma indicates that the Object has multiple values and will be repeated multiple times. Comma-separated objects must be followed by either a semi-colon or period.
Let’s look at three ways of formatting the same very basic information about a book, “https://www.goodreads.com/book/show/3124906", namely its title and its authors. For the authors, I’ll be using the URIs to their Library of Congress Linked Data Service pages, e.g. http://id.loc.gov/authorities/names/n2007089310. All these statements contain the same information and should parse exactly the same.
First, I can format as straight-up triples.
<http://ourinternalsystem.org/248249> <http://schema.org/givenName> "Ruth" . <http://ourinternalsystem.org/248249> <http://schema.org/additionalName> "K" . <http://ourinternalsystem.org/248249> <http://schema.org/familyName> "Tillman" . book:3124906 dc:title "Semantic Web for the Working Ontologist" . book:3124906 dc:creator loc:n2007089310 . book:3124906 dc:creator loc:n86056933 .
Or, I can break it down using semi-colons, because I’m repeating the subject and don’t need to do so.
@prefix dc: <http://purl.org/dc/terms/> . @prefix book: <https://www.goodreads.com/book/show/> . @prefix loc: <http://id.loc.gov/authorities/names/> . book:3124906 dc:title "Semantic Web for the Working Ontologist" ; dc:creator loc:n2007089310 ; dc:creator loc:n86056933 .
Finally, I’m also repeating the property
dc:creator and I don’t need to be. So I can write:
@prefix dc: <http://purl.org/dc/terms/> . @prefix book: <https://www.goodreads.com/book/show/> . @prefix loc: <http://id.loc.gov/authorities/names/> . book:3124906 dc:title "Semantic Web for the Working Ontologist" ; dc:creator loc:n2007089310, loc:n86056933 .
The prefixes are extra lines compared to N-Triples for the few statements above, but once you have a gigantic file full of these statements (which can then be ingested and used by a triplestore), it saves a lot of space and generally simplifies things. I find Turtle the friendliest way of serializing RDF for my human eyes. It’s a straightforward set of triples, but even those can be condensed a little and still make sense. Importantly, one can also use a full URI, for whatever reason, like in a file of N-Triples, and the statement remains valid.
Datatype and Language Modifiers
Sometimes in RDF, even when using a Literal vs. a URI in the Object (e.g. “Semantic Web for the Working Ontologist”), you’ll want to specify the type of information it is or something else about it. The following section is specific to Turtle serialization with a note about N-Triples at the end.
One might want to differentiate French and English titles:
film:9258 dc:title "Man Bites Dog"@en, "C'est arrivé près de chez vous"@fr .
This method gets used a lot in data that’s being used in a multi-lingual context. Semantic Web for the Working Ontologist uses the example of agricultural datasets where one needs to indicate the name of an animal, using
skos:prefLabel, in multiple languages. The system may choose to only display English labels or to let one toggle between languages. But within the data, they’re equal.
One may also want to indicate what kind of data a Literal contains. I’m going to break my rule of this so far and use an example.org fake namespace where we presume example.org contains lists of tv episodes with a URI prefix show/show-name/episodes.
@prefix dc: <http://purl.org/dc/terms/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix ep: <http://example.org/show/agent-carter/episodes/> . ep:15 dc:title "Monsters" ; dc:issued "2016-02-16"^^xsd:date ; dc:description "Peggy and Jarvis mount a rescue attempt to recover Dottie Underwood." .
^^xsd:date indicates that it’s in line with the way the XML Schema defines a date. For objects such as system transactions which have an actual timestamp, one can use
^^xsd:dateTime. One may also define numbers as
^^xsd:integer, etc., or specify that a string really is a string. Using
^^xsd:string feels like overkill, but it’s a way to be very precise about one’s data. For complete precision, one could write the above statements as:
@prefix dc: <http://purl.org/dc/terms/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix ep: <http://example.org/show/agent-carter/episodes/> . ep:15 dc:title "Monsters"^^xsd:string ; dc:issued "2016-02-16"^^xsd:date ; dc:description "Peggy and Jarvis mount a rescue attempt to recover Dottie Underwood."^^xsd:string .
A brief note on datatypes and N-Triples. In N-Triples, the entire datatype is simply spelled out, e.g.
Where Does the Linking Come In?
I hope by this point I’ve managed to convey some of the basics of RDF. You have a Subject, a thing about which you’re making statements. You have a Predicate, the property you’re about to describe about the object. And then you have the Object, which may be a string, a date, a number, or another URI which fits with the property to make a statement about the object. A sample statement might be “This book was written by Octavia Butler” where “This book” is the Subject, “was written by” is the Predicate, and “Octavia Butler” is the Object.
RDF and Linked Data get used rather interchangeably. Linked Data also includes HTTP and URIs (Principles of linked data) and describes the entire infrastructure which includes RDF statements at and about URIs, most of which can be accessed via HTTP (you can technically have a URI like an ISBN,
<urn:isbn:9781934389-18-8>, which doesn’t link to anything but which does provide a unique identifier for the book). In best case scenarios, while some Objects may be Literals, many will also be links, creating an eco-system of links.
This is the DBPedia entry for the play A Raisin in the Sun. Note the multiple vocabularies it’s using. Also note how it links to Lorraine Hansberry’s page. Note that Lorraine Hansberry, far down on her page, has a “dbp:influences of bell hooks”. All these things link together. Each link contains further information. Querying various a resource “who did the author of ‘Raisin in the Sun’ influence?” may allow one to get data that follows back along multiple links across multiple sites, rather like a vast database that’s not strictly a database or owned by any one person.
Let’s look at a different kind of concrete example.
@prefix worldcat: <http://worldcat.org/oclc/> . @prefix dc: <http://purl.org/dc/terms/> . @prefix viaf: <http://viaf.org/viaf/> . worldcat:896427099 dc:title "Cebuano subjects in two frameworks" ; dc:creator viaf:75371441 .
That’s my mother’s dissertation. And her VIAF record (behind the VIAF page at https://viaf.org/viaf/75371441/) includes an English
skos:prefLabel "Sarah J. Bell". It also has a
schema:alternateName "Sarah Johanna Bell" and
schema:deathDate "2010". Note the mixing of two different ontologies. Now, assume we have a functional system parsing the statement above. In a best case scenario, it would reach out to VIAF’s RDF file for that URI and grab the
skos:prefLabel. It could also grab her dates using
schema:birthDate/schema:deathDate. If either my mother’s prefLabel or her dates changed they would automatically change in the display, which is fetching them from a remote resource.3 When my mom died, I eventually got her LCNAF updated, but most catalogs still don’t reflect that because they’re using strings vs. pulling in that data from a centralized source. The same could be done for
dc:creator which, in the RDF document defining the vocabulary, uses
rdfs:label "Creator", etc..
Now, if systems were linked well enough together, one might be able to use that same VIAF URI to query a whole bunch of catalogs and see everything she’s written and the systems holding the books. One can do this now when they’re loaded into a single database, like a multi-system catalog or Worldcat, but in theory, it could be much bigger.
Or imagine an archival resource with a URI for the entity that created it and a system which allowed one to look at the URI and find all the archival materials in every repository around the world created by that same entity.
This is the kind of possibility that puts stars in people’s eyes and makes them gesticulate while whispering “linked data.” It’s a pretty good dream. Some people are making it work out better than others. Other dreams include actually getting library catalogs indexed in Google using linked data, which is theoretically more possible with linked data than any other way I’ve heard.
But What Does It MEAN?
This is an aside many of you may be able to skip. However, when first working with linked data (perhaps because SO many examples used fake ontologies hypothetically located at example.org), one of my biggest problems was in figuring out where meaning would be imposed on the statements we were creating. I would also wonder about such rules as “you can only have one skos:prefLabel for a piece (per language)” as it’s not validated in the same way as XML. And, unlike XML, the underlying claim of RDF seemed to be that it was imbuing things with better meaning than we could do in other ways. So, where did that meaning happen?
The answer is that, like many things, the meaning has to be built into the systems which process or read the RDF. For example, a system built to handle SKOS has to know to look for a single “prefLabel” on a piece and return an error if there are two. But it also must know that if the object has an @en or @fr or other language code following the object, then there can be multiple prefLabel statements as long as there aren’t two which also have the same language code. It should be able to do this by processing the SKOS namespace document and the rules it contains. It then also needs to know that prefLabel should be displayed as the subject’s main label, or how to handle a display conflict if the subject also has an
rdfs:label statement. It’s left up to the coders to know the goal of the system, to work with the people supplying the data to find out what it should mean in terms of display, use, etc., and to make it happen.
A more complicated example would be a statement:
resource:342 dc:isPartOf resource:10. The system would need to be programmed to look for this and know what it means. It may be coded to understand
dc:hasPart as inverse properties. So without someone even needing to make the statement
resource:10 dc:hasPart resource:342, it would know that to be true as well. Or it might not. One must learn what one’s systems do and don’t know and act on. It may be true that in this system, “is part of” is a way of indicating files in a composite digital resource, like multiple research data files. There may also be a statement that
resource:10 fedora-rels-ext:isMemberOfCollection collection:12. Whether that applies to
resource:342 or not depends entirely on what kind of transitive properties have been programmed into it…or not.
Like XML and HTML and all other technical things, how engines and software and other things handle RDF differs slightly between them. Sometimes you’ll be involved in saying how it should behave. Sometimes you’ll be figuring out how a system is programmed and cringe writing your RDF accordingly…ideally not doing it wrong, but choosing your properties and such based on what the system supports.
What if I need to do Hierarchical Metadata?
Every example so far has been flat, straightforward metadata. But what if I want to do something more hierarchical? Not all metadata seems to break down into triples (though I’d argue that a lot more does than we’re currently thinking.)
In a perfect world, that’s what the Object part is for. This is how RDF becomes a vast graph. Suppose you want to make a publication statement about something. You might have an RDF record describing that publisher—a preferred label, their city/state, etc. This might be a local record in your system. Then, each time you want to say a book was published by that publisher, you use the URI of that local RDF record as your Object. The system extracts all the fields from that record. This is, in fact, one of the ideals people think of in implementing linked data, that it’s a series of resources connected by relationships, not a bunch of text repeated in every record.
But we don’t live with perfect systems and sometimes hierarchical information is a one-off. That’s where something called “blank nodes” comes in. Put very very simply, blank nodes are ways of grouping a set of statements together around something that’s not a URI outside the document. They may look like one of two things in something like turtle.
@prefix dc: <http://purl.org/dc/terms/> . @prefix schema: <http://schema.org> . <http://eadiva.com/> dc:title "EADiva"; dc:creator [ schema:givenName "Ruth"; schema:familyName "Tillman" ] .
@prefix dc: <http://purl.org/dc/terms/> . @prefix schema: <http://schema.org> . <http://eadiva.com/> dc:title "EADiva"; dc:creator _:935772. _:935772 schema:givenName "Ruth"; schema:familyName "Tillman" .
In the first case, brackets indicate that this is an anonymous but unified entity we’re talking about, that the
dc:creator has the
schema:givenName “Ruth” and the
schema:familyName “Tillman.” In the second case, a supposed URI _:935772 with an underscore precending it (putting the “blank” in “blank node”) is reference in the
dc:creator statement, followed by a period. Then a second set of statements are made about this _:935772. This is essentially creating a separate resource in your system, except you’re doing it right inside a specific file and not in a way that can be reused. This second approach works in N-Triples as well. Both work in Turtle and the former is part of Turtle’s condensed way of writing things.
In some cases, people ask “why not have a real, reusable URI for this blank node?” It’s not a bad question and should generally be considered when modeling data. But sometimes you just need to make a whole bunch of statements about an Object in a triple and they really don’t break off neatly to their own URI.
Again, this is a very simple explanation, not as in-depth as one could get.
You Said This Was Exciting
Gosh yes, sorry about that. Now that we’ve established all this, let me tell you what I find exciting about Linked Data:
- The potential to discover related resources.
- The potential for metadata re-use, allowing information to be defined and updated in one location, not typed and retyped throughout a catalog or database (think Library of Congress Name Authority Files and how, in most catalogs, they don’t get updated with a death date when an author dies).
- The potential to incorporate more and different ways of describing objects without having to adhere to single XML schemas or the specific ways in which one may combine them.
- The potential to query multiple URIs which are about the same resource and pull out more combined data than either has separately.
There are other potentials I can’t easily bring up here because I haven’t gone into any kind of details about that kind of thing yet. For example, I’m very excited by the Portland Common Data Model, which will use RDF to allow one to model multi-part works much better than most systems we currently have in place. In that case, the statements will be about works (Classes, we haven’t even covered those, there was so much else to say) and about how they relate to each other, how the files that make them up relate to them, etc. PCDM borrows from other things which already exist (another advantage of RDF, not having to rewrite the wheel) like a method of ordering where one of the properties on a file is what file is
iana:next in a sequence, thus making it easy to order a group of files that have a particular order (perhaps book page images).
Recap, or, Oh God You Wrote How Many Thousand Words?*
That was a lot of words, so let me do a quick recap of everything that I’ve talked about.
First, RDF is a “Resource Description Framework.” In this framework, one makes statements as “triples” which are composed of a Subject, a Predicate, and an Object. The Subject is the thing being talked about. The Predicate tells us what properties we’re going to describe/define. And the Object is the description or definition. RDF lets us choose from multiple vocabularies and doesn’t require any particular triple to exist in a file, making it more flexible than a database or XML. Each vocabulary may have its own rules for what kind of data an Object can be, but we can mix and match Dublin Core and SKOS and Schema.org without any negative repercussions. We should always be thoughtful when doing so, however, and aware of how these triples will be used later on.
Second, we went over the various ways RDF can be formatted. We got into some depth on Turtle because it’s fairly friendly to humans. We took a quick look at datatypes and languages.
Third, we looked at what the Linked part means and some basic examples of how one can follow a web of statements. We saw ways it might be used to trace relationships or to pull in information from a remote source.
Fourth, we very briefly talked about hierarchical metadata and blank nodes.
Then I tried to convey why I find all this exciting.
There are a LOT of things I haven’t even had time to talk about here, like rdf:type, or what a Class is. I didn’t talk about RDF as graphs and there’s a lot of exciting stuff that can be said there, but I thought about how many words it would take and decided I’d cover it later if at all. I would again recommend the books and courses I mentioned at the beginning. A lot of official RDF documentation is painfully-opaque as it’s written by people who already know what they’re talking about, which is why I wouldn’t recommend going straight to the documentation.
If you’ve made it through this whole post/document, and it makes sense to you, then you understand the very basic starting points of RDF. But what I hope is that, by reading this, you’ll have some simple grounding for learning more from other resources. You’ll know what you’re reading when you see an N-Triple or Turtle statement, at least. You’ll understand a bit better why people get all starry-eyed over it.
It really is a book-length subject and one I only feel competent on writing about at an entry level, but not a full set of posts with more details. I may do a few smaller ones in the future at least about some critical stuff like Classes.
Footnotes, a.k.a. Rejected Parentheticals
Thanks to Kyle Shockey for providing some very helpful feedback on a couple assumptions the first draft of this post made.
(Four, four thousand six hundred words (update, edits brought it over five thousand). If only I could write this much in a day on demand or when I really needed to, vs. when I wake up at 6:30am and think “must write about linked data.") ↩︎
I’m getting a bit reductive here, as people who know a lot about RDF wil lsee, but I think this is the place to start so you don’t end up getting overwhelmed. ↩︎
Isn’t that also scary? yes, yes it is. Or it can be laggy. That’s why some people download giant exports of resources they use. That way if something remote is wiped, crashes, or otherwise becomes inaccessible, they could either stage a copy or use it to extract the data they were fetching. ↩︎