Time is a Property of Data

January 5, 2021:1 I query DBpedia to see whether Barack Obama is still listed as President of the United States2. He is.

February 1, 2021, the day I write this, I check again. Good news! He's not.

Bad news — Biden isn't either. And it's not just easy-to-read versions of the data you can access on resource pages. When I go to the DBpedia SPARQL editor or the one their databus directed me to or the one some other docs direct me to or the insecure one I find when I remove .demo. from that URL and run:

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbr: <http://dbpedia.org/resource/>

SELECT ?office WHERE {
dbr:Donald_Trump dbp:office ?office.
}

I get the result:

office
“President of the United States”@en

Every time.

Do I get the same result for Biden? No.

Despite how it's started, this is not a political post. This is a post which argues that time is a core property of data.

Early in Masked By Trust : Bias In Library Discovery, Matthew Reidsma shares an exchange in which he and I reviewed suggested encyclopedia snippets recommended by the Summon Discovery service's “Topics” sidebar integration. We identified numerous false statements, including that Barack Obama was president and that Donald Trump was a real estate investor and reality tv figure. Osama Bin Laden was alive. We notified Summon about the out-of-date Wikipedia snippets. After several conversations, they've said that their Wikipedia extracts and revisions are more frequent. Since the other encylopedia integrations (Gale, Credo, etc) remain static, we've left it turned off for safety.

Time and Data

So if this isn't about politics, why all the political figures?

Since experimenting with discovery systems, learning how they work, and making them better is part of my job, I spend a lot of time thinking about the best queries for different problems. Need to check on a system's timeliness of updates? Try political searches.

When I started at Penn State, we also included summary snippets of encyclopedia entries in our search results. My searches revealed similar problems with those. For example, in 2018, if one searched for “George W. Bush,” the accompanying sidebar text could be best summarized as:

  • Former governor of Texas,
  • Had owned a baseball team,
  • The Supreme Court decided he won the 2000 Presidential Election,
  • Sure wonder what he's gonna do with the Presidency!

I would guess most people my age or older can pinpoint the exact 9-month period in which that encyclopedia entry was written.3

There's a clear difference between reading that in the context of an encylopedia (which carries a publication date) and reading it alongside one's search results. In the latter context, it is presented as a Piece of Data.4 It is implied to be the most recent and accurate piece of data known, especially when it doesn't carry a last-updated date. This doesn't make the encyclopedia's data bad. Summon and the encyclopedia serve incompatible functions. Their integration is inappropriate.

Time and DBpedia

So, what about DBpedia?

First, I'll provide a necessarily brief overview of the DBpedia Project. DBpedia's original data project derives structured data from Wikipedia and other Wikimedia projects. It publishes this data as linked data. It interlinks with many other linked data projects. It preceded Wikidata by about 5 years. It was heralded in keynotes, presentations, articles, and books as one of the most promising knowledgebases using linked data. It's been expanded upon and integrated with other projects, including library data projects.

It also took 4 years for the live version to say that Donald Trump was President of the United States and, when it did so, the statement was true for 16 or fewer days.5

What happened?

One of the difficulties I've had in trying to catch up with DBpedia over the last few years has been keeping track of its websites and their functions. Last time I checked (October), the most recent data release was from October 2016. Or was it August 2020? The project is spread over quite a few subdomains, which may or may not have the information one needs:

But the good news is that I found an answer! On the latest core dataset releases page, I found a link to this very helpful paper, “The New DBpedia Release Cycle: Increasing Agility and Efficiency in Knowledge Extraction Workflows” (SEMANTiCS 2020). It:

  • explains the effect that growth and scale had on DBpedia's update processes,
  • confirms that there weren't new core dataset releases published between 2017 and 2019,
  • and outlines the steps taken to create an automated, monthly, release cycle.

A lot of thought and work has been put into how DBpedia can be both comprehensive and up-to-date. It's got potential to be a major improvement. But…

Conclusion

…although it looks like the question of who's President of the United States will get sorted out in another month or two, I am still left with my own questions. Are monthly updates frequent enough for library data use? What happens if the pipeline breaks down again, or differently? What about other kinds of errors, such as those introduced by badly-formatted source data?6

As library workers, we respond to many kinds of inquiry. One core need we meet is the provision of reliable sources about current events.7 When we present out-of-date information in contexts which imply its currency, such as knowledge boxes, we damage the credibility of the other information we provide. We may even further ideas that we have no desire to propagate.

The statement “Donald Trump is the President of the United States” carries significant implications on February 1, 2021 that it did not carry on January 1, 2021.

And thus, I return to my thesis: Time is a critical property of data. We must treat it as such.

End Notes


  1. While I didn't submit it to the Wayback Machine that day, I was able to determine the date thanks to a message I sent a colleague. Their response: “y-i-k-e-s.” ↩︎

  2. The last Wayback archived copy of the resource page version of the data is November 2020. ↩︎

  3. Did you remember GWB owned a baseball team? That's certainly not what I associate with him. Also, please don't let the evils of the last 4 years whitewash your memory of what he did with his 8. ↩︎

  4. And, in this context, it is implied to be the most relevant and useful piece of data available. ↩︎

  5. In the Wayback machine's capture from January 2, 2021 Trump was not described anywhere as President. Based on my messaging history (see Note 1), this was still true on January 5 and Obama was described as President. The Wayback Machine's January 20, 2021 capture contains information that was correct on that day, that Trump was President. ↩︎

  6. The entire reason I started looking at DBpedia was a friend's note that a saved query had recently broken. We quickly identified the cause as data separated by a middot on Wikipedia and parsed by DBpedia as a string, rather than an entity. ↩︎

  7. One may argue that Wikidata, a far more timely source, has its own pitfalls. Just as Wikipedia articles about ongoing events may draw vandals, Wikidata may suffer similar malicious attacks which affect the same information our patrons are trying to understand. ↩︎

Avatar
Ruth Kitchin Tillman
Cataloging Systems and Linked Data Strategist

Card-carrying quilter. Mennonite. Writer. Worker.