An Introduction to SQL for Librarians

2015-09-15 tech

Update! This post has been adapted into Spanish with permission by Profesor Jose A. Senso at the University of Granada.

After August’s local Code4Lib meetup and SAA meeting, I’ve begun to consider blogging more about technical subjects. The biggest reason I haven’t so far is the fear of not being comprehensive enough or being too wordy. But if you want comprehensive, I suggest getting O’Reilly books and/or Database Design for Mere Mortals.¹ And I’ve accepted I’m going to be wordy…this isn’t my formal work documentation style, this style assumes we’re all still figuring stuff out and tries to be thorough. Some will be beginner stuff and some intermediate. Today is an introduction to SQL written for the people who are already database searching experts—librarians.²

Several years ago, a reference coworker with some technical aptitude sat me down and asked me what SQL was about and whether she could ever learn it. My answer was that, knowing what I did of her work, she probably wouldn’t need to know anything about SQL for her job. But yes, I did believe she could learn it. I think librarians, particularly the ones who spend the most time in subscription databases, have the right mindset to work with SQL.

So if she doesn’t need to learn about it, when might it be appropriate for a librarian to learn SQL?

Whether you need to know it depends on how deeply you have to dive into the software and systems that support the library’s work and what they run on. For example, at my place of work we have a MySQL database where we hold metadata before turning it into Fedora objects. I use SQL to extract that metadata into spreadsheets and then run update queries to improve that metadata. A lot of the work done around migrating Archivists’ Toolkit installations to ArchivesSpace involved digging into both databases. And it translates. Understanding SQL may help you better understand an API for a service you’re trying to set up. I’d definitely recommend it for metadata types and digital archives/collections types.

Preliminaries: Things to Know But Not Worry About

SQL has dialects based on the type of setup you’re using. What I’m going to be talking about relates to MySQL databases. You may be using a PostgreSQL or SQL Server. You can look up dialectical differences here and find threads on various forums talking about why MySQL doesn’t support FULL OUTER JOIN, but my advice for starting out is not to worry too much about it. Most of it’s the same.

Why backticks? Why only sometimes? When looking around, you may see SQL written as

SELECT `comment_author` FROM `wp_comments`;

In MySQL, backticks are used to to indicate that the phrase is some kind of identifier: a database name, a table name, or a field name. It lets MySQL support spaces in field or table names³ and lets you use a reserved word like from as a name without causing errors. If you’re not worried about those use cases, you don’t need to use it but you always can. I’m going to use them in my prose so that you know I’m talking about a database/table/field but not in my code examples because I generally write SQL without them.

Why are Librarians Suited to SQL?

Because querying information is what we do. You’ve thought “ok this is what I want, now where do I go to get it and what should I put into the fields?” In many cases we’re already doing mediated queries which are functionally indistinguishable from SQL. When you’re saying that you want to search for Article from a database where a Title is “Name of the title,” what you’re doing is essentially the same as writing a SQL query. Reports used to get data from or update data in your ILS use the same type of querying.

You know how to do this. All you need is the language to express it. I have faith. Let’s do this.

Get to Know Your Database

Assuming you’ve got access to run queries⁴, now you need to know what you’re querying. You know how you get familiar with running a JSTOR search? You know how to do basic searches and when an advanced search might be needed. You’re familiar with the fields which you can conveniently access from drop-downs or checkboxes. And even if you’re not familiar, you can figure it out because it’s all there on the page. While SQL databases may not be quite as up-front about what’s in them (and some do have composition windows that lets you use fields as drop-downs), it’s still easy to get that information.

Databases contain multiple tables, generally divided by the type of data they store, like omeka_options or accessions. Tables then contain fields, which you can think of as being like an Excel column. In a very simple table, one field might be for a book’s title, the next for its date of publication, the next for its table of contents, etc..

There are two ways to find out what tables are in your database and what fields those contain and a third that may help you get a real idea of what’s in those tables.

Table Information Through the Interface

First, most interfaces actually give you a visual display of the tables and fields.. If I go into phpMyAdmin as seen below, I can click on a database and see a list of its tables plus other info about the tables which I may not need to know if I’m not doing high-level work.

Now if I click on a table, I can see a list of its fields (again, think column names) as well as the first set of records in the fields. This gives me a good idea of what the data actually looks like.

What’s up at the top of this? That’s right, phpMyAdmin is actually using a SQL query to generate this page. Zooming in, we see:

Querying for Detailed Table Information

Second, we can get this info by running MySQL queries. If we run:

SHOW TABLES;

our client should reply with a list of all the tables in the database. Now we know table names!

Then we need to know what the field names are and maybe a bit about what kind of info is in the fields. Using the table name, we run:

DESCRIBE omeka_options;

and get

The first column is field names. The second tells us what kind of info is in the fields. You can look up what each of those mean. These two are the most important (along with key) when you’re starting out. The third tells us if it’s ever allowed to be empty or NULL. The fourth tells us about whether any of these are keys and what kind. Keys are a whole thing about which I should probably write a short post, but you don’t need them for simple queries. The fifth tells us if there is any default value, such as “false” or “true,” but this doesn’t have any. And the final column tells us anything additional— in this case we know that the ID auto-increments.

Querying for Example Table Data

Remember when I zoomed in above to show that phpMyAdmin was using a query to generate data? The most basic way to get an example of what’s in the table is to run:

SELECT *
FROM table_name
LIMIT 25

I’ll explain the elements that make up this query below, but the result should be the first 25 entries from the table. I can look through it to see things like if a field called location uses the value “new” when a book is on the new books shelf, “sl2” when it’s on stack level 2, etc.. This will help me construct queries later.

Write, Write, Write

When getting started, it can be easiest to do this using a pencil and scratch paper, but you can also make notes in a document on your computer as you figure it out.

You’re querying because you WANT something. You probably want information. Maybe you want to build a spreadsheet with some info. Maybe you want a specific list of titles that match X. Maybe you want an article title and the DOI you can use to get to that article title.

Despite having used SQL for around 8 years now, I still start writing things down when I’m doing really complex stuff. Use what you’ve written as you walk through the section about constructing query syntax. Make notes on it as you understand better how you’d express it in the SQL syntax.

Now We Query

Finally! This is what we’ve been building toward. Let’s walk through the syntax by levels of complexity in the query. Every query needs two basic clauses: a SELECT and a FROM.

SELECT — does what it sounds like. SELECT is just going to pull out data, as opposed to commands which change the data: UPDATE or DELETE or INSERT. Most of the time, you’ll be listing field names you want to select, like title. Separate multiple fields with commas. Use a * to select every single field. You can refine a SELECT by starting with * like we did in the query above and then deciding which of those fields you actually want.

FROM — name of the table. Normally that’s just table_name, but sometimes it can be database_name.table_name (with backticks around each). See the period? We use periods to join database, table, and field names for specificity in queries. You only do this when you’re going to be searching multiple databases and multiple tables, which we’re not doing there.

The next most common clause in a query is WHERE.

WHERE — introduces the ability to limit what we’re getting. You may only use one WHERE (see AND/OR). Here’s an example query below. It uses WHERE to specify that we only want books published in 2015. Notice we don’t have to SELECT publication_date to use it in the query. A query can refer to any field in any table you’re referencing in the query.

SELECT title
FROM books
WHERE publication_date = '2015';

WHERE can use a lot of comparison operators besides the = like < > != <> !< !> <= >=, all of which you can learn more about by searching for “SQL Comparison Operators.” It can also use logical operators, including: LIKE, IS NULL, IS UNIQUE, EXISTS, and BETWEEN. I’m not going to spell it out here, but a search for “SQL Logical Operators” will help you out on this one.

After WHERE, we get AND, OR, and NOT. These are all considered to be part of the WHERE clause as you can’t use them without one.

AND — ah the familiar Boolean search. You may use as many ANDs as necessary to add additional requirements to your search. So you might want the date of publication to be in 2015 AND the publication to be from Simon & Shuster. As with an article database or a catalog, AND is limiting and all restrictions must be met.

SELECT title
FROM books
WHERE publication_date = '2015'
AND publisher = 'Simon & Shuster';

OR — more Boolean. Again, use as many as necessary. Maybe you’re looking for books on the new shelf OR books published in 2015.

SELECT title
FROM books
WHERE publication_date = '2015'
OR location = 'new';

NOT — you knew this was coming, right? But unlike in Boolean, NOT needs to be used with WHERE, AND, or OR to do its negation. Let’s suppose that we’re now looking for books that were published in 2015 but haven’t made it to the new shelf yet!

SELECT title
FROM books
WHERE publication_date = '2015'
AND NOT location = 'new';

This can also be done by simply using the != or <> operators to say location != 'new', but there may be reasons you want to do it this way.

You can also nest AND, OR, and NOT within parenthetical statements. Perhaps we know that the data wasn’t always entered cleanly. So we want something from Simon & Shuster but we know somebody probably wrote it as “Simon and Shuster” at least once. Again, this is pretty similar to other searches we do.

SELECT title
FROM books
WHERE publication_date = '2015'
AND (publisher = 'Simon & Shuster' OR publisher='Simon and Shuster');

Since this is a basic overview and we’ve hit up all the basic elements, let’s move to the two basic ways to control output.

LIMIT — lets you control where you start and stop looking at results. LIMIT 25 is acceptable, but you can also write LIMIT 0,25 (like many programming languages, SQL starts counting at 0). You can then run the query a second time and say LIMIT 26, 50 to get the next 25 results. This is really useful when you only need a sampling of results or when you don’t want to overload the server

In the examples above, phpMyAdmin is automatically adding LIMIT 0, 30 when I don’t put in my own limit. At work, I often do LIMIT 0, 1000 and then 1001, 2000, etc. If server you’re querying gets heavy use or your query is really complex, you probably want to lower that number.

ORDER BY— lets you choose how you get the results. Just like anything referenced in the WHERE clause, it doesn’t actually have to be in the list of fields you’re SELECTing. You can select whether you want it ascending (ASC) or descending (DESC). ASC is default, so you don’t have to add it unless you’re being very intentional.

SELECT title
FROM books
WHERE publication_date = '2015'
OR location = 'new'
ORDER BY publication_date DESC;

Summary

That was a lot, so let’s put it together and pare down to the essentials.

To build a SQL query, you first need to know a bit about the database you’re querying from.
Then figure out what it is that you want. Find where the info is in those tables and fields.
Write down your thoughts and refine them as you go.
And finally refine what you wrote down using the section with a basic intro to syntax.

I’m going to write more tutorials in the future, like “how do I get data from two tables?” with companion “what’s a key?”. Most, I think, will be shorter and will refer to this master intro post. I’m open to feedback about style and content. There are a lot of reference sites out there, but that’s not exactly what I’m trying to do here, although I hope this will be a good reference for those who’ve used it. What I hope is that this tutorial made SQL a little less confusing and helped you see that it’s just a way of expressing skills you had all along.

Follow-up Posts on SQL

Get easy hands-on experience with an SQL test table I set up

Footnotes, a.k.a. Rejected Parentheticals

If your library has a subscription, many O’Reilly books are available through Safari. Database Design for Mere Mortals was the text we used in my Database Design/Management class at CUA (with Bruce Hulse, I recommend it if you’re a student). ↩︎
…or archivists, let’s not leave my hybrid-self out! But I’m going to use just the one term. ↩︎
But why would you do this?? Why?? Just use underscores. ↩︎
And for the love of all things holy, stick with SELECT queries until you’re really good at SQL and even then always run SELECT to ensure you’ve got things right before running UPDATE or DELETE. ↩︎