Tutorial: Setting Up a Traject Project on Your Windows Machine

2018-09-13 cataloging, metadata, tech

This tutorial is aimed at metadata workers who may be using Traject, a ruby gem which can index MARC records for Solr. Right now at Penn State, we’re using it to index our entire catalog in Solr for a Blacklight project. I’ve got a lot of tickets to choose the right fields and subfield for our index and display, so I needed a fast, easy way to test on my own computer. My computer runs Windows, which added some challenges.

The tutorial assumes you understand:

Some basics of the Windows DOS command line, such as changing directories.
You can read MARC documentation and have some basic knowledge of how MARC fields, indicators, and subfields work.
You know a little about programmatic principles—have you ever written a script or tweaked someone else’s? You’re good. A lot of this is copy-paste, edit, then test by running the same command over and over in the terminal.

It does not assume:

You understand Ruby.

Getting Set Up

You’ll need the following things installed:

Ruby for Windows: https://rubyinstaller.org/
Git / Git Bash for Windows: https://git-scm.com/downloads. If you want to be able to install Git from the Windows command line and the Ruby terminal, be sure to choose the installation option to put it in the Windows command line. Otherwise, do the Git cloning pieces from Git Bash.
A text editor (e.g. Sublime, Atom)

And you’ll need to assemble the following:

A traject project. Your own or my starter project.
A sample set of your MARC records. Or use the sample MARC included in the starter project.

Creating/Cloning Your Own Traject Project

If you’re not using an existing traject project, you’ll need to create your own. I’ve created a starter traject project which you can clone and use. If you feel like an expert, you can try following the instructions in Traject’s docs.

To clone my starter project, change directories to where you want it, then run:

git clone https://github.com/ruthtillman/starter_traject.git

Installing Traject

Either clone/download an existing traject project into a directory on your computer or build your own (see above).
From the start menu, open Ruby terminal (“Start Command Prompt with Ruby”) and change directories to your traject project. You may need to run as admin¹.
Run gem install bundler to install bundler if you don’t already have it.
Run bundle install. (On this and previous, you may get prompts from Windows Defender or other monitors about Ruby connections. Choose yes. It may also freeze, which is apparently an IPv6 issue. Try again in a minute.).

Building/Testing Your Index

Rather than send the data to a Solr index, this walkthrough uses traject’s debug mode to create a text file. If using larger MARC extracts, it may still create an enormous file. Just indexing all the subjects in our catalog creates a text file that’s over 100MB, hard to open in a text editor. A future walkthrough will include tips for ensuring it doesn’t get unmanageable and ways to interact with very large faux-indices via grep.

The sample traject repository comes with its own sample config file and sample MARC file (a download created from Library of Congress MARC records). Run the following code in the Ruby command prompt to apply the sample config file to the sample MARC file and write out a sample index text file:

traject --debug-mode -c sample_config.rb sample_marc.mrc > sample_index.txt

If your MARC file is not in your traject directory, then you’ll need to include the full filepath as /path/to/file.mrc (or because Windows c:\user\path\to\file.mrc – be sure to escape spaces in directory names). Similarly, you can name your sample_index.txt whatever you want, or remove the sample_index.txt to simply output results to the command line. Note, the file sample_index.txt is in the sample traject repository’s .gitignore. If you want to commit versions of your index, you’ll need to remove it there.

The sample_config.rb file includes Encoding.default_external = "UTF-8" at the top, which we found necessary at Penn State on Windows.

Writing/Editing Your Index

To change what’s in your index, open sample_config.rb in your text editor and add, edit, or delete lines. Add more fields, change subfields, etc. I’m pasting some sample code below, the first from the main traject repo (with some small edits) and the second some of my own samples. You’ll notice that I’m using trim_punctuation for these, which removes punctuation from the beginning and end of lines. alternate_script controls linked 880s. By default, they’ll also be extracted. If you set it to false, they won’t be. If you set it to only, only the linked 880s will be included. Separating them out may make character handling easier.

If you know Ruby, you can try taking it further. However, just copying/pasting/editing what’s below, saving, and running the command line above should get you a fair way.

Traject instructions (adapted):

# Take the value of the first 001 field, and put
# it in output field 'id', to be indexed in Solr
# field 'id'
to_field "id", extract_marc("001")

# separate multiple fields with a :
to_field "title_t", extract_marc("245ab:130")

# Can limit to certain indicators with || chars.
# "*" is a wildcard in indicator spec.  So this is
# 856 with first indicator '0', subfield u.
to_field "email_addresses", extract_marc("856|0*|u")

# Can list tag twice with different field combinations
# to extract separately
to_field "isbn", extract_marc("020a:020aq")

# For MARC Control ('fixed') fields, you can optionally
# use square brackets to take a byte offset.
to_field "language_code", extract_marc("008[35-37]")

My own samples:

to_field 'subject_topic_facet', extract_marc("600|*0|abcdq:610|*0|ab:611|*0|ab:630|*0|ab:650|*0|a:653|*0|a", :trim_punctuation => true)
to_field "published_display", extract_marc("264|*1|abc3:260abcefg3", :trim_punctuation => true, :alternate_script=>false)
to_field 'published_vern_display', extract_marc("264|*1|abc3:260abcefg3", :trim_punctuation => true, :alternate_script=>:only)

Make your changes, run and check results, and try again. The best part about working with a MARC dump on your local machine and not writing to a Solr index is that it’s hard to mess things up. Just try fields and subfields until you like what you see.

Footnotes

For Penn State folks or others using Privilege Guard, search for the Ruby terminal in the Start menu, right-click to “open file location,” then right click on it in the directory and run with Privilege Guard. Those who have admin privileges may choose to simply right-click on it in the Start menu and “run as administrator.” ↩︎