Extra-curricular Challenge: Archiving a Xanga Blog

2011-11-06 archives, tech

In LBSC 605, Intro to Archives, I did a literature review of articles on blog archival. I found so little that dealt with actual blogging that I had to expand it to blogs and dynamic websites. It was a bit disappointing, but preparing that review reminded me of a little blog that I wanted to save.

The Blog

In the fall of 2004, my mother was diagnosed with terminal cancer. One of her many concerns became the preservation of family stories, mostly the ones she’d told us as kids or the ones which had been told her by older relatives who were already gone. At my sister’s suggestion, she began blogging the stories in early 2005 using Xanga.

The blog consists of 19 posts from 2005 to mid-2006. Her posting frequency was affected by her treatments and she eventually began writing down the stories by hand in her free time, which she found easier than sitting at a computer. It’s an incomplete scrap of a blog, but it’s also 19 stories which I’m not sure she duplicated elsewhere. The blog hasn’t been touched since 2006.

My goals is to end up with three versions:

Archive of the site in the full HTML form it has on the web (i.e. the pure & untouched files, but containing a lot of unnecessary or xanga.com-dependent code).
Functional and self-contained local site with unnecessary code redacted, internal linking changed to reflect local site file names, local scripts and files substituted for linked xanga scripts and files, and anything else necessary make the blog usable independently and with no internet connection.
Content of the posts extracted into a plaintext document and a PDF document.

I want to preserve the original site form, but in order to make it function offline I’ll have to do some major editing. This shouldn’t be a real problem once I determine the layout (see below). Since it’ll require major editing, I plan to keep a zipped copy of the original files so that a) I could start over and b) the original is retained. I also realize that a representation of the site in its original form is probably not the most useful tool for someone who simply wants to read my mother’s stories. Therefore I’ve decided to copy and paste the post content into two document formats that should be fairly accessible.

Putting it in a document form should take all of an hour or two. In fact, it’s what most people would do instead of archiving the site. However, I’d like to use this as an opportunity to learn about the processes and challenges of blog archival. As it’s a small blog, I hope not to be overwhelmed by the size of the project.

My Initial Steps

1) Identify all pages that need to be saved.

According to the blog archive, the blog contained 19 posts. However, a blog is more than its individual post pages.

First, there were the index page and “Next 5” pages. As is customary in blog format, each contained the full content of 5 posts in reverse chronological order (most recent posts on the front page, etc). I saved the four instances of the index page (for 19 posts and 5/page, this came to 3 pages of 5 posts and one of 4).

Then there were individual monthly archives pages. The format was: Main Archives Page (linked to on each page of the blog) -> Monthly Archives Page (containing links to individual posts). In order to create a holistic representation of the original site, I saved each of these instances as well.

2) Determine what code needed replacement and what could be deleted.

After I saved all the original site HTML, I started looking at the page structure. Some of the display is dependent on external files from xanga.com. Other code is unnecessary to display (such as tracking scripts or scripts which allow one to post comments).

At this point, I’m still determining the overall structure. I expect that my next step will be coming up with a plan for restructuring the files and possibly a step-by-step process to make it go faster while not missing anything important. I will also need to save any necessary scripts and CSS files which are stored on xanga itself.

The Initial Challenges

1) File naming conventions:

Xanga follows a postid/post-slug/ format. I needed an easy way to name the HTML files so that I could look at an internal blog link and know how to link it to the appropriate .html file on the local site.

I used the format postid-post-slug.html to name the files so that I could easily alter internal links by just deleting the domain and slightly changing the format, then adding .html. Some posts weren’t titled externally and therefore didn’t have post slugs. If the post didn’t have a slug, it just became postid.html.

For the index pages, I used index.html to indicate the blog’s front page at the time of archiving and index-2.html, etc, on subsequent files.

2) Defining necessary elements of the page:

Part of step 2 involves making judgment calls on what’s really a part of the site and what’s extraneous data which can be lost. Unseen and database-reliant scripts can easily be removed, but what about elements like a comment box? I plan to retain comments left by the friends and family who visited it, but is the comment box necessary as an example of how comments were entered? It doesn’t serve the family members, but might it serve people down the road if someone were to look at this site as a sample of a 2005-2006 blog? Or would that person be looking at the original files anyway and not need it to display here?

What about elements like the xanga sign-in link? It’s no longer necessary or functional on the site. Or what about elements which were probably added later by changes to some of the scripts, like Twitter or Facebook sign-in links in the comment section? Those were certainly added after the blog was abandoned in 2006.

I’m over-thinking this part, mostly because I’m weighing how I might do things differently if I were saving the blog for an archive with a broad user base instead of a small family group. in fact, after working on Goal 2 for a while, I decided I should probably skip ahead to Goal 3 next time I have the time to work on it and extract the post content into usable files.

To be continued, as the semester allows…