Thursday, March 31, 2011

Digitization Final Project

I'm about midway through my final project at the Austin History Center. I'm working on digitizing photographs from the Congress Avenue collection, the earliest of which date from the mid-1800s. I started with the rare and fragile photos, which are kept separately from the rest due to their fragile condition. Many of these were some of the oldest photos -- I think the earliest one dated from the 1880s.

The first step of this project, as I suppose with any digitization initiative, was deciding what to digitize. The first round was easy -- everything kept in rare and fragile that hadn't already been digitized. Now that I've completed that step, things get more complicated. I'd like to digitize a portion of the collection that is representative of the wide span of time it covers, so I'm aiming to do about five photos per decade over about a 100-year period, with an end result of around 60 photos.

Originally, I had just been scanning things into one general photo for this project. But as I started needing to do things with the files, I quickly realized I would need to implement a more cohesive system of project management.

The workflow is essentially as follows:

1. Scan the image
2. Scan the back of the image (for use in metadata entry later)
3. Make adjustments to the image (rotation, cropping, etc.)
4. Create a print derivative
5. Create a Web derivative
6. Create a thumbnail derivative
7. Transfer the files from my working folder to the AHC main archive folder
8. Input metadata for each image into the AHC's Visual Resources Database

This would not be so hard to keep track of if it were practical to do each task in turn for each image. However, it's much less time consuming to scan a batch of images, then automate the derivative creation process, and then do the metadata afterwards. However, this means it's VERY IMPORTANT to keep track of what tasks you've done for which files. This way, you don't end up scanning something and forgetting to create metadata, or losing track of which files you've adjusted and which you haven't. So I created a spreadsheet that included each task, listed each photo as I scanned it, and marked when I had finished each task. I also subdivided my working folder into (backs), (print), (web), (thumb), (to adjust) and (to transfer), to further help me keep track of things visually.

Each photo was taking a while at first, but I think I've really hit my groove now. There are some pretty interesting shots I'm getting too see -- it's especially cool to see the evolution of a familiar area that I know in the present day.

Thursday, February 17, 2011

OCR Diary

For the text-based part of our digitization portfolio, I had one abstract, five news articles, and two transcripts to work on. Of the formats, the news articles were easily the hardest. While the transcripts and the abstract were, for the most part, reasonably clear typewritten pages, the news articles were cut out from larger paper, hand once been glued down to letter-sized paper but were coming unglued and had to be placed carefully on the scanner. In one case it had scanned at a slight angle, which the OCR had major problems with; I had to mostly retype that section.

I have experience as a professional proofreader, so working with OCR came pretty easily. I'm something of a freak and actually enjoy this kind of detail-oriented work. I did wonder about the best way to tackle it, however. Because the newspaper articles were often messy and had a lot of hyphenated line breaks, I opted to read each of these with unrecognized characters highlighted, rather than go through the spell checker -- I didn't trust it to pick up all the line breaks, and some of the text was so messy that it was just easier to go through line by line. After an initial scan of the transcripts and abstract, it was clear that the spell checker was picking up on most of the problems, and I went through that way to correct OCR errors. Each file had its quirks I had to pick up on -- for example, one of the transcripts had an interviewer whose initial was O., however, OCR often interpreted this abbreviation as a zero and a lowercase o - 0o. Depending on how they were spaced the OCR may or may not have picked up on this so I did a find and replace for each one.

Monday, January 31, 2011

My History of Digitization

My interest in digitization came on in a rather roundabout fashion. In fact, it's only within the last five years or so that I've really had much interest in technology at all. I graduated college in 2007 with a degree in journalism -- the first graduating class at my college that offered a concentration in online journalism, in fact. But I concentrated my studies in the more traditional, long-form magazine journalism, avoiding all that online stuff. Of course, a few months later, after my summer internship had come to an end and I needed to find a real job, I was ruing the fact that my technology skills were sorely lacking, as that's what all the job ads I saw were looking for.

Eventually, I was able to land a public relations job at a national nonprofit. Even though the organization itself was large, our staff was quite small, and when I was hired we were in the last months of a Web redesign. It was all hands on deck, especially where the communications department was concerned. I picked up HTML quickly, and discovered I really liked working on the Web site. I was able to pursue a lot of different technological interests, from putting up Web videos to making press materials available online, but I wanted to learn more.

So that's the tech side. The library/information side has pretty much been there since I got my first library card and realized they had all the stories I could ever want to read, for free. My first paying job was at that same library branch, and as my undergrad drew to a close, I began considering getting the graduate degree ... but I knew I wasn't ready to go straight back to school. I wanted to work on the content side of information for a while, before delving into the organizational structures behind it.

When I did finally decide to go back to school, I found myself especially drawn to the archives track, specifically, the intersection it offers with history and modernity. I have always been drawn to the history of the people and places that give context to the events of the present, and after my work on the Web in my last job, I had become very interested in how usage of information adapts to new technologies.

Both interests converged a few months before I left my nonprofit job. We were moving offices, and since the organization had been around since the 1960s, there was a lot of history to be moved with us. Unpacking a box of reel-to-reel tapes, I found one labeled, “Johnny Cash, 1975 PSA.” I’d read that the famous country singer had been a spokesman for us, but still a little thrill went down the back of my spine on discovering this little slice of our history. Unfortunately, we had no way to access the tape, lacking the technology to play it and the budget to have it converted.

I see digitization as the epitome of that intersection of history and modernity that so fascinates me. It isn't something I have had much history with as such, at least not on the back end, but it's a practice I wanted to learn more about. I strongly believe in the idea of democratic access that digitization affords, but I also understand that this entails some trade-offs. It's my goal in this class not only to become more educated about the specific practices of digitization, but also the overarching issues it entails for management and preservation.