Thursday, February 17, 2011

OCR Diary

For the text-based part of our digitization portfolio, I had one abstract, five news articles, and two transcripts to work on. Of the formats, the news articles were easily the hardest. While the transcripts and the abstract were, for the most part, reasonably clear typewritten pages, the news articles were cut out from larger paper, hand once been glued down to letter-sized paper but were coming unglued and had to be placed carefully on the scanner. In one case it had scanned at a slight angle, which the OCR had major problems with; I had to mostly retype that section.

I have experience as a professional proofreader, so working with OCR came pretty easily. I'm something of a freak and actually enjoy this kind of detail-oriented work. I did wonder about the best way to tackle it, however. Because the newspaper articles were often messy and had a lot of hyphenated line breaks, I opted to read each of these with unrecognized characters highlighted, rather than go through the spell checker -- I didn't trust it to pick up all the line breaks, and some of the text was so messy that it was just easier to go through line by line. After an initial scan of the transcripts and abstract, it was clear that the spell checker was picking up on most of the problems, and I went through that way to correct OCR errors. Each file had its quirks I had to pick up on -- for example, one of the transcripts had an interviewer whose initial was O., however, OCR often interpreted this abbreviation as a zero and a lowercase o - 0o. Depending on how they were spaced the OCR may or may not have picked up on this so I did a find and replace for each one.

No comments:

Post a Comment