We got good results from a setup almost identical to Dan's - HP Scanjet 4c
and page feeder with Omnipage 8.0. If I had the original clean typewritten
copy I could get results approaching 100%, maybe only one or two mistakes
per page. We did not expect to be able to get such good accuracy. One of
the things that helped was that I created a "training" file for Omnipage by
associating suspect characters with the correct ones. This did take some
time, but with training the system did get more accurate. Omnipage features
an edit mode after scanning which flags suspect words and words not found in
it's spell-checking dictionary which was also helpful. The names were a
problem for this of course, so things did have to be completely proofed.
The things that produced very poor text were multi-generation copies and
dot-matrix printing. These were so bad that re-keying was a better solution
than trying to edit the text. On a positive note, as an experiment I pulled
cards for several collections from our extensive card catalog, photocopied
them and OCR'd the copies. These took more work, for instance I had to
manually "zone" the areas for OCR, but produced results good enough to show
that there is the potential for transferring typewritten text from old and
heavily used cards. I think the key overall was to use the cleanest,
hopefully darkest originals available, and use the training files to better
the accuracy rate.
[log in to unmask]
From: Fox, Michael <[log in to unmask]>
To: Multiple recipients of list EAD <[log in to unmask]>
Date: Friday, April 23, 1999 9:26 AM
Subject: Re: HTML or ASCII?
>I would be curious to hear the experiences of others on the list with
>scanning of existing finding aids. I have read several reports of very
>good experiences like that which Dan reports but have quite poor results
>ourselves (using various OCR software including Omnipage).
>Perhaps it's because so many of our inventories are older and were produced
>on manual and electric typewriters.
>95% accuracy rates sound good until you realize that this means that every
>fourth or fifth word has a typo. And the spell checker doesn't seem to
>much with inventories that contain a lot of names.
>Head of Processing
>Minnesota Historical Society
>345 Kellogg Blvd West
>St. Paul MN 55102-1906
>[log in to unmask]
>**NOTE NEW AREA CODE EFFECTIVE JULY 12, 1998**
>> From: Dan Linke[SMTP:[log in to unmask]]
>> Sent: Wednesday, April 21, 1999 4:27 PM
>> To: Multiple recipients of list EAD
>> Subject: Re: HTML or ASCII?
>> You did not mention in what form you gave the finding aids to APEX. I am
>> assuming that you gave them paper copies which they scanned, OCRed, and
>> converted. If they are going to all that effort, they could surely give
>> an electronic version without the EAD encoding, and the cost cannot be
>> much greater since it's only a matter of one of the steps along the way
>> toward encoding. I would pursue that route before attempting to strip
>> coding out manually at your end.
>> However, related to this and for future planning, we are scanning
>> paper-based finding aids with an HP scanner (cost of $700) with Caere's
>> Omnipage 8.0 (an OCR program, cost about $400, less if it's an upgrade
>> software included with the scanner; we paid only $100) with very good
>> results. It saves the file as both a Word file and then also converts to
>> HTML. (You could also save it as an ASCII file too, once it was in Word.)
>> Last summer we employed a student to do this nearly fulltime and she was
>> able to scan, OCR, and convert nearly 100 pages per day. This rate
>> according to each finding aid's "page density" but is based on 20 finding
>> aids totalling over 3500 pages which were completed in 37 working days.
>> mention this because the cost of the student labor is significantly less
>> than APEX's hourly rate, I suspect, and if you provide them with an
>> electronic version rather than paper, you may stretch your grant dollars
>> that much more. Of course, you need to find a good student who is
>> Hope this helps.
>> David Delorenzo wrote:
>> > Colleagues--
>> > I need your advice on a problem I have encountered with our project to
>> > convert and encode our finding aids.
>> > None of our 3,000+ finding aids are available in electronic form. I
>> > received a grant which I hope can kill several birds with one stone. My
>> > goals for the project are: 1) convert the finding aids into electronic
>> > form, 2) acquire an electronic text version (ASCII or something else
>> > that I can manipulate in a word processing software (MS WORD) and an
>> > HTML writer/editor (Netscape Composer)), and 3) acquire an EAD-encoded
>> > version.
>> > We have hired Apex (as we are an RLG member) to convert and encode the
>> > finding aids. We plan to send the EAD versions to Archival Resources.
>> > Because I want more flexibility for future uses of the finidng aids
>> > (whatever they later may be given advances in technology), I would like
>> > to maintain locally a text version (which for now could be manipulated
>> > using MS WORD). Because Archival Resources is available only
>> > fee-for-service, and I don't have the technical support necessary to
>> > maintain SGML documents, I am also planning on maintaining at our WEB
>> > site an HTML encoded version meeting our specifications for structure,
>> > etc.
>> > Here is the problem. Apex has provided me with a first batch of one
>> > hundred EAD encoded finding aids. I had hoped to be able to use the
>> > encoded versions in other ways by stripping them of the coding BUT
>> > with most grand ideas, I have been unsuccessful! Of course, for more
>> > money (which I'd like to spend on other issues), I am sure Apex would
>> > happy to resolve this matter for me. Before pursuing this option with
>> > the remaining 2,900 finding aids, however, I wanted to know if there
>> > was a "de-babble-izer" that I could purchase to magically remove the
>> > encoding.
>> > I am happy to pay the vendor for the deliverables I need but I wanted
>> > check with you all first! I look forward to hearing from you.
>> > David
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > David de Lorenzo 201 West Monument Street
>> > Library Director Baltimore, MD 21201-4674
>> > Maryland Historical Society (410) 685-3750 Ext. 309
>> > Library of Maryland History FAX: (410) 385-2105
>> > http://www.mdhs.org
>> Dan Linke
>> Assistant Archivist for Technical Services
>> Seeley G. Mudd Manuscript Library
>> 65 Olden Street
>> Princeton, NJ 08544
>> 609-258-6345 (v) 609-258-3385 (fax)