Susan:
This maybe an indirect route, but it certainly works:
1. open the word document in Open Office <http://www.openoffice.org>
2. save the file as an *.sxw file (Open Office's native "Word" format)
3. open the *sxw file with a zip utility (such as winzip), and extract the content.xml file
All characters outside of 8-bit ASCII are here encoded in unicode. You can process the content.xml file using XSLT, or copy and paste from that file. Of course, since the characters are unicode they will look like gibberish in a non-unicode compliant editor (such as Note Tab); don't worry, unicode complaint systems will know how to render them.
Hope this helps some,
St.
Stephen Yearl
Systems Archivist
Yale University Library::Manuscripts and Archives
At 11:21 AM 4/2/2003 -0500, you wrote:
>Emory University and Boston College are working on a joint project to
>encode our collections relating to Irish literature. We have
>encountered a number of names, etc which contain special
>characters/diacritics. We are wondering if anyone has developed a way
>(routine) to take special characters from MSWord and translate them to
>unicode entities? If yes, would you be willing to share? Many thanks!
>
>Susan McDonald
>Head of Technical Services
>Special Collections and Archives Department
>Emory University
|