Here's another method that may work for dealing with special characters, from an LC colleague (Jim Godwin): >> Here's a stripped down MS Word macro that does nothing except convert a document's Unicode codepoints to hexadecimal numeric character references and back. It's probably okay to send outside to someone if you want. You might want to Zip it before attaching it to an external email message. This function might be useful to someone who has been having trouble saving the entire range of Unicode characters in an XML or HTML or other document. Most browsers and HTML and XML editors accept the hexadecimal numeric character reference form. To install the macro, save attachment JLGUNI.dot in the folder C:\Program Files\Microsoft Office\Office\Startup. If you do not have such a folder on your computer, check Tools/Options/File Locations to determine the location of your Startup folder. This will make JLGUNI available in a MS Word global template, and the macros will be available through the Tools/Macro/Macros menu. To convert all of a document's Unicode characters (that is, those outside of the common ASCII characters) to hexadecimal Numeric Character References, click on Tools/Macro/Macros, then select UniToXNCR, and click on Run. To convert all of a document's hexadecimal Numeric Character Referrences to Unicode characters, click on Tools/Macro/Macros, then select XNCRToUni, and click on Run. - - Jim >> Rather than sending an attachment to the list, these instructions and the file can be found at http://lcweb.loc.gov/ead/practices/technical/unicode.html. Thanks! Mary Lacy On Wed, 2 Apr 2003 13:00:49 -0500, Stephen Yearl <[log in to unmask]> wrote: >Susan: > >This maybe an indirect route, but it certainly works: > >1. open the word document in Open Office <http://www.openoffice.org> >2. save the file as an *.sxw file (Open Office's native "Word" format) >3. open the *sxw file with a zip utility (such as winzip), and extract the content.xml file > >All characters outside of 8-bit ASCII are here encoded in unicode. You can process the content.xml file using XSLT, or copy and paste from that file. Of course, since the characters are unicode they will look like gibberish in a non-unicode compliant editor (such as Note Tab); don't worry, unicode complaint systems will know how to render them. > >Hope this helps some, > >St. > >Stephen Yearl >Systems Archivist >Yale University Library::Manuscripts and Archives > >At 11:21 AM 4/2/2003 -0500, you wrote: >>Emory University and Boston College are working on a joint project to >>encode our collections relating to Irish literature. We have >>encountered a number of names, etc which contain special >>characters/diacritics. We are wondering if anyone has developed a way >>(routine) to take special characters from MSWord and translate them to >>unicode entities? If yes, would you be willing to share? Many thanks! >> >>Susan McDonald >>Head of Technical Services >>Special Collections and Archives Department >>Emory University