Print

Print


Here's another method that may work for dealing with special characters,
from an LC colleague (Jim Godwin):

>>
 Here's a stripped down MS Word macro that does nothing except
convert a document's Unicode codepoints to hexadecimal numeric character
references and back.  It's probably okay to send outside to someone if you
want.  You might want to Zip it before attaching it to an external email
message.

 This function might be useful to someone who has been having
trouble saving the entire range of Unicode characters in an XML or HTML or
other document.  Most browsers and HTML and XML editors accept the
hexadecimal numeric character reference form.

 To install the macro, save attachment JLGUNI.dot in the folder
C:\Program Files\Microsoft Office\Office\Startup.  If you do not have such
a folder on your computer, check Tools/Options/File Locations to determine
the location of your Startup folder.  This will make JLGUNI available in a
MS Word global template, and the macros will be available through the
Tools/Macro/Macros menu.

 To convert all of a document's Unicode characters (that is, those
outside of the common ASCII characters) to  hexadecimal Numeric Character
References, click on Tools/Macro/Macros, then select UniToXNCR, and click
on Run.

 To convert all of a document's hexadecimal Numeric Character
Referrences to Unicode characters, click on Tools/Macro/Macros, then select
XNCRToUni, and click on Run.

- - Jim

>>

Rather than sending an attachment to the list, these instructions and the
file can be found at
http://lcweb.loc.gov/ead/practices/technical/unicode.html.
Thanks!
Mary Lacy




On Wed, 2 Apr 2003 13:00:49 -0500, Stephen Yearl <[log in to unmask]>
wrote:

>Susan:
>
>This maybe an indirect route, but it certainly works:
>
>1. open the word document in Open Office <http://www.openoffice.org>
>2. save the file as an *.sxw file (Open Office's native "Word" format)
>3. open the *sxw file with a zip utility (such as winzip), and extract the
content.xml file
>
>All characters outside of 8-bit ASCII are here encoded in unicode. You can
process the content.xml file using XSLT, or copy and paste from that file.
Of course, since the characters are unicode they will look like gibberish
in a non-unicode compliant editor (such as Note Tab); don't worry, unicode
complaint systems will know how to render them.
>
>Hope this helps some,
>
>St.
>
>Stephen Yearl
>Systems Archivist
>Yale University Library::Manuscripts and Archives
>
>At 11:21 AM 4/2/2003 -0500, you wrote:
>>Emory University and Boston College are working on a joint project to
>>encode our collections relating to Irish literature.  We have
>>encountered a number of names, etc which contain special
>>characters/diacritics.  We are wondering if anyone has developed a way
>>(routine) to take special characters from MSWord and translate them to
>>unicode entities?  If yes, would you be willing to share?  Many thanks!
>>
>>Susan McDonald
>>Head of Technical Services
>>Special Collections and Archives Department
>>Emory University