On Wed, 12 Apr 2000, Mark & Erika Reichert wrote:
> ...
> Of the thousands of CJK characters that actually have EACC equivalents at
> charts.unicode.org, roughly 2/3 or more (I don't have the actual numbers in
> front of me right now--but this figure is in the ballbark) of the codes are
> not unique; i.e., there are many Unicode CJK characters that map to the same
> EACC character. Can anyone briefly explain why this is and whether there's
> an algorithm for choosing the correct mapping? Perhaps the answer will
> become obvious reading the NISO CJK standard, which we currently don't
> have--we need to get that one of these days. Or perhaps my parsing of the
> Unihan database a few days ago was incorrect--leading to the duplicate
> mappings--but I don't think so.
There is a fundamental difference in the approach taken by the two systems.
It has to do with the fact that a given CJK symbol can have different "variant
forms", based on the language it is being used in. That is, a particular
symbol, with consistent meaning across languages, may be rendered differently
in Korean, Japanese, Traditional Chinese, and/or simplified chinese.
Unicode gives these variant forms one value. That is, only one numeric value
or "codepoint" exists in Unicode to cover these (up to) 4 variants. How that
codepoint gets rendered on the screen or printer may be affected by what font
is chosen; a font that is designed for Japanese may render it differently than
one that is designed for Korean. If you're in a computing environment (such as
a library) that stores multilingual data, and there is no "language
hinting" being delivered, the choice of fonts (and therefore the rendering
style) is ambiguous. This is an area of ongoing discussion within the Unicode
arena.
EACC, however, treats these as different numeric values, bearing a numerical
relationship to each other. This is explained in ANSI/NISO Z39.64-1989, "East
Asian Character Code for Bibliographic Use" (which, by the way, is available
only in microfiche form). EACC codes are represented in hex, such as
"xxyyzz". The "xx" is referred to as the "plane". Planes 21 to 26 are
traditional Chinese, 27 to 2C are simplified Chinese, and 2D through 68 are
"other forms". Variants are encoded with a difference of hex 06 in the
"xx". Thus a character at 34yyzz could have variants at 22yyzz, 28yyzz,
2eyyzz, 3Ayyzz, 40yyzz, 46yyzz, 4Cyyzz, 52yyzz, 58yyzz, 5Eyyzz, and/or
64yyzz. Note though that not all planes are currently being used.
Other planes (69 and higher) cover Japanese kana, kokuji, Korean hangul, and
new simplified Chinese.
All this means that there is NOT a one-to-one relationship between EACC and
Unicode. When translating from EACC to Unicode, if the plane is 21 to 68, you
have to look for a codepoint that matches directly or, failing that, matches a
value related as explained above. When translating the other way, I have no
good answer, but I suspect this is less often needed.
I hope this helps. For more details, please contact mee offline.
--
Regards,
....Bob Rasmussen, President, Rasmussen Software, Inc.
personal e-mail: [log in to unmask]
company e-mail: [log in to unmask]
voice: (US) 503-624-0360 (9:00-6:00 Pacific Time)
fax: (US) 503-624-0760
web: http://www.anzio.com
|