MARBI CHARACTER SET SUBCOMMITTEE
Interim Report to MARBI
January 22, 1996
The MARBI Character Set Subcommittee was appointed in June
1994 following MARBI discussion of Discussion Paper #73 with
the following charge:
1. To review the character set issues related to mapping
between USMARC and Unicode
2. To formulate a proposal for review and comments by LC,
MARBI and the USMARC Advisory Group.
3. To identify other issues related to character sets which
should be addressed by MARBI and/or the library community.
The members of the Subcommittee are:
Joan Aliprand - RLG and Secretary of the Unicode Consortium
Randy Barry - LC
Candy Bogar - DRA
John Espley - VTLS
Robyn Greenlund - Microlif
Sally McCallum - LC
Gary Smith - OCLC
Paul Weiss - University of New Mexico
Larry Woods - University of Iowa, Chair
The previous Interim Report was posted on MARC-L in July,
1995.
The Character Set Subcommittee continued its deliberations
during the latter part of 1995. The following topics were
discussed via e-mail:
1. Mapping for the USMARC Latin character F8 "RIGHT
CEDILLA".
2. General mappings for Cyrillic, Hebrew and Arabic.
3. Mapping for the Hebrew "HOLAM" and Hebrew-specific
punctuation marks
4. Mapping for the "ASCII Clones" in the USMARC Hebrew,
Cyrillic and Arabic sets.
5. The addition of nine characters to Unicode Standard/ISO
10646 from the USMARC Arabic set.
6. Representation in the MARC record of the presence of
Unicode characters.
F8 RIGHT CEDILLA
We were fortunate to locate a Thai linguistics expert from
the University of Wisconsin, Robert Bickner, who confirmed
Joan Aliprand's hypothesis that the "RIGHT CEDILLA" was an
artifact created through transcription and is similar to the
International Phonetic Alphabet (IPA) symbol for an open
vowel, which is depicted in the Unicode Standard as 031C
COMBINING LEFT HALF RING BELOW. This gives us unique
mappings for all four "comma below" characters in the
USMARC Latin Set:
F0 CEDILLA 0327 COMBINING CEDILLA
F1 RIGHT HOOK 0328 COMBINING OGONEK
F7 LEFT HOOK 0326 COMBINING COMMA BELOW
F8 RIGHT CEDILLA 031C COMBINING LEFT HALF RING
BELOW.
This completes our mapping table for USMARC Latin. USMARC
Greek Symbols, Subscripts and Superscripts are also
complete.
45 HOLAM
The issue concerned the recommended mapping for the USMARC
Hebrew character 45 HOLAM. USMARC has only a single
character HOLAM, which should have been listed as
HOLAM/RIGHT SIN DOT. There are two distinct characters -- HOLAM
and SIN DOT -- in both the Unicode Standard and ISO/IEC 10646.
The discussion was about whether to map contextually
to Unicode/UCS HOLAM or Unicode/UCS SIN DOT. This
has now been resolved and the recommended mapping will be:
45 HOLAM 05B9 HEBREW POINT HOLAM
We will ask that the USMARC tables be updated to read "45
HOLAM/RIGHT SIN DOT".
MAPPINGS FOR "ASCII CLONES"
The USMARC Cyrillic, Hebrew and Arabic character sets
include punctuation and digits which replicate those of
ASCII. The Unicode Standard has only a single set of these
characters. The Committee is discussing whether to map
these "ASCII clones" in the Hebrew, Cyrillic and Arabic
sets to their Latin equivalents in the Unicode Standard or
to values in the Private Use Area. Discussion is continuing
about what challenges mapping them to their Latin
equivalents provides for "round trip mapping" and
programming in general, especially as far as directionality
of display is concerned. We are surveying the vendors to see
what the impact of mapping these to the Private Use Area
would be, particularly as far as generic display and print
drivers are concerned.
ADDITION OF NINE ARABIC SCRIPT CHARACTERS TO UNICODE
The following nine script characters from the USMARC Arabic
Set have no corresponding equivalents in the Unicode Standard:
A1 DOUBLE ALEF WITH HAMZA ABOVE
B2 TCHEH WITH DOT ABOVE
C9 SHEEN WITH DOT BELOW
CC DAD WITH DOT BELOW
CF GHAIN WITH DOT BELOW
E7 LAM WITH THREE DOTS BELOW
EC NOON WITH DOT BELOW
FD SHORT E
FE SHORT U
Since USMARC Arabic has been officially adopted as an ISO
standard, this will be our primary justification to getting
them added to the Unicode Standard/ISO 10646. We assume this
will be routine and will list the mapping as "in process,
pending addition to Unicode/UCS" or some other similar phrase.
We will not map them to the Private Use Area.
REPRESENTATION OF THE PRESENCE OF UNICODE CHARACTERS IN MARC
RECORDS
Discussion of this is continuing. We need to show when a
record contains only Unicode values and when it contains Unicode
values as well as USMARC characters.
MAPPINGS FOR THE EAST ASIAN CHARACTER CODE (EACC)
The Subcommittee recommends that a new committee be
appointed to handle these mappings, because the Subcommittee
felt it lacked the expertise to deal with East Asian
scripts. The Subcommittee further recommends that at least
one member of the present Subcommittee be named to the EACC
Mapping Group, and that the Library of Congress also be
represented. We also recommend that the same set of Working
Principles be observed that this Subcommittee put into
place.
WORKING PRINCIPLES TO BE FOLLOWED IN MAPPING OF CHARACTERS
FROM USMARC TO UNICODE
The following Working Principles were established by the
Subcommittee and continue to inform their mapping decisions:
1. Round-trip mapping will be provided between USMARC
characters and Unicode characters wherever possible.
2. Transliteration tables will remain unchanged unless there
is no Unicode equivalent for a diacritical mark, in which
case a change to the transliteration table may be considered
by the Library of Congress.
3. Accented letters (and vocalized consonants in Hebrew and
Arabic) will continue to be encoded as a base letter and non-
spacing marks. Use of precomposed accented letters is not
sanctioned at this stage.
4. Codes in the Private Use Area will be used only if
necessary to facilitate round-trip mapping.
OUTSTANDING ISSUES
Mapping of ASCII clones in the USMARC Arabic, Hebrew and
Cyrillic sets.
Representation of Unicode data in a MARC record.
New committee to handle EACC.
********************************************************
Sally H. McCallum, Chief, Network Development and
MARC Standards Office, Library of Congress
Washington, DC 20540 USA
[log in to unmask] (Fax: 1-202-707 0115) (Voice: 1-202-707 5119)
********************************************************
|