Print

Print


        MARBI CHARACTER SET SUBCOMMITTEE

                   Interim Report to MARBI

                      January 22, 1996


The MARBI Character Set Subcommittee was appointed in June
1994 following MARBI discussion of Discussion Paper #73 with
the following charge:

1. To review the character set issues related to mapping
between USMARC and Unicode
2. To formulate a proposal for review and comments by LC,
MARBI and the USMARC Advisory Group.
3. To identify other issues related to character sets which
should be addressed by MARBI and/or the library community.

The members of the Subcommittee are:

Joan Aliprand - RLG and Secretary of the Unicode Consortium
Randy Barry - LC
Candy Bogar - DRA
John Espley - VTLS
Robyn Greenlund - Microlif
Sally McCallum - LC
Gary Smith - OCLC
Paul Weiss - University of New Mexico
Larry Woods - University of Iowa, Chair

The previous Interim Report was posted on MARC-L in July,
1995.

The Character Set Subcommittee continued its deliberations
during the latter part of 1995. The following topics were
discussed via e-mail:

1. Mapping for the USMARC Latin character  F8 "RIGHT
CEDILLA".

2. General mappings for Cyrillic, Hebrew and Arabic.

3.  Mapping for the Hebrew "HOLAM" and Hebrew-specific
punctuation marks

4. Mapping for the "ASCII Clones" in the USMARC Hebrew,
Cyrillic and Arabic sets.

5. The addition of nine characters to Unicode Standard/ISO
10646 from the USMARC Arabic set.

6. Representation in the MARC record of the presence of
Unicode characters.

F8 RIGHT CEDILLA

We were fortunate to locate a Thai linguistics expert from
the University of Wisconsin, Robert Bickner, who confirmed
Joan Aliprand's hypothesis that the "RIGHT CEDILLA" was an
artifact created through transcription and is similar to the
International Phonetic Alphabet (IPA) symbol for an open
vowel, which is depicted in the Unicode Standard as 031C
COMBINING LEFT HALF RING BELOW. This gives us unique
mappings for all four  "comma below" characters in the
USMARC Latin Set:

F0     CEDILLA                0327      COMBINING CEDILLA
F1     RIGHT HOOK             0328      COMBINING OGONEK
F7     LEFT HOOK              0326      COMBINING COMMA BELOW
F8     RIGHT CEDILLA          031C      COMBINING LEFT HALF RING
                                        BELOW.

This completes our mapping table for USMARC Latin. USMARC
Greek Symbols, Subscripts and Superscripts are also
complete.

45 HOLAM

The issue concerned the recommended mapping for the USMARC
Hebrew character 45 HOLAM. USMARC has only a single
character HOLAM, which should have been listed as
HOLAM/RIGHT SIN DOT. There are two distinct characters -- HOLAM
and SIN DOT -- in both the Unicode Standard and ISO/IEC 10646.
The discussion was about whether to map contextually
to Unicode/UCS HOLAM or Unicode/UCS SIN DOT. This
has now been resolved and the recommended mapping will be:

45 HOLAM       05B9 HEBREW POINT HOLAM

We will ask that the USMARC tables be updated to read "45
HOLAM/RIGHT SIN DOT".


MAPPINGS FOR "ASCII CLONES"

The USMARC Cyrillic, Hebrew and Arabic character sets
include punctuation and digits which replicate those of
ASCII. The Unicode Standard has only a single set of these
characters. The Committee is discussing whether to  map
these  "ASCII clones" in the Hebrew, Cyrillic and Arabic
sets to their Latin equivalents in the Unicode Standard or
to values in the Private Use Area. Discussion is continuing
about what challenges mapping them to their Latin
equivalents provides for "round trip mapping" and
programming in general, especially as far as directionality
of display is concerned. We are surveying the vendors to see
what the impact of mapping these to the Private Use Area
would be, particularly as far as generic display and print
drivers are concerned.


ADDITION OF NINE ARABIC SCRIPT CHARACTERS TO UNICODE

The following nine script characters from the USMARC Arabic
Set have no corresponding equivalents in the Unicode Standard:

A1 DOUBLE ALEF WITH HAMZA ABOVE
B2 TCHEH WITH DOT ABOVE
C9 SHEEN WITH DOT BELOW
CC DAD WITH DOT BELOW
CF GHAIN WITH DOT BELOW
E7 LAM WITH THREE DOTS BELOW
EC NOON WITH DOT BELOW
FD SHORT E
FE SHORT U

Since USMARC Arabic has been officially adopted as an ISO
standard, this will be our primary justification to getting
them added to the Unicode Standard/ISO 10646. We assume this
will be routine and will list the mapping as "in process,
pending addition to Unicode/UCS" or some other similar phrase.
We will not map them to the Private Use Area.

REPRESENTATION OF THE PRESENCE OF UNICODE CHARACTERS IN MARC
RECORDS

Discussion of this is continuing. We need to show when a
record contains only Unicode values and when it contains Unicode
values as well as USMARC characters.

MAPPINGS FOR THE EAST ASIAN CHARACTER CODE (EACC)

The Subcommittee recommends that a new committee be
appointed to handle these mappings, because the Subcommittee
felt it lacked the expertise to deal with East Asian
scripts. The Subcommittee further recommends that at least
one member of the present Subcommittee be named to the EACC
Mapping  Group, and that the Library of Congress also be
represented. We also recommend that the same set of  Working
Principles be observed that this Subcommittee put into
place.

WORKING PRINCIPLES TO BE FOLLOWED IN MAPPING OF CHARACTERS
FROM USMARC TO UNICODE

The following Working Principles were established by the
Subcommittee and continue to inform their mapping decisions:

1. Round-trip mapping will be provided between USMARC
characters and Unicode characters wherever possible.

2. Transliteration tables will remain unchanged unless there
is no Unicode equivalent for a diacritical mark, in which
case a change to the transliteration table may be considered
by the Library of Congress.

3. Accented letters (and vocalized consonants in Hebrew and
Arabic) will continue to be encoded as a base letter and non-
spacing marks. Use of precomposed accented letters is not
sanctioned at this stage.

4. Codes in the Private Use Area will be used only if
necessary to facilitate round-trip mapping.


OUTSTANDING ISSUES

Mapping of ASCII clones in the USMARC Arabic, Hebrew and
Cyrillic sets.
Representation of Unicode data in a MARC record.
New committee to handle EACC.



********************************************************
Sally H. McCallum, Chief, Network Development and
MARC Standards Office, Library of Congress
Washington, DC 20540   USA
[log in to unmask]    (Fax: 1-202-707 0115) (Voice: 1-202-707 5119)
********************************************************