John Clews <[log in to unmask]> asked about information on
coding in character sets in MARC records.
The key question needs to be answered first.
>8. What steps are being made to implement UTF-8 in USMARC?
Approved by MARBI in June 1998. Minutes of MARBI meetings are
available on the USMARC Web site. http://lcweb.loc.gov/marc/
But this approval is only part of what must be a complete
specification before records can be exchanged. Keep reading this
list for future developments.
>1. How much is ISO 6630 actually used already in library systems?
>I have yet to come across any great number that are using it.
Some of the ISO 6630 control characters are valid for UNIMARC. The
actual use depends entirely on the data, and also on the filing
conventions of the library in the case of NSB/NSE.
>2. What provisions are being made to avoid confusion between the
>characters in columns 8 and 9 of ISO 6630, vs. similar characters in
>UTF-8 implementations of UCS (ISO/IEC 10646 and Unicode)? I can
>forsee some problems in unaware systems that can cope with UTF-8,
>with combinations of 1. a character from ANSI Z39.47:1993 and 2.
a >non-filing or similar character from ISO 6630 being adjacent in
the >data stream, and looking like a 2-octet UTF-8 sequence.
Potential confusion between 8-bit data and UTF-8 records was
addressed in the MARBI proposal that covered use of UTF-8 (you can
find it through the Minutes of the June 1998 MARBI Meeting). It was
discussed at length by the task Force that prepared the proposal.
When (in the future) records can be in either 8-bit character sets
or UTF-8, systems will *have* to check on the record's type of
encoding. An "unaware system[s] that can cope with UTF-8" suggests
to me a system that assumes that *all* records are in UTF-8. Not a
good design decision for the near term, when UTF-8 systems will have
to be able to accept "legacy data" (i.e., records in 8-bit character
sets).
>3. I suppose conversion systems should be aware when data is being
>migrated to Unicode in the future, but what steps have been made, by
>vendors, or bibliographic utilities, to do appropriate
>trnasformations?
Transformations are already being done by systems that use Unicode
internally.
The Library of Congress has posted UCS/Unicode mappings for the
alphabetic USMARC character sets on the USMARC Web site.
>4. The basic USMARC character set left columns C and D blank. How
>much is ANSI Z39.47:1993 (which adds five characters to this)
>actually used already in library systems?
These characters have been legal for USMARC since 1994, when its
Latin set was aligned with ANSEL (Z39.47). So the question is:
How many systems have implemented these additional characters?
But maybe the count varies depending on the characters. For
example, RLIN has always had "script l" (which is converted to a
regular lower case l in record output).
>5. How many systems in the USA now use ISO/IEC 8859-1 and/or the ANSI
>codepage, which is a subset of it?
If by "ANSI codepage" you mean Windows codepage 1252, it's a
SUPERset of ISO/IEC 8859-1 because it encodes graphic characters in
the second control code (C1) area.
There were some a few years ago (which was the last time I checked)
but they were systems intended for smaller libraries.
>6. How many people see occasional "garbage characters" through coding
>ambiguities in MARC records?
What does "coding ambiguities" mean? One way you'd get unknown
characters is through mal-transformation of data from another
character set. Another cause is an inadequate or wrong font used
for display. Since there are at least two possible causes (and very
likely more), the question isn't very useful.
7. How many records use characters from columns C and D for the
subscript and superscript characters used by OCLC (and others)?
These are legitimate USMARC characters (albeit encoded differently
in the USMARC exchange format). Their encoding in the base 8-bit
set is an internal device. Potentially any records could contain
these characters: it all depends on the source of information that
was transcribed.
"How many systems include these characters in their internal 8-bit
set?" is a different question. Since the use is internal, does it
matter to anyone except the technical people maintaining that
system?
-- Joan Aliprand
Senior Analyst, RLG
To: [log in to unmask]
|