On Wed, 6 Dec 2006 18:02:09 EST, James Agenbroad <[log in to unmask]> wrote:
[This is such a long post that I am replying to only the last portion here.]
>More specific comments on responses follow:
>
>1. Joan wrote, "It is pointless to try to devise a MARC-specific list of
>characters that are allow or forbidden. Users will enter whatever is
>available to them. I diaagree as above. Not all rules are always followed,
>but awareness of rules reduces the frequency and consequences of proscribed
>activities. Most of us have jaywalked at one time or another. Recently an
>11-year old in my town died doing so. At ALA in San Diego a policeman gave
>me warning for doing so. We can reasonably expect that system developers
>will not make available characters defined as undesirable. For exmple, the
>current MARC Specifications say that U+00IB, the excape character is
>unkliekly to occur in UCS/Unicode records--and with good reason, Unicode
>was designed to prevent the need for escape sequences.
The MARC 21 Specifications mention the control code U+001B because the Baisc
Latin (ASCII) mapping table gives it as the Unicode equivalent for the ASCII
C0 control character used in MARC-8 data. Although the use of escape
sequences in the context of Unicode does not make sense, the MARC 21
Specifications do not explicitly forbid use of U+001B.
My point is that there must be solid reasons for any prohibitions. (As for
the alternative of explicitly allowed characters, the scope of Unicode is
too vast.) With respect to U+001B, the MARC 21 Specifications do the right
thing: there is no requirement to remove U+001B in the unlikely event that
it occurs in a MARC record.
>2. Joan wrote, "Both the fill character and numeric character references
>are equivalents for characters in the source record." The fill character
>is just a place marker, it's far from the equivalent of the missing
>character. A series of fill characters is far from the equivalent of a Thai
>script title. Numeric character references are intended to allow the record
>recipients to recreate the missing character(s) later when they get around
>to converting to Unicode; I would not expect these references to be
>dispalyed for the public so they are not equivalents either in any useful
>of the word equivalent.
OK, here is a more explicit re-write:
Both the fill character and the numeric character reference shows the
location of a character in the source record that could not be converted to
a MARC-8 equivalent.
The relevant user community, MicroLIF, preferred to either drop the
unmappable character or use the fill character. The NCR option was to meet
OCLC's need for a lossless solution.
http://www.loc.gov/marc/marbi/minutes/mw-06.html
>3. Joan wrote: "Yes, use of 'a' in leader position is preferable to use of
>the Byte Order MARK." I agree and would exclude BOM from MARC records.
>Then she says, "A system encountering the BOM sequence could not tell
>whether the first byte EF represented the ANSEL character candrabindu or
>was the first byte of a BOM." Isn't the ANSEL code EF only used in MARC-8
>records? Am I missing something? I would hope the BOM was not converted
>unaltered into MARC-8 records.
I was writing about examination of UTF-8 data by a MARC-8 system, but
omitted "MARC-8". A system that was UTF-8 aware would, of course, interpret
hex EF as the beginning of a UTF-8 sequence.
>4. Joan wrote, "Since MARBI has already approved use of certain private
>use code points in MARC 21 record, there seems no good reason to expressly
>prohibit the use of any otherprivate use code points." That MARBI has
>approved use of a few PUA characters seems a very good reason to prohibit
>the use of the rest--al but the approved 61 characters. Do we want MARC
>systems to need to seek every corporate logo with a PUA code that might
>through error get into a MARC record? I think not.
I don't understamd how a MARC system would need to seek for the meaning of a
private use code that was not one of those sanctioned by MARBI. Such a
private use code that through error got into a MARC record would just show
up as the "no glyph available" image supplied by system software (in the
case of Windows, for example, as a box).
We should bear in mind that MARC 21 is an international format. In the
implementation of MARC 21 that we are familar with, we do not plan to use
any more private use code points. But we do not know whether implementers
using MARC 21 in other parts of the world may wish to use private use code
points to meet the needs of their own constituency. It is more flexible to
have the restriction on use controlled via MARBI approval, than to block
further use anywhere absolutely.
(There is no guarantee that a MARC 21 implementation will conform to what LC
does. Even eight-bit MARC 21 allows the use of character sets other than the
MARC-8 ones. Because of this flexibility, MARC 21 is a truly international
format.)
>5. Joan wrote, "Noncharacters by definition may not be interchanged. They
>may be used internally by an application but cannot be exported in MARC 21
>exchange." I agree we should exclude them. See item 5 above.
>As always comments are most welcome.
-- Joan Aliprand
|