On Nov 9, 2011, at 3:39 PM, Riley, Charles wrote:
> Bibliographic data is largely built on the MARC-8 character set, in essence a subset of UTF-8; thus a loss of data for the preponderance of materials in non-Latin scripts has already occurred by the time data becomes bibliographic.
I don't think MARC-8 is properly a "subset" of UTF-8: I'm not sure what that means. MARC-8, as I understand, is more similar to ISO-2022 where you can switch between multiple character sets within a single text stream. UTF-8 is an encoding form of Unicode: a different beast entirely.
I would hope that Unicode would be used for any future bibliographic representation: the choice of encoding then depends on the particular serialization format used. There is little we can do if the original data has been lost, but having the foundation to represent the world's current and historical scripts is a vital requirement, and Unicode fits the bill here.
In addition to specifying language (whether ISO 639-2/B or 639-3 I don't have a preference) we should also consider specifying script details. ISO 15924 works well for this, e.g., to distinguish a title in Simplified Chinese vs. one in Traditional.
P.S. All opinions are my own and do not necessarily represent my employer.
Principal Software Engineer --- Search
10 Estes Street
Ipswich, MA 01938, USA
Phone: +1-978-356-6500 x2185
[log in to unmask]