Print

Print


Misty,

This is a very interesting problem.  Generically, "how do you record paths
that use characters which do not have a unicode representation?"
 Unfortunately, I do not have an answer.

> This filename has some characters which are not valid in UTF-8 or ASCII

Which characters end up being unrepresented?  I thought you couldn't
guarantee round trip conversion Shift-JISto Unicode to Shift_JIS.  I was
under the impression that Shift_JIS-> Unicode would work.  I understand
that this breaks your ability to unambiguously look up the files on the
legacy system.

Does this have full coverage of Shift_JIS?  If not, you can disregard the
rest of my response.
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT

I do not think it is possible to faithfully record the original paths of
a Shift_JIS filesystem in the FLocat element.  Doing a liftover to a newer
file system seems like a backed run around to your issue, but it would
permit representation of the new paths.

You could use FContent instead of Flocat, and list the MIMETYPE attribute
of the file element as Shift_JIS.  This would allow you to encode the
contents of the file, but not the original path.  So the original problem
remains.

I would run a translation to utf-8 and see if there end up being collisions
in the new representation of the legacy paths.  Without collisions, you can
at least maintain a look up table between the current and legacy
representations.  However, I'm not sure how you would reference an external
table in a METS document.

> As well, since the source encoding my be unknown

Without explicitly knowing the encoding, you're in a bit of a crux. There
are some interesting methods for attempting to infer the coding. I would
suggest using some sort of heuristic scan of the document to make an
educated guess based on a priori knowledge or assumptions.  Perhaps looking
for runs of hex that would be unlikely in the ascii or latin-1
representation of the document's language.




/* Colin Gross
 * Application Programmer
 * Digital Library Production Service
 * University of Michigan Library
*/


On Fri, Dec 13, 2013 at 1:15 PM, Misty De Meo <[log in to unmask]>wrote:

> Hi,
>
> I've run into an issue representing certain file paths in certain METS
> fields, for instance FLocat links.
>
> Some files I'm describing in a METS document were created on legacy
> operating systems and filesystems, and their filenames are in non-Unicode
> encodings. For example, one file I've been looking at contains Shift-JIS
> characters in its name. This filename has some characters which are not
> valid in UTF-8 or ASCII - e.g., characters above ordinal 127. Other
> filenames could potentially contain characters which are not valid in
> UTF-16.
>
> Unfortunately, since XML requires that all strings be Unicode or ASCII,
> I'm not sure how to represent these paths in the document. My understanding
> is that these fields are meant to represent actual paths, so
> base64-encoding the original data is out. As well, since the source
> encoding my be unknown or may contain characters that are unrepresentable
> in Unicode, transcoding the strings into Unicode before writing the METS is
> out. (That would also make it difficult to find the associate with the
> files on disk, since the Unicode-transcoded version wouldn't match what the
> filesystem is storing.)
>
> Have any other METS users run into this issue? Any suggestions?
>
> Best,
> Misty De Meo
>
> --
> Misty De Meo
> Software Developer / Systems Analyst
> Artefactual Systems
> www.artefactual.com
>