Misty,

This is a very interesting problem.  Generically, "how do you record paths that use characters which do not have a unicode representation?"  Unfortunately, I do not have an answer.  

This filename has some characters which are not valid in UTF-8 or ASCII

Which characters end up being unrepresented?  I thought you couldn't guarantee round trip conversion Shift-JISto Unicode to Shift_JIS.  I was under the impression that Shift_JIS-> Unicode would work.  I understand that this breaks your ability to unambiguously look up the files on the legacy system.

Does this have full coverage of Shift_JIS?  If not, you can disregard the rest of my response.
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT

I do not think it is possible to faithfully record the original paths of a Shift_JIS filesystem in the FLocat element.  Doing a liftover to a newer file system seems like a backed run around to your issue, but it would permit representation of the new paths.

You could use FContent instead of Flocat, and list the MIMETYPE attribute of the file element as Shift_JIS.  This would allow you to encode the contents of the file, but not the original path.  So the original problem remains.

I would run a translation to utf-8 and see if there end up being collisions in the new representation of the legacy paths.  Without collisions, you can at least maintain a look up table between the current and legacy representations.  However, I'm not sure how you would reference an external table in a METS document.

> As well, since the source encoding my be unknown

Without explicitly knowing the encoding, you're in a bit of a crux. There are some interesting methods for attempting to infer the coding. I would suggest using some sort of heuristic scan of the document to make an educated guess based on a priori knowledge or assumptions.  Perhaps looking for runs of hex that would be unlikely in the ascii or latin-1 representation of the document's language.




/* Colin Gross
 * Application Programmer
 * Digital Library Production Service
 * University of Michigan Library
*/


On Fri, Dec 13, 2013 at 1:15 PM, Misty De Meo <[log in to unmask]> wrote:
Hi,

I've run into an issue representing certain file paths in certain METS fields, for instance FLocat links.

Some files I'm describing in a METS document were created on legacy operating systems and filesystems, and their filenames are in non-Unicode encodings. For example, one file I've been looking at contains Shift-JIS characters in its name. This filename has some characters which are not valid in UTF-8 or ASCII - e.g., characters above ordinal 127. Other filenames could potentially contain characters which are not valid in UTF-16.

Unfortunately, since XML requires that all strings be Unicode or ASCII, I'm not sure how to represent these paths in the document. My understanding is that these fields are meant to represent actual paths, so base64-encoding the original data is out. As well, since the source encoding my be unknown or may contain characters that are unrepresentable in Unicode, transcoding the strings into Unicode before writing the METS is out. (That would also make it difficult to find the associate with the files on disk, since the Unicode-transcoded version wouldn't match what the filesystem is storing.)

Have any other METS users run into this issue? Any suggestions?

Best,
Misty De Meo

--
Misty De Meo
Software Developer / Systems Analyst
Artefactual Systems
www.artefactual.com