Hello!
> may contain characters that are unrepresentable in Unicode,
I do not know of any character unrepresentable in Unicode that is natively
recognized by any legacy OS (and used in file names). I would be
interested to learn about these if such characters exist and I believe the
Unicode consortium would also be interested in that.
Using (for example) an underscore "_" for characters not representable in
Unicode and converting all other characters to UTF-8, you could point to
symlinks to those files — assuming your system supports symlinks.
> attempting to infer the coding. I would suggest using some sort of heuristic scan of the document
Maybe, this could be useful if no more than VERY FEW encodings are used in
your files. Otherwise, it is extremely risky. (Personally, I have many
thousands of text files encoded in dozens of different encodings and I've
never found any reliable automatic heuristic to infer their encodings.
Each time I need any of those pre-Unicode files more than sporadically, I
convert it to UTF-8 including PUA because some of the alphabets I used are
not yet available in Unicode.) In other words, you will probably need to
check each file (or at least each group of files) "manually".
Regards!
Saašha,
|