Print

Print


Hello!

> may contain characters that are unrepresentable in Unicode,
I do not know of any character unrepresentable in Unicode that is natively 
recognized by any legacy OS (and used in file names). I would be 
interested to learn about these if such characters exist and I believe the 
Unicode consortium would also be interested in that.

Using (for example) an underscore "_" for characters not representable in 
Unicode and converting all other characters to UTF-8, you could point to 
symlinks to those files — assuming your system supports symlinks.

> attempting to infer the coding. I would suggest using some sort of heuristic scan of the document
Maybe, this could be useful if no more than VERY FEW encodings are used in 
your files. Otherwise, it is extremely risky. (Personally, I have many
thousands of text files encoded in dozens of different encodings and I've 
never found any reliable automatic heuristic to infer their encodings. 
Each time I need any of those pre-Unicode files more than sporadically, I 
convert it to UTF-8 including PUA because some of the alphabets I used are 
not yet available in Unicode.) In other words, you will probably need to 
check each file (or at least each group of files) "manually".

Regards!

Saašha,