[Ohrrpgce] Unicodification (Was: SVN: teeemcee/8200 Add a couple test files with unicode filenames)

Fri Nov 4 06:20:57 PDT 2016

A little more playing around and a lot of reading Windows documentation
reveals:
-according to James, svn translates the filenames to UTF16 when checking
out on Windows 7
-when you scan a directory with findfiles/DIR (which calls FindFilesA) it
converts the filenames to the current codepage, possibly lossly. If it's
lossy then the filename is useless, it can't be used for anything. (I
couldn't find this fact stated anywhere in the Windows documentation!) But
even if it's not 7-bit ASCII, if it's on the ANSI codepage, like héllo.txt
on a Latin-1 (US/Western European) machine then it will work.
-Windows doesn't even turn combining characters into precomposed characters
when converting to Latin1, e.g. À turns into A` instead of À. This doesn't
matter, the filename would be broken either way
-If you ask findfiles() for subdirectories it does an additional check,
which fails for unencodable filenames because the filename is invalid. So
those subdirectories won't even display in the browser
-There are exactly two possible ways to allow reading files with unicode
characters on Windows:
--replace DIR with a modified version using FindFilesW instead, encode the
filename into a transport encoding (namely UTF8) so that we can store it in
strings everywhere, and even properly display it in the engine by calling
decode_filename** and then modify every function that gives Windows or FB a
filename to convert the filename back to UTF16 first. That's not a huge
amount of work, because we already have wrappers for nearly all FB file IO
commands and use them nearly everywhere. You wouldn't be able to use OPEN,
KILL, MKDIR directly anymore.
--replace DIR with a modified version using FindFilesW, and then if the
file can't be encoded, get a short 8.3 compatibility filename for the file
and pass that back instead. Unfortunately this means that a filename like
ЦЧШЩN.txt becomes something completely nonsensical like NFE9D~1.TXT.
(Some other sources of filenames also need to be handled, like
cmdline/drag-dropping .rpg files in a funky path, and I know nothing about
those)

** it's my intention to switch to UTF8 internally, and potentially even
fall back to a .ttf font to render characters not in the font. Making
everything UTF8-aware is the same thing as making everything aware of text
markup codes. Most strings in the engine can continue to be treated as 8bit
at first because they can't contain either filenames or markup, but when we
further want to support fonts larger than 256 characters, we will need to
shift most of the rest over. That means needing to write replacements for
all string functions like LEFT, MID, etc. Heck, we can just #undef the
builtin versions and redefine them! wstrings won't be used anywhere.

However, I am aware that Unicodification is NOT a priority, it should be a
very low priority, I think I remember just one bug report about non-ascii
characters, in fact even in my previous work on unicode I never noticed
that on Windows we can't open files with non-ANSI filenames. But it's nice
to finally understand how it always works, and I do still want to support
larger fonts and text markup everywhere, which is a far more useful goal.

On 4 November 2016 at 05:19, <subversion at hamsterrepublic.com> wrote:

> teeemcee
> 2016-11-03 09:19:46 -0700 (Thu, 03 Nov 2016)
> 698
> Add a couple test files with unicode filenames
>
> On my system I have UTF8 encoded filenames; I don't quite know
> what will happen after going through git and svn and ending up on
> a different system.
>
> The .rpg is just a blank file, with a Cyrillic filename, so shows
> up as just a bunch of ?????'s in Custom but works just fine
> under GNU/Linux. The filename appears fine when viewed in a Windows
> VM, but Custom under Windows can't load it.
>
> The .txt contains exported textboxes, and its filename has lots
> of pairs of side-by-side pre-composed Latin-1 characters and
> characters decomposed into ASCII base character plus modifier.
> It appears and works perfectly in Custom under GNU/Linux but
> not Windows.
> ---
> A   wip/testgame/unicodetest/
> A   wip/testgame/unicodetest/?\195?\128A?\204?\128?\195?\
> 129A?\204?\129?\195?\130A?\204?\130?\195?\131A?\204?\131?
> \195?\132A?\204?\136?\195?\133A?\204?\138?\195?\135C?\
> 204?\167?\195?\136E?\204?\128?\195?\137E?\204?\129?\195?\
> 138E?\204?\130?\195?\139E?\204?\136?\195?\140I?\204?\128?
> \195?\141I?\204?\129?\195?\142I?\204?\130?\195?\143I?\
> 204?\136?\195?\145N?\204?\131?\195?\146O?\204?\128?\195?\
> 147O?\204?\129?\195?\148O?\204?\130?\195?\149O?\204?\131.txt
> A   wip/testgame/unicodetest/?\206?\154?\206?\177?\206?\187?\
> 206?\183?\206?\188?\206?\173?\207?\129?\206?\177/
> A   wip/testgame/unicodetest/?\206?\154?\206?\177?\206?\187?\
> 206?\183?\206?\188?\206?\173?\207?\129?\206?\177/?\206?\186?
> \207?\140?\207?\131?\206?\188?\206?\181!.rpg
> _______________________________________________
> Ohrrpgce mailing list
> ohrrpgce at lists.motherhamster.org
> http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.motherhamster.org/pipermail/ohrrpgce-motherhamster.org/attachments/20161105/3f751887/attachment-0001.htm>