[Ohrrpgce] utf8 plan

Mon Apr 22 06:24:32 PDT 2019

This is a plan for switching to unicode with utf8 encoding, to allow larger
fonts in future.  This certainly isn't a priority, but I'm already quite
clear
about how we can achieve this, and it came up, so I thought I'd describe
it.  (I
want to allow importing .ttf fonts, converting it to bitmaps, so using
unicode
should be practical.)

In memory, strings will be stored in UTF8, which is very convenient. Rather
little code will need to be updated, because you can use string
concatenation,
MID, INSTR, FORMAT, strprintf, and most of our utility functions on utf8
strings.  Code that iterates over strings by character but only cares about
characters < 128 doesn't need to be updated either, such as code to process
embed codes.  I was thinking of doing "TYPE ustring as string" and
annotating
which strings contain utf8, but now it seems to make much more sense to
just say
that all strings and zstrings do, unless clearly marked otherwise.

Most code that wants to know the length of a string in characters needs to
be
updated anyway to use textwidth() and measure things in pixels instead, for
variable-width fonts.  This part will probably be much more work than
everything
else.  Looks like there are about 310 cases of LEN (excluding "IF LEN(")
which
would need to be investigated, including 44 FOR loops over the characters
of a
string, nearly all of which look fine.

Using UTF8, characters 128-255 take two bytes to encode, which means strings
that currently fit in a fixed-width text field may no longer fit. Therefore
to
avoid having to replace all lump file formats with RELOAD-based ones, I
propose
that fixed width string fields can store either 8-bit (extended ascii) or
utf8
strings. The encoding can be stored as an extra bit in the length field of
the
text (which is either 1, 2 or 4 bytes). There are a few text data fields
which
aren't prefixed by the length but instead are zero-padded. The encoding
bits for
these would have to be stored elsewhere, or a byte (codepoint between 1 and
31)
could be prepended to the string to indicate utf8.

strgrabber will need to be updated to treat the maxlength as number of bytes
available for encoding using either as 8-bit or utf8.  Shorter length
limits is
unfortunate, but good enough. I think only textboxes and item names will
badly
need to be expanded, both lumps badly overdue for replacement anyway.

RELOAD documents will gain a new node type for utf8 encoded strings.

Aside from data files there aren't many ways that strings enter or leave the
engine, which will all need to be updated, including filenames (OPENFILE can
handle this, and we should just replace all remaining instances of OPEN with
OPENFILE; also findfiles), reading/writing .txt files (FB file function
support
reading/writing unicode files, but OPENFILE doesn't expose this yet),
strings
and script names exported by hspeak (hspeak already supports unicode, but
dumbs
it down to ascii when writing a .hs file), and a few places in the gfx
backends
(eg setwindowtitle) or the os_* modules (mainly winapi calls).

I've already been working towards utf8 support for a while, eg. hspeak is
done,
we have routines to convert from/to utf8, the clipboard routines use utf8,
and
we already treat filenames as being utf8 most of the time (see
decode_filename()).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.motherhamster.org/pipermail/ohrrpgce-motherhamster.org/attachments/20190423/1b448872/attachment.html>