[Ohrrpgce] utf8 plan

Tue Apr 23 04:19:15 PDT 2019

Put it on the wiki, where it belonged:
https://rpg.hamsterrepublic.com/ohrrpgce/Plan_for_unicode_support

(Though I find most of the Plans on the wiki not terribly helpful, most of
them are largely obsolete)

On Tue, 23 Apr 2019 at 01:24, Ralph Versteegen <teeemcee at gmail.com> wrote:

>
> This is a plan for switching to unicode with utf8 encoding, to allow larger
> fonts in future.  This certainly isn't a priority, but I'm already quite
> clear
> about how we can achieve this, and it came up, so I thought I'd describe
> it.  (I
> want to allow importing .ttf fonts, converting it to bitmaps, so using
> unicode
> should be practical.)
>
> In memory, strings will be stored in UTF8, which is very convenient. Rather
> little code will need to be updated, because you can use string
> concatenation,
> MID, INSTR, FORMAT, strprintf, and most of our utility functions on utf8
> strings.  Code that iterates over strings by character but only cares about
> characters < 128 doesn't need to be updated either, such as code to process
> embed codes.  I was thinking of doing "TYPE ustring as string" and
> annotating
> which strings contain utf8, but now it seems to make much more sense to
> just say
> that all strings and zstrings do, unless clearly marked otherwise.
>
> Most code that wants to know the length of a string in characters needs to
> be
> updated anyway to use textwidth() and measure things in pixels instead, for
> variable-width fonts.  This part will probably be much more work than
> everything
> else.  Looks like there are about 310 cases of LEN (excluding "IF LEN(")
> which
> would need to be investigated, including 44 FOR loops over the characters
> of a
> string, nearly all of which look fine.
>
> Using UTF8, characters 128-255 take two bytes to encode, which means
> strings
> that currently fit in a fixed-width text field may no longer fit.
> Therefore to
> avoid having to replace all lump file formats with RELOAD-based ones, I
> propose
> that fixed width string fields can store either 8-bit (extended ascii) or
> utf8
> strings. The encoding can be stored as an extra bit in the length field of
> the
> text (which is either 1, 2 or 4 bytes). There are a few text data fields
> which
> aren't prefixed by the length but instead are zero-padded. The encoding
> bits for
> these would have to be stored elsewhere, or a byte (codepoint between 1
> and 31)
> could be prepended to the string to indicate utf8.
>
> strgrabber will need to be updated to treat the maxlength as number of
> bytes
> available for encoding using either as 8-bit or utf8.  Shorter length
> limits is
> unfortunate, but good enough. I think only textboxes and item names will
> badly
> need to be expanded, both lumps badly overdue for replacement anyway.
>
> RELOAD documents will gain a new node type for utf8 encoded strings.
>
> Aside from data files there aren't many ways that strings enter or leave
> the
> engine, which will all need to be updated, including filenames (OPENFILE
> can
> handle this, and we should just replace all remaining instances of OPEN
> with
> OPENFILE; also findfiles), reading/writing .txt files (FB file function
> support
> reading/writing unicode files, but OPENFILE doesn't expose this yet),
> strings
> and script names exported by hspeak (hspeak already supports unicode, but
> dumbs
> it down to ascii when writing a .hs file), and a few places in the gfx
> backends
> (eg setwindowtitle) or the os_* modules (mainly winapi calls).
>
> I've already been working towards utf8 support for a while, eg. hspeak is
> done,
> we have routines to convert from/to utf8, the clipboard routines use utf8,
> and
> we already treat filenames as being utf8 most of the time (see
> decode_filename()).
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.motherhamster.org/pipermail/ohrrpgce-motherhamster.org/attachments/20190423/591c4239/attachment.html>