[Ohrrpgce] utf8 plan

James Paige Bob at hamsterrepublic.com
Tue Apr 23 06:49:08 PDT 2019


I read the plan and I like it!

As for old plans on the wiki, we should probably delete the really obsolete
ones, and mark others with {{obsolete}}

I have been really bad about checking the wiki lately

On Tuesday, April 23, 2019, Ralph Versteegen <teeemcee at gmail.com> wrote:

> Put it on the wiki, where it belonged: https://rpg.hamsterrepublic.
> com/ohrrpgce/Plan_for_unicode_support
>
> (Though I find most of the Plans on the wiki not terribly helpful, most of
> them are largely obsolete)
>
> On Tue, 23 Apr 2019 at 01:24, Ralph Versteegen <teeemcee at gmail.com> wrote:
>
>>
>> This is a plan for switching to unicode with utf8 encoding, to allow
>> larger
>> fonts in future.  This certainly isn't a priority, but I'm already quite
>> clear
>> about how we can achieve this, and it came up, so I thought I'd describe
>> it.  (I
>> want to allow importing .ttf fonts, converting it to bitmaps, so using
>> unicode
>> should be practical.)
>>
>> In memory, strings will be stored in UTF8, which is very convenient.
>> Rather
>> little code will need to be updated, because you can use string
>> concatenation,
>> MID, INSTR, FORMAT, strprintf, and most of our utility functions on utf8
>> strings.  Code that iterates over strings by character but only cares
>> about
>> characters < 128 doesn't need to be updated either, such as code to
>> process
>> embed codes.  I was thinking of doing "TYPE ustring as string" and
>> annotating
>> which strings contain utf8, but now it seems to make much more sense to
>> just say
>> that all strings and zstrings do, unless clearly marked otherwise.
>>
>> Most code that wants to know the length of a string in characters needs
>> to be
>> updated anyway to use textwidth() and measure things in pixels instead,
>> for
>> variable-width fonts.  This part will probably be much more work than
>> everything
>> else.  Looks like there are about 310 cases of LEN (excluding "IF LEN(")
>> which
>> would need to be investigated, including 44 FOR loops over the characters
>> of a
>> string, nearly all of which look fine.
>>
>> Using UTF8, characters 128-255 take two bytes to encode, which means
>> strings
>> that currently fit in a fixed-width text field may no longer fit.
>> Therefore to
>> avoid having to replace all lump file formats with RELOAD-based ones, I
>> propose
>> that fixed width string fields can store either 8-bit (extended ascii) or
>> utf8
>> strings. The encoding can be stored as an extra bit in the length field
>> of the
>> text (which is either 1, 2 or 4 bytes). There are a few text data fields
>> which
>> aren't prefixed by the length but instead are zero-padded. The encoding
>> bits for
>> these would have to be stored elsewhere, or a byte (codepoint between 1
>> and 31)
>> could be prepended to the string to indicate utf8.
>>
>> strgrabber will need to be updated to treat the maxlength as number of
>> bytes
>> available for encoding using either as 8-bit or utf8.  Shorter length
>> limits is
>> unfortunate, but good enough. I think only textboxes and item names will
>> badly
>> need to be expanded, both lumps badly overdue for replacement anyway.
>>
>> RELOAD documents will gain a new node type for utf8 encoded strings.
>>
>> Aside from data files there aren't many ways that strings enter or leave
>> the
>> engine, which will all need to be updated, including filenames (OPENFILE
>> can
>> handle this, and we should just replace all remaining instances of OPEN
>> with
>> OPENFILE; also findfiles), reading/writing .txt files (FB file function
>> support
>> reading/writing unicode files, but OPENFILE doesn't expose this yet),
>> strings
>> and script names exported by hspeak (hspeak already supports unicode, but
>> dumbs
>> it down to ascii when writing a .hs file), and a few places in the gfx
>> backends
>> (eg setwindowtitle) or the os_* modules (mainly winapi calls).
>>
>> I've already been working towards utf8 support for a while, eg. hspeak is
>> done,
>> we have routines to convert from/to utf8, the clipboard routines use
>> utf8, and
>> we already treat filenames as being utf8 most of the time (see
>> decode_filename()).
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.motherhamster.org/pipermail/ohrrpgce-motherhamster.org/attachments/20190423/974701d5/attachment.html>


More information about the Ohrrpgce mailing list