<div dir="ltr"><div dir="ltr"><br>This is a plan for switching to unicode with utf8 encoding, to allow larger<br>fonts in future.  This certainly isn't a priority, but I'm already quite clear<br>about how we can achieve this, and it came up, so I thought I'd describe it.  (I<br>want to allow importing .ttf fonts, converting it to bitmaps, so using unicode<br>should be practical.)<br><br>In memory, strings will be stored in UTF8, which is very convenient. Rather<br>little code will need to be updated, because you can use string concatenation,<br>MID, INSTR, FORMAT, strprintf, and most of our utility functions on utf8<br>strings.  Code that iterates over strings by character but only cares about<br>characters < 128 doesn't need to be updated either, such as code to process<br>embed codes.  I was thinking of doing "TYPE ustring as string" and annotating<br>which strings contain utf8, but now it seems to make much more sense to just say<br>that all strings and zstrings do, unless clearly marked otherwise.<br><br>Most code that wants to know the length of a string in characters needs to be<br>updated anyway to use textwidth() and measure things in pixels instead, for<br>variable-width fonts.  This part will probably be much more work than everything<br>else.  Looks like there are about 310 cases of LEN (excluding "IF LEN(") which<br>would need to be investigated, including 44 FOR loops over the characters of a<br>string, nearly all of which look fine.<br><br>Using UTF8, characters 128-255 take two bytes to encode, which means strings<br>that currently fit in a fixed-width text field may no longer fit. Therefore to<br>avoid having to replace all lump file formats with RELOAD-based ones, I propose<br>that fixed width string fields can store either 8-bit (extended ascii) or utf8<br>strings. The encoding can be stored as an extra bit in the length field of the<br>text (which is either 1, 2 or 4 bytes). There are a few text data fields which<br>aren't prefixed by the length but instead are zero-padded. The encoding bits for<br>these would have to be stored elsewhere, or a byte (codepoint between 1 and 31)<br>could be prepended to the string to indicate utf8.<br><br>strgrabber will need to be updated to treat the maxlength as number of bytes<br>available for encoding using either as 8-bit or utf8.  Shorter length limits is<br>unfortunate, but good enough. I think only textboxes and item names will badly<br>need to be expanded, both lumps badly overdue for replacement anyway.<br><br>RELOAD documents will gain a new node type for utf8 encoded strings.<br><br>Aside from data files there aren't many ways that strings enter or leave the<br>engine, which will all need to be updated, including filenames (OPENFILE can<br>handle this, and we should just replace all remaining instances of OPEN with<br>OPENFILE; also findfiles), reading/writing .txt files (FB file function support<br>reading/writing unicode files, but OPENFILE doesn't expose this yet), strings<br>and script names exported by hspeak (hspeak already supports unicode, but dumbs<br>it down to ascii when writing a .hs file), and a few places in the gfx backends<br>(eg setwindowtitle) or the os_* modules (mainly winapi calls).<br><br>I've already been working towards utf8 support for a while, eg. hspeak is done,<br>we have routines to convert from/to utf8, the clipboard routines use utf8, and<br>we already treat filenames as being utf8 most of the time (see<br>decode_filename()).<br><br></div></div>