Character encoding related rambling

2020-07-31

If you think ECMA-35 is solely about character-code switching, you do not understand it.

(Incidentally, stop using ECMA as an abbreviation for ECMAScript. ECMA-262 is not the only ECMA specification out there, or even the only important one.)

While it is true that a decent portion of the length of ECMA-35 is taken up by the nitty gritty of character-code switching, and that it is most infamous for contributing the code switching escapes used in traditional Japanese email systems (usually so known under its other designations of ISO 2022 and JIS X 0202), ECMA-35’s scope is broader than that, and even ISO 8859-1 (ECMA-94-L1, which does not allow for code switching) explicitly conforms to ECMA-35 via ECMA-43. It is better thought of as the “hand” to ECMA-48’s “glove”, ECMA-48 being at the fundamental level moulded to ECMA-35.

By the way, while the original ISO-8859-1 and EUC-KR conform to ECMA-35, this is not true for their decendents used for their names under current HTML standards (Windows-1252 and Unified Hangul Code respectively, though the latter is still listed with the preferred label “EUC-KR” due to UHC lacking a truly well supported label which isn’t merely poached from EUC-KR or, worse, KS C 5601 itself).

Although ECMA-48 does include simple summaries of concepts such as C1 control functions and independent control functions, giving just enough information to be implementable without reference to ECMA-35, it is not possible to truly understand ECMA-48 at the deep level without a thorough understanding of ECMA-35.

ECMA-48 is best known as the standard for “ANSI escapes”. In fact, the general syntax for escape sequences is defined by ECMA-35. For the most part, the specific sequences except for ECMA-35 designations and announcements are defined by ECMA-48, although ECMA-48’s jurisdiction is somewhat more limited when an alternate C1 set is selected. To wit, ECMA-35 divides escape sequences firstly by the byte immediately following the 0x1B:

If it is between 0x20 and 0x2F inclusive, it is a type nF escape sequence. More specifically, if it is 0x20, it is a type 0F escape sequence, and so forth until 15F.
If it is between 0x30 and 0x3F inclusive, it is a type Fp escape sequence. These are private-use control functions.
If it is between 0x40 and 0x5F inclusive, it is a type Fe escape sequence. These are associated with C1 control codes. This is often, but not always, the C1 control set defined by ECMA-48.
If it is between 0x60 and 0x7E inclusive, it is a type Fs escape sequence. These are control functions with specific meanings defined by ECMA-35 and/or ECMA-48, independently of which C1 control code set is selected.

Type nF escape sequences may then include zero or more additional bytes between 0x20 and 0x2F, followed by a final “F” byte between 0x30 and 0x7E inclusive. If it is between 0x30 and 0x3F inclusive, the sequence is a type nFp sequence, otherwise it is a type nFt sequence. Generally speaking, the abstract function of a nF escape for a given n is defined by ECMA-35, and the specific nFt escapes are registered in the ISO-IR registry (organised with the involvement of ECMA and IPSJ, the latter of which hosts it). The nFp escapes are private use, and in practice defined by protocol specifications (e.g. ARIB, MARC) or by vendors (e.g. DEC).

An escape sequence (started with ESC) should not be confused with a control sequence (started with CSI), even though CSI (as a C1 control) might itself be represented as an escape sequence. To wit, out of the following, ESC [ is an escape sequence, while the entire thing is a control sequence:

ESC [  1  m
1B  5B 31 6D

CSI sequences are defined entirely by ECMA-48, although they do not conflict with ECMA-35 (for a control function to change the interpretation of a sequence of non-control bytes following it is entirely acceptable). Essentially, they can afaict be used whereëver CSI is available (and although CSI’s function and the individual CSI functions are defined by ECMA-48, CSI itself is available in one or two of the ITU T.101 C1 sets, not just in the ECMA-48 one).

Although they occupy completely different namespaces, the syntax of a CSI sequence is broadly similar, but not identical, to that of an ESC sequence. The biggest differences are that the presence, absence or identity of a first 0x20–0x2F intermediate byte does not fundamentally change the type of CSI sequence, and that the intermediate and final bytes (identifying the specific CSI function) may be preceeded by a sequence of parameter bytes between 0x30 and 0x3F inclusive (i.e. including digits and a small subset of the punctuation characters, notably the semicolon). Thus, the 0x30–0x3F cannot be used as final bytes for a CSI sequence, hence the private use range for final bytes of CSI sequences is 0x70–0x7E instead (and certain parameter bytes are also reserved for denoting private behaviour).

Back on the topic of ECMA-35. Much of ECMA-35 is actually permitted inside of UCS/Unicode, in that C0 and C1 control codes and escape sequences can still exist, although obviously the character-code switching is forbidden (there are no G0/G1/G2/G3 sets to speak of when literally inside Unicode), although switching C0 and C1 sets seems to be permitted to an extent (though sticking to ECMA-48 unless a protocol requires otherwise seems to be recommended, and the C0 format effectors (except BS), the C0 information separators and ECMA-48’s NEL have special Unicode properties for whitespaceness, line breaking and/or bidi behaviour). The main difference is that the escape sequence bytes have to be padded to the size of the code unit, and therefore do not necessarily match their ECMA-35 specified format at the byte level.

Similarly, Unicode’s category Cc codepoints correspond to the ECMA-35 CL and CR regions having undergone the padding in question. When in UTF-16 or UTF-32, that is. While it is certainly possible for a terminal designed for thorough UTF-8 support from square one to recognise the UTF-8 codes for the CR region as C1 controls, this isn’t necessarily the case (nor would it be especially helpful, since they would be the same number of bytes as the escape sequences anyway). While some terminals might recognise isolated UTF-8 continuation bytes as C1 controls (which is not, of course, strictly valid UTF-8), the recommended representation is (like in non-ECMA-35 encodings such as DOS-850 or Windows-1252) the 7-bit escape code, in practice.

Unrelatedly, generating a chart from the available mapping data for the mysterious KPS 10721 raises more questions than answers. Its trail bytes occupy the full range 0x00–0xFF, while its lead bytes occupy the range 0x34–0x92. Both are confusing, but it is the lead byte range which is the most confusing.

While it does not seem to overlap with 0xA1–0xFE, and can therefore co-exist with the main plane of KPS 9566 when it’s encoded over that region (excluding the UHC-style additions from 2003 and 2011, which it presumably predates since it is from 2000), it cannot co-exist with ASCII. While not co-existing with ASCII isn’t necessarily truly bizarre, KPS 9566 was certainly designed to (hence its usual invocation over 0xA1–0xFE, even in its Unihan source references) and 0x34 is a very odd place to start… unless there are non-hanja rows before it for which the mapping and charts are not available outside of North Korea.

It’s not even clear whether the C0 controls (assuming they’re even used) are encoded with one or two bytes, since most encodings which retain the use of one-byte C0 controls avoid using them as trail bytes altogether, and the trail byte range is clearly 0x00–0xFF (in a seemingly seamless fashion, so this does not seem to be an obvious extension of something earlier, unlike e.g. KPS 9566-2011 versus KPS 9566-97).

It could, of course, be like its South Korean counterparts which, while defining codepoints, are not encodeable in any existing encoding nor very useful in isolation (due to only including relatively uncommon characters), and hence are mainly notable as Unicode sources alone.

2020-09-02

WHATWG’s encoder for Windows-1258 is not fit for purpose. Its decoder’s probably fine though.

Given that this is apparently now established without anyone saying anything, I can only assume that Vietnamese-language pages accepting form submissions either don’t use Windows-1258, or use backends built to fix characters mangled into numerical character entities. Unless it’s conventional for Unicode-equipped Vietnamese IMEs to use some weird not-in-any-normalised-form representation for Windows-1258 legacy reasons, which I suppose is also possible.

Windows-1258 is indeed a Windows-125x encoding (like Windows-1251 or Windows-1252), but it is also a single-byte Vietnamese Latin (Quốc ngữ; ASCIIfies as e.g. Quo^'c ngu*~, Quoocs ngwx) encoding. This brings its own unique issues.

Vietnamese requires 134 non-ASCII letters in total, due to letters taking both modifier diacritics (to create additional letters) and tone diacritics (which might be applied to whichever vowel, including Y and modified vowel letters, in whichever letter case). This number exceeds the:

94 characters available to a 94-character GR set, like the JIS X 0201 kana.
96 characters available to a 96-character GR set, like the ISO 8859 right-hand sides.
128 characters available to an extended-ASCII SBCS which structures its right-hand side in a non-ECMA-35 manner, such as Windows-1252.

Accordingly, legacy Vietnamese encoding employ a variety of different tricks to allow this to fit, depending on the particular encoding:

Only including non-toned versions of uppercase letters, requiring the use of lowercase letters in an all-capital font in contexts where toned uppercase letters are needed (TCVN/VSCII level 3).
Replacing those C0 control codes least likely to (a) be in use as format effectors or (b) choke gateways which interpret them (VPS, VISCII and TCVN/VSCII level 1).
Replacing the ECMA-6 nationalisable characters as well as adding a 128-character extension (VNI for DOS).
Using combining marks for all diacritics, possibly (not necessarily) with a few exceptions for combinations which would be problemetic to implement as combining marks with a non-advanced-typography font (ANSEL, VNI for Windows and VNI for Macintosh).
Using combining marks for tone marks, whilst possibly including some tone-marked letters already, and encoding the modified vowels as atomic characters as opposed to compositions (Windows-1258 and TCVN/VSCII levels 2 and 1).

(VSCII and VISCII are both abbreviated from “Vietnamese Standard Code for Information Interchange” but are unrelated, independent efforts. The latter was first published under that name slightly earlier to the best I can tell, but the former is the Vietnam national (TCVN) standard, namely TCVN 5712. In practice, VSCII seems mostly known as “TCVN”, while its custom-encoded fonts are apparently conventionally marked as “.VN” and known as “ABC fonts”. TCVN 5712 defines three levels, with the higher numbered levels including fewer of its non-ASCII characters: level 1 contains all characters, level 2 can be used within ECMA-35, and level 3 excludes combining characters and toned bicamerality). VPS and VNI are the manufacturers of influential input methods, and as such their own proprietary encodings gained some note, although I gather they do support other encodings. Finally, ANSEL is the legacy charset which was adopted for Latin letters in Library of Congress records.)

Anyway, you will hopefully have noticed that Windows-1258 uses a hybrid approach of pre-composed characters (for characters with modifier diacritics, and also for an arbitary selection of characters with tone diacritics—namely, those that existed in Windows-1252 already) and combining characters (for toned letters which don’t exist in Windows-1252, and also a valid alternative representation of those which do).

So for example, take the ố (lowercase circumflexed O with rising tone; the third letter in the term “Quốc ngữ”):

Code page 1258: 0xF4 0xEC (circumflexed lowercase O, combining acute).
Table-mapped to Unicode: U+00F4+0301.
- For reference, this is 0xAB 0xB3 in TCVN/VSCII (levels 1 and 2).
NFC form of the above: U+1ED1.
- Cannot map directly to Windows-1258: comes out as ố with html error mode.
- For reference, this is 0xD3 in VPS, 0xAF in VISCII, and 0xE8 in TCVN/VSCII (all levels).
NFD form of the above: U+006F+0302+0301 (lowercase O, combining circumflex, combining acute).
- U+0302 cannot map directly to Windows-1258, so the whole thing comes out as ố.
- For reference, this is 0x6F 0xE1 in VNI for Windows (since it has a separate character for U+0302+0301, namely 0xE1 in lowercase and 0xC1 in uppercase), 0x6F 0x87 in VNI for Macintosh (basically equivalent to the same put through a Windows-1252→MacRoman routine), and 0xE2 0xE3 0x6F in ANSEL (order reversed, since ANSEL diacritics precede what they modify).
- It also so happens to be 0xC2 0xC3 0x6F (again, reversed compared to Unicode) in ITU T.51 and ITU T.61, although it is not part of the ISO/IEC 6937 repertoire (also ITU T.51 Annex A), and therefore would not be considered valid in all contexts. Furthermore, T.51/T.61 includes an insufficient set of diacritics to properly support Vietnamese in the first place.

One would presume that an input method actually designed for Unicode would input by one of the preferred forms (in practice, NFC is preferred over NFD, with HFS+ filenames being an exception iirc). One would presume that an input method cannot be relied upon to produce the unusual, non-normalised form U+00F4+0301.

Hence, WHATWG’s (and Python’s, for that matter) encoder for Windows-1258 is insufficient for anything besides round-tripping non-normalised data decoded from Windows-1258 to begin with.