Character encoding related rambling


If you think ECMA-35 is solely about character-code switching, you do not understand it.

(Incidentally, stop using ECMA as an abbreviation for ECMAScript. ECMA-262 is not the only ECMA specification out there, or even the only important one.)

While it is true that a decent portion of the length of ECMA-35 is taken up by the nitty gritty of character-code switching, and that it is most infamous for contributing the code switching escapes used in traditional Japanese email systems (usually so known under its other designations of ISO 2022 and JIS X 0202), ECMA-35’s scope is broader than that, and even ISO 8859-1 (ECMA-94-L1, which does not allow for code switching) explicitly conforms to ECMA-35 via ECMA-43. It is better thought of as the hand” to ECMA-48’s glove”, ECMA-48 being at the fundamental level moulded to ECMA-35.

By the way, while the original ISO-8859-1 and EUC-KR conform to ECMA-35, this is not true for their decendents used for their names under current HTML standards (Windows-1252 and Unified Hangul Code respectively, though the latter is still listed with the preferred label EUC-KR” due to UHC lacking a truly well supported label which isn’t merely poached from EUC-KR or, worse, KS C 5601 itself).

Although ECMA-48 does include simple summaries of concepts such as C1 control functions and independent control functions, giving just enough information to be implementable without reference to ECMA-35, it is not possible to truly understand ECMA-48 at the deep level without a thorough understanding of ECMA-35.

ECMA-48 is best known as the standard for ANSI escapes”. In fact, the general syntax for escape sequences is defined by ECMA-35. For the most part, the specific sequences except for ECMA-35 designations and announcements are defined by ECMA-48, although ECMA-48’s jurisdiction is somewhat more limited when an alternate C1 set is selected. To wit, ECMA-35 divides escape sequences firstly by the byte immediately following the 0x1B:

Type nF escape sequences may then include zero or more additional bytes between 0x20 and 0x2F, followed by a final F” byte between 0x30 and 0x7E inclusive. If it is between 0x30 and 0x3F inclusive, the sequence is a type nFp sequence, otherwise it is a type nFt sequence. Generally speaking, the abstract function of a nF escape for a given n is defined by ECMA-35, and the specific nFt escapes are registered in the ISO-IR registry (organised with the involvement of ECMA and IPSJ, the latter of which hosts it). The nFp escapes are private use, and in practice defined by protocol specifications (e.g. ARIB, MARC) or by vendors (e.g. DEC).

An escape sequence (started with ESC) should not be confused with a control sequence (started with CSI), even though CSI (as a C1 control) might itself be represented as an escape sequence. To wit, out of the following, ESC [ is an escape sequence, while the entire thing is a control sequence:

ESC [  1  m
1B  5B 31 6D

CSI sequences are defined entirely by ECMA-48, although they do not conflict with ECMA-35 (for a control function to change the interpretation of a sequence of non-control bytes following it is entirely acceptable). Essentially, they can afaict be used whereëver CSI is available (and although CSI’s function and the individual CSI functions are defined by ECMA-48, CSI itself is available in one or two of the ITU T.101 C1 sets, not just in the ECMA-48 one).

Although they occupy completely different namespaces, the syntax of a CSI sequence is broadly similar, but not identical, to that of an ESC sequence. The biggest differences are that the presence, absence or identity of a first 0x20–0x2F intermediate byte does not fundamentally change the type of CSI sequence, and that the intermediate and final bytes (identifying the specific CSI function) may be preceeded by a sequence of parameter bytes between 0x30 and 0x3F inclusive (i.e. including digits and a small subset of the punctuation characters, notably the semicolon). Thus, the 0x30–0x3F cannot be used as final bytes for a CSI sequence, hence the private use range for final bytes of CSI sequences is 0x70–0x7E instead (and certain parameter bytes are also reserved for denoting private behaviour).

Back on the topic of ECMA-35. Much of ECMA-35 is actually permitted inside of UCS/Unicode, in that C0 and C1 control codes and escape sequences can still exist, although obviously the character-code switching is forbidden (there are no G0/G1/G2/G3 sets to speak of when literally inside Unicode), although switching C0 and C1 sets seems to be permitted to an extent (though sticking to ECMA-48 unless a protocol requires otherwise seems to be recommended, and the C0 format effectors (except BS), the C0 information separators and ECMA-48’s NEL have special Unicode properties for whitespaceness, line breaking and/or bidi behaviour). The main difference is that the escape sequence bytes have to be padded to the size of the code unit, and therefore do not necessarily match their ECMA-35 specified format at the byte level.

Similarly, Unicode’s category Cc codepoints correspond to the ECMA-35 CL and CR regions having undergone the padding in question. When in UTF-16 or UTF-32, that is. While it is certainly possible for a terminal designed for thorough UTF-8 support from square one to recognise the UTF-8 codes for the CR region as C1 controls, this isn’t necessarily the case (nor would it be especially helpful, since they would be the same number of bytes as the escape sequences anyway). While some terminals might recognise isolated UTF-8 continuation bytes as C1 controls (which is not, of course, strictly valid UTF-8), the recommended representation is (like in non-ECMA-35 encodings such as DOS-850 or Windows-1252) the 7-bit escape code, in practice.

Unrelatedly, generating a chart from the available mapping data for the mysterious KPS 10721 raises more questions than answers. Its trail bytes occupy the full range 0x00–0xFF, while its lead bytes occupy the range 0x34–0x92. Both are confusing, but it is the lead byte range which is the most confusing.

While it does not seem to overlap with 0xA1–0xFE, and can therefore co-exist with the main plane of KPS 9566 when it’s encoded over that region (excluding the UHC-style additions from 2003 and 2011, which it presumably predates since it is from 2000), it cannot co-exist with ASCII. While not co-existing with ASCII isn’t necessarily truly bizarre, KPS 9566 was certainly designed to (hence its usual invocation over 0xA1–0xFE, even in its Unihan source references) and 0x34 is a very odd place to start… unless there are non-hanja rows before it for which the mapping and charts are not available outside of North Korea.

It’s not even clear whether the C0 controls (assuming they’re even used) are encoded with one or two bytes, since most encodings which retain the use of one-byte C0 controls avoid using them as trail bytes altogether, and the trail byte range is clearly 0x00–0xFF (in a seemingly seamless fashion, so this does not seem to be an obvious extension of something earlier, unlike e.g. KPS 9566-2011 versus KPS 9566-97).

It could, of course, be like its South Korean counterparts which, while defining codepoints, are not encodeable in any existing encoding nor very useful in isolation (due to only including relatively uncommon characters), and hence are mainly notable as Unicode sources alone.