JIS Character Sets Explained
ISO 646, JIS X 0201 and early representations
The ISO-646 7-bit character set standard is best known in what is actually its United States variant (ISO-646-US, more often called ASCII, or American Standard Code for Information Interchange), but there were previously a plethora of national variants. Certain characters were to be supported by all variants (most of them), but certain codepoints could be changed so more nationally relevant symbols could be supported, which would otherwise be missed out. For example, the UK version occupied 0x23 with a pound sign (£
, for the GBP currency) and 0x7E with an overline (‾
) or spacing macron (¯
). But in US-ASCII, these are respectively the octothorpe (#
) and simple tilde (~
).
The Japanese variant (ISO-646-JP, also called JISCII or JIS-Roman) treated 0x23 like in ASCII and 0x7E like in ISO-646-GB, but occupied 0x5C (used for the backslash \ in both UK and US variants) with the Latin-based yen sign ¥
. So, it differs (minorly in practice) from ASCII at 0x7E, and it differs significantly from ASCII at 0x5C. Keep note of that, it becomes important later.
The macron is used to indicate long vowels in certain systems of romanising Japanese (such might state Tōkyō, alternatives including Tôkyô (using the circumflex), Toukyou (following Hiragana orthography), Tookyoo (doubling long vowels), Tohkyoh (using the “oh” digraph) or Tokyo (discarding vowel length)). The spacing overline is not of much use in modern-day computing, but on a typewriter/teletype it could be combined with backspace to be overstamped onto the letter.
But a 7-bit stateless form of JISCII can only represent Japanese in romanised form, because neither Kanji/Hiragana nor even Katakana has adequate encoding space within the nationalisable code points. There are two ways of working around that, I’ll get to these and their applications in due course.
Compare that with Morse code, where the prosign (control character) -..---
(N+J or D+O run together, meaning “next Japanese”) was used to switch to Kana mode. The ...-.
(S+N or V+E) control was used to switch back. The Kana morse codes most probably are not relevant to your life, so I will not list them all out here, but the curious can check Wikipedia. Hiragana and Katakana were not distinguished by this encoding, so Katakana alone was used to transcribe the message.
ASCII was designed as a 7-bit character set, with the eighth bit being either absent, a checksum or unused (ASCII itself didn’t specify). This contrasted with EBCDIC (Extended Binary Coded Decimal Interchange Code), which was an 8-bit character set with twice as many codepoints. But EBCDIC was not well designed: capital and lowercase letters were not in continuous ranges, which made casemapping routines needlessly complicated. EBCDIC was also not even slightly compatible with ISO 646, and mutual compatibility of anything other than English letters and digits between EBCDIC variants was limited (in at least one unusually divergent case, even lowercase letters don’t correspond). Unless you work on IBM mainframes, EBCDIC is not relevant to your life either (except in places you don’t get to see and really don’t want to have to see), so I’ll leave out further details on that.
Combining the advantages of ASCII and EBCDIC, however, was Extended ASCII, which used ASCII for bytes with the high (eighth) bit unset, and an extended encoding for bytes with it set. A particularly influential such setup in the Occident was the DEC Multinational Character Set (or MCS), which stratified into two 16-code rows for extended control characters (mirroring ASCII’s two such rows), two rows of uncased characters (mainly punctuation), and two rows for each letter case. This elegant design meant that casemapping routines could be almost as simple as in ASCII itself. In theory. Subsequent standards and variants messed that up completely. The most common single-byte variant today is Windows-1252, also called WinLatin1 or “Western European (Windows)”, which likely contributed to ISO-646-GB being lost to oblivion due to including the £
with its own codepoint.
Extension was possible on septet-byte machines too, since the Shift Out control character would switch to the non-646 set (equivalent to mapping 0x20–0x7F to 0xA0–0xFF). But it only reached 96 extended characters (0xA0–0xFF), with the use of 0xA0 and 0xFF being potentially technically problematic due to being mapped to the space and delete characters, so many extended ASCII sets were designed to keep the extended characters within the 0xA1–0xFE or 0xA0–0xFF range. Shift In switched back. This is a massive simplification; if you want the details, I refer you to ECMA-35.
Needless to say, Extended ASCII, or rather extended ISO-646, did catch on in Japan too, but with an entirely different extended encoding. Four rows were dedicated to kana, without distinguishing Hiragana (so, dedicated to Katakana), with the Japanese-style comma (、
), full stop (。
) and speech marks (「」
) and the interpunct (・
) getting their own codepoints. Diacritic forms were not included separately; the diacritic characters were given separate codepoints and were inserted after the respective kana. The first two rows of the extended area were left open, so the characters were in the reach of Shift Out (and which also theoretically allowed extended control characters in that space), and the last two were unused, which becomes important later. The ISO-646 encoding used in the lower half (meaning, with the high bit unset) was of course Japan’s own variant (ISO-646-JP). Both ISO-646-JP and the single-byte kana set, as well as the 7-bit and 8-bit encodings thereöf, were originally standardised by JISC as JIS C 6220, which is now called JIS X 0201.
ISO 2022, JIS X 0208 and the Double Byte Character Set
The lack of Kanji (and Hiragana) in JIS X 0201 allowed only a simplified writing system which, while suitable for phonetically transcribing the language to an extent (albeit leaving near homophones distinguished by speech inflection ambiguous), did not retain the semantic and other distinctions achieved through Kanji. Furthermore, the order of the kana codepoints did not make for sensible sorting.
JIS X 0208, originally called JIS C 6226, was an entirely new encoding. Control characters and the ASCII space were represented like in ASCII. The remainder were represented with ASCII printing codepoints in pairs. This meant 94 possible first bytes and 94 possible second bytes, so the character set was arranged into 94 “ku” (rows or wards), each with 94 “ten” (cells). The first byte (minus 32) would specify the row (the rows being numbered 1 to 94, yes, 1-indexed) and the second would similarly specify the cell within that row. A character might be specified by the said bytes (e.g. 0x2121 for the ideographic space).
As well as being represented by their indexing bytes, a form also exists with the codepoints indexed by pairs of 1-indexed numbers (i.e. from 1 to 94), called a “kuten”. The WHATWG encoding standard introduces a further codepoint representation where a singular number counting from zero, called a “pointer”, indexes the characters, such that the first row includes pointers 0–93, the second includes pointers 94–187 and so forth. But neither of these are character encodings, but rather academic or internal representations. More character encodings of JIS X 0208 also exist, I’ll come to that later.
While not ASCII-compatible or even ISO-646 compatible, this layout allowed it to be used with ISO/IEC 2022 (ECMA-35 or JIS X 0202), which defines (amongst other things) a means of switching between 7-bit encodings with certain structural constraints using sequences with the ESC (escape) control character. A profile of this system, starting off in ASCII and switching to and from JIS X 0208, never using the high bit, become the pre-Unicode standard for Japanese-language e-mails (at the time, e-mails had to be 7-bit clean) and is still relevant today, much to the frustration of the WHATWG (which I’ll come to later). The label “ISO-2022-JP” is given to this usage, formally codified as a profile in RFC 1468, and later included by JISC themselves in the 1997 edition of JIS X 0208. Due to it using the JIS X 0208 two-byte representations unmodified, and due to usage of pure JIS X 0208 being rare, it’s also often simply called the “JIS encoding”, a name which also references the Japanese-language definition of the mechanism being in JIS X 0202. Indeed, JIS X 0202 implementations which support 7-bit JIS X 0208 but exceed (or indeed predate) the relatively conservative profile defined by RFC 1468 may still be referred to as “JIS Encoding”, as a broader or blanket term.
But JIS X 0208 could also be used alone; it included punctuation and letters in addition to kanji and kana, which also meant that Roman letters could be specified in ISO-2022-JP using either ASCII or JIS X 0208. It did not however attempt to sort out a lossless conversion of ASCII data to its own charset (e.g., lacking straight quotation marks and disunifying the hyphen from the minus sign). All of this had other consequences later.
Although there were theoretically 94 rows, the last 10 rows were not yet used, being left out for future expansion. The first two rows were used mainly for punctuation and symbols, followed by one containing Roman letters and Euro-Arabic digits, another containing Hiragana, another containing identically ordered Katakana, another containing Greek letters, another containing Cyrillic letters and an eighth row containing box drawing characters (initially unallocated but allocated in the 1983 version). The following seven rows were left out for future non-Kanji expansion, with the sixteenth row being the start of the Kanji block. These seventeen unallocated rows became important later. The kana ordering differed from the JIS X 0201 order in order to make more sense for sorting, and diacritic kana were provided pre-composed.
A neat feature worth mentioning while we’re still on this topic: unlike the Greek and Cyrillic characters which start at the beginnings of their respective rows, the Roman letters start some distance into their row, skipping out several spaces at the beginning, so the second representing byte matches that letter in ISO 646, so purely Roman text in JIS X 0208 remains fairly readable if misinterpreted as ASCII or as JIS X 0201 (somewhat like #H#e#l#l#o!!#W#o#r#l#d!*
).
This encoding went through several versions since its release in 1978. In 1983, several more characters were added (mostly non-kanji symbols in the second row), whereas several level 1 kanji were changed to simpler synoglyphic variants (as “extended shinjitai”); in the process of doing so, some swapped codepoints with existing extended shinjitai in level 2, and a small number had their original unsimplified forms appended to the end of level 2. All of this was significant enough to warrent the new version being assigned a separate ISO-2022 switching code (although often both get treated like the later version). More on that later.
※ I should perhaps link here to a table of the JIS X 0208 editions’ kanji changes. Also lists collisions with Microsoft extensions for the varients’ JIS X 0213 (more on that later) codepoints.
Shift_JIS
Microsoft, in the process of expanding out into Japan, developed connections with the publishers of a microcomputer magazine by the name of (confusingly enough) ASCII (a forerunner of the current ASCII Media Works). One outcome of this association was figuring out an encoding of JIS X 0208 which retained compatibility with JIS X 0201 (and, so, with ISO-646), which has since turned out to be far more important than ISO-2022 compatibility (which it does not retain, on account of (for example) repurposing the C1 area).
The way it worked was that each Shift_JIS ward constituted a pair of consecutive JIS X 0208 wards, which meant that only 47 initial bytes were required. As the first two and last two rows of the upper half of JIS X 0201 were not used by that standard, that meant that 64 initial bytes were available. The first such byte (0x80) was skipped, leaving 63 possible initial bytes or a total of 126 representable JIS X 0208 rows (or 120 in practice, since 0xFD through 0xFF usually weren’t used as lead bytes). This was several more than the 94 rows present in the JIS X 0208 standard, which becomes important later.
As the second byte would have to index within a two-row ward, not an individual row, 188 such bytes were required. But a second byte would be clearly a second byte, assuming no initial truncation, so it was possible to reuse halfwidth katakana and even ASCII characters there. The ASCII controls, including the first two rows and the DEL character (0x7F), were nonetheless skipped. The second two ASCII rows were also skipped including the digits, ASCII space and 21 out of 32 of the punctuation marks, including all 19 invariant punctuation marks such as the "
, '
and <>
which are sensitive to HTML syntax, which leaves 191 possible second bytes, of which only the first 188 are needed.
Since that left two different ways of encoding Katakana and most of ASCII, the one-byte forms were distinguished from the two-byte forms by their width, with the two-byte forms being displayed roughly the same width as a Kanji, and the one-byte forms being more narrow (roughly half as wide), which avoided breaking anything that assumed the physical length of a piece of text to be proportional to the number of bytes. This became known as “fullwidth” versus “halfwidth”.
This encoding, “shift-coded” JIS or “Shift_JIS” as it become known, become the basis for encoding Japanese on Windows and Apple computers and was eventually standardised in the 1997 edition of JIS X 0208. The potentially confusing name refers to the “JIS” (x0208) codes being “shifted” around the existing codes, not to the use of actual shift-codes which it does not involve (and which I’ll come to later).
Host Data
IBM uses four different encoding types for Japanese characters: JIS (ISO 2022 form), EUC, PC data (meaning Shift_JIS) and Host Data (or DBCS-Host). The latter is analogous to an ISO 2022 DBCS, but the reserved single byte characters are controls at 0x00–0x3F (not 0x00–0x1F) and 0xFF (not 0x7F), and the space at 0x40 (not 0x20), with the lead/trail bytes being 0x41–0xFE (rather than 0x21–0x7E). To put it very simply, host code is to EBCDIC as ISO 2022 is to ASCII, and accessed using locking shift codes.
※ Strictly speaking, 0x40 in a DBCS-Host code is required to be used in pairs, as a double-width space character (0x4040, U+3000). Any other use of the byte 0x40 while in DBCS-Host is invalid/undefined.
The IBM DBCS-Host code for Japanese (IBM-300) predates JIS X 0208, although it has added characters so that all JIS X 0208 characters are supported, and conversion between them requires a mapping table. For the most part, you probably don’t care about it unless you’re using an EBCDIC mainframe. However, it is important to note that because it predated JIS X 0208, it includes characters which do not appear in JIS X 0208 (both non-kanji and kanji), and this becomes very important later. In any case, its characters can be browsed here
※ Hitachi and Fujitsu designed their own DBCS-Host codes, but those ones are based on an encoding of JIS X 0208 with the high bit set, plus their own extensions in an expanded trail byte range, hence they are very different. They also use different shift codes to IBM.
※ Japanese EBCDIC encompasses two incompatible single-byte variants, a “Latin” one which keeps the Roman lowercase letters in their usually invariant positions (IBM-1027) and a “Katakana” one which moves them out the way to make the single byte katakana layout match the trail bytes of the double byte katakana layout (IBM-290). The IBM DBCS-Host code can be paired with IBM-1027 (making IBM-939, while IBM-1399 is an updated version) or with IBM-290 (IBM-930 or IBM-1390). Finally, IBM-931 pairs IBM-300 with IBM-8229, i.e. the common subset of IBM-037 and IBM-1027. Similar pairings exist for the alternative host codes from Fujitsu and Hitachi—although the layout of the lowercase in the “Katakana” page and of the katakana in the “Latin” page appears to differ between e.g. Hitachi’s and IBM’s versions for some reason. The “Katakana” variants of the single-byte component seems to be nicknamed EBCDIK, a term sometimes used constrastively with the “Latin” variants as EBCDIC, but the term EBCDIK is apparently used for a lowercase-preserving variant by HP.
※ Microsoft apparently tried to implement IBM-290 and IBM-1027 as Windows-20290 and Windows-21027, presumably respectively. In the case of at least Windows-21027, “tried” is the operative term, and they seem to have shipped it with Windows in an unfinished and unusable state, not that many people besides Microsoft developers seem to have noticed its existence. To the best of my knowledge, Microsoft made no attempt to implement the DBCS-Host components.
Beyond Shift_JIS: IBM, NEC and Windows-31J
Microsoft’s own Windows “shift_jis” differs from the generic standard definition in certain ways.
Firstly, its mapping is based on ASCII rather than ISO-646-JP. To be fair, that was necessary due to the reliance of DOS and Windows on the backslash as the primary path separator. (This was in turn for backward compatibility: DOS would consider a command name to end before the first forward-slash even if a space was not inserted, because a forward-slash was used for command-line option syntax. In every other context, Windows accepts either slash as equivalent but renders paths using the backslash.) Japanese fonts nonetheless render the code which became mapped to the Unicode backslash as a yen sign for ISO-646-JP compatibility, which does interesting things to the rendering of Windows paths.
Secondly, it includes extensions, both within the 94 rows defined by the JIS X 0208 standard and comprising 26 rows beyond it. Extending JIS X 0208 was not unusual or unique to Microsoft: for example, cellular carriers developed their own proprietary extensions (which included the original emoji). In fact, the extensions themselves originated from NEC and IBM, not from Microsoft, who basically defined their encoding as a superset of the IBM variant incorporating certain NEC extensions.
IBM’s version, which IBM call “Japanese DBCS–PC” (note that that term does not include the single-byte codepage which it’s used with, called “Japan PC-Data SB”), does not allocate codepoints within the 94 rows defined by the JIS X 0208 standard but does so within the 26 JIS rows/wards (or 13 Shift_JIS wards / lead bytes) beyond it. Its extensions to the basic Shift JIS code actually derive from the IBM Japanese Host Code repertoire, allowing for two-way conversion between Japanese EBCDIC and Shift_JIS.
IBM and Microsoft mark out rows 95 through 114 (Shift_JIS wards 0xF0-0xF9) for what IBM call User Defined Characters (UDC), and which Microsoft calls End User Defined Characters (EUDC). In other words, it’s a private-use area.
IBM’s rows 115 through 119 (Shift_JIS wards 0xFA-0xFC) contain IBM extensions.
Row 115 starts with 28 codepoints dedicated to “non-Kanji” characters: lowercase Roman Numerals i-x, uppercase Roman Numerals, a not-sign, a broken pipe, single and double straight quotation marks, a Kanji-derived “kabushiki kaisha” symbol (which apparently still counts as “non-Kanji”), square Roman-derived “No.” and “Tel” symbols and a “because” sign.
The remainder of rows 115-118 (Shift_JIS wards 0xFA-0xFB) and the first twelve cells of row 119 (at the beginning of Shift_JIS ward 0xFC) are occupied by 360 IBM-selected Kanji characters.
Microsoft’s variant also includes the following extensions from NEC’s version of JIS X 0208, which appears not to have been designed to be necessarily used with Shift_JIS. So in contrast to IBM’s extensions, these do not occupy rows beyond JIS X 0208, but rather occupy unallocated rows within that standard.
NEC’s row 13 (in Shift_JIS ward 0x87) contains circled Euro-Arabic numerals, uppercase Roman Numerals, Katakana-derived and Roman-derived square symbols for e.g. units, Kanji-derived symbols (circled Kanji, composed Kanji for era names) and mathematical symbols (several of the latter have had standard codepoints in JIS X 0208 row 2 since the 1983 edition).
NEC’s rows 89 through 92 (Shift_JIS wards 0xED and 0xEE) contain all characters which are present in IBM’s rows 115 through 119 but absent from NEC’s row 13, including and starting with all the Kanji, with said subset of the non-Kanji (lowercase Roman numerals, not-sign, broken pipe, straight quotation marks) being placed at the end of row 92.
This, of course, means that rows 115 through 119 are entirely redundant to rows 89 through 92 and parts of row 13 in Microsoft’s version.
NEC themselves included further extensions in rows 9 through 12, which were not included by Microsoft. NEC’s rows 9 and 10 contain JIS X 0201, with row 10 also including halfwidth composed diacritic katakana forms (it appears that NEC were trying to put 0201 in 0208, the exact opposite of what Microsoft were doing). NEC’s row 11 contains halfwidth box drawing characters (in Unicode order) and extended halfwidth punctuation including curly quotation marks and lenticular brackets. NEC’s row 12 includes fullwidth box drawing characters. It should be noted that NEC were extending the 1978 edition of JIS X 0208 so e.g. many of the fullwidth box drawing characters are redundant to codepoints added in the 1983 versions, although there’s no collision between the NEC version and the 1983 or 1990 version.
※ It’s also worth noting that NEC also have their own, different, extensions to single-byte JIS X 0201. Unlike the extensions present in row 10, these were not focused on katakana, but added progress bar characters and box drawing characters in the C1 range and assorted things after the halfwidth katakana, including playing card symbols and a small number of kanji for giving dates and times and yen prices.
I previously commented upon the odd nature of NEC’s row 88 but, upon closer inspection of the characters in question and upon identifying them to be absent in another PC98 font file which I checked, I conclude it to have simply been junk data in the specific font file in question.
IANA call the combined Microsoft variant “Windows-31J”, a name which Microsoft does not use. It’s also called “MS_Kanji” e.g. by Python, although IANA treat “MS_Kanji” as an identifier for standard Shift_JIS. It has the codepage number 932 on Windows, the same as IBM’s number for a version of their variant.
Windows-932 differs from IBM-932 in that IBM-932 does not include the NEC codepoints. Also, IBM-932 does not follow the charcter variant swaps made in 1983, preferring to retain greater backward compatibility with the 1978 edition of JIS X 0208 (while nonetheless including most of the codepoints added in later editions). Also worth noting: Windows-932 uses ASCII for its lower half, while IBM-932 uses an extended ISO-646-JP with box drawing characters in the first two rows, at least in theory (which can be switched to/from controls using controls, but get mapped to controls in Unicode). IBM offers Microsoft’s variant of the double-byte codes (with NEC extensions) in code page 943 (“IBM-943”), which also incorporates their own ISO-646-JP extensions.
Note that IBM’s C0 controls arrangement does not entirely match ANSI/ECMA/ISO standards: they put File Separator in the Control-Z position, Substitution Character in the seven-ones position and Delete in the position vacated by FS. But Microsoft’s version follows the standards as regards control character mapping.
IBM later extended the single-byte codes of IBM-932, adding the cent sign, pound sign, not sign, backslash and tilde to 0x80, 0xA0, 0xFD, 0xFE and 0xFF: the result was called IBM-942. But those codes collide with unrelated Apple extensions (and Windows non-commitally maps them to private use). ICU includes two Unicode mappings for IBM-943 (one also called IBM-943C and ASCII based, one not). Also, the ICU mapping for IBM-942 is ASCII-based, resulting in duplicate single-byte encoding of the backslash and tilde.
Microsoft’s documentation and APIs simply label their version Shift_JIS, so it became prevalent on the web simply as “Shift_JIS”, which the W3C/WHATWG encoding standard used by HTML5 takes into account by incorporating the IBM and NEC extensions to JIS X 0208 and Shift_JIS into its respective definitions. Encoding to Shift_JIS in particular per that standard avoids rows 89 through 94, so preferring the original IBM codepoints for the extended Kanji.
Apple, MacJapanese and the CORPCHAR extensions
Like Microsoft, Apple had meanwhile added their own (incompatible) extensions. Actually, Apple’s variant of Shift_JIS (sometimes called MacJapanese or KanjiTalk) is best thought of as three incompatible Shift_JIS variants in a trenchcoat.
The newest and best-documented of these, sometimes called the “KanjiTalk7” encoding, was introduced in 1992 with the release of version 7.1 of KanjiTalk (the Japanese edition of the classic Mac OS), although some fonts shipped with KanjiTalk 7.1 still used older variants. It includes more special characters in rows 8–15, and vertical presentation forms in rows 85–89 (at 84 rows down from their normalised forms). This also includes the backslash, required space, copyright sign, trademark sign and halfwidth (i.e. low) horizontal ellipsis in the vacant single byte space (0x80, 0xA0, 0xFD, 0xFE, 0xFF). The tilde is present rather than an overline. Like NEC, no non‑user-defined double byte assignments were added beyond the JIS X 0208 space, only within it. Whilst the repertoire demonstrates a not-insignificant overlap, the layout of the KanjiTalk7 extensions are unlike those in any of the PC versions.
The other variants, sometimes called “KanjiTalk6”, fall into two categories: those which encode the vertical presentation forms 10 rows down, and those which encode them 84 rows down like the KanjiTalk7 set; the latter is sometimes called the “PostScript” variant. KanjiTalk6 fonts can include some subset of the NEC non‑IBM-selected extensions, but what subset can vary (and might even vary between print and screen versions of the same font). Since it is the punctuation and small kana (hence, within rows 1, 4 and 5) which require vertical presentation forms, and the NEC non‑IBM-selected extensions did not use rows 14 and 15, only NEC row 11 is actually clobbered by the row+10 vertical forms; NEC row 13, in particular, seems to have been sometimes considered part of the set, although its kana and era ligatures could, of course, only be given their own vertical forms over the row+84 range. In the PostScript variants, anything up to the entire NEC rows 9–13 could have been included, though the Carbon framework seems only to consider rows 12 and 13 there.
To summarise:
Location | Mac Row+10 | Mac Row+84 KanjiTalk7 | Mac Row+84 PostScript | Windows |
---|---|---|---|---|
Rows 9-10 | Nothing, potentially NEC halfwidth extensions | Apple extensions | Nothing, or NEC halfwidth extensions | Nothing |
Row 11 | Vertical forms | Apple extensions | Nothing, or NEC halfwidth extensions | Nothing |
Row 12 | Nothing, potentially NEC fullwidth box drawing | Apple extensions | NEC fullwidth box drawing | Nothing |
Row 13 | Nothing, or NEC special characters | Apple extensions | NEC special characters | NEC special characters |
Row 14 | Vertical forms | Apple extensions | Nothing | Nothing |
Row 15 | Vertical forms | Apple extensions | Nothing | Nothing |
Rows 85-94 | Nothing | Vertical forms | Vertical forms, including NEC special characters | NEC selection of IBM extensions |
Some characters did not exist in Unicode at the time. Some still don’t, while some have been added in the interim. Apple firstly solved this by mapping them onto the Unicode Private Use Area. But that turned out to be bad for interoperability, so Apple switched to using combining sequences where possible, and otherwise using a compatibilty decomposition (failing that, a close substitute) of the character combined with private use characters (either functioning as variation or presentation form selectors, or marking a sequence of codepoints as representing one MacJapanese character).
Similarly, Apple originally mapped the single-byte ellipsis to the normal horizontal ellipsis and the double byte one to the mathematical vertically centred horizontal ellipsis, but switched to mapping the double byte one to the normal one and using a private use marker on the single byte one, since mapping to the mathematical one was apparently not handled well in the opposite direction by Windows (i.e. bestfit932).
Apple provides to Unicode a file named CORPCHAR.TXT
which details all of their private use mappings, including those used for MacJapanese as well as those used for their other character sets and East Asian charset variants.
Apple’s published mappings still map to Unicode 2.1 and so use these markers, even for characters for which actually matching codepoints have since been added. But, comments in the mapping file provide mappings to Unicode 4.0 for characters that had been added by that point.
JIS X 0212 and the Extended Unix Code (EUC)
In 1990, JISC put out two standards that are relevant here. One was a further revision to the already-established JIS X 0208 standard, which merely added two disunified variants of existing kanji in response to changes in the legal list of naming kanji and, due to this time being a strict superset of the previous (1983) version, did not warrent another new ISO 2022 code (although it did get assigned a prefix for the designation code to indicate the use of a upwardly compatible revised version, this was not used in practice, and the 1990 version is expected to be indicated without it in the ISO-2022-JP profile).
The other was JIS X 0212. This was an entirely separate 94×94-cell character set, a second or supplementary “plane”, which was not of much use on its own, containing only characters which were absent from JIS X 0208 (5801 Kanji and 245 non-Kanji). The idea presumably being that ISO 2022 mechanisms would be used to switch between the two character sets as necessary (a setup referred to as “ISO-2022-JP-1”). Since both JIS X 0208 and JIS X 0212 were 7-bit, the high bit could theoretically be used for that too.
As with JIS X 0208, the first fifteen rows contained non-Kanji characters (though they were mostly unallocated). The sixteenth through seventy-seventh rows contained Kanji and the remainder was unallocated.
While JIS X 0208 and JIS X 0212 are separate 94×94 sets with mostly-colliding Kanji allocations, the non-Kanji allocations avoided codepoints that were used in JIS X 0208, leaving them unallocated. Hence, it would be possible to e.g. support only the non-Kanji portion of JIS X 0212 without a mechanism for switching between them, or implement them such that a switching mechanism has effect only on the Kanji portion of the sets, or scan for a punctuation sequence without tracking which one is active (though one would still need to track that double-byte mode, not e.g. ASCII, is active). Whilst I have found no evidence that this provision was ever directly useful to anyone, the unallocated rows which resulted (and the empty rows after the Kanji) did become important later (as with JIS X 0208 and Shift-JIS, but not quite for the same reasons).
JIS X 0212 subsequently became somewhat of an embarassment for those responsible for the JIS character sets, since its authors neglected to properly document the characters, making it difficult to tell why they were added or what they matched or were unifiable with. It was also criticised for not properly honouring JIS X 0208’s unification criteria; in fact, some of its characters matched subsequently revised reference glyphs from the original 1978 edition of JIS X 0208. That being said, these disunifications (separating traditional characters from their extended‑shinjitai forms) were not unjustifiable.
While use of JIS X 0212 in most encodings simply did not catch on, even where it was possible (it does not fit in Shift_JIS, but one IBM EUC code page gives PC Data (that is, Shift_JIS) mappings for the mere subset of the JIS X 0212 characters it includes), there was one exception, itself with a catch.
The 8-bit Extended Unix Code (EUC) works as follows. The lower half of the encoding (with the high bit unset) gets assigned to an ISO-2022-compliant encoding, usually an ISO-646 variant such as ASCII or ISO-646-JP, and those bytes are not used for any other purpose. The first two rows of the upper half of the encoding are used for control characters (including two single-shifts) and one, two or three bytes (as appropriate) from the remaining rows represent a character from another ISO 2022 compatible 7-bit encoding (with the high bit set). The control characters in the upper half include 0x8E and 0x8F (single shifts), which are used for indicating additional ISO-2022-compliant encodings and are followed by one, two or three bytes from the non-control rows of the upper half.
EUC coding is popular on UNIX, where most other ways of encoding DBCSs would not be POSIX-compatible, but it can also be used elsewhere. The names of the Mainland Chinese (GB2312) and Korean (KS X 1001) counterparts to JIS X 0208 have become used almost interchangably with their EUC forms (EUC-CN and EUC-KR), and their Windows encodings (GBK and UHC) are supersets of their EUC encodings, although they are not themselves valid EUC. EUC-JP, the EUC form of JIS X 0208, did not become nearly as popular: non-Unix systems tended to adopt Shift_JIS instead.
The presence of single-shifts in EUC meant that it was possible to represent more sets by preceeding representations with a single shift. The 0x8E single shift was already used to preceed a character from the upper half of JIS X 0201, so the 0x8F single shift was adopted for JIS X 0212 characters, but only by non-Microsoft software. Microsoft’s system locales assumed for a long time that a character would not be more than two bytes (like in Shift_JIS, GBK, UHC and Big5), which is why UTF-8 only became a system locale very recently.
IBM’s version of EUC-JP occupied rows 83 and 84 of JIS X 0212 with a selection of their vendor extensions, apparently serving to allow all characters representable in IBM-932 to be represented in their version of EUC-JP without the NEC extensions (this seems to use a different layout in at least one IBM EUC codepage versus eucJP-open though, despite encoding the same characters).
Like ISO-2022-JP, EUC-JP is an ISO-2022 mechanism, albeit one where the character sets are pre-arranged rather than loaded using unique codes (also pre-arranging which half of the encoding is used with the single shifts, specifically the upper half) and which hence cannot be arbitrarily mingled with other national ISO 2022 formats.
Cellular Emoji
Bit of a diversion here, but it concerns other extensions to JIS character sets.
In the late 1990s, pictograms (or emoji in Japanese) started appearing on Japanese mobile devices.
Today, the term “emoji” is often used by chat apps to refer to any small image which may be used as a reaction, sent as a message sticker, and/or embedded into the main text of a message. From a technical perspective, this usage is overly broad, and encompasses several things which are quite different under the hood.
An emoji in a technical, character encoding sense is a regular encoded character, represented in much the same way a letter or a Chinese character might be, but depicting an image or symbol rather than writing; they may sometimes appear with colourful and/or animated presentation in a font which supports this. This generally excludes “symbols” in the sense of mathematical operators (e.g. +), but it may for historical reasons include stylised versions of them (e.g. ). Today, characters newly encoded for use as emoji are expected to be pictographic in nature.
There are three main legacy emoji character sets, each with more than one encoding. These include the set from DoCoMo (of i-mode fame), the set from SoftBank Mobile (formerly Vodafone Japan, formerly J-Phone), and the set from KDDI (trading in the mobile phone industry as “au”).
There seem to have been several ways these were encoded:
- Using an
ESC $ (thing)
sequence, where(thing)
is a byte between 0x43 (C
) and 0x7E, to change the ASCII printing characters to a page of emoji, then using the Shift In control character to switch back. The only vendor using this version was SoftBank (over 2G communication), which usedE
,F
,G
,O
,P
andQ
. (Note: this is not conformant to JIS X 0202 / ECMA-35, before anyone asks.) - In Shift_JIS, after the JIS X 0208 section. Here, the KDDI, DoCoMo and IBM extensions manage to coëxist in separate ranges. The SoftBank extensions manage to collide with all three; they map each page so it starts at the trail byte 0x41 or 0xA1 (as applicable). These are not all one after the other; I’m not sure what the logic here is as to the ranges used.
- In JIS X 0208, after the kanji block. Obviously, this would tend to collide with the various other extensions also in this area; however, the providers which used this style of encoding seem to have tried to make it vaguely compatible with one another as opposed to merely transposing their Shift_JIS sets. More specifically, the KDDI set can be mapped between the JIS and Shift_JIS representations in two ranges (Shift_JIS 0xF640 to 0xF7FC then 0xFE40 to 0xF493 map to JIS 0x7521 to 0x7B73, within usual trail byte ranges), while the other two vendors try to map their emoji to similar KDDI codepoints in this range where possible. Where not possible, DoCoMo and SoftBank mostly occupy separate ranges, except for two which DoCoMo apparently manages to encode off-by-one from their SoftBank locations.
- As an image link: all three have schemes of image URLs for their emoji sets, allowing them to be represented as an images instead of as emoji characters, as a perhaps more portable solution.
- In the Unicode Private Use Area (PUA). DoCoMo’s use of this corresponds to the IBM/Microsoft mapping of the Shift_JIS range which it uses to the PUA. KDDI used two different such mappings (one in general from U+E468 to U+E5DF then U+EA80 to U+EB8E in JIS order, plus one in the web browser which merely amounts to the Shift_JIS code as a big endian short minus 0x700). SoftBank simply mapped each “page” of emoji to its own PUA range, with the low 8 bits 0x01 for the first emoji of each page (although the pages aren’t in exactly the same order as in the Shift_JIS and 2G encodings); the last of these pages collides with the regular (not web) KDDI scheme, although this is the only collision between the four schemes.
Later, Google produced a system of Supplementary Private Use Area mappings of all three vendors’ emoji, in addition to a few unique to GMail. Most of these, excluding a few that were seen as near-duplicate or corporate logos, were subsequently assigned standard Unicode codepoints. Prior to that, the most common scheme for representing cellular emoji in Unicode outside of Japan had been the SoftBank Private Use Area mapping, due to Apple collaborating with SoftBank.
All of this being said, these were not the only source of emoji characters. Several characters had already been present since early Unicode sourced from the Zapf Dingbats font. Several characters had been added only a year before the cellular emoji sets were, from the ARIB extension to JIS X 0208 used by broadcasters in Japan. Several characters were added a few years later, from the Webdings font and the Wingdings series of fonts (for just one example, people including a (national park) pictogram in e-mail footers asking for the email not to be printed unless absolutely necessary due to environmental considerations would use Webdings, showing up as a letter P on systems where Webdings was not present). The simple smiley face (☺
or ️) had been available in MS-DOS.
In all cases where a set (ARIB, or Japanese Cellular, or Wingdings) was added, characters which were considered to already exist in Unicode were unified with the existing character, meaning that several individual emoji codepoints were originally assigned earlier. Hence, although Zapf Dingbats, ARIB, the three Japanese cellular sets and the four Wingdings/Webdings sets are the ones listed as major sources of emoji (non-exhaustively: a single emoji can be associated with multiple such sources), several were technically first added from other sets than those (such as KPS 9566).
JIS X 0213 and why arbitrary proprietary assignments are a bad idea
In 1997, a new edition of JIS X 0208 was published. This informally deprecated JIS X 0212, made efforts to re-unify the character variants from 1978 that had been disunified by JIS X 0212 (apparently to the point of listing some characters with two reference glyphs), and opposed the use of unallocated ranges for vendor or private extensions (comments about stable doors seem rather apt in this context). It also finally made Shift_JIS an appendix to the standard itself.
In 2000, JISC released a new standard, JIS X 0213, intended to be the successor to JIS X 0208. In addition to JIS X 0208 characters, it included 2743 of the Kanji from JIS X 0212, 952 additional Kanji and many additional non-Kanji such as Roman letters with macrons or additional small Katakana used in Ainu.
Whereas JIS X 0208 had defined one 94×94 plane (or “men”) and JIS X 0212 had defined another, JIS X 0213 took a different approach. Nominally, it defined two planes. Plane 1 remained more-or-less compatible with existing standard JIS X 0208 codepoints, insofar as any revision of that standard had, but with many additions to the point of all rows being mostly or entirely occupied. It was, however, assigned its own ISO-2022 escape sequence. Although it also mostly retained compatibility with the NEC/Microsoft row 13 (deällocating some duplicate codepoints), it did not do so for rows 89 through 92, which it used for a different set of Kanji than the ones present in the NEC/Microsoft use of those rows.
In addition to the 94-row Plane 1, JIS X 0213 defined 26 more rows in plane 2, containing Kanji only. In reality, for compatibility with all three established encodings, these were encoded in two different ways. As far as Shift_JIS was concerned, they were mapped without intervening empty rows after the end of Plane 1, so colliding with the unrelated IBM/Microsoft extensions in that region and likely with proprietary cellular variants. This was done in the order 1, 8, 3, 4, 5, 12–15 and 78–94, with the placement of 8 between 1 and 3 ensuring the alternating odd-even numbers that Shift_JIS encoding algorithms might rely on (i.e. it preserves the property that the lower-valued half of the trail byte range is used for odd-numbered rows).
As for why those rows specifically were allocated, while Plane 2 could be accessed from the ISO/IEC 2022 system using its own escape sequence (a setup called “ISO-2022-JP-3”), the arrangement of these rows was designed to deliberately avoid colliding with any of the codepoints assigned by JIS X 0212, so allowing JIS X 0213 to be used within EUC-JP without changing the meaning of any existing standard-compliant content. Shame about the IBM extensions. One further consequence of this: JIS X 0213 and JIS X 0212 can be unambiguously used in a single EUC document (if we ignore the IBM extensions). The use of JIS X 0213 within EUC-JP is called “EUC-JISx0213”, while the use thereof within Shift_JIS is termed “Shift_JISx0213”, although these labels are sometimes (not always) used specifically to refer to variants encoding the first (2000) edition of JIS X 0213.
Further defined were simple 7-bit and 8-bit formats using both planes, the former differentiating using 0x0E (shift out) to move to the second plane and 0x0F (shift in) to switch to the first, and the latter using the high bit. Since it would give access to the entirety of Plane 2, not just the JIS X 0213 rows of it, it could hypothetically be used for JIS X 0212 also, but whether that’s advisable is another question.
Overall, though, what could have been a standard enhancement to Shift_JIS and to a lesser extent EUC-JP, retaining compatibility with the earlier standards, actually ended up defining yet another incompatible variant due to colliding with established extensions. It did not catch on much, perhaps due to Microsoft’s variant of Shift_JIS (with NEC and IBM extensions) having become the de facto standard version (and for reasons I’ll come to shortly, the actual standard version in certain contexts), and perhaps because, since this was not the nineties anymore, there was not much will to implement new changed revisions of non-Unicode character sets anymore. That being said, the JIS X 0213 Shift_JIS and EUC-JP variants are informatively rather than normatively defined, and using the JIS X 0213 repertoire from Unicode tends to be preferred.
In 2004, a new revision of JIS X 0213 was released, which changed the recommended renderings of a number of characters and disunified a small number, affecting only the first plane. The same EUC and Shift_JIS encodings were retained, but often called “EUC-JIS-2004” and “Shift_JIS-2004” to distinguish them from the 2000 versions. The revised first plane received a new ISO-2022 code, used when the newly assigned codes are used, resulting in “ISO-2022-JP-2004”.
Unicode, UTF-8, the web, WHATWG and the future
Both JIS X 0208 and JIS X 0212 were used as character sources for Unicode and its ISO-10646 (or “JIS X 0221”) Universal Character Set. They were mapped to Unicode in their entireties. The characters added in JIS X 0213 that were not already present in JIS X 0212 (and hence Unicode) were added to Unicode in version 3.2. Unicode lacks many of the problems encountered with JIS X 0208: more or less any character encoding can be converted to Unicode and, while extensions such as CSUR (and Apple’s CORPCHAR as noted above) do exist, they map onto explicitly designated private-use areas, which will never be used for standard mapping (and collectively include 137 468 private use codepoints, compare with the only 17 672 total possible codepoints in both JIS X 0208 and JIS X 0212 combined).
Unicode Transformation Format—8-bit, better known as UTF-8, is a very well designed character encoding. It’s very ASCII compatible, in that it uses ASCII bytes for and only for ASCII characters. Its multi-byte sequences follow a very regular pattern which could in principle represent up to 68 719 476 736 codepoints, although most of these are invalid since there are only 1 114 112 theoretical codepoints present in the entirety of Unicode (which are mostly unallocated and include non-character and private-use codepoints as well as non-codepoint surrogate values reserved for allowing UTF-16 to work). Initial bytes of multi-byte sequences only ever get used as initial bytes and continuation bytes only ever get used as continuation bytes; this alleviates many of the issues associated with seeking and truncation as well as making it quite unlikely for a file to be coïncidentally valid UTF-8. It nonetheless may start with an optional unique signature (the three bytes 0xEF 0xBB 0xBF). It likewise lacks most of the issues associated with the older multi-byte encodings.
With the advent of HTML5, based on the WHATWG HTML Living Standard, the WHATWG encoding standard become the relevant standard for character encodings used within HTML. This had several relevant consequences.
Microsoft’s version of Shift_JIS is now also the HTML standard version, with the same applying regarding the Windows subset of the NEC extensions and both of the other encodings of JIS X 0208 (EUC-JP and ISO-2022-JP). JIS X 0213 is not supported. WHATWG’s ISO-2022-JP treats both JIS X 0208 escape codes as equivalent and all of the JIS X 0212 and JIS X 0213 escapes as error conditions, although WHATWG’s EUC-JP supports JIS X 0212 for decoding only.
Overall, the WHATWG discourage use of any encodings other than UTF-8, although they specify them to standardise compatibility with existing content and interfaces, so they are unlikely to adopt support for JIS X 0213 in the foreseeable future. In particular, they would much rather have ISO-2022-JP removed from the standard and mapped in its entirety to U+FFFD (like its Chinese and Korean counterparts) due to its ASCII incompatibility (when in JIS X 0208 mode) posing a potential XSS risk, but are not able to due to it still being relevant to current content and software.
Further comments on Unicode
Note on CJK unification: the Unicode standard has been somewhat controversal in that it encodes logographic characters used in both Chinese and Japanese only once, rather than encoding the languages separately. This sometimes proves problematic, but not for the reason that some assume.
The sometimes assumed reason is seeing it as equivalent to unifying, say, Latin, Greek and Cyrillic, or even as an implied act of linguistic or cultural assimilation. However, this is not where the actual problem arises. The Japanese word “kanji” literally means “kan” (Han Chinese) “ji” (characters), the Korean word “hanja” being cognate. Hence, their treatment as Chinese characters is reasonable considering, quite simply, that that’s what the respective languages call them. As for the distinctly Japanese kana and distinctly Korean hangul (and the zhuyin used mainly in Taiwan), they are encoded separately. Note that the coïncidental kana-zhuyin homoglyphs are not unified, since they are not simply encoded by looks.
The Unicode Consortium, being an American organisation, is sometimes perceived as imperialistically telling other countries how their language works. However, the work on Chinese characters specifically is primarily the responsibility of the Ideographic Research Group (IRG) under the ISO, which includes experts from all interested territories, although the Consortium may rubber-stamp it.
※ Official representatives from Japan (since 2019) and North Korea (for roughly two decades) have not been actively attending recent IRG meetings. Japan’s withdrawal from active participation was a decision by the Japanese national body itself, a decision which is if anything viewed as somewhat of an annoyance by the rest of the IRG.
※ The Unicode Consortium does participate in the IRG, nominally representing the United States. Since Chinese characters are generally required to be submitted by an IRG member body rather than an individual, the participation of the Consortium as an IRG member body primarily serves to allow Chinese characters, usually in small numbers, to be proposed for inclusion (with evidence supplied) via normal Unicode proposal documents.
Note also than English and German using Roman letters does not implicitly make them treated as dialects of Italian. Users of the Roman alphabet do not recognise Greek as the same alphabet, but recognise their alphabet to be ultimately Latin; an English speaker would consider German to be written in the same alphabet, more letters (e.g. ß, ü) notwithstanding, but would not consider Russian or Bulgarian to be. So Greek gets encoded once, separately from Russian, Ukranian, Belarusian and Bulgarian (which are encoded together) and also separate from English, French, German and Swedish (which are encoded together). None of this is where the actual problem comes from.
The individual national standards such as JIS X 0208 applied “unification criteria” to kanji in the source material to limit the number of glyphs encoded: multiple minor variants would be unified with one standard variant (this necessitates formalising what constitutes a spelling variant versus merely a handwriting variant, which is much less clear-cut for Chinese characters than it is for an alphabet). Unicode continued this process, but with the IRG considering character sets from all four territories.
※ Strictly speaking, the South Korean standard deliberately included several hanja with multiple readings twice, or even as many as four times, with the same glyph. This was not the usual approach; it was the only one of the national sets to deliberately do this.
In actual fact, any two character variants which one of the national standards such as JIS X 0208 treated as separate characters (i.e. by including them both separately), even if they would usually be the same character under the unification criteria used, were also mapped to separate characters by Unicode, either to a compatibility character or even to a separate full-fledged canonical character (so-called “source separation”; in cases where multiple of the latter would otherwise have been unified, they are referred to as “Z-variants”). Hence, the only place where unification occurred for characters which already existed in the established national standards was between a character in one set (e.g. JIS X 0208) versus one in another (e.g. GB 2312 for Simplified Chinese).
The problem with this is that which variant of the character gets considered standard may have minor differences between, say, Japan and Taiwan. Note that this falls along country or territory boundaries, not language boundaries—North Korea and South Korea are not identical in this respect, and neither are Taiwan and Hong Kong—and furthermore, what is normative in print does not necessarily match what is normative in handwriting even within a single country or territory. To be clear, drastically different simplification levels are still encoded separately, but more minor variations where mutual legibility can still be assumed are. As I mentioned, this did not originate with Unicode, but prior to Unicode the codepage would limit the available fonts so it was not likely for Japanese to end up accidentally in a Taiwanese font (instead, it would have been completely illegible on a computer in Taiwan, due to misinterpretation of Shift_JIS data as Big5).
Further confusion comes from how some software keeps strings in Shift_JIS rather than decoding it to Unicode so as to guarantee round-tripping. This has nothing to do with CJK unification, and everything to do with the deliberate duplication between the IBM and NEC extensions as explained above (and likewise the characters duplicated between these extensions and the non-kanji added in 1983). Contrary to misconception, using Shift_JIS does not avoid CJK unification (not least because all font rendering is done via Unicode nowadays anyway).
Unicode have a number of solutions such as variant selectors for cases where a specific glyph variant is needed for display, but ultimately it would complicate text searching/lookup too much to disunify them all, especially given that content will already exist under the original unification criteria.
※ Compatibility codepoints, with normalisation mappings to the canonical forms, are included either where exact duplicates existed in the original primary source-separation character set standards, or where multiple variants existed in some other specifications which needed round-trip compatibility but were not among the original primary source-separation authorities). They were designed as a mechanism for round-tripping exact duplicates, and so they convert to the corresponding canonical forms under any Unicode normalisation operation (such as would be used to make representations of “á” as a single character, or as an “a” followed by a non‑spacing acute accent, compare equal). As such, they were always poorly suited as a mechanism for identifying specific character variants, and their use for that purpose is now obsolete in favour of corresponding variation selector sequences having been defined.
JIS Coded Character sets to Unicode mapping for multiple variants
Selected further reading
- Ken Lunde’s formidable CJKV Information Processing, and precursor/companion CJK.INF. A couple of versions of CJK.INF (2.1, 1.9) are available online, and up to date to the mid-to-late nineties. Referencing the (much more recent) second edition of the print work, chapters 3 and 4 are most relevant here. Supplementary examples and appendix data not included in the print book itself (as of the current edition) are instead distributed online here and here .
- SLJ FAQ on encodings – generally informative as a basic overview.
- Overview on JIS editions by Kazushi Marukawa
- Wikipedia articles on the individual encodings. Which I have been to an extent working on improving.
- Various mapping data files are available here from the Unicode Consortium, and here from ICU (originally under IBM, now also under the Unicode Consortium).
- The standard for encodings in HTML5 is thataway.
- IBM’s Character Data Representation Architecture (CDRA) documenting several code pages used to be available on their website, but has now mostly been taken down.