JIS Character Sets Explained

ISO 646, JIS X 0201 and early representations

The ISO-646 7-bit character set standard is best known in what is actually its United States variant (ISO-646-US, more often called ASCII, or American Standard Code for Information Interchange), but there were previously a plethora of national variants. Certain characters were to be supported by all variants (most of them), but certain codepoints could be changed so more nationally relevant symbols could be supported, which would otherwise be missed out. For example,  the UK version occupied 0x23 with a pound sign (£, for the GBP currency) and 0x7E with an overline () or spacing macron (¯). But in US-ASCII, these are respectively the octothorpe (#) and simple tilde (~).

The Japanese variant (ISO-646-JP, also called JISCII or JIS-Roman) treated 0x23 like in ASCII and 0x7E like in ISO-646-GB, but occupied 0x5C (used for the backslash \ in both UK and US variants) with the Latin-based yen sign ¥. So, it differs (minorly in practice) from ASCII at 0x7E, and it differs significantly from ASCII at 0x5C. Keep note of that, it becomes important later.

The macron is used to indicate long vowels in certain systems of romanising Japanese (such might state Tōkyō, alternatives including Tôkyô (using the circumflex), Toukyou (following Hiragana orthography), Tookyoo (doubling long vowels), Tohkyoh (using the oh” digraph) or Tokyo (discarding vowel length)). The spacing overline is not of much use in modern-day computing, but on a typewriter/teletype it could be combined with backspace to be overstamped onto the letter.

But a 7-bit stateless form of JISCII can only represent Japanese in romanised form, because neither Kanji/Hiragana nor even Katakana has adequate encoding space within the nationalisable code points. There are two ways of working around that, I’ll get to these and their applications in due course.

Compare that with Morse code, where the prosign (control character) -..--- (N+J or D+O run together, meaning next Japanese”) was used to switch to Kana mode. The ...-. (S+N or V+E) control was used to switch back. The Kana morse codes most probably are not relevant to your life, so I will not list them all out here, but the curious can check Wikipedia. Hiragana and Katakana were not distinguished by this encoding, so Katakana alone was used to transcribe the message.

ASCII was designed as a 7-bit character set, with the eighth bit being either absent, a checksum or unused (ASCII itself didn’t specify). This contrasted with EBCDIC (Extended Binary Coded Decimal Interchange Code), which was an 8-bit character set with twice as many codepoints. But EBCDIC was not well designed: capital and lowercase letters were not in continuous ranges, which made casemapping routines needlessly complicated. EBCDIC was also not even slightly compatible with ISO 646, and mutual compatibility of anything other than English letters and digits between EBCDIC variants was limited (in at least one unusually divergent case, even lowercase letters don’t correspond). Unless you work on IBM mainframes, EBCDIC is not relevant to your life either, so I’ll leave out further details on that.

Combining the advantages of ASCII and EBCDIC, however, was Extended ASCII, which used ASCII for bytes with the high (eighth) bit unset, and an extended encoding for bytes with it set. A particularly influential such setup in the Occident was the DEC Multinational Character Set (or MCS), which stratified into two 16-code rows for extended control characters (mirroring ASCII’s two such rows), two rows of uncased characters (mainly punctuation), and two rows for each letter case. This elegant design meant that casemapping routines could be almost as simple as in ASCII itself. In theory. Subsequent standards and variants messed that up completely. The most common single-byte variant today is Windows-1252, also called WinLatin1 or Western European (Windows)”, which likely contributed to ISO-646-GB being lost to oblivion due to including the £ with its own codepoint.

Extension was possible on septet-byte machines too, since the Shift Out control character would switch to the non-646 set (equivalent to mapping 0x20–0x7F to 0xA0–0xFF). But it only reached 96 extended characters (0xA0–0xFF), with the use of 0xA0 and 0xFF being potentially technically problematic due to being mapped to the space and delete characters, so many extended ASCII sets were designed to keep the extended characters within the 0xA1–0xFE or 0xA0–0xFF range. Shift In switched back. This is a massive simplification; if you want the details, I refer you to ECMA-35.

Needless to say, Extended ASCII, or rather extended ISO-646, did catch on in Japan too, but with an entirely different extended encoding. Four rows were dedicated to kana, without distinguishing Hiragana (so, dedicated to Katakana), with the Japanese-style comma (), full stop () and speech marks (「」) and the interpunct () getting their own codepoints. Diacritic forms were not included separately; the diacritic characters were given separate codepoints and were inserted after the respective kana. The first two rows of the extended area were left open, so the characters were in the reach of Shift Out (and which also theoretically allowed extended control characters in that space), and the last two were unused, which becomes important later. The ISO-646 encoding used in the lower half (meaning, with the high bit unset) was of course Japan’s own variant (ISO-646-JP). Both ISO-646-JP and the single-byte kana set, as well as the 7-bit and 8-bit encodings thereöf, were originally standardised by JISC as JIS C 6220, which is now called JIS X 0201.

ISO 2022, JIS X 0208 and the Double Byte Character Set

The lack of Kanji (and Hiragana) in JIS X 0201 allowed only a simplified writing system which, while suitable for phonetically transcribing the language to an extent (albeit leaving near homophones distinguished by speech inflection ambiguous), did not retain the semantic and other distinctions achieved through Kanji. Furthermore, the order of the kana codepoints did not make for sensible sorting.

JIS X 0208, originally called JIS C 6226, was an entirely new encoding. Control characters and the ASCII space were represented like in ASCII. The remainder were represented with ASCII printing codepoints in pairs. This meant 94 possible first bytes and 94 possible second bytes, so the character set was arranged into 94 ku” (rows or wards), each with 94 ten” (cells). The first byte (minus 32) would specify the row (the rows being numbered 1 to 94, yes, 1-indexed) and the second would similarly specify the cell within that row. A character might be specified by the said bytes (e.g. 0x2121 for the ideographic space).

As well as being represented by their indexing bytes, a form also exists with the codepoints indexed by pairs of 1-indexed numbers (i.e. from 1 to 94), called a kuten”. The WHATWG encoding standard introduces a further codepoint representation where a singular number counting from zero, called a pointer”, indexes the characters, such that the first row includes pointers 0–93, the second includes pointers 94–187 and so forth. But neither of these are character encodings, but rather academic or internal representations. More character encodings of JIS X 0208 also exist, I’ll come to that later.

While not ASCII-compatible or even ISO-646 compatible, this layout allowed it to be used with ISO/IEC 2022 (ECMA-35 or JIS X 0202), which defines (amongst other things) a means of switching between 7-bit encodings with certain structural constraints using sequences with the ESC (escape) control character. A profile of this system defined by RFC 1468, starting off in ASCII and switching to and from JIS X 0208, never using the high bit, become the pre-Unicode standard for Japanese-language e-mails (at the time, e-mails had to be 7-bit clean) and is still relevant today, much to the frustration of the WHATWG (which I’ll come to later). The label ISO-2022-JP” is given to this usage, first specified in RFC 1468 and later included by JISC themselves in the 1997 edition of JIS X 0208. Due to it using the JIS X 0208 two-byte representations unmodified, and due to usage of pure JIS X 0208 being rare, it’s also often simply called the JIS encoding”, a name which also references the Japanese-language definition of the mechanism being in JIS X 0202.

But JIS X 0208 could also be used alone; it included punctuation and letters in addition to kanji and kana, which also meant that Roman letters could be specified in ISO-2022-JP using either ASCII or JIS X 0208. It did not however attempt to sort out a lossless conversion of ASCII data to its own charset (e.g., lacking straight quotation marks and disunifying the hyphen from the minus sign). All of this had other consequences later.

Although there were theoretically 94 rows, the last 10 rows were not yet used, being left out for future expansion. The first two rows were used mainly for punctuation and symbols, followed by one containing Roman letters and Euro-Arabic digits, another containing Hiragana, another containing identically ordered Katakana, another containing Greek letters, another containing Cyrillic letters and an eighth row containing box drawing characters (initially unallocated but allocated in the 1983 version). The following seven rows were left out for future non-Kanji expansion, with the sixteenth row being the start of the Kanji block. These seventeen unallocated rows became important later. The kana ordering differed from the JIS X 0201 order in order to make more sense for sorting, and diacritic kana were provided pre-composed.

A neat feature worth mentioning while we’re still on this topic: unlike the Greek and Cyrillic characters which start at the beginnings of their respective rows, the Roman letters start some distance into their row, skipping out several spaces at the beginning, so the second representing byte matches that letter in ISO 646, so purely Roman text in JIS X 0208 remains fairly readable if misinterpreted as ASCII or as JIS X 0201 (somewhat like #H#e#l#l#o!!#W#o#r#l#d!*).

This encoding went through several versions since its release in 1978. In 1983, several more characters were added (mostly non-kanji symbols in the second row), whereas several synoglyphic Kanji variants were added and/or swapped in order to sort out the level 1 / level 2 strata, which was significant enough to warrent the new version being assigned a separate ISO-2022 switching code (although often both get treated like the later version). More on that later.

 ※ I should perhaps link here to a table of the JIS X 0208 editions’ kanji changes. Also lists collisions with Microsoft extensions for the varients’ JIS X 0213 (more on that later) codepoints.


Microsoft, in the process of expanding out into Japan, developed connections with the publishers of a microcomputer magazine by the name of (confusingly enough) ASCII (a forerunner of the current ASCII Media Works). One outcome of this association was figuring out an encoding of JIS X 0208 which retained compatibility with JIS X 0201 (and, so, with ISO-646), which has since turned out to be far more important than ISO-2022 compatibility (which it does not retain, on account of (for example) repurposing the C1 area).

The way it worked was that each Shift_JIS ward constituted a pair of consecutive JIS X 0208 wards, which meant that only 47 initial bytes were required. As the first two and last two rows of the upper half of JIS X 0201 were not used by that standard, that meant that 64 initial bytes were available. The first such byte (0x80) was skipped, leaving 63 possible initial bytes or a total of 126 representable JIS X 0208 rows, out of only 94 present in the JIS X 0208 standard, which becomes important later.

As the second byte would have to index within a two-row ward, not an individual row, 188 such bytes were required. But a second byte would be clearly a second byte, assuming no initial truncation, so it was possible to reuse halfwidth katakana and even ASCII characters there. The ASCII controls, including the first two rows and the DEL character (0x7F), were nonetheless skipped. The second two ASCII rows were also skipped including the digits, ASCII space and 21 out of 32 of the punctuation marks, including all 19 invariant punctuation marks such as the ", ' and <> which are sensitive to HTML syntax, which leaves 191 possible second bytes, of which only the first 188 are needed.

Since that left two different ways of encoding Katakana and most of ASCII, the one-byte forms were distinguished from the two-byte forms by their width, with the two-byte forms being displayed roughly the same width as a Kanji, and the one-byte forms being more narrow (roughly half as wide), which avoided breaking anything that assumed the physical length of a piece of text to be proportional to the number of bytes. This became known as fullwidth” versus halfwidth”.

This encoding, shift-coded” JIS or Shift_JIS” as it become known, become the basis for encoding Japanese on Windows and Apple computers and was eventually standardised in the 1997 edition of JIS X 0208. The potentially confusing name refers to the JIS” (x0208) codes being shifted” around the existing codes, not to the use of actual shift-codes which it does not involve (and which I’ll come to later).

Beyond Shift_JIS: IBM, NEC and Windows-31J

Following Microsoft’s general policy of not following their own standards, Microsoft’s own Windows shift_jis” differs from the standard in certain ways.

Firstly, it’s based on ASCII rather than ISO-646-JP. To be fair, that was necessary due to the reliance of DOS and Windows on the backslash as the primary path separator. (This was in turn for backward compatibility: DOS would consider a command name to end before the first forward-slash even if a space was not inserted, because a forward-slash was used for command-line option syntax. In every other context, Windows accepts either slash as equivalent but renders paths using the backslash.) Japanese fonts nonetheless render the backslash as a yen sign for ISO-646-JP compatibility, which does interesting things to the rendering of Windows paths.

Secondly, it includes extensions, both within the 94 rows defined by the JIS X 0208 standard and comprising 29 rows beyond it. Extending JIS X 0208 was not unusual or unique to Microsoft: for example, cellular carriers developed their own proprietary extensions (which included the original Emoji). In fact, the extensions themselves originated from NEC and IBM, not from Microsoft, who basically defined their encoding as a superset of the IBM variant incorporating certain NEC extensions.

IBM’s version, which IBM call Japanese DBCS–PC” (note that that term does not include the single-byte codepage which it’s used with, called Japan PC-Data SB”), does not allocate codepoints within the 94 rows defined by the JIS X 0208 standard but does so within the 32 rows or 16 wards beyond it.

IBM and Microsoft mark out rows 95 through 114 (Shift_JIS wards 0xF0-0xF9) for what IBM call User Defined Characters (UDC), and which Microsoft calls End User Defined Characters (EUDC). In other words, it’s a private-use area.

IBM’s rows 115 through 119 (Shift_JIS wards 0xFA-0xFC) contain IBM extensions.

Row 115 starts with 28 codepoints dedicated to non-Kanji” characters: lowercase Roman Numerals i-x, uppercase Roman Numerals, a not-sign, a broken pipe, single and double straight quotation marks, a Kanji-derived kabushiki kaisha” symbol (which apparently still counts as non-Kanji”), square Roman-derived No.” and Tel” symbols and a because” sign. (IBM themselves, however, currently list the cells for the not-sign and because” sign as unallocated, presumably due to the said signs presently having separate codepoints in standard JIS X 0208).

The remainder of rows 115-118 (Shift_JIS wards 0xFA-0xFB) and the first twelve cells of row 119 (at the beginning of Shift_JIS ward 0xFC) are occupied by 360 IBM-selected Kanji characters.

Microsoft’s variant also includes the following extensions from NEC’s version of JIS X 0208, which appears not to have been designed to be used with Shift_JIS. So in contrast to IBM’s extensions, these do not occupy rows beyond JIS X 0208, but rather occupy unallocated rows within that standard.

NEC’s row 13 (in Shift_JIS ward 0x87) contains circled Euro-Arabic numerals, uppercase Roman Numerals, Katakana-derived and Roman-derived square symbols for e.g. units, Kanji-derived symbols (circled Kanji, composed Kanji for era names) and mathematical symbols (several of the latter have had standard codepoints in JIS X 0208 row 2 since the 1983 edition).

NEC’s rows 89 through 92 (Shift_JIS wards 0xED and 0xEE) contain all characters which are present in IBM’s rows 115 through 119 but absent from NEC’s row 13, including and starting with all the Kanji, with said subset of the non-Kanji (lowercase Roman numerals, not-sign, broken pipe, straight quotation marks) being placed at the end of row 92.

This, of course, means that rows 115 through 119 are entirely redundant to rows 89 through 92 and parts of row 13 in Microsoft’s version.

NEC themselves included further extensions in rows 9 through 12, which were not included by Microsoft. NEC’s rows 9 and 10 contain JIS X 0201, with row 10 also including halfwidth composed diacritic katakana forms (it appears that NEC were trying to put 0201 in 0208, the exact opposite of what Microsoft were doing). NEC’s row 11 contains halfwidth box drawing characters (in Unicode order) and extended halfwidth punctuation including curly quotation marks and lenticular brackets. NEC’s row 12 includes fullwidth box drawing characters. It should be noted that NEC were extending the 1978 edition of JIS X 0208 so e.g. many of the fullwidth box drawing characters are redundant to codepoints added in the 1983 versions, although there’s no collision between the NEC version and the 1983 or 1990 version.

 ※ It’s also worth noting that NEC also have their own, different, extensions to single-byte JIS X 0201. Unlike the extensions present in row 10, these were not focused on katakana, but added progress bar characters and box drawing characters in the C1 range and assorted things after the halfwidth katakana, including playing card symbols and a small number of kanji for giving dates and times and yen prices.

I previously commented upon the odd nature of NEC’s row 88 but, upon closer inspection of the characters in question and upon identifying them to be absent in another PC98 font file which I checked, I conclude it to have simply been junk data in the specific font file in question.

IANA call the combined Microsoft variant Windows-31J”, a name which Microsoft does not use. It’s also called MS_Kanji” e.g. by Python, although IANA treat MS_Kanji” as an identifier for standard Shift_JIS. It has the codepage number 932 on Windows, the same as IBM’s number for a version of their variant.

Windows-932 differs from IBM-932 in that IBM-932 does not include the NEC codepoints. Also, IBM-932 does not follow the charcter variant swaps made in 1983, preferring to retain greater backward compatibility with the 1978 edition of JIS X 0208 (while nonetheless including most of the codepoints added in later editions). Also worth noting: Windows-932 uses ASCII for its lower half, while IBM-932 uses an extended ISO-646-JP with box drawing characters in the first two rows, at least in theory (which can be switched to/from controls using controls, but get mapped to controls in Unicode). IBM offers Microsoft’s variant of the double-byte codes (with NEC extensions) in code page 943 (IBM-943”), which also incorporates their own ISO-646-JP extensions.

Note that IBM’s C0 controls arrangement does not entirely match ANSI/ECMA/ISO standards: they put File Separator in the Control-Z position, Substitution Character in the seven-ones position and Delete in the position vacated by FS. But Microsoft’s version follows the standards as regards control character mapping.

IBM later extended the single-byte codes of IBM-932, adding the cent sign, pound sign, not sign, backslash and tilde to 0x80, 0xA0, 0xFD, 0xFE and 0xFF: the result was called IBM-942. But those codes collide with unrelated Apple extensions (and Windows non-commitally maps them to private use). ICU includes two Unicode mappings for IBM-943 (one also called IBM-943C and ASCII based, one not). Also, the ICU mapping for IBM-942 is ASCII-based, resulting in duplicate single-byte encoding of the backslash and tilde.

Microsoft’s documentation and APIs simply label their version Shift_JIS, so it became prevalent on the web simply as Shift_JIS”, which the W3C/WHATWG encoding standard used by HTML5 takes into account by incorporating the IBM and NEC extensions to JIS X 0208 and Shift_JIS into its respective definitions. Encoding to Shift_JIS in particular per that standard avoids rows 89 through 94, so preferring the original IBM codepoints for the extended Kanji.

Apple, MacJapanese and the CORPCHAR extensions

Like Microsoft, Apple had meanwhile added their own (incompatible) extensions. These included more special characters in rows 8–14, and vertical presentation forms in rows 85–89 (at 84 rows down from their normalised forms). This also included the backslash, required space, copyright sign, trademark sign and halfwidth (i.e. low) horizontal ellipsis in the vacant single byte space (0x80, 0xA0, 0xFD, 0xFE, 0xFF). The tilde was present rather than an overline. Like NEC, no new double byte assignments were added beyond the JIS X 0208 space, only within it, for some reason.

Whilst the repertoire demonstrated a not-insignificant overlap, the layout of these extensions was not like those in any of the PC versions. But certain fonts did not implement these extensions, and some implemented an alternative PostScript” variant with incomptible extensions, including a different set of special characters, taken from the non-IBM-selected fullwidth NEC extensions. More characters were available in the printer versions of those fonts than in the screen versions. Besides PostScript fonts, this was apparently also the ordinary version prior to KanjiTalk 7.

Earlier, in System 7.1, the vertical forms in certain fonts (those which included neither extended special character set) were 10 rows down from their normalised forms. They was subsequently moved to their current location 84 rows down.

To summarise (x-mac-japanese and windows-31j are established labels, the rest I made up):

Rows 9-10NothingApple extensionsNothingNothing
Row 11Vertical formsApple extensionsNothingNothing
Row 12NothingApple extensionsNEC fullwidth box drawingNothing
Row 13NothingApple extensionsNEC special charactersNEC special characters
Row 14Vertical formsApple extensionsNothingNothing
Row 15Vertical formsNothingNothingNothing
Rows 85-94NothingVertical formsVertical forms, including NEC special charactersNEC selection of IBM extensions

Some characters did not exist in Unicode at the time. Some still don’t, while some have been added in the interim. Apple firstly solved this by mapping them onto the Unicode Private Use Area. But that turned out to be bad for interoperability, so Apple switched to using combining sequences where possible, and otherwise using a compatibilty normalisation (failing that, a close substitute) of the character combined with private use characters (either functioning as variation or presentation form selectors, or marking a sequence of codepoints as representing one MacJapanese character).

Similarly, Apple originally mapped the single-byte ellipsis to the normal horizontal ellipsis and the double byte one to the mathematical vertically centred horizontal ellipsis, but switched to mapping the double byte one to the normal one and using a private use marker on the single byte one, since mapping to the mathematical one was apparently not handled well in the opposite direction by Windows (i.e. bestfit932).

Apple provides to Unicode a file named CORPCHAR.TXT which details all of their private use mappings, including those used for MacJapanese as well as those used for their other character sets and East Asian charset variants.

Apple’s published mappings still map to Unicode 2.1 and so use these markers, even for characters for which actually matching codepoints have since been added. But, comments in the mapping file provide mappings to Unicode 4.0 for characters that had been added by that point.

JIS X 0212 and the Extended Unix Code (EUC)

In 1990, JISC put out two standards that are relevant here. One was a further revision to the already-established JIS X 0208 standard, which merely disunified two existing kanji and, due to this time being a strict superset of the previous (1983) version, did not warrent another new ISO 2022 code. The other was JIS X 0212. This was an entirely separate 94×94-cell character set, which was not of much use on its own, containing only characters which were absent from JIS X 0208 (5801 Kanji and 245 non-Kanji). The idea presumably being that ISO 2022 mechanisms would be used to switch between the two character sets as necessary (a setup referred to as ISO-2022-JP-1”). Since both JIS X 0208 and JIS X 0212 were 7-bit, the high bit could theoretically be used for that too.

As with JIS X 0208, the first fifteen rows contained non-Kanji characters (though they were mostly unallocated). The sixteenth through seventy-seventh rows contained Kanji and the remainder was unallocated.

While JIS X 0208 and JIS X 0212 are separate 94×94 sets with mostly-colliding Kanji allocations, the non-Kanji allocations avoided codepoints that were used in JIS X 0208, leaving them unallocated. Hence, it would be possible to support only the non-Kanji portion of JIS X 0212 without a mechanism for switching between them. Whilst I have found no evidence that this provision was ever directly useful to anyone, the unallocated rows which resulted (and the empty rows after the Kanji) did become important later (as with JIS X 0208 and Shift-JIS, but not quite for the same reasons).

JIS X 0212 subsequently became somewhat of an embarassment for those responsible for the JIS character sets, since its authors neglected to properly document the characters, making it difficult to tell why they were added or what they matched or were unifiable with. It was also criticised for not properly honouring JIS X 0208’s unification criteria; in fact, some of its characters matched subsequently revised reference glyphs from the original 1978 edition of JIS X 0208. That being said, these disunifications (mainly separating traditional characters from their extended‑shinjitai forms) were not unjustifiable.

While use of JIS X 0212 in most encodings simply did not catch on, even where it was possible (it does not fit in Shift_JIS, but one IBM EUC code page gives PC Data (that is, Shift_JIS) mappings for the mere subset of the JIS X 0212 characters it includes), there was one exception, itself with a catch.

The 8-bit Extended Unix Code (EUC) works as follows. The lower half of the encoding (with the high bit unset) gets assigned to an ISO-2022-compliant encoding, usually an ISO-646 variant such as ASCII or ISO-646-JP, and those bytes are not used for any other purpose. The first two rows of the upper half of the encoding are used for control characters (including two single-shifts) and one, two or three bytes (as appropriate) from the remaining rows represent a character from another ISO 2022 compatible 7-bit encoding (with the high bit set). The control characters in the upper half include 0x8E and 0x8F (single shifts), which are used for indicating additional ISO-2022-compliant encodings and are followed by one, two or three bytes from the non-control rows of the upper half.

EUC coding is popular on UNIX, where most other ways of encoding DBCSs would not be POSIX-compatible, but it can also be used elsewhere. The names of the Mainland Chinese (GB2312) and Korean (KS X 1001) counterparts to JIS X 0208 have become used almost interchangably with their EUC forms (EUC-CN and EUC-KR), and their Windows encodings (GBK and UHC) are supersets of their EUC encodings, although they are not themselves valid EUC. EUC-JP, the EUC form of JIS X 0208, did not become nearly as popular: non-Unix systems tended to adopt Shift_JIS instead.

The presence of single-shifts in EUC meant that it was possible to represent more sets by preceeding representations with a single shift. The 0x8E single shift was already used to preceed a character from the upper half of JIS X 0201, so the 0x8F single shift was adopted for JIS X 0212 characters, but only by non-Microsoft software. Microsoft’s system locales assumed for a long time that a character would not be more than two bytes (like in Shift_JIS, GBK, UHC and Big5), which is why UTF-8 only became a system locale very recently.

IBM’s version of EUC-JP occupied rows 83 and 84 of JIS X 0212 with a selection of their vendor extensions, apparently serving to allow all characters representable in IBM-932 to be represented in their version of EUC-JP without the NEC extensions (this seems to use a different layout in at least one IBM EUC codepage versus eucJP-open though, despite encoding the same characters).

Like ISO-2022-JP, EUC-JP is an ISO-2022 mechanism, albeit one where the character sets are pre-arranged rather than loaded using unique codes (also pre-arranging which half of the encoding is used with the single shifts, specifically the upper half) and which hence cannot be arbitrarily mingled with other national ISO 2022 formats.

Host Data

IBM uses four different encoding types for Japanese characters: JIS (ISO 2022 form), EUC, PC data (meaning Shift_JIS) and Host Data. The latter is analogous to an ISO 2022 DBCS, but the reserved single byte characters are controls at 0x00–0x3F (not 0x00–0x1F) and 0xFF (not 0x7F), and the space at 0x40 (not 0x20), with the lead/trail bytes being 0x41–0xFE (rather than 0x21–0x7E). To put it very simply, host code is to EBCDIC as ISO 2022 is to ASCII.

Again, you probably don’t care about that unless you’re using an IBM mainframe.

JIS X 0213 and why arbitrary proprietary assignments are a bad idea

In 2000, JISC released a new standard, JIS X 0213, intended to be the successor to JIS X 0208. In addition to JIS X 0208 characters, it included 2743 of the Kanji from JIS X 0212, 952 additional Kanji and many additional non-Kanji such as Roman letters with macrons or additional small Katakana used in Ainu.

Whereas JIS X 0208 had defined one 94×94 plane (or men”) and JIS X 0212 had defined another, JIS X 0213 took a different approach. Nominally, it defined two planes. Plane 1 remained more-or-less compatible with existing standard JIS X 0208 codepoints, insofar as any revision of that standard had, but with many additions to the point of all rows being mostly or entirely occupied. It was, however, assigned its own ISO-2022 escape sequence. Although it also mostly retained compatibility with the NEC/Microsoft row 13 (deällocating some duplicate codepoints), it did not do so for rows 89 through 92, which it used for a different set of Kanji than the ones present in the NEC/Microsoft use of those rows.

In addition to the 94-row Plane 1, JIS X 0213 defined 26 more rows in plane 2, containing Kanji only. In reality, for compatibility with all three established encodings, these were encoded in two different ways. As far as Shift_JIS was concerned, they were mapped without intervening empty rows after the end of Plane 1, so colliding with the unrelated IBM/Microsoft extensions in that region and likely with proprietary cellular variants. This was done in the order 1, 8, 3, 4, 5, 12–15 and 78–94, with the placement of 8 between 1 and 3 ensuring the alternating odd-even numbers that Shift_JIS encoding algorithms might rely on (i.e. it preserves the property that the lower-valued half of the trail byte range is used for odd-numbered rows).

As for why those rows specifically were allocated, while Plane 2 could be accessed from the ISO/IEC 2022 system using its own escape sequence (a setup called ISO-2022-JP-3”), the arrangement of these rows was designed to deliberately avoid colliding with any of the codepoints assigned by JIS X 0212, so allowing JIS X 0213 to be used within EUC-JP without changing the meaning of any existing standard-compliant content. Shame about the IBM extensions. One further consequence of this: JIS X 0213 and JIS X 0212 can be unambiguously used in a single EUC document (if we ignore the IBM extensions). The use of JIS X 0213 within EUC-JP is called EUC-JISx0213”, while the use thereof within Shift_JIS is termed Shift_JISx0213”, although these labels are sometimes (not always) used specifically to refer to variants encoding the first (2000) edition of JIS X 0213.

Further defined were simple 7-bit and 8-bit formats using both planes, the former differentiating using 0x0E (shift out) to move to the second plane and 0x0F (shift in) to switch to the first, and the latter using the high bit. Since it would give access to the entirety of Plane 2, not just the JIS X 0213 rows of it, it could hypothetically be used for JIS X 0212 also, but whether that’s advisable is another question.

Overall, though, what was supposed to be a major standard enhancement to Shift_JIS and to a lesser extent EUC-JP, retaining compatibility with the earlier standards, actually ended up defining yet another incompatible variant due to colliding with established extensions. It did not catch on at nearly the intended rate, perhaps to Microsoft’s variant of Shift_JIS (with NEC and IBM extensions) having become the de facto standard version (and for reasons I’ll come to shortly, the actual standard version in certain contexts).

In 2004, a new revision of JIS X 0213 was released, which changed the recommended renderings of a number of characters and disunified a small number, affecting only the first plane. The same EUC and Shift encodings were retained, but often called EUC-JIS-2004” and Shift_JIS-2004” to distinguish them from the 2000 versions. The revised first plane received a new ISO-2022 code, used when the newly assigned codes are used, resulting in ISO-2022-JP-2004”.

Unicode, UTF-8, the web, WHATWG and the future

Both JIS X 0208 and JIS X 0212 were used as character sources for Unicode and its ISO-10646 (or JIS X 0221”) Universal Character Set. They were mapped to Unicode in their entireties. The characters added in JIS X 0213 that were not already present in JIS X 0212 (and hence Unicode) were added to Unicode in version 3.2. Unicode lacks many of the problems encountered with JIS X 0208: more or less any character encoding can be converted to Unicode and, while extensions such as CSUR (and Apple’s CORPCHAR as noted above) do exist, they map onto explicitly designated private-use areas, which will never be used for standard mapping (and collectively include 137 468 private use codepoints, compare with the only 17 672 total possible codepoints in both JIS X 0208 and JIS X 0212 combined).

Unicode Transformation Format—8-bit, better known as UTF-8, is a very well designed character encoding. It’s very ASCII compatible, in that it uses ASCII bytes for and only for ASCII characters. Its multi-byte sequences follow a very regular pattern which could in principle represent up to 68 719 476 736 codepoints, although most of these are invalid as there are only 1 114 112 theoretical codepoints present in the entirety of Unicode (which are mostly unallocated and include non-character and private-use codepoints as well as non-codepoint surrogate values reserved for allowing UTF-16 to work). Initial bytes of multi-byte sequences only ever get used as initial bytes and continuation bytes only ever get used as continuation bytes; this alleviates many of the issues associated with seeking and truncation as well as making it quite unlikely for a file to be coïncidentally valid UTF-8. It nonetheless may start with an optional unique signature (the three bytes 0xEF 0xBB 0xBF). It likewise lacks most of the issues associated with the older multi-byte encodings.

With the advent of HTML5, based on the WHATWG HTML Living Standard, the WHATWG encoding standard become the relevant standard for character encodings used within HTML. This had several relevant consequences.

Microsoft’s version of Shift_JIS is now also the HTML standard version, with the same applying regarding the Windows subset of the NEC extensions and both of the other encodings of JIS X 0208 (EUC-JP and ISO-2022-JP). JIS X 0213 is not supported. WHATWG’s ISO-2022-JP treats both JIS X 0208 escape codes as equivalent and all of the JIS X 0212 and JIS X 0213 escapes as error conditions, although WHATWG’s EUC-JP supports JIS X 0212 for decoding only.

Overall, the WHATWG discourage use of any encodings other than UTF-8, although they specify them to standardise compatibility with existing content and interfaces, so they are unlikely to adopt support for JIS X 0213 in the foreseeable future. In particular, they would much rather have ISO-2022-JP removed from the standard and mapped in its entirety to U+FFFD (like its Chinese and Korean counterparts) due to its ASCII incompatibility (when in JIS X 0208 mode) posing a potential XSS risk, but are not able to due to it still being relevant to current content and software.

Further comments on Unicode

Note on CJK unification: the Unicode standard has been somewhat controversal in that it encodes logographic characters used in both Chinese and Japanese only once, rather than encoding the languages separately. This sometimes proves problematic, but not for the reason that some assume.

The sometimes assumed reason is seeing it as equivalent to unifying, say, Latin, Greek and Cyrillic, or even as an implied act of linguistic or cultural assimilation. It’s not, and this is not the actual problem. The Japanese word kanji” literally means kan” (Han Chinese) ji” (characters), the Korean word hanja” being cognate. So they are treated as Chinese characters because, quite simply, that’s what the respective languages call them. As for the distinctly Japanese kana and distinctly Korean hangul (and the Taiwanese zhuyin), they are encoded separately. Note well that the coïncidental kana-zhuyin homoglyphs are not unified, so they’re not simply encoded by looks.

Note also that no language erasure is implied here, any more than English and German using Roman letters makes them dialects of Italian. Users of the Roman alphabet do not recognise Greek as the same alphabet, but recognise their alphabet to be ultimately Latin: an English speaker might call their alphabet Roman; an English speaker would consider German to be written in the same alphabet, more letters (ß, ü) notwithstanding, but would not consider Russian to be. So Greek gets encoded once, separately from Russian and Ukranian and Belarusian (which are encoded together) and also separate from English, French, German and Swedish (which are encoded together). None of this is the actual problem.

The individual national standards such as JIS X 0208 applied unification criteria” to kanji in the source material to limit the number of glyphs encoded: multiple minor variants would be unified with one standard variant. Unicode continued this process, but with character sets from all four territories, and consulting a group of expert representatives for each territory.

The problem with this is that which variant of the character gets considered standard may have minor differences between, say, Japan and Taiwan. We’re not talking about, say, drastically different simplification levels (which are encoded separately) but more minor variations where mutual legibility can still be assumed. As I mentioned, this did not originate with Unicode, but prior to Unicode the codepage would limit the available fonts so it was not likely for Japanese to end up accidentally in a Taiwanese font.

 ※ Further reading from someone with more hands-on knowledge of the process than I do.

Unicode have a number of solutions such as variant selectors for cases where a specific glyph variant is needed for display, but ultimately it would complicate text searching too much to disunify them all, especially given that content will already exist under the original unification criteria (compatibility codepoints with NFC mappings to the canonical forms are included where multiple variants or even exact duplicates existed in source-separation authorities).

JIS Coded Character sets to Unicode mapping for multiple variants

See here.

Further reading