Jump to content.

JIS Character Sets Explained

ISO 646, JIS X 0201 and early representations

The ISO-646 7-bit character set standard is best known in what is actually its United States variant (ISO-646-US, better known as ASCII, or American Standard Code for Information Interchange), but there were previously a plethora of national variants. Certain characters were to be supported by all variants (most of them), whereas certain codepoints could be occupied with more nationally relevant symbols. For example, the UK version occupied 0x23 with a pound sign (£, for the GBP currency) and 0x7E with an overline () or spacing macron (¯). In US-ASCII, these are respectively the octothorpe (#) and the simple tilde (~).

The Japanese variant (ISO-646-JP, also referred to as JISCII or JIS-Roman) treated 0x23 as in ASCII and 0x7E as in ISO-646-GB, but occupied 0x5C (used for the backslash \ in both UK and US variants) with the Latin-based yen sign ¥. It accordingly differs (minorly in practice) from ASCII at 0x7E, and differs significantly from ASCII at 0x5C. Keep note of this, it becomes important later.

The macron is used to indicate long vowels in some systems of romanising Japanese (such might state Tōkyō, alternatives including Tôkyô (using the circumflex), Toukyou (following Hiragana orthography), Tookyoo (doubling long vowels), Tohkyoh (using the oh digraph) or Tokyo (discarding vowel length)). The spacing overline is not of much use in modern-day computing, but on a typewriter/teletype it could be used with backspace to be overstamped onto the letter, thus allowing such use.

It is worth noting that a 7-bit stateless form af JISCII can only represent Japanese in romanised form, as neither Kanji/Hiragana nor even Katakana has adequate encoding space within the nationalisable code points. There are two ways of working around this, I’ll get to these and their applications in due course.

Compare this with Morse code, where the prosign (control character) -..--- (N+J or D+O run together, meaning next Japanese) was used to switch to Kana mode. The ...-. (S+N or V+E) control was used to switch back. I will not list all the Kana morse codes here as they most probably are not relevant to your life, but the curious can check Wikipedia. Hiragana and Katakana were not distinguished by this encoding, hence Katakana alone was used to transcribe the message.

ASCII was designed as a 7-bit character set, with the eighth bit being either absent, a checksum or unused (ASCII itself didn’t specify). This contrasted with EBCDIC (Extended Binary Coded Decimal Interchange Code), which was an 8-bit character set with twice as many codepoints. EBCDIC was however, not well designed: capital and lowercase letters were not in continuous ranges, which made casemapping routines needlessly complicated. EBCDIC was also not even slightly compatible with ISO 646, and mutual compatibility of anything other than English letters and digits between EBCDIC variants was limited (in at least one unusually divergent case, even lowercase letters don’t correspond). Unless you work on IBM mainframes, EBCDIC is not relevant to your life either, so I’ll leave it at that.

Combining the advantages of ASCII and EBCDIC, however, was Extended ASCII. This used ASCII for bytes with the high (eighth) bit unset, and an extended encoding for bytes with it set. A particularly influential such setup in the Occident was the DEC Multinational Character Set (or MCS). This stratified into two 16-code rows for extended control characters (mirroring ASCII’s two such rows), two rows of uncased characters (mainly punctuation), and two rows for each letter case. This elegant design meant that casemapping routines could be almost as simple as in ASCII itself. In theory. Subsequent standards and variants messed that up completely. The most common single-byte variant today is Windows-1252, also called WinLatin1 or Western European (Windows), which likely contributed to ISO-646-GB being lost to oblivion due to including the £ with its own codepoint.

Extension was possible on septet-byte machines too, as the Shift Out control character would switch to the non-646 set (equivalent to mapping 0x21–0x7E to 0xA1–0xFE). But this only reached 94 extended characters (0xA1–0xFE), not the full 128 (0x80–0xFF): many extended ASCII sets were designed to keep the extended characters within this range. Shift In switched back.

Needless to say, Extended ASCII, or rather extended ISO-646, did catch on in Japan too, but with an entirely different extended encoding. Four rows were dedicated to kana, without distinguishing Hiragana (so, dedicated to Katakana), with the Japanese-style comma (), full stop () and speech marks (「」) and the interpunct () getting their own codepoints. Diacritic forms were not included separately; the diacritic characters were given separate codepoints and were inserted after the respective kana. The first two rows of the extended area were left open to keep the characters in the reach of Shift Out (and to a probably lesser extent allowing extended control characters), and the last two were unused. This become important later. The ISO-646 encoding used in the lower half (that is to say, with the high bit unset) was of course ISO-646-JP, not ASCII. This encoding, originally standardised by JISC as JIS C 6220, is now know as JIS X 0201.

ISO 2022, JIS X 0208 and the Double Byte Character Set

The lack of Kanji (and Hiragana) in JIS X 0201 allowed only a simplified writing system which, while suitable for phonetically transcribing the language to an extent (albeit leaving near homophones distinguished by speech inflection ambiguous), did not retain the semantic and other distinctions achieved through Kanji. Furthermore, the order of the kana codepoints did not make for sensible sorting.

JIS X 0208, originally called JIS C 6226, was an entirely new encoding. Control characters and the ASCII space were represented as in ASCII. The remainder were represented with ASCII printing codepoints in pairs. This meant 94 possible first bytes and 94 possible second bytes, hence the character set was arranged into 94 ku (rows or wards), each with 94 ten (cells). The first byte (minus 32) would specify the row (the rows being numbered 1 to 94, yes, 1-indexed) and the second would similarly specify the cell within that row. A character might be specified by the said bytes (e.g. 0x2121 for the ideographic space).

As well as being represented by their indexing bytes, a form also exists where the codepoints are indexed by pairs of 1-indexed numbers (i.e. from 1 to 94), called a kuten. The WHATWG encoding standard introduces a further codepoint representation where the characters are indexed by a singular number counting from zero, called a pointer, such that the first row includes pointers 0–93, the second includes pointers 94–187 and so forth. Neither of these are character encodings, but rather academic or internal representations. Additional character encodings of JIS X 0208 also exist, I’ll come to that later.

While not ASCII-compatible or even ISO-646 compatible, this layout allowed it to be used with ISO 2022 (or JIS X 0202), which switches between 7-bit encodings with certain layout constraints using ANSI escapes. A subset of this system defined by RFC 1468, starting off in ASCII and switching to and from JIS X 0208, never using the high bit, become the pre-Unicode standard for Japanese-language e-mails (at the time, e-mails had to be 7-bit clean) and is still relevant today, much to the frustration of the WHATWG (which I’ll come to later). The label ISO-2022-JP is given to this usage, first specified in RFC 1468 and later included by JISC themselves in the 1997 edition of JIS X 0208. Due to this using the JIS X 0208 two-byte representations unmodified, and due to usage of pure JIS X 0208 being rare, it is also often simply called the JIS encoding, a name which also references the Japanese-language definition of the mechanism being in JIS X 0202.

JIS X 0208 could however be used alone; it included punctuation and letters in addition to kanji and kana. This also meant that Roman letters could be specified in ISO-2022-JP using either ASCII or JIS X 0208. It did not however attempt to establish a lossless conversion of ASCII data to its own charset (e.g., lacking straight quotation marks and disunifying the hyphen from the minus sign). All of this had other consequences later.

Although there were theoretically 94 rows, the last 10 rows were not yet used, being left open for future expansion. The first two rows were used mainly for punctuation and symbols, followed by one containing Roman letters and Euro-Arabic digits, another containing Hiragana, another containing identically ordered Katakana, another containing Greek letters, another containing Cyrillic letters and an eighth row containing box drawing characters (initially unallocated but allocated in the 1983 version). The following seven rows were left unallocated for further expansion, with the sixteenth row being the start of the Kanji block. These seventeen unallocated rows became important later. The kana ordering differed from the JIS X 0201 order in order to make more sense for sorting, and diacritic kana were provided pre-composed.

A neat feature worth mentioning while we’re still on this topic is that, unlike the Greek and Cyrillic characters which start at the beginnings of their respective rows, the Roman letters start some distance into their row. This is so the second representing byte matches that letter in ISO 646, thus making purely Roman text in JIS X 0208 still fairly readable if misinterpreted as ASCII or as JIS X 0201 (somewhat like #H#e#l#l#o!!#W#o#r#l#d!*).

This encoding went through several versions since its release in 1978. In 1983, several characters were added (mostly non-kanji symbols in the second row), whereas some synoglyphic Kanji variants were added and/or swapped (so as to adjust the level 1 / level 2 strata). This was apparently significant enough to warrent the new version being assigned a separate ISO-2022 switching code (although often both are treated as the later version). More on this later.

Shift_JIS

Microsoft, in the process of expanding into Japan, developed connections with the publishers of a microcomputer magazine by the name of (confusingly enough) ASCII (a forerunner of the current ASCII Media Works). One outcome of this association was a encoding of JIS X 0208 which retained compatibility with JIS X 0201 and hence with ISO-646, which has since proven far more important than ISO-2022 compatibility (which it does not retain, on account of (for example) repurposing the C1 area).

The way this worked was that each Shift_JIS ward constituted a pair of consecutive JIS X 0208 wards. This meant that only 47 initial bytes were required. As the first two and last two rows of the upper half of JIS X 0201 were not used by that standard, that meant that 64 initial bytes were available. The first such byte (0x80) was skipped, leaving 63 possible initial bytes or a total of 126 representable JIS X 0208 rows, out of only 94 present in the JIS X 0208 standard. This becomes important later.

As the second byte would have to index within a two-row ward, not an individual row, 188 such bytes were required. However, as a second byte would be clearly a second byte, assuming no initial truncation, it was possible to reuse halfwidth katakana and even ASCII characters there. The ASCII controls, including the first two rows and the DEL character (0x7F), were nonetheless skipped. The second two ASCII rows were also skipped including the digits, ASCII space and 21 out of 32 of the punctuation marks, including all 19 invariant punctuation marks such as the ", ' and <> which are sensitive to HTML syntax. This leaves 191 possible second bytes, of which only the first 188 are needed.

As this left two different ways of encoding Katakana and most of ASCII, the one-byte forms were distinguished from the two-byte forms by their width, with the two-byte forms being displayed roughly as wide as a Kanji, and the one-byte forms being more narrow (roughly half as wide). This also avoided breaking anything that assumed the physical length of a piece of text to be proportional to the number of bytes. This become know as fullwidth versus halfwidth.

This encoding, shift-coded JIS or Shift_JIS as it become known, become the basis for encoding Japanese on Windows and Apple computers and was standardised in the 1997 edition of JIS X 0208. The potentially confusing name refers to the JIS (x0208) codes being shifted around the existing codes, not to the use of actual shift-codes which it does not involve (and which I’ll come to later).

Beyond Shift_JIS: IBM, NEC and Windows-31J

Following Microsoft’s general policy of not following their own standards, Microsoft’s own Windows shift_jis differs from the standard in certain ways.

Firstly, is is based on ASCII rather than ISO-646-JP. To be fair, this was necessary due to the reliance of DOS and Windows on the backslash as the primary path separator. (This was in turn for backward compatibility: DOS would consider a command name to end before the first forward-slash even if a space was not inserted, because a forward-slash was used for command-line option syntax. In every other context, Windows accepts either slash as equivalent but renders paths using the backslash.) Japanese fonts nonetheless render the backslash as a yen sign for ISO-646-JP compatibility, which does interesting things to the rendering of Windows paths.

Secondly, it includes extensions, both within the 94 rows defined by the JIS X 0208 standard and comprising 29 rows beyond it. This was not unusual or unique to Microsoft: cellular carriers developed their own proprietary extensions (which included the original Emoji). In fact, the extensions themselves originated from NEC and IBM, not from Microsoft, who basically defined their encoding as a superset of the IBM variant incorporating certain NEC extensions.

IBM’s version, which IBM refer to as Japanese DBCS–PC (note that that term does not include the single-byte codepage which it is used with, which is called Japan PC-Data SB), does not allocate codepoints within the 94 rows defined by the JIS X 0208 standard but does so within the 32 rows or 16 wards beyond it.

IBM’s rows 95 through 114 (Shift_JIS wards 0xF0-0xF9) are reserved as what IBM refers to as User Defined Characters (UDC), and which Microsoft refers to as End User Defined Characters (EUDC). In other words, it is a private-use area.

IBM’s rows 115 through 119 (Shift_JIS wards 0xFA-0xFC) contain IBM extensions.

Row 115 starts with 28 codepoints dedicated to non-Kanji characters: lowercase Roman Numerals i-x, uppercase of the same, the not-sign, the broken pipe, single and double straight quotation marks, a Kanji-derived kabushiki kaisha symbol (which apparently still counts as non-Kanji), square Roman-derived No. and Tel symbols and the because sign. (IBM themselves, however, currently list the cells for the not-sign and the because sign as unallocated, presumably due to the said signs presently having separate codepoints in standard JIS X 0208).

The remainder of rows 115-118 (Shift_JIS wards 0xFA-0xFB) and the first twelve cells of row 119 (at the beginning of Shift_JIS ward 0xFC) are occupied by 360 IBM-selected Kanji characters.

Microsoft’s variant also includes the following extensions from NEC’s version of JIS X 0208, which appears not to have been designed to be used with Shift_JIS. Hence, in contrast to IBM’s extensions, these do not occupy rows beyond JIS X 0208, but rather occupy unallocated rows within that standard.

NEC’s row 13 (in Shift_JIS ward 0x87) contains circled Euro-Arabic numerals, uppercase Roman Numerals, Katakana-derived and Roman-derived square symbols for e.g. units, Kanji-derived symbols (circled Kanji, composed Kanji for era names) and mathematical symbols (some of the latter have had standard codepoints in JIS X 0208 row 2 since the 1983 edition).

NEC’s rows 89 through 92 (Shift_JIS wards 0xED and 0xEE) contain all characters which are present in IBM’s rows 115 through 119 but absent from NEC’s row 13, including and starting with all the Kanji, with said subset of the non-Kanji (lowercase Roman numerals, not-sign, broken pipe, straight quotation marks) being placed at the end of row 92.

This, of course, means that rows 115 through 119 are entirely redundant to rows 89 through 92 and parts of row 13 in Microsoft’s version.

NEC themselves included further extensions in rows 9 through 12, which were not included by Microsoft. NEC’s rows 9 and 10 contain JIS X 0201, with row 10 also including halfwidth composed diacritic katakana forms (it appears that NEC were trying to put 0201 in 0208, the exact opposite of what Microsoft were doing). NEC’s row 11 contains halfwidth box drawing characters (in Unicode order) and extended halfwidth punctuation such as curly quotation marks and lenticular brackets. NEC’s row 12 includes fullwidth box drawing characters. It should be noted that NEC were extending the 1978 edition of JIS X 0208 so e.g. many of the fullwidth box drawing characters are redundant to codepoints added in the 1983 versions, although there is no collision between the NEC version and the 1983 or 1990 version. It is also worth noting that NEC also have their own extensions to JIS X 0201, unrelated to the extensions present in row 10, which added progress bar characters and box drawing characters in the C1 range and assorted things after the halfwidth katakana, including playing card symbols and a small number of kanji for giving dates and times and yen prices.

I previously commented upon the odd nature of NEC’s row 88 but, upon closer inspection of the characters in question and upon identifying them to be absent in another PC98 font which I have obtained, I conclude this to have simply been junk data in the specific font file in question.

IANA refer to the combined Microsoft variant as Windows-31J, a name which Microsoft does not use. It is also called MS_Kanji e.g. by Python, although IANA treat MS_Kanji as an identifier for standard Shift_JIS. It has the codepage number 932 on Windows, the same as IBM’s number for a version of their variant.

Windows-932 differs from IBM-932 in that IBM-932 does not include the NEC codepoints. Also, IBM-932 does not follow the charcter variant swaps made in 1983, preferring to retain greater backward compatibility with the 1978 edition of JIS X 0208 (while nonetheless including most of the codepoints added in later editions). Also worth noting is that Windows-932 uses ASCII as its lower half, while IBM-932 uses an extended ISO-646-JP with box drawing characters in the first two rows, at least in theory (which can be switched to/from controls using controls, but are mapped to controls in Unicode). IBM offers Microsoft’s variant of the double-byte codes (with NEC extensions) in code page 943 (IBM-943), which also incorporates their own ISO-646-JP extensions.

Note that IBM’s C0 controls arrangement does not entirely match ANSI/ISO standards: File Separator is in the Control-Z position, Substitution Character is in the seven-ones position and Delete is in the position vacated by FS. Microsoft’s version follows the standards in this regard.

IBM later extended the single-byte codes of IBM-932, adding the cent sign, pound sign, not sign, backslash and tilde to 0x80, 0xA0, 0xFD, 0xFE and 0xFF: this was called IBM-942. This collides with unrelated Apple extensions; Windows maps these to private use. ICU includes two Unicode mappings for IBM-943 (one also called IBM-943C and ASCII based, one not). Also, the ICU mapping for IBM-942 is ASCII-based, resulting in duplicate single-byte encoding of the backslash and tilde.

Microsoft traditionally referred to their version simply as Shift_JIS, hence it became prevalent on the web simply as Shift_JIS. This is taken into account by the W3C/WHATWG encoding standard used by HTML5, which incorporates the IBM and NEC extensions to JIS X 0208 and Shift_JIS into its respective definitions. Encoding to Shift_JIS in particular per that standard avoids rows 89 through 94, thus preferring the original IBM codepoints for the extended Kanji.

Apple, MacJapanese and the CORPCHAR extensions

Apple, meanwhile, had added their own, incompatible extensions. These included additional special characters in rows 8–14, and vertical presentation forms in rows 85–89 (at 84 rows down from their canonical forms). This also included the backslash, required space, copyright sign, trademark sign and halfwidth (i.e. low) horizontal ellipsis in the vacant single byte space (0x80, 0xA0, 0xFD, 0xFE, 0xFF). The tilde was present rather than an overline. Notably, no new double byte assignments were added beyond the JIS X 0208 space, only within it, for some reason.

Whilst the repertoire demonstrated a not-insignificant overlap, the layout of these extensions was not related to those in any of the PC versions. However, some fonts implemented an alternative PostScript variant which had a different set of special characters, taken from the non-IBM-selected fullwidth NEC extensions. More characters were available in the printer versions of those fonts than in the screen versions.

Earlier, in System 7.1, the vertical forms were 10 rows down from their canonical forms. They was subsequently moved to their final location 84 rows down.

To summarise (x-mac-japanese and windows-31j are established labels, the rest I made up):

Locationx-mac-japanese-7_1x-mac-japanesex-mac-japanese-postscriptwindows-31j
Rows 9-10NothingApple extensionsNothingNothing
Row 11Vertical formsApple extensionsNothingNothing
Row 12NothingApple extensionsNEC fullwidth box drawingNothing
Row 13NothingApple extensionsNEC special charactersNEC special characters
Row 14Vertical formsApple extensionsNothingNothing
Row 15Vertical formsNothingNothingNothing
Rows 85-94NothingVertical formsVertical forms, including NEC special charactersNEC selection of IBM extensions

Some characters did not exist in Unicode at the time. Some still don’t, while some have been added in the interim. Apple firstly solved this by mapping them onto the Unicode Private Use Area. This proved bad for interoperability, so Apple switched to using combining sequences where possible, and otherwise using a canonical normalisation (failing that, a close substitute) of the character combined with private use characters (either functioning as variation or presentation form selectors, or marking a sequence of codepoints as representing one MacJapanese character).

Similarly, Apple originally mapped the single-byte ellipsis to the normal horizontal ellipsis and the double byte one to the mathematical vertically centred horizontal ellipsis, but switched to mapping the double byte one to the normal one and using a private use marker on the single byte one, as mapping to the mathematical one was apparently not handled well in the opposite direction by Windows (i.e. bestfit932).

Apple provides to Unicode a file named CORPCHAR.TXT which details all of their private use mappings, including those used for MacJapanese as well as those used for their other character sets and East Asian charset variants.

Apple’s published mappings still map to Unicode 2.1 and hence use these markers, even for characters for which actually matching codepoints have since been added. Mappings to Unicode 4.0 for characters added by that point are provided in comments, however.

JIS X 0212 and the Extended Unix Code (EUC)

In 1990, JISC released two standards that are relevant here. One was a further revision to the already-established JIS X 0208 standard, which this time apparantly did not warrent another new ISO 2022 code. The other was JIS X 0212. This was an entirely separate 94×94-cell character set, which was not of much use on its own, containing only characters which were absent from JIS X 0208 (5801 Kanji and 245 non-Kanji). The idea presumably being that ISO 2022 mechanisms would be used to switch between the two character sets as necessary (a setup referred to as ISO-2022-JP-1). As both JIS X 0208 and JIS X 0212 were 7-bit, the high bit could theoretically be used for this also; however, this was not clarified.

As with JIS X 0208, the first fifteen rows contained non-Kanji characters (though they were mostly unallocated). The sixteenth through seventy-seventh rows contained Kanji and the remainder was unallocated.

While JIS X 0208 and JIS X 0212 are separate 94×94 sets with mostly-colliding Kanji allocations, the non-Kanji allocations avoided codepoints that were used in JIS X 0208, leaving them unallocated. Thus, it would be possible to support only the non-Kanji portion of JIS X 0212 without a mechanism for switching between them. Whilst I have found no evidence that this provision was ever directly useful to anyone, the unallocated rows that resulted from this (and the empty rows after the Kanji) did become important later (as with JIS X 0208 and Shift-JIS, but not quite for the same reasons).

JIS X 0212 subsequently became somewhat of an embarassment for those responsible for the JIS character sets, as it included several characters which were already unified with JIS X 0208 characters under their already-established unification criteria (some, in fact, matched reference glyphs from older releases of JIS X 0208), and also neglected to properly document the characters themselves making it very difficult to tell what they, in turn, matched or were unifiable with.

While use of JIS X 0212 in most encodings simply did not catch on, even where it was possible (it does not fit in Shift_JIS, but one IBM EUC code page gives PC Data (that is, Shift_JIS) mappings for the JIS X 0212 characters it includes, which is only a subset), there was one exception, itself with a catch.

The 8-bit Extended Unix Code (EUC) works as follows. The lower half of the encoding (with the high bit unset) is assigned to an ISO-2022-compliant encoding, usually an ISO-646 variant such as ASCII or ISO-646-JP, and those bytes are not used for any other purpose. The first two rows of the upper half of the encoding are used for control characters (including two single-shifts) and one, two or three bytes (as appropriate) from the remaining rows represent a character from another ISO 2022 compatible 7-bit encoding (with the high bit set). The control characters in the upper half include 0x8E and 0x8F (single shifts), which are used for indicating additional ISO-2022-compliant encodings and are followed by one, two or three bytes from the non-control rows of the upper half.

EUC coding is popular on UNIX, where most other ways of encoding DBCSs would not be POSIX-compatible, but is can also be used elsewhere. The names of the Mainland Chinese (GB2312) and Korean (KS X 1001) counterparts to JIS X 0208 have become used almost interchangably with their EUC forms (EUC-CN and EUC-KR), and their Windows encodings (GBK and UHC) are supersets of their EUC encodings, although they are not themselves valid EUC. EUC-JP, the EUC form of JIS X 0208, did not become nearly as popular: non-Unix systems tended to adopt Shift_JIS instead.

The presence of single-shifts in EUC meant that it was possible to represent additional sets by preceeding representations with a single shift. The 0x8E single shift was already used to preceed a character from the upper half of JIS X 0201, so the 0x8F single shift was adopted for JIS X 0212 characters, but only by non-Microsoft software.

IBM’s version of EUC-JP occupied rows 83 and 84 of JIS X 0212 with a selection of their vendor extensions, apparently serving to allow all characters representable in IBM-932 to be represented in their version of EUC-JP without the NEC extensions (this seems to use a different layout in at least one IBM EUC codepage versus eucJP-open though, despite encoding the same characters).

Like ISO-2022-JP, EUC-JP is an ISO-2022 mechanism, albeit one where the character sets are pre-arranged rather than loaded using unique codes (also pre-arranging which half of the encoding is used with the single shifts, specifically the upper half) and which hence cannot be arbitrarily mingled with other national ISO 2022 formats.

Host Data

IBM uses four different encoding types for Japanese characters: JIS (ISO 2022 form), EUC, PC data (meaning Shift_JIS) and Host Data. The latter is analogous to an ISO 2022 DBCS, but the reserved single byte characters are controls at 0x00–0x3F (not 0x00–0x1F) and 0xFF (not 0x7F), and the space at 0x40 (not 0x20), with the lead/trail bytes being 0x41–0xFE (rather than 0x21–0x7E). To put it very simply, host code is to EBCDIC as ISO 2022 is to ASCII.

Again, you probably don’t care about this unless you’re using an IBM mainframe.

JIS X 0213 and why arbitrary proprietary assignments are a bad idea

In 2000, JISC released a new standard, JIS X 0213, intended to be the successor to JIS X 0208. In addition to JIS X 0208 characters, it included 2743 of the Kanji from JIS X 0212, 952 additional Kanji and many additional non-Kanji such as Roman letters with macrons or additional small Katakana used in Ainu.

Whereas JIS X 0208 had defined one 94×94 plane (or men) and JIS X 0212 had defined another, JIS X 0213 took a different approach. Nominally, it defined two planes. Plane 1 remained more-or-less compatible with existing standard JIS X 0208 codepoints, insofar as any revision of that standard had, but with many additions to the point of all rows being mostly or entirely occupied. It was, however, assigned its own ISO-2022 escape sequence. Whilst it also mostly retained compatibility with the NEC/Microsoft row 13 (deällocating some duplicate codepoints), it did not do so for rows 89 through 92, which it used for a different set of Kanji than the ones present in the NEC/Microsoft use of those rows.

In addition to the 94-row Plane 1, JIS X 0213 defined 26 additional new rows in plane 2, containing Kanji only. In reality, for compatibility with all three established encodings, these were encoded in two different ways. As far as Shift_JIS was concerned, they were mapped without intervening empty rows after the end of Plane 1, thus colliding with the unrelated IBM/Microsoft extensions in that region and likely with proprietary cellular variants. This was done in the order 1, 8, 3, 4, 5, 12–15 and 78–94, with the placement of 8 between 1 and 3 ensuring the alternating odd-even numbers that some Shift_JIS encoding algorithms rely on.

As for why those rows specifically were allocated, while Plane 2 could be accessed from the ISO-2022 system using its own escape sequence (a setup referred to as ISO-2022-JP-3), the arrangement of these rows was designed to deliberately avoid colliding with any of the codepoints assigned by JIS X 0212, thus allowing JIS X 0213 to be used within EUC-JP without changing the meaning of any existing standard-compliant content. Shame about the IBM extensions. A further consequence of this is that JIS X 0213 and JIS X 0212 can be unambiguously used in a single EUC document (if we ignore the IBM extensions). The use of JIS X 0213 within EUC-JP is referred to as EUC-JISx0213, while the use thereof within Shift_JIS is termed Shift_JISx0213.

Further defined were simple 7-bit and 8-bit formats using both planes, the former differentiating using 0x0E (shift out) to move to the second plane and 0x0F (shift in) to switch to the first, and the latter using the high bit. Because this would give access to the entirety of Plane 2, not just the JIS X 0213 rows of it, this could hypothetically be used for JIS X 0212 also, but whether this is advisable is another question.

Overall, though, what was supposed to be a major standard enhancement to Shift_JIS and to a lesser extent EUC-JP, retaining compatibility with the earlier standards, actually ended up defining yet another incompatible variant due to colliding with established extensions. It did not catch on at nearly the intended rate, perhaps to Microsoft’s variant of Shift_JIS (with NEC and IBM extensions) having become the de facto standard version (and for reasons I’ll come to shortly, the actual standard version in certain contexts).

In 2004, a new revision of JIS X 0213 was released. This changed the recommended renderings of a number of characters and disunified a small number, affecting only the first plane. The same EUC and shift-coded encodings were retained, but often referred to as EUC-JIS-2004 and Shift_JIS-2004 to distinguish them from the 2000 versions. The revised first plane received a new ISO-2022 code, resulting in ISO-2022-JP-2004.

Unicode, UTF-8, the web, WHATWG and the future

Both JIS X 0208 and JIS X 0212 were used as character sources for Unicode and its ISO-10646 (or JIS X 0221) Universal Character Set. They were mapped to Unicode in their entireties. The characters added in JIS X 0213 that were not already present in JIS X 0212 (and hence Unicode) were added to Unicode in version 3.2. Unicode lacks many of the problems encountered with JIS X 0208: more or less any character encoding can be converted to Unicode and, while extensions such as CSUR (and Apple’s CORPCHAR as noted above) do exist, they map onto explicitly designated private-use areas, which will never be used for standard mapping (and collectively include 137 468 private use codepoints, compare with the only 17 672 total possible codepoints in both JIS X 0208 and JIS X 0212 combined).

Unicode Transformation Format—8-bit, better known as UTF-8, is a very well designed character encoding. It is very ASCII compatible, in that it uses ASCII bytes for and only for ASCII characters. Its multi-byte sequences follow a very regular pattern which is in principle capable of representing up to 68 719 476 736 codepoints, although most of these are invalid as there are only 1 114 112 theoretical codepoints present in the entirety of Unicode (which are mostly unallocated and include non-character and private-use codepoints as well as non-codepoint surrogate values reserved for allowing UTF-16 to work). Initial bytes of multi-byte sequences are only ever used as initial bytes and continuation bytes are only ever used as continuation bytes; this alleviates many of the issues associated with seeking and truncation as well as making it quite unlikely for a file to be coïncidentally valid UTF-8. It nonetheless may start with an optional unique signature (the three bytes 0xEF 0xBB 0xBF). It likewise lacks most of the issues associated with the older multi-byte encodings.

With the advent of HTML5, based on the WHATWG HTML Living Standard, the WHATWG encoding standard become the relevant standard for character encodings used within HTML. This had several relevant consequences.

Microsoft’s version of Shift_JIS is now also the HTML standard version, with the same applying regarding the Windows subset of the NEC extensions and both of the other encodings of JIS X 0208 (EUC-JP and ISO-2022-JP). JIS X 0213 is not supported. WHATWG’s ISO-2022-JP treats both JIS X 0208 escape codes as equivalent and all of the JIS X 0212 and JIS X 0213 escapes as error conditions, although WHATWG’s EUC-JP supports JIS X 0212 for decoding only.

Overall, the WHATWG discourage use of any encodings other than UTF-8, although they specify them to standardise compatibility with existing content and interfaces, so they are unlikely to adopt support for JIS X 0213 in the foreseeable future. In particular, they would much rather have ISO-2022-JP removed from the standard and mapped in its entirety to U+FFFD (like its Chinese and Korean counterparts) due to its ASCII incompatibility (when in JIS X 0208 mode) posing a potential XSS risk, but are unable to due to it still being relevant to current content and software.

Further comments on Unicode

Note on CJK unification: the Unicode standard has been somewhat controversal in that it encodes logographic characters used in both Chinese and Japanese only once, rather than encoding the languages separately. This is sometimes problematic, but not for the reason that is sometimes assumed.

The sometimes assumed reason is seeing it as equivalent to unifying, say, Latin, Greek and Cyrillic, or even as an implied act of linguistic or cultural assimilation. It isn’t, and this is not the actual problem. The Japanese word kanji literally means kan (Han Chinese) ji (characters), the Korean word hanja being cognate. Hence they are treated as Chinese characters because, quite simply, that’s what the respective languages call them. As for the distinctly Japanese kana and the distinctly Korean hangul (and the Taiwanese zhuyin), they are encoded separately. Note well that the coïncidental kana-zhuyin homoglyphs are not unified, so they’re not simply encoded by looks.

Note also that no language erasure is implied here, any more than English and German using Roman letters makes them dialects of Italian. Users of the Roman alphabet do not recognise Greek as the same alphabet, but recognise their alphabet to be ultimately Latin: an English speaker might call their alphabet Roman; an English speaker would consider German to be written in the same alphabet, additional letters (ß, ü) notwithstanding, but would not consider Russian to be. Hence, Greek is encoded once, separately from Russian and Ukranian and Belarusian (which are encoded together) and also separate from English, French, German and Swedish (which are encoded together). None of this is the actual problem.

The individual national standards such as JIS X 0208 applied unification criteria to kanji in the source material to limit the number of glyphs encoded: multiple minor variants would be unified with one standard variant. Unicode continued this process, but with character sets from all four territories, and consulting a group of expert representatives for each territory.

The problem with this is that which the standard variant of the character is may have minor differences between, say, Japan and Taiwan. We’re not talking about, say, drastically different simplification levels (which are encoded separately) but more minor variations where mutual legibility can still be assumed. As I mentioned, this did not originate with Unicode, but prior to Unicode the codepage would limit the available fonts so it was not likely for Japanese to end up accidentally in a Taiwanese font.

Unicode have a number of solutions such as variant selectors for cases where a specific glyph variant is needed for display, but ultimately it would complicate text searching too much to disunify them all, especially given that content will already exist under the original unification criteria (compatibility codepoints with NFKC mappings to the canonical forms is, by contrast, comparatively fine, and some do exist).

JIS Coded Character sets to Unicode mapping for multiple variants

See here.

Further reading