Whence Unicode’s Halfwidth and Fullwidth forms?
Unicode’s Halfwidth and Fullwidth Forms block encodes equivalents to other Unicode characters, differing entirely in being intended to display wider or narrower than their compatibility-normalised equivalents. This usually harks back to East Asian multiple-byte character sets (MBCSs), which combined a single byte character set such as ASCII or another ISO 646 variant (a SBCS) with a (possibly reärranged) double byte character set (DBCS) often of separate orgin. This resulted in duplicate encoding of some characters, where the single byte versions often rendered narrower than the double byte ones (duospace typesetting).
That being said, the origin of the individual characters is not always obvious. Being that not everyone is motivated to read code pages if they don’t have to for some reason, I have decided to write this up.
So… blame for Halfwidth and Fullwidth forms, by subheading?
(Most characters in this block were present since Unicode 1.0.0, exceptions are noted below.)
Fullwidth ASCII variants (U+FF01–FF5E)
Pretty much every East Asian MBCS is responsible for at least some of it. E.g. EUC-CN (GBK, its superset, didn’t exist at the time), its underlying GB character set encodes the entirety of an ISO 646 variant in row 3 (with Yuan/Yen not Dollar, Overline/Macron not Tilde) and the remaining ASCII (Tilde and Dollar) in row 1. Unicode layout here copies ASCII though.
JIS X 0208 row 3, by contrast, only includes the letters and numbers, though most of the ASCII punctuation is included in row 1. It is, however, missing straight quotes (added in some vendor extensions), has a “wave dash” not a wide tilde for all that they differ (Microsoft still treat it as a wide tilde), and has separate hyphen and minus sign assignments (Microsoft treat the minus sign as a wide hyphen-minus). JIS X 0213 didn’t exist at the time.
※ The fullwidth tilde is the fullwidth form of the ASCII tilde, which is in itself an ambiguously defined character from the pre-Unicode ASCII era. So in principle, the fullwidth tilde could be a mathematical tilde operator centred within an em-square, a tilde accent in the upper centre of an em square, or (yes) a wave/swung dash. In practice, it is a wave dash, matching the intent of most of the East Asian character sets it retains compatibility with, with the notable exception of South Korean Wansung.
※ A separate wave dash codepoint was added to Unicode for use for the JIS X 0208 character (which is officially not considered a tilde). Its reference glyph was incorrectly mirrored, however, and Microsoft initially took that as gospel (a) not using it as the mapping of the SJIS character and (b) using the mirrored glyph in fonts for Windows XP. This glyph error was fixed in fonts introduced in later versions of Windows (although the existing fonts such as MS PMincho remained the same) and, subsequently, in later versions of the Unicode charts.
※ JIS X 0213 displays 01-02-18 as a tilde accent, and maps it onto the ASCII tilde in Shift_JIS and the fullwidth tilde in EUC-JP. In the latter case, mapping onto the Unicode small tilde (which is found in Windows-1252 and is necessarily a spacing accent) might be more practical in reality, in order to ensure a distinct glyph from the wave dash (and to avoid colliding with Microsoft’s mappings). To the best of my knowledge, noöne does this in JIS, though. South Korean Wansung is a different story, where 02-06 variously gets mapped to the small tilde or the fullwidth tilde depending on vendor.
Fullwidth brackets (U+FF5F–FF60)
“White” (hollow or doubled) parentheses. Not needed for round trip reasons, but included due to differing formatting / rendering requirements for East Asian and mathematical versions of the graphemes in question. Added in Unicode 3.2, making them the most recent characters to be added to this block.
Halfwidth CJK punctuation, Halfwidth Katakana variants (U+FF61–FF9F)
Shift_JIS is ultimately responsible. Also supported in most EUC-JP and some extended ISO-2022-JP, probably for Shift_JIS round trip reasons, given that they are not present in standard ISO-2022-JP (though the extensions are still just taken from ISO 2022, and are present in some standardised supersets of ISO-2022-JP), and that they are still two bytes (like the fullwidth ones) in EUC-JP.
Halfwidth Hangul variants (U+FF9F–FFDC)
Used in IBM-1364 (and its subset IBM-933), an IBM EBCDIC code for Korean including compatibility jamo but also allowing locking shifting to a double byte host code (actually Johab, only with a non-syllable (non-hanja and hanja) area with what would have been ASCII lead bytes, a private use area in place of what would have been the non-Hangul area in the ASCII-based Johab, and with IBM-933 not including all possible syllable clusters, while IBM-1364 does). Also in IBM-944, an old predecessor to IBM-949 which used a proprietary 94×94 (or 123×94 with extensions) plane rather than the KS X 1001 one, and used the trail byte range and two-rows-per-lead-byte format from Shift JIS (differing only in not skipping the 0xA0–DF lead bytes, since the single-byte codes are in 0xC0–FC instead). Note that the layout of the jamo, including reserved codepoints, matches the layout in IBM-944, similarly to how the layout of the kana matches Shift_JIS.
The layout in IBM-944 appears to be basically a transposition of the Hangul consonant and Hangul vowel polygons from IBM-933 onto a 8-by-32 extended ASCII grid, resulting in otherwise inexplicable positioning of the empty space. That being said, the original KS C 5601-1974 (before Wansung) is basically the letter subset of the second half of IBM-891, which is a subset of IBM-1040, the single-byte set of IBM-944. I’m not entirely sure which came first.
※ EBCDIC being EBCDIC, things tend to be laid out in polygons drawn on a 16-by-16 grid rather than in ranges.
※ Noting here because I don’t have a better place to put it: the Hangul Filler serves two purposes: firstly, it marks the start of a jamo composition sequence in KS X 1001 (whereas the KS X 1001 jamo will otherwise appear as standalone characters, like the correponding Unicode compatibility jamo but unlike the regular Unicode jamo); secondly, it stands in for an unused position in such a sequence (e.g. if there is no final consonant, the filler will be inserted in its place). The compatibility jamo in Unicode itself (including the filler) stand for isolated characters, the sequences may be processed by the decoder (but often aren’t, e.g. the UHC code (Windows-949, WHATWG EUC-KR) does not, as all the supported Hangul syllables are provided precomposed anyway due to extensions).
Fullwidth symbol variants (U+FFE0–FFE6)
These are present for a variety of reasons.
- Yen sign and overline: have both single byte and double byte representations in Shift_JIS and in some variants of EUC-JP. Why the compatibilty mapping is to Macron, when it’s used as the fullwidth form of Overline in these contexts I do not know, and guess it doesn’t matter all that much in practice (the Unicode code chart notes “sometimes treated as fullwidth overline”).
- Won sign: has both single byte and double byte representations in, for example, at least some variants of EUC-KR.
- Pound, cent and not signs and broken vertical bar: used for the standard representations of those characters in encodings where the normalised mappings are used for IBM’s single byte extension characters (e.g. IBM-942 variant of Shift_JIS). Also used in encodings trying to retain compatibility with them (e.g. Windows-932 variant of Shift_JIS, modified from IBM-932 which is a subset of IBM-942; further, MS-932 / Windows-932 is in turn copied by OSF’s eucJP-ms and by WHATWG’s Shift_JIS and EUC-JP). The cent sign, not sign and broken vertical bar are very common inclusions in EBCDIC (both single-byte and double-byte), hence IBM encodings (including non-EBCDIC ones trying to support the repertoire of the EBCDIC ones) often include them, sometimes with both single-byte and double-byte forms.
Mappings often disagree on the question of whether a character with a fullwidth variant form, and a halfwidth compatibility normalised form, should be mapped to the normalised or the fullwidth codepoint if it’s the only representation in the encoding, but is double byte.
Halfwidth symbol variants (U+FFE8–FFEE)
Used as C0 replacement graphics in several IBM East Asian MBCSs, including their variants of Shift_JIS (IBM-932, IBM-942, IBM-943), IBM-936 (a GB variant) and IBM-944 (see above). (As these are ambiguous control/graphic characters as is par the course on DOS, and the ICU mapping is to the control meanings, this is only obvious upon reading the code pages.) Accompanied by some other C0 replacements, but those ones don’t also have double byte forms (e.g. the double byte box drawing is single lined, the C0 box drawing is double lined (excepting the lone light vertical), so they have different mapping anyway). Added in Unicode 1.0.1.