Jump to content.

Whence Unicode’s Halfwidth and Fullwidth forms?

Unicode’s Halfwidth and Fullwidth Forms block encodes canonical equivalents to other Unicode characters, differing entirely in being intended to display wider or narrower than their canonical equivalents. This usually harks back to East Asian multiple-byte character sets (MBCSs), which combined a single byte character set such as ASCII or another ISO 646 variant (a SBCS) with a (possibly reärranged) double byte character set (DBCS) often of separate orgin. This resulted in duplicate encoding of some characters, where the single byte versions often rendered narrower than the double byte ones (duospace typesetting).

That being said, the origin of the individual characters is not always obvious. Being that not everyone is motivated to read code pages if they don’t have to for some reason, I have decided to write this up.

So… blame for Halfwidth and Fullwidth forms, by subheading?

(Most characters in this block were present since Unicode 1.0.0, exceptions are noted below.)

Fullwidth ASCII variants (U+FF01–FF5E)

Pretty much every East Asian MBCS is responsible for at least some of it. E.g. EUC-CN (GBK, its superset, didn’t exist at the time), its underlying GB character set encodes the entirety of an ISO 646 variant in row 3 (with Yuan/Yen not Dollar, Overline/Macron not Tilde) and the remaining ASCII (Tilde and Dollar) in row 1. Unicode layout here copies ASCII though.

JIS X 0208 row 3, by contrast, only includes the letters and numbers, though most of the ASCII punctuation is included in row 1. It is, however, missing straight quotes (added in some vendor extensions), has a wave dash not a wide tilde for all that they differ (Microsoft still treat it as a wide tilde), and has separate hyphen and minus sign assignments (Microsoft treat the minus sign as a wide hyphen-minus). JIS X 0213 didn’t exist at the time.

 ※ The fullwidth tilde is the fullwidth form of the ASCII tilde, which is in itself an ambiguously defined character from the pre-Unicode ASCII era. So in principle, the fullwidth tilde could be a mathematical tilde operator centred within an em-square, a tilde accent in the upper centre of an em square, or (yes) a wave/swung dash. In practice, it is a wave dash, matching the intent of most of the East Asian character sets it retains compatibility with.

 ※ A separate wave dash codepoint was added to Unicode for use for the JIS X 0208 character (which is officially not considered a tilde). Its reference glyph was incorrectly mirrored, however, and Microsoft initially took that as gospel (a) not using it as the mapping of the SJIS character and (b) using the mirrored glyph in fonts for Windows XP. This glyph error was fixed in later versions of Windows and, subsequently, later versions of Unicode.

 ※ JIS X 0213 displays 1-2-18 as a tilde accent, and maps it onto the ASCII tilde in Shift_JIS and the fullwidth tilde in EUC-JP. In the latter case, mapping onto the Unicode small tilde (which is found in Windows-1252 and is necessarily a spacing accent) might be more practical in reality, in order to ensure a distinct glyph from the wave dash (and to avoid colliding with Microsoft’s mappings). To the best of my knowledge, noöne does this, though.

Fullwidth brackets (U+FF5F–FF60)

White (hollow or doubled) parentheses. Not needed for round trip reasons, but included due to differing formatting / rendering requirements for East Asian and mathematical versions of the graphemes in question. Added in Unicode 3.2, making them the most recent characters to be added to this block.

Halfwidth CJK punctuation, Halfwidth Katakana variants (U+FF61–FF9F)

Shift_JIS is ultimately responsible. Also supported in most EUC-JP and some extended ISO-2022-JP, probably for Shift_JIS round trip reasons, given that they are not present in standard ISO-2022-JP (though the extensions are still just taken from ISO 2022, and are present in some standardised supersets of ISO-2022-JP), and that they are still two bytes (like the fullwidth ones) in EUC-JP.

Halfwidth Hangul variants (U+FF9F–FFDC)

Used in IBM-933, an IBM EBCDIC code for Korean including compatibility jamo but also allowing locking shifting to a double byte host code. Also in IBM-944, which appears analogous to Shift_JIS but for Korean, and is more obscure (IBM-933 has a mapping in ICU, IBM-944 does not). Note that the layout of the jamo, including reserved codepoints, matches the layout in IBM-944, similarly to how the layout of the kana matches Shift_JIS (although the layout in IBM-944 is basically a transposition of the Hangul consonant and Hangul vowel polygons from IBM-933 onto a 8-by-32 extended ASCII grid, hence the positioning of the empty space).

 ※ EBCDIC being EBCDIC, things tend to be laid out in polygons drawn on a 16-by-16 grid rather than in ranges.

 ※ Noting here because I don’t have a better place to put it: the Hangul Filler serves two purposes: firstly, it marks the start of a jamo composition sequence in KS X 1001 (whereas the KS X 1001 jamo will otherwise appear as standalone characters, like the correponding Unicode compatibility jamo but unlike the regular Unicode jamo); secondly, it stands in for an unused position in such a sequence (e.g. if there is no final consonant, the filler will be inserted in its place). The compatibility jamo in Unicode itself (including the filler) stand for isolated characters, the sequences may be processed by the decoder (but often aren’t, e.g. the UHC code (Windows-949, WHATWG EUC-KR) does not, as all the supported Hangul syllables are provided precomposed anyway due to extensions).

Fullwidth symbol variants (U+FFE0–FFE6)

These are present for a variety of reasons.

Let’s … not go into the disagreements of whether a character with a fullwidth variant form, and a halfwidth canonical form, should be mapped to the canonical or variant codepoint if it’s the only representation in the encoding, but is double byte.

Halfwidth symbol variants (U+FFE8–FFEE)

Used as C0 replacement graphics in several IBM East Asian MBCSs, including their variants of Shift_JIS (IBM-932, IBM-942, IBM-943), IBM-936 (a GB variant) and IBM-944 (see above). (As these are ambiguous control/graphic characters as is par the course on DOS, and the ICU mapping is to the control meanings, this is only obvious upon reading the code pages.) Accompanied by some other C0 replacements, but those ones don’t also have double byte forms (e.g. the double byte box drawing is single lined, the C0 box drawing is double lined (excepting the lone light vertical), so they have different mapping anyway). Added in Unicode 1.0.1.