Legacy Document

Important: The information in this document is obsolete and should not be used for new development.

PATH Mac OS 8 and 9 Developer Documentation > Text Encoding Conversion Manager

Programming With the Text Encoding Conversion Manager

Packing Schemes for Multiple Character Sets

Packing schemes use a sequence of 8-bit values, so they are generally not suitable for mail (although they are often used on the Web). In these schemes, certain characters function as a local shift that controls the interpretation of the next 1-3 bytes.

The most well-known packing scheme is probably Shift-JIS, which was originally developed by Microsoft for use with MS-DOS. It includes the following:

The characters from JIS X0201, represented as single bytes, with same code points as in JIS X0201: 0x00-0x7F and 0xA1-0xDF.
The characters from JIS X0208, represented as 2 bytes, with the first byte in the range 0x81-0x9F or 0xE0-0xEF and the second byte in the range 0x40-0x7E or 0x80-0xFC.
Space for 2444 user-defined characters, represented as 2 bytes, with the first byte in the range 0xF0-0xFC, and the second byte in the range 0x40-0x7E or 0x80-0xFC.

The 2-byte units all begin with byte values that are not used for JIS X0201, so it is possible to distinguish them if the text is processed serially from the beginning of a buffer. However, the second bytes of 2-byte units use values that can be confused either with the first byte of a 2-byte unit or with a single-byte code point from JIS X0201; when pointing into an arbitrary location in the middle of Shift-JIS text, it may be impossible to determine character boundaries. Figure B-4 shows this with a somewhat pathological Shift-JIS byte sequence using only two different byte values (the corresponding character images are also shown).

Figure B-4 Shift-JIS byte sequence

Moreover, Shift-JIS contains multiple representations of the Katakana and basic Latin repertoires, which are available in 1-byte form via JIS X0201, and in 2-byte form via JIS X0208. Shift-JIS has a well-deserved reputation as a troublesome encoding scheme.

The EUC (Extended UNIX Code) packing schemes were originally developed for UNIX systems; they use units of 1 to 4 bytes.

EUC-JP (Japanese) combines JIS-Roman, the JIS X0201 Katakana and related punctuation, JIS X0208, and JIS X0212:

Character Set	Range of Corresponding EUC Sequence
JIS-Roman	0x21-0x7E (same as JIS-Roman code point)
JIS X0208	0xA1A1-0xFEFE (X0208 code point + 0x8080)
JIS X0201, Katakana, etc.	0x8EA1-0x8EDF (0x8E, then X0201 code point)
JIS X0212	0x8FA1A1-0x8FFEFE (0x8F, then X0212 code point + 0x8080)

EUC-CN (simplified Chinese) combines ASCII, GB 2312 (adds 0x8080 to GB code point)
EUC-KR (Korean) combines ASCII, KSC 5601-1987 (adds 0x8080 to KSC code point)
EUC-TW (traditional Chinese) combines ASCII and all 16 planes of CNS 11643-1992. The 16 planes are encoded as 0x8E, then the plane number + 0xA0, then the CNS code point + 0x8080. In addition, Plane 1 is redundantly encoded as simply the CNS code point + 0x8080.

The Big 5 encoding is a special case. This is not a national standard, but a de facto encoding used for traditional Chinese. It combines ASCII--represented as 1-byte units--with 2-byte units that represent Hanzi, CJK punctuation and symbols, and other characters. There is no separate specification for the set of characters represented by the 2-byte units, although the Hanzi repertoire matches the CNS 11643 Plane 1 repertoire. For the 2-byte units, the first byte is in the range 0xA1-0xFE, and the second byte is in the range 0x40-0x7E or 0xA1-0xFE.

The acronym MBCS (multi-byte character set) is used for encoding schemes that mix character units of different byte lengths (as in the packing schemes mentioned above), in contrast to SBCS (single-byte character set). The acronym DBCS (double-byte character set) is sometimes used for pure two-byte encodings such as JIS X0208, and sometimes used synonymously with MBCS.