Unicode is generally considered the native encoding for Mac OS X and should be used in nearly all situations. Previous versions of Mac OS supported file encodings such as MacRoman but most modern Mac OS X libraries support Unicode inherently. If you use Cocoa or Core Foundation routines, then you will probably never need to worry about other file encodings. If your software supports legacy file formats, however, you might need to consider file encoding issues when importing legacy file formats. The following sections describe some of the issues related to Unicode support and legacy file encodings.
File Systems and Unicode Support
Getting Canonical Strings
Carbon and QuickDraw Issues
Cocoa Issues
Different file systems in Mac OS X have different levels of Unicode support:
Mac OS Extended (HFS+) uses canonically decomposed Unicode 3.2 in UTF-16 format, which consists of a sequence of 16-bit codes. (Characters in the ranges U2000-U2FFF, UF900-UFA6A, and U2F800-U2FA1D are not decomposed.)
The UFS file system allows any character from Unicode 2.1 or later, but uses the UTF-8 format, which consists mostly of 8-bit ASCII codes but which may also include multibyte codes. (Characters in the ranges U2000-U2FFF, UF900-UFA6A, and U2F800-U2FA1D are not decomposed.)
Mac OS Standard (HFS) does not support Unicode and instead uses legacy Mac encodings, such as MacRoman.
Locking the canonical decomposition to a particular version of Unicode does not exclude usage of characters defined in a newer version of Unicode. Because the Unicode consortium has guaranteed not to add any more precomposed characters, applications can expect to store characters defined in future versions of Unicode without compatibility issues.
Note: Because of implementation differences, erroneous Unicode in filenames on HFS+ volumes may display correctly when entered on Mac OS 9 but appear garbled on Mac OS X. Similarly, erroneous Unicode entered on Mac OS X may appear garbled in Mac OS 9.
All BSD system functions expect their string parameters to be in UTF-8 encoding and nothing else. Code that calls BSD system routines should ensure that the contents of all const *char
parameters are in canonical UTF-8 encoding. In a canonical UTF-8 string, all decomposable characters are decomposed; for example, é (0x00E9) is represented as e (0x0065) + ´ (0x0301). To put things into a canonical UTF-8 encoding, use the “file-system representation” interfaces defined in Cocoa and Carbon (including Core Foundation).
Both Cocoa and Core Foundation provide routines for accessing canonical and non-canonical Unicode strings. Cocoa string manipulations are all handled through the NSString
class and its subclasses. In Core Foundation, you can use the CFStringGetCString
and CFStringGetCStringPtr
functions to obtain a C string with the desired encoding.
If you have existing QuickDraw code and want to draw text, you should be aware that the QuickDraw Text routines do not directly support Unicode. The Carbon File Manager has some file-system calls that return Mac encodings and others that return Unicode. If you pass this Unicode text directly to a QuickDraw routine, you may run into problems. Similarly, if you retrieve text in a Mac encoding and want to use it with Cocoa or with Carbon’s Apple Type Services for Unicode Imaging (ATSUI) API, you must convert the text to Unicode first.
Generally, the encoding that is used depends upon the API you use and not on the font. Fonts are not necessarily limited to particular encodings. TrueType fonts, for example, declare the set of glyphs they implement and provide encoding tables that map those glyphs to character values in particular encodings. PostScript fonts have similar encoding tables. Various parts of the operating system know how to map characters from one encoding to another. Cocoa and ATSUI use Unicode as the “destination” mapping for a font. QuickDraw Text in Carbon uses the Mac encodings, selected according to the script that the ‘FOND’ resource of the font corresponds to.
The fonts that are installed with Mac OS X have large character sets supporting a wide range of encodings and scripts. For example, Lucida, the system font, supports extended Latin, Greek, Cyrillic, Arabic, Hebrew, and Thai. But if you draw text through QuickDraw Text, you have access only to the MacRoman repertoire. To access the rest, you must use Cocoa or ATSUI. Similarly, the Hiragino fonts also have a large repertoire of characters beyond that supported by MacJapanese, and these are accessible only through Cocoa or ATSUI. Both Cocoa and ATSUI also substitute glyphs from other fonts when the requested one isn't available; however, their algorithms for font substitution are different.
For information on file encodings in the context of multiscript support, see “Guidelines for Adding MultiScript Support.”
Cocoa employs Unicode for character encoding, making any Cocoa application capable of displaying most human languages. Although Cocoa supports vertical and bidirectional text, the NSTypesetter
class only supports layout for horizontal text. If you want to lay out vertical text, you need to define your own custom typesetter class.
© 2003, 2009 Apple Inc. All Rights Reserved. (Last updated: 2009-01-06)