ADC Home > Reference Library > Technical Q&As > Carbon > Text & Fonts >

Converting to Precomposed Unicode


Q: How do I convert a Unicode string to its precomposed form?

A: It is possible to convert a string to precomposed Unicode using APIs introduced in Mac OS X 10.2. The rest of this Q&A explains the difference between precomposed and decomposed Unicode, why you might want to convert to precomposed Unicode, and how to do so.

Precomposed versus Decomposed

Certain Unicode characters can be encoded in more than one way. For example, an Á (A acute) can be encoded either precomposed, as U+00C1 (LATIN CAPITAL LETTER A WITH ACUTE), or decomposed, as U+0041 U+0301 (LATIN CAPITAL LETTER A followed by a COMBINING ACUTE ACCENT). Precomposed characters are more common in the Windows world, whereas decomposed characters are more common on the Mac.

When working within Mac OS you will find yourself using a mixture of precomposed and decomposed Unicode. For example, HFS Plus converts all file names to decomposed Unicode, while Macintosh keyboards generally produce precomposed Unicode. This isn't a problem as long as you use system-provided APIs to process text. Apple's APIs correctly handle both precomposed and decomposed Unicode.

However, you may need to convert to precomposed Unicode when you interact with other platforms. For example, the following are all valid reasons why you might want to convert to precomposed Unicode.

  • If you implement a network protocol which is defined to use precomposed Unicode.
  • When creating a cross-platform file (or volume) whose specification dictates precomposed Unicode.
  • If you incorporate a large body of cross-platform code into your application, where that code is expecting precomposed Unicode.

IMPORTANT:
Do not convert to precomposed Unicode in an attempt to simplify your text processing. Precomposed Unicode can still contain composite characters. For example, there is no precomposed equivalent of U+0065 U+030A (LATIN SMALL LETTER E followed by COMBINING RING ABOVE), so converting to precomposed does not buy you anything.

You can find a lot more information about Unicode on the Unicode consortium web site. Specifically of interest is the Unicode Standard Annex #15 Unicode Normalization Forms. As used in this Q&A, the terms decomposed and precomposed correspond to Unicode Normal Forms D (NFD) and C (NFC), respectively.

Converting to Precomposed on Mac OS X 10.2

Mac OS X 10.2 introduced two APIs to convert a Unicode string to its precomposed form. The easiest to use is CFStringNormalize. Listing 1 shows the prototype for this function. You can learn more about CFStringNormalize by reading the comments in the <CoreFoundation/CFString.h> header file.



typedef enum {
    kCFStringNormalizationFormD = 0,
    kCFStringNormalizationFormKD,
    kCFStringNormalizationFormC,
    kCFStringNormalizationFormKC
} CFStringNormalizationForm;

void CFStringNormalize(CFMutableStringRef theString,
                       CFStringNormalizationForm theForm);

Listing 1. Prototype for CFStringNormalize



In addition, the Unicode Converter in Mac OS X 10.2 can convert a Unicode string to its precomposed form. The code in Listing 2 shows how to do this (assuming you pass true to the precompose parameter).



static OSStatus ConvertUnicodeToCanonical(
            Boolean precomposed,
            const UniChar *inputBuf, ByteCount inputBufLen,
            UniChar *outputBuf, ByteCount outputBufSize,
            ByteCount *outputBufLen)
    /* As is standard with the Unicode Converter,
    all lengths are in bytes. */
{
    OSStatus            err;
    OSStatus            junk;
    TextEncodingVariant variant;
    UnicodeToTextInfo   uni;
    UnicodeMapping      map;
    ByteCount           junkRead;

    assert(inputBuf     != NULL);
    assert(outputBuf    != NULL);
    assert(outputBufLen != NULL);

    if (precomposed) {
        variant = kUnicodeCanonicalCompVariant;
    } else {
        variant = kUnicodeCanonicalDecompVariant;
    }
    map.unicodeEncoding = CreateTextEncoding(kTextEncodingUnicodeDefault,
                                             kUnicodeNoSubset,
                                             kTextEncodingDefaultFormat);
    map.otherEncoding   = CreateTextEncoding(kTextEncodingUnicodeDefault,
                                             variant,
                                             kTextEncodingDefaultFormat);
    map.mappingVersion  = kUnicodeUseLatestMapping;

    uni = NULL;

    err = CreateUnicodeToTextInfo(&map, &uni);
    if (err == noErr) {
        err = ConvertFromUnicodeToText(uni, inputBufLen, inputBuf,
                                       kUnicodeDefaultDirectionMask,
                                       0, NULL, NULL, NULL,
                                       outputBufSize, &junkRead,
                                       outputBufLen, outputBuf);
    }

    if (uni != NULL) {
        junk = DisposeUnicodeToTextInfo(&uni);
        assert(junk == noErr);
    }

    return err;
}

Listing 2. Using the Unicode Converter to create precomposed Unicode


There are three things to note about this code.

  • You can create decomposed Unicode by passing false to the precompose parameter.
  • The code uses ConvertFromUnicodeToText, not ConvertFromTextToUnicode. You can't convert directly from a non-Unicode encoding to precomposed Unicode.
  • The code uses the low-level Unicode Converter, not the Text Encoding Converter. TEC does not support conversion to precomposed Unicode.

To convert an arbitrarily encoded non-Unicode string to precomposed Unicode, you must first A) convert the string to Unicode (using the Unicode Converter or TEC) and then B) convert that Unicode to precomposed Unicode (using the code shown above).

Note:
When converting to Unicode, TEC will preserve the precomposed/decomposed nature of the source encoding. For example, MacRoman does not support decomposed characters, so TEC will, by default, produce precomposed Unicode. On the other hand, GB 18030 does support decomposed characters, which TEC preserves as decomposed Unicode. Therefore, if you know the nature of the encoding of your source text, you can use this to avoid step B above.

Converting to Precomposed on Earlier Systems

Neither of these solutions are present on earlier versions of Mac OS. If you need to convert to precomposed Unicode on Mac OS X 10.1.x and earlier, you will have to write your own code to do so. You might consider using use the normalization functions provided in International Components for Unicode (IBM's open source code for internationalization with Unicode).


[Feb 7 2003]


Did this document help you?
Yes: Tell us what works for you.
It’s good, but: Report typos, inaccuracies, and so forth.
It wasn’t helpful: Tell us what would have helped.