ADC Home > Reference Library > Technical Q&As > Carbon > File Management >

Text Encodings in VFS


Q: I'm writing a file system (VFS) plug-in for Mac OS X. How do I handle text encodings correctly?

A: In Mac OS X's VFS API file names are, by definition, canonically decomposed Unicode, encoded using UTF-8. This raises a number of interesting issues.

Precomposed versus Decomposed

This Q&A assumes that you're familiar with the terms precomposed and decomposed Unicode. If that's not the case, there's a short explanation in DTS Q&A 1235 Converting to Precomposed Unicode.

IMPORTANT:
The terms used in this Q&A, decomposed and precomposed, roughly correspond to Unicode Normal Forms D and C, respectively. However, most volume formats do not follow the exact specification for these normal forms. For example, HFS Plus uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed (this avoids problems with round trip conversions from old Mac text encodings). It's likely that your volume format has similar oddities.

Volume Format

Your target volume format should define whether it uses precomposed or decomposed Unicode. For example, HFS Plus uses decomposed Unicode whereas UDF and SMB use precomposed Unicode.

Unfortunately, some volume formats (for example, NFS) have no accepted standard. This presents additional challenges, which I'll cover below.

Note:
Throughout this Q&A I use the term volume format to indicate either an on-disk volume format or a network protocol.

Returning Names

When returning names to higher layers (for example, from your VOP_READDIR entry point), you should always return decomposed names. If your underlying volume format uses precomposed names, you should convert any precomposed characters to their decomposed equivalents before returning them to the system.

Accepting Names

In most cases, high-level software will pass decomposed names to your file system. However, this is not guaranteed. There are a variety of circumstances (some discussed below) where your file system is passed precomposed names. Regardless of their incoming state, you should always convert names to the encoding scheme required by your underlying volume format. Thus, if your underlying volume format requires precomposed names, you should convert names to their precomposed variant before writing them to disk. Similarly, if your volume format requires decomposed names, you should decompose any precomposed characters.

This raises the question of what to do if your underlying volume format does not define a standard. There is no good solution here. You can choose to pass through names unchanged, which is what Apple's NFS implementation does, or provide some user interface for the user to choose (similar to the "Character Set" popup in the AppleShare login dialog). Either way the user experience will not be ideal.

Implementation

The Unicode Converter provides a mechanism to convert between precomposed and decomposed Unicode. See DTS Q&A 1235 Converting to Precomposed Unicode for more details on doing this. However, the Unicode Converter is not callable from inside the kernel, where your VFS plug-in resides. Therefore, you will probably need to roll your own code to compose and decompose Unicode. The basics for doing this are covered in the table contained in DTS Technote 1150 HFS Plus Volume Format.

If you agree to the Apple Public Source License you can also look at the code Apple uses in our HFS Plus implementation.

IMPORTANT:
The presence of the __APPLE_API_UNSTABLE guard at the top of "utfconv.h" indicates that, if your kernel code calls these APIs directly, it may suffer binary compatibility problems in the future.

I strongly recommend that you look through this source, even if you don't choose to reuse it in your own file system. Converting between precomposed and decomposed Unicode text is a complicated process and our implementation will give you an understanding of the scope of the problem.

You can find more information about this issue on the Unicode consortium web site. You may also want to look at the normalization functions provided in International Components for Unicode (IBM's open source code for internationalization with Unicode).

Compatibility

In theory the techniques describes above can cause compatibility problems for applications. For example, if an application creates a file using a precomposed name and then iterates through the directory looking for that file using a simple binary string comparison, it won't find the file. In practice this is rarely a problem. Don't forget that the primary file system, HFS Plus, works this way, so any program that's incompatible with your file system will also be incompatible with HFS Plus.

Most of Apple's built-in file systems use the techniques described above. Two notable exceptions are NFS and UFS. Of these, NFS is the most troublesome because NFS volumes can be shared with non-Mac clients that create files with precomposed characters in their names, and the Mac OS X NFS client does not decompose them before returning them to applications. If the user copies files from an NFS volume to your volume using a naive copy program (like the cp command line tool), the copy program will copy the files without decomposing the names. Thus your file system will by asked to create files with precomposed names. Your file system must be prepared to handle this, as described above.


[Feb 10 2003]


Did this document help you?
Yes: Tell us what works for you.
It’s good, but: Report typos, inaccuracies, and so forth.
It wasn’t helpful: Tell us what would have helped.