ADC Home > Reference Library > Technical Q&As > Carbon > File Management >
|
Q: I'm writing a file system (VFS) plug-in for Mac OS X. How do I handle text encodings correctly? A: In Mac OS X's VFS API file names are, by definition, canonically decomposed Unicode, encoded using UTF-8. This raises a number of interesting issues. Precomposed versus DecomposedThis Q&A assumes that you're familiar with the terms precomposed and decomposed Unicode. If that's not the case, there's a short explanation in DTS Q&A 1235 Converting to Precomposed Unicode.
Volume FormatYour target volume format should define whether it uses precomposed or decomposed Unicode. For example, HFS Plus uses decomposed Unicode whereas UDF and SMB use precomposed Unicode. Unfortunately, some volume formats (for example, NFS) have no accepted standard. This presents additional challenges, which I'll cover below.
Returning NamesWhen returning names to higher layers (for example,
from your Accepting NamesIn most cases, high-level software will pass decomposed names to your file system. However, this is not guaranteed. There are a variety of circumstances (some discussed below) where your file system is passed precomposed names. Regardless of their incoming state, you should always convert names to the encoding scheme required by your underlying volume format. Thus, if your underlying volume format requires precomposed names, you should convert names to their precomposed variant before writing them to disk. Similarly, if your volume format requires decomposed names, you should decompose any precomposed characters. This raises the question of what to do if your underlying volume format does not define a standard. There is no good solution here. You can choose to pass through names unchanged, which is what Apple's NFS implementation does, or provide some user interface for the user to choose (similar to the "Character Set" popup in the AppleShare login dialog). Either way the user experience will not be ideal. ImplementationThe Unicode Converter provides a mechanism to convert between precomposed and decomposed Unicode. See DTS Q&A 1235 Converting to Precomposed Unicode for more details on doing this. However, the Unicode Converter is not callable from inside the kernel, where your VFS plug-in resides. Therefore, you will probably need to roll your own code to compose and decompose Unicode. The basics for doing this are covered in the table contained in DTS Technote 1150 HFS Plus Volume Format. If you agree to the Apple Public Source License you can also look at the code Apple uses in our HFS Plus implementation.
I strongly recommend that you look through this source, even if you don't choose to reuse it in your own file system. Converting between precomposed and decomposed Unicode text is a complicated process and our implementation will give you an understanding of the scope of the problem. You can find more information about this issue on the Unicode consortium web site. You may also want to look at the normalization functions provided in International Components for Unicode (IBM's open source code for internationalization with Unicode). CompatibilityIn theory the techniques describes above can cause compatibility problems for applications. For example, if an application creates a file using a precomposed name and then iterates through the directory looking for that file using a simple binary string comparison, it won't find the file. In practice this is rarely a problem. Don't forget that the primary file system, HFS Plus, works this way, so any program that's incompatible with your file system will also be incompatible with HFS Plus. Most of Apple's built-in file systems use the techniques described above. Two notable exceptions are NFS and UFS. Of these, NFS is the most troublesome because NFS volumes can be shared with non-Mac clients that create files with precomposed characters in their names, and the Mac OS X NFS client does not decompose them before returning them to applications. If the user copies files from an NFS volume to your volume using a naive copy program (like the
[Feb 10 2003] |
|