Legacy Documentclose button

Important: The information in this document is obsolete and should not be used for new development.

Up Previous Next 

PATH 
Mac OS 8 and 9 Developer Documentation > Text Encoding Conversion Manager
Programming With the Text Encoding Conversion Manager



Identifying Character Encodings on the Internet

In many Internet protocols, a charset parameter may be used in certain contexts to specify both a character set and a character encoding scheme. The value of the charset parameter is a case-insensitive string limited to the characters A-Z, a-z, 0-9, hyphen-minus, underscore, period, and colon. The character encoding names specified for this parameter are generally expressed in US-ASCII octet values.

The character encoding name may be an experimental name beginning with x-; if it is not an experimental name, it must be a name registered with the Internet Assigned Numbers Authority (IANA) that corresponds to a character encoding that has a formal specification. Multiple names exist for most character encodings in the registry. The IANA registry is updated periodically; for example, the name EUC-JP was added to it in January. Table C-1 identifies character encodings for various languages, gives some of their common Internet names, and tells when the character encoding was first supported for the Text Encoding Converter and the Unicode Converter. To preview the style of character set name used on the Internet, here are a few sample names:

ISO-8859-1 latin1 UNICODE-1-1-UTF-7 Shift_JIS X-EUC-CN

Many of the character encodings in use on the Internet are not registered with IANA and do not have official Internet names, although they may have names that have become de facto standards. Moreover, even when an encoding is registered, the name specified by IANA may not be the one that is actually used on the Internet. For example, EUC-JP has been registered for some time with the unwieldy name Extended_UNIX_Code_Packed_Format_for_Japanese, but the name actually used is the unofficial X-EUC-JP. Another example, Shift_JIS, is the official name, but the names commonly used in its stead are x-shift-jis and x-sjis. In many cases, mail and browser software recognizes only the unofficial names, not the official ones.

In some cases, the names for unregistered encodings follow a pattern established by other, registered encodings. For example, some IBM/Microsoft code pages are registered with names consisting of cp followed by the code page number: cp437, cp850, cp852. Code page 874 is not registered, but the name cp874 would be expected. Most Windows code pages are registered using the form used in these examples: windows-1250, windows-1251. Windows Latin-1 is, oddly enough, not registered as either windows-1252 or cp1252, although both forms are in use.


© 1999 Apple Computer, Inc. – (Last Updated 13 Dec 99)

Up Previous Next