Important: The information in this document is obsolete and should not be used for new development.
Characters are the atomic units of content for text data; they include letters, digits, punctuation, and symbols. A character is an abstract entity without any particular appearance. A coded character is a character together with its numeric representation in a particular CCS.
A text element is a group of one or more characters that is treated as a single entity for a particular process such as collation, display, or transcoding. The way that characters are grouped into text elements depends on the process; each process may group characters differently.
Glyph images are the visual elements used to represent characters; aspects of text presentation such as font and style apply to glyph images, not to characters. The mapping from a sequence of coded characters to a sequence of glyph images on a display device is complex. In general there is not a one-to-one mapping from character to glyph image; a particular glyph image may correspond to more or less than one character. Figure B-1 shows glyphs and their associated characters.
Figure B-1 Some glyph images for representing characters
A script is a collection of related characters, subsets of which are required to write a particular language. Some examples of scripts are Latin, Greek, Hiragana, Katakana, and Han. A writing system consists of a set of characters from one or more scripts that are used to write a particular language and the rules that govern the presentation of those characters. Punctuation, digits, and symbols that are shared across many writing systems can be considered as one or more separate pseudo-scripts. For example, the Japanese writing system includes a Kanji subset of Han characters, plus Hiragana, Katakana, some Latin, and various punctuation and symbols, some of which are specific to CJK--Chinese, Japanese, Korean--or even just to Japanese, and some of which are more general.
The term presentation form is generally used to mean a kind of abstract shape that represents a standard way to display a particular character or group of characters in a particular context as specified by a particular writing system. The term glyph by itself may refer either to presentation forms or to glyph images. This appendix assumes the latter convention. Figure B-2 shows some examples of presentation forms.
The determination of what is a character in a CCS should be based on what is best for implementing the range of text processes for which that CCS will be used. The characters in a CCS need not correspond to what a user or linguist might consider a character. In fact, if the CCS will be used for more than one writing system, this might be impossible to do anyway, since each writing system has its own notion of what constitutes a natural character. Well-designed software should provide users with the behavior they expect or prefer, regardless of the details of the underlying character encoding, and without exposing users to those details.
Some character sets that were intended primarily for display using less sophisticated display software have encoded presentation forms as characters. For example, the DOS Arabic character set (code page 864) encodes Arabic contextual forms and ligatures instead of abstract letters.