ADC Home > Reference Library > Technical Notes > Legacy Documents > Text & Fonts >

Legacy Documentclose button

Important: This document is part of the Legacy section of the ADC Reference Library. This information should not be used for new development.

Current information on this Reference Library topic can be found here:

How to Construct Word-Break Tables

CONTENTS

This technical note describes how to construct auxiliary break tables for use with the FindWord routine in the Script Manager.

[Nov 01 1987]






Constructing break tables

The FindWord algorithm finds word boundaries by determining where words should not be broken. For example, "re-do" is one word: it should not be broken at the hyphen. In other words, a sequence of the form: (letter, hyphen, letter) should not be broken between the first and second or second and third character. This is called a continuation sequence. The algorithm used by the FindWord routine allows for continuation sequences of lengths one, two and three. Examples of a sequence of length two include (letter, letter), or (number, number). For a length of one, there is only one sequence, consisting of the characters of type nonBreaking: these characters are never separated from preceding or following characters.

For most scripts, this information about continuation sequences is packed into a table for use by the FindWord algorithm. (For complex scripts like Japanese, a different algorithm is used for portions of the script.) The default break tables for a given script can be overridden by a user-specified breakTable parameter, but should only be used for known scripts. That is, before overriding the breakTable parameter, the programmer should first check the script of the current font.

A break table consists of two sections, a 256 byte character type table followed by a character triple table.

Character Type Table

Character Triple Table

The character type table is indexed by the character's ASCII code and contains one type value for each character. The character types in the table are limited to values between 1 and 31. There are two distinguishing values: the type nonBreaking (= 1) indicates that the character is non-breaking; it always continues a word. The type wild (=0) indicates that the character may or may not break, depending on information in the character triple table, as described below. Otherwise, the choice of numbers to represent character types is completely arbitrary.

For example, the following in MPW Assembler defines character types for use in a word-selection break table, then sets up a character type table using an assembly macro (setByte) to store character type values in an array. (Note that the character types could have been defined with equate definitions (EQU), rather than using the record structure.) Writing the setByte macro is left as an exercise to the reader. Note that the break value is the default. This value is not distinguished, but should have no continuation sequences.

;============================================================
charWordRec        record        0
wild            ds.b        1    ; constant! not in char table.
nonbreak        ds.b        1    ; constant! non-breaking space.
letter            ds.b        1    ; letters.
number            ds.b        1    ; digits.
break            ds.b        1    ; always breaks.
midLetter        ds.b        1    ; a'a.
midLetNum        ds.b        1    ; a'a 1'1.
preNum            ds.b        1    ; $, etc.
postNum        ds.b        1    ; %, etc.
midNum            ds.b        1    ; 1,1.
preMidNum        ds.b        1    ; .1234.
blank            ds.b        1    ; spaces and tabs.
cr            ds.b        1    ; add carriage return
            endr
;============================================================
    with        charWordRec
wordTable
    dcb.b        256,break
    setByte    wordTable,nonBreak,$ca
    setByte    wordTable,letter,('A','Z'),('a','z')('Ä','Ü')
    setByte    wordTable,letter,'Æ','Ø','Æ','Ø',('À','oe'),'Ÿ'
    setByte    wordTable,midLetter,'-'
    setByte    wordTable,midLetNum,$27,'''
    setByte    wordTable,number,('0','9')
    setByte    wordTable,preNum,'$','cents','[[sterling]]','[[yen]]'
    setByte    wordTable,postNum,'%'
    setByte    wordTable,midNum,','
    setByte    wordTable,preMidNum,'.'
    setByte    wordTable,blank,$00,' ',$09
    setByte    wordTable,cr,$0d
    endWith

The character triple table is a coded representation of a list of continuation sequences. It consists of a list of packed one word triples, preceded by a length word. This length word contains the number of triples minus one. Each triple contains three character types, either as derived from the charType table or the special type wild (= zero). The three types in a triple are packed into fields five bits apiece, with the most significant bit in the word cleared. The first type in the triple is the leftmost.

A continuation sequence of length three (xyz) is represented by entering three triples into the triple list: xyz, *xy, and yz* (where '*' stands for the type wild, which is always zero).

original and permutations with wild card

A continuation sequence of length two (xy) is represented by entering two triples into this list: *xy, and xy*. A continuation sequence of length one has no entry in the triple list: the character type is simply nonBreaking.

original and wild card permutations

Note that the type wild cannot appear as the middle element of a triple. The words in the triple table must be sorted in ascending numerical order for future compatibility.

The following is an example of how a character triple table could be coded. The defSeq macro takes a continuation sequence as a parameter, and enters a set of triples into an internal array. The dumpSeq macro sorts the triples, and stores them in the proper order with dc.w commands. Once again, writing the macros defSeq and dumpSeq is left as an exercise for the reader.

;============================================================
    with        charWordRec
    defSeq        letter,letter
    defSeq        letter,preMidNum,letter
    defSeq        letter,midLetter,letter
    defSeq        letter,midLetNum,letter

    defSeq        number,number
    defSeq        number,letter
    defSeq        number,midNum,number
    defSeq        number,midLetNum,number
    defSeq        number,preMidNum,number
    defSeq        number,postNum
    defSeq        preNum,number
    defSeq        preMidNum,number

    defSeq        blank,blank
    defSeq        blank,cr
    endWith

;============================================================
    dc.w    ((wordEnd-wordBegin)/2)-1    ; length word.
wordBegin
    dumpSeq
wordEnd

A series of blanks should generally select as a single word. Make certain, however, that a carriage return does not continue a word to the right (note how it has a separate character type from blank for this reason), otherwise word selection and wrapping do not work properly across paragraphs.

Back to top

Extensions

The values 16-31 in the character type table entry for null ($00) (the first byte in the character type table) are reserved by Apple for future expansion. The use of one of these values indicates the presence of a supplementary table after the triple table.

Back to top

References

The Script Manager

Back to top

Downloadables

Acrobat gif

Acrobat version of this Note (160K).

Download



Did this document help you?
Yes: Tell us what works for you.
It’s good, but: Report typos, inaccuracies, and so forth.
It wasn’t helpful: Tell us what would have helped.