Precomposed character

Wikipedia

A precomposed character (alternatively composite character or decomposable character) is a multi-glyph entity represented in Unicode by a single codepoint. A precomposed character may typically represent a letter with a diacritical mark, such as é (U+00E9 é LATIN SMALL LETTER E WITH ACUTE).

The same character can also be created using a sequence of codepoints, one for each of the glyphs that comprise the character. For example in Unicode terms, the letter é is a character that can be represented directly using U+00E9 or alternatively can be decomposed into an equivalent string of the base letter e (U+0065 e LATIN SMALL LETTER E), together with the combining form of the acute accent (U+0301 ́ COMBINING ACUTE ACCENT). Similarly, precomposed ligatures are precompositions of their constituent letters or graphemes  for example, the U+0133 ij LATIN SMALL LIGATURE IJ used in Dutch.

Precomposed characters are the legacy solution for representing many special letters in various character sets. In Unicode, they were included for compatibility with early encoding systems such as the various components of ISO 8859 or other kinds of "extended ASCII". More recent Unicode policy has been to resist creation of new precomposed characters if the character can be produced using combining forms.

Comparing precomposed and decomposed characters

In the following example, there is a common Swedish surname Åström written in the two alternative methods, the first one with a precomposed Å (U+00C5) and ö (U+00F6), and the second one using a decomposed base letter A (U+0041) with a combining ring above (U+030A) and an o (U+006F) with a combining diaeresis (U+0308).

  1. Åström (U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D)
  2. Åström (U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D)

Except for the different colors, the two solutions are equivalent and should render identically. In practice, however, some Unicode implementations still have difficulties with decomposed characters. In the worst case, combining diacritics may be disregarded or rendered as unrecognized characters after their base letters, as they are not included in all computer fonts. To overcome the problems, some applications may simply attempt to replace the decomposed characters with the equivalent precomposed characters.

With an incomplete font, however, precomposed characters may also be problematic – especially if they are more exotic, as in the following example (showing the reconstructed Proto-Indo-European word for "dog"):

  1. ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E)
  2. ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E)

In some situations, the precomposed green k, u and o with diacritics may render as unrecognized characters, or their typographical appearance may be very different from the final letter n with no diacritic. On the second line, the base letters should at least render correctly even if the combining diacritics could not be recognized.

OpenType has the ccmp "feature tag" to define graphemes that are compositions or decompositions involving combining characters.

Chinese characters

In theory, most Chinese characters as encoded by Han unification and similar schemes could be treated as precomposed characters, since they can be reduced (decomposed) to their constituent radical and phonetic components with Chinese character description languages. Such an approach could reduce the number of characters in the character set from tens of thousands to just a few thousand. On the other hand, a decomposed character set would introduce challenges for searching and editing software and require more bytes of encoding per document. One particular challenge would be the multiple-to-multiple projections between the set of decomposed characters and the precomposed character—one precomposed character may be decomposed into multiple different sets of decomposed characters while one set of decomposed characters could contract themselves into multiple different precomposed characters. There is no strict requirement or constraints regarding the relative position between components within a character, the form of variant and transform (narrow, widen, stretch, rotate, etc.) applied on components, nor the number of each components.

See also

Sources