Khmer Encoding Structure

The main goal of this encoding revision is to ensure one visual form maps to one encoding, avoiding multiple encodings for the same visual representation. This prevents user confusion, typing inconsistencies, and challenges in searching and sorting.

Why Do We Need Consistency in Encoding?
The current situation, for Khmer, is that users are expected to have a relatively deep understanding for how Khmer is encoded in Unicode in order to type correctly. They are expected to type their coengs, vowels and other diacritics and signs in the right order with nothing to help them visually.
To make the matter worse, there is often more than one way to encode a Khmer phrase and have it render prefectly, and yet only one way is the “correct” way to encode it. For those very experienced in understanding how the Khmer script is stored in Unicode, and with a strong linguistic awareness, there seems to be no problem. But for many users there are problems.

Mitigations
The current situation, for Khmer, is that users are expected to have a relatively deep understanding for how Khmer is encoded in Unicode in order to type correctly. They are expected to type their coengs, vowels and other diacritics and signs in the right order with nothing to help them visually.
To make the matter worse, there is often more than one way to encode a Khmer phrase and have it render prefectly, and yet only one way is the “correct” way to encode it. For those very experienced in understanding how the Khmer script is stored in Unicode, and with a strong linguistic awareness, there seems to be no problem. But for many users there are problems.

One Visual Form, One Encoding
If there is only one way to encode something, then it makes it easier to produce a system that works with that one way. Thus, if there is only one way to stored a word, an input method can take a variety of ways that a users might type a word and normalize them into the single correct way. On the other hand, if there multiple ways to encode a word, then the input method cannot do this and the user is expected to resolve the ambiguity and pick the right ordering.
In the technical context where a keyboard only allows a user to type single code points (or short sequences), there has been no other option. Users want to be able to type in a variety of orders and therefore have made use of the existing ambiguities in the encoding. But modern keyboard applications are far more sophisticated and are capable of allowing different typing orders and outputting a single correct order.






Tools
With a defined well-formed regex of the orthography syllable structure, we exercise the idea of “One visual form, one encoding”. Fonts should expose invalid character sequences so that users can know and correct their mishap manually, or keyboards should be built in a way that it is smart enough to transform an invalid string to a valid one. Developers can take advantage of a normalizer that normalizes Modern Khmer text to the encoding previously defined in such a way that it looks the same as the input text. Thus if there is bad spelling in the original (for example inappropriate multiple vowels), this code fixes the errors and returns only valid strings corresponding with those.

Khmer Busra Font
ensure no identical rendering of the same words with different encodings

Khmer Angkor Keyboard
ease the typing experiences and ensure consistent outputs

Normalizer Code
normalize text data and ensure there is no ambiguous strings in the mix
Try it to experience for yourself here!

Stay Connect
ADDRESS
National Road 6A, Kthor, Prek Leap Chroy Changvar, Phnom Penh, Cambodia
CONTACT US
Phone: +855 10 344 040
Email: pr@cadt.edu.kh