Monday, 30 April 2012

Unicode normalization forms in Android and iOS

1. Introduction
English is one of the most easily represents languages for computers, because it's words doesn't has too much tittles and neither special words as "ñ" in spanish.

But there are many other languages which has special characters that the computer cannot represent them correctly.



The example above shows the representation of the Vietnamese in Android. You don't have to know Vietnamese to notice that something went wrong: Some characters with more than one accent, some characters with a horizontal line in the upper part, etc.

One of the solutions for this problem could be the unicode normalization, which "replace equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. " Wikipedia - Unicode equivalent

There are two notions to represent texts with similar code points: canonical equivalence and compatibility.  The first one defines code points sequences that have the same appearance and meaning when printed or displayed, while the second one defines code points that have the save meaning in some context, but they could have possibly distinct appearances.

For each one of the notions, the text could be fully composed, where utilize a single points to replace the multiple code points where possible, or fully decomposed, where single points are split into multiple ones.

So, there are four Unicode normalization forms:
  • NFD:    Normalization Form Canonical Decomposition
  • NFC:    Normalization Form Canonical Composition
  • NFKD: Normalization Form Compatibility Decomposition
  • NFKC: Normalization Form Compatibility Composition
For more information, you might check this article of the organization Unicode:
http://unicode.org/reports/tr15/ or/and this article of Wikipedia:
http://en.wikipedia.org/wiki/Unicode_equivalence

2. Normalization in Android
In Android, the texts are localized in two places:
  • Statical way, using a xml file in "res/values/", where the texts are in format: <string name="app_name">SampleAppName</string>
  • Dynamica way, because the text could be changed during running process or because the text is unknown before running, for example, come from the server.
For the first case, a specific compiler which normalize the text. For example, charlint by W3C.

For the second case, Android utilizes the same way as Java do, by using the class java.text.Normalizer. This class provides two methods:
  • boolean isNormalized(CharSequence src, Normalizer.Form)
    • This method checks if a char sequence has been normalized to a specific normalization form.
  • String normalize (CharSequence src, Normalizer.Form)
    • This method normalize a char sequence to a specific normalization form.
The second parameters of both methods are Normalized.Form, which could be find here:
http://developer.android.com/reference/java/text/Normalizer.Form.html

Here is an example of usage:

String textNormalized = Normalizer.normalize(text, Normalizer.Form.NFD);

Source: StackOverFlow
Source2: Daniel Lew's Coding Thoughts

3. Normalization in iOS
In Object-C, the normalization is provided by the class NSString.
The methods are:
  • - (NSString *)decomposedStringWithCanonicalMapping
    • Equivalent to NFD in Java/Android
  • - (NSString *)decomposedStringWithCompatibilityMapping
    • Equivalent to NFKD in Java/Android
  • - (NSString *)precomposedStringWithCanonicalMapping
    • Equivalente to NFC in Java/Android
  • - (NSString *)precomposedStringWithCompatibilityMapping
    • Equivalente to NFKC in Java/Android
Here is an example of the usage:

NSString *textNormalized = [text decomposedStringWithCanonicalMapping];

Source: JongAm's blog

No comments:

Post a Comment