Chinese Character Codes
What make this world extremely interesting is the variety
of standards. Nobody seems to agree to settle with a unified
way of doing things. We can see this from the languages that
we speak, the food we eat, the houses that we build and a lot
of other example.
There are a few ways of representing Chinese characters in
computer. The following is a list of existing standards. I
have tabulated the Chinese codes for easy referencing.
Unicode
This encoding characters has defined 20902 CJK characters. The
advantage of using this standard is that you can display
Simplified Chinese characters, Traditional Chinese characters,
Korean characters and Japanese characters on the same HTML
page. No other encoding standards is supporting that for the
moment.
GB Code
GB (Guo Biao) Code is defined by China. It is the encoding standard
used to represent Simplified Chinese characters. It has defined
about 6763 Chinese characters (excluding all symbols). Countries
such as China, Singapore and Malaysia are using this encoding
standard.
Every Chinese character is represeneted by a two byte code. The
MSB of both the first and second bytes are set. Thus, they can
be easily identified from documents that contain both GB
characters and regular ASCII characters.
GBK Code
The Chinese authority soon realized that it cannot ignore the traditional
Chinese characters. Thus, it had defined GBK (Guo Biao Kuozhan) to include all the
traditional Chinese characters defined in Big 5. It claims that GBK is
synchronized with Unicode standard, version 1.1.
Big 5 Code
Big 5 is the character encoding standard most commonly used for
traditional Chinese characters. Regions / countries such as Taiwan, Hong Kong
and Malaysia are using this encoding standard.
Every Chinese Character is represented by a two byte code. The first
byte ranges from 0xA1 to 0xF9, while the second byte ranges from
0x40 to 0x7E, 0xA1 to 0xFE.
Note that the MSB of the two byte code is always set. Thus, in a
document that contain Chinese characters and regular ASCII characters,
the ASCII characters are still represented with a single byte.
CNS-11643-1992
CNS-11643-1992 is sometimes refered to as Chinese Standard
Interchange Code. It is a Chinese character encoding standard
defined by Taiwan in 1992. It has 16 planes. Each plane contains
94*94 = 8836 locations. Each location is supposed to be filled
with a Chinese character. However, a lot of the locations are
left blank.
Every Chinese character is represented with two 7 bit ASCII
codes. Each 7 bit is a printable ASCII character ranging from
0x21 to 0x7E. This implies the first character is every plane
starts with code 0x2121.
This encoding standard encompases much more characters than
Unicode, GB or Big 5. A lot of characters are very rarely used.
However, this encoding scheme is less popular than Big 5. This
encoding scheme is used in the Chinese paging (pagers, beepers)
industry. However, the paging industry uses only the first
plane due to memory constraints in such devices.
Since the characters in different plane may have the same
coding, escape sequence is necessary to switch between character
planes.
References
You may find more information about various encoding schemes in
these web sites.
|