LINFO

CJKV: A Brief Introduction

CJKV is a collective term for the Chinese, Japanese, Korean and Vietnamese languages with regard to their relationship to computers.

All of these languages have (or had) writing systems based entirely or partly on Chinese characters. The term CJK is also used, to refer to just Chinese, Japanese and Korean, because Chinese characters are not utilized for writing contemporary Vietnamese, but are employed only for historical, religious and cultural purposes.

Chinese characters (which are referred to as hanzi in Mandarin Chinese¹, kanji in Japanese and hanja in Korean) are the characters (i.e., symbols) that were originally developed to write the Chinese languages² at least 3,500 years ago and possibly several thousand years earlier³.

The terms CJKV and CJK are used mainly in the context of the internationalization and localization of software and communications. Internationalization refers to the addition of a framework for support for multiple languages and cultures. Localization refers to the adaptation of language, content and design to specific countries, regions or cultures.

Special Characteristics of CJKV Writing Systems

A special term exists for languages using Chinese characters because of their common characteristics and the fact that there are important differences between them and all other writing systems. It is also a consequence of the vast numbers of people that use such languages. They are spoken by roughly 1.5 billion people, or approximately a quarter of the world's population, and Mandarin Chinese, the most widely spoken language in the world, alone has more than 800 million native speakers (well in excess of twice the number of native speakers of English)⁴. Such characteristics and differences must be considered when developing software, mainly programs and user interfaces (e.g., keyboard input techniques and screen displays), that are to be used internationally.

These differences include (1) the existence of thousands of characters (while keyboards have generally been designed for only a few dozen characters), (2) the extreme complexity of many of the characters⁵, (3) the vertical, right to left layout frequently used by CJKV text, (4) the open-ended nature of the number of characters (i.e., there is no clearly defined upper limit), (5) the fact that there is no clearly defined order for the characters (in contrast to, for example, the alphabetic order for letters in English and other European languages), (6) the existence of multiple, conflicting versions of character sets and (7) the occasional creation by authors of new characters on the fly.

Huge Number of Characters

One of the most challenging tasks with regard to CJKV writing systems is coping with the vast number of characters. Whereas most languages written with the Roman alphabet have only a few dozen characters and most or all of them can be accommodated by the 256 character encodings (i.e., assigning a unique integer to represent each character) that are possible with a single byte, there are several tens of thousands of Chinese characters, and some estimates place the total number at more than 80,000⁶.

Although only a relatively small percentage of the Chinese characters are commonly used, even this percentage represents a far larger absolute number than the characters used in non-CJKV languages. For example, it is widely said that only about 3,000 characters are needed for basic literacy in Chinese (such as to be able to read a newspaper), but that a well-educated person will be familiar with 4,000 or 5,000 characters. Likewise, a well-educated Japanese person may know upwards of 3500 Chinese characters (in addition to the three phonetic alphabets that are used to supplement them in modern Japanese).

However, many of the less commonly used characters still play an important role, and their absence could cause considerable inconvenience. Some of them are used mainly for place names, names of people and historical terms. Others, however, are merely obscure variations of more commonly used characters.

Differences Among CJKV Writing Systems

Adding to the complexity of dealing with the CJKV languages in a computer context is the fact that there are also some big differences among the writing systems of these languages despite the fact that they all use (or used, in the case of Vietnamese and North Korean) Chinese characters.

These differences include (1) the number of characters that are used, (2) which characters are used, (3) the visual representations (i.e., glyphs) of the characters that are used, (4) whether the characters are used mixed together with other types of characters or not and (5) how the characters are used together with other types of characters.

For example, software intended for use with any of the Chinese languages requires the availability of tens of thousands of characters, even if most of them are rarely used, whereas software for Japanese and Korean generally needs a substantially smaller number (because of the extensive use of purely phonetic characters, which are frequently employed in place of rare Chinese characters).

Some of the characters that are used in Cantonese, which is the second most widely spoken Chinese language after Mandarin, are different from those that are commonly used in Mandarin, Japanese and Korean. Likewise, classical Vietnamese contains numerous characters that were created from standard components (referred to as radicals) of conventional Chinese characters but which are assembled in unique combinations. In addition, several hundred unique characters of this type (referred to as kokuji) were developed in Japan, some of which are still in common use in that country.

Adding to the complexity is the fact that there are often multiple versions of the same character. An example is the dichotomy of the so-called traditional characters and the simplified characters⁷. The former have been in use for thousands of years and are still standard in Taiwan, Hong Kong and many overseas Chinese communities. The latter were introduced in Mainland China several decades ago in an attempt to promote literacy.

The situation is still further complicated by the fact that the Chinese characters that are now used in Japan represent an intermediate simplification between the traditional Chinese characters and the extremely simplified versions of some characters that were introduced on the Chinese mainland. South Korea continues to use the traditional characters.

Another complication is the fact that modern Chinese is written using solely Chinese characters, as were the classical forms of all CJKV languages. (A major exception is that numbers in modern Chinese can be written with either Chinese characters or Arabic numerals, depending on the context.) In modern Japanese and Korean, however, text consists of a mixture of Chinese characters and indigenous phonetic characters (hiragana and katakana in Japanese and hangul in Korean)⁸.

Adding still further to this already incredible complexity is the fact that the phonetic character systems used regularly in Japanese and Korean and for special purposes in Chinese (mainly in elementary education for teaching the standard characters) are all completely different from each other as well as from alphabets or syllabaries (i.e., phonetic writing systems consisting of symbols representing syllables) used in other languages. For example, the number of characters can be much larger, and they can be written both horizontally and vertically as well as in miniaturized versions alongside (rather than in sequence with) Chinese characters to indicate pronunciation (e.g., furigana in Japanese).

The Future of Chinese Character-based Writing Systems

In the middle decades of the twentieth century it was fashionable to argue that Chinese characters were obsolete and should be abolished in favor of Roman letters. In fact, the simplification of characters that began on the Chinese mainland in the 1950s was designed as a first step towards that goal.

One of the main reasons given for this position was the belief that Chinese characters were not compatible with modern technology such as typewriters and computers⁹. It was also felt that the large amount of time that was required to learn thousands of complex characters hindered the development of literacy and took time away from studying other subjects, such as science and foreign languages.

However, the situation has changed remarkably in recent decades, and today it is rare to hear any calls for the abolition of Chinese characters. One reason for this is that it has been found that such characters are indeed highly compatible with advanced technology. In particular, it is now easy for computers to store, retrieve and display virtually any number of unique characters as a result of dramatic improvements in the performance and reductions in the cost of memory, storage (e.g., hard disk drives) and microprocessors in recent decades. This has been accompanied by the development of highly efficient character input systems for use with keyboards resembling those used for English other Western languages.

Another reason is that there has been a growing appreciation of the fact that learning Chinese characters, while time consuming, can offer a variety of educational, cultural and possibly psychological benefits. One of these is that it facilitates access to hundreds, or even thousands, of years of literature in its original form. Another is the preservation of traditional culture, in which the appreciation of the forms of characters plays an important role. There is also a widely held belief that the discipline of learning to read and write the traditional characters is important for brain development and acquiring a good aesthetic sense.

There is yet another powerful force that has helped to preserve the Chinese character-based writing system in China. It is the fact that it has been widely recognized in that country that such characters have been a major factor in the continuing struggle to hold the country together in the face of the strong centrifugal forces that have existed throughout most of Chinese history. This is because Chinese characters can be read (and understood) by literate Chinese speakers regardless of their language or dialect, despite the fact that the spoken languages (and even some dialects) are mutually unintelligible and the characters are often pronounced very differently in the different languages. Were the hanzi to be replaced by Roman letters, text written in the various languages and dialects would also become mutually unintelligible.

Thus, there is now little doubt that Chinese characters will remain central to the writing systems of roughly a quarter of the world's population for decades or centuries to come. Moreover, it is likely that their use will actually increase as educational standards continue to rise in China and large numbers of people move beyond just basic literacy.

Unicode and Character Encoding Issues

Clearly, one of the most challenging tasks with regard to maximizing the efficiency of the computerization of CJKV languages has been that of character encoding. This has been a tedious and controversial task, not only because of the vast numbers of characters involved but also because of such factors as differences in characters among the various languages using them and various cultural and political considerations.

The basic solution has been to represent each character by two or more bytes instead of a single byte. This results in an expansion of the character space from 256 to hundreds of thousands, which is more than sufficient to accommodate all characters in existing and extinct human languages (the great majority of which are Chinese characters). However, complications and controversy then arise as to how characters should be encoded in this space.

A number of character encoding schemes have been developed by individual countries and organizations for East Asian languages, examples of which include Big5 (which is used by Taiwan and Hong Kong), GB2312 (a mainland Chinese standard for simplified characters), GB18030 (the newer mainland Chinese standard), Shift-JIS (a Japanese standard developed by Microsoft) and ISO 2022-JP (widely used in Japan). Although these standards have been very successful for use within individual East Asian countries or regions, for internationalization of software to be efficient, it is necessary to have a single, standardized encoding scheme.

The most publicized such scheme that has been developed to date is Unicode¹⁰, which is a system that attempts to provide a unique encoding for every character used by all of the world's languages, both existing and extinct. Unicode, has become highly successful and is widely employed, although it is not without controversy. While the issues might seem trivial to people who have little contact with East Asian languages and culture, they are nevertheless a serious concern in that region.

Perhaps the biggest point of contention with regard to Unicode has been its use of Han unification¹¹. This refers to assigning a single code to each of the characters in the basic subset of Chinese characters that is used in multiple languages, even if there are minor differences in their glyphs according to the language. The result is a substantial reduction in the number of encodings as compared with the alternative of having a separate encoding for what appears to be an identical or very similar character for each of the languages in which it is used. The resulting character repertoire is sometimes referred to as Unihan.

A major criticism of the use of the unified characters by some native speakers of those languages is that they do not look correct, because of minor differences in appearance according to the language. Another objection is that it can create a perception that the languages themselves are somehow unified as well, whereas, in reality, East Asian languages (and cultures) are unique and very diverse (e.g., Japanese and Korean have grammars that are completely unrelated to that of the Chinese languages, and the same character can have different meanings in different languages). The resistance to Han unification has been particularly strong on the part of some Japanese scholars.

The alternative is to have a separate encoding for each Chinese character in each language, regardless of whether the glyph is very similar or identical to that used in another language. This approach has become increasingly practical as storage and memory costs have continued to drop. And it is the approach that has been adopted by TRON (The Real-time Operating system Nucleus), an operating system that is used extensively in Japanese and other Asian products, mainly for embedded systems (i.e., computer circuitry and software built into other products), and which is thus claimed to be the most widely deployed operating system in the world¹².

________
¹Mandarin Chinese, which is indigenous to the area around Beijing, is the dominant Chinese language both in Mainland China and Taiwan.

²Multiple, related languages, as well as various dialects of each, are in use in China, along with a number of unrelated languages spoken by minority populations. Often it is not clear what is a dialect and what is a distinct language, although a rule of thumb is that distinct languages have evolved so far apart that they are no longer mutually intelligible in their spoken forms. The situation is analogous to Europe, which is dominated by a number of related languages (i.e., Indo-European languages), each with multiple dialects, but in which also survive several totally unrelated languages (i.e., Basque, Finnish and Hungarian). The differences among the main Chinese languages are very roughly comparable to the differences among the Romance languages (i.e., Spanish, French, Italian, Portuguese and Romanian).

³The record is far from clear regarding the age of Chinese characters, but discoveries of early forms carved on bones and tortoise shells indicate that it could be as old as 8,000 years. In any event, Chinese characters are probably the oldest surviving writing system. In contrast, cuneiform (which was used for writing several languages in the Middle East) was apparently developed roughly 5,500 years ago and the last known cuneiform text was written in 75 AD.

⁴The total number of native speakers of the various Chinese languages (excluding languages unrelated to Chinese that are used by minorities in China) may be roughly 1.2 billion. Estimates by various sources of the number of native speakers of English generally fall in the range of 320 to 330 million. Some estimates place English as the second most widely spoken language, others put it behind Hindi and on a par with Spanish.

⁵Chinese characters are written as a series of pen or brush strokes and are classified according to the numbers of these strokes. The most commonly used character in Chinese (which means of) contains eight strokes, and the average for Chinese text as a whole is likely close to this. The smallest number of strokes for a Chinese character is one, which applies to the character representing the number one, and it is basically just a horizontal line. Characters written with more than ten strokes are very common, and some frequently used characters have more than 20 strokes. The character with the largest number of strokes that is still in use in China has 57 strokes (and represents a type of noodle). The largest number of strokes for any character historically developed in China may have been 64, although an 84 stroke character was developed in Japan.

⁶The character count varies considerably according to the particular dictionary and its comprehensiveness. For example, the Kangxi Zidian, which was published in 1716 and was the standard Chinese dictionary during the eighteenth and nineteenth centuries, lists about 40,000 characters, whereas the modern Zhonghua Zihai contains in excess of 80,000 supposedly unique characters.

⁷The issue of which character set is best and should be the standard remains highly controversial. Proponents of the simplified characters claim that they promote literacy. However, opponents of those characters claim that they dumb down the language and cut off new generations from thousands of years of Chinese literature and tradition. They also point out that literacy rates in countries and regions that use the traditional characters are substantially higher than in those that use the simplified characters. For example, literacy is roughly 95 percent for Taiwan and 99 percent for Japan, as compared with perhaps 85 percent for mainland China (and the definition of what constitutes literacy might also be lower there).

⁸This is a result of the very different structure of the Japanese and Korean languages as compared with Chinese. For example, verbs in Japanese and Korean (which have very similar grammars and appear to be distantly related) employ a complex system of suffixes to indicate tense and the level of politeness, and it is much easier to write these suffixes with phonetic characters, because of their far fewer numbers of strokes, than using Chinese characters (which was done in the past).

⁹Possibly the first typewriter for Chinese characters was one for the Japanese language that was developed in 1915. It contained a large tray of characters, which were picked up and printed individually by a lever-like device. Chinese character typewriters were much bulkier (to accommodate the larger number of characters), more expensive and slower than typewriters for languages which use phonetic characters, and they also required the operator to be highly trained. Thus most documents were written by hand, including many official documents. This technology remained basically unchanged until the development of electronic word processors in 1978 and their proliferation in the 1980s. Word processors for CJK languages were very similar in size, weight, price and ease of use to those for non-CJK languages. Moreover, they allowed a large increase in the number of characters (from the maximum of about 3000 for typewriters to more than 6800 on standard word processor models).

¹⁰Unicode's web site is located at http://www.unicode.org. For a good FAQ about issues related to the use of Unicode for East Asian languages, see http://www.unicode.org/faq/han_cjk.html.

¹¹Han in Chinese refers to the to the majority ethnic group within China and the largest single human ethnic group in the world. Han Chinese constitute about 91 percent of the population of mainland China and about 19 percent of the total world population. A synonym is ethnic Chinese. In the context of Han unification, it means unification of Chinese characters.

¹²TRON, a real-time operating system kernel, was begun by Dr. Ken Sakamura, a professor of electrical engineering at the University of Tokyo, in 1984. It subsequently became the world's most widely deployed operating system because of its use in vast numbers of electric and electronic products manufactured by Japanese companies as well as by their overseas subsidiaries and by other companies that use their semiconductor devices. However, there is little likelihood that the TRON character encoding will be adopted to any great extent internationally for standalone computers, because of the dominant position of and general satisfaction with software (including operating systems, application programs and programming languages) that utilizes Unicode.