LINFO

Characters: A Brief Introduction


Characters are the basic symbols that are used to write or print a language. For example, the characters used by the English language consist of the letters of the alphabet, numerals, punctuation marks and a variety of symbols (e.g., the ampersand, the dollar sign and the arithmetic symbols).

Characters are fundamental to computer systems. They are used for (1) input (e.g., through the keyboard or through optical scanning) and output (e.g., on the screen or on printed pages), (2) writing programs in programming languages, (3) as the basis of some operating systems (such as Linux) which are largely collections of plain text (i.e., human-readable character) files and (4) for the storage and transmission of non-character data (e.g., the transmission of images by e-mail using base64).

Issues regarding characters and their use with computers are relatively simple if dealing with a single language, such as English, which has a small number of characters. However, they become quite complex when dealing with internationalization and localization because of the diverse array of writing systems and vast number of characters in use throughout the world. Internationalization is the addition of a framework for support for multiple languages and cultures; localization is the adjustment of language, content and design to specific countries, regions or cultures.


Character Sets

A character set is the collection of characters that is used to write a particular language. Most languages have a single character set, and similar character sets are often used by a number of languages (e.g., variants of the Roman alphabet are used to write English, Spanish, Finnish, Dutch, etc.).

A few languages have, or have had, more than one character set. For example, the Japanese language uses three character sets: the main one is Chinese characters (i.e., the characters that are used to write the Chinese language), but it is supplemented with two syllabaries (called hiragana and katakana). The Korean language is now written mainly with a unique alphabet (called Hangul), but Chinese characters are still occasionally used.

Mongolia is attempting to restore its traditional alphabet that was replaced by the Cyrillic alphabet (used to write Russian) in 1937 as a result of the country's being incorporated into the Soviet Union, and thus both character sets are currently in use. Turkey used an Arabic alphabet until 1928, at which time it was replaced by an alphabet based on the Roman alphabet as part of a political decision to become more westernized.


Characters and Glyphs

Characters should not confused with glyphs (although they sometimes are). A glyph is a visual representation (i.e., appearance) of a character and is determined by the typeface and style in which the character is printed. In general, any character can have a number of glyphs, with the number depending on the language.

A typeface is a specific, coordinated design for the entire set of characters that is used to write a language or languages. Some typefaces are available in several styles, such as most of those used to write English and other Western European languages, which are usually available in plain, bold and italic.

Different writing systems use different typefaces, and the number of typefaces varies according to the writing system and language. Thousands of typefaces have been developed for use by English and other Western European languages, and they range all the way from the very simple sans serif Geneva and Courier (which was widely used for typewriters) to Times (which is frequently used in printing periodicals and books) to the highly ornate Gothic (which is used mainly for decorative purposes). Characters written in sans serif typefaces lack the little hooks on their ends that are widely believed to make them easier to read.

Some characters in some languages can look very different according to the combination of typeface and style that are used to write them, and in some cases they may closely resemble other characters. Yet, it is only the glyph of a character that resembles another character, and the character itself (including its meaning and usage) is distinct.


Classification of Characters

Most writing systems can be broadly classified into one of three categories: alphabetic, syllabic and logographic. The vast majority of written languages that exist today use alphabets.

An alphabet is the complete, ordered, standardized set of letters that is used to write or print a written language. Each letter represents one or more phonemes (i.e., the fundamental sounds of a spoken language) and/or is used in combination with other letters to represent a phoneme. Most alphabets in use today are based on the Roman alphabet, which was used by the ancient Romans to write their Latin language.

A syllabary is a set of characters that represent the syllables of a language, with one distinct character for each possible syllable. A syllable is the next largest unit of sound in a language after a phoneme; it consists of a vowel sound or a vowel-consonant combination. Syllabaries typically contain many more characters than do alphabets. They are best suited to languages with relatively simple syllable structures, such as Japanese, which has only about a hundred syllables. The English language, in contrast, contains a relatively large number of vowels and complex consonant clusters, resulting in thousands of syllables.

The third major type of writing system, logographic, uses characters that represent objects or abstract ideas. This type of writing system is popularly referred to as pictographic or ideographic. The most important modern logographic writing system by far is Chinese, whose characters are also used, with varying degrees of modification, in Japanese and Korean (as a supplement to Hangul). The ancient Sumerians, Egyptians and Mayans also used logographic systems.

These three categories are not rigid. For example, the Chinese writing system is not purely logographic. This is because individual characters are often compounds which consist of an element that represents the meaning and an element that represents the pronunciation. Also, combinations of characters are sometimes used mainly for their phonetic values to represent proper nouns (e.g., names of people or places) from other languages.

Likewise, alphabetic and syllabic scripts frequently make some use of logograms and logographic values. The most common example is Arabic numerals, each of which has the same meaning regardless of which language or dialect it is used in and how it is pronounced. Other examples are symbols such as the ampersand and dollar sign. Also, individual letters sometimes have more than just a phonetic value: for example, in the English language the letter A often indicates high quality and the letter X sometimes indicates the unknown or an adult rating.


Origin of Characters

The oldest known writing system is cuneiform (named after the wedge-like shapes of the characters that were formed in clay tablets with reed styluses), which emerged in Sumer (in the southern part of what is now Iraq) more than 5,000 years ago. It was followed closely by the development of writing in Egypt and the Indus valley (in western India).

Chinese characters were apparently invented independently of characters used in the Middle East. They first appeared more than three thousand years ago, and they have been in use continuously in basically the same form ever since.

Most scholars believe that the first alphabets originated in the Near East, perhaps evolving from, or at least being influenced by, cuneiform or Egyptian hieroglyphics. The first widely used alphabet appears to have been that of the Phoenicians (who originated in what is now Lebanon), which was in use by at least 1,200 BC. That alphabet contained 22 letters for consonant sounds and had no letters for vowels (as is the case with the Hebrew and Arabic alphabets, which descended from it). The Phoenicians spread their alphabet around the Mediterranean, including to the Greeks and the Etruscans (who preceded the Romans in Italy).

The Roman alphabet was adapted mainly from the Etruscan alphabet during the 7th century BC. It had only upper case (i.e., capital) letters and there were no punctuation marks nor spaces between words. Numbers were written with seven letters of the alphabet (i.e., Roman numerals) rather than with Arabic numerals.

Arabic numerals are today by far the most commonly used characters to represent numbers, although there are also other systems for writing numerals that are still in use, including Chinese and Thai. Arabic numerals were originally derived from an Indian system of writing numerals, and there is some speculation that the Indian numerals, in turn, originally came from Chinese characters.

Characters were also invented apparently independently in the Americas. In particular, the Mayans had a highly developed writing system that contained a large number of complex, logographic characters.


Numbers of Characters

The size of a character set varies wildly according to the language. Languages written with alphabets usually have the fewest characters and those using logographic writing systems have the most. Among the former, the language with the smallest alphabet (and thus the smallest total number of characters) is the Rotokas language (spoken in Bougainville, an island to the East of Papua New Guinea), which contains only eleven letters, and that with the largest alphabet is Armenian, with 39 letters.

The Chinese language has by far the largest number of characters of any writing system that has ever existed, and it accounts for the vast bulk of the characters in use in the world today. Chinese contains more than 40,000 characters, and some estimates place the total at close to 60,000. However, most of these are rarely used, and well-educated people generally know only about 5,000.

The Japanese language ranks second in terms of the number of characters because it makes heavy use of Chinese characters. Approximately 2000 such characters are taught during primary and secondary school, and a well-educated person will know at least 3500 characters. Hiragana and katakana, the two syllabaries that are used to supplement the Chinese characters, each contain 46 characters.

In South Korea, middle and high school students study 1,800 to 2,000 Chinese characters, but most people use Hangul almost exclusively in their daily lives. Chinese characters are used mainly for personal and place names, for calligraphy and for clarification of some terms written in Hangul.


Characters and Computers

The vast number of characters and the great diversity of writing systems in use around the world present some major challenges for the development of software. This has become an increasingly important issue as a result of the rapid growth in the use of computers in countries that do not use European languages.

ASCII (an acronym for American Standard Code for Information Interchange and pronounced ask-ee) is the de facto encoding (i.e., set of code numbers) used by computers and communications equipment to represent text. It is a single byte (i.e., eight bits) encoding system (i.e., uses one byte to represent each character), and the use of the first seven bits allows it to represent a maximum of 128 characters. ASCII is based on the characters used to write the English language (including both upper and lower case letters). Extended versions (which utilize the eighth bit to provide a maximum of 256 characters) have been developed for use with other character sets.

Although ASCII is one of the most successful software standards ever developed, its limitations have become increasingly apparent as a result of the growing internationalization and localization of software. It is suitable for use only with languages that have very small character sets, and is not well suited for computer systems which simultaneously use multiple character sets.

Consequently, Unicode was developed as a means of allowing computers to deal with the full range of characters used by human languages. It has a goal of providing a unique encoding for every character that currently exists or that has ever existed (but not for their variant glyphs). This is accomplished by representing each character with two or more bytes, thus vastly increasing the total number of possible unique character encodings. Unicode version 2.0 (released in 1996) listed 38,885 characters, version 3.0 (released in 2000) listed 49,194 and version 4.0 (released in 2003) lists 96,382. Although Unicode has achieved considerable success, it remains a work in process.

A number of issues with regard to the use of characters and writing systems by computers have yet to be completely resolved. They include (1) controversies in the case of some Chinese characters regarding what is the underlying character and what is the variant glyph, (2) efficient keyboard input systems for languages that use large numbers of characters, (3) software that will allow easy input and display of characters that are arranged other than horizontally from left to right (e.g., right to left or vertically), (4) political and nationalistic controversies about characters, (5) characters that can have multiple forms according to where they are used in words and (6) languages that use multiple character sets.




Created August 12, 2004. Updated October 12, 2007.
Copyright © 2004 - 2007 The Linux Information Project. All Rights Reserved.