LINFO

Plain Text Definition

Plain text refers to any string (i.e., finite sequence of characters) that consists entirely of printable characters (i.e., human-readable characters) and, optionally, a very few specific types of control characters (e.g., characters indicating a tab or the start of a new line).

A character is any letter, symbol or mark employed in writing or printing a written language (i.e., a language used by humans for which a writing system has been developed). The characters used to write the English language are the 26 lower case (i.e., small) and the 26 upper case (i.e., capital) letters of the English alphabet, the Arabic numerals, punctuation marks and a variety of other symbols (e.g., the ampersand, the equals sign, the tilde and the at symbol). An alphabet is the ordered, standardized set of letters that is used to write or print a written language.

Plain text usually refers to text that consists entirely of the ASCII printable characters and a few of its control characters. ASCII, an acronym for American standard code for information interchange, is based on the characters used to write the English language as it is used in the U.S. It is the de facto standard for the character encoding (i.e., representing characters by numbers) that is utilized by computers and communications equipment to represent text, and it (or some compatible extension of it) is used on most computers, including nearly all personal computers and workstations.

The term printable characters refers to the 96 ASCII characters (inclusive of the space character, which occupies a single space) that are actually human readable when displayed on a computer screen or when printed on paper. ASCII also contains a substantial number of non-printable characters (i.e., control characters) that were originally intended to control devices (e.g., printers) that make use of ASCII. It does not contain any characters that represent the formatting of text, such as those that indicate the typeface, the font, underlining, margins, etc.

A typeface is a specific design for the entire set of characters that is used to write a language or languages. Among the most popular of the thousands that are available for the English language are Helvetica, Times Roman and Courier. Each typeface contains numerous fonts. A font is an implementation of a typeface for a specific size and style (e.g., plain, bold or italic) of type.

Plain text can also be defined in terms of other character encoding systems. For example, plain Unicode text is a sequence of Unicode characters. Unicode is a newer system that attempts to provide a unique encoding for every character used by the world's languages and which incorporates ASCII as a subset. Thus, plain Unicode text could include human-readable characters from almost any language or combination thereof (e.g., a mixture of Chinese, Russian and English characters as might be used in a trilingual dictionary).

Source code (i.e., the original form of any computer program) is typed into a computer in ASCII plain text by humans using any of thousands programming languages (among the most common of which are C, C++ and Java). When the source code files have been converted into object code by a compiler, they are no longer plain text, but rather binary files. A binary file is a file that can be directly read by a computer's CPU (central processing unit); it contains at least some data that is not plain text and is thus generally not readable by humans.

The only formatting possible for plain text is that which can be created with the space, tab and new line characters. Thus, for example, new lines and new paragraphs can be created, and vertical spaces can be added between lines and between paragraphs. There is no variation in the typeface or font, no underlining, no italic or bold characters and no superscripts or subscripts. Likewise, plain text does not contain any images or hyperlinks (i.e., automated cross-references to other documents).

However, plain text can contain instructions that are written in plain text for formatting, for adding images, for creating hyperlinks, etc. that can be used by programs that convert plain text into other forms. That is, it can contain tags (i.e., instructions or indicators that are written in plain text) that tell a word processor, web browser or other program to format it in a certain way, including which typefaces and fonts to use, how to set the margins, where to underline the text and where to use bold or italic characters.

HTML (hypertext markup language) and XML (extensible markup language) are good examples of the use of instructions that (1) are used to convert plain text into some form of formatted text, (2) are written in plain text and (3) are embedded in the plain text documents that they are used to format. For example, the HTML tags <b> and </b>, although written in plain text, instruct any web browser that reads a file containing them to render (i.e., display) any plain text located between them in bold characters. Among the many other things that HTML can tell browsers are where to create hyperlinks, how to set margins, which images to use and where to insert them, which typefaces and fonts to use and where to render text in italics or underlined characters.

Rich text, also referred to as styled text, consists of plain text plus additional information in binary format, such as about fonts, language identifiers and margins.

Plain text should not be confused with plaintext (a single word instead of two). The latter is a term used in cryptography (i.e., the converting of information into an unreadable format) that refers to a plain text message prior to encryption or after decryption, that is, a message in human-readable form.

Important Advantages

Plain text offers some important advantages over other ways of storing and manipulating data. They revolve around the fact that it is the most flexible and portable format for data. That is, everything can be done with plain text that could be done with any binary format, and some things can be done with plain text that cannot easily (if at all) be done with some binary formats. This is because plain text is supported by nearly every application program on every operating system and on every type of CPU and allows information to be manipulated (including, searching, sorting and updating) both manually and programmatically using virtually every text processing tool in existence.

This flexibility and portability make plain text the best format for storing data persistently (i.e., for years, decades, or even millennia). That is, plain text provides insurance against the obsolescence of any application programs that are needed to create, read, modify and extend data. Human-readable forms of data (including data in self-describing formats such as HTML and XML) will most likely survive longer than all other forms of data and the application programs that created them. In other words, as long as the data itself survives, it will be possible to use it even if the original application programs have long since vanished.

For example, it is very easy to read a data file from a legacy system (i.e., an antiquated program or operating system) or convert it to some other format even if there is little or no information about the original program that was used to create it, if that data file is written in plain text. If it is written in some binary format, such as by a proprietary (i.e., commercial) word processor or spreadsheet program, it might be very difficult or impossible to read or use it.

Plain text is not necessarily unstructured text. Programming languages as well as SGML (standard generalized markup language) and its modern descendants, most notably HTML (hypertext markup language) and XML (extensible markup language), are examples of plain text formats that have well-defined structures. These formats have the important advantage of making plain text easier for computers to read, reorganize and modify while keeping it relatively readable by humans.

Plain Text and the Unix Philosophy

The use of plain text is an important part of the Unix philosophy, and thus of the Linux philosophy (which incorporates the Unix philosophy). Consequently, in contrast to other types of operating systems, Linux and other Unix-like operating systems attempt to use plain text as much as possible and to minimize the use of binary code.

For example, programs are designed to produce plain text output to the extent practical. An obvious example of a type of program whose primary output cannot be plain text is a compiler, because its purpose is to translate plain text (i.e., source code) into binary code (i.e., runnable programs that can be read directly by the CPU).

All filters use plain text input and produce plain text output. Filters, which are among the most important programs in Unix-like operating systems, are small and (usually) specialized programs that transform plain text data in some meaningful way. They are designed to be linked together using pipes (represented in commands by the vertical bar character) to form pipelines of commands that can have great power and flexibility.

Also, Unix-like operating systems use plain text files (i.e., files that contain only plain text and no binary data) for system and application configuration information. A major advantage of this approach is ease of access and modification, which can be particularly useful when repairing a crashed or otherwise damaged system. Examples of plain text configuration files include /etc/fstab (which lists the currently mountable filesystems), /etc/passwd (which holds user account data) and /etc/httpd.conf (which is the configuration file for the highly popular Apache web server).

Some operating systems, such as Solaris, the most popular proprietary Unix-like operating system, maintain a binary version of certain system databases in addition to the plain text version as a means of optimizing system performance. The plain text version is retained as a human interface to the binary version, i.e., in order to be able to easily read and modify it.

Few Disadvantages

Despite these advantages, the use of plain text for configuration files and data storage varies greatly according to the operating system and application program.

One disadvantage that has sometimes been claimed for plain text is that it can consume more storage (e.g., hard disk or magnetic tape) space than would a compressed binary format. Another is that it might be computationally more expensive (i.e., require more CPU time) to interpret and process than binary files. However, both of these supposed disadvantages have declined in importance as a result of the rapid reduction in the cost of storage and the rise in processing speeds.

Developers have sometimes expressed concern that the keeping metadata (i.e., data about data, including formatting information) in plain text form could expose to it accidental or malicious damage by users. However, although binary data is certainly far more obscure (i.e., difficult to read) than plain text, it is not necessarily more secure. Indeed, there are very effective ways of protecting metadata while still using plain text, such as employing a secure hash of the data and including it in the plain text as a checksum. Plain text can, of course, also be made very obscure if desired through the use of encryption.

A hash, also called a hash function, is an algorithm (i.e., a set of precise, unambiguous rules that specify how to solve some problem or perform some task) or mathematical formula that converts data of any length into a unique, short and fixed-length string of plain text characters, known as a hash value or message digest. A hash function is a one-way function; that is, it easily calculates a hash value but, conversely, it is extremely difficult or impossible to reverse the process and reproduce the original data from the hash value.

Proprietary software generally does not use plain text for configuration files and for storing data, but rather it almost always uses some binary format. This is often an attempt by software developers and vendors to lock in existing users of such software, i.e., to make it difficult and costly for them to convert their existing data to any competing file format, including plain text.

Programs For Producing Plain Text

There is a large number and a great variety of programs that can produce output in plain text form. The simplest and most basic are text editors, which are small programs that are designed specifically to create, read and edit plain text.

A pure text editor deals only with plain text and, in contrast to a word processor, is not designed to format text. At least one free text editor is included as a basic part of virtually every operating system. Among the most popular on Unix-like operating systems are vi, gedit and kedit. Emacs, which is often preferred by programmers, has a text editor function, but it also has a number of advanced capabilities that allow it to even be used for compiling programs and browsing the Web.

Examples of free text editors for other operating systems are SimpleText, which was included with the Macintosh prior to OS X, and Notepad, which is included with the Microsoft Windows operating systems. Caution should be exercised regarding the use of text editors for other operating systems because some of them, such as Notepad, do not treat Unix-style text files correctly and thus can cause programs for Unix-like operating systems to malfunction.

Word processors can also be used to read, write and edit plain text. But they can additionally be used with a variety of other text formats, including various proprietary formats (both those that are native to that particular word processor and those that are compatible with other word processors) and rich text. Several word processors can also create PDF (portable document format) documents.

There is a trend towards using plain text as an output format for all types of programs, not just text-oriented programs. This trend is even affecting some art programs, particularly free art programs which store their output as scalable vector graphics (SVG). SVG is an XML language (and thus plain text, as is SVG) for describing two-dimensional vector graphics (both static and animated) that makes it much easier to modify such graphics, including selectively transforming and regrouping parts of them, than do most conventional graphics formats. This is another example of the great flexibility of plain text.

Binary files are often converted into a plain text representation in order to improve their survivability during transit over the Internet or other networks. This is accomplished using encoding schemes such as Base64, which automatically converts all non-text e-mail data (e.g., images and attachments) into a 65-character subset of ASCII; the data is then converted back into its original form after arrival.