File Formats: A Brief Introduction

A file format is any specific way of encoding digital data to create a file.

A file is a named collection of related data that appears to the user as a single, contiguous block of information that can be retained in storage and accessed by its file name. Storage refers to computer devices or media that can retain data for relatively long periods of time (e.g., years or decades), such as hard disk drives (HDDs), CDROMs and magnetic tape.

File formats are necessary because computers store, manipulate and communicate data in binary (i.e., sequences of zeros and ones) form, and thus it is necessary to have some system of converting data into binary form and back into a form that is easily comprehensible to humans. For example, some sequence of zeros and ones is necessary to represent each character in a language (e.g., a letter of the alphabet or a Chinese character) and to represent the color of each pixel (i.e., dot) in an image.

A large variety of file formats have been developed and are in common use. There are numerous file formats for each type of information, including text, still images, moving images, sound and executable (i.e., ready-to-run) programs, each with its own special characteristics and capabilities. File formats often differ with regard to their specificity; that is, some can be used with only a very particular type of data whereas others can accommodate a wider variety of data. Some file formats are competitive with others, and the difficulty of converting data from one format into another can vary according to the specific formats.

Text File Formats

The most common type of file is text files. The simplest, and often the most useful, format for text files is plain text, which consists solely of binary codes for readable characters (such as letters of an alphabet, numerals and punctuation marks) together with a few types of control characters (e.g., characters that are used to indicate new lines and tabs).

One of the biggest advantages of plain text file formats is that they are human readable and are thus an excellent computer-human interface. It is easy to quickly examine and search their contents, even in the absence of specialized software. Also, they are, or can be, used by, numerous programs, including text editors and word processors, and they are used by programmers when writing programs. Moreover, it is relatively easy to convert many other file formats into plain text.

The most commonly used format and de facto standard for plain text files is ASCII (American Standard Code for Information Interchange), which uses eight bits to encode a maximum of 256 different characters. They include the letters of most alphabets (both lower case and upper case), Arabic numerals, punctuation marks and standard symbols (such as the ampersand, arithmetic signs and the monetary symbols).

Another, increasingly common format for plain text files is Unicode, which is a system that attempts to provide a unique encoding for every character used by all of the world's languages, both existing and extinct. It accomplishes this by means of allowing the use of multiple bytes to represent each character, rather than just one byte as is the case with ASCII.

Among plain text files, regardless of whether they have been created in ASCII, Unicode or some other formatting, it is possible to have further specificity of file formats. A common example is HTML (hypertext markup language) files, whose plain text must obey certain additional rules (e.g., the use of pre-defined tags to indicate paragraphs, type sizes and color) in order to both be recognized by web browsers and to be interpreted correctly by them.

Most commercial word processors by default save their output in proprietary file formats rather than ASCII or Unicode. Such formats can have some advantages, such as reducing storage space because they allow data compression. However, there are also some disadvantages. One is that it can be difficult to convert saved documents into other proprietary formats. Also, it can become difficult and costly to access data at all in the event that a developer stops supporting a format.

There are numerous file formats for image data, the most commonly used of which are JPEG (Joint Photographic Experts Group), GIF (graphics interchange format) and PNG (portable network graphics). One reason that so many have been developed is that each has a different set of capabilities and limitations. For example, JPEG is well suited for photographs and other images that contain large numbers of colors, whereas GIF is preferable for images with small numbers of colors and can also be used for simple moving images. Also, PNG was developed as a free substitute for GIF, which was encumbered with patents.

Identifying File Formats

The way that computers identify file formats can differ according to the operating system. For example, the various Microsoft operating systems determine the format of a file on the basis of its filename extension, which are several letters to the right of a period following the file name. In contrast, Linux and other a Unix-like operating systems determine the file format by the file's magic number, which is a number embedded at or near the beginning of a file.

There are several ways for users to identify the format of a file. When working in a GUI (graphical user interface), often the easiest is to look at its icon (i.e., small image). However, this does not work in some cases, particularly for less common file formats or those not recognized by default by the particular operating system. Another way that can be used either in a GUI or at the command line is to look at the extension. However, problems can arise if there is no extension or if the extension is incorrect.

A very useful technique that is available in Unix-like operating systems is the command named file, which attempts to classify any specified filesystem object (i.e., file, directory or link) by probing it with three types of tests, including its magic number, until one succeeds.