LINFO

Files: A Brief Introduction

A file is a named collection of related data that appears to the user as a single, contiguous block of information and that is retained in storage.

Storage refers to computer devices or media which can retain data for relatively long periods of time (e.g., years or decades), such as hard disk drives (HDDs), CDROMs and magnetic tape. This contrasts with memory, which consists of RAM (random access memory) chips and whose contents can be accessed (i.e., read and written to) at extremely high speeds but which are retained only temporarily (i.e., only while in use or while the power supply remains on).

User View Versus Computer View of Files

There is a substantial difference between the way that files appear to users and the way that they are actually stored on a computer. To the user, a file is seen as the smallest unit of logical storage: that is, data cannot be written to storage unless it is written in a file. Each file is identified by a name that is unique within the directory in which the file is located.

To the computer, however, individual files are identified by numbers rather rather than by their names and their locations in directories. Also, individual files are usually not stored as contiguous blocks of data, but rather they are stored as multiple fragments scattered in various locations on a HDD, or even on multiple HDDs.

File Types

Files can be classified in several ways, including according to (1) the operating system on which they were created, (2) the application program that created them, (3) the type of contents they contain, (4) whether they are text files or binary files and (5) how they are classified by any of several commands (e.g., ls -l and file).

Every operating system uses files as a means of organizing data. However, there are often differences in the structures of files created by different operating systems, and thus it can be difficult for files created on one operating system to be read, written to or executed (i.e., run as a program) on another operating system. For example, transferring text files among Linux (or other Unix-like operating systems), Macintosh and Microsoft Windows can take some practice because each uses different characters to signify line breaks.

Different application programs also produce different types of files. Some programs produce file types, also called file formats, that are specific only to that program and are difficult or virtually impossible to read or manipulate with other programs. Others produce files that conform to industry standards and thus can be easily read and manipulated by a variety of other programs.

Any type of information can be stored in files, including text, sound, still images, moving images, database data, source code, executable programs and compressed data, and different types of data are generally not stored together in the same file. An executable program is a ready-to-run program. Source code is the version of a program as it is originally written (i.e., typed into a computer) by a human in plain text (i.e., human readable alphanumeric characters) and before it is converted by a compiler into machine code, which can be read directly by a computer's CPU (central processing unit).

In Linux, everything is configured or treated as a file, or as a process. Thus, the extent of file types is further broadened to include directories, partitions and hardware device drivers. A partition is a logically independent section of a HDD. A process is an instance of a program in execution, but a program is merely one or more executable files.

Text Files and Binary Files

The simplest (usually) and one of the most common ways of classifying files is to divide them into the two categories of text files and binary files. A text file (also referred to as a plain text file) is a file that contains only human-readable characters plus a few types of control characters, such as those used to indicate line breaks and tabs. Text files are usually encoded in ASCII (American Standard Code for Information Interchange), the most commonly used character set for computers.

The term binary file is commonly used to refer to any file that does not consist entirely of plain text, that is, any file that contains at least some binary data. Binary data consists of sequences of zeros and ones that do not represent any human-readable text characters but which can be directly readable by a computer's CPU. Binary files can be image files, sound files, compressed files (including compressed plain text), executable programs, etc.

The files for executable programs are also called executable files or executables. They contain step-by-step instructions in machine code that a computer can understand (but that humans cannot read). When a user enters the name of the executable file at the keyboard, the commands in the file are executed. This type of file is also sometimes referred to as an image (which has no relationship to the other use of this term that refers to pictures or graphics).

Files can be, and are frequently, converted from text to binary and visa versa. For example, the source code for programs is written in plain text (using a programming language) but it is then converted into binary form by a compiler so that it can be run by the computer. (Thus, compiled applications are often simply referred to as binaries.)

Likewise, web pages, which are written in HTML (hypertext markup language), XHTML (extensible hypertext markup language) or XML (extensible markup language), are text files, but they are converted to binary form by web browsers so that they can be displayed on users' monitor screens.

Conversely, binary files are often converted into alphanumeric characters, usually in order to improve survivability during transit. This is accomplished using encoding schemes such as Base64. Such coding is employed extensively for e-mail attachments, such as graphic images. Such files are human readable in the sense that they consist entirely of alphanumeric characters, although they are not meaningful to humans because the sequences of characters are not recognizable words.

Metadata

Each file is given a number of attributes (i.e., information about the file) when it is created, with the exact numbers and definitions of the attributes differing according to the operating system. On Unix-like systems they include the file name, inode (i.e., numerical identifier), type of file, location on the HDD, size, timestamps (i.e., dates of creation and most recent update), ownership (which is by default the user that created it) and permissions (such as by the owner and others for reading, writing and executing the file). This metadata (i.e., data about data) is generally stored elsewhere in the file structure rather than in the files themselves.

Whereas users identify files by their names, Unix-like operating systems identify them by their inodes. Inodes are numbers that are stored in inode tables in the filesystem and exist for all types of files. They are necessary because a single file can have multiple names, or even no name. A file name in a Unix-like operating system is just an entry in an inode table. Inode numbers are unique per filesystem, which means that an inode with a same number can exist on another filesystem in the same computer.