LINFO

String Definition


In computer science a string is any finite sequence of characters (i.e., letters, numerals, symbols and punctuation marks).

An important characteristic of each string is its length, which is the number of characters in it. The length can be any natural number (i.e., zero or any positive integer). A particularly useful string for some programming applications is the empty string, which is a string containing no characters and thus having a length of zero. A substring is any contiguous sequence of characters in a string.


The String Data Type

Data types are widely used in programming languages and database systems as a way of categorizing data and thereby facilitating error prevention, modularity, documentation and system optimization. Data types can differ according to the programming language or database system, but strings are such an important and useful data type that they are implemented in some way in virtually every programming language.

How strings and string data types are represented depends largely on the character set (e.g., an alphabet) for which they are defined and the method of character encoding (i.e., how they are represented by bits on a computer). String implementations formerly were usually designed to work with ASCII (the de facto standard for the character encoding used by computers and communications equipment to represent text) or with its subsequent extensions (particularly the ISO 8859 series, which allows representation of many national alphabets other than just the U.S. English alphabet represented by the original ASCII). In recent years, however, the trend has been to implement strings with Unicode, which attempts to provide character codes for all existing and extinct written languages.

Several programming languages have been specifically designed to facilitate the development of application programs for processing strings. They include awk, Perl, sed and Tcl. Perl takes a particularly flexible approach to its strings data type by allowing it to contain any kind of data, even binary (i.e., non-character) data.

The C programming language, which is probably the most widely used systems development language (i.e., a language used to write operating systems) and the language that is used to write most of the Linux kernel, takes a very different approach to strings. It has no special data type for strings; rather, strings are treated as an array of char data. An array is an ordered sequence of data elements of a single data type. The char data type is used for storing individual characters (e.g., letters and punctuation marks) and is classified as an integer data type because it actually stores integers representing the ASCII values of the characters (e.g., the lower case letter b is stored as 98).


String Manipulation

There are numerous algorithms (i.e., sets of precise, unambiguous rules designed to solve specific problems or perform specific tasks) for processing strings, including for searching, sorting, comparing and transforming. Even the C programming language, which was not designed for string processing, has several standard string operations, including finding length, concatenating (i.e., joining), comparing and copying.

In Linux and other Unix-like operating systems, there are numerous utilities for processing strings. These are the same utilities that are used for manipulating text files (i.e., files that contain only text and no binary data), because in such operating systems text files and strings are considered to be essentially the same thing. Moreover, these utilities feature the ability to be combined using pipes (which send the output of one utility to another utility to use as its input) and, in some cases, the ability to be easily programmed to provide powerful (i.e., very flexible and efficient) string processing algorithms.

Among these utilities are cat, which is used for concatenating, sort, used for sorting, expand, used to convert tabs to spaces, grep, used for searching, split, used for cutting into pieces, tr used to translate or delete characters, unexpand, used to convert spaces to tabs, and wc, used for counting characters or words.

Regular expressions are a powerful technique for searching text for strings (i.e., searching strings for substrings) that pervades programming tools in Unix-like operating systems and which are also used by many text editors and some programming languages. Several such systems have been developed, but the Perl-compatible regular expressions (PCRE) are generally regarded as having the richest and most predictable syntax and thus the greatest flexibility and ease of use.

The strings command is designed to extract ASCII strings from binary files. Binary files are files that contain at least some non-character data (i.e., binary data) but which can (and usually do) also contain some character data; they include executable (i.e., runnable) programs, output files from proprietary (i.e., commercial) programs (e.g., word processing and spreadsheet programs) and image files.




Created January 29, 2005. Updated June 17, 2007.
Copyright © 2005 - 2007 The Linux Information Project. All Rights Reserved.