103.7 Lesson 1

Certificate:

LPIC-1

Version:

5.0

Topic:

103 GNU and Unix Commands

Objective:

103.7 Search text files using regular expressions

Lesson:

1 of 2

Introduction

String-searching algorithms are widely used by several data-processing tasks, so much that Unix-like operating systems have their own ubiquitous implementation: Regular expressions, often abbreviated to REs. Regular expressions consist of character sequences that make up a generic pattern used to locate and sometimes modify a corresponding sequence in a larger string of characters. Regular expressions greatly expand the ability to:

Write parsing rules to requests in HTTP servers, nginx in particular.
Write scripts that convert text-based datasets to another format.
Search for occurrences of interest in journal entries or documents.
Filter markup documents, keeping semantic content.

The simplest regular expression contains at least one atom. An atom, so named because it’s the basic element of a regular expression, is just a character that may or may not have special meaning. Most ordinary characters are unambiguous, they retain their literal meaning, while others have special meaning:

. (dot): Atom matches with any character.
^ (caret): Atom matches with the beginning of a line.
$ (dollar sign): Atom matches with the end of a line.

For example, the regular expression bc, composed by the literal atoms b and c, can be found in the string abcd, but can not be found in the string a1cd. On the other hand, the regular expression .c can be found in both strings abcd and a1cd, as the dot . matches with any character.

The caret and dollar sign atoms are used when only matches at the beginning or at the end of the string are of interest. For that reason they are also called anchors. For example, cd can be found in abcd, but ^cd can not. Similarly, ab can be found in abcd, but ab$ can not. The caret ^ is a literal character except when at the beginning and $ is a literal character except when at the end of the regular expression.

Bracket Expression

There is another type of atom named bracket expression. Although not a single character, brackets [] (including their content) are considered a single atom. A bracket expression usually is just a list of literal characters enclosed by [], making the atom match any single character from the list. For example, the expression [1b] can be found in both strings abcd and a1cd. To specify characters the atom should not correspond to, the list must begin with ^, as in [^1b]. It is also possible to specify ranges of characters in bracket expressions. For example, [0−9] matches digits 0 to 9 and [a−z] matches any lowercase letter. Ranges must be used with caution, as they might not be consistent across distinct locales.

Bracket expression lists also accept classes instead of just single characters and ranges. Traditional character classes are:

[:alnum:]: Represents an alphanumeric character.
[:alpha:]: Represents an alphabetic character.
[:ascii:]: Represents a character that fits into the ASCII character set.
[:blank:]: Represents a blank character, that is, a space or a tab.
[:cntrl:]: Represents a control character.
[:digit:]: Represents a digit (0 through 9).
[:graph:]: Represents any printable character except space.
[:lower:]: Represents a lowercase character.
[:print:]: Represents any printable character including space.
[:punct:]: Represents any printable character which is not a space or an alphanumeric character.
[:space:]: Represents white-space characters: space, form-feed (\f), newline (\n), carriage return (\r), horizontal tab (\t), and vertical tab (\v).
[:upper:]: Represents an uppercase letter.
[:xdigit:]: Represents hexadecimal digits (0 through F).

Character classes can be combined with single characters and ranges, but may not be used as an endpoint of a range. Also, character classes may be used only in bracket expressions, not as an independent atom outside the brackets.

Quantifiers

The reach of an atom, either a single character atom or a bracket atom, can be adjusted using an atom quantifier. Atom quantifiers define atom sequences, that is, matches occur when a contiguous repetition for the atom is found in the string. The substring corresponding to the match is called a piece. Notwithstanding, quantifiers and other features of regular expressions are treated differently depending on which standard is being used.

As defined by POSIX, there are two forms of regular expressions: “basic” regular expressions and “extended” regular expressions. Most text related programs in any conventional Linux distribution support both forms, so it is important to know their differences in order to avoid compatibility issues and to pick the most suitable implementation for the intended task.

The * quantifier has the same function in both basic and extended REs (atom occurs zero or more times) and it’s a literal character if it appears at the beginning of the regular expression or if it’s preceded by a backslash \. The plus sign quantifier ` will select pieces containing one or more atom matches in sequence. With the question mark quantifier `?`, a match will occur if the corresponding atom appears once or if it doesn't appear at all. If preceded by a backslash `\`, their special meaning is not considered. Basic regular expressions also support ` and ? quantifiers, but they need to be preceded by a backslash. Unlike extended regular expressions, + and ? by themselves are literal characters in basic regular expressions.

Bounds

A bound is an atom quantifier that, as the name implies, allows a user to specify precise quantity boundaries for an atom. In extended regular expressions, a bound may appear in three forms:

{i}: The atom must appear exactly i times (i an integer number). For example, [[:blank:]]{2} matches with exactly two blank characters.
{i,}: The atom must appear at least i times (i an integer number). For example, [[:blank:]]{2,} matches with any sequence of two or more blank characters.
{i,j}: The atom must appear at least i times and at most j times (i and j integer numbers, j greater then i). For example, xyz{2,4} matches the xy string followed by two to four of the z character.

In any case, if a substring matches with a regular expression and a longer substring starting at the same point also matches, the longer substring will be considered.

Basic regular expressions also support bounds, but the delimiters must be preceded by \: \{ and \}. By themselves, { and } are interpreted as literal characters. A \{ followed by a character other than a digit is a literal character, not the beginning of a bound.

Branches and Back References

Basic regular expressions also differ from extended regular expressions in another important aspect: an extended regular expression can be divided into branches, each one an independent regular expression. Branches are separated by | and the combined regular expression will match anything that corresponds to any of the branches. For example, he|him will match if either substring he or him are found in the string being examined. Basic regular expressions interpret | as a literal character. However, most programs supporting basic regular expressions will allow branches with \|.

An extended regular expression enclosed in () can be used in a back reference. For example, ([[:digit:]])\1 will match any regular expression that repeats itself at least once, because the \1 in the expression is the back reference to the piece matched by the first parenthesized subexpression. If more than one parenthesized subexpression exist in the regular expression, they can be referenced with \2, \3 and so on.

For basic REs, subexpressions must be enclosed by $ and $, with ( and ) by themselves ordinary characters. The back reference indicator is used like in extended regular expressions.

Searching with Regular Expressions

The immediate benefit offered by regular expressions is to improve searches on filesystems and in text documents. The -regex option of command find allows to test every path in a directory hierarchy against a regular expression. For example,

$ find $HOME -regex '.*/\..*' -size +100M

searches for files greater than 100 megabytes (100 units of 1048576 bytes), but only in paths inside the user’s home directory that do contain a match with .*/\..*, that is, a /. surrounded by any other number of characters. In other words, only hidden files or files inside hidden directories will be listed, regardless of the position of /. in the corresponding path. For case insensitive regular expressions, the -iregex option should be used instead:

$ find /usr/share/fonts -regextype posix-extended -iregex '.*(dejavu|liberation).*sans.*(italic|oblique).*'
/usr/share/fonts/dejavu/DejaVuSansCondensed-BoldOblique.ttf
/usr/share/fonts/dejavu/DejaVuSansCondensed-Oblique.ttf
/usr/share/fonts/dejavu/DejaVuSans-BoldOblique.ttf
/usr/share/fonts/dejavu/DejaVuSans-Oblique.ttf
/usr/share/fonts/dejavu/DejaVuSansMono-BoldOblique.ttf
/usr/share/fonts/dejavu/DejaVuSansMono-Oblique.ttf
/usr/share/fonts/liberation/LiberationSans-BoldItalic.ttf
/usr/share/fonts/liberation/LiberationSans-Italic.ttf

In this example, the regular expression contains branches (written in extended style) to list only specific font files under the /usr/share/fonts directory hierarchy. Extended regular expressions are not supported by default, but find allows for them to be enabled with -regextype posix-extended or -regextype egrep. The default RE standard for find is findutils-default, which is virtually a basic regular expression clone.

It is often necessary to pass the output of a program to command less when it doesn’t fit on the screen. Command less splits its input in pages, one screenful at a time, allowing the user to easily navigate the text up and down. In addition, less also allows a user to perform regular expression based searches. This feature is notably important because less is the default paginator used for many everyday tasks, like inspecting journal entries or consulting manual pages. When reading a manual page, for instance, pressing the / key will open a search prompt. This is a typical scenario in which regular expressions are useful, as command options are listed just after a page margin in the general manual page layout. However, the same option might appear many times through the text, making literal searches unfeasible. Regardless of that, typing ^[[:blank:]]*-o — or more simply: ^ *-o — in the search prompt will jump immediately to option the -o section (if it exists) after pressing Enter, thus allowing one to consult an option description more rapidly.

Guided Exercises

What extended regular expression would match any email address, like info@example.org?
What extended regular expression would only match any IPv4 address in the standard dotted-quad format, like 192.168.15.1?
How can the grep command be used to list the contents of file /etc/services, discarding all comments (lines starting with #)?
The file domains.txt contains a list of domain names, one per line. How would the egrep command be used to list only .org or .com domains?

Explorational Exercises

From the current directory, how would the find command use an extended regular expression to search for all files not containing a standard file suffix (file names not ending in .txt or .c, for example)?
Command less is the default paginator for displaying long text files in the shell environment. By typing /, a regular expression can be entered in the search prompt to jump to the first corresponding match. In order to stay in the current document position and only highlight the corresponding matches, what key combination should be entered at the search prompt?
In less, how would it be possible to filter the output so only lines which match a regular expression get displayed?

Summary

This lesson covers the general Linux support for regular expressions, a widely used standard whose pattern matching capabilities are supported by most text related programs. The lesson goes through the following steps:

What a regular expression is.
The main components of a regular expression.
The differences between regular and extended regular expressions.
How to perform simple text and file searches using regular expressions.

Answers to Guided Exercises

What extended regular expression would match any email address, like info@example.org?
```
egrep "\S+@\S+\.\S+"
```
What extended regular expression would only match any IPv4 address in the standard dotted-quad format, like 192.168.15.1?
```
egrep "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
```
How can the grep command be used to list the contents of file /etc/services, discarding all comments (lines starting with #)?
```
grep -v ^# /etc/services
```
The file domains.txt contains a list of domain names, one per line. How would the egrep command be used to list only .org or .com domains?
```
egrep ".org$|.com$" domains.txt
```

Answers to Explorational Exercises

From the current directory, how would the find command use an extended regular expression to search for all files not containing a standard file suffix (file names not ending in .txt or .c, for example)?
```
find . -type f -regextype egrep -not -regex '.*\.[[:alnum:]]{1,}$'
```
Command less is the default paginator for displaying long text files in the shell environment. By typing /, a regular expression can be entered in the search prompt to jump to the first corresponding match. In order to stay in the current document position and only highlight the corresponding matches, what key combination should be entered at the search prompt?

Pressing Ctrl+K before entering the search expression.
In less, how would it be possible to filter the output so only lines which match a regular expression get displayed?

By pressing & and entering the search expression.