103.7 Lesson 2

Certificate:

LPIC-1

Version:

5.0

Topic:

103 GNU and Unix Commands

Objective:

103.7 Search text files using regular expressions

Lesson:

2 of 2

Introduction

Streaming data through a chain of piped commands allows for the application of compound filters based on regular expressions. Regular expressions are an important technique used not only in system administration, but also in data mining and related areas. Two commands are specially suited to manipulate files and text data using regular expressions: grep and sed. grep is a pattern finder and sed is a stream editor. They are useful by themselves, but it is when working together with other processes that they stand out.

The Pattern Finder: grep

One of the most common uses of grep is to facilitate the inspection of long files, using the regular expression as a filter applied to each line. It can be used to show only the lines starting with a certain term. For example, grep can be used to investigate a configuration file for kernel modules, listing only option lines:

$ grep '^options' /etc/modprobe.d/alsa-base.conf
options snd-pcsp index=-2
options snd-usb-audio index=-2
options bt87x index=-2
options cx88_alsa index=-2
options snd-atiixp-modem index=-2
options snd-intel8x0m index=-2
options snd-via82xx-modem index=-2

The pipe | character can be employed to redirect the output of a command directly to grep's input. The following example uses a bracket expression to select lines from fdisk -l output, starting with Disk /dev/sda or Disk /dev/sdb:

# fdisk -l | grep '^Disk /dev/sd[ab]'
Disk /dev/sda: 320.1 GB, 320072933376 bytes, 625142448 sectors
Disk /dev/sdb: 7998 MB, 7998537728 bytes, 15622144 sectors

The mere selection of lines with matches may not be appropriate for a particular task, requiring adjustments to grep's behavior through its options. For example, option -c or --count tells grep to show how many lines had matches:

# fdisk -l | grep '^Disk /dev/sd[ab]' -c
2

The option can be placed before or after the regular expression. Other important grep options are:

-c or --count: Instead of displaying the search results, only display the total count for how many times a match occurs in any given file.
-i or --ignore-case: Turn the search case-insensitive.
-f FILE or --file=FILE: Indicate a file containing the regular expression to use.
-n or --line-number: Show the number of the line.
-v or --invert-match: Select every line, except those containing matches.
-H or --with-filename: Print also the name of the file containing the line.
-z or --null-data: Rather than have grep treat input and output data streams as separate lines (using the newline by default) instead take the input or output as a sequence of lines. When combining output from the find command using its -print0 option with the grep command, the -z or --null-data option should be used to process the stream in the same manner.

Although activated by default when multiple file paths are given as input, the option -H is not activated for single files. That may be critical in special situations, like when grep is called directly by find, for instance:

$ find /usr/share/doc -type f -exec grep -i '3d modeling' "{}" \; | cut -c -100
artistic aspects of 3D modeling. Thus this might be the application you are
This major approach of 3D modeling has not been supported
oce is a C++ 3D modeling library. It can be used to develop CAD/CAM softwares, for instance [FreeCad

In this example, find lists every file under /usr/share/doc then passes each one to grep, which in turn performs a case-insensitive search for 3d modeling inside the file. The pipe to cut is there just to limit output length to 100 columns. Note, however, that there is no way of knowing from which file the lines came from. This issue is solved by adding -H to grep:

$ find /usr/share/doc -type f -exec grep -i -H '3d modeling' "{}" \; | cut -c -100
/usr/share/doc/openscad/README.md:artistic aspects of 3D modeling. Thus this might be the applicatio
/usr/share/doc/opencsg/doc/publications.html:This major approach of 3D modeling has not been support

Now it is possible to identify the files where each match was found. To make the listing even more informative, leading and trailing lines can be added to lines with matches:

$ find /usr/share/doc -type f -exec grep -i -H -1 '3d modeling' "{}" \; | cut -c -100
/usr/share/doc/openscad/README.md-application Blender), OpenSCAD focuses on the CAD aspects rather t
/usr/share/doc/openscad/README.md:artistic aspects of 3D modeling. Thus this might be the applicatio
/usr/share/doc/openscad/README.md-looking for when you are planning to create 3D models of machine p
/usr/share/doc/opencsg/doc/publications.html-3D graphics library for Constructive Solid Geometry (CS
/usr/share/doc/opencsg/doc/publications.html:This major approach of 3D modeling has not been support
/usr/share/doc/opencsg/doc/publications.html-by real-time computer graphics until recently.

The option -1 instructs grep to include one line before and one line after when it finds a line with a match. These extra lines are called context lines and are identified in the output by a minus sign after the file name. The same result can be obtained with -C 1 or --context=1 and other context line quantities may be indicated.

There are two complementary programs to grep: egrep and fgrep. The program egrep is equivalent to the command grep -E, which incorporates extra features other than the basic regular expressions. For example, with egrep it is possible to use extended regular expression features, like branching:

$ find /usr/share/doc -type f -exec egrep -i -H -1 '3d (modeling|printing)' "{}" \; | cut -c -100
/usr/share/doc/openscad/README.md-application Blender), OpenSCAD focuses on the CAD aspects rather t
/usr/share/doc/openscad/README.md:artistic aspects of 3D modeling. Thus this might be the applicatio
/usr/share/doc/openscad/README.md-looking for when you are planning to create 3D models of machine p
/usr/share/doc/openscad/RELEASE_NOTES.md-* Support for using 3D-Mouse / Joystick / Gamepad input dev
/usr/share/doc/openscad/RELEASE_NOTES.md:* 3D Printing support: Purchase from a print service partne
/usr/share/doc/openscad/RELEASE_NOTES.md-* New export file formats: SVG, 3MF, AMF
/usr/share/doc/opencsg/doc/publications.html-3D graphics library for Constructive Solid Geometry (CS
/usr/share/doc/opencsg/doc/publications.html:This major approach of 3D modeling has not been support
/usr/share/doc/opencsg/doc/publications.html-by real-time computer graphics until recently.

In this example either 3D modeling or 3D printing will match the expression, case-insensitive. To display only the parts of a text stream that match the expression used by egrep, use the -o option.

The program fgrep is equivalent to grep -F, that is, it does not parse regular expressions. It is useful in simple searches where the goal is to match a literal expression. Therefore, special characters like the dollar sign and the dot will be taken literally and not by their meanings in a regular expression.

The Stream Editor: sed

The purpose of the sed program is to modify text-based data in a non-interactive way. It means that all the editing is made by predefined instructions, not by arbitrarily typing directly into a text displayed on the screen. In modern terms, sed can be understood as a template parser: given a text as input, it places custom content at predefined positions or when it finds a match for a regular expression.

Sed, as the name implies, is well suited for text streamed through pipelines. Its basic syntax is sed -f SCRIPT when editing instructions are stored in the file SCRIPT or sed -e COMMANDS to execute COMMANDS directly from the command line. If neither -f or -e are present, sed uses the first non-option parameter as the script file. It is also possible to use a file as the input just by giving its path as an argument to sed.

sed instructions are composed of a single character, possibly preceded by an address or followed by one or more options, and are applied to each line at a time. Addresses can be a single line number, a regular expression, or a range of lines. For example, the first line of a text stream can be deleted with 1d, where 1 specifies the line where the delete command d will be applied. To clarify sed 's usage, take the output of the command factor `seq 12`, which returns the prime factors for numbers 1 to 12:

$ factor `seq 12`
1:
2: 2
3: 3
4: 2 2
5: 5
6: 2 3
7: 7
8: 2 2 2
9: 3 3
10: 2 5
11: 11
12: 2 2 3

Deleting the first line with sed is accomplished by 1d:

$ factor `seq 12` | sed 1d
2: 2
3: 3
4: 2 2
5: 5
6: 2 3
7: 7
8: 2 2 2
9: 3 3
10: 2 5
11: 11
12: 2 2 3

A range of lines can be specified with a separating comma:

$ factor `seq 12` | sed 1,7d
8: 2 2 2
9: 3 3
10: 2 5
11: 11
12: 2 2 3

More than one instruction can be used in the same execution, separated by semicolons. In this case, however, it is important to enclose them with parenthesis so the semicolon is not interpreted by the shell:

$ factor `seq 12` | sed "1,7d;11d"
8: 2 2 2
9: 3 3
10: 2 5
12: 2 2 3

In this example, two deletion instructions were executed, first on lines ranging from 1 to 7 and then on line 11. An address can also be a regular expression, so only lines with a match will be affected by the instruction:

$ factor `seq 12` | sed "1d;/:.*2.*/d"
3: 3
5: 5
7: 7
9: 3 3
11: 11

The regular expression :.*2.* matches with any occurrence of the number 2 anywhere after a colon, causing the deletion of lines corresponding to numbers with 2 as a factor. With sed, anything placed between slashes (/) is considered a regular expression and by default all basic RE is supported. For example, sed -e "/^#/d" /etc/services shows the contents of the file /etc/services without the lines beginning with # (comment lines).

The delete instruction d is only one of the many editing instructions provided by sed. Instead of deleting a line, sed can replace it with a given text:

$ factor `seq 12` | sed "1d;/:.*2.*/c REMOVED"
REMOVED
3: 3
REMOVED
5: 5
REMOVED
7: 7
REMOVED
9: 3 3
REMOVED
11: 11
REMOVED

The instruction c REMOVED simply replaces a line with the text REMOVED. In the example’s case, every line with a substring matching the regular expression :.*2.* is affected by instruction c REMOVED. Instruction a TEXT copies text indicated by TEXT to a new line after the line with a match. The instruction r FILE does the same, but copies the contents of the file indicated by FILE. Instruction w does the opposite of r, that is, the line will be appended to the indicated file.

By far the most used sed instruction is s/FIND/REPLACE/, which is used to replace a match to the regular expression FIND with text indicated by REPLACE. For example, the instruction s/hda/sda/ replaces a substring matching the literal RE hda with sda. Only the first match found in the line will be replaced, unless the flag g is placed after the instruction, as in s/hda/sda/g.

A more realistic case study will help to illustrate sed's features. Suppose a medical clinic wants to send text messages to its customers, reminding them of their scheduled appointments for the next day. A typical implementation scenario relies on a professional instant message service, which provides an API to access the system responsible for delivering the messages. These messages usually originate from the same system that runs the application controlling customer’s appointments, triggered by a specific time of the day or some other event. In this hypothetical situation, the application could generate a file called appointments.csv containing tabulated data with all the appointments for the next day, then used by sed to render the text messages from a template file called template.txt. CSV files are a standard way of export data from database queries, so sample appointments could be given as follows:

$ cat appointments.csv
"NAME","TIME","PHONE"
"Carol","11am","55557777"
"Dave","2pm","33334444"

The first line holds the labels for each column, which will be used to match the tags inside the sample template file:

$ cat template.txt
Hey <NAME>, don't forget your appointment tomorrow at <TIME>.

The less than < and greater than > signs were put around labels just to help identify them as tags. The following Bash script parses all enqueued appointments using template.txt as the message template:

#! /bin/bash

TEMPLATE=`cat template.txt`
TAGS=(`sed -ne '1s/^"//;1s/","/\n/g;1s/"$//p' appointments.csv`)
mapfile -t -s 1 ROWS < appointments.csv
for (( r = 0; r < ${#ROWS[*]}; r++ ))
do
  MSG=$TEMPLATE
  VALS=(`sed -e 's/^"//;s/","/\n/g;s/"$//' <<<${ROWS[$r]}`)
  for (( c = 0; c < ${#TAGS[*]}; c++ ))
  do
    MSG=`sed -e "s/<${TAGS[$c]}>/${VALS[$c]}/g" <<<"$MSG"`
  done
  echo curl --data message=\"$MSG\" --data phone=\"${VALS[2]}\" https://mysmsprovider/api
done

An actual production script would also handle authentication, error checking and logging, but the example has basic functionality to start with. The first instructions executed by sed are applied only to the first line — the address 1 in 1s/^"//;1s/","/\n/g;1s/"$//p — to remove the leading and trailing quotes — 1s/^"// and 1s/"$// — and to replace field separators with a newline character: 1s/","/\n/g. Only the first line is needed for loading column names, so non-matching lines will be suppressed by option -n, requiring flag p to be placed after the last sed command to print the matching line. The tags are then stored in the TAGS variable as a Bash array. Another Bash array variable is created by the command mapfile to store the lines containing the appointments in the array variable ROWS.

A for loop is employed to process each appointment line found in ROWS. Then, quotes and separators in the appointment — the appointment is in variable ${ROWS[$r]} used as a here string — are replaced by sed, similarly to the commands used to load the tags. The separated values for the appointment are then stored in the array variable VALS, where array subscripts 0, 1 and 2 correspond to values for NAME, TIME and PHONE.

Finally, a nested for loop walks through the TAGS array and replaces each tag found in the template with its corresponding value in VALS. The MSG variable holds a copy of the rendered template, updated by the substitution command s/<${TAGS[$c]}>/${VALS[$c]}/g on every loop pass through TAGS.

This results in a rendered message like: "Hey Carol, don’t forget your appointment tomorrow at 11am." The rendered message can then be sent as a parameter through a HTTP request with curl, as a mail message or any other similar method.

Combining grep and sed

Commands grep and sed can be used together when more complex text mining procedures are required. As a system administrator, you may want to inspect all the login attempts to a server, for example. The file /var/log/wtmp records all logins and logouts, whilst the file /var/log/btmp records the failed login attempts. They are written in a binary format, which can be read by the commands last and lastb, respectively.

The output of lastb shows not only the username used in the bad login attempt, but its IP address as well:

# lastb -d -a -n 10 --time-format notime
user     ssh:notty       (00:00)     81.161.63.251
nrostagn ssh:notty       (00:00)     vmd60532.contaboserver.net
pi       ssh:notty       (00:00)     132.red-88-20-39.staticip.rima-tde.net
pi       ssh:notty       (00:00)     132.red-88-20-39.staticip.rima-tde.net
pi       ssh:notty       (00:00)     46.6.11.56
pi       ssh:notty       (00:00)     46.6.11.56
nps      ssh:notty       (00:00)     vmd60532.contaboserver.net
narmadan ssh:notty       (00:00)     vmd60532.contaboserver.net
nominati ssh:notty       (00:00)     vmd60532.contaboserver.net
nominati ssh:notty       (00:00)     vmd60532.contaboserver.net

Option -d translates the IP number to the corresponding hostname. The hostname may provide clues about the ISP or hosting service used to perform these bad login attempts. Option -a puts the hostname in the last column, which facilitates the filtering yet to be applied. Option --time-format notime suppresses the time when the login attempt occurred. Command lastb can take some time to finish if there were too many bad login attempts, so the output was limited to ten entries with the option -n 10.

Not all remote IPs have a hostname associated to it, so reverse DNS does not apply to them and they can be dismissed. Although you could write a regular expression to match the expected format for a hostname at the end of the line, it is probably simpler to write a regular expression to match with either a letter from the alphabet or with a single digit at the end of the line. The following example shows how the command grep takes the listing at its standard input and removes the lines without hostnames:

# lastb -d -a --time-format notime | grep -v '[0-9]$' | head -n 10
nvidia   ssh:notty       (00:00)     vmd60532.contaboserver.net
n_tonson ssh:notty       (00:00)     vmd60532.contaboserver.net
nrostagn ssh:notty       (00:00)     vmd60532.contaboserver.net
pi       ssh:notty       (00:00)     132.red-88-20-39.staticip.rima-tde.net
pi       ssh:notty       (00:00)     132.red-88-20-39.staticip.rima-tde.net
nps      ssh:notty       (00:00)     vmd60532.contaboserver.net
narmadan ssh:notty       (00:00)     vmd60532.contaboserver.net
nominati ssh:notty       (00:00)     vmd60532.contaboserver.net
nominati ssh:notty       (00:00)     vmd60532.contaboserver.net
nominati ssh:notty       (00:00)     vmd60532.contaboserver.net

Command grep option -v shows only the lines that don’t match with the given regular expression. A regular expression matching any line ending with a number (i.e. [0-9]$) will capture only the entries without a hostname. Therefore, grep -v '[0-9]$' will show only the lines ending with a hostname.

The output can be filtered even further, by keeping only the domain name and removing the other parts from each line. Command sed can do it with a substitution command to replace the whole line with a back-reference to the domain name in it:

# lastb -d -a --time-format notime | grep -v '[0-9]$' | sed -e 's/.* \(.*\)$/\1/' | head -n 10
vmd60532.contaboserver.net
vmd60532.contaboserver.net
vmd60532.contaboserver.net
132.red-88-20-39.staticip.rima-tde.net
132.red-88-20-39.staticip.rima-tde.net
vmd60532.contaboserver.net
vmd60532.contaboserver.net
vmd60532.contaboserver.net
vmd60532.contaboserver.net
vmd60532.contaboserver.net

The escaped parenthesis in .* $.*$$ tells sed to remember that part of the line, that is, the part between the last space character and the end of the line. In the example, this part is referenced with \1 and used the replace the entire line.

It’s clear that most remote hosts try to login more than once, thus the same domain name repeats itself. To suppress the repeated entries, first they need to be sorted (with command sort) then passed to the command uniq:

# lastb -d -a --time-format notime | grep -v '[0-9]$' | sed -e 's/.* \(.*\)$/\1/' | sort | uniq | head -n 10
116-25-254-113-on-nets.com
132.red-88-20-39.staticip.rima-tde.net
145-40-33-205.power-speed.at
tor.laquadrature.net
tor.momx.site
ua-83-226-233-154.bbcust.telenor.se
vmd38161.contaboserver.net
vmd60532.contaboserver.net
vmi488063.contaboserver.net
vmi515749.contaboserver.net

This shows how different commands can be combined to produce the desired outcome. The hostname list can then be used to write blocking firewall rules or to take other measures to enforce the security of the server.

Guided Exercises

Command last shows a listing of last logged in users, including their origin IPs. How would the egrep command be used to filter last output, showing only occurrences of an IPv4 address, discarding any additional information in the corresponding line?
What option should be given to grep in order to correctly filter the output generated by command find executed with option -print0?
Command uptime -s shows the last date when the system was powered on, as in 2019-08-05 20:13:22. What will be the result of command uptime -s | sed -e 's/(.*) (.*)/\1/'?
What option should be given to grep so it counts matching lines instead of displaying them?

Explorational Exercises

The basic structure of an HTML file starts with elements html, head and body, for example:
```
<html>
<head>
  <title>News Site</title>
</head>
<body>
  <h1>Headline</h1>
  <p>Information of interest.</p>
</body>
</html>
```
Describe how addresses could be used in sed to display only the body element and its contents.
What sed expression will remove all tags from an HTML document, keeping only the rendered text?
Files with extension .ovpn are very popular to configure VPN clients as they contain not only the settings, but also the contents of keys and certificates for the client. These keys and certificates are originally in separate files, so they need to be copied into the .ovpn file. Given the following excerpt of a .ovpn template:
```
client
dev tun
remote 192.168.1.155 1194
<ca>
ca.crt
</ca>
<cert>
client.crt
</cert>
<key>
client.key
</key>
<tls-auth>
ta.key
</tls-auth>
```
Assuming files ca.crt, client.crt, client.key and ta.key are in the current directory, how would the template configuration be modified by sed to replace each filename by its content?

Summary

This lesson covers the two most important Linux commands related to regular expressions: grep and sed. Scripts and compound commands rely on grep and sed to perform a wide range of text filtering and parsing tasks. The lesson goes through the following steps:

How to use grep and its variations such as egrep and fgrep.
How to use sed and its internal instructions to manipulate text.
Examples of regular expression applications using grep and sed.

Answers to Guided Exercises

Command last shows a listing of last logged in users, including their origin IPs. How would the egrep command be used to filter last output, showing only occurrences of an IPv4 address, discarding any additional information in the corresponding line?

last -i | egrep -o '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'
What option should be given to grep in order to correctly filter the output generated by command find executed with option -print0?

The option -z or --null-data, as in find . -print0 | grep -z expression.
Command uptime -s shows the last date when the system was powered on, as in 2019-08-05 20:13:22. What will be the result of command uptime -s | sed -e 's/(.*) (.*)/\1/'?

An error will occur. By default, parenthesis should be escaped to use backreferences in sed.
What option should be given to grep so it counts matching lines instead of displaying them?

Option -c.

Answers to Explorational Exercises

The basic structure of an HTML file starts with elements html, head and body, for example:
```
<html>
<head>
  <title>News Site</title>
</head>
<body>
  <h1>Headline</h1>
  <p>Information of interest.</p>
</body>
</html>
```
Describe how addresses could be used in sed to display only the body element and its contents.

To only show body, the addresses should be /<body>/,/<\/body>/, as in sed -n -e '/<body>/,/<\/body>/p'. Option -n is given to sed so it doesn’t print lines by default, hence the command p at the end of sed expression to print matching lines.
What sed expression will remove all tags from an HTML document, keeping only the rendered text?

The sed expression s/<[^>]*>//g will replace any content enclosed in <> by an empty string.
Files with extension .ovpn are very popular to configure VPN clients as they contain not only the settings, but also the contents of keys and certificates for the client. These keys and certificates are originally in separate files, so they need to be copied into the .ovpn file. Given the following excerpt of a .ovpn template:
```
client
dev tun
remote 192.168.1.155 1194
<ca>
ca.crt
</ca>
<cert>
client.crt
</cert>
<key>
client.key
</key>
<tls-auth>
ta.key
</tls-auth>
```
Assuming files ca.crt, client.crt, client.key and ta.key are in the current directory, how would the template configuration be modified by sed to replace each filename by its content?

The command
```
sed -r -e 's/(^[^.]*)\.(crt|key)$/cat \1.\2/e' < client.template > client.ovpn
```
replaces any line terminating in .crt or .key with the content of a file whose name equals the line. Option -r tells sed to use extended regular expressions, whilst e at the end of the expression tells sed to replace matches with the output of command cat \1.\2. The backreferences \1 and \2 correspond to the filename and extension found in the match.