103.7 Lesson 2
Certificate: |
LPIC-1 |
---|---|
Version: |
5.0 |
Topic: |
103 GNU and Unix Commands |
Objective: |
103.7 Search text files using regular expressions |
Lesson: |
2 of 2 |
Introduction
Streaming data through a chain of piped commands allows for the application of compound filters based on regular expressions. Regular expressions are an important technique used not only in system administration, but also in data mining and related areas. Two commands are specially suited to manipulate files and text data using regular expressions: grep
and sed
. grep
is a pattern finder and sed
is a stream editor. They are useful by themselves, but it is when working together with other processes that they stand out.
The Pattern Finder: grep
One of the most common uses of grep
is to facilitate the inspection of long files, using the regular expression as a filter applied to each line. It can be used to show only the lines starting with a certain term. For example, grep
can be used to investigate a configuration file for kernel modules, listing only option lines:
$ grep '^options' /etc/modprobe.d/alsa-base.conf options snd-pcsp index=-2 options snd-usb-audio index=-2 options bt87x index=-2 options cx88_alsa index=-2 options snd-atiixp-modem index=-2 options snd-intel8x0m index=-2 options snd-via82xx-modem index=-2
The pipe |
character can be employed to redirect the output of a command directly to grep
's input. The following example uses a bracket expression to select lines from fdisk -l
output, starting with Disk /dev/sda
or Disk /dev/sdb
:
# fdisk -l | grep '^Disk /dev/sd[ab]' Disk /dev/sda: 320.1 GB, 320072933376 bytes, 625142448 sectors Disk /dev/sdb: 7998 MB, 7998537728 bytes, 15622144 sectors
The mere selection of lines with matches may not be appropriate for a particular task, requiring adjustments to grep
's behavior through its options. For example, option -c
or --count
tells grep
to show how many lines had matches:
# fdisk -l | grep '^Disk /dev/sd[ab]' -c 2
The option can be placed before or after the regular expression. Other important grep
options are:
-c
or--count
-
Instead of displaying the search results, only display the total count for how many times a match occurs in any given file.
-i
or--ignore-case
-
Turn the search case-insensitive.
-f FILE
or--file=FILE
-
Indicate a file containing the regular expression to use.
-n
or--line-number
-
Show the number of the line.
-v
or--invert-match
-
Select every line, except those containing matches.
-H
or--with-filename
-
Print also the name of the file containing the line.
-z
or--null-data
-
Rather than have
grep
treat input and output data streams as separate lines (using the newline by default) instead take the input or output as a sequence of lines. When combining output from thefind
command using its-print0
option with thegrep
command, the-z
or--null-data
option should be used to process the stream in the same manner.
Although activated by default when multiple file paths are given as input, the option -H
is not activated for single files. That may be critical in special situations, like when grep
is called directly by find
, for instance:
$ find /usr/share/doc -type f -exec grep -i '3d modeling' "{}" \; | cut -c -100 artistic aspects of 3D modeling. Thus this might be the application you are This major approach of 3D modeling has not been supported oce is a C++ 3D modeling library. It can be used to develop CAD/CAM softwares, for instance [FreeCad
In this example, find
lists every file under /usr/share/doc
then passes each one to grep
, which in turn performs a case-insensitive search for 3d modeling
inside the file. The pipe to cut
is there just to limit output length to 100 columns. Note, however, that there is no way of knowing from which file the lines came from. This issue is solved by adding -H
to grep
:
$ find /usr/share/doc -type f -exec grep -i -H '3d modeling' "{}" \; | cut -c -100 /usr/share/doc/openscad/README.md:artistic aspects of 3D modeling. Thus this might be the applicatio /usr/share/doc/opencsg/doc/publications.html:This major approach of 3D modeling has not been support
Now it is possible to identify the files where each match was found. To make the listing even more informative, leading and trailing lines can be added to lines with matches:
$ find /usr/share/doc -type f -exec grep -i -H -1 '3d modeling' "{}" \; | cut -c -100 /usr/share/doc/openscad/README.md-application Blender), OpenSCAD focuses on the CAD aspects rather t /usr/share/doc/openscad/README.md:artistic aspects of 3D modeling. Thus this might be the applicatio /usr/share/doc/openscad/README.md-looking for when you are planning to create 3D models of machine p /usr/share/doc/opencsg/doc/publications.html-3D graphics library for Constructive Solid Geometry (CS /usr/share/doc/opencsg/doc/publications.html:This major approach of 3D modeling has not been support /usr/share/doc/opencsg/doc/publications.html-by real-time computer graphics until recently.
The option -1
instructs grep
to include one line before and one line after when it finds a line with a match. These extra lines are called context lines and are identified in the output by a minus sign after the file name. The same result can be obtained with -C 1
or --context=1
and other context line quantities may be indicated.
There are two complementary programs to grep: egrep
and fgrep
. The program egrep
is equivalent to the command grep -E
, which incorporates extra features other than the basic regular expressions. For example, with egrep
it is possible to use extended regular expression features, like branching:
$ find /usr/share/doc -type f -exec egrep -i -H -1 '3d (modeling|printing)' "{}" \; | cut -c -100 /usr/share/doc/openscad/README.md-application Blender), OpenSCAD focuses on the CAD aspects rather t /usr/share/doc/openscad/README.md:artistic aspects of 3D modeling. Thus this might be the applicatio /usr/share/doc/openscad/README.md-looking for when you are planning to create 3D models of machine p /usr/share/doc/openscad/RELEASE_NOTES.md-* Support for using 3D-Mouse / Joystick / Gamepad input dev /usr/share/doc/openscad/RELEASE_NOTES.md:* 3D Printing support: Purchase from a print service partne /usr/share/doc/openscad/RELEASE_NOTES.md-* New export file formats: SVG, 3MF, AMF /usr/share/doc/opencsg/doc/publications.html-3D graphics library for Constructive Solid Geometry (CS /usr/share/doc/opencsg/doc/publications.html:This major approach of 3D modeling has not been support /usr/share/doc/opencsg/doc/publications.html-by real-time computer graphics until recently.
In this example either 3D modeling
or 3D printing
will match the expression, case-insensitive. To display only the parts of a text stream that match the expression used by egrep
, use the -o
option.
The program fgrep
is equivalent to grep -F
, that is, it does not parse regular expressions. It is useful in simple searches where the goal is to match a literal expression. Therefore, special characters like the dollar sign and the dot will be taken literally and not by their meanings in a regular expression.
The Stream Editor: sed
The purpose of the sed
program is to modify text-based data in a non-interactive way. It means that all the editing is made by predefined instructions, not by arbitrarily typing directly into a text displayed on the screen. In modern terms, sed
can be understood as a template parser: given a text as input, it places custom content at predefined positions or when it finds a match for a regular expression.
Sed, as the name implies, is well suited for text streamed through pipelines. Its basic syntax is sed -f SCRIPT
when editing instructions are stored in the file SCRIPT
or sed -e COMMANDS
to execute COMMANDS
directly from the command line. If neither -f
or -e
are present, sed
uses the first non-option parameter as the script file. It is also possible to use a file as the input just by giving its path as an argument to sed
.
sed
instructions are composed of a single character, possibly preceded by an address or followed by one or more options, and are applied to each line at a time. Addresses can be a single line number, a regular expression, or a range of lines. For example, the first line of a text stream can be deleted with 1d
, where 1
specifies the line where the delete command d
will be applied. To clarify sed
's usage, take the output of the command factor `seq 12`
, which returns the prime factors for numbers 1 to 12:
$ factor `seq 12` 1: 2: 2 3: 3 4: 2 2 5: 5 6: 2 3 7: 7 8: 2 2 2 9: 3 3 10: 2 5 11: 11 12: 2 2 3
Deleting the first line with sed
is accomplished by 1d
:
$ factor `seq 12` | sed 1d 2: 2 3: 3 4: 2 2 5: 5 6: 2 3 7: 7 8: 2 2 2 9: 3 3 10: 2 5 11: 11 12: 2 2 3
A range of lines can be specified with a separating comma:
$ factor `seq 12` | sed 1,7d 8: 2 2 2 9: 3 3 10: 2 5 11: 11 12: 2 2 3
More than one instruction can be used in the same execution, separated by semicolons. In this case, however, it is important to enclose them with parenthesis so the semicolon is not interpreted by the shell:
$ factor `seq 12` | sed "1,7d;11d" 8: 2 2 2 9: 3 3 10: 2 5 12: 2 2 3
In this example, two deletion instructions were executed, first on lines ranging from 1 to 7 and then on line 11. An address can also be a regular expression, so only lines with a match will be affected by the instruction:
$ factor `seq 12` | sed "1d;/:.*2.*/d" 3: 3 5: 5 7: 7 9: 3 3 11: 11
The regular expression :.*2.*
matches with any occurrence of the number 2 anywhere after a colon, causing the deletion of lines corresponding to numbers with 2 as a factor. With sed
, anything placed between slashes (/
) is considered a regular expression and by default all basic RE is supported. For example, sed -e "/^#/d" /etc/services
shows the contents of the file /etc/services
without the lines beginning with #
(comment lines).
The delete instruction d
is only one of the many editing instructions provided by sed
. Instead of deleting a line, sed
can replace it with a given text:
$ factor `seq 12` | sed "1d;/:.*2.*/c REMOVED" REMOVED 3: 3 REMOVED 5: 5 REMOVED 7: 7 REMOVED 9: 3 3 REMOVED 11: 11 REMOVED
The instruction c REMOVED
simply replaces a line with the text REMOVED
. In the example’s case, every line with a substring matching the regular expression :.*2.*
is affected by instruction c REMOVED
. Instruction a TEXT
copies text indicated by TEXT
to a new line after the line with a match. The instruction r FILE
does the same, but copies the contents of the file indicated by FILE
. Instruction w
does the opposite of r
, that is, the line will be appended to the indicated file.
By far the most used sed
instruction is s/FIND/REPLACE/
, which is used to replace a match to the regular expression FIND
with text indicated by REPLACE
. For example, the instruction s/hda/sda/
replaces a substring matching the literal RE hda
with sda
. Only the first match found in the line will be replaced, unless the flag g
is placed after the instruction, as in s/hda/sda/g
.
A more realistic case study will help to illustrate sed
's features. Suppose a medical clinic wants to send text messages to its customers, reminding them of their scheduled appointments for the next day. A typical implementation scenario relies on a professional instant message service, which provides an API to access the system responsible for delivering the messages. These messages usually originate from the same system that runs the application controlling customer’s appointments, triggered by a specific time of the day or some other event. In this hypothetical situation, the application could generate a file called appointments.csv
containing tabulated data with all the appointments for the next day, then used by sed
to render the text messages from a template file called template.txt
. CSV files are a standard way of export data from database queries, so sample appointments could be given as follows:
$ cat appointments.csv "NAME","TIME","PHONE" "Carol","11am","55557777" "Dave","2pm","33334444"
The first line holds the labels for each column, which will be used to match the tags inside the sample template file:
$ cat template.txt Hey <NAME>, don't forget your appointment tomorrow at <TIME>.
The less than <
and greater than >
signs were put around labels just to help identify them as tags. The following Bash script parses all enqueued appointments using template.txt
as the message template:
#! /bin/bash TEMPLATE=`cat template.txt` TAGS=(`sed -ne '1s/^"//;1s/","/\n/g;1s/"$//p' appointments.csv`) mapfile -t -s 1 ROWS < appointments.csv for (( r = 0; r < ${#ROWS[*]}; r++ )) do MSG=$TEMPLATE VALS=(`sed -e 's/^"//;s/","/\n/g;s/"$//' <<<${ROWS[$r]}`) for (( c = 0; c < ${#TAGS[*]}; c++ )) do MSG=`sed -e "s/<${TAGS[$c]}>/${VALS[$c]}/g" <<<"$MSG"` done echo curl --data message=\"$MSG\" --data phone=\"${VALS[2]}\" https://mysmsprovider/api done
An actual production script would also handle authentication, error checking and logging, but the example has basic functionality to start with. The first instructions executed by sed
are applied only to the first line — the address 1
in 1s/^"//;1s/","/\n/g;1s/"$//p
— to remove the leading and trailing quotes — 1s/^"//
and 1s/"$//
— and to replace field separators with a newline character: 1s/","/\n/g
. Only the first line is needed for loading column names, so non-matching lines will be suppressed by option -n
, requiring flag p
to be placed after the last sed
command to print the matching line. The tags are then stored in the TAGS
variable as a Bash array. Another Bash array variable is created by the command mapfile
to store the lines containing the appointments in the array variable ROWS
.
A for
loop is employed to process each appointment line found in ROWS
. Then, quotes and separators in the appointment — the appointment is in variable ${ROWS[$r]}
used as a here string — are replaced by sed
, similarly to the commands used to load the tags. The separated values for the appointment are then stored in the array variable VALS
, where array subscripts 0, 1 and 2 correspond to values for NAME
, TIME
and PHONE
.
Finally, a nested for
loop walks through the TAGS
array and replaces each tag found in the template with its corresponding value in VALS
. The MSG
variable holds a copy of the rendered template, updated by the substitution command s/<${TAGS[$c]}>/${VALS[$c]}/g
on every loop pass through TAGS
.
This results in a rendered message like: "Hey Carol, don’t forget your appointment tomorrow at 11am."
The rendered message can then be sent as a parameter through a HTTP request with curl
, as a mail message or any other similar method.
Combining grep and sed
Commands grep
and sed
can be used together when more complex text mining procedures are required. As a system administrator, you may want to inspect all the login attempts to a server, for example. The file /var/log/wtmp
records all logins and logouts, whilst the file /var/log/btmp
records the failed login attempts. They are written in a binary format, which can be read by the commands last
and lastb
, respectively.
The output of lastb
shows not only the username used in the bad login attempt, but its IP address as well:
# lastb -d -a -n 10 --time-format notime user ssh:notty (00:00) 81.161.63.251 nrostagn ssh:notty (00:00) vmd60532.contaboserver.net pi ssh:notty (00:00) 132.red-88-20-39.staticip.rima-tde.net pi ssh:notty (00:00) 132.red-88-20-39.staticip.rima-tde.net pi ssh:notty (00:00) 46.6.11.56 pi ssh:notty (00:00) 46.6.11.56 nps ssh:notty (00:00) vmd60532.contaboserver.net narmadan ssh:notty (00:00) vmd60532.contaboserver.net nominati ssh:notty (00:00) vmd60532.contaboserver.net nominati ssh:notty (00:00) vmd60532.contaboserver.net
Option -d
translates the IP number to the corresponding hostname. The hostname may provide clues about the ISP or hosting service used to perform these bad login attempts. Option -a
puts the hostname in the last column, which facilitates the filtering yet to be applied. Option --time-format notime
suppresses the time when the login attempt occurred. Command lastb
can take some time to finish if there were too many bad login attempts, so the output was limited to ten entries with the option -n 10
.
Not all remote IPs have a hostname associated to it, so reverse DNS does not apply to them and they can be dismissed. Although you could write a regular expression to match the expected format for a hostname at the end of the line, it is probably simpler to write a regular expression to match with either a letter from the alphabet or with a single digit at the end of the line. The following example shows how the command grep
takes the listing at its standard input and removes the lines without hostnames:
# lastb -d -a --time-format notime | grep -v '[0-9]$' | head -n 10 nvidia ssh:notty (00:00) vmd60532.contaboserver.net n_tonson ssh:notty (00:00) vmd60532.contaboserver.net nrostagn ssh:notty (00:00) vmd60532.contaboserver.net pi ssh:notty (00:00) 132.red-88-20-39.staticip.rima-tde.net pi ssh:notty (00:00) 132.red-88-20-39.staticip.rima-tde.net nps ssh:notty (00:00) vmd60532.contaboserver.net narmadan ssh:notty (00:00) vmd60532.contaboserver.net nominati ssh:notty (00:00) vmd60532.contaboserver.net nominati ssh:notty (00:00) vmd60532.contaboserver.net nominati ssh:notty (00:00) vmd60532.contaboserver.net
Command grep
option -v
shows only the lines that don’t match with the given regular expression. A regular expression matching any line ending with a number (i.e. [0-9]$
) will capture only the entries without a hostname. Therefore, grep -v '[0-9]$'
will show only the lines ending with a hostname.
The output can be filtered even further, by keeping only the domain name and removing the other parts from each line. Command sed
can do it with a substitution command to replace the whole line with a back-reference to the domain name in it:
# lastb -d -a --time-format notime | grep -v '[0-9]$' | sed -e 's/.* \(.*\)$/\1/' | head -n 10 vmd60532.contaboserver.net vmd60532.contaboserver.net vmd60532.contaboserver.net 132.red-88-20-39.staticip.rima-tde.net 132.red-88-20-39.staticip.rima-tde.net vmd60532.contaboserver.net vmd60532.contaboserver.net vmd60532.contaboserver.net vmd60532.contaboserver.net vmd60532.contaboserver.net
The escaped parenthesis in .* \(.*\)$
tells sed
to remember that part of the line, that is, the part between the last space character and the end of the line. In the example, this part is referenced with \1
and used the replace the entire line.
It’s clear that most remote hosts try to login more than once, thus the same domain name repeats itself. To suppress the repeated entries, first they need to be sorted (with command sort
) then passed to the command uniq
:
# lastb -d -a --time-format notime | grep -v '[0-9]$' | sed -e 's/.* \(.*\)$/\1/' | sort | uniq | head -n 10 116-25-254-113-on-nets.com 132.red-88-20-39.staticip.rima-tde.net 145-40-33-205.power-speed.at tor.laquadrature.net tor.momx.site ua-83-226-233-154.bbcust.telenor.se vmd38161.contaboserver.net vmd60532.contaboserver.net vmi488063.contaboserver.net vmi515749.contaboserver.net
This shows how different commands can be combined to produce the desired outcome. The hostname list can then be used to write blocking firewall rules or to take other measures to enforce the security of the server.
Guided Exercises
-
Command
last
shows a listing of last logged in users, including their origin IPs. How would theegrep
command be used to filterlast
output, showing only occurrences of an IPv4 address, discarding any additional information in the corresponding line? -
What option should be given to
grep
in order to correctly filter the output generated by commandfind
executed with option-print0
? -
Command
uptime -s
shows the last date when the system was powered on, as in2019-08-05 20:13:22
. What will be the result of commanduptime -s | sed -e 's/(.*) (.*)/\1/'
? -
What option should be given to
grep
so it counts matching lines instead of displaying them?
Explorational Exercises
-
The basic structure of an HTML file starts with elements
html
,head
andbody
, for example:<html> <head> <title>News Site</title> </head> <body> <h1>Headline</h1> <p>Information of interest.</p> </body> </html>
Describe how addresses could be used in
sed
to display only thebody
element and its contents. -
What
sed
expression will remove all tags from an HTML document, keeping only the rendered text? -
Files with extension
.ovpn
are very popular to configure VPN clients as they contain not only the settings, but also the contents of keys and certificates for the client. These keys and certificates are originally in separate files, so they need to be copied into the.ovpn
file. Given the following excerpt of a.ovpn
template:client dev tun remote 192.168.1.155 1194 <ca> ca.crt </ca> <cert> client.crt </cert> <key> client.key </key> <tls-auth> ta.key </tls-auth>
Assuming files
ca.crt
,client.crt
,client.key
andta.key
are in the current directory, how would the template configuration be modified bysed
to replace each filename by its content?
Summary
This lesson covers the two most important Linux commands related to regular expressions: grep
and sed
. Scripts and compound commands rely on grep
and sed
to perform a wide range of text filtering and parsing tasks. The lesson goes through the following steps:
-
How to use
grep
and its variations such asegrep
andfgrep
. -
How to use
sed
and its internal instructions to manipulate text. -
Examples of regular expression applications using
grep
andsed
.
Answers to Guided Exercises
-
Command
last
shows a listing of last logged in users, including their origin IPs. How would theegrep
command be used to filterlast
output, showing only occurrences of an IPv4 address, discarding any additional information in the corresponding line?last -i | egrep -o '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'
-
What option should be given to
grep
in order to correctly filter the output generated by commandfind
executed with option-print0
?The option
-z
or--null-data
, as infind . -print0 | grep -z expression
. -
Command
uptime -s
shows the last date when the system was powered on, as in 2019-08-05 20:13:22. What will be the result of commanduptime -s | sed -e 's/(.*) (.*)/\1/'
?An error will occur. By default, parenthesis should be escaped to use backreferences in
sed
. -
What option should be given to
grep
so it counts matching lines instead of displaying them?Option
-c
.
Answers to Explorational Exercises
-
The basic structure of an HTML file starts with elements
html
,head
andbody
, for example:<html> <head> <title>News Site</title> </head> <body> <h1>Headline</h1> <p>Information of interest.</p> </body> </html>
Describe how addresses could be used in
sed
to display only thebody
element and its contents.To only show
body
, the addresses should be/<body>/,/<\/body>/
, as insed -n -e '/<body>/,/<\/body>/p'
. Option-n
is given tosed
so it doesn’t print lines by default, hence the commandp
at the end ofsed
expression to print matching lines. -
What
sed
expression will remove all tags from an HTML document, keeping only the rendered text?The
sed
expressions/<[^>]*>//g
will replace any content enclosed in<>
by an empty string. -
Files with extension
.ovpn
are very popular to configure VPN clients as they contain not only the settings, but also the contents of keys and certificates for the client. These keys and certificates are originally in separate files, so they need to be copied into the.ovpn
file. Given the following excerpt of a.ovpn
template:client dev tun remote 192.168.1.155 1194 <ca> ca.crt </ca> <cert> client.crt </cert> <key> client.key </key> <tls-auth> ta.key </tls-auth>
Assuming files
ca.crt
,client.crt
,client.key
andta.key
are in the current directory, how would the template configuration be modified bysed
to replace each filename by its content?The command
sed -r -e 's/(^[^.]*)\.(crt|key)$/cat \1.\2/e' < client.template > client.ovpn
replaces any line terminating in
.crt
or.key
with the content of a file whose name equals the line. Option-r
tellssed
to use extended regular expressions, whilste
at the end of the expression tellssed
to replace matches with the output of commandcat \1.\2
. The backreferences\1
and\2
correspond to the filename and extension found in the match.