3.2 Lesson 2
Certificate: |
Linux Essentials |
---|---|
Version: |
1.6 |
Topic: |
3 The Power of the Command Line |
Objective: |
3.2 Searching and Extracting Data from Files |
Lesson: |
2 of 2 |
Introduction
In this lesson, we are going to look at tools that are used to manipulate text. These tools are frequently used by system administrators or programs to automatically monitor or identify specific recurring information.
Searching within Files with grep
The first tool that we will discuss in this lesson is the grep
command. grep
is the abbreviation of the phrase “global regular expression print” and its main functionality is to search within files for the specified pattern. The command outputs the line containing the specified pattern highlighted in red.
$ grep bash /etc/passwd root:x:0:0:root:/root:/bin/bash user:x:1001:1001:User,,,,:/home/user:/bin/bash
grep
, as most commands, can also be tweaked by using options. Here are the most common ones:
-i
-
the search is case insensitive
-r
-
the search is recursive (it searches into all files within the given directory and its subdirectories)
-c
-
the search counts the number of matches
-v
-
invert the match, to print lines that do not match the search term
-E
-
turns on extended regular expressions (needed by some of the more advanced meta-characters like
|
,+
and?
)
grep
has many other useful options. Consult the man page to find out more about it.
Regular Expressions
The second tool is very powerful. It is used to describe bits of text within files, also called regular expressions. Regular expressions are extremely useful in extracting data from text files by constructing patterns. They are commonly used within scripts or when programming with high level languages, such as Perl or Python.
When working with regular expressions, it is very important to keep in mind that every character counts and the pattern is written with the purpose of matching a specific sequence of characters, known as a string. Most patterns use the normal ASCII symbols, such as letters, digits, punctuation or other symbols, but it can also use Unicode characters in order to match any other type of text.
The following list explains the regular expressions meta-characters that are used to form the patterns.
.
-
Match any single character (except newline)
[abcABC]
-
Match any one character within the brackets
[^abcABC]
-
Match any one character except the ones in the brackets
[a-z]
-
Match any character in the range
[^a-z]
-
Match any character except the ones in the range
sun|moon
-
Find either of the listed strings
^
-
Start of a line
$
-
End of a line
All functionalities of the regular expressions can be implemented through grep
as well. You can see that in the example above, the word is not surrounded by double quotes. To prevent the shell from interpreting the meta-character itself, it is recommended that the more complex pattern be between double quotes (" "). For the purpose of practice, we will be using double quotes when implementing regular expressions. The other quotation marks keep their normal functionality, as discussed in previous lessons.
The following examples emphasize the functionality of the regular expressions. We will need data within the file, therefore the next set of commands just appends different strings to the text.txt
file.
$ echo "aaabbb1" > text.txt $ echo "abab2" >> text.txt $ echo "noone2" >> text.txt $ echo "class1" >> text.txt $ echo "alien2" >> text.txt $ cat text.txt aaabbb1 abab2 noone2 class1 alien2
The first example is a combination of searching through the file without and with regular expressions. In order to fully understand regular expressions, it is very important to show the difference. The first command searches for the exact string, anywhere in the line, whereas the second command searches for sets of characters that contain any of the characters between the brackets. Therefore, the results of the commands are different.
$ grep "ab" text.txt aaabbb1 abab2 $ grep "[ab]" text.txt aaabbb1 abab2 class1 alien2
The second set of examples shows the application of the beginning and the end of the line meta-character. It is very important to specify the need to put the 2 characters at the right place in the expression. When specifying the beginning of the line, the meta-character needs to be before the expression, whereas, when specifying the end of the line, the meta-character needs to be after the expression.
$ grep "^a" text.txt aaabbb1 abab2 alien2 $ grep "2$" text.txt abab2 noone2 alien2
On top of the previous explained meta-characters, regular expressions also have meta-characters that enable multiplication of the previously specified pattern:
*
-
Zero or more of the preceding pattern
+
-
One or more of the preceding pattern
?
-
Zero or one of the preceding pattern
For the multiplier meta-characters, the command below searches for a string that contains ab
, a single character and one or more of the characters previously found. The result shows that grep
found the aaabbb1
string, matching the abbb
part as well as abab2
. Since the +
character is an extended regular expression character, we need to pass the -E
option to the grep
command.
$ grep -E "ab.+" text.txt aaabbb1 abab2
Most of the meta-characters are self-explanatory, but they can become tricky when used for the first time. The previous examples represent a small part of the regular expressions’ functionality. Try all meta-characters from the above table to understand more on how they work.
Guided Exercises
Using grep
and the /usr/share/hunspell/en_US.dic
file, find the lines that match the following criteria:
-
All lines containing the word
cat
anywhere on the line. -
All lines that do not contain any of the following characters:
sawgtfixk
. -
All lines that start with any 3 letters and the word
dig
. -
All lines that end with at least one
e
. -
All lines that contain one of the following words:
org
,kay
ortuna
. -
Number of lines that start with one or no
c
followed by the stringati
.
Explorational Exercises
-
Find the regular expression that matches the words in the “Include” line and doesn’t match the ones in the “Exclude” line:
-
Include:
pot
,spot
,apot
Exclude:
potic
,spots
,potatoe
-
Include:
arp99
,apple
,zipper
Exclude:
zoo
,arive
,attack
-
Include:
arcane
,capper
,zoology
Exclude:
air
,coper
,zoloc
-
Include:
0th/pt
,3th/tc
,9th/pt
Exclude:
0/nm
,3/nm
,9/nm
-
Include:
Hawaii
,Dario
,Ramiro
Exclude:
hawaii
,Ian
,Alice
-
-
What other useful command is commonly used to search within the files? What additional functionalities does it have?
-
Thinking back at the previous lesson, use one of the examples and try to look for a specific pattern within the output of the command, with the help of
grep
.
Summary
In this lab you learned:
-
Regular expressions meta-characters
-
How to create patterns with regular expressions
-
How to search within the files
Commands used in the exercises:
grep
-
Searches for characters or strings within a file
Answers to Guided Exercises
Using grep
and the /usr/share/hunspell/en_US.dic
file, find the lines that match the following criteria:
-
All lines containing the word
cat
anywhere on the line.$ grep "cat" /usr/share/hunspell/en_US.dic Alcatraz/M Decatur/M Hecate/M ...
-
All lines that do not contain any of the following characters:
sawgtfixk
.$ grep -v "[sawgtfixk]" /usr/share/hunspell/en_US.dic 49269 0/nm 1/n1 2/nm 2nd/p 3/nm 3rd/p 4/nm 5/nm 6/nm 7/nm 8/nm ...
-
All lines that start with any 3 letters and the word
dig
.$ grep "^...dig" /usr/share/hunspell/en_US.dic cardigan/SM condign predigest/GDS ...
-
All lines that end with at least one
e
.$ grep -E "e+$" /usr/share/hunspell/en_US.dic Anglicize Anglophobe Anthropocene ...
-
All lines that contain one of the following words:
org
,kay
ortuna
.$ grep -E "org|kay|tuna" /usr/share/hunspell/en_US.dic Borg/SM George/MS Tokay/M fortunate/UY ...
-
Number of lines that start with one or no
c
followed by the stringati
.$ grep -cE "^c?ati" /usr/share/hunspell/en_US.dic 3
Answers to Explorational Exercises
-
Find the regular expression that matches the words in the “Include” line and doesn’t match the ones in the “Exclude” line:
-
Include:
pot
,spot
,apot
Exclude:
potic
,spots
,potatoe
Answer:
pot$
-
Include:
arp99
,apple
,zipper
Exclude:
zoo
,arive
,attack
Answer:
p+
-
Include:
arcane
,capper
,zoology
Exclude:
air
,coper
,zoloc
Answer:
arc|cap|zoo
-
Include:
0th/pt
,3th/tc
,9th/pt
Exclude:
0/nm
,3/nm
,9/nm
Answer:
[0-9]th.+
-
Include:
Hawaii
,Dario
,Ramiro
Exclude:
hawaii
,Ian
,Alice
Answer:
^[A-Z]a.*i+
-
-
What other useful command is commonly used to search within the files? What additional functionalities does it have?
The
sed
command. The command can find and replace characters or sets of characters within a file. -
Thinking back at the previous lesson, use one of the examples and try to look for a specific pattern within the output of the command, with the help of
grep
.I took one of the answers from the Explorational Exercises and looked for the line that has read, write and execute as the group permissions. Your answer might be different, depending on the command that you chose and the pattern that you created.
$ cat contents.txt | tr -s " " | grep "^....rwx"
This exercise is to show you that
grep
can also receive input from different commands and it can help in filtering generated information.