3.2 Lesson 2

Certificate:

Linux Essentials

Version:

1.6

Topic:

3 The Power of the Command Line

Objective:

3.2 Searching and Extracting Data from Files

Lesson:

2 of 2

Introduction

In this lesson, we are going to look at tools that are used to manipulate text. These tools are frequently used by system administrators or programs to automatically monitor or identify specific recurring information.

Searching within Files with `grep`

The first tool that we will discuss in this lesson is the grep command. grep is the abbreviation of the phrase “global regular expression print” and its main functionality is to search within files for the specified pattern. The command outputs the line containing the specified pattern highlighted in red.

$ grep bash /etc/passwd
root:x:0:0:root:/root:/bin/bash
user:x:1001:1001:User,,,,:/home/user:/bin/bash

grep, as most commands, can also be tweaked by using options. Here are the most common ones:

-i: the search is case insensitive
-r: the search is recursive (it searches into all files within the given directory and its subdirectories)
-c: the search counts the number of matches
-v: invert the match, to print lines that do not match the search term
-E: turns on extended regular expressions (needed by some of the more advanced meta-characters like | , + and ?)

grep has many other useful options. Consult the man page to find out more about it.

Regular Expressions

The second tool is very powerful. It is used to describe bits of text within files, also called regular expressions. Regular expressions are extremely useful in extracting data from text files by constructing patterns. They are commonly used within scripts or when programming with high level languages, such as Perl or Python.

When working with regular expressions, it is very important to keep in mind that every character counts and the pattern is written with the purpose of matching a specific sequence of characters, known as a string. Most patterns use the normal ASCII symbols, such as letters, digits, punctuation or other symbols, but it can also use Unicode characters in order to match any other type of text.

The following list explains the regular expressions meta-characters that are used to form the patterns.

.: Match any single character (except newline)
[abcABC]: Match any one character within the brackets
[^abcABC]: Match any one character except the ones in the brackets
[a-z]: Match any character in the range
[^a-z]: Match any character except the ones in the range
sun|moon: Find either of the listed strings
^: Start of a line
$: End of a line

All functionalities of the regular expressions can be implemented through grep as well. You can see that in the example above, the word is not surrounded by double quotes. To prevent the shell from interpreting the meta-character itself, it is recommended that the more complex pattern be between double quotes (" "). For the purpose of practice, we will be using double quotes when implementing regular expressions. The other quotation marks keep their normal functionality, as discussed in previous lessons.

The following examples emphasize the functionality of the regular expressions. We will need data within the file, therefore the next set of commands just appends different strings to the text.txt file.

$ echo "aaabbb1" > text.txt
$ echo "abab2" >> text.txt
$ echo "noone2" >> text.txt
$ echo "class1" >> text.txt
$ echo "alien2" >> text.txt
$ cat text.txt
aaabbb1
abab2
noone2
class1
alien2

The first example is a combination of searching through the file without and with regular expressions. In order to fully understand regular expressions, it is very important to show the difference. The first command searches for the exact string, anywhere in the line, whereas the second command searches for sets of characters that contain any of the characters between the brackets. Therefore, the results of the commands are different.

$ grep "ab" text.txt
aaabbb1
abab2
$ grep "[ab]" text.txt
aaabbb1
abab2
class1
alien2

The second set of examples shows the application of the beginning and the end of the line meta-character. It is very important to specify the need to put the 2 characters at the right place in the expression. When specifying the beginning of the line, the meta-character needs to be before the expression, whereas, when specifying the end of the line, the meta-character needs to be after the expression.

$ grep "^a" text.txt
aaabbb1
abab2
alien2
$ grep "2$" text.txt
abab2
noone2
alien2

On top of the previous explained meta-characters, regular expressions also have meta-characters that enable multiplication of the previously specified pattern:

*: Zero or more of the preceding pattern
+: One or more of the preceding pattern
?: Zero or one of the preceding pattern

For the multiplier meta-characters, the command below searches for a string that contains ab, a single character and one or more of the characters previously found. The result shows that grep found the aaabbb1 string, matching the abbb part as well as abab2. Since the + character is an extended regular expression character, we need to pass the -E option to the grep command.

$ grep -E "ab.+" text.txt
aaabbb1
abab2

Most of the meta-characters are self-explanatory, but they can become tricky when used for the first time. The previous examples represent a small part of the regular expressions’ functionality. Try all meta-characters from the above table to understand more on how they work.

Guided Exercises

Using grep and the /usr/share/hunspell/en_US.dic file, find the lines that match the following criteria:

All lines containing the word cat anywhere on the line.
All lines that do not contain any of the following characters: sawgtfixk.
All lines that start with any 3 letters and the word dig.
All lines that end with at least one e.
All lines that contain one of the following words: org , kay or tuna.
Number of lines that start with one or no c followed by the string ati.

Explorational Exercises

Find the regular expression that matches the words in the “Include” line and doesn’t match the ones in the “Exclude” line:
- Include: pot, spot, apot
  
  Exclude: potic, spots, potatoe
- Include: arp99, apple, zipper
  
  Exclude: zoo, arive, attack
- Include: arcane, capper, zoology
  
  Exclude: air, coper, zoloc
- Include: 0th/pt, 3th/tc, 9th/pt
  
  Exclude: 0/nm, 3/nm, 9/nm
- Include: Hawaii, Dario, Ramiro
  
  Exclude: hawaii, Ian, Alice
What other useful command is commonly used to search within the files? What additional functionalities does it have?
Thinking back at the previous lesson, use one of the examples and try to look for a specific pattern within the output of the command, with the help of grep.

Summary

In this lab you learned:

Regular expressions meta-characters
How to create patterns with regular expressions
How to search within the files

Commands used in the exercises:

grep: Searches for characters or strings within a file

Answers to Guided Exercises

Using grep and the /usr/share/hunspell/en_US.dic file, find the lines that match the following criteria:

All lines containing the word cat anywhere on the line.

$ grep "cat" /usr/share/hunspell/en_US.dic
Alcatraz/M
Decatur/M
Hecate/M
...

All lines that do not contain any of the following characters: sawgtfixk.

$ grep -v "[sawgtfixk]" /usr/share/hunspell/en_US.dic
49269
0/nm
1/n1
2/nm
2nd/p
3/nm
3rd/p
4/nm
5/nm
6/nm
7/nm
8/nm
...

All lines that start with any 3 letters and the word dig.

$ grep "^...dig" /usr/share/hunspell/en_US.dic
cardigan/SM
condign
predigest/GDS
...

All lines that end with at least one e.

$ grep -E "e+$" /usr/share/hunspell/en_US.dic
Anglicize
Anglophobe
Anthropocene
...

All lines that contain one of the following words: org , kay or tuna.

$ grep -E "org|kay|tuna" /usr/share/hunspell/en_US.dic
Borg/SM
George/MS
Tokay/M
fortunate/UY
...

Number of lines that start with one or no c followed by the string ati.
```
$ grep -cE "^c?ati" /usr/share/hunspell/en_US.dic
3
```

Answers to Explorational Exercises

Find the regular expression that matches the words in the “Include” line and doesn’t match the ones in the “Exclude” line:
- Include: pot, spot, apot
  
  Exclude: potic, spots, potatoe
  
  Answer: pot$
- Include: arp99, apple, zipper
  
  Exclude: zoo, arive, attack
  
  Answer: p+
- Include: arcane, capper, zoology
  
  Exclude: air, coper, zoloc
  
  Answer: arc|cap|zoo
- Include: 0th/pt, 3th/tc, 9th/pt
  
  Exclude: 0/nm, 3/nm, 9/nm
  
  Answer: [0-9]th.+
- Include: Hawaii, Dario, Ramiro
  
  Exclude: hawaii, Ian, Alice
  
  Answer: ^[A-Z]a.*i+
What other useful command is commonly used to search within the files? What additional functionalities does it have?

The sed command. The command can find and replace characters or sets of characters within a file.
Thinking back at the previous lesson, use one of the examples and try to look for a specific pattern within the output of the command, with the help of grep.

I took one of the answers from the Explorational Exercises and looked for the line that has read, write and execute as the group permissions. Your answer might be different, depending on the command that you chose and the pattern that you created.
```
$ cat contents.txt | tr -s " " | grep "^....rwx"
```
This exercise is to show you that grep can also receive input from different commands and it can help in filtering generated information.