Linux Professional Institute Learning Logo.
Skip to main content
  • Home
    • All Resources
    • LPI Learning Materials
    • Become a Contributor
    • Publishing Partners
    • Become a Publishing Partner
    • About
    • FAQ
    • Contributors
    • Roadmap
    • Contact
  • LPI.org
3.1 Lesson 1
Topic 1: The Linux Community and a Career in Open Source
1.1 Linux Evolution and Popular Operating Systems
  • 1.1 Lesson 1
1.2 Major Open Source Applications
  • 1.2 Lesson 1
1.3 Open Source Software and Licensing
  • 1.3 Lesson 1
1.4 ICT Skills and Working in Linux
  • 1.4 Lesson 1
Topic 2: Finding Your Way on a Linux System
2.1 Command Line Basics
  • 2.1 Lesson 1
  • 2.1 Lesson 2
2.2 Using the Command Line to Get Help
  • 2.2 Lesson 1
2.3 Using Directories and Listing Files
  • 2.3 Lesson 1
  • 2.3 Lesson 2
2.4 Creating, Moving and Deleting Files
  • 2.4 Lesson 1
Topic 3: The Power of the Command Line
3.1 Archiving Files on the Command Line
  • 3.1 Lesson 1
3.2 Searching and Extracting Data from Files
  • 3.2 Lesson 1
  • 3.2 Lesson 2
3.3 Turning Commands into a Script
  • 3.3 Lesson 1
  • 3.3 Lesson 2
Topic 4: The Linux Operating System
4.1 Choosing an Operating System
  • 4.1 Lesson 1
4.2 Understanding Computer Hardware
  • 4.2 Lesson 1
4.3 Where Data is Stored
  • 4.3 Lesson 1
  • 4.3 Lesson 2
4.4 Your Computer on the Network
  • 4.4 Lesson 1
Topic 5: Security and File Permissions
5.1 Basic Security and Identifying User Types
  • 5.1 Lesson 1
5.2 Creating Users and Groups
  • 5.2 Lesson 1
5.3 Managing File Permissions and Ownership
  • 5.3 Lesson 1
5.4 Special Directories and Files
  • 5.4 Lesson 1
  1. Topic 3: The Power of the Command Line
  2. 3.1 Archiving Files on the Command Line
  3. 3.1 Lesson 1

3.1 Lesson 1

Certificate:

Linux Essentials

Version:

1.6

Topic:

3 The Power of the Command Line

Objective:

3.1 Archiving Files on the Command Line

Lesson:

1 of 1

Introduction

Compression is used to reduce the amount of space a specific set of data consumes. Compression is commonly used for reducing the amount of space that is needed to store a file. Another common use is to reduce the amount of data sent over a network connection.

Compression works by replacing repetitive patterns in data. Suppose you have a novel. Some words are extremely common but have multiple characters, such as the word “the”. You could reduce the size of the novel significantly if you were to replace these common multi character words and patterns with single character replacements. E.g., replace “the” with a greek letter that is not used elsewhere in the text. Data compression algorithms are similar to this but more complex.

Compression comes in two varieties, lossless and lossy. Things compressed with a lossless algorithm can be decompressed back into their original form. Data compressed with a lossy algorithm cannot be recovered. Lossy algorithms are often used for images, video, and audio where the quality loss is imperceptible to humans, irrelevant to the context, or the loss is worth the saved space or network throughput.

Archiving tools are used to bundle up files and directories into a single file. Some common uses are backups, bundling software source code, and data retention.

Archive and compression are commonly used together. Some archiving tools even compress their contents by default. Others can optionally compress their contents. A few archive tools must be used in conjunction with stand-alone compression tools if you wish to compress the contents.

The most common tool for archiving files on Linux systems is tar. Most Linux distributions ship with the GNU version of tar, so it is the one that will be covered in this lesson. tar on its own only manages the archiving of files but does not compress them.

There are lots of compression tools available on Linux. Some common lossless ones are bzip2, gzip, and xz. You will find all three on most systems. You may encounter an old or very minimal system where xz or bzip is not installed. If you become a regular Linux user, you will likely encounter files compressed with all three of these. All three of them use different algorithms, so a file compressed with one tool can’t be decompressed by another. Compression tools have a trade off. If you want a high compression ratio, it will take longer to compress and decompress the file. This is because higher compression requires more work finding more complex patterns. All of these tools compress data but can not create archives containing multiple files.

Stand-alone compression tools aren’t typically available on Windows systems. Windows archiving and compression tools are usually bundled together. Keep this in mind if you have Linux and Windows systems that need to share files.

Linux systems also have tools for handling .zip files commonly used on Windows system. They are called zip and unzip. These tools are not installed by default on all systems, so if you need to use them you may have to install them. Fortunately, they are typically found in distributions' package repositories.

Compression Tools

How much disk space is saved by compressing files depends on a few factors. The nature of the data you are compressing, the algorithm used to compress the data, and the compression level. Not all algorithms support different compression levels.

Let’s start with setting up some test files to compress:

$ mkdir ~/linux_essentials-3.1
$ cd ~/linux_essentials-3.1
$ mkdir compression archiving
$ cd compression
$ cat /etc/* > bigfile 2> /dev/null

Now we create three copies of this file:

$ cp bigfile bigfile2
$ cp bigfile bigfile3
$ cp bigfile bigfile4
$ ls -lh
total 2.8M
-rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile
-rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile2
-rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile3
-rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile4

Now we are going to compress the files with each aforementioned compression tool:

$ bzip2 bigfile2
$ gzip bigfile3
$ xz bigfile4
$ ls -lh
total 1.2M
-rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile
-rw-r--r-- 1 emma emma 170K Jun 23 08:08 bigfile2.bz2
-rw-r--r-- 1 emma emma 179K Jun 23 08:08 bigfile3.gz
-rw-r--r-- 1 emma emma 144K Jun 23 08:08 bigfile4.xz

Compare the sizes of the compressed files to the uncompressed file named bigfile. Also notice how the compression tools added extensions to the file names and removed the uncompressed files.

Use bunzip2, gunzip, or unxz to decompress the files:

$ bunzip2 bigfile2.bz2
$ gunzip bigfile3.gz
$ unxz bigfile4.xz
$ ls -lh
total 2.8M
-rw-r--r-- 1 emma emma 712K Jun 23 08:20 bigfile
-rw-r--r-- 1 emma emma 712K Jun 23 08:20 bigfile2
-rw-r--r-- 1 emma emma 712K Jun 23 08:20 bigfile3
-rw-r--r-- 1 emma emma 712K Jun 23 08:20 bigfile4

Notice again that now the compressed file is deleted once it is decompressed.

Some compression tools support different compression levels. A higher compression level usually requires more memory and CPU cycles, but results in a smaller compressed file. The opposite is true for a lower level. Below is a demonstration with xz and gzip:

$ cp bigfile bigfile-gz1
$ cp bigfile bigfile-gz9
$ gzip -1 bigfile-gz1
$ gzip -9 bigfile-gz9
$ cp bigfile bigfile-xz1
$ cp bigfile bigfile-xz9
$ xz -1 bigfile bigfile-xz1
$ xz -9 bigfile bigfile-xz9
$ ls -lh bigfile bigfile-* *
total 3.5M
-rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile
-rw-r--r-- 1 emma emma 205K Jun 23 13:14 bigfile-gz1.gz
-rw-r--r-- 1 emma emma 178K Jun 23 13:14 bigfile-gz9.gz
-rw-r--r-- 1 emma emma 156K Jun 23 08:08 bigfile-xz1.xz
-rw-r--r-- 1 emma emma 143K Jun 23 08:08 bigfile-xz9.xz

It is not necessary to decompress a file every time you use it. Compression tools typically come with special versions of common tools used to read text files. For example, gzip has a version of cat, grep, diff, less, more, and a few others. For gzip, the tools are prefixed with a z, while the prefix bz exists for bzip2 and xz exists for xz. Below is an example of using zcat to read display a file compressed with gzip:

$ cp /etc/hosts ./
$ gzip hosts
$ zcat hosts.gz
127.0.0.1	localhost

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Archiving Tools

The tar program is probably the most widely used archiving tool on Linux systems. In case you are wondering why it is named how it is, it as an abbreviation for “tape archive”. Files created with tar are often called tar balls. It is very common for applications distributed as source code to be in tar balls.

The GNU version of tar that Linux distributions ship with has a lot of options. This lesson is going to cover the most commonly used subset.

Let’s start off by creating an archive of the files used for compression:

$ cd ~/linux_essentials-3.1
$ tar cf archiving/3.1.tar compression

The c option instructs tar to create a new archive file and the f option is the name of the file to create. The argument immediately following the options is always going to be the name of the file to work on. The rest of the arguments are the paths to any files or directories you wish to add to, list, or extract from the file. In the example, we are adding the directory compression and all of its contents to the archive.

To view the contents of a tar ball, use the t option of tar:

$ tar -tf 3.1.tar
compression/
compression/bigfile-xz1.xz
compression/bigfile-gz9.gz
compression/hosts.gz
compression/bigfile2
compression/bigfile
compression/bigfile-gz1.gz
compression/bigfile-xz9.xz
compression/bigfile3
compression/bigfile4

Notice how the options are preceded with -. Unlike most programs, with tar, the - isn’t required when specifying options, although it doesn’t cause any harm if it is used.

Note

You can use the -v option to let tar output the names of files it operates on when creating or extracting an archive.

Now let’s extract the file:

$ cd ~/linux_essentials-3.1/archiving
$ ls
3.1.tar
$ tar xf 3.1.tar
$ ls
3.1.tar  compression

Suppose you only need one file out of the archive. If this is the case, you can specify it after the archive’s file name. You can specify multiple files if necessary:

$ cd ~/linux_essentials-3.1/archiving
$ rm -rf compression
$ ls
3.1.tar
$ tar xvf 3.1.tar compression/hosts.gz
compression/
compression/bigfile-xz1.xz
compression/bigfile-gz9.gz
compression/hosts.gz
compression/bigfile2
compression/bigfile
compression/bigfile-gz1.gz
compression/bigfile-xz9.xz
compression/bigfile3
compression/bigfile4
$ ls
3.1.tar  compression
$ ls compression
hosts.gz

With the exception of absolute paths (paths beginning with /), tar files preserve the entire path to files when they are created. Since the file 3.1.tar was created with a single directory, that directory will be created relative to your current working directory when extracted. Another example should clarify this:

$ cd ~/linux_essentials-3.1/archiving
$ rm -rf compression
$ cd ../compression
$ tar cf ../tar/3.1-nodir.tar *
$ cd ../archiving
$ mkdir untar
$ cd untar
$ tar -xf ../3.1-nodir.tar
$ ls
bigfile   bigfile3  bigfile-gz1.gz  bigfile-xz1.xz  hosts.gz
bigfile2  bigfile4  bigfile-gz9.gz  bigfile-xz9.xz
Tip

If you wish to use the absolute path in a tar file, you must use the P option. Be aware that this may overwrite important files and might cause errors on your system.

The tar program can also manage compression and decompression of archives on the fly. tar does so by calling one of the compression tools discussed earlier in this section. It is as simple as adding the option appropriate to the compression algorithm. The most commonly used ones are j, J, and z for bzip2, xz, and gzip, respectively. Below are examples using the aforementioned algorithms:

$ cd ~/linux_essentials-3.1/compression
$ ls
bigfile   bigfile3  bigfile-gz1.gz  bigfile-xz1.xz  hosts.gz
bigfile2  bigfile4  bigfile-gz9.gz  bigfile-xz9.xz
$ tar -czf gzip.tar.gz bigfile bigfile2 bigfile3
$ tar -cjf bzip2.tar.bz2 bigfile bigfile2 bigfile3
$ tar -cJf xz.tar.xz bigfile bigfile2 bigfile3
$ ls -l | grep tar
-rw-r--r-- 1 emma emma  450202 Jun 27 05:56 bzip2.tar.bz2
-rw-r--r-- 1 emma emma  548656 Jun 27 05:55 gzip.tar.gz
-rw-r--r-- 1 emma emma  147068 Jun 27 05:56 xz.tar.xz

Notice how in the example the .tar files have different sizes. This shows that they were successfully compressed. If you create compressed .tar archives, you should always add a second file extension denoting the algorithm you used. They are .xz, .bz, and .gz for xz, bzip2, and gzip, respectively. Sometimes shortened extensions such as .tgz are used.

It is possible to add files to already existing uncompressed tar archives. Use the u option to do this. If you attempt to add to a compressed archive, you will get an error.

$ cd ~/linux_essentials-3.1/compression
$ ls
bigfile   bigfile3  bigfile-gz1.gz  bigfile-xz1.xz  bzip2.tar.bz2  hosts.gz
bigfile2  bigfile4  bigfile-gz9.gz  bigfile-xz9.xz  gzip.tar.gz    xz.tar.xz
$ tar cf plain.tar bigfile bigfile2 bigfile3
$ tar tf plain.tar
bigfile
bigfile2
bigfile3
$ tar uf plain.tar bigfile4
$ tar tf plain.tar
bigfile
bigfile2
bigfile3
bigfile4
$ tar uzf gzip.tar.gz bigfile4
tar: Cannot update compressed archives
Try 'tar --help' or 'tar --usage' for more information.

Managing ZIP files

Windows machines often don’t have applications to handle tar balls or many of the compression tools commonly found on Linux systems. If you need to interact with Windows systems, you can use ZIP files. A ZIP file is an archive file similar to a compressed tar file.

The zip and unzip programs can be used to work with ZIP files on Linux systems. The example below should be all you need to get started using them. First we create a set of files:

$ cd ~/linux_essentials-3.1
$ mkdir zip
$ cd zip/
$ mkdir dir
$ touch dir/file1 dir/file2

Now we use zip to pack these files into a ZIP file:

$ zip -r zipfile.zip dir
  adding: dir/ (stored 0%)
  adding: dir/file1 (stored 0%)
  adding: dir/file2 (stored 0%)
$ rm -rf dir

Finally, we unpack the ZIP file again:

$ ls
zipfile.zip
$ unzip zipfile.zip
Archive:  zipfile.zip
   creating: dir/
 extracting: dir/file1
 extracting: dir/file2
$ find
.
./zipfile.zip
./dir
./dir/file1
./dir/file2

When adding directories to ZIP files, the -r option causes zip to include a directory’s contents. Without it, you would have an empty directory in the ZIP file.

Guided Exercises

  1. According to the extensions, which of the following tools were used to create these files?

    Filename tar gzip bzip2 xz

    archive.tar

    archive.tgz

    archive.tar.xz

  2. According to the extensions, which of these files are archives and which are compressed?

    Filename Archive Compressed

    file.tar

    file.tar.bz2

    file.zip

    file.xz

  3. How would you add a file to a gzip compressed tar file?

  4. Which tar option instructs tar to include the leading / in absolute paths?

  5. Does zip support different compression levels?

Explorational Exercises

  1. When extracting files, does tar support globs in the file list?

  2. How can you make sure a decompressed file is identical to the file before it was compressed?

  3. What happens if you try to extract a file from a tar archive that already exists on your filesystem?

  4. How would you extract the file archive.tgz without using the tar z option?

Summary

Linux systems have several compression and archiving tools available. This lesson covered the most common ones. The most common archiving tool is tar. If interacting with Windows systems is necessary, zip and unzip can create and extract ZIP files.

The tar command has a few options that are worth memorizing. They are x for extract, c for create, t for view contents, and u to add or replace files. The v option lists the files which are processed by tar while creating or extracting an archive.

The typical Linux distribution’s repository has many compression tools. The most common are gzip, bzip2, and xz. Compression algorithms often support different levels that allow you to optimize for speed or file size. Files can be decompressed with gunzip, bunzip2, and unxz.

Compression tools commonly have programs that behave like common text file tools, with the difference being they work on compressed files. A few of them are zcat, bzcat, and xzcat. Compression tools typically ship with programs with the functionality of grep, more, less, diff, and cmp.

Commands used in the exercises:

bunzip2

Decompress a bzip2 compressed file.

bzcat

Output the contents of a bzip compressed file.

bzip2

Compress files using the bzip2 algorithm and format.

gunzip

Decompress a gzip compressed file.

gzip

Compress files using the gzip algorithm and format.

tar

Create, update, list and extract tar archives.

unxz

Decompress a xz compressed file.

unzip

Decompress and extract content from a ZIP file.

xz Compress files using the xz algorithm and format.

zcat

Output the contents of a gzip compressed file.

zip

Create and compress ZIP archives.

Answers to Guided Exercises

  1. According to the extensions, which of the following tools were used to create these files?

    Filename tar gzip bzip2 xz

    archive.tar

    X

    archive.tgz

    X

    X

    archive.tar.xz

    X

    X

  2. According to the extensions, which of these files are archives and which are compressed?

    Filename Archive Compressed

    file.tar

    X

    file.tar.bz2

    X

    X

    file.zip

    X

    X

    file.xz

    X

  3. How would you add a file to a gzip compressed tar file?

    You would decompress the file with gunzip, add the file with tar uf, and then compress it with gzip

  4. Which tar option instructs tar to include the leading / in absolute paths?

    The -P option. From the man page:

    -P, --absolute-names
            Don't strip leading slashes from file names when creating  archives
  5. Does zip support different compression levels?

    Yes. You would use -#, replacing # with a number from 0-9. From the man page:

    -#
    (-0, -1, -2, -3, -4, -5, -6, -7, -8, -9)
           Regulate the speed of compression using the specified  digit  #,
           where  -0  indicates  no compression (store all files), -1 indi‐
           cates the fastest compression speed (less  compression)  and  -9
           indicates  the  slowest  compression speed (optimal compression,
           ignores the suffix list). The default compression level is -6.
    
           Though still being worked, the intention is  this  setting  will
           control  compression  speed  for  all compression methods.  Cur‐
           rently only deflation is controlled.

Answers to Explorational Exercises

  1. When extracting files, does tar support globs in the file list?

    Yes, you would use the --wildcards option. --wildcards must be placed right after the tar file when using the no dash style of options. For example:

    $ tar xf tarfile.tar --wildcards dir/file*
    $ tar --wildcards -xf tarfile.tar dir/file*
  2. How can you make sure a decompressed file is identical to the file before it was compressed?

    You don’t need to do anything with the tools covered in this lesson. All three of them include checksums in their file format that is verified when they are decompressed.

  3. What happens if you try to extract a file from a tar archive that already exists on your filesystem?

    The file on your filesystem is overwritten with the version that is in the tar file.

  4. How would you extract the file archive.tgz without using the tar z option?

    You would decompress it with gunzip first.

    $ gunzip archive.tgz
    $ tar xf archive.tar

© 2020 Linux Professional Insitute Inc. All rights reserved. Visit the Learning Materials website: https://learning.lpi.org
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Next Lesson

3.2 Searching and Extracting Data from Files (3.2 Lesson 1)

Read next lesson

© 2020 Linux Professional Insitute Inc. All rights reserved. Visit the Learning Materials website: https://learning.lpi.org
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

LPI is a non-profit organization.

Linux Professional Institute (LPI) is the global certification standard and career support organization for open source professionals. With more than 200,000 certification holders, it's the world’s first and largest vendor-neutral Linux and open source certification body. LPI has certified professionals in over 180 countries, delivers exams in multiple languages, and has hundreds of training partners.

Our purpose is to enable economic and creative opportunities for everybody by making open source knowledge and skills certification universally accessible.

  • LinkedIn
  • flogo-RGB-HEX-Blk-58 Facebook
  • Twitter
  • Contact Us
  • Privacy and Cookie Policy

Spot a mistake or want to help improve this page? Please let us know.

© Copyright 1999-2020 The Linux Professional Institute Inc. All rights reserved.