3.1 Lesson 1
Certificate: |
Linux Essentials |
---|---|
Version: |
1.6 |
Topic: |
3 The Power of the Command Line |
Objective: |
3.1 Archiving Files on the Command Line |
Lesson: |
1 of 1 |
Introduction
Compression is used to reduce the amount of space a specific set of data consumes. Compression is commonly used for reducing the amount of space that is needed to store a file. Another common use is to reduce the amount of data sent over a network connection.
Compression works by replacing repetitive patterns in data. Suppose you have a novel. Some words are extremely common but have multiple characters, such as the word “the”. You could reduce the size of the novel significantly if you were to replace these common multi character words and patterns with single character replacements. E.g., replace “the” with a greek letter that is not used elsewhere in the text. Data compression algorithms are similar to this but more complex.
Compression comes in two varieties, lossless and lossy. Things compressed with a lossless algorithm can be decompressed back into their original form. Data compressed with a lossy algorithm cannot be recovered. Lossy algorithms are often used for images, video, and audio where the quality loss is imperceptible to humans, irrelevant to the context, or the loss is worth the saved space or network throughput.
Archiving tools are used to bundle up files and directories into a single file. Some common uses are backups, bundling software source code, and data retention.
Archive and compression are commonly used together. Some archiving tools even compress their contents by default. Others can optionally compress their contents. A few archive tools must be used in conjunction with stand-alone compression tools if you wish to compress the contents.
The most common tool for archiving files on Linux systems is tar
. Most Linux distributions ship with the GNU version of tar
, so it is the one that will be covered in this lesson. tar
on its own only manages the archiving of files but does not compress them.
There are lots of compression tools available on Linux. Some common lossless ones are bzip2
, gzip
, and xz
. You will find all three on most systems. You may encounter an old or very minimal system where xz
or bzip
is not installed. If you become a regular Linux user, you will likely encounter files compressed with all three of these. All three of them use different algorithms, so a file compressed with one tool can’t be decompressed by another. Compression tools have a trade off. If you want a high compression ratio, it will take longer to compress and decompress the file. This is because higher compression requires more work finding more complex patterns. All of these tools compress data but can not create archives containing multiple files.
Stand-alone compression tools aren’t typically available on Windows systems. Windows archiving and compression tools are usually bundled together. Keep this in mind if you have Linux and Windows systems that need to share files.
Linux systems also have tools for handling .zip
files commonly used on Windows system. They are called zip
and unzip
. These tools are not installed by default on all systems, so if you need to use them you may have to install them. Fortunately, they are typically found in distributions' package repositories.
Compression Tools
How much disk space is saved by compressing files depends on a few factors. The nature of the data you are compressing, the algorithm used to compress the data, and the compression level. Not all algorithms support different compression levels.
Let’s start with setting up some test files to compress:
$ mkdir ~/linux_essentials-3.1 $ cd ~/linux_essentials-3.1 $ mkdir compression archiving $ cd compression $ cat /etc/* > bigfile 2> /dev/null
Now we create three copies of this file:
$ cp bigfile bigfile2 $ cp bigfile bigfile3 $ cp bigfile bigfile4 $ ls -lh total 2.8M -rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile -rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile2 -rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile3 -rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile4
Now we are going to compress the files with each aforementioned compression tool:
$ bzip2 bigfile2 $ gzip bigfile3 $ xz bigfile4 $ ls -lh total 1.2M -rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile -rw-r--r-- 1 emma emma 170K Jun 23 08:08 bigfile2.bz2 -rw-r--r-- 1 emma emma 179K Jun 23 08:08 bigfile3.gz -rw-r--r-- 1 emma emma 144K Jun 23 08:08 bigfile4.xz
Compare the sizes of the compressed files to the uncompressed file named bigfile
. Also notice how the compression tools added extensions to the file names and removed the uncompressed files.
Use bunzip2
, gunzip
, or unxz
to decompress the files:
$ bunzip2 bigfile2.bz2 $ gunzip bigfile3.gz $ unxz bigfile4.xz $ ls -lh total 2.8M -rw-r--r-- 1 emma emma 712K Jun 23 08:20 bigfile -rw-r--r-- 1 emma emma 712K Jun 23 08:20 bigfile2 -rw-r--r-- 1 emma emma 712K Jun 23 08:20 bigfile3 -rw-r--r-- 1 emma emma 712K Jun 23 08:20 bigfile4
Notice again that now the compressed file is deleted once it is decompressed.
Some compression tools support different compression levels. A higher compression level usually requires more memory and CPU cycles, but results in a smaller compressed file. The opposite is true for a lower level. Below is a demonstration with xz
and gzip
:
$ cp bigfile bigfile-gz1 $ cp bigfile bigfile-gz9 $ gzip -1 bigfile-gz1 $ gzip -9 bigfile-gz9 $ cp bigfile bigfile-xz1 $ cp bigfile bigfile-xz9 $ xz -1 bigfile bigfile-xz1 $ xz -9 bigfile bigfile-xz9 $ ls -lh bigfile bigfile-* * total 3.5M -rw-r--r-- 1 emma emma 712K Jun 23 08:08 bigfile -rw-r--r-- 1 emma emma 205K Jun 23 13:14 bigfile-gz1.gz -rw-r--r-- 1 emma emma 178K Jun 23 13:14 bigfile-gz9.gz -rw-r--r-- 1 emma emma 156K Jun 23 08:08 bigfile-xz1.xz -rw-r--r-- 1 emma emma 143K Jun 23 08:08 bigfile-xz9.xz
It is not necessary to decompress a file every time you use it. Compression tools typically come with special versions of common tools used to read text files. For example, gzip
has a version of cat
, grep
, diff
, less
, more
, and a few others. For gzip
, the tools are prefixed with a z
, while the prefix bz
exists for bzip2
and xz
exists for xz
. Below is an example of using zcat
to read display a file compressed with gzip
:
$ cp /etc/hosts ./ $ gzip hosts $ zcat hosts.gz 127.0.0.1 localhost # The following lines are desirable for IPv6 capable hosts ::1 localhost ip6-localhost ip6-loopback ff02::1 ip6-allnodes ff02::2 ip6-allrouters
Archiving Tools
The tar
program is probably the most widely used archiving tool on Linux systems. In case you are wondering why it is named how it is, it as an abbreviation for “tape archive”. Files created with tar
are often called tar balls. It is very common for applications distributed as source code to be in tar balls.
The GNU version of tar
that Linux distributions ship with has a lot of options. This lesson is going to cover the most commonly used subset.
Let’s start off by creating an archive of the files used for compression:
$ cd ~/linux_essentials-3.1 $ tar cf archiving/3.1.tar compression
The c
option instructs tar
to create a new archive file and the f
option is the name of the file to create. The argument immediately following the options is always going to be the name of the file to work on. The rest of the arguments are the paths to any files or directories you wish to add to, list, or extract from the file. In the example, we are adding the directory compression
and all of its contents to the archive.
To view the contents of a tar ball, use the t
option of tar
:
$ tar -tf 3.1.tar compression/ compression/bigfile-xz1.xz compression/bigfile-gz9.gz compression/hosts.gz compression/bigfile2 compression/bigfile compression/bigfile-gz1.gz compression/bigfile-xz9.xz compression/bigfile3 compression/bigfile4
Notice how the options are preceded with -
. Unlike most programs, with tar
, the -
isn’t required when specifying options, although it doesn’t cause any harm if it is used.
Note
|
You can use the |
Now let’s extract the file:
$ cd ~/linux_essentials-3.1/archiving $ ls 3.1.tar $ tar xf 3.1.tar $ ls 3.1.tar compression
Suppose you only need one file out of the archive. If this is the case, you can specify it after the archive’s file name. You can specify multiple files if necessary:
$ cd ~/linux_essentials-3.1/archiving $ rm -rf compression $ ls 3.1.tar $ tar xvf 3.1.tar compression/hosts.gz compression/ compression/bigfile-xz1.xz compression/bigfile-gz9.gz compression/hosts.gz compression/bigfile2 compression/bigfile compression/bigfile-gz1.gz compression/bigfile-xz9.xz compression/bigfile3 compression/bigfile4 $ ls 3.1.tar compression $ ls compression hosts.gz
With the exception of absolute paths (paths beginning with /
), tar
files preserve the entire path to files when they are created. Since the file 3.1.tar
was created with a single directory, that directory will be created relative to your current working directory when extracted. Another example should clarify this:
$ cd ~/linux_essentials-3.1/archiving $ rm -rf compression $ cd ../compression $ tar cf ../tar/3.1-nodir.tar * $ cd ../archiving $ mkdir untar $ cd untar $ tar -xf ../3.1-nodir.tar $ ls bigfile bigfile3 bigfile-gz1.gz bigfile-xz1.xz hosts.gz bigfile2 bigfile4 bigfile-gz9.gz bigfile-xz9.xz
Tip
|
If you wish to use the absolute path in a |
The tar
program can also manage compression and decompression of archives on the fly. tar
does so by calling one of the compression tools discussed earlier in this section. It is as simple as adding the option appropriate to the compression algorithm. The most commonly used ones are j
, J
, and z
for bzip2
, xz
, and gzip
, respectively. Below are examples using the aforementioned algorithms:
$ cd ~/linux_essentials-3.1/compression $ ls bigfile bigfile3 bigfile-gz1.gz bigfile-xz1.xz hosts.gz bigfile2 bigfile4 bigfile-gz9.gz bigfile-xz9.xz $ tar -czf gzip.tar.gz bigfile bigfile2 bigfile3 $ tar -cjf bzip2.tar.bz2 bigfile bigfile2 bigfile3 $ tar -cJf xz.tar.xz bigfile bigfile2 bigfile3 $ ls -l | grep tar -rw-r--r-- 1 emma emma 450202 Jun 27 05:56 bzip2.tar.bz2 -rw-r--r-- 1 emma emma 548656 Jun 27 05:55 gzip.tar.gz -rw-r--r-- 1 emma emma 147068 Jun 27 05:56 xz.tar.xz
Notice how in the example the .tar
files have different sizes. This shows that they were successfully compressed. If you create compressed .tar
archives, you should always add a second file extension denoting the algorithm you used. They are .xz
, .bz
, and .gz
for xz
, bzip2
, and gzip
, respectively. Sometimes shortened extensions such as .tgz
are used.
It is possible to add files to already existing uncompressed tar archives. Use the u
option to do this. If you attempt to add to a compressed archive, you will get an error.
$ cd ~/linux_essentials-3.1/compression $ ls bigfile bigfile3 bigfile-gz1.gz bigfile-xz1.xz bzip2.tar.bz2 hosts.gz bigfile2 bigfile4 bigfile-gz9.gz bigfile-xz9.xz gzip.tar.gz xz.tar.xz $ tar cf plain.tar bigfile bigfile2 bigfile3 $ tar tf plain.tar bigfile bigfile2 bigfile3 $ tar uf plain.tar bigfile4 $ tar tf plain.tar bigfile bigfile2 bigfile3 bigfile4 $ tar uzf gzip.tar.gz bigfile4 tar: Cannot update compressed archives Try 'tar --help' or 'tar --usage' for more information.
Managing ZIP files
Windows machines often don’t have applications to handle tar balls or many of the compression tools commonly found on Linux systems. If you need to interact with Windows systems, you can use ZIP files. A ZIP file is an archive file similar to a compressed tar
file.
The zip
and unzip
programs can be used to work with ZIP files on Linux systems. The example below should be all you need to get started using them. First we create a set of files:
$ cd ~/linux_essentials-3.1 $ mkdir zip $ cd zip/ $ mkdir dir $ touch dir/file1 dir/file2
Now we use zip
to pack these files into a ZIP file:
$ zip -r zipfile.zip dir adding: dir/ (stored 0%) adding: dir/file1 (stored 0%) adding: dir/file2 (stored 0%) $ rm -rf dir
Finally, we unpack the ZIP file again:
$ ls zipfile.zip $ unzip zipfile.zip Archive: zipfile.zip creating: dir/ extracting: dir/file1 extracting: dir/file2 $ find . ./zipfile.zip ./dir ./dir/file1 ./dir/file2
When adding directories to ZIP files, the -r
option causes zip
to include a directory’s contents. Without it, you would have an empty directory in the ZIP file.
Guided Exercises
-
According to the extensions, which of the following tools were used to create these files?
Filename tar
gzip
bzip2
xz
archive.tar
archive.tgz
archive.tar.xz
-
According to the extensions, which of these files are archives and which are compressed?
Filename Archive Compressed file.tar
file.tar.bz2
file.zip
file.xz
-
How would you add a file to a
gzip
compressedtar
file? -
Which
tar
option instructstar
to include the leading/
in absolute paths? -
Does
zip
support different compression levels?
Explorational Exercises
-
When extracting files, does
tar
support globs in the file list? -
How can you make sure a decompressed file is identical to the file before it was compressed?
-
What happens if you try to extract a file from a
tar
archive that already exists on your filesystem? -
How would you extract the file
archive.tgz
without using thetar
z
option?
Summary
Linux systems have several compression and archiving tools available. This lesson covered the most common ones. The most common archiving tool is tar
. If interacting with Windows systems is necessary, zip
and unzip
can create and extract ZIP files.
The tar
command has a few options that are worth memorizing. They are x
for extract, c
for create, t
for view contents, and u
to add or replace files. The v
option lists the files which are processed by tar
while creating or extracting an archive.
The typical Linux distribution’s repository has many compression tools. The most common are gzip
, bzip2
, and xz
. Compression algorithms often support different levels that allow you to optimize for speed or file size. Files can be decompressed with gunzip
, bunzip2
, and unxz
.
Compression tools commonly have programs that behave like common text file tools, with the difference being they work on compressed files. A few of them are zcat
, bzcat
, and xzcat
. Compression tools typically ship with programs with the functionality of grep
, more
, less
, diff
, and cmp
.
Commands used in the exercises:
bunzip2
-
Decompress a
bzip2
compressed file. bzcat
-
Output the contents of a
bzip
compressed file. bzip2
-
Compress files using the
bzip2
algorithm and format. gunzip
-
Decompress a
gzip
compressed file. gzip
-
Compress files using the
gzip
algorithm and format. tar
-
Create, update, list and extract
tar
archives. unxz
-
Decompress a
xz
compressed file. unzip
-
Decompress and extract content from a ZIP file.
xz
Compress files using the xz
algorithm and format.
zcat
-
Output the contents of a
gzip
compressed file. zip
-
Create and compress ZIP archives.
Answers to Guided Exercises
-
According to the extensions, which of the following tools were used to create these files?
Filename tar
gzip
bzip2
xz
archive.tar
X
archive.tgz
X
X
archive.tar.xz
X
X
-
According to the extensions, which of these files are archives and which are compressed?
Filename Archive Compressed file.tar
X
file.tar.bz2
X
X
file.zip
X
X
file.xz
X
-
How would you add a file to a
gzip
compressedtar
file?You would decompress the file with
gunzip
, add the file withtar uf
, and then compress it withgzip
-
Which
tar
option instructstar
to include the leading/
in absolute paths?The
-P
option. From the man page:-P, --absolute-names Don't strip leading slashes from file names when creating archives
-
Does
zip
support different compression levels?Yes. You would use
-#
, replacing#
with a number from 0-9. From the man page:-# (-0, -1, -2, -3, -4, -5, -6, -7, -8, -9) Regulate the speed of compression using the specified digit #, where -0 indicates no compression (store all files), -1 indi‐ cates the fastest compression speed (less compression) and -9 indicates the slowest compression speed (optimal compression, ignores the suffix list). The default compression level is -6. Though still being worked, the intention is this setting will control compression speed for all compression methods. Cur‐ rently only deflation is controlled.
Answers to Explorational Exercises
-
When extracting files, does
tar
support globs in the file list?Yes, you would use the
--wildcards
option.--wildcards
must be placed right after thetar
file when using the no dash style of options. For example:$ tar xf tarfile.tar --wildcards dir/file* $ tar --wildcards -xf tarfile.tar dir/file*
-
How can you make sure a decompressed file is identical to the file before it was compressed?
You don’t need to do anything with the tools covered in this lesson. All three of them include checksums in their file format that is verified when they are decompressed.
-
What happens if you try to extract a file from a
tar
archive that already exists on your filesystem?The file on your filesystem is overwritten with the version that is in the
tar
file. -
How would you extract the file
archive.tgz
without using thetar
z
option?You would decompress it with
gunzip
first.$ gunzip archive.tgz $ tar xf archive.tar