6  More advanced commands

Prior experience

You can skip this section and proceed directly to the exercises if you are already familiar with commands like grep, sort and cut.

Tip

Remember that we have provided a list of helpful tips and hints in the appendix: Section A.1.

6.1 Searching in files: grep

Being able to search through (long) text files is incredibly useful in a wide range of scenarios. For this, we make use of the grep command. It is a very powerful and complex command, with many different options to tweak its behaviour, but even just the basic version can already be a lifesaver. The basic syntax is

grep "PATTERN" <path/to/target_file>

For example, we can look for the word “evolution” in the Origin of Species:

$ grep "evolution" long.txt
evolutionists that mammals are descended from a marsupial form; and if
At the present day almost all naturalists admit evolution under some
evolutionists; but there is no need, as it seems to me, to invoke any
Everyone who believes in slow and gradual evolution, will of course
of gradual evolution, through the preservation of a large number of
a strong disbeliever in evolution, but he appears to have been so much
historian will recognise as having produced a revolution in natural
the fact would be fatal to the theory of evolution through natural
the revolution in our palæontological knowledge effected by the
opposed to the admission of such prodigious geographical revolutions
has thus been arrived at; and the belief in the revolution of the earth
subject of evolution, and never once met with any sympathetic
agreement. It is probable that some did then believe in evolution, but
evolution. There are, however, some who still think that species have
we can dimly foresee that there will be a considerable revolution in

It might not be obvious from the above snippet, but inside your own terminal, the matching words in the search results will be highlighted. As you can see, grep will return each line that contains a match. Also note how partial matches like revolution are returned as well.

When you run grep "species" long.txt, you will indeed find many occurrences of this word. However, we are missing all the occurrences of “Species”. Try running grep "Species" long.txt to compare the results. Lastly, try running the command grep -i "species" long.txt. This option will

Remember, capitalization matters to grep (and to Unix as a whole)!

Aside from the -i option explained in the box above, there exist several other flags to improve your search results.

Option Effect
-i Case insensitive (i.e., ACTG = actg)
-n Also print the line number of the result
-c Count the number of matches
-5 Print 5 lines surrounding each match
-e/-E/-P Use regular or extended regular expressions
-r Recursive search through all files in a folder

The -# option is particularly useful to learn more about the context around you search result. You can supply any number of lines here, which will get printed both before and after each match. The example below, you can see how it helps us find the identifier of a DNA read that contains a particular sequence (AACCGGGGT):

$ grep -3 AACCGGGGT PF0512_S47_L001_R1_001.fastq
+
CCCCCGGCFGGGGFGGGGDGCEFGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGFFGGGFGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGD+,A=FG,:<?F5FGGGD>F9BDBDE9FEDAFGAFGGB=EDF,,>@FCCF;@;D;CF;,5;DEGG84,@@,3@ED<@,,@7,;@,7EC2=E>C977=
@M05795:43:000000000-CFLMP:1:2106:16840:21815 1:N:0:47
AGCCATACCAAGACCACAATTCTGAAGAGGAAACAAAACAAAAAAAAAAAAATAATTAAAAAAAAAAAAATTTAAAATTAAAAAAAAAATTTTTTTATTAAAATAATAAATATTAATTTTTATAATATAAATAAAATCCTATTTTACCCCACCAACCGGGGTTCATCCCCGGGCTCTTATACACATTTCCTAACCCACAAAAAGTACGAACAACACGCAACCCCCCTTCTGCCTTAAAAAAAAAACAAATCAAAATACACAAATATATCGAACATACAGCAACTACAAATGAAGATGTGGT
+
CCCCCGGFGGGFGGGGCFG9@@C<FGDG8EGGGFGGGGGGGGGGGGGGGGGD,:C,96CFD,8>FG7FGG,,,,B<B9,9CFFGGCG+3,,8BF@@,=8,8,=,C,,3@,,,3,3@,DGGFC,,,,,,7,,,,7>C,,3,,,7@@@:9,6**5,,**4*1*=8,,+5>=:****3/+>2;;92++2+++4++29*/:**9*1**3*0*/*/95)1*1))))29*05)202.:96*/-@=<(7:(.)00()))))))).-,)(())--)-(((.().,.:)8..(,).).))-4))..)))(
@M05795:43:000000000-CFLMP:1:2106:14940:21820 1:N:0:47

The -c options makes grep return the total number of matches it found. This method of counting is useful to complement the wc command, in case you are presented with a file that does not have such an orderly structure as the FASTQ format we saw earlier. To demonstrate, consider the penguins.csv file, which contains morphological data on three different penguin species.1 We can count the number of Adelie penguin records via:

1 Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

grep -c "Adelie" penguins.csv
152

In some situations we want to search through multiple files simultaneously. This is where the -r/--recursive flag comes in. It allows us to target a directory and search through all of its contents (including subdirectories). Let us try searching for the same DNA sequence as before, but this time targeting all the files in the unix-demo directory:

$ grep -r "AACCGGGGT" .
./PF0512_S47_L001_R1_001.fastq:AGCCATACCAAGACCACAATTCTGAAGAGGAAACAAAACAAAAAAAAAAAAATAATTAAAAAAAAAAAAATTTAAAATTAAAAAAAAAATTTTTTTATTAAAATAATAAATATTAATTTTTATAATATAAATAAAATCCTATTTTACCCCACCAACCGGGGTTCATCCCCGGGCTCTTATACACATTTCCTAACCCACAAAAAGTACGAACAACACGCAACCCCCCTTCTGCCTTAAAAAAAAAACAAATCAAAATACACAAATATATCGAACATACAGCAACTACAAATGAAGATGTGGT
./Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa:AATAAAACCGGGGTTGATACCACCACTTCCAGGTTCCCACATTCCAAGTCCCCTCAGCCA
./Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa:CTGGAGTCAGGACGTGAGCCGACTTGCTTAAAAATAAATCCACATGGCTGAACCGGGGTT
./Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa:ACAGCTAACCGGGGTTTTAGTATATGTGCCACATCTCTGTAAATGTTCACTTCCTAGGCA
./Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa:TATGATCGTGCCACTGCACTTCAACCGGGGTGACAAAGCGAAAACCGTGTCTCTAAAAAA

Instead of using the -r flag, we can also rely on globbing again (see Section 5.3.2). To search for the string “needle” in all .txt files in a particular folder, we can do the following:

grep "needle" path/to/directory_with_txt_files/*.txt

We already mentioned regular expressions in the previous section: they allow you to search for particular patterns that can match more than one exact string of text. This is tremendously useful, but we will not dive deeply into how they work during this course. If you are interested, you can check out an excellent tutorial here.

One special pattern that we will introduce is the start or end of line anchor. You can search for patterns that start at the beginning of a line by prefixing the pattern with a ^ symbol. Similarly, you can search for patterns that occur at the end of a line by using a $ symbol. For example, if we search for grep ">@" read.fastq in a FASTQ file, we will only retrieve lines that start with the @ symbol, ignoring any @ symbols that are used elsewhere on a line.

As we saw in Section 5.1.3.1, every read in a FASTQ file is represented by four lines, the first of which is the header and which always starts with an @ symbol. So far so good, but the fourth line, which contains the quality score of each base in the sequence, can also start with an @ symbol, because @ is just another score value that could happen to be occur for the first base in the sequence.

We will see more elaborate use cases for grep when we introduce the Unix concepts of piping and redirection.

6.2 Tabular data: cut

We already encountered tabular data (the penguin dataset in .csv format) when talking about grep. Tabular data files like .csv are a very common format, and not just in bioinformatics.

Tabular data and .csv files

Tabular data files are usually plain text files, where each row corresponds to a record (e.g., an individual penguin), and each column represents a particular field (e.g., species, flipper length, body mass, etc.). The columns can be separated by different field delimiters or separators. In .csv files, these are usually commas (comma separated values), but they can also be TABS (.tsv) or semicolons (;).

A particularly useful Unix tool for manipulating tabular data files, is cut. It allows us to extract particular columns from these files. The syntax is as follows:

cut [OPTIONS] target_file
Option Effect
-d ","/--delimiter ";" Change the default delimiter (TAB) to another character like ,
-f 1 Select the first column
-f 2,3 Select the second and third column
-f 1-3,6 Select columns one through three and columns six
--complement -f 1 Select all columns except for the first one
-r Recursive search through all files in a folder

6.2.1 SAM file format

Aside from .csv and .tsv files, there are many bioinformatics file formats that also follow a tabular lay-out. One of these is the SAM file format.

SAM: Sequence Alignment/Map Format

The SAM file format is a tab-delimited text file that stores information about the alignment of sequence reads to a reference genome.

When considering the steps that are taken during the variant calling analysis of sequence reads, SAM files result when the reads inside a FASTQ file are processed by an alignment tool like bwa and minimap. These tools take each individual read corresponding to a particular sample and try to map them to their most likely position in a reference genome.

SAM files consists of an optional header section followed by an alignment section.

The header lines always start with an @ symbol and contain information about e.g., the reference genome that was used and the sorting of the alignments. This section is not yet tab-delimited.

The alignment section contains a tab-delimited line for each sequence read that was aligned to the reference. There are 11 mandatory fields, containing information on the position of the read in the reference genome (i.e., where it was mapped), the sequence itself, its quality score ( = the quality of each base pair of the sequence during sequence calling, cf. FASTQ format), its mapping score ( = how well the read aligned to the reference genome), etc.

Col Field Type Brief description
1 QNAME String Query template NAME
2 FLAG Int bitwise FLAG
3 RNAME String References sequence NAME
4 POS Int 1- based leftmost mapping POSition
5 MAPQ Int MAPping Quality
6 CIGAR String CIGAR string
7 RNEXT String Ref. name of the mate/next read
8 PNEXT Int Position of the mate/next read
9 TLEN Int observed Template LENgth
10 SEQ String segment SEQuence
11 QUAL String ASCII of Phred-scaled base QUALity+33

We will not dive into the details of each of these fields for now, but if you are interested you can check out the official documentation or this useful write-up.

A typical SAM file will look something like this:

@SQ     SN:ref  LN:45
@SQ     SN:ref2 LN:40
r001    163     ref     7       30      8M4I4M1D3M      =       37      39      TTAGATAAAGAGGATACTG     *       XX:B:S,12561,2,20,112
r002    0       ref     9       30      1S2I6M1P1I1P1I4M2I      *       0       0       AAAAGATAAGGGATAAA       *
r003    0       ref     9       30      5H6M    *       0       0       AGCTAA  *
r004    0       ref     16      30      6M14N1I5M       *       0       0       ATAGCTCTCAGC    *
r003    16      ref     29      30      6H5M    *       0       0       TAGGC   *
r001    83      ref     37      30      9M      =       7       -39     CAGCGCCAT       *
x1      0       ref2    1       30      20M     *       0       0       aggttttataaaacaaataa    ????????????????????
x2      0       ref2    2       30      21M     *       0       0       ggttttataaaacaaataatt   ?????????????????????
x3      0       ref2    6       30      9M4I13M *       0       0       ttataaaacAAATaattaagtctaca      ??????????????????????????
x4      0       ref2    10      30      25M     *       0       0       CaaaTaattaagtctacagagcaac       ?????????????????????????
x5      0       ref2    12      30      24M     *       0       0       aaTaattaagtctacagagcaact        ????????????????????????
x6      0       ref2    14      30      23M     *       0       0       Taattaagtctacagagcaacta ???????????????????????

This example has two header lines denoting the reference genome contigs and their lengths, followed by 12 aligned reads.

Now inspect the SAM file in ./training/unix-demo/PF0302_S20.sort.sam and try to identify the different sections that make up the file.

6.3 File sizes: du

We already saw that the ls -lh can be used to figure out the file size of files in a particular directory. du is another tool to do this, but it operates on individual files or directories directly. Like ls, it also provides the -h/--human-readable option to return file sizes in KB/MB/GB, so it is generally recommended to always use this option. When used on a file, it will simply return its size, but when used on a directory, it will output information for all files, as well as the total file size of the entire directory (the final line of the output).

# targetting an individual file
$ du -h Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa
5.9M    Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa

# targetting a directory
$ du -h training/data/
284M    training/data/fastq
62M     training/data/reference
346M    training/data/

6.4 Compressed files: gzip

We already introduced the concept of file compression when talking about the FASTQ files in the training/data/fastq directory. As a reminder, compressed files are binary files (as opposed to human-readable plain text files) that are used to reduce the file size for more efficient storage. Many of the files that we use in bioinformatics tend to be compressed. Some of the tools we use, will not work on compressed files (e.g., try to cat a compressed file and see what happens), so we either need to 1) use specialized tools that expect compressed files as their input, or 2) decompress or extract the files first.

For gzipped (.gz) files specifically, we can do this via the gzip and gunzip commands. The former allows us to create a gzip-compressed version of the file, whereas the latter will extract one back to a plain text file. The basic syntax is gzip/gunzip <path/to/file>, but a very useful option is the -k/--keep flag. Without it, compressing a file would replace the uncompressed file with the new compressed one (vice versa: extracting would replace the compressed version with the extracted one), but when using the flag both files will be retained.

$ du -h PF0512_S47_L001_R1_001.fastq
67M     PF0512_S47_L001_R1_001.fastq
$ gzip --keep PF0512_S47_L001_R1_001.fastq
$ du -h PF0512_S47_L001_R1_001.fastq.gz
17M     PF0512_S47_L001_R1_001.fastq.gz

After compressing the FASTQ file with gzip, it shrunk to less than a third of its original size.

Do note that there exist other types of file compression besides gzip, like .zip/.7zip. In unix we also often make use of tar (which technically is not a compression tool, but a file archiver). File compression and tar can even be combined, leading to files with suffixes like .tar.gz. This allows us to compress entire directories, instead of only individual files.

To extract these so called tarballs, we need to use the tar command:

# extract .tar.gz archive
$ tar -xzvf tar_archive.tar.gz
tar_archive/
tar_archive/3
tar_archive/2
tar_archive/1

$ ls tar_archive
1  2  3

# create .tar.gz archive
$ tar -czvf new_archive.tar.gz <path/to/target_directory>

This command is notorious for how arcane its option flags are, but you can either try to remember it using a mnemonic (“eXtract/Compress Ze Vucking Files”, pronounced like a B-movie vampire) or the meaning of the individual flags (z tells tar that we are using gzip compression, -v stands for --verbose to make the command show more information and output, -c/x switches between compression and extraction mode, -f is the last option and points to the tar file) . And of course, the correct syntax is only a google/tldr search or tar --help call away.

6.4.1 BAM file format

A final example of a binary file that you might encounter in the bioinformatic analysis of AmpliSeq sequencing is the BAM file format:

BAM: Binary Alignment Map

A BAM file is nothing more than a compressed, binary representation of a SAM file. It is used to reduce the file size of SAM files and also improve the speed of specific operations like sorting or retrieving information from the file.

Many bioinformatics tools can handle BAM file natively. However, similar to gzipped files, BAM files are binary and we can no longer preview them using tools like cat and less.

One of the tools that is often used to convert between the two types of alignment file formats is samtools, which is maintained by the same group of people who manage the format specification for SAM/BAM files.

Aside from BAM, you might also encounter the CRAM file format. This is a more recent and even more highly optimized compressed alternative for sequence alignments, but it is not as commonly used (yet).

samtools is a commonly used program to manage SAM/BAM files, which we will introduce later when discussing the structure of variant calling pipelines.

6.5 Downloading files: wget

wget is a command that allows you to download files from a particular web address or URL and place them in your working directory. While there are several optional flags, in its most basic form the syntax is simply: wget URL. This command is not only useful when automating certain tasks, but also crucial if you ever find yourself in a Unix environment that does not have a GUI at all (e.g., compute clusters or cloud servers).

  • At the top of the page, click on Data -> Download data files.
  • Search for falciparum 3D7 and then narrow down your search by selecting the most recent release and the FASTA file format. Alternatively, you can click on the Download Archive link in the top and navigate the file directory to the current release.
  • The file name of the 3D7 reference genome in FASTA format is PlasmoDB-66_Pfalciparum3D7_Genome.fasta.
  • Right click the file and copy its URL to your clipboard: https://plasmodb.org/common/downloads/release-66/Pfalciparum3D7/fasta/data/PlasmoDB-66_Pfalciparum3D7_Genome.fasta.
  • Create and navigate to the output directory (mkdir -p ./training/data/results and cd ./training/data/results)
  • Download the file here using the command: wget https://plasmodb.org/common/downloads/release-66/Pfalciparum3D7/fasta/data/PlasmoDB-66_Pfalciparum3D7_Genome.fasta.

Two optional flags that you might find useful are: 1) -o allows you to rename the download file, and 2) -P <path/to/directory saves the file in a directory of your choice instead of the current working directory. Of course, these are just small convenient timesavers, since you can always cd to a particular location and use mv to rename the file afterwards.

Lastly, an alternative to wget that you might encounter at some point is curl. On the whole, it acts quite similar to wget for the most part.

6.6 Retrieving file names: basename

basename is a rather simple command: if you give it a long file path, it will return the final section (i.e., the file name).

# starting in the `training` directory
$ pwd
/home/bioinfo50/Desktop/Fa/FA5-bioinformatics/training

# get the file name for the reference genome we just downloaded
$ basename data/reference/PlasmoDB-65_Pfalciparum3D7_Genome.fasta
PlasmoDB-65_Pfalciparum3D7_Genome.fasta

We can also use this command to remove a particular suffix from a filename:

$ basename PF0512_S47_L001_R1_001.fastq .fastq
PF0512_S47_L001_R1_001

At this point in time, it might not seem particularly useful to be able to extract the file name of a file, but when we introduce the concept of for loops and bash scripting, it will become more clear why this can be so useful.

6.7 Sorting and removing duplicates: sort and uniq

The final two commands that we will introduce are yet again tools to manipulate plain text files. The first is sort, which does exactly that you expect it to. It can sort all the rows in a text file. Its syntax is:

sort [OPTIONS] <./path/to/file>`

There are optional flags that allow you to choose the type of order to use (e.g., numerical -n/--numeric-sort instead of alphabetical ), reverse the order of the sort (-r/--reverse) or ignore capitals (-f/--ignore-case).

The second command, uniq, is used to remove duplicate lines in a file. It also offers the option to count the frequency of each unique line.

# given the following file
$ cat file_with_duplicates.txt
a
a
b
b
b
b
c

$ uniq -c file_with_duplicates.txt
2 a
4 b
1 c

These two commands are often used in conjunction, because uniq on its own is not capable of filtering out identical lines that are not adjacent. So to truly remove all duplicate lines in a file, we would first need to sort it: sort unsorted_file.txt | uniq. Also note that the sort command also offers a flag -u, which will make it return only unique lines after sorting.

In the next section, we will introduce a method of combining these commands in a more convenient way than running them one by one and without needing to create any intermediary files.

6.8 Other commands

Of course, there exist many more Unix commands than the ones we introduced here. We will end this section by briefly mentioning two that you might run into at some point awk and sed. Both of them allow you to search and replace patterns in text files, and with awk you can even perform more complex operations including calculations. We will not dive into them here, but keep them in the back of your mind for the future.


6.9 Exercises

  1. Report the file size of the Plasmodium vivax P01 reference genome sequence in FASTA format. It can be found under ./training/data/reference/PvivaxP01.genome.fasta.
  2. Find out how many lines of text the file contains.
  3. Search through the file for the > character, which is used to denote every chromosome/contig. Use a single command to count them.
  4. Compress the FASTQ file PF0512_S47_L001_R1_001.fastq in the unix-demo directory using gzip.
  5. Navigate to the directory ./training/data/fastq and in a single command, extract the forward (PF0097_S43_L001_R1_001.fastq.gz) and reverse (PF0097_S43_L001_R2_001.fastq.gz) of the PF0097_S43 sample, without removing the compressed files. Hint: use globbing!
  6. Search both FASTQ files for the read fragment with identifier @M05795:43:000000000-CFLMP:1:1101:21518:5740 2:N:0:43 using a single command.
  7. Compare the file sizes of the two compressed and uncompressed FASTQ files.
  8. Extract the columns containing the island and flipper length of each penguin from the ./training/unix-demo/penguins.csv file.
  9. Count how often the sequence CATCATCATCATCAT occurs in the FASTA file ./training/unix-demo/Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa.
  10. Which command can be used to extract the name of the SAM file ./training/unix-demo/PF0302_S20.sort.sam without the .sam suffix?
  11. Which reference genome was used to create the SAM file?

6.10 Summary

Overview of concepts and commands
  • Tabular data: .csv, .tsv
  • The SAM/BAM file formats store sequence reads aligned to a reference genome.
Command Result
grep <pattern> <path/to/file> Search through a (very large) file for the supplied pattern
du <-h> <path/to/file_or_directory> Check how much space a file or directory occupies
cut -f [--delimiter ","] <path/to/file> Extract columns from tabular data using the specified delimiter
gzip / gunzip (-keep) <path/to/file> Compress or extract a gzip compressed file (.gz)
wget <url> Downloads a file from the URL to the current directory
basename <path/to/file> Returns the name of the file without the path prefix
basename <path/to/file> Returns the name of the file and remove the provided suffix