-
π Exercise material setup: download the exercises.zip archive file to your local computer and unzip it. This will unpack a directory named
exercises
, with all the data needed for the course exercises. -
βοΈ Additional Tasks: at the end of each exercise, you will find a section named Additional Tasks. These sections contain tasks to complete if you have the time and after having completed the main exercise. The Additional Tasks sections will not be corrected in class, but their solution is given in this document.
-
π Exercise solutions: all exercises and "Additional Tasks" section have their solution embedded in this document. Solutions are hidden by default, but you can reveal them by clicking on them. Here is an example:
Exercise solution (click me)
β¨ This reveals the answer β¨We encourage you to not look at the solutions too quickly, and try to solve the exercises without it. Remember that you can always ask the course teachers for help.
-
π₯ Tip: if you are viewing these instructions on the GitHub web-interface, you can display a table of content (outline) of this page by clicking on a small icon (looks like a bulleted list) near the top-right of this page.
Objective: get familiar with navigating the directory tree and listing the content of directories.
-
Print your current working directory with the
pwd
command. This will show you where you are currently located in the directory tree. -
Navigate to the
exercises/
directory (the one you unpacked from the zip archive file), then enter theexercise_1
subdirectory. -
Try to run the commands
cd .
andcd ..
What happens? What does.
and..
stand for? -
List the content of the
exercise_1/
directory withls
,ls -l
,ls -lh
, andls -lha
.- Question: what do the
-l
,-h
and-a
options do? - π―
Hint: you can use
man ls
to display the help for thels
command. To exit the help, simply typeq
on your keyboard. - β¨
Notes:
- One-letter options can be grouped together, so
ls -lha
is the same asls -l -h -a
. - Some options have both a "short" and a "long" form. E.g.
ls -ah
is the short form forls --all --human-readable
.
- One-letter options can be grouped together, so
- Question: what do the
-
List the content of the directory in chronological order (oldest file first) and in reverse chronological order (newest file first).
π₯ Tip: a very handy functionality that the shell provides is the ability to auto-complete file/directory names. You simply have to start typing the name of a file/directory, and then click on TAB on your keyboard:
- The shell will autocomplete (as much as possible) the file name.
- If there are multiple file name matches for the characters you started to type, the shell will stop the auto-completion at the point where the names diverge.
- To continue auto-completion, you will need to type additional characters and then click TAB again. You can also click TAB again to display all the possible matches left at this point.
You can try this functionality to autocomplete the name of the file
a_regular_file_with_a_really_long_name.md
:
- Start by typing
ls a_r
, then click on TAB. You will see that the shell auto-completes up tols a_regular_file
. - At this point there are 2 possible matches:
a_regular_file.txt
anda_regular_file_with_a_really_long_name.md
. To disambiguate between them, enter the additional character_
and then click on TAB again. - The full name of the file
a_regular_file_with_a_really_long_name.md
should now have auto-completed.
π Exercise solution
-
Printing the current working directory:
pwd
-
Navigate to
exercises/exercise_1
:cd /path/to/directory/exercises ls -l cd exercise_1 pwd ls -l
β¨ changing directory to
exercise_1
can of course also be done in a single command:cd /path/to/directory/exercises/exercise_1
-
The
.
symbol is a shortcut for the current directory. So runningcd .
has no effect since it simply changes to the same directory we are already in. The.
shortcut is useful in some situations. E.g. if you want to copy a file to the current directory you can docp /file/to/copy .
, or you can run an executable located in the current directory with./run_me.sh
.The
..
symbol is a shortcut to the parent directory. These shortcuts can be combined, so e.g.cd ../..
will go up two levels in the directory tree. -
Listing the content of the
exercise_1/
directory with differentls
options. The effect of the different options is described in the comments of the code block.ls # Prints the names of files and directories ls -l # List content of the subdirectory in "long listing" format. This # provides additional details for each file/directory, such as # its permissions, its size and its last modified date. ls -lh # Adding the "-h" option displays file sizes in "human readable" # format. The size of files are shown in kB, MB, GB, instead of # their size in bytes (octets). ls -lha # Adding the "-a" option additionally displays hidden files and # directories. These are files/directories whose name starts with # a dot ".". # Hidden files are often used to store program configurations.
-
List the content of the directory in chronological and reverse chronological order.
ls -lht # The "-t" option sorts by time, newest file first. ls -lhtr # The "-r" option reverses the order of sorting.
Some other useful
ls
options and shortcuts:ls -a/--all # Also show hidden files. ls -R # --recursive, list subdirectories recursively. cd . # Does nothing, we stay in the same directory. cd .. # Go to parent directory. cd / # Go to root directory. cd ~ # Go to user's home directory, on Linux: /home/<user name>. cd - # Go back to the previous directory. cd # With no argument, cd brings you back to your home directory.
-
Try the
cd ~
andcd -
shortcuts. What do they do? -
Create an alias named
ll
that runs the following command:ls -lh --group-directories-first --color=auto
.β¨ Notes:
- On some Linux system, an
ll
alias may already exist. - To list your currently defined aliases, you can type
alias
to list them all, oralias <name of alias>
to list a specific one (e.g.alias ll
). - Aliases only live as long as your current shell session. To make aliases
permanent, they must be defined inside a configuration file, such as
~/.bashrc
, so that they get loaded each time a new shell is spawned. - To remove an alias from the current shell, use
unalias <alias name>
. - To remove a permanent alias, remove it from the config file
(e.g.
.bashrc
) where it is defined.
- On some Linux system, an
-
Compute the size of a directory. To display the size of a directory, the command
du -sh <directory>
can be used. Try in on the directories found inexercise_1
. -
Let's look at a detail of how the bash shell displays file sizes.
Go into the directory
a_directory
and list its content using the following commands - look at how file size is indicated:ls -l
: lists the file size in bytes/octets.ls -lh
(you can also use your newll
alias!): the-h
option (the short form of--human-readable
) lists the file size in a more readable format, using thek
,M
,G
, ... unit abbreviations forkB
(kilobyte),MB
(megabyte),GB
, (gigabyte) etc.
β¨ Note: in everyday language, the term kilobyte (abbreviated
kB
) is used for talking interchangeably about either 1000 bytes or 1024 bytes, because they represent almost the same quantity of bytes. If we really wanted to be precise, the proper name for a unit of 1024 bytes is a kibibyteKiB
, while a kilobyte designates 1000 bytes. Similarly, a megabyte is 1'000'0000 bytes, and a mebibyte is 1024^2 bytes (same with gigabyte vs. gibibytes, terabyte vs.tebibyte, etc.).
π Additional tasks solution
-
cd ~
andcd -
shortcuts:~
is a shortcut for the "home directory", and thereforecd ~
is a shortcut to change directory to your home directory.-
is a shortcut to change to the previous working directory. It is handy if you want to return to a directory you were in just previously.
-
Create an
ll
alias:# Create a new "ll" alias: alias ll='ls -lh --group-directories-first --color=auto'
Here are some some more useful commands for aliases:
alias # Lists the currently defined aliases. unalias ll # Removes the alias from the current shell session. # The "type" command tells if a command is an alias: # * If yes, the aliased command is shown. # * If not, the path to the binary file is shown. type ll # -> ll is aliased to `ls -lh --group-directories-first --color=auto' type bash # -> bash is /usr/bin/bash
-
Show the size of the directories:
du -sh a_directory # 20K (20 kilobytes) du -sh b_directory # 4K (directory is empty, 4K is the size of an empty dir) # Using the ? wildcard character, we can also compute the size of both # directories in a single command. du -sh ?_directory
-
Nothing to correct.
Objective: learn to use wildcard characters to match existing file names.
β¨ Notes:
- The technical term for the expansion of wildcards characters by the shell is filename expansion, but it is often referred to as globbing.
- Globbing only matches existing file/directory names: expansion will not happen if there is no matching file/directory. This is why it's official name is filename expansion.
- π₯
Tip: If you don't want a specific wildcard character to expand, you can
escape it by prefixing it with
\
. E.g.ls test_\*.md
will try to list a file named exactlytest_*.md
.
To start this exercise, enter the directory exercise_2/RedList_mammals
and
list its content with the command ls
.
You will see that it contains a large number of files, whose names are those
of the critically endangered mammal species as listed in the
International Union for Conservation of Nature (IUCN) Red List.
The species names are given in
binomial nomenclature (i.e. latin names),
and each file has the structure Genus_species
. E.g. if there a was file for
humans, it would be named Homo_sapiens
.
Using ls
and wildcard characters, perform the following tasks:
-
List all files starting with the letter
i
(upper or lower case).
π― Hint: you should have 1 match. -
List the files of Rhinoceros species (genus Rhinoceros, Dicerorhinus, and Diceros).
π― Hint: you should have 3 matches. -
List the files of Gibbon species from the genus Nomascus whose species name ends with either
r
ori
.
π― Hint: you should have 2 matches. -
List the files of species that meet both of the following conditions:
- The genus name contains the pattern "
l
+ a single letter + a letter betweena
andh
", e.g.lia
orlug
. - The species name starts with a
g
.
For instance, Eubalaena glacialis, the North Atlantic right whale, would be a match, because its genus name Eubalaena contains the pattern
lae
and its species name glacialis starts with ag
.
π― Hint: you should have 3 matches. - The genus name contains the pattern "
π Exercise solution
-
There is only one file that starts with the letter
i
:cd exercise_2/RedList_mammals/ ls -l I* # Returns a single match: Indri_indri (a lemur species).
β¨ Since all file names start with a capital letter,
ls -l I*
is sufficient to list all files starting with the letteri
. If there were also files starting with lower case letters, we would usels -l [iI]*
.β οΈ Please note thatls -l [iI]*
andls -l i* I*
are not completely equivalent expressions:ls -l i* I*
will return an error unless there exists both files starting withi
and withI
(you can test it in your terminal). -
The critically endangered Rhino species are:
ls -l Rhinoceros_* Dicerorhinus_* Diceros_* ls -l Rhinoceros* Dicero* # Gives the same result. # Dicerorhinus_sumatrensis (Sumatran Rhinoceros). # Diceros_bicornis (Black Rhino). # Rhinoceros_sondaicus (Javan Rhinoceros).
β¨ Since both the genus
Dicerorhinus
andDiceros
start withDicero
, we can match the patternDicero*
to get both genuses at the same time.π¦ There exists 2 other Rhino species:
- The White Rhino (Ceratotherium simum) is listed as "Near Threatened" by the IUCN. This species has two subspecies: the Northern and Southern White Rhino. The Northern White Rhino subspecies is critically endangered with only 2 female individuals remaining worldwide (living in semi-captivity in Kenya).
- The Greater One-Horned Rhino (a.k.a. Indian Rhino), Rhinoceros unicornis is listed as "Vulnerable" by the IUCN.
-
Gibbon Nomascus species whose species name ends in
r
ori
:ls -l Nomascus_*[ri] # Nomascus_concolor (Black crested gibbon). # Nomascus_siki (Southern white-cheeked gibbon).
β¨
Note: in this specific case, using ls -l Nomascus_*[ri]
or
ls -l Nomascus*[ri]
gives the same result, but in principle the former is
safer to use because it will only match genus names corresponding to exactly
Nomascus
, while the later could match any genus name starting with
Nomascus
.
-
Species matching both conditions:
ls -l *l?[a-h]*_g* # Eubalaena_glacialis (North Atlantic right whale). # Gorilla_gorilla (Western gorilla). # Plecturocebus_grovesi (Alta Floresta titi monkey - a new world monkey)
-
List the files of species who satisfy both of the following conditions:
- The genus name contains the pattern "
a
oro
, followed by exactly 2 letters, followed by the letterx
" (e.g.abix
oronyx
) - The species name ends either with an
i
or with the patternra
.
For instance, Pteralopex pulchra, the Montane monkey-faced bat, would be a match, because its genus name Pteralopex contains the pattern
opex
and its species name pulchra ends with the patternra
.π― Hint: this cannot be matched in a single expression only with regular file globbing (i.e. filename expansion). You will need to either:
- Use 2 expressions with regular globbing.
- Use brace expansion.
- Use pattern matching.
π― Hint: you should have 4 matches.
- The genus name contains the pattern "
-
Try to add quotes (single or double) around a globbing pattern with wildcards, e.g.
ls -l "I*"
:- What difference does it make (if any)?
- Can you think of a use case for using quotes around a pattern with wildcards?
π Additional tasks solution
-
File names matching the requested criteria:
ls -l *[ao]??x_*[ra] *[ao]??x_*i # Solution using pure globbing. Requires some duplication. ls -l *[ao]??x_*@(ra|i) # Solution using pattern matching. ls -l *[ao]??x_*{ra,i} # Solution using both globbing and brace expansion. # Myosorex_eisentrauti (Eisentraut's mouse shrew). # Pteralopex_flanneryi (Greater monkey-faced bat). # Pteralopex_pulchra (Montane monkey-faced bat). # Sorex_sclateri (Sclater's shrew).
β¨ To avoid duplicating the
*[ao]??x_*
part, we can use either pattern matching or brace expansion.-
Pattern matching: here
@(ra|i)
matches either the patternra
ori
. -
Brace expansion: during the shell's processing, braces
{}
are expanded first (before globbing), and therefore:ls -l *[ao]??x_*{[ra],i}
is expanded into:
ls -l *[ao]??x_*[ra] *[ao]??x_*i
Before the actual globbing is performed.
-
-
Adding single or double quotes around the search pattern prevents the shell from performing file expansion (globbing). Instead, it will try to literally match the pattern. E.g.
ls -l I*
in the example below will try to find a file namedI*
, instead of any file starting with the letterI
.ls -l 'I*' # ls: cannot access 'I*': No such file or directory
One use case for adding quotes is if we e.g. want to store the pattern to match as a shell variable (e.g. in a shell script):
# We store the pattern "I*" in a variable named "search_pattern". search_pattern="I*" echo ${search_pattern} # Later we can use our stored pattern to match files: ls -l ${search_pattern} # -> lists all files starting with "I".
In this case, if we did not use quotes around
"I*"
when creating oursearch_pattern
variable, file globbing would have occurred and the value of the variable would have been set to the file(s) name that match the globbing pattern, and not the pattern itself.search_pattern=I* echo ${search_pattern} # The value of `search_pattern` is set to "Indri_indri" # instead of "I*"... not what we wanted.
Objective: learn to use the mkdir
, cp
and mv
commands.
Enter the directory exercise_3/
and perform the following tasks:
-
Create directories with the
mkdir
command:-
In the directory
exercise_3/
, create 2 new sub-directories:species_by_genus
andspecies_by_common_name
. -
In
species_by_genus/
, create a new sub-directory namedDendrolagus
(tree-kangaroos). -
In
species_by_common_name/
, create a new sub-directories namedB
.π₯ Tip: to avoid having to rewrite a command, remember that you can use the up arrow of your keyboard to go back in your terminal history. This allows you to re-use a command that you wrote earlier, while making changes to it if needed.
-
-
Copy files using the
cp
command:- From the directory
exercise_2/RedList_mammals
, make a copy of all files of the genusDendrolagus
intospecies_by_genus/Dendrolagus
. - From the directory
exercise_2/RedList_mammals
, copy the file for the Black Rhinoceros - Diceros bicornis - to the directoryspecies_by_common_name
.
- From the directory
-
Move and rename files with the
mv
command:- Enter the
species_by_common_name
directory. - In the directory, move the file
Diceros_bicornis
into subdirectoryB
. - Rename the
Diceros_bicornis
file you just moved to the common name of the species:Black_rhinoceros
.
- Enter the
-
Copy, rename and delete directories:
- Change directory to the root of the
exercise_3/
directory. - Copy the entire directory
species_by_genus/Dendrolagus
- with all its content - to the root ofexercise_3
. - Rename the directory to
Tree-kangaroos
. - Delete the directory
Tree-kangaroos
and its content in a safe way.
- Change directory to the root of the
π Exercise solution
-
Create the directories
species_by_genus
andspecies_by_common_name
.cd exercise_3 # Option 1: create one directory after the other. mkdir species_by_genus mkdir species_by_common_name # Option 2: create both directories with a single command. mkdir species_by_genus species_by_common_name # Option 3: use brace expansion to avoid repeating the common part # of the directory names. mkdir species_by_{genus,common_name}
Create a
Dendrolagus
sub-directory:# Option 1: create the sub-directory from the root of the exercise_3 # directory. mkdir species_by_genus/Dendrolagus # Option 2: enter the species_by_genus directory, then create the # "Dendrolagus" sub-directory. cd species_by_genus/ mkdir Dendrolagus cd ..
Create a
B
sub-directory:mkdir species_by_common_name/B
β¨ Note: using the
-p
option ofmkdir
, it is possible to create multiple levels of directories in a single command. For example, we could create all the directories for this exercise in a single command:mkdir -p species_by_{genus/Dendrolagus,common_name/B}
π₯ Tip: if you want to preview the output of a brace expansion (or a filename expansion), you can run the command prefixed with
echo
: it will print the command that would be executed to the terminal without running the command.echo mkdir -p species_by_{genus/Dendrolagus,common_name/B}
-
Copy files for the
Dendrolagus
genus.cp ../exercise_2/RedList_mammals/Dendrolagus_* species_by_genus/Dendrolagus/
Copy the file for the Black Rhinoceros:
cp ../exercise_2/RedList_mammals/Diceros_bicornis species_by_common_name/
-
Move and rename the Black Rhinoceros file.
cd species_by_common_name/ mv Diceros_bicornis B/ # Move the file into its subdirectory. mv B/Diceros_bicornis B/Black_rhinoceros # Rename the files to the common name of the species.
-
Copy, rename and delete a directory.
cd .. # Change directory to the root of `exercise_3`. cp -r species_by_genus/Dendrolagus/ . # Copy the directory and its content. mv Dendrolagus/ Tree-kangaroos # Rename the directory. # The copying and renaming of the directory can also be done in # a single command. cp -r species_by_genus/Dendrolagus/ Tree-kangaroos/
To delete the directory in a safe way, we first delete all files inside it, and then delete the empty directory with
rmdir
. Note thatrmdir
will not delete a directory if it is not empty - this is a safety behavior to avoid deleting large number of files by mistake.rm Tree-kangaroos/* rmdir Tree-kangaroos
π¦ Note: the faster way to delete the directory and all of its content is to use the command:
rm -rf Tree-kangaroos
.β οΈ This recursively deletes the directory, and therefore one has to be careful to delete the correct directory, as you can otherwise very quickly delete large amounts of data by mistake, which can be problematic as there is no command to undo file deletion.
-
At the root of
exercise_3/
, create a new directory namedspecies_by_binomial_name
and enter it. -
Inside this directory, create sub-directories named
A
,B
,C
, ...Z
(i.e. one directory for each letter of the alphabet).To avoid doing this tedious work manually, you can use a for loop very similar to this example:
for x in {A..Z}; do echo ${x}; done
Try to run the above code in your shell (it will only print things to the screen without creating anything on disk). Then adapt the
for
loop (or the brace expansion) so that it creates the directories forA
toZ
.β¨ Note: in this specific case, a for loop is not even necessary. We can simply use brace expansions:
mkdir {A..Z}
. -
Using a similar
for
loop as above, copy all files fromexercise_2/RedList_mammals
into their correct subdirectory, i.e. the subdirectory that corresponds to the first letter of the Genus name. For example:Marmota_vancouverensis
should go into sub-directoryM
because the first letter of the genus name isM
.Note that when running the for loop, you will get some warning messages, because the genuses present in
RedList_mammals
do not cover all letters of the alphabet. However, this is not a problem here because it does not prevent thefor
loop from running to the end.
π Additional tasks solution
# Create and enter the new directory.
mkdir species_by_binomial_name
cd species_by_binomial_name/
# Create directories "A" to "Z" with a for loop.
for x in {A..Z}; do mkdir ${x}; done
# Copy species file names into the correct directory.
# Note that letters that do not have any matching genus will print a warning
# to the terminal, but this does not prevent the loop from completing.
for x in {A..Z}; do cp ../../exercise_2/RedList_mammals/${x}* ${x}/; done
ls -l ./*
β¨ The task of creating all the directories could also be done using brace expansion, like so:
mkdir {A..Z}
Objective: get familiar with shell commands that display text file content:
head
, tail
, cat
and less
.
Enter the directory exercise_4/
and list its content, you should see that it
contains a single file named protein_sequences.fasta
.
Perform the following tasks on the protein_sequences.fasta
file:
-
Display the start/end of the file using the
head
andtail
commands:- Display the first 10 lines of the file.
- Display the last 5 lines of the file.
-
Count the number of lines in the file using the
wc
command:- Count only the number of lines in the file.
- Count only the number of words in the file.
-
Display the content of the file using the
cat
command:- Why is this not the most adapted program here?
- Indicate another usage of
cat
?
-
Display, navigate and search the file with
less
:- Open the file using
less
. - Add lines numbers to the display using the
-N
option. - Navigate the file using the space bar and arrows.
- Search for the pattern
isoform
using the command/<search term>
, then navigate through the matches with the keysn
andN
. - Close the file with
q
.
- Open the file using
π Exercise solution
-
Display the first 10 and last 5 lines.
cd exercise_4/protein_sequences.fasta head protein_sequences.fasta # No need to specify "-n 10", as 10 is the default value. tail -5 protein_sequences.fasta
π₯ Tips:
- If you want to display the entire file except for the last
X
lines you can usehead -n-X
(replaceX
by the number of lines you want to skip at the end of the file). - Conversely,
tail -n+X
will skip the firstX
lines, and then print all remaining lines till the end of the file.
- If you want to display the entire file except for the last
-
Count the number of lines and words in the file.
wc -l protein_sequences.fasta # 19222 lines. wc -w protein_sequences.fasta # 51914 words.
-
Display the content of the file with
cat
. As you can see, this is not an ideal solution for this file because it is large.cat protein_sequences.fasta
One usage of
cat
is concatenate 2 or more files together (this is where the command got its name from).cat
concatenates files by pasting their content one after another.
Here is an example:# Create 2 files to concatenate: head -n5 protein_sequences.fasta > file_1 tail -n5 protein_sequences.fasta > file_2 # Concatenate the 2 files into new file named "file_3". cat file_1 file_2 > file_3 cat file_? > file_3 # Same as above, but using filename globbing.
β¨ Bonus: we could also create
file_3
of the example above without creating any intermediate file. This is done using a method called process substitution and allows to treat the output of a command as an input file. The syntax of process substitution is<( )
.
Example:cat <( head -n5 protein_sequences.fasta ) <( tail -n5 protein_sequences.fasta )
β¨ To concatenate multiple files by columns, use the
paste
command. -
Display the content of the file with
less
. Remember that to exitless
, you must press theq
key on your keyboard.less protein_sequences.fasta less -N protein_sequences.fasta # Line numbers can also be added/removed # after a file was opened with "-N" + "enter".
Display only the line 100 of the file protein_sequences.fasta
by using a
combination of head
and tail
.
For this you will need to use the the |
(pipe) operator, that allows to
redirect the output of one command into another command.
π Additional tasks solution
head
and tail
can be combined to display any section of a file. Here
we print the line 100 of the file:
head -n100 protein_sequences.fasta | tail -n1 # Print the 100th line.
In this exercise, we will work with a copy of the file
exercise_4/protein_sequences.fasta
. This file is a so-called
FASTA file. FASTA is a
text-based format to represent nucleotides or protein sequences.
- FASTA files can contain one or more sequences.
- Each new sequence starts with a sequence header line, which starts with
the character
>
. A sequence header is always on a single line. - Each sequence header is followed by one or more lines that contain the nucleotide or amino acid sequence of the sequence.
Here is an example of a section of a FASTA file:
>sp|P18823|ACCD_PEA Carboxyl transferase OS=Pisum sativum GN=accD PE=1 SV=3
MINEDPSSLTDMDNNIDSWKNNSENSSYSHADSLADVSNIDNLLSDKIFSIRDSNSNIYD
IYYAYDTNDTNITKYKWTNNINRCIESYLRSQICEDIDFNSDICDKVQRTIIILIRSTND
NDISDTNDISDTNDTNDTNAIYDPFDISDTNDTN
>sp|P09339|ACON_BACSU Aconitate hydratase OS=Bacillus subtilis GN=citB PE=1 SV=4
MANEQKTAAKDVFQARKTFTTNGKTYHYYSLKALEDSGIGKVSKLPYSIKVLLESVLRQV
DGFVIKKEHVENLAKWGTAELKDIDVPFKPSRVILQDFTGVPAVVDL
Enter the directory exercise_5
and
make a copy of the file exercise_4/protein_sequences.fasta
in that directory. Name the copy of the file sequences.fasta
.
β¨ If you are on a Linux/Mac, you may also create a symlink instead of copying the file:
ln -s ../exercise_4/protein_sequences.fasta sequences.fasta
- A symlink creates a pointer to a file, without making an actual copy of it.
- Symlinks are not supported on Windows (except if using WSL and working on a non-windows partition).
Have a look at the sequences.fasta
file - e.g. using the less
command -
then answer the following questions using the grep
command:
- How many sequences are there in the file? π― Hint: count the number of header lines in the file.
- How many entries are from
Staphylococcus
? - Display header lines that are not from
Staphylococcus
?
Here is a reminder of some of the grep
options:
-i
: case insensitive search.-c
: suppress normal output; instead print the count of matching lines.-o
: print only matching content, not the entire line.-n
: add the line number in front of printed output.-v
: inverted search - print lines that do not match the pattern.
π Part A solution
cd exercise_5/
cp ../exercise_4/protein_sequences.fasta sequences.fasta
# Count the number of sequences in the file:
grep -c "^>" sequences.fasta # -> 3325 sequences.
# Count the number of sequences from Staphylococcus
# Note the use of the `-i` option of `grep` ("case insensitive search").
grep -ci "os=staphylococcus " sequences.fasta # 141 sequences.
# Display the header lines of sequences that are not from Staphylococcus.
grep "^>" sequences.fasta | grep -vi "os=staphylococcus "
grep "^>" sequences.fasta | grep -vi "os=staphylococcus " | wc -l # The sequence count is 3184.
In the second part of this exercise, your task is to
display the 10 most frequent genuses in the sequences of the
sequences.fasta
file, along with their frequency (i.e. the number of
sequences for each of the 10 most-frequent genus in the file).
Here is a suggested way to perform this task:
- Isolate the header lines.
- Isolate the genus name from each line. To do this, you can take advantage
of the controlled vocabulary in the file: the organisms name is always
prefixed with
OS=
. - Sort the genus names, compute their frequency and keep only a single instance of each genus name.
- Sort the genus by frequency and keep only the 10 most frequent.
π― Hints:
- The steps above are best done as part of a pipeline: use the
|
(pipe) operator to pipe the output of one command into the next. - When building the pipeline and doing tests, you can end your pipeline with
| head
so that you avoid printing the whole file each time.
π― Additional hints:
click to show more hints, if needed
Here are some commands and their options that are useful for this exercise:
uniq -c
: the-c/--count
option prefixes each line with the number of occurrences.sort -nr
:-n/--numeric-sort
sorts numerically instead of alphabetically.-r/--reverse
sorts in decreasing order.grep -o
: the-o/--only-matching
option returns only the matching part of a line instead of the entire line (the default grep behavior).
π Part B solution
There are multiple ways to perform this task, here are a few possibilities.
- β¨
Note: some pipelines make use of the
grep
option-o
, which instructsgrep
to only output the actual matching pattern instead of the entire line on which the match is found.
grep "^>" sequences.fasta | cut -f2 --delim="=" | cut -f1 --delim=" " | sort | uniq -c | sort -nr | head
grep -o "OS=[a-zA-Z]*" sequences.fasta | cut -f2 --delim="=" | sort | uniq -c | sort -nr | head
# Same as above, but using the "[[:alpha::]]" syntax to indicate we only want
# to match alphabetic letters and not e.g. spaces (or numbers).
grep -o "OS=[[:alpha:]]*" sequences.fasta | cut -f2 --delim="=" | sort | uniq -c | sort -nr | head
# Output of the pipe: the 10 most frequent genus and their frequency in the file.
168 Arabidopsis
166 Escherichia
163 Bacillus
152 Homo
141 Staphylococcus
134 Mus
111 Oryza
84 Salmonella
83 Rattus
72 Mycobacterium
Here is another solution that makes use of a more complicated regular expression to directly isolate the genus name. For this we must:
- Use
grep
with "Perl"-style regular expressions by adding the-P
option. - Use a lookbehind match:
(?<=OS=)
matches something located behind the patternOS=
.
grep -oP "(?<=OS=)[a-zA-Z]+ " sequences.fasta | sort | uniq -c | sort -nr | head
π Regular expressions are a powerful tool to do sophisticated pattern matching. However they are beyond the scope of this course.
This is not an easy one, but it's the last!
Our objective is to write a short for
loop that performs the task
of copying each species files found in the exercise_2/RedList_mammals
directory into the correct directory for its genus in a species_by_genus
directory.
So basically, instead of only doing it for 2 genus manually as we did in exercise 3, we want to have it done automatically for all genuses.
π― Hints: this task is more difficult and uses a few concepts that were not presented in the course, such as:
for
loops to repeat a number of instructions multiple times while iterating over a range of values. In our case, we want to iterate over the list of genuses.- Variables: in bash, variables can be:
- Created using
variable_name=value
. - Accessed using
${variable_name}
.
- Created using
Here is a scaffold of one possible solution to get you started.
# Enter the "exercise_7" directory and create a new "species_by_genus"
# directory.
cd exercise_7/
mkdir species_by_genus
# Save the "RedList_mammals" directory location in a variable, so it will be
# easy to access later.
red_list_dir=../exercise_2/RedList_mammals
# Loop through all genus values and copy the files for each in the correct
# sub-directory of "species_by_genus".
for genus in $( <pipeline that returns the list of genus> ); do
mkdir species_by_genus/${genus} # Create directory for genus.
cp ${red_list_dir}/... ... # Copy files for genus.
done
# List all the copied files to see if the result is correct.
ls species_by_genus/*
What you have left to do in the code above is to:
- Replace
<pipeline that returns the list of genus>
with a series of commands that will produce the list of unique genus present inRedList_mammals
. - Replace
cp ...
with the proper command to copy all files for a given genus.
π Additional tasks solution
# Enter the "exercise_7" directory and create a new "species_by_genus"
# directory.
cd exercise_7/
mkdir species_by_genus
# Save the "RedList_mammals" directory location in a variable, so it will be
# easy to access later.
red_list_dir=../exercise_2/RedList_mammals
# Loop through all genus values and copy the files for each in the correct
# sub-directory of "species_by_genus".
for genus in $( ls ${red_list_dir} | cut -f1 --delim="_" | sort | uniq ); do
mkdir species_by_genus/${genus} # Create dir for genus.
cp ${red_list_dir}/${genus}_* species_by_genus/${genus} # Copy files for genus.
done
# List all the copied files to see if the result is correct.
ls species_by_genus/*