Skip to content
Snippets Groups Projects

slides: fix a few typos and other minor changes

Files

+ 167
0
# Shell scripting course - exam questions
Please follow these instructions to submit your answers:
* Answer all questions in **a single text file**. The scripts you will be asked to write are
short, so they can be included in the same file too.
* **Number your answers according to the questions**, e.g.:
> Question 2
>
> Your answer here...
* **Name your file** using the pattern **`<LAST NAME>_<First_name>_exam`**. For instance
`SMITH_Alice_exam.md` or `SPONGE_Bob_exam.txt`.
* Submit your answers by email to `robin.engler@sib.swiss` and `thomas.junier@sib.swiss`.
*Note:* all files needed for the exam questions are found in the **`exam/`** directory - the same
directory that also contains the present file.
<br>
## Question 1 - [1 point]
Consider the following variable declarations:
```bash
# Case 1
name="Dendroaspis"
# Case 2
name="Dendroaspis angusticeps"
```
**Questions:**
* In which (if any) of the above cases are the quotes really needed?
* What happen if the quotes are omitted?
<br>
## Question 2 - [1 point]
We would like to surround indirect speech with double quotes, as in:
> I said, "This will never work."
However, when we type the following, it doesn't work (the quotes are missing):
```bash
$ echo I said, "This will never work"
```
**Task:** give 2 solutions of how can we fix this problem.
<br>
## Question 3 [3 points]
The file `sequences_mammalia.fasta` contains genomic sequences for different mammal species,
amongst which are [*Vulpes vulpes*](https://en.wikipedia.org/wiki/Red_fox) - the red fox,
[*Vulpes lagopus*](https://en.wikipedia.org/wiki/Arctic_fox) - the Arctic fox, and
[*Vulpes pallida*](https://en.wikipedia.org/wiki/Pale_fox) - the Pale fox.
**Reminder:** each sequence in a Fasta files starts with a **header line** that starts with
a **`>`** character.
**Tasks:**
* Write a Bash command that stores the number of sequences found in the file for the red fox
in a variable named `red_fox_sq_count`.
* Write a **`for`** loop that prints the name of the species and the number of sequences found
in the file for the Red, Arctic and Pale fox. Your output on the terminal should looks
something like this:
```
<species 1> sequence count: x
<species 2> sequence count: y
<species 3> sequence count: z
```
<br>
## Question 4 [4 points]
Using the same `sequences_mammalia.fasta` file as in the previous exercise, write a script that
**prints the number of sequences per species** to a tab-delimited text file named
`seq_count_per_species.txt`.
The output file should have a header line (row names), followed by data for each species
(name and sequence count), like so:
```
species count
Balaenoptera musculus 2
Hippopotamus amphibius 3
Vulpes lagopus 3
...
...
```
**Important:**
* The names of the species should be retrieved programatically, not manually.
* The name of the input and output files should only appear once in the script
(i.e. DRY principle - don't repeat yourself), so they are easy to change.
**Hints:** here is a suggestions of how your script could proceed:
1. Extract the list of unique species names from the `sequences_mammalia.fasta` input file and
store them in a temporary file named `tmp.txt`.
2. Loop over the species stored in `tmp.txt`, compute the sequence count for each of them and
add it to the output file.
3. Delete the temporary `tmp.txt` file.
<br>
## Question 5 [4 points]
In Fasta seuquences with nucleotides, the **`N`** character is used to indicate
**unindentified nucleotides** (i.e. nucleotides for which the reading from the sequencer was
of too poor quality to be assigned a specific nucleotide value).
**Task:**
* Write a program named `detect-Ns.sh` that takes the _output_ of `exam/fasta2tsv-exam.sh` as input
(NOT an original Fasta file!) and **keeps only those sequences** in which the sequence field has
**at least one `N` character**. This is essentially a detector of poor-quality sequences.
* Test your script on the file `exam/test_sequences.fasta`. It should produce the following
result:
```bash
$ ./src/fasta2tsv-exam.sh < ./data/Q4_test.fasta | ./detect-Ns.sh
Test_sequence_1 (has 1 N); TGGCCTTAGATGACGCGTTGGGTGNCGGCGCCTGAAAGTTCAGGTAAAACGACCGTGGCA
Test_sequence_3 (has 4 N); AGGGGCGATTATGCNNATGGGTGACGCTGCCCTGAAAGTTCANGTAAAACGACCGTGGCN
Test_sequence_5 (has 5 N); AGGGGCGATTATGCNNATGGGTGACGCTGCNNTGAAAGTTCAGGTAAAACGACCGTGGCN
```
**Notes and Hints:**
* The `exam/fasta2tsv-exam.sh` file is a script similar to what we developed together in the
course: it takes a Fasta file as input, and converts it to a tabulated format (tab delimited)
file where each line corresponds to a fasta sequence (header in the first field and sequence in
the second field).
* To set the IFS to TAB, you can use ANSI quoting, e.g. `IFS=$'\t'`. See the
code in `fasta2tsv-exam.sh` for an example.
<br>
## Question 6 [3 points]
Write a program that converts the TSV (tab-separated values) produced by `fasta2tsv-exam.sh`
back into Fasta format. Call it **`tsv2fasta.sh`**.
**Hint:**
* Fasta does not require sequences to be on multiple lines (it just allows it). It's therefore
OK if the nucleotide sequence is on a single line (the header, on the other hand,
**must be on a line of its own _and_ start with a `>`**).
<br>
## Question 7 [2 points]
Show how `fasta2tsv-exam.sh` can be combined with `detect-Ns.sh` and `tsv2fasta.sh` to filter
an input Fasta file and **keep only the poor-quality sequences** (those containing at least one
`N`).
**Tasks:**
* Write an expression/command that filters the file `sequences_mammalia.fasta` by keeping only
the poor-quality sequences.
* Write a second expression/command that additionally filters the `sequences_mammalia.fasta` file
to only keep the poor-quality sequences that belong to
[*Vultur gryphus*](https://en.wikipedia.org/wiki/Andean_condor) - the Andean condor.
**Hint:**
* You don't need to write a full script for this exercise, an expression on the command
line should be enough.
<br>
Loading