Robin Engler · Robin Engler · 22c69e7a · 0bd2dcb4 · 95a0a7ed · fab49928
--- a/exam/exam_questions.md 0 → 100644

+ 167

− 0

View file @ 95a0a7ed

Open in Web IDE
+++ b/exam/exam_questions.md 0 → 100644

+ 167

− 0

View file @ 95a0a7ed

Open in Web IDE
+# Shell scripting course - exam questions
+
+Please follow these instructions to submit your answers:
+* Answer all questions in **a single text file**. The scripts you will be asked to write are
+  short, so they can be included in the same file too.
+* **Number your answers according to the questions**, e.g.:
+  > Question 2
+  >
+  > Your answer here...
+* **Name your file** using the pattern **`<LAST NAME>_<First_name>_exam`**. For instance
+  `SMITH_Alice_exam.md` or `SPONGE_Bob_exam.txt`.
+* Submit your answers by email to `robin.engler@sib.swiss` and `thomas.junier@sib.swiss`.
+
+*Note:* all files needed for the exam questions are found in the **`exam/`** directory - the same
+directory that also contains the present file.
+
+
+<br>
+
+## Question 1 - [1 point]
+Consider the following variable declarations:
+
+```bash
+# Case 1
+name="Dendroaspis"
+
+# Case 2
+name="Dendroaspis angusticeps"
+```
+
+**Questions:**
+* In which (if any) of the above cases are the quotes really needed?
+* What happen if the quotes are omitted?
+
+
+<br>
+
+## Question 2 - [1 point]
+We would like to surround indirect speech with double quotes, as in:
+
+> I said, "This will never work."
+
+However, when we type the following, it doesn't work (the quotes are missing):
+
+```bash
+$ echo I said, "This will never work"
+```
+
+**Task:** give 2 solutions of how can we fix this problem.
+
+
+<br>
+
+## Question 3  [3 points]
+The file `sequences_mammalia.fasta` contains genomic sequences for different mammal species,
+amongst which are [*Vulpes vulpes*](https://en.wikipedia.org/wiki/Red_fox) - the red fox,
+[*Vulpes lagopus*](https://en.wikipedia.org/wiki/Arctic_fox) - the Arctic fox, and
+[*Vulpes pallida*](https://en.wikipedia.org/wiki/Pale_fox) - the Pale fox.
+
+**Reminder:** each sequence in a Fasta files starts with a **header line** that starts with
+a **`>`** character.
+
+**Tasks:**
+* Write a Bash command that stores the number of sequences found in the file for the red fox
+  in a variable named `red_fox_sq_count`.
+* Write a **`for`** loop that prints the name of the species and the number of sequences found
+  in the file for the Red, Arctic and Pale fox. Your output on the terminal should looks
+  something like this:
+  ```
+  <species 1> sequence count: x
+  <species 2> sequence count: y
+  <species 3> sequence count: z
+  ```
+
+
+<br>
+
+## Question 4  [4 points]
+Using the same `sequences_mammalia.fasta` file as in the previous exercise, write a script that
+**prints the number of sequences per species** to a tab-delimited text file named
+`seq_count_per_species.txt`.  
+The output file should have a header line (row names), followed by data for each species
+(name and sequence count), like so:
+
+```
+species    count
+Balaenoptera musculus   2
+Hippopotamus amphibius  3
+Vulpes lagopus  3
+...
+...
+```
+
+**Important:**
+* The names of the species should be retrieved programatically, not manually.
+* The name of the input and output files should only appear once in the script
+  (i.e. DRY principle - don't repeat yourself), so they are easy to change.
+
+**Hints:** here is a suggestions of how your script could proceed:
+1. Extract the list of unique species names from the `sequences_mammalia.fasta` input file and
+   store them in a temporary file named `tmp.txt`.
+2. Loop over the species stored in `tmp.txt`, compute the sequence count for each of them and
+   add it to the output file.
+3. Delete the temporary `tmp.txt` file.
+
+
+<br>
+
+## Question 5  [4 points]
+In Fasta seuquences with nucleotides, the **`N`** character is used to indicate
+**unindentified nucleotides** (i.e. nucleotides for which the reading from the sequencer was
+of too poor quality to be assigned a specific nucleotide value).
+
+**Task:**
+* Write a program named `detect-Ns.sh` that takes the _output_ of `exam/fasta2tsv-exam.sh` as input
+  (NOT an original Fasta file!) and **keeps only those sequences** in which the sequence field has
+  **at least one `N` character**. This is essentially a detector of poor-quality sequences.
+* Test your script on the file `exam/test_sequences.fasta`. It should produce the following
+  result:
+  ```bash
+  $ ./src/fasta2tsv-exam.sh < ./data/Q4_test.fasta | ./detect-Ns.sh
+  Test_sequence_1 (has 1 N);	TGGCCTTAGATGACGCGTTGGGTGNCGGCGCCTGAAAGTTCAGGTAAAACGACCGTGGCA
+  Test_sequence_3 (has 4 N);	AGGGGCGATTATGCNNATGGGTGACGCTGCCCTGAAAGTTCANGTAAAACGACCGTGGCN
+  Test_sequence_5 (has 5 N);	AGGGGCGATTATGCNNATGGGTGACGCTGCNNTGAAAGTTCAGGTAAAACGACCGTGGCN
+  ```
+
+**Notes and Hints:**
+* The `exam/fasta2tsv-exam.sh` file is a script similar to what we developed together in the
+  course: it takes a Fasta file as input, and converts it to a tabulated format (tab delimited)
+  file where each line corresponds to a fasta sequence (header in the first field and sequence in
+  the second field).
+* To set the IFS to TAB, you can use ANSI quoting, e.g. `IFS=$'\t'`. See the
+  code in `fasta2tsv-exam.sh` for an example.
+
+
+<br>
+
+## Question 6  [3 points]
+Write a program that converts the TSV (tab-separated values) produced by `fasta2tsv-exam.sh`
+back into Fasta format. Call it **`tsv2fasta.sh`**.
+
+**Hint:**
+* Fasta does not require sequences to be on multiple lines (it just allows it). It's therefore
+  OK if the nucleotide sequence is on a single line (the header, on the other hand,
+  **must be on a line of its own _and_ start with a `>`**).
+
+
+<br>
+
+## Question 7  [2 points]
+Show how `fasta2tsv-exam.sh` can be combined with `detect-Ns.sh` and `tsv2fasta.sh` to filter
+an input Fasta file and **keep only the poor-quality sequences** (those containing at least one
+`N`).
+
+**Tasks:**
+* Write an expression/command that filters the file `sequences_mammalia.fasta` by keeping only
+  the poor-quality sequences.
+* Write a second expression/command that additionally filters the `sequences_mammalia.fasta` file
+  to only keep the poor-quality sequences that belong to
+  [*Vultur gryphus*](https://en.wikipedia.org/wiki/Andean_condor) - the Andean condor.
+
+**Hint:**
+* You don't need to write a full script for this exercise, an expression on the command
+  line should be enough.
+
+
+<br>