BINF90002 Semester 1 2020 Final exam Academic Integrity Declaration By commencing and/or submitting this assessment I agree that I have read and understood the University’s policy on academic integrity. I also agree that: 1. Unless paragraph 2 applies, the work I submit will be original and solely my own work (cheating); 2. I will not seek or receive any assistance from any other person (collusion) except where the work is for a designated collaborative task, in which case the individual contributions will be indicated; and, 3. I will not use any sources without proper acknowledgment or referencing (plagiarism). 4. Where the work I submit is a computer program or code, I will ensure that: a. any code I have copied is clearly noted by identifying the source of that code at the start of the program or in a header file or, that comments inline identify the start and end of the copied code; and b. any modifications to code sourced from elsewhere will be commented upon to show the nature of the modification. This exam opens at 9.00 AM Australian Eastern Standard Time (AEST) on Thursday 02/07/2020 in Canvas (lms.unimelb.edu.au). The exam must be completed by 3.30 PM AEST on Thursday 02/07/2020. This exam has 30 minutes of reading time, and 120 minutes of writing time. You have a 6 hour window in which to complete and submit the exam. Number of pages: This paper has 7 pages, including this cover page Authorised Materials: This is an open book exam. All material delivered during the teaching period and student notes are permitted. Instructions to Students: The total number of marks for the examination is 100. It accounts for 50% of your final result for the subject. You should attempt all questions. Write your answers in a separate word document and upload your document as a word or pdf document as your submission in the Assignment for the Semester 1 exam. You may include scanned drawings to illustrate your answer(s) if you wish but they must be appropriately embedded within the document that is your final submission. Page 1 of 7 Section 1. Short-answer questions (50 marks total) 1. 5 possible variants have been summarised by a variant finding program. The locations (marked in red vertical lines) of the variants and the affected codon are shown on the diagram above. For each variant (1-5), use the VCF entry and codon table to comment on the type of mutation that may have occurred, and the effect the variant may have on the gene transcript or its protein product. (5 marks) 2. In your work for the Pathology department in a major public cancer hospital, you have been asked to design a new whole genome sequencing test that will be applied to every cancer patient who visits the hospital. a) How deep will you sequence the tumour and matched normal samples b) Would you check the data for sample swaps and, if so, how c) What types of mutations should the assay cover For each mutation type, what signal in the sequencing reads would you expect the caller to use d) What other downstream analyses might you include (5 marks) Page 2 of 7 3. You have been tasked with developing two diagnostic tests to identify bacterial pathogens. Test_1 must be able to detect the bacterial pathogen species B. pathogenesis in patient throat swab samples. The samples are expected to contain many bacterial species due to host microbiome contamination. B. pathogenesis has a very limited accessory genome compared to the other species which may be present in a sample. There is a reasonable evolutionary distance in terms of SNPs between B. pathogenesis and other bacteria. Test_2 must be able to detect and differentiate between strains of a pathogenic species B. nauseous. Different strains of B. nauseous have high levels of sequence homology, but each strain has unique genetic regions compared to the other strains within a large accessory genome. For each of the two tests above state: a) which of the following metagenomics sequencing approaches would be the most appropriate; 16S amplicon, MLST, or WGS. b) whether short-read or long-read sequencing would be most suited to that test, and explain why. (4 marks) 4. A study was performed to look for gene expression changes in cultured cells when treated with a new drug, melbuximab. Bulk RNA-seq was performed on samples from control (C) and drug treated (T) cell cultures of an immortalized cell line to detect differentially expressed genes. The experiment was performed using cell cultures prepared on three different days. The library sizes for each sample after NGS sequencing are shown below. A standard RNA-seq protocol was used, consisting of 1) sample preparation 2) mRNA capture & size selection, 3) reverse transcription and PCR amplification, 4) short read Illumina sequencing, and 5) differential expression (DE) analysis. For each of the 5 steps, describe one possible source of bias or error that could impact the final results of the DE analysis. (5 marks) Page 3 of 7 5. The figure above shows a view of the UCSC genome browser showing some of the Tabula Muris single cell RNAseq expression data. The data has been collapsed into cell types and the coverage plots for a few selected cell types are shown. The region of the mouse genome shown is the location of the Interleukin 4 gene (IL4) and the data ranges for the coverage plots are autoscaled for each cell type. a) From examination of the features in the gene models, what can you deduce about the IL4 gene, its transcripts and possible protein products from this image b) From examination of the coverage plots, what can you deduce about expression of IL4 transcripts in these cell types (6 marks) Page 4 of 7 6. Cell clustering is an important technique for interpreting scRNA-seq data. Discuss why clustering is necessary, the visual representation which is produced, and how the output can be informative. Mention the name of a commonly used scRNA-seq clustering tool, and the name of a common normalisation method. (5 marks) 7. 3D structure prediction is an important technique for characterising the functionality that a novel protein may possess. Discuss when 3D structure prediction is the best option for functional annotation over DNA or amino acid sequence homology. Additionally, mention 2 considerations about the 3D structure which indicate whether the predicted protein may perform the same function as the most similar annotated protein. (5 marks) 8. What are post-translational modifications Briefly describe how the potential presence of post-translational modifications contributes to the complexity in assigning peptide sequences from fragmentation data in mass spectrometry experiments. (5 marks) 9. In clinical genomics, deciding whether a variant is likely to be pathogenic involves multiple stages. Discuss 5 considerations when deciding if a variant is likely to be pathogenic. (5 marks) 10. Benchmarking is important to understand the strengths and weaknesses of bioinformatics tools, and to compare their performance in an unbiased manner. You are writing a review paper concerned with comparing software tools which perform cell clustering from scRNA-seq data. Discuss how you would benchmark a group of tools, commenting on: a) Why benchmarking is necessary for scRNA-seq analysis. b) The kinds of datasets which are appropriate for use in benchmarking. c) Which metrics you would choose to assess tool performance. Page 5 of 7 (5 marks) Section 2. Long-answer question (50 marks total) Part A: (20 marks) Bowel cancer is one of the leading causes of cancer related mortality among young people. Recent studies have shown that some genotoxic bacteria found in the intestinal microbiome produce a toxin that is associated with risk of developing bowel cancer. Genotoxins are toxins that can damage the DNA of a cell. Of particular interest is a specific strain of Escherichia coli that produces the toxin colibactin, encoded by the clb gene. The clb gene is contained within the 50kb Pks genomic island, which is part of the bacteria’s accessory genome. The colibactin producing strain of E.coli is therefore called pks+E. Coli. The pks+E.coli reference genome is available from public databases. For this study, 1000 tumor specimens are available from young people with bowel cancer. The first aim of the study is to determine the frequency of pks+E.coli presence in the tumour environment. DNA was extracted from each of the tumor samples and subjected to whole genome sequencing for a metagenomics analysis using Illumina reads. You have been provided with the FASTQ read sets for each tumor sample. Describe a bioinformatics workflow to investigate the presence of pks+E.coli in the tumour samples using these FASTQ reads. Your answer should describe, in detail, important steps for processing the data, tools used, and expected outputs of the analysis. Part B: (20 marks) Your analysis determined that approximately 10% of the tumors were associated with the presence of pks+E.coli. However, it is likely that not all these tumors are caused by pks+E.coli. The genotoxic activity of pks+E.coli is known to result in a specific pattern of somatic mutations in the tumor. To further investigate which of these 100 tumors are likely to be a result of colibactin toxin you must design a further experiment. For this aim you are provided with Illumina FASTQ sequencing reads from both the tumors and matched normal (non-cancerous) tissue from the same patient. 1. Describe a bioinformatics workflow to identify somatic mutations in the tumors using this data. Your answer should detail important steps for processing the data, tools used, and expected outputs. 2. Suggest how you could compare the patterns of mutations between the samples. Part C: (10 marks) From the analysis, you confirmed that some of the tumor samples had somatic DNA mutations characteristic of colibactin toxin (e.g. T>A base pair substitution), and that these patterns were more prevalent in tumours associated with pks+E.coli. Given this collection of data, explore how genomics tools (for e.g. metagenomics and microbiome analysis) can be used to identify people with a greater risk of developing bowel cancer. Page 6 of 7 Section 2 is an opportunity for you to demonstrate your overall understanding of the subject material. You should try to draw upon as many different parts of the course material as you can. Creative (but scientifically motivated) uses of the course material are welcome. — END OF EXAMINATION — Page 7 of 7