Class exercise
Single-nucleotide polymorphism ● A single-nucleotide polymorphism (SNP, pronounced snip) is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of more than 1% in the population. ● For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations – C or A – are said to be alleles for this position.
Objective ● Write a software that given a sequence of data about SNPs computes: ● the number of transitions (A vs. G or C vs. T) within the data for each chromosome ● the number of transversions (anything not being a transition) within the data for each chromosome ● BUT FIRST YOU HAVE TO DESIGN THE SOFTWARE BY DEFINING CRC CARDS AND UML CLASS DIAGRAMS
Input data ● A dataset consisting of a VCF file representing a random sampling of SNPs from three people—a mother, a father, and their daughter—compared to the reference human genome. ● VCF is tabular format similar to CSV ● The dataset contains a SNP for each row
Input data sample Alternative Chromosome # SNP’s ID base found Reference SNP’s position in base at this the chromosome position
What to do: SNP class (1) ● Implement a SNP class whose object will hold relevant information about a single line in the VCF file. ● The SNP class is a derived class of AlleleVariation, which is an abstract class ● AlleleVariation provides two abstract methods: ● .isTransition() should return True if the variation is a transition and False otherwise by looking at the two allele instance variables. ● .isTransversion() should return True if the variation is a not transition and False otherwise. ● Instances of SNP include the following private attributes: ● the reference allele (a one-character string in column 4, e.g., “A”) ● the alternative allele (a one-character string in column 5, e.g., “G") ● the name of the chromosome on which it exists (a string in column 1, e.g., “1") ● the reference position (an integer in column 2, e.g., 799739) ● and the ID of the SNP (in column 3, e.g., "rs57181708" or "."). ● Because we’ll be parsing lines one at a time, all of this information can be provided in the constructor.
What to do: SNP class (2) ● SNP objects should be able to answer questions: ● isTransition() should return True if the SNP is a transition and False otherwise by looking at the two allele instance variables. A transition is A/G, G/A, C/T, or T/C ● isTransversion() should return True if the SNP is a not transition and False otherwise ● Use of inheritance and overriding for this and encapsulation for hiding all attributes of SNP
What to do: Chromosome class ● Implement a Chromosome class that provides four methods: ● count_transitions(), which returns the number of transition SNPs ● count_transversions(), which returns the number of transversion SNPs ● addSNP(), which add a SNP object into the array of SNPs associated to the current Chromosome ● getName, which returns the string representing the name of the Chromosome
Where to get the dataset ● The dataset can be downloaded here: https://raw.githubusercontent.com/anuzzolese/genomics-unibo/master/ 2019-2020/data/trio.sample.vcf
How to read the dataset: import csv with open('trio.sample.vcf') as csv_file: csv_reader = csv.reader(csv_file, delimiter='\t') line_count = 0 for row in csv_reader: chromosomeName = row[0] snpPosition = row[1] snpId = row[2] refAllele = row[3] altAllele = row[4] print(chromosomeName + ", " + snpPosition + ", " + snpId + ", " + refAllele + ", " + altAllele) https://github.com/anuzzolese/genomics-unibo/blob/master/2019-2020/exercises/trio-sample-vcf-reader.py
Recommend
More recommend