Biology � Different levels Biology & CS � Evolution � organisms over time � Ecology � interactions among organisms and environment � Individual organisms � Anatomy, Physiology Philip Chan � Cell Biology � cells � Molecular Biology � chemical molecules Molecular Biology Molecular Biology � DNA � DNA � Stands for? � Dioxyribonucleic Acid � Double helix structure � Watson and Crick, 1953 � Nobel Prize in Physiology or Medicine, 1962 Genome Genome � Chromosomes � Chromosomes � inside where? � inside the cell nucleus � ? pairs 1
Genome Genome � Chromosomes � Chromosomes � inside the cell nucleus � inside the cell nucleus � 23 pairs (one determines what?) � 23 pairs (one determines gender) Genome Genome � Chromosomes � Chromosomes � inside the cell nucleus � inside the cell nucleus � 23 pairs (one determines gender) � 23 pairs (one determines gender) � contains genetic information � contains genetic information � copied during cell division � copied during cell division � made of DNA � made of DNA � Gene � Gene � ? � (roughly) segments of DNA that encode proteins � Genome � Human: ? genes Genome DNA to Protein � Chromosomes � Transcription � inside the cell nucleus � DNA -> RNA � 23 pairs (one determines gender) � Translation � contains genetic information � copied during cell division � RNA -> Protein � made of DNA � Genes � (roughly) segments of DNA that encodes proteins � Genome � Human: 20,000-25,000 genes 2
DNA Encoding for Proteins Sequencing Human Genome � DNA � Human Genome Project � Sequence of nucleotides � International (governments/universities) � 4 possible nucleotides: � Adenine (A), Cytosine (C), Guanine (G), Thymine (T) � Celera Corporation (US) � [Thymine (T) becomes Uracil (U) in RNA] � Many short sequences � Protein � Algorithms to merge them into longer � Sequence of amino acids sequences � 20 possible amino acids � Complete genome sequence in ~2003 � How many nucleotides are needed to encode one amino acid? Why Study the Genome? Comparing Genes � Understanding how genes, proteins, … � After a gene is found interact with each other � Biologist might not know its function � Find “similarities” with genes of known function � Understanding diseases � Mistakes in copying DNA � Mutations cause changes in DNA Cancer (1984) Cystic Fibrosis (1989) � Cancer-causing gene is similar to a normal � Cystic Fibrosis is a fatal disease associated growth gene with abnormal secretions (clogs in lungs). � Cancer might be caused by a normal growth � A segment of the Cystic Fibrosis gene is gene being switched on at the wrong time similar to the sequence for ATP binding proteins. � A good gene doing the right thing at the wrong time � These proteins affect cell membrane and secretions 3
Similarity/Distance of Sequences Similarity/Distance of Sequences � Position by position � Position by position � ACACAC � ACACAC � CACACA � CACACA � Hamming Distance = 6 � Hamming Distance = 6 � Shift the second sequence by one character � ACACAC_ � _CACACA � Distance = 2 Subsequence Longest Common � Subsequence � Sequence of characters that might NOT be Subsequence consecutive � ATTGCTA � TTGC -> subsequence � AGCA -> subsequence � ATTA -> subsequence Problem 1 � TGTT -> not a subsequence � TCG -> not a subsequence Common Subsequence Common Subsequence � Given two sequences � Given two sequences � ATCTGAT � ATCTGAT � TGCATA � TGCATA � Common subsequences ? � Common subsequences � TCTA � TA 4
Problem Formulation Longest Common Subsequence (LCS) � Many different common subsequences � Given (input) � Two sequences v, w � Want to find the longest � Find (output) � Longest common substring of v and w (simpler � Length of LCS helps determine similarity of problem) two sequences/genes Algorithm Algorithm 1 � Any ideas? � Find common subsequence of length 1 � Find common subsequence of length 2 � … Algorithm 1 Algorithm 1 � Find common substring of length 1 � Find common substring of length 1 � Find common substring of length 2 � Find common substring of length 2 � … � … � What is the time complexity? � What is the time complexity? � Are we repeating unnecessary work? 5
Algorithm 2 Algorithm 2 � Observation: � Observation: � If common substring of length L+1 exists � If common substring of length L+1 exists � Common substring of length L must also exists � Common substring of length L must also exists � Idea? � Idea � Use common substring of length L to find common substring of length L+1 Algorithm 2 Algorithm ? � Observation: � Tree Search � If common substring of length L+1 exists � Common substring of length L must also exists � What would be the nodes and branches? � Idea � Could recursion help? � Use common substring of length L to find common substring of length L+1 � Time complexity? � Time complexity? Algorithm 3 Algorithm 3 � Consider � Consider � String v, indexed by i � String v, indexed by i � String w, indexed by j � String w, indexed by j � LCS(i, j) returns the length of LCS ending at � LCS(i, j) returns the length of LCS ending at i,j i,j � LCS(i, j) = � LCS(i - 1, j - 1) + 1 if v[i] = w[j] � 0 otherwise 6
Algorithm 3 Algorithm 3 � Consider � Consider � String v, indexed by i � String v, indexed by i � String w, indexed by j � String w, indexed by j � LCS(i, j) returns the length of LCS ending at � LCS(i, j) returns the length of LCS ending at i,j i,j � LCS(i, j) = � LCS(i, j) = � LCS(i - 1, j - 1) + 1 if v[i] = w[j] � LCS(i - 1, j - 1) + 1 if v[i] = w[j] � 0 otherwise � 0 otherwise � Different initial i,j pairs � Different initial i,j pairs � Any redundant work? Algorithm 3 Algorithm 3 � Dynamic programming � Eliminate redundant work A B A B � By storing partial answers 0 0 0 0 0 � LCS[] is a table B 0 � LCS[i, j] is the length of LCS ending at i, j � LCS[i, j] = A 0 � LCS[i - 1, j - 1] + 1 if v[i] = w[j] B 0 � 0 otherwise A 0 Algorithm 3 Algorithm 3 A B A B A B A B 0 0 0 0 0 0 0 0 0 0 B 0 0 1 0 1 B 0 0 1 0 1 A 0 A 0 1 0 2 0 B 0 B 0 A 0 A 0 7
Algorithm 3 Algorithm 3 A B A B A B A B 0 0 0 0 0 0 0 0 0 0 B 0 0 1 0 1 B 0 0 1 0 1 A 0 1 0 2 0 A 0 1 0 2 0 B 0 0 2 0 3 B 0 0 2 0 3 A 0 1 0 3 0 A 0 1 0 3 0 Problem Formulation Problem � Given (input) � String editing � Two sequences v, w � Transform one string to another by keeping/adding/deleting characters � Can also be viewed as aligning two strings � Find (output) � Any ideas? � Longest common subsequence of v and w � Skipping character(s) is allowed -- T G C A T -- A -- C A T -- C -- T G A T C LCS: Example Edit Graph for LCS Problem 0 0 1 2 3 4 5 5 6 6 7 i coords: A T C T G A T C j 0 1 2 3 4 5 6 7 8 elements of v -- T G C A T -- A -- C i 0 T elements of w 1 A T -- C -- T G A T C G 0 1 2 2 3 3 4 5 6 7 8 j coords: 2 C 3 (0,0) � (0,1) � (1,2) � (2,2) � (3,3) � (4,3) � (5,4) � (5,5) � (6,6) � (6,7) � (7,8) A 4 positions in v : 1 < 3 < 5 < 6 < 7 Matches shown in red T 5 positions in w : 2 < 3 < 4 < 6 < 8 A 6 Every common subsequence is a path in 2-D grid C 7 8
Edit Graph for LCS Problem Edit Graph for LCS Problem A T C T G A T C A T C T G A T C j j 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Every path is a i 0 i 0 common T T 1 1 subsequence. Every diagonal G G 2 2 edge adds an C C 3 3 extra element to common A A 4 4 subsequence T T 5 5 LCS Problem: Find a path with A A 6 6 maximum number of C C 7 7 diagonal edges Computing LCS Computing LCS The length of LCS( v i , w j ) is computed by: i -1 ,j i -1 ,j -1 1 0 i,j -1 i,j 0 s i-1, j s i, j = max s i, j-1 s i-1, j-1 + 1 if v i = w j s i-1,j + 0 s i,j = MAX s i,j -1 + 0 s i-1,j -1 + 1, if v i = w j Dynamic Programming Example Dynamic Programming Example Initialize 1 st row and 1 st column to be all zeroes. S i,j = S i-1, j-1 � value from NW +1, if v i = w j � value from North (top) Or, to be more max S i-1, j � value from West (left) S i, j-1 precise, initialize 0 th row and 0 th column to be all zeroes. 9
Recommend
More recommend