CSE182-L16 Non-coding RNA
Biol. Data analysis: Review Assembly Protein Sequence Sequence Analysis Analysis Gene Finding
Much other analysis is possible Assembly Genomic Analysis/ Pop. Genetics Protein Sequence Sequence Analysis Analysis Gene Finding ncRNA
A Static picture of the cell is insufficient • Each Cell is continuously active, – Genes are being transcribed into RNA – RNA is translated into proteins – Proteins are PT modified and transported – Proteins perform various cellular Gene functions Proteomic Regulation Transcript • Can we probe the Cell profiling profiling dynamically
ncRNA gene finding • Gene is transcribed but not translated. • What are the clues to non-coding genes? – Look for signals selecting start of transcription and translation. Non coding genes are transcribed by Pol III – Non-coding genes have structure. Look for genomic sequences that fold into an RNA structure • Structure: Given a sequence, what is the structure into which it can fold with minimum energy?
tRNA structure
RNA structure: Basics • Key: RNA is single-stranded. Think of a string over 4 letters, AC,G, and U. • The complementary bases form pairs. • Base-pairing defines a secondary structure. The base- pairing is usually non-crossing.
RNA structure: pseudoknots Sometimes, unpaired bases in loops form ‘crossing pairs’. These are pseudoknots
RNA structure prediction • Any set of non-crossing base-pairs defines a secondary structure. • Abstract Question: – Given an RNA string find a structure that maximizes the number of non-crossing base- pairs – Incorporate the true energetics of folding – Incorporate Pseudo-knots
A combinatorial problem • Input: • A string over A,C,G,U • A pairs with U, C pairs with G • Output: • A subset of possible base-pairs of maximum size such that • No two base-pairs intersect • How can we compute this set efficiently?
RNA structure Nussinov’s algorithm 1. Score B for every base-pair. No penalty for loops. No pesudo-knots. 1. Let W(i,j) be the score of the best structure of the subsequence 2. from i to j. for i = n down to 1 { for j = i+1 to n { Ï B ( r i , r j ) + W ( i + 1, j - 1), Ô W ( i , j - 1), Ô W(i +1,j) W ( i , j ) = max Ì W ( i , k ) + W ( k + 1, j ) i £ k < j Ô Ô Ó } }
Obtaining RNA structure for i = n downto 1 { for j = i+1 to n { Ï B ( r i , r j ) + W ( i + 1, j - 1), (1) Ô W ( i , j - 1), (2) Ô W ( i , j ) = max Ì (3) W(i +1,j) Ô (4) W(i,k) +W(k +1,j) Ô Ó if (1) { S(i,j) = / else if (2) S(i,j) = | else if(3) S(i,j) = - else S(i,j) = k } } }
Obtaining RNA Structure Procedure print_RNA(i,j) { if S(i,j) = / { print “(i,j)”; print_RNA(i+1,j-1); else if (S(i,j) = -) { print_RNA(i+1,j); } else if (S(i,j) = |) { print_RNA(i,j-1); } else { k=S(i,j) print_RNA(i,k); print_RNA(k+1,j); } }
RNA structure: example A C G A U U A C G A U U 1 2 3 4 5 6 1 2 3 4 5 6 i 1 2 3 4 5 6 j 0 2 1 1 3 1 1 0 4 2 2 1 1 5 3 2 1 1 0 6
RNA Structure: Details
Base-pairing & Loops Base-pairs arise from complementary nucleotides • Single-stranded • Stack is when 2 base-pairs are contiguous • Loops arise when there are unpaired bases. • They are characterized by the number of base-pairs that close it. • • Hairpin: closed by 1 base-pair • Bulge/Interior Loops (2 base-pairs) • Multiple Internal loops (k base-pairs)
Scoring Loops, multi-loops • Zuker-Turner Energy Rules http://www.bioinfo.rpi.edu/~zukerm/rna/energy/node2.html • • Stacking Energies • Energy for Bulges and Interior Loops • Energy for Multi-loops
Other tricks for obtaining structure • Alignment and Covariance
RNA: unsolved problems • The structure problem is still unsolved. – De novo prediction does not work as well. – Co-variance models require prior alignment. • Many undiscovered non-coding genes – miRNA, and others have only just been discovered. – Very hard to detect signal for these genes – Random sequence folds into low energy structures.
Other ncRNA: miRNA ncRNA ~22 nt in length • Pairs to sites within the 3’ UTR, • specifying translational repression. Similar to siRNA (involved in RNAi) • Unlike siRNA, miRNA do not need • perfect base complementarity Until recently, no computational • techniques to predict miRNA Most predictions based on cloning • small RNAs from size fractionated samples
Gene Regulation
Gene expression • The expression of transcripts and protein in the cell is not static. It changes in response to signals. • The expression can be measured using micro- arrays. • What causes the change in expression?
Transcriptional machinery • DNA polymerase (II) scans the genome, initiating transcription, and terminating it. • The same machinery is used for every gene, so while Pol II is required, it is not sufficient to confer specificity
TF binding • Other transcription Transcription factors factors interact with the core machinery and upstream DNA to provide specificity. • TFs bind to TF binding sites which are clustered in upstream enhancer and promoter elements. • The enhancer elements may be located many kb upstream of the core- promoter Upstream elements
TF binding sites • TF binding sites are weak signal (about 10 bp with 5bp TCAGGAG g 1 conserved) TGAGGAG g 2 • If two genes are co- g 3 TCAGGTG regulated, they are g 4 TGAGGTG likely to share binding g 5 TCAGGTG sites • Discovery of binding site motifs is an important research problem.
http://www.gene-regulation.com/pub/databases.html#transfac
Discovering TF binding sites • Identification of these TF binding sites/switches is critical. • Requires identification of co-regulated genes (genes containing the same set of switches). • How do we find co-regulated genes?
Idea1: Use orthologous genes from different species 1. The species are too close (EX: ACGGCAGCTCGCCGCCGCGC humans and chimps). Binding ||||| || ||||||| || & non-binding sites are both ACGGC-GGGCGCCGCCCCGC conserved. 2. The species are distant. Binding ACGGCAGCTCGCCGCCGC-C sites are conserved but not | || | ||||||| | other sequence. AGTGC-GGGCGCCGCCTCAT 3. The species are very distant. Even binding sites are not ACGGC-GC-TCGCCGCCGCGC | | | || | | conerved. The genes have AT-ACGAAGTAGCGG-ATGGT alternative regulators.
Idea2: Measure expression of genes • Northern Blot: – Quantitative expression of a few genes
Microarray • Expression level of all genes
Protein Expression using MS
Pathways • Proteins interact to transduce signal, catalyze reactions, etc. • The interactions can be captured in a database. • Queries on this database are about looking for interesting sub-graphs in a large graph.
Biological databases in NAR • http://www3.oup.co.uk/nar/database/c • 548 databases in various categories Genbank Rfam SwissProt PDB Kegg dbSNP/OMIM/seattleSNPs Stanford microarray db SWISS 2D-page
Summary • Biological databases cannot be understood without understanding the data, and the tools for querying and accessing these data. • While database technology (XML, Relational OO databases, text formats) is used to store this data, its use is (often) transparent for Bioinformatics people. • In this course, we looked at various data-streams, and pointed to databases that store these data- 2004: 548 databases streams • Nucleic Acids Research brings out a database issue every January
Recommend
More recommend