introduction to epd
play

Introduction to EPD General Goal: To provide the best possible - PDF document

Introduction to EPD General Goal: To provide the best possible guess about the location of the transcription start sites (TSS) of a gene based on multiple evidence Leading concepts: Promoters are defined as transcription initiation regions


  1. Introduction to EPD General Goal: To provide the best possible guess about the location of the transcription start sites (TSS) of a gene based on multiple evidence Leading concepts: • Promoters are defined as transcription initiation regions • No redundancy • Data selection according to consistent criteria defined in user manual • Independent evaluation of published results • Dynamic entries potentially based on multiple sources • Definition of promoter sequence through positional pointers • Cross-referencing between related entries • Definition of a subset of phylogenetically independent promoters for comparative sequence analysis Promoter Definition Three definitions for E. coli promoters 1. Transcription start regions. 2. DNA sequences essential for accurate and efficient RNA chain initiation 3. RNA polymerase binding sites. How the term promoter is used in the literature: Dickson et al. (1975), Science 187, 27: "... the (lac) promoter can be divided into two functional units, the CAP interaction site and the RNA polymerase interaction site.“ Harley & Reynolds (1987), Nucl. Acids. Res. 15, 2343: "Promoters are DNA sequences which affect the frequency and location of transcription initiation through interaction with RNA polymerase."

  2. EPD: Admission criteria I order to be included in EPD, a promoter must be: a) recognized by eukaryotic RNA POL II b) active in a higher eukaryotic (viral promoters ok) c) experimentally defined, or homologous and sufficiently similar to an experimentally defined promoter d) biologically functional (no promoters of transcribed pseudo-genes) e) available in the current sequence database (not a problem anymore) f) distinct from other promoters in the EPD (no redundancy, one copy for tandemly repeated gene clusters, active retrotransposons, etc.) Recent developments: Acceptance of “low quality” entries based on weak evidence EPD entries are positional sequence features Positional sequence features have a center position, but no experimentally defined borders . Examples of positional features: • promoters (DNA) • splice-junctions (DNA/RNA) • polyadenylation sites (RNA) • translation start sites (RNA) • catalytic residues (proteins) • post-translational modification sites How can sequence analysis software deal with positional featres: • by extracting fixed length sequence segments around positions • relative 5’ and 3’ border may be specified on the fly

  3. EPD format: Example of an EPD entry ID LE_1A12 standard; single; PLN. XX AC EP35029; XX DT ??-JUN-1993 (Rel. 35, created) DT 07-OCT-2002 (Rel. 72, Last annotation update). XX DE 1-aminocyclopropane-1-carboxylic acid synthase 2 OS Lycopersicon esculentum (tomato). XX HG none. AP none. NP none. XX DR EMBL; X59139.1; [-2883, 4361]. DR SWISS-PROT; P18485; 1A12_LYCES. XX RN [1] RX MEDLINE; 1762159. RA Rottmann W.H., Peter G.F., Oeller P.W., Keller J.A., Shen N.F., RA Nagy B.P., Taylor L.P., Campbell A.D., Theologis A.; RT "The 1-aminocyclopropane-1-carboxylate synthase in tomato is RT encoded by a multigene family whose transcription is induced RT during fruit and floral senescence"; RL J. Mol. Biol. 222:937-961(1991). ... EPD format: Example of an EPD entry (continuation) ME Nuclease protection with homologous sequence ladder [1]. ME Primer extension with homologous sequence ladder [1]. XX SE acttcagtctttccccttatatatatccctcacattccttaattctcttACACCATAACA XX TX 1. Plant promoters TX 1.1. Chromosomal genes TX 1.1.4. Enzymes TX 1.1.4.6. Ethylene synthesis XX KW Fruit ripening, Ethylene biosynthesis, Lyase, Multigene family. XX FP Le ACC synth. ACC2 :+S EM:X59139.1 1+ 2884; 35029. XX DO Experimental evidence: 3h,6h DO Expression/Regulation: +fruit ripening;+wounding RF JMB222:937 // Note: The line starting with the code FP defines the position of the TSS EM:X59139.1 sequence identifier 1 topology (1 = liner, 0 = circular) + strand (+/ − ) 2884 position within sequence

  4. Signal Search Analysis Essentials History: Signal Search Analysis is an ancient method developed by myself in the early eighties in Max Birnstiel’s lab in Zurich (first published in 1984) Purpose: to discover and characterize sequence motifs that occur at constrained distances from physiologically defined sites in nucleic acid sequences. Recent event: Adaptation of software to new environment, SSA web server, application to promoters and translational start sites. Note the difference: SSA programs serve to characterize motifs that occur at constrained distances from sites not: motifs that are over-represented within sequence sets There are hundreds of programs that address the latter problem, but only very few that serve the same purpose as the SSA programs! Early comparative analysis of E.coli promoter sequences FIG. 4. Comparison of promoter sequences (see text). b, Homologous sequence probably engaged by RNA polymerase; i, mRNA initiation point (underlined). Hyphens have been omitted. SV40, simian virus 40; w.t., wild type. Among the promoter sequences, there is a homologous, 7-base sequence lying to the left of the initiation points. I feel that the DNA sequence 5' T-A-T-Pu-A-T-G 3' 3' A-T-A-Py-T-A-C 5' is implicated in the formation of a tight binary complex with RNA polymerase. Text and Figures from: Pribnow (1975) Proc. Nat. Acad. Sci. USA 72, 784-788.

  5. SSA Signal Search Analysis Giovanna Ambrosini ISREC Swiss Institute for Experimental Cancer Research � History: Signal Search Analysis is a method developed by P Bucher in the early eighties (Bucher, P. and Bryan B., E.N.; Nucleic Acids Res , v.12 (1 Pt 1): 287–305) � Purpose: to discover and characterize sequence motifs that occur at constrained distances from physiologically defined sites in nucleic acid sequences. � Signal search analysis programs : 1. CPR: generates a “constraint profile” for the neighborhood of a functional site 2. SList: generates lists of over and under-represented motifs in particular regions relative to a functional site 3. OProf: generates a “signal occurrence profile” for a particular motif 4. PatOP: optimizes a weight matrix description of a locally over-represented sequence motif � Recent events: Adaptation of software to new environment, SSA web server, application to promoters and translational start sites Signal Search Analysis: Sequence via a functional position set � Input Data Structure Primary experimental data (Functional Position Set) � annotated functional positions in DNA sequences stored in a database � Work data A DNA sequence matrix � a set of fixed-length sequence segments with an experimentally defined site at a fixed internal position

  6. Generating signal search data from a DNA sequence matrix Computing a constraint profile from signal search data

  7. Generating a constraint profile for plant promoters Input parameters for constraint profile for plant promoters.

  8. Input menu for signal search data Special collections: each line expands to all combinations of bases. For instance: NNXXNN -> NNAANN NNACNN NNAGNN NNATNN NNCANN NNCCNN NNCGNN NNCTNN NNGANN NNGCNN NNGGNN NNGTNN NNTANN NNTCNN NNTGNN NNTTNN Signal Occurrence Profile. Is the tri-nucleotide TAT over-represented or under-represented inone of the window ? Answer see next slide.

  9. How to compute the local occurrence frequency ? Note: Each “signal” occurrence (TAT) is counted only once per sequence. Windows containing N’s do not add to the sample size. Here, occurrence frequencies are computed for non-overlapping windows. Signal occurrence profiles are usually computed from overalapping windows. Making a Signal Occurrence Profile for the eukaryotic TATA-box: Input data and parameters

  10. Making Signal Occurrence Profile for the TATA-box for Eukaryotic Promoters: Result Concept of a locally over-represented sequence motif

  11. Definition of a Locally Over-represented Sequence Motif � Concept A motif which preferentially occurs at a characteristic distance (range) from a certain type of functional position Example: the TATA-box is a locally over-represented sequence motif of the -30 region of eukaryotic POL II transcription initiation sites � Components of the formal motif description 1. A weight matrix or consensus sequence defining the motif 2. A cut-off value determining which subsequence constitutes a motif match 3. A preferred region of occurrence defined by 5’ and 3’ borders relative to a functional site, e.g. a transcription initiation site The PATOP algorithm optimizes a locally over-represented sequence motif

  12. A weight matrix definition for the TATA-box motif See also. Bucher 1990, J. Mol. Biol. 212 , 563-578. Weight matrix for the Initiator (Cap-signal) See also. Bucher 1990, J. Mol. Biol. 212 , 563-578.

  13. Positional distribution of “site-selector” promoter elements See also. Bucher 1990, J. Mol. Biol. 212 , 563-578. Weight matrix definition for CCAAT box motif See also. Bucher 1990, J. Mol. Biol. 212 , 563-578.

  14. A weight matrix definition for the GC-box motif See also. Bucher 1990, J. Mol. Biol. 212 , 563-578. Positional distributions of the promoter upstream elements See also. Bucher 1990, J. Mol. Biol. 212 , 563-578.

Recommend


More recommend