EMBnet Course: PERL for biomedical researchers Command line tools scripting Basel, 11 September 2008 Lorenza Bordoli Swiss Institute of Bioinformatics Outline • Combining different programs with Perl: writing a short Pipeline in Perl • UniProt and its controlled vocabulary • Swissknife library • Overview of the programs of the pipeline: Blast, seqret (EMBOSS), Clustalw, Tree- Puzzle • Details of the Perl script Lorenza Bordoli 11 September 2008
Combining different programs with Perl input program1 program2 program3 output output output embedded in a single Perl script 11 September 2008 Lorenza Bordoli www.bc2.unibas.ch www.bc2.unibas.ch Lorenza Bordoli 11 September 2008
Mycobacterium Mycobacterium tuberculosis tuberculosis (MT) Swiss-Prot swissknife swissknife Protein sequence DB Kinase domain Swiss-Prot ( H. Sapiens ) ClustalW Blast MT protein + human MT protein + Multiple sequence homologous sequences Alignment of MT and sequences seqret in multi FASTA file Human Kinase in multi FASTA file homologues Tree-Puzzle Phylogenetic tree UniProt www.uniprot.org www.uniprot.org
UniProt • The UniProt Knowledgebase (UniProtKB) provides the central database of protein sequences with accurate, consistent, rich sequence and functional annotation. • The UniProt Knowledgebase consists of two sections: – Swiss-Prot - a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and – TrEMBL - a section with computationally analyzed records that await full manual annotation. 11 September 2008 Lorenza Bordoli TrEMBL • TrEMBL is the computer-annotated section of the UniProt Knowledgebase. It contains translations of all coding regions in the DDBJ/EMBL/GenBank nucleotide databases, and protein sequences extracted from the literature or submitted to UniProtKB, which are not yet integrated into Swiss-Prot. • TrEMBL allows these sequences to be made publicly available quickly without diluting the high quality annotation found in Swiss- Prot. • The information in a TrEMBL entry is initially derived directly from the underlying DDBJ/EMBL/GenBank nucleotide entry and the quality of data is directly dependent on the information provided by the submitter of the nucleotide entry. This information may be enhanced later by automatic annotation procedures but if not, it remains as provided by the submitter until the entry is manually annotated and added to Swiss-Prot. Lorenza Bordoli 11 September 2008
Swiss-Prot • Swiss-Prot is an annotated protein sequence database. It was established in 1986 and maintained collaboratively, since 1987, by the group of Amos Bairoch first at the Department of Medical Biochemistry of the University of Geneva and now at the Swiss Institute of Bioinformatics (SIB) and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)). The Swiss-Prot Protein Knowledgebase consists of sequence entries. Sequence entries are composed of different line types, each with their own format. • Swiss-Prot distinguishes itself by four distinct criteria: 1. Annotations 2. Minimal redundancy 3. Integration with other databases 4. Documentation 11 September 2008 Lorenza Bordoli Swiss-Prot – 1. Annotations In Swiss-Prot, as in many sequence databases, two classes of data can be distinguished: the core data and the annotation: 1.For each sequence entry the core data consists of: • The sequence data; • The citation information (bibliographical references); • The taxonomic data (description of the biological source of the protein). 2.The annotation consists of the description of the following items: • Function(s) of the protein; • Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and GPI-anchor; • Domains and sites, for example, calcium-binding regions, ATP-binding sites, zinc fingers, homeoboxes, SH2 and SH3 domains and kringle; • Secondary structure, e.g. alpha helix, beta sheet; • Quaternary structure, i.g. homodimer, heterotrimer, etc.; • Similarities to other proteins; • Disease(s) associated with any number of deficiencies in the protein; • Sequence conflicts, variants, etc. Lorenza Bordoli 11 September 2008
UniProt – Structure of a sequence entry Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry: ID GRAA_HUMAN Reviewed; 262 AA. AC P12544; Q6IB36; DT 01-OCT-1989, integrated into UniProtKB/Swiss-Prot. DT 01-OCT-1989, sequence version 1. DT 10-JUN-2008, entry version 103. DE RecName: Full=Granzyme A; DE EC=3.4.21.78; DE AltName: Full=Granzyme-1; DE AltName: Full=Cytotoxic T-lymphocyte proteinase 1; DE AltName: Full=Hanukkah factor; DE Short=H factor; DE Short=HF; DE AltName: Full=CTL tryptase; DE AltName: Full=Fragmentin-1; DE Flags: Precursor; GN Name=GZMA; Synonyms=CTLA3, HFSP; OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; OC Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP NUCLEOTIDE SEQUENCE [MRNA]. […] UniProt – Sequence entry lines http://www.expasy.org/sprot/userman.html http://www.expasy.org/sprot/userman.html
UniProt – Sequence entry • The entries in the UniProt Knowledgebase are structured so as to be usable by human readers as well as by computer programs. • The explanations, descriptions, classifications and other comments are in ordinary English. • Wherever possible, symbols familiar to biochemists, protein chemists and molecular biologists are used. • Example: http://www.uniprot.org/uniprot/P65726 11 September 2008 Lorenza Bordoli Swiss-Prot – Swissknife • You can write your parser to extract information from the UniProt database or use: • The Swissknife, an object-oriented Perl library to handle Swiss-Prot entries • http://swissknife.sourceforge.net/docs/ Lorenza Bordoli 11 September 2008
Swiss-Prot – Swissknife SWISS::Entry • Main module to handle SWISS-PROT entries. One Entry object represents one SWISS-PROT entry and provides an API for its modification. use SWISS::Entry; # Read an entire record at a time $/ = "\n//\n"; while (<>){ $entry = SWISS::Entry->fromText($_); print $entry->AC, "\n"; } Swiss-Prot – Swissknife use SWISS::Entry; use SWISS::OCs; # Read an entire record at a time local $/ = "\n//\n"; while (<>){ # Read the entry my $entry = SWISS::Entry->fromText($_); # Print the primary accession number of each entry. print $entry->AC, ":\n"; #Print the multiple organism classification lines of each #entries my @OC = $entry->OCs->elements(); foreach my $oc (@OC){ print "$oc\t"; } print "\n\n"; }
Swiss-Prot – Swissknife use SWISS::Entry; use SWISS::OCs; use SWISS::FTs; […] #Print the FT lines of type "domain" of the entry foreach my $ft ( $entry->FTs->get('DOMAIN') ) { my $FTkey = $$ft[0]; my $FTfrom = scalar $$ft[1]; my $FTto = scalar $$ft[2]; my $FTdes = $$ft[3]; print "FT: $FTdes $FTkey from: $FTfrom to:$FTto \n"; } Mycobacterium Mycobacterium tuberculosis tuberculosis (MT) Swiss-Prot swissknife swissknife Protein sequence DB Kinase domain Swiss-Prot ( H. Sapiens ) ClustalW Blast MT protein + human MT protein + Multiple sequence homologous sequences Alignment of MT and sequences seqret in multi FASTA file Human Kinase in multi FASTA file homologues Tree-Puzzle Phylogenetic tree
Blast $blastall -p blastp -d sprot -i sequence.txt -m 9 blastall 2.2.16 arguments: -p Program Name [String] -d Database [String] default = nr -i Query File [File In] default = stdin -e Expectation value (E) [Real] default = 10.0 -m alignment view options: 0 = pairwise, 1 = query-anchored showing identities, 2 = query-anchored no identities, 3 = flat query-anchored, show identities, 4 = flat query-anchored, no identities, 5 = query-anchored no identities and blunt ends, 6 = flat query-anchored, no identities and blunt ends, 7 = XML Blast output, 8 = tabular, 9 tabular with comment lines 10 ASN, text 11 ASN, binary [Integer] […] and more options . Blast $blastall -p blastp -d sprot -i sequence.txt -m 9 Program Query Database protein protein VS blastp blastn nucleotide nucleotide VS blastx nucleotide protein protein VS tblastn nucleotide protein protein VS nucleotide nucleotide tblastx protein protein VS Lorenza Bordoli 11 September 2008
Recommend
More recommend