Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de
Organization § Lectures: Thursday 9:20 – 10:50, Auditorium right CRTD Prof. Michael Schroeder: michael.schroeder@tu-dresden.de Predoc Melissa Adasme: melissa.adasme@tu-dresden.de § Labs: Thursday 11:10 – 12:40, E030 (PC Pool) BIOTEC Predoc Negin Malekian: negin.malekian@tu-dresden.de § Group projects : 31 st January 9:20-10:50 (task description will be released 3 weeks before ) 1
The module… § will teach you basic programming skills relevant to bioinformatics, which will enable you to actively develop bioinformatics tools . § will take a problem-driven approach . § will present bioinformatics problems and show how to solve them using existing online tools and how to implement such tools . § will revisit some of the problems and databases discussed in applied bioinformatics. § will be very practical and hands-on approach to basic computer science tools such as using command line operating systems , programming in Python , and using relational databases . 2
Objectives § You will be able to automate simple repetitive information retrieval tasks § You will be able to write simple programs in Python § You will be able to work with relational databases § You will appreciate the principles, limits, and possibilities of programming § You will be able to formulate biological questions as information processing problems § You will understand when and how programming can help to automate bioinformatics problems 3
Module Structure ■ More Python ■ Introduction § MySQL Database Connection § REST Queries ■ Databases § Dyn. Progr. & Clustering § Introduction to SQL § A Little Exercise § A Little Science ■ Introduction to PyMOL § Commands and Scripting ■ Introduction to Python § PyMOL Movie Project ■ Programming concepts ■ Revision Class § Data types and loops § Sequences and lists § Patterns and functions § Dictionaries & More Concepts § Data Visualization 4
Resources § Online resources for Python and MySQL available on course web page http://www.biotec.tu-dresden.de/de/forschung/schroeder/teaching/programming-for-bioinformatics.html 5
Resources: Python § Python in a Nutshell Alex Martelli (O’Reilly) § Python Cookbook * David Beazley (O’Reilly) The publisher O’Reilly has many good general programming (e-)books on Linux, Python, etc. § Learn Python the Hard Way * Zed A. Shaw § Think Python: How to Think Like a Computer Scientist * Allen B. Downey * free HTML version 6
Resources: MySQL § W3schools SQL (Interactive online tutorial) § MySQL Cookbook Paul DuBois (O'Reilly) § Jump Start MYSQL Timothy Boronczyk (O'Reilly) § MySQL Reference Manual includes Tutorials 7
Labs Exercises § Each week during the lab you will get exercises which you have to do during the lab (recommended) or finish on your own during the week § Results will be discussed the next week in the lab. Questions on exercises at the beginning of the lecture or labs § Using the machines in the PC pool is recommended § Access to databases § Availability of python modules § No marks for the exercises § Doing all exercises each week makes the exam easier § You should try yourself before asking others 8
Programming Projects § Goal I: Demonstrate ability to use SQL and Python § Goal II: Applying your skills on a real-world problem § You will work in a team and get a biological problem. § Implementation of small workflows § Integration of data from various sources § Visualization of data § Explain approach to others (5 minutes presentation) § Possible tasks: § What is the largest ligand that can bind a metalloprotease? Suggestions for tasks? 9
Motivation: Databases In the last term , § we accessed most information online via the web § we interacted directly and manually with databases and tools § we had to manually submit queries , interpret results. select interesting results, cut&paste them, and submit queries again,… Pro: § Reasonably easy to get hold of information Con: § Not possible to ask many queries § Queries limited by interface provided by web page § Difficult/impossible to integrate information from different sites In this term , § we will look at the databases underlying the online front ends § How is the data internally stored? § How can we - and more important computer programs - directly interact with the underlying data, so that we can ask more powerful queries, large queries, and integrate different systems 10
What actually happens You are limited by what web server allows you to ask: Example CATH: • PDB ID, • CATH code, or • General text But you cannot ask: • In how many different PDB structures is there a P-loop domain? • Is there a PDB entry with a P-loop and a DNA-binding domain • How many different superfamilies does the largest structure in PDB have? • With direct access to the underlying database you could answer all these questions (and many more) 11
Motivation: SCOP as Relational Database § We worked with SCOP , the Structural Classification of Proteins § Family : >30% sequence identity § Superfamily : Similar structure and function (possibly lower 30% sequence identity) Structure similarity Sequence identity 12
Motivation: Databases We wish to answer the following questions: § How many families and superfamilies are there? § Do all superfamilies roughly have the same number of families ? § How many families does the immunoglobulin superfamily have? § Which superfamily has the most families and how many ? § How many percent of superfamilies have only one family ? § Which PDB structure has the largest number of distinct superfamilies ? § How many percent of PDB structures have only one type of superfamily , how many percent have at least two? § Which is the most popular superfamily ? § Are all superfamilies equally likely to co-occur or do they have preferences? § Which superfamily has the most co-occurrence partners ? § Is the number of co-occurrence partners and the frequency of the superfamily correlated? Can we do it with the knowledge you have so far? 13
What is a Database ? § SCOP contains relevant information, but we cannot answer the above questions through the web-interface of SCOP § The problem is that we do not have access to the underlying database What is a database? A database provides… § Logical organization of data § data models, schema design, dictionaries § Physical organization of data § Fast retrieval, indexing, compact storage of data 14
Relational Database Central Idea: Data as relations in a table § E.g. Employee +-------+------+---------+---------+ | id | name | salary | role | +-------+------+---------+---------+ | 46457 | pete | 50.000 | director| | 46458 | jane | 60.000 | nurse | | 46459 | asif | 70.000 | driver | +-------+------+---------+---------+ 15
Relational Database Central Idea: Data as relations in a table § E.g. SCOP, Structural Classification of Proteins +-------+------+---------+---------+--------------------------------------+ | id | type | sccs | sid | description | +-------+------+---------+---------+--------------------------------------+ | 46457 | cf | a.1 | - | Globin-like | | 46458 | sf | a.1.1 | - | Globin-like | | 46459 | fa | a.1.1.1 | - | Truncated hemoglobin | | 46460 | dm | a.1.1.1 | - | Truncated hemoglobin | | 46461 | sp | a.1.1.1 | - | Ciliate (Paramecium caudatum) | | 14982 | px | a.1.1.1 | d1dlwa_ | 1dlw A: | | 46462 | sp | a.1.1.1 | - | Green alga (Chlamydomonas eugametos) | | 14983 | px | a.1.1.1 | d1dlya_ | 1dly A: | | 63437 | sp | a.1.1.1 | - | Mycobacterium tuberculosis | | 62301 | px | a.1.1.1 | d1idra_ | 1idr A: | +-------+------+---------+---------+--------------------------------------+ 16
SCOP database 17
SCOP Tables Cla +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ Astral +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ Subchain +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+ Do you see any relation between tables? 18
Recommend
More recommend