WinBioinfTools: Bioinformatics Tools for Windows High Performance Computing Server 2008 joint project between Nile University, Microsoft Egypt, and Cairo Microsoft Innovation Center Mohamed Abouelhoda Nile University 1
Nile University • Established in 2006 as a first non-profit research university • Specialized in • Information and Communication Technology and related fields and their applications • Research centers • Center for Informatics Sciences (CIS) • Center for Wireless Intelligent Networks (WINC) • Center for Innovation & Competitiveness (CIC) • Modern Master Programs • 9 Master programs in IT, Micro-electronics, Management, Business, Transportation systems, and construction management • Recent undergraduate program • Engineering and management programs Nile University 2
Research Groups • Established in June 2008 • 9 Senior Scientists , 36 Junior scientists • Mission: Address information rich problems of importance to the region and Egypt Nile University 3 3
State of the art Scientific Discovery & Business Insights Scientists Knowledge Workers Bioinformatics Medical Imaging Data Mining Data Analysis, Decision Making & Collaboration Tools Data Management Local Computing Local Data & Data & Information HPC Integration Tools Resources Software Tools Ubiquitous Networking Distributed Scientific Information & Resources Distributed Computing Distributed Data Sources Distributed 4 Resources & Remote (SQL, Web Sources, Sensors & Devices Software Access Images, Text)
Infrastructure of CIS Local CIS resources (first phase): • 21 Servers with 160 AMD/Intel Bioinformatics Applications cores and total 1TB RAM • 24 TB total Storage Extensible resources via partners Nile University • Microsoft, Imperial College, Bridge Project Shared Middleware: Standardized SOA interfaces, Service Composition, Utility- based Computing, …. Imperial College Other resources Microsoft CMIC Biblioteca London Nile University Alexandrina Bridge-Project Nile University 5
Group Leader: Mohamed Abouelhoda Co-Workers: 7 RAs Projects and Research: • NUBIOS: Nile University Bioinformatics Server • Plant , animal, bacterial, and virus computational genomics • Cancer Bioinformatics • High Performance Computing for Bioinformatics Applications Collaborators: Academic • Imperial College, Prof. Hani Gabra • National Cancer Institute, Egypt http://www.bioinf.nileu.edu.eg • Bielefeld University, Prof. Robert Giegerich • Agriculture Research Institute Industry • Cairo Microsoft Innovation Cenetr (CMIC), Egypt • IBM
WinBioinfTools: Bioinformatics Tools for Windows High Performance Computing Server 2008
Motivation bioinformatics tools are essential for recent molecular biology research Obstacles : • Open source bioinformatics tools are usually written for Unix/Linux, which are not so popular in life science community • Data size becomes prohibitively large to analyze on usual PC 8
Project Objectives Providing WinBioinfTools to the biological community that - runs under MS-windows - runs under computer cluster (Windows HPC Server 2008) Primary focus on sequence analysis and comparative genomics - Distributed Sequence Alignment - Distributed BLAST (Basic Local Alignment Search Tool) - CoCoNUT (Computational Comparative GeNomics Utilities Toolkit) Comparing the performance of the Windows based versions of these tools to the corresponding Linux based versions.
Resources Human Resources o Mohamed Abouelhoda, Hisham Mohamed (Nile University) o Mohamed Zahran (collaborator, New York City University) o Tamer Shaalan (CMIC) CMIC Lab: • Cluster of 4 nodes (2 Quad-core 2.6 GHz processors, 16GB RAM, 250 GB HD) • 1 Giga Ethernet Network • Windows HPC server 2008, with HPC Pack 2008 10
Why Sequence Analysis First? - We focused on sequence analysis tools Comparing short sequences Parallel Sequence Alignment 1. 2. Comparing large genomic sequences Parallel CoCoNUT 3. Database search Parallel Blast Database search Genome - Sequence analysis helps in elucidating Comparison, Sequence alignment function and structure of genomic regions Database search - Example pipeline used in practice is HAVANA (Human And Vertebrate Analysis aNd Annotation)
Cluster Modes of Operation 1. Load balancing: task level parallelism – Most bioinformatics problems can be well solved under this category due to decomposability of data 2. (High Performance) Compute cluster: instruction level parallelism - Problems following this are very critical and form a bottleneck 12
Basic features of the Windows (HPC) Server 2008 High performance: 64bit version, accessing large memory, 16, 32, 64, 128 GB RAM Cluster and multi-core support Cluster management and monitoring tools Load balancing: Job scheduler Parallel computing: MS MPI Interoperability: SUA (Support for Unix Applications), Cygwin also works Virtualization: Hyper-V for virtual machines support 13
Sequence Alignment 14
Sequence Alignment mismatch S 1 TACAATCAA T _ ACAA TCA A S TCACTCAC TC AC_ _TCA C 2 Sequence Alignment insertion/deletion 2 Dynamic programming algorithms take time ( k =number of genomes, n =average O ( n ) genome length) Needlemann-Wunch, 1970 15
Dynamic Programming Algorithm Sequence alignment aims at maximizing the similarities between sequences. Optimal sequence alignment can be computed using dynamic programming. For two sequences, the best alignment is computed by filling a 2D matrix, where the score at cell ( i,j ) is computed as follows: score ( i 1 , j 1 ) 1 , if S [ i ] S [ j ] ( 1 , 1 ), [ ] [ ] score i j if S i S j score ( i , j ) min (character deletion cost) score ( i 1 , j ) 1 (character deletion cost) score ( i , j 1 ) 1 16
Parallelization of the DP Algorithm The cluster nodes cooperate in filling matrix (Compute Cluster Model) The filling proceeds diagonal-wise, and the master node synchronizes the filling The complexity reduces to O ( n 2 /k+tk ’ ), where t is the communication time, k is the number of cores , k’ is the number of cluster nodes. node 4 score ( i 1 , j 1 ) 1 , if S [ i ] S [ j ] node 3 score ( i 1 , j 1 ), if S [ i ] S [ j ] ( , ) min node 2 score i j (character deletion cost) score ( i 1 , j ) 1 node 1 (character deletion cost) score ( i , j 1 ) 1 synchronizing line, synchronized by the master node
Experimental Results The running times (in seconds) for pairwise sequence alignment on one and 4 nodes. Time on 4 nodes Time on one Sequence Length Communication Processing Total node Time time 100 X 100 0.03623 0.000665 0.001765 0.0034 1000 X 1000 0.152653 0.005 0.014 0.04 5000 X 5000 0.142311 0.3 1 3.9 10000 X 10000 1.19 1.1 2.6 8.4 20000 X 20000 3.679 2 8 18 30000 X 30000 4 11 15 40 - In the first column, we list the sequence sizes, where 100x100 for example means that we aligned two sequences, each of100 character length.
Experimental Results - On the x-axis, we list the sequence sizes, where 100x100 for example means that we aligned two sequences, each of100 character length.
Database Search 20
Querying Biological Databases using BLAST Biological database formatting And querying 2 query 1 formatting results 3 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin . ||| | . |. . . | : .||||.:| : 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP . ||| | . |. . . | : .||||.:| : : | | | | :: | .| . || |: || |. 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin : | | | | :: | .| . || |: || |. 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP : | | | | :: | .| . || |: || |. || ||. | :.|||| | . .| 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin || ||. | :.|||| | . .| 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 21
Large Scale Application of BLAST BLAST (basic local alignment search tool): given a biological sequence it search for similar (sub) regions in the database Altschul et al. 1997 The database size is extremely large The search time is proportional to the database length Computer cluster provides an ideal solution for speeding up BLAST search Internet queries Institution Enterprise 22
Recommend
More recommend