Automatic Synchronization and Distribution of Biological Databases and Software over Low-Bandwidth Networks among Developing Countries P2P Node Setup Guide Authored by: Unitsa Sungket, Prince of Songkla University, Thailand Darran Nathan, APBioNet
2 Background Bioinformatics and the need for network bandwidth Bioinformatics involves the collection, organization and analysis of large amounts of biological data, using networks of computers and databases. Bioinformatics Centers around the world have to regularly update their database repositories with the latest releases. This is normally done by a file transfer over FTP; but the large and growing sizes of these databases means that a large network bandwidth is required to ensure the new database releases are downloaded quickly and without failure. To assist this, a network of database mirror sites was established in several countries worldwide in 1997, under the Bio-Mirror project. Developing countries in the Asia-Pacific region are just moving into this new field of bioinformatics, but the computational infrastructure and network bandwidths available in those countries are still at a primitive level compared to that in more developed countries. Network bandwidth within these countries are still very low, and the low reliability of connections means breaks / aborts in downloads are common. So, in spite of the Bio-Mirrors nodes being made available, many developing countries in the world still face a major problem in regularly updating these databases. And, with the large and growing sizes of these databases, the problem will only get worse in the next years because the growth of databases outstrips the rate of bandwidth penetration to the end user. A revolution in file sharing technology In the late 90’s, the Internet community witnessed the start of a major revolution in the way people share files – Peer-to-Peer (P2P) file exchange was introduced with the wildly popular Napster in 1997. Internet users used this to share mp3 music and video files throughout the world. P2P technology involves exchanging files not just between a central server and multiple clients that connect to it, but rather focus on using clients to exchange files amongst one another. The technology continued to evolve and improve, with the second generation P2P FastTrack / Kazaa network in 2001. In 2002, the BitTorrent protocol was introduced. This third generation P2P technology was a major advance over previous P2P protocols with BitTorrent, a large file to be distributed will be broken up into smaller fragments, typically around a quarter of a megabyte each. These fragments are distributed to each peer, and amongst peers, in a random manner, and are reassembled at the requesting machine. This difference between traditional client/server distribution of files, and 3 rd generation P2P distribution, is illustrated in Figures 1 and 2 below:
3 Figure 1. Traditional Client / Server distribution of files Figure 2. BitTorrent distribution of files These figures illustrate the power of the concept introduced by 3 rd generation P2P technology: As the number of downloading clients in the traditional distribution architecture increases, demands for bandwidth placed on the server will only increase and lead to a bottleneck. However, for the case of the 3 rd generation P2P architecture, the more peers there are, the more nodes are available to distribute fragments of the file. High demand will actually lead to greater throughput as more bandwidth from additional nodes becomes available to the group. Using P2P technology in distributing biological data From the comparison above, it can be seen that if 3 rd generation P2P technology is used, it offers to simultaneously solve the two major problems plaguing the distribution of biological data to developing countries: 1) Low international bandwidth • With a P2P architecture, downloads need not be from a central server in another country – every peer that connects up to synchronize its databases or software, whether from the same institute, state, country or region, will provide additional bandwidth, that will speed up the overall download rate of all the peers 2) Unreliable connections • In the conventional server/client architecture, all download is from a single server and if this connection becomes very slow or unreliable, there can be no ‘failover’ to automatically continue downloading from another source • For the 3 rd generation P2P architecture however, downloads are automatically sourced from peers with the best connections; and if a connection experiences a bottleneck, downloads automatically continue from the next best connections.
4 P2P technology can be applied in three areas – the distribution of biological software, courseware, and databases. Objectives 1) To develop a client application based on 3 rd generation P2P protocols, or select and extend an existing open-source one, for use in the distribution of biological software, courseware, and databases 2) To set up and test the performance of this biological software, courseware, and database distribution P2P network, with nodes in countries in the Asia-Pacific region starting with Singapore, Thailand, and Korea, and to beyond. These tests will include • Benchmarking performance against more traditional rsync and FTP techniques • Assessing the effect of bandwidth saturation in using P2P • Identifying P2P architecture and topology variations most suited for distributing the datasets of different sizes P2P Software Selected After extensive analysis and trials of various available P2P software, the Azureus program was selected for this work becase of the following reasons: • It is open-source and has a large active development community • It runs on Java, allowing it to be deployed on any OS • It has a well documented plug-in interface that makes it easy to develop additional enhancements that may be necessary for this project Setup of the P2P node This section describes the steps needed to set up a P2P node. After a server has been assigned by your institute and set up with the OS as well as necessary misc software such as antivirus, firewall, and intrusion detection systems: 1. Installation of Azureus a) Download and install the Azureus program from http://azureus.sourceforge.net/download.php Linux users can view the installation details at: http://azureus.sourceforge.net/howto_linux.php Windows users can view the installation details at http://azureus.sourceforge.net/howto_win.php
5 2. Setting up Azureus Azureus has 2 sections that should be set up – ‘client’ settings and ‘server’ settings. 1) Client settings - for download of data from peers 1.1) Go to the Tools menu and choose Options . In the list on the left click Connection. Pick a number between 49152 and 65534, and enter that in the incoming TCP listen port and UDP listen port boxes as shows in Figure 3. Then click Save to save this change. Ensure that you have opened this port in your firewall for both download and upload. Figure 3. Setting the incoming TCP and UDP ports 1.2) To test the download of data from a Seed node, download a torrent from the KOBIC Tracker (http://ftp.kobic.re.kr:6969/) as shown in Figure 4. In the File menu click Open and choose Add file, to add the torrent that you have downloaded. This is shown in Figure 5. If everything has been set up correctly, download should begin. Figure 6 shows that the PSU node is downloading go_200608-assocdb.rdf-xml.gz file
6 from the KOBIC node (go_200608-assocdb.rdf-xml.gz.torrent), and the download speed is 12.4 kB/s. Note: if your Health indicator on the torrent is red, it means that your server is not connected to any peer. This may be either because the tracker server down, or there is no Seed node present. Figure 4. KOBIC tracker
7 Figure 5. Opening the Torrent file Figure 6. PSU node downloading data from KOBIC node. 2) Server settings - for setting up a Seed node to host and manage a database If you want to upload or distribute your data to any peer, you must create a torrent for that data, and keep the torrent in a ‘tracker’. 2.1) Go to the Tools menu and choose Options . In the list on the left click Tracker , then click Server . Enter your external IP address or server name. Select HTTP port check box, and enter a port such as 6969 as shown in Figure 7. Ensure that you have opened port 6969 on your firewall. 2.2) In the list on the left click Plugins , and next click Tracker Web . Then select Publish torrent , enter title of you tracker web as shows in Figure 8, and select all RSS feed options for automatic synchoronization as shown in Figure 9.
8 Figure 7. Tracker server settings Figure 8. Tracker Web settings
9 Figure 9. RSS feed setting in Tracker Web 2.3) Create a torrent by clicking New Torrent in the File menu and select Embedded Tracker as shown in Figure 10. Before clicking on ‘Finish’ to create the torrent, check the boxes to Open the torrent and Host the torrent as shown in Figure 11.
10 Figure 10. Creating a new torrent Figure 11. Options selected to create a new torrent
Recommend
More recommend