Linux for Biology DEDAN GITHAE, BIOINFORMATICIAN BECA-ILRI HUB
Impo Importanc nce o e of c f comput mputer ers t s to bi biology û Availability of vast research data shared online. û Automated analysis leading to generation of massive data û Interaction with other research communities and shared databases û Speed and efficiency in processing, storage and data mining
BIG BIG Da Data: V : Volume me, V , Vari riety ty, V , Velocity ty & & Ve Veracity Volume: ◦ More content already generated and ◦ is available over open access ◦ More content being generated per run ◦ as a result of technology advancement ◦ Costs cheaper over time
Velocity: ◦ Technology making data generation faster and higher efficiency Variety ◦ Sequences, annotation, structures, image processing Veracity ◦ Some ambiguities, Inconsistencies, incomplete, model approximations
Ot Other er computational task sks: s: Analysi sis s and interp erpretation Biology activities: ◦ Prediction – functional and structural ◦ Pattern recognition: Domains, homology ◦ Sequence alignments ◦ Statistical analysis ◦ Structural modelling ◦ Genetic diversity and interactions between organisms, between populations
Lin Linux
Wha hat i is s lin linux a family ◦ of free and open-source software ◦ operating system ◦ distributions built around the Linux kernel.
Wha hat i is s lin linux a family Ubuntu? Fedora? Mint? Debian? openSUSE? ◦ of free anyone is freely licensed to use, copy, study, and change the software in any way ◦ and open-source software the source code is openly shared so that people are encouraged to voluntarily improve the design of the software ◦ operating system system software that manages computer hardware and software resources and provides common services for computer programs. ◦ distributions built around the Linux kernel. part of the operating system that mediates access to system resources eg input/output requests from software, translating them into data-processing instructions for the central processing unit
Ke Kernel
Som Some ap applic lication ions t to b o biologic iological t al tas asks Repetitive tasks – processing several sequences Automating analysis processes – scripts / piping to programs Text processing Regex; grep; sed; ◦ extracting fields using cut / awk ◦ We’ll see more of this on the tutorial
Th The I ILRI RI H High gh P Perfor ormance Com Computing (H g (HPC) Cl C) Cluster
Th The I ILRI RI H High gh P Perfor ormance Com Computing (H g (HPC) Cl C) Cluster users log into HPC (the master) To log in: ssh userX@hpc.ilri.cgiar.org then “jump”to the rest of the cluster (computing servers). To do this, type interactive
Soft Softwar ares: To know whether a software, and version you need to use is installed, type module avail To use a software, eg BLAST, type module load blast To see what softwares are ready for use (loaded), type module list
SL SLURM: M: Si Simple Linux Utility for r Reso source ce Ma Managem emen ent Interactive jobs have a time limit of 8 hours. if you are running a longer job, write a batch script to schedule it. How do we write scripts?
Writing a Slurm script ◦ Available options, type sbatch –u [ man sbatch for detailed explanation of usage ]
Ex Exampl ple of a ba batch h scri ript #!/usr/bin/env bash #SBATCH -p batch #SBATCH -J blastn #SBATCH -n 4 # load the blast module module load blast/2.6.0+ # run the blast with 4 CPU threads (cores) blastn -query ~/data/sequences/drosoph_14_sequences.seq -db nt To Run the script, type sbatch [ scriptname.sbatch ]
Be Best practice; overview Run the job on the computing node interactive Make a directory in the scratch space; and “go” there mkdir –p /var/scratch/userX ; cd $_ Create the script Run the script sbatch [scriptname.sbatch]
Enjoy!
Recommend
More recommend