Implementing iRODS for Next Generation Sequencing Data Management Gen-Tao Chiang Wellcome Trust Sanger Institute gtc@sanger.ac.uk ISGC, March 20, 2011
Outline 1. DNA Sequencing 2. Managing Data 3. iRODS 4. WTSI use case 5. Future Works ISGC, March 20, 2011
The Sanger Institute Funded by Wellcome Trust. • 2 nd largest research charity in the world. • More than 800 employees. • Based in Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. (share with EBI) • Most cited in the UK (Science Watch, 2008) Large scale genomic research. • Sequenced 1/3 of the human genome. (largest single contributor). • We have active cancer, malaria, pathogen and genomic variation / human health studies. • All data is made publicly available. Websites, ftp, direct database. access, programmatic APIs. ISGC, March 20, 2011 By Guy Coates
Data Centre • Completed in 2005. • 1,000 square meters of floor space split equally into four rooms. • Capable to support up to 50,000 processors. • Currently, about 10,000 cores and 10 petabyte storage. ISGC, March 20, 2011
Managing Data ISGC, March 20, 2011
DNA Sequencing ISGC, March 20, 2011
Capillary Based • In 2001, in the era of the HGP, DNA sequencing technology used a capillary-based approach. • Each sequencer produced about 115 kbp (thousand base pairs) per day (Mardis, 2011). ISGC, March 20, 2011
Next Generation Sequencing Life sciences is drowning in data from our new sequencing machines. Traditional sequencing: • 96 sequencing reactions carried out per run. Next-generation: sequencing. • 52 Million reactions per run. Machines are cheap(ish) and small. • Small labs can afford one. • Big labs can afford lots of them. ISGC, March 20, 2011
Illumina HiSeq • Migrating to Illumina HiSeq since October, 2010. • 5 times more data than Illumina GA2. • 20 Machines on site. • Make data management extremely difficult. http://www.illumina.com ISGC, March 20, 2011
ER Mardis. Nature 470 , 198-203 (2011) ISGC, March 20, 2011
Output Trends 4500 Our peak “old generation” sequencing: 4000 • August 2007: 3.5 Gbases/month. 4000 3500 Current output: 3000 • Jan 2010: 4 Tbases/month. 2500 s e s a b Capillary 2000 1000x increase in our sequencing G Illumina output. 1500 • In August 2007, total size of genbank was 1000 200 Gbases. 500 Improvements in chemistry continue to 3.5 0 increase the output of machines. Jan 2010
Data Growth Current weeky sequencing: 3000 Gbase Peak Yearly capillary sequencing: 30 Gbase ISGC, March 20, 2011
Managing Growth We have exponential growth in storage and compute. • Storage /compute doubles every 12 months. 2009 ~7 PB raw Gigabase of sequence ≠ Gigbyte of storage. • 16 bytes per base for for sequence data. • Intermediate analysis typically need 10x disk space of the raw data. Moore's law will not save us. • Transistor/disk density: T d =18 months • Sequencing cost: T d =12 months By Guy Coates
Economic Trends: The Human genome project: • 13 years. • 23 labs. • $500 Million. A Human genome today: • 3 days. • 1 machine. • $10,000. • Large centres are now doing studies with 1000s and 10,000s of genomes. Changes in sequencing technology are going to continue this trend. • “Next - next” generation sequencers are on their way . • One Pacific Biosciences RS test machine at WTSI now. • $500 genome is probable within 5 years.
Managing Data ISGC, March 20, 2011
Bulk Data Data size per Genome Structured data (databases) Individual features (3MB) Variation data (1GB) Alignments (200 GB) Sequencing informatics specialists Sequence + quality data (500 GB) Unstructured data Intensities / raw data (2TB) (flat files) By Guy Coates
Bulk Data Management We though we were really good at it. • All samples that come through the sequencing lab are bar-coded and tracked (Laboratory Information Systems). • Sequencing machines fed into an automated analysis pipeline. • All the data was tracked, analysed and archived appropriately. Strict meta-data controls. • Experiments do not start in the wet-lab until the investigator has supplied all the required data privacy and archiving requirements. Anonymised data → straight into the archive. Identifiable data → private/controlled archives. Some data held back until journal publication.
Mainly for QC pipeline SRF SRA fastq Analysis, alignment, Further analysis assembly Ensembl annotation ISGC, March 20, 2011
ISGC, March 20, 2011
ISGC, March 20, 2011
We had been focused on the sequencing pipeline. • For many investigators, data coming off the end of the sequencing pipeline is where they start . • Investigators take the mass of finished sequence data out of the archives, onto our compute farms and “do stuff”. Huge explosion of data and disk use all over the institute. • We had no idea what people were doing with their data. ISGC, March 20, 2011
Alignment Find the best match of fragments to a known genome / genomes. • “ grep ” for DNA sequences. • Use more sophisticated algorithms that can do fuzzy matching. Real DNA has Insertions, deletions and mutations. Typical algorithms are maq, bwa, ssaha, blast. Reference: ...TTTGCTGAAACCCAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCTCGGTCATCACCAGCATTCTC.... Query: CAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCT A GGTCATCACCAGCA Look for differences • Single base pair differences (SNP). • Larger insertions/deletions/mutations. Typical experiment: • Compare cancer cell genomes with healthy ones. By Guy Coates
Assembly Assemble fragments into a complete genome. • Typical experiment: collect reference genome for a new species. “De - novo” assembly. • Assemble fragment with no external data. • Harder than it looks. Non uniform coverage, low depth, non-unique sequence (repeats). By Guy Coates
Analysing Cancer Genomes Cancer genomes contains a lot of genetic damage. • Many of the mutations in cancer are incidental. • Initial mutation disrupts the normal DNA repair/replication processes. • Corruption spreads through the rest of the genome. Today: Find the “driver” mutations amongst the thousands of “passengers. • Identifying the driver mutations will give us new targets for therapies. Tomorrow: Analyse the cancer genome of every patient in the clinic. • Variations in a patient and cancer genetic makeup play a major role in how effective a particular drugs will be. • Clinicians will use this information to tailor therapies.
ISGC, March 20, 2011
Accidents waiting to happen... From: <User A> (who left 12 months ago) I find the <project> directory is removed . The original directory is "/scratch/ <User B> (who left 6 months ago) " ..where is it ? If this problem cannot be solved ,I am afraid that <project> cannot be released.
Need a file tracking systems for unstructured data !! • They could not keep track of where the results. • Problem exacerbated with student turnover (summer students, PhD students, visiting researchers on rotation). Big wins with little effort. • Disk space usage dropped by 2/3. Lots of individuals keeping copies of the same data set “so I know where it is”. • Team leaders are happy that their data are where they think they are. Important stuff is on file systems that are backed up etc. But: • Systems are ad-hoc, quick hacks. • We want an institute wide, standardised system. Invest in people to maintain/develop it.
Data Grid • Many different science fields today require dealing with large and geographically distributed data sets. The size of these data sets has been scaled up from terabytes to petabytes. • The combination of several issues, such as – large datasets, – distributed data – computationally intensive analysis • Data grid: a unified environment which allows users to deal with all above issues. • SRB, dCAche , CASTOR….etc ISGC, March 20, 2011
iRODS Architecture ISGC, March 20, 2011
iRODS • iRODS: Integrated Rule-Oriented Data System. • Produced by DICE (Data Intensive Cyber Environments) groups at U. North Carolina, Chapel Hill. • Successor to SRB.
Important Features • Catalogue: mapping logical file names to physical locations. • Metadata: metadata can be inserted into each file. • Rule Engines: – Manipulate files or DB. For example, replicate data to multiple resources. – Implement policies. • Easy to use client tools: – Icommands – Web interface. – API • Federation ISGC, March 20, 2011
What are we doing with it? Piloting it for internal use. • Help groups keep track of their data. • Move files between different storage pools. Fast scratch space ↔ warehouse disk ↔ Offsite DR centre. • Link metadata back to our LIMs/tracking databases. We need to share data with other institutions. • Public data is easy: FTP/http. • Controlled data is hard: • Encrypt files and place on private FTP dropboxes. • Cumbersome to manage and insecure.
First Stage: A preservation system ISGC, March 20, 2011
BAM Multiple NFS Partitions ISGC, March 20, 2011
Recommend
More recommend