Barcode Sequence Alignment and Statistical Analysis (Barcas) tool 2016.10.05 Mun, Jihyeob and Kim, Seon-Young Korea Research Institute of Bioscience and Biotechnology
Barcode-Sequencing Ø Genome-wide screening method based on sequencing the counts of tens of thousands of individual tags (barcodes) for each gene for a given condition Ø Originally developed as yeast deletion libraries such as Saccharomyces cerevisiae and Schizosaccharomyces pombe Ø Now applied for genome-wide siRNA or shRNA screening to measure the effects of knock-down of genes Ø Or, using CRISPR-Cas9, applied for genome-wide sgRNA screening for the effects of gene knock-out 2
Examples of genome-wide barcode-sequencing libraries Contents Organism # of # of References genes barcodes Yeast deletion consortium S. cerevisiae 6,343 2 (UP and DN) www-sequence.Stanford.edu/group/ Bioneer pombe collection S. pombe 4,836 2 (UP and DN) http://us.bioneer.com/ MISSION shRNA (human) H. sapiens 20,018 129,696 shRNA http://sigmaaldrich.com MISSION shRNA (human) M. musculus 21,171 118,072 shRNA http://sigmaaldrich.com TRC1 (human) shRNA H. sapiens 16,019 80,717 shRNA https://portals.broadinstitute.org/gpp/trc1/ TRC1 (mouse) shRNA M. musculus 15,960 77,819 shRNA https://portals.broadinstitute.org/gpp/trc1/ Human DECIPHER (shRNA) H. sapiens 15,377 5+ shRNAs https://www.cellecta.com Mouse DECIPHER (shRNA) M. musculus 9,145 5+ shRNAs https://www.cellecta.com Cellecta Genome-wide shRNA H. sapiens 19,276 8 shRNAs https://www.cellecta.com Cellecta Genome-wide CRISPR H. sapiens 19,001 8 sgRNAs https://www.cellecta.com Human GeCKO v2 H. sapiens 19,050 123,411 sgRNA https://www.addgene.org/ Mouse GeCKO v2 M. musculus 20,611 130,209 sgRNA https://www.addgene.org/ Mouse genome-wide v1 (yusa) M. musculus 19,150 87,897 sgRNA https://www.addgene.org/ Oxford fly Drosophila 13,501 40,279 sgRNA https://www.addgene.org/ CRISPRa H. sapiens 15,977 198,810 sgRNA https://www.addgene.org/ CRISPRi H. sapiens 11,219 206,421 sgRNA https://www.addgene.org/ 3
Workflow : barcoded yeast deletion strains 4
Workflow : genome-wide shRNA screening 5
Basic format of barcode-seq data MID (Multiplexing Universal Barcode Index, 4-6 bp) Primer (20-25 bp) (20-30 bp) 6
Steps of barcode-seq data analysis Pre-processing and QC Trim index Trim primer Multiplex Universal Barcode Index Primer (20-30 bp) (4-6 bp) (20-bp) Map and count each TAG sample1 Sample2 sample3 Statistical tag1 3400 2500 2983 Normalization tag2 120 199 739 Analyses tag3 29920 3544 2232 tag4 4300 3433 3344 . . . . . . . . . . . . Visualization
Current tools and methods for barcode-seq data analysis Ref. Tool (or Pre- QC Normal Statistical Visuali Software method) processing ization Analysis zation format Barcas O O O O O Java GUI Mun 2016 BMC Bioinfo Barcode O X X X X Windows or www.decipherproject.net/ Deconvoluter Mac GUI software BiNGS!LS- O O O O X R package Kim 2012 Method Mol seq & edgeR Biol edgeR O X O O X R package Dai 2014 F1000 Res HiTSelect X X X Multi-objective O Matlab Diaz 2015 Nuc Acids Res ranking runtime MAGeCK O O O O X Python, C Li 2014 Genome Bio source code MAGeCK- O O O Robust rank O Python script Li 2015 Genome Bio VISPR aggregation RIGER X X X RNAi Gene O GENE-E (=> Luo 2008 PNAS Enrichment Morpheus) Ranking Java GUI RSA X X X Iterative X Windows Konig 2007 Nat Methods hypergeometric P- GUI (C#), R, value Perl ScreenBEAM X X X Pooled scoring X R package Yu 2015 Bioinformatics shALIGN & O O O O X Perl and R Sims 2011 Genome Bio shRNAseq script 8
Barcas (Barcode sequence Alignment and Statistical Analysis) - Barcas is an all-in-one program for the analysis of multiplexed barcode sequencing (barcode-seq) data - Available at http://medical-genome.kribb.re.kr/barseq/ Input: Barcode-seq data • Genome-wide shRNAs (Cellecta, TRC, Sigmaaldrich, etc) • Genome-wide sgRNAs (Addgene, Cellecta, etc) • barcoded yeast deletion strains: S. cerevisae or S. pombe Ø Preprocessing & Mapping • Filtering, trimming, and mapping with mismatches and indels Ø Quality Control (of barcodes and samples) Ø Normalization Ø Statistical Analysis • Two-condition comparison, multiple time points. Ø Visualization • Various graphs and heatmap 9
All in one package with user-friendly GUI Step 1: Pre-processing & Mapping Step 2: QC of data quality Step 3: Design experiment Step 4: Statistical analysis 10
Step 1: Data preprocessing and mapping Ø De-multiplexing and trimming (universal primers) Ø Mapping with imperfect matches (mismatches and indels) Ø Searching for individual tag sequences 11
Step 2: Data quality evaluation Ø Sequence level: overall sequence quality Ø Sample level: mapping counts and percentage, etc Ø Barcode (or tag) level: mapping counts and percentage, etc 12
Step 3: Experimental design Ø Comparison of two conditions Ø Across several different time points 13
Step 4: Statistical analysis and Visualization Ø Calculates z-score and p-value for each barcode Ø Ranks each barcode by z-score Ø Plots z-score graph Ø Plots time dependent intensity heat-map Ø Allows searching for individual target gene 14
Novel functions of Barcas for data pre-processing and QC Ø Flexible mapping with support for both substitution s and indels Ø Detection of erroneous barcodes in the library Ø Checking similarity among barcodes in the library collection 15
Existing tools for data preprocessing Name Mismatches Shifts of the Indel Backend Ref. position tool BiNGS!LS- Kim (2012) O X X bowtie seq Methods Mol Bio shALIGN Perl script Sims (2011) O X X (or bowtie) Genome Bio Dai (2014) edgeR O O X edgeR F1000Res Trie data Mun (2016) BMC Barcas O O O structure Bioinfo MID Universal Primer Barcode (shRNA) Original barcode TCAAAGATAGTCACGCGACCTCATCGACGAGCTACC Perfect match TCAAAGATAGTCACGCGACCTCATCGACGAGCTACC TCAAAGATAGTCACGCGACCTCATCGACGAGCTACC Mismatches TCAAAGATAGTCACGCGACC-ATCGACGAGCTACC Position shift TCAAAGATAGTCACGCGACCTCATCGA--AGCTACC Indel 16
Trie data structure Ø Data structure based on prefix tree Ø Useful data structure to store a dynamic set or associate array in which the keys are usually strings Ø More efficient than hash table (or dictionary) or lists in terms of look-up speed an d memory 1:M sequence matching processing 1:1 sequence matching processing Algorithm : Tree based Algorithm : List based Maximum time : N (N: read count) Maximum time : N * M (N: read count, M: reference count) read Library reference read Library reference root CGCT AGCT A T G C GCCAA AGCT TTAG G T C C G TCAGT GCAG A C A A C C TTAT AGCT T T G G G A T T A 17
1. Data structure of Barcas for mapping - Based on trie data structure, Barcas supports imperfect matching allowing mismatches, base shifting and indels - Dynamic sequence lengths - Dynamic start positions 18
Comparison of speed and mapping rate of barcas with bowtie and edgeR package of R • 215 million reads were mapped to 4,832 heterozygous diploid deletion strains in S. pombe . Data • 45-bp sequences were used as barcode library. Option Result Barcas was 1.7 times faster than bowtie and 13 times faster than edgeR. Owing to indel mapping, Barcas mapped at least 8-12% more than the other two programs.
2. Detection of erroneous barcodes from the genome-wide barcode library Ø We are likely to assume that barcode sequences in the li brary are perfectly error-free from the original design Ø However, errors can creep in the barcodes during many steps including • barcode synthesis, • random mutations during library maintenance, • erroneous incorporation of barcodes into the genome in case of yeast strains . 20
Erroneous barcodes in the yeast library Smith et al (2009) Quantitative Eason et al (2004) Characterization of phenotyping via deep barcode sequencing synthetic DNA bar codes in Genome Res 19:1836-42 Saccharomyces cerevisiae gene-deletion strains PNAS 101(30):11046-51 U1 UpTag U2 D2 DnTag D1 # correct 4,242 4,369 4,045 4,207 4,320 3,867 by Smith % correct 80.1% 82.5% 82.9% 80.9% 83.1% 83.7% by Smith # correct 4185 3,764 4,057 4,343 3,807 4,095 by Easton % correct 79.1% 71.1% 83.2% 83.5% 73.2% 88.7% by Easton % Agreed 86% 84.4% 89.2% 92.6% 85.1% 92% 21
Recommend
More recommend