SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM 10/23/2017 Microbiome : Analysis of NGS Data 1
Outline Background Wet Lab! Raw reads Quality Assessment Quality Control Merging and Filtering OTU picking Decontamination, Annotation and BIOM
WET LAB “Garbage in garbage out” It takes a good lab practice to produce reliable data for the downstream processing If we mess-up in wetlab it can not be corrected in dry lab 10/23/2017 Microbiome : Analysis of NGS Data 3
1. PCR and Sequencing flow chart Stage1: PCR Stage 2: QC analysis Stage 3: Sequencing 10/23/2017 Microbiome : Analysis of NGS Data 4
Why the V4 -16S rRNA region? Pros Cons • Well established protocals • Hypervariable regions only • Full overlap of forward • Less information and reverse reads • Less error during • Limited resolution in assembling Bacillus * • Highly reduced sequencing noise 10/23/2017 Microbiome : Analysis of NGS Data 5
2. Raw sequence Reads Quality Assessment 10/23/2017 Microbiome : Analysis of NGS Data 6
Raw sequences FASTQ file always has 4 lines per sequence. ✓ The first line shows the sequence ID and an optional description. ✓ The second line contains a sequence of nucleotides. ✓ The third line generally holds only a “+” symbol and occasionally, the same ID and sequence description as the first line. ✓ The fourth line displays the quality score of each nucleotide shown on the second line. “ The probability of a sequencing error at each position of the nucleotide” 10/23/2017 Microbiome : Analysis of NGS Data 7
1 • For example, if the probability of an error (p) equals 0.01, then the corresponding quality score will be 20; if p = 0.001, then Q=30. • These are special ASCII characters that are used to encode quality values with a single symbol, rather than a double or triple digit. 100 10/23/2017 Microbiome : Analysis of NGS Data 8
VISUALIZE FASTQ FILE SEQUENCE QUALITY FastQC Package (Andrew S, 2010) fastqc_base/fastqc --extract $fastq -f fastq -o $out_dir -t $fastqc_threads" fastqc --extract -f fastq -o $fastqc_dir -t 6 $raw_reads_dir/* fastqc_combine_base/fastqc_combine.pl -v --out $out_dir --skip --files \"$out_dir/*_fastqc\"" 10/23/2017 Microbiome : Analysis of NGS Data 9
Raw Sequences: Sample Dog8_R1 10/23/2017 Microbiome : Analysis of NGS Data 10
Raw Sequence: Sample Dog8_R2 10/23/2017 Microbiome : Analysis of NGS Data 11
What about this quality?? https://www.bioinformatics.babraham.ac.uk/projects/fastqc/RNA-Seq_fastqc.html 10/23/2017 Microbiome : Analysis of NGS Data 12
3. Processing of 16S rRNA NGS data 10/23/2017 Microbiome : Analysis of NGS Data 13
Some tools available CBIO-PIPELINE integrates some tools from UPARSE and QIIME to process NGS microbiome data 10/23/2017 Microbiome : Analysis of NGS Data 14
3.1 Merging Paired End reads UPARSE pipeline uses Usearch commands (Edgar, 2010) Usearch9 – fastq_mergepairs; maxdiff=3 R1 ATGGATCCC G GAGG G GCGCGAAAAGAGAGAGATTCTCC .... 300bp 300bp …..ATGGATCCC T GAGG C GCGCGAAAGGAGAGAGATCTCTCC R2 Merged: ATGGATCCC T GAGG G GCGCGAAA G GAGAGAGATCTCTCC If two bases are different in R1 and R2, the one to appear in merged seq should have 3 x more quality score than the other, otherwise it will be N (ambiguous call) If the diff in nucleotide btn R1 and R2 is > 3, it will be rejected 10/23/2017 Microbiome : Analysis of NGS Data 15
Merged summary output • Fwd /researchdata/fhgfs/cbio/cbio/courses/IBS5003Z/samson/uparse/renamed/Dog10_R1.fastq • Rev /researchdata/fhgfs/cbio/cbio/courses/IBS5003Z/samson/uparse/renamed/Dog10_R2.fastq • Totals: • 79342 Pairs (79.3k) • 70287 Merged (70.3k, 88.59%) • 49910 Alignments with zero diffs (62.90%) • 8990 Too many diffs (> 3) (11.33%) • 0 Fwd tails Q <= 2 trimmed (0.00%) • 174 Rev tails Q <= 2 trimmed (0.22%) • 0 Fwd too short (< 64) after tail trimming (0.00%) • 38 Rev too short (< 64) after tail trimming (0.05%) • 27 No alignment found (0.03%) • 0 Alignment too short (< 16) (0.00%) • 79141 Staggered pairs (99.75%) merged & trimmed • 252.65 Mean alignment length • 252.65 Mean merged length • 0.29 Mean fwd expected errors • 2.23 Mean rev expected errors • 0.03 Mean merged expected errors 10/23/2017 Microbiome : Analysis of NGS Data 16
3.2 Filtering Merged Reads Generally, filtering involves three steps ✓ Based on error contribution of each nucleotide base (maxee) ✓ Primer stripping (nowadays stripped by sequencing platform) ✓ Length truncation Filtering based on maxim expected error (maxee = 0.1) uparse_filter_fastq_maxee=0.1 This is the maximum expected error of each nucleotide in a DNA sequence ✓ Thus, for a sequence with a length 100bp it will be rejected only if the total error > 10 [0.1 x100] ✓ 250bp will be rejected if total error > 25. What if maxee = 0.5? ✓ For a 250bp sequence, it will be rejected if total error > 250 x 0.5 = 125 !!!! 10/23/2017 Microbiome : Analysis of NGS Data 17
Was Quality Control Effective? 10/23/2017 Microbiome : Analysis of NGS Data 18
3.3 FastQC of Merged, Trimmed and Filtered Reads 10/23/2017 Microbiome : Analysis of NGS Data 19
4. Uparse_downstream 10/23/2017 Microbiome : Analysis of NGS Data 20
4.1. De-replication Full length de-replication is done to find a set of unique sequences. Sequences are compared letter by letter Sample result >524e5df45a66fb616ef4a553473dd833dedff0ca;size=2; AACACAGGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATG TGAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAA TTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTA ACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >459142c21f9a9981d43f98e53cc276b781ad2c6a;size=5; AACATAAGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >4ea702c69516c467927860701b5d1a3b59b5d9c6;size=1; AACATAGAGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAGGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG 10/23/2017 Microbiome : Analysis of NGS Data 21
4.2. Sort sequences by size usearch9 -sortbysize command, min_size=2 Sample result >2fd264476fe1367dbe062db6e5bdcc7d384a8487;size=190716; TACGTAGGGGGCTAGCGTTATCCGGATTTACTGGGCGTAAAGGGTGCGTAGGCGGTCTTTCAAGTCAGGAGTTAAAGGCTAC GGCTCAACCGTAGTAAGCTCCTGATACTGTCTGACTTGAGTGCAGGAGAGGAAAGCGGAATTCCCAGTGTAGCGGTGAAATG CGTAGATATTGGGAGGAACACCAGTAGCGAAGGCGGCTTTCTGGACTGTAACTGACGCTGAGGCACGAAAGCGTGGGGAGC AAACAGG >dfcca28a6795cdd3c43b2fbfd5d1f7f64ead1fa8;size=161971; TACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGCAGGCGGACTCTTAAGTCAGTTGTGAAATACGGC GGCTCAACCGTCGGACTGCAGTTGATACTGGGAGTCTTGAGTGCACACAGGGATGCTGGAATTCATGGTGTAGCGGTGAAAT GCTCAGATATCATGAAGAACTCCGATCGCGAAGGCAGGTATCCGGGGTGCAACTGACGCTGAGGCTCGAAAGTGCGGGTATC AAACAGG 10/23/2017 Microbiome : Analysis of NGS Data 22
4.3. Denovo Otu-picking Usearch9 cluster_otus , otu_radius_pct=3 Performs 97% OTU clustering using the UPARSE-OTU algorithm. Edgar, R.C. (2013) 10/23/2017 Microbiome : Analysis of NGS Data 23
4.4. Chimera detection and removal 3. usearch9 -uchime2_ref, gold_db ✓ Chimeric sequences detected and removed Sample output: – 59Mb 100.0% Reading /scratch/DB/bio/qiime/uchime/gold.fa – 26Mb 100.0% Converting to upper case – 27Mb 100.0% Word stats – 27Mb 100.0% Alloc rows – 86Mb 100.0% Build index – 93Mb 100.0% Chimeras 5/184 (2.7%), in db 27 (14.7%), not matched 152 (82.6%) 10/23/2017 Microbiome : Analysis of NGS Data 24
4.5. OTUs - table generation De-dereplication and Qiime compatible otu_table usearch9 -usearch_global usearch_global command: searches for how many times each OTU appears in each set of samples and then generates qiime compatible out_table OTUId Dog10/1 Dog15/1 Dog16/1 Dog17/1 Dog1/1 Dog22/1 Dog24/1 Dog29/1 Dog2/1 OTU_19 2961 151 25 569 212 967 64 330 2691 257 567 374 OTU_2 14004 7549 13826 14747 8370 5715 33064 658 11497 21298 44 OTU_1 10077 29178 11913 9804 10362 33473 22356 25381 8320 13869 OTU_12 1276 586 1185 1258 1906 476 3418 128 1247 998 1510 OTU_4 5932 11319 4568 5609 8082 14859 9988 6492 6135 8157 12908 10/23/2017 Microbiome : Analysis of NGS Data 25
5. Decontamination 10/23/2017 Microbiome : Analysis of NGS Data 26
5.1 Overview of the Decontamination ✓ We need to know OTUs that might be contributed by contamination from reagents used for sampling, DNA extraction and purification, and environments and personnel where DNA was extracted ✓ This is very critical, especially in clinical samples. Why? ✓ These OTUs must be subtracted from biological samples to retain a true representation of the OTUs from the sample of interest. ✓ To achieve this, reagents / blanks [controls] are spiked with known bacteria at the same DNA concentrations as those used in sample under study 10/23/2017 Microbiome : Analysis of NGS Data 27
Recommend
More recommend