Analysing re-sequencing samples Malin Larsson - PowerPoint PPT Presentation

Analysing re-sequencing samples Malin Larsson Malin.larsson@scilifelab.se WABI / SciLifeLab

Re-sequencing Reference genome assembly ...GTGCGTAGACTGCTAGATCGAAGA... �

Re-sequencing IND 1 � IND 2 � IND 3 � IND 4 � GTAGACT � TGCGTAG � TAGACTG � AGATCGA � AGATCGG � ATCGAAG � GATCGAA � GTAGACT � GCGTAGT AGACTGC GACTGCT GATCGAA Reference genome assembly � ...GTGCGTAGACTGCTAGATCGAAGA... �

Re-sequencing IND 1 � IND 2 � IND 3 � IND 4 � GTAGACT � TGCGTAG � TAGACTG � AGATCGA � AGATCGG � ATCGAAG � GATCGAA � GTAGACT � GCGTAGT AGACTGC GACTGCT GATCGAA Reference genome assembly � ...GTGCGTAGACTGCTAGATCGAAGA... � IND 1 � IND 2 � IND 3 � IND 4 � 5 GTAGACT � 16 TGCGTAG � 24 AGTTCGA � 8 AGATCGA � 12 AGTTCGG � 6 ATCGAAG � 5 GTAGACT � 19 GTAGGCT � 3 GCGTAGT 7 AAACTGC 18 GATCGAA 2 GATCGAA

Rare variants in human

Exome sequencing in trios to detect de novo coding variants

Population genetics – speciation, adaptive evolution Darwin Finches

Population genetics – speciation, adaptive evolution Darwin Finches Heliconius Butterflies

Population genetics – speciation, adaptive evolution Darwin Finches Heliconius Butterflies Lake Victoria cechlid fishes

Paired end sequencing

Pair-end reads • Two .fastq files containing the reads are created • The order in the files are identical and naming of reads are the same with the exception of the end • The naming of reads is changing and depends on software version ID_R1_001.fastq ID_R2_001.fastq @HISEQ:100:C3MG8ACXX: @HISEQ:100:C3MG8ACXX: 5:1101:1160:2197 1:N:0:ATCACG � 5:1101:1160:2197 2:N:0:ATCACG � CAGTTGCGATGAGAGCGTTGAGAAGTATAATAGG CTTCGTCCACTTTCATTATTCCTTTCATACATG AGTTAAACTGAGTAACAGGATAAGAAATAGTGAG CTCTCCGGTTTAGGGTACTCTTGACCTGGCCTT ATATGGAAACGTTGTGGTCTGAAAGAAGATGT � TTTTCAAGACGTCCCTGACTTGATCTTGAAACG � + � + � B@CFFFFFHHHHHGJJJJJJJJJJJFHHIIIIJJ CCCFFFFFHHHHHJJJJIJJJJJJJJJJJJJJJ JIHGIIJJJJIJIJIJJJJIIJJJJJIIEIHHIJ JJJJJJJIJIJGIJHBGHHIIIJIJJJJJJJJI HGHHHHHDFFFEDDDDDCDDDCDDDDDDDCDC � JJJHFFFFFFDDDDDDDDDDDDDDDEDCCDDDD �

Paired end sequencing Unknown sequence Read 2 Read 1 Insert size

Adapter trimming Module load cutadapt � 3' Adapter When the adaptor has been read in sequencing, it is present in reads or and needs to be removed prior to mapping 5' Adapter or Read Anchored 5' adapter Adapter Removed sequence

Basic quality control - FASTQC Module load FastQC �

Genome Analysis Tool Kit (GATK) Mapping Alignment refinement Variant discovery Callset refinement 17

When in doubt, google it!

Steps in resequencing analysis Setup programs, data map reads to a reference find best placement of reads bam file realign indels remove duplicates process alignments recalibrate base quality bam file statistical algorithms identify/call variants to detect true variants vcf file

Mapping to reference genome 21

brute force TCGATCC � x � GACCTCATCGATCCCACTG �

brute force TCGATCC � ||x � GACCTCATCGATCCCACTG �

brute force TCGATCC � x � GACCTCATCGATCCCACTG �

brute force TCGATCC � ||||||| � GACCTCATCGATCCCACTG �

hash tables build an index of the reference sequence for fast access 0 5 10 15 � � GACCTCATCGATCCCACTG � seed length 7 GACCTCA � à chromosome 1, pos 0 � ACCTCAT � à chromosome 1, pos 1 � � à chromosome 1, pos 2 � CCTCATC � à chromosome 1, pos 3 � CTCATCG � � � à chromosome 1, pos 4 � TCATCGA � � � à chromosome 1, pos 5 � CATCGAT � � à chromosome 1, pos 6 � ATCGATC � � à chromosome 1, pos 7 � TCGATCC � � à chromosome 1, pos 8 � CGATCCC GATCCCA � à chromosome 1, pos 9 �

hash tables build an index of the reference sequence for fast access TCGATCC ? � 0 5 10 15 � � GACCTCATCGATCCCACTG � � � à chromosome 1, pos 0 � GACCTCA � � à chromosome 1, pos 1 � ACCTCAT CCTCATC � � à chromosome 1, pos 2 � � à chromosome 1, pos 3 � CTCATCG � � � à chromosome 1, pos 4 � TCATCGA � � � à chromosome 1, pos 5 � CATCGAT � � à chromosome 1, pos 6 � ATCGATC � � à chromosome 1, pos 7 � TCGATCC � � à chromosome 1, pos 8 � CGATCCC GATCCCA � à chromosome 1, pos 9 �

hash tables build an index of the reference sequence for fast access TCGATCC = chromosome 1, pos 7 � 0 5 10 15 � � GACCTCATCGATCCCACTG � � � à chromosome 1, pos 0 � GACCTCA � � à chromosome 1, pos 1 � ACCTCAT CCTCATC � � à chromosome 1, pos 2 � � à chromosome 1, pos 3 � CTCATCG � � � à chromosome 1, pos 4 � TCATCGA � � � à chromosome 1, pos 5 � CATCGAT � � à chromosome 1, pos 6 � ATCGATC � � à chromosome 1, pos 7 � TCGATCC � � à chromosome 1, pos 8 � CGATCCC GATCCCA � à chromosome 1, pos 9 �

Burroughs-Wheeler Aligner algorithm used in computer science for file compression original sequence can be reconstructed BWA ( module add bwa ) Burroughs-Wheeler Aligner

Input to mapping Reference genome Sample data Reference.fasta R1.fastq R2.fastq Reference.fai >Potra000002 � @HISEQ:100:C3MG8ACXX: CACGAGGTTTCATCATGGACTTGGCACCATAAAA 5:1101:1160:2197 1:N:0:ATCACG � GTTCTCTTTCATTATATTCCCTTTAGGTAAAATG CAGTTGCGATGAGAGCGTTGAGAAGTATAATAGG ATTCTCGTTCATTTGATAATTTTGTAATAACCGG AGTTAAACTGAGTAACAGGATAAGAAATAGTGAG CCTCATTCAACCCATGATCCGACTTGATGGTGAA ATATGGAAACGTTGTGGTCTGAAAGAAGATGT � TACTTGTGTAATAACTGATAATTTACTGTGATTT + � ATATAACTATCTCATAATGGTTCGTCAAAATCTT B@CFFFFFHHHHHGJJJJJJJJJJJFHHIIIIJJ TTAAAAGATAAAAAAAACCTTTATCAATTATCTA JIHGIIJJJJIJIJIJJJJIIJJJJJIIEIHHIJ TATAAATTCAAATTTGTACACATTTACTAGAAAT HGHHHHHDFFFEDDDDDCDDDCDDDDDDDCDC � TACAACTCAGCAATAAAATTGACAAAATATAAAA @HISEQ:100:C3MG8ACXX: CAGAACCGTTAAATAAGCTATTATTTATTTCATC 5:1101:1448:2164 1:N:0:ATCACG � ACAAAACATCTAAGTCAAAAATTTGACATAAGTT NAGATTGTTTGTGTGCCTAAATAAATAAATAAAT TCATCAATTTACAAACAAACACAATTTTACAAAA AAAAATGATGATGGTCTTAAAGGAATTTGAAATT TCTCAACCAAACCATAACATGTACAAATTATAAA AAGATTGAGATATTGAAAAAGCAGATGTGGTC � TATCAACAATATTGTTTGAGAAAAAAACTATAAC + � ACAAGTAAATACCAAAAAAAATACATATACTACA #1=DDFFEHHDFHHJGGIJJJJGIHIGIJJJJJI AAACAATATATAAAAAATTAACATTTTAAAATTG IJJJJIJJJFIJJF? TGTTCAAATAAAAAATTAGATTTGCTTACTTAAG FHHHIIJJIIJJIGIIJJJIJIGHGHIIJJIHGH GTGGAGAATTCTCAATAAAATTTGAATTAGAACA GHGHFFFEDEEE>CDDD �

Output from mapping

Output - SAM format HEADER SECTION � @SQ SN:17 LN:81195210 � @PG ID:bwa PN:bwa VN:0.7.13-r1126 CL:bwa sampe human_17_v37.fasta NA06984.ILLUMINA.low_coverage.17_1.sai NA06984.ILL � UMINA.low_coverage.17_2.sai /proj/g2016008/labs/gatk/fastq/wgs/NA06984.ILLUMINA.low_coverage.17q_1.fq /proj/g2016008/labs/ � gatk/fastq/wgs/NA06984.ILLUMINA.low_coverage.17q_2.fq � � ALIGNMENT SECTION SRR035026.5316211 83 17 43500121 15 76M = 43500094 -103 CATCTCTATCAGAATTAG � AGTAAAAGACCCCTGCCCCCAAGCAAAGGATACAAAGGAAATGAAAGTTTGAATAATA ?@@?;@@ABAB8@@<?B@B;A@@@B@@A>A@>>:<8A@@B@@@@B@@AAA@@@B@@=@ � A?@=:@?@BB@@B@@AA@ XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:2 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 XA:Z:17,-62767526, � 76M,0; � SRR035026.5316211 163 17 43500094 23 76M = 43500121 103 AATGTGAGAGGAAGGTTT � AACATAACACATCTCTATCAGAATTAGAGTAAAAGACCCCTGCCCCCAAGCAAAGGAT >BA@>=@?<@@AA@A?@@@@;@AAB;A?AA@A<A<A<@?>A@@A@>?3=>A;?@0>>@ � A@>@@@############ XT:A:U NM:i:0 SM:i:23 AM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 XA:Z:17,+62767499, � 76M,1; � SRR035022.26046929 99 17 43499955 60 76M = 43500177 298 TAAAGAGGGACACCACGT � AATGATAGAAAAGCACAATTTGTAACGAAAGAACGCTCGAAATCTGCATCCTCCTGAC @AABABAAAA?B?AA>9AABA@BA@@BBAB@@A?ABA@@@@AB?9BAB@BA?9@B@9B � BAA>B@>BA??A?@A?A> XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 � S � Read name Chr Start position Sequence Quality

Analysing re-sequencing samples Malin Larsson - PowerPoint PPT Presentation

Analysing re-sequencing samples Malin Larsson Malin.larsson@scilifelab.se WABI / SciLifeLab Re-sequencing Reference genome assembly ...GTGCGTAGACTGCTAGATCGAAGA... Re-sequencing IND 1 IND 2 IND 3 IND 4 GTAGACT TGCGTAG

Analysing re-sequencing samples Anna Johansson Anna.johansson@scilifelab.se WABI / SciLifeLab

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Week 5 Kullmann Analysing BFS Depth-first search Depth-first search Analysing DFS

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Analysing re-sequencing samples Anna Johansson WABI / SciLifeLab What is resequencing? You

Analysing re-sequencing samples Anna Johansson WABI /

Samples Advertising of samples and handing out samples Advertising Education and Assurance

-Samples [AB98] Hyp: domain S is a smooth curve or surface. S 1 -Samples [AB98] Hyp:

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Sequencing Technologies Benchtop Production-Scale Illumina: Sequencing Platforms

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

The Massive Parallel Sequencing era: "Global sequencing" Richard Christen CNRS UMR

Outline Introduction Background Progress on the implementation of the MTSF

Monitoring and analysing multilingual media reports Monitoring and analysing multilingual media

Advances in Visualizing Categorical Data Using the vcd, gnm and vcdExtra Packages in R Michael

Computing Support for the Enterprise John P. Glaser, Ph.D. Vice President and CIO Partners

1 Welcome! My name is Marilyn Denegre- Rumbin, and Im the Director of Payer and Reimbursement

Uncertain Trajectories Best practices from start to finish Alex Mancini, Pan London Lead Nurse

Individual Income Tax 2 KRS 141.010, enacted October 1, 1942 Flat 5% rate on taxable

Fully Commutative Elements and Lattice Walks Philippe Nadeau (CNRS / Universit e Lyon 1)

The R User Conference 2009 Rennes Hup et al. (Institut Curie, Paris, France) Analysis of DNA

Learning penalties for change-point detection using max-margin interval regression Presented as a

Sambuz

Useful Links

Newsletter

Mail Us

Analysing re-sequencing samples Malin Larsson - PowerPoint PPT Presentation

Analysing re-sequencing samples Malin Larsson Malin.larsson@scilifelab.se WABI / SciLifeLab Re-sequencing Reference genome assembly ...GTGCGTAGACTGCTAGATCGAAGA... Re-sequencing IND 1 IND 2 IND 3 IND 4 GTAGACT TGCGTAG

Analysing re-sequencing samples Anna Johansson Anna.johansson@scilifelab.se WABI / SciLifeLab

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Week 5 Kullmann Analysing BFS Depth-first search Depth-first search Analysing DFS

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Analysing re-sequencing samples Anna Johansson WABI / SciLifeLab What is resequencing? You

Analysing re-sequencing samples Anna Johansson WABI /

Samples Advertising of samples and handing out samples Advertising Education and Assurance

-Samples [AB98] Hyp: domain S is a smooth curve or surface. S 1 -Samples [AB98] Hyp:

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Sequencing Technologies Benchtop Production-Scale Illumina: Sequencing Platforms

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

The Massive Parallel Sequencing era: &quot;Global sequencing&quot; Richard Christen CNRS UMR

Outline Introduction Background Progress on the implementation of the MTSF

Monitoring and analysing multilingual media reports Monitoring and analysing multilingual media

Advances in Visualizing Categorical Data Using the vcd, gnm and vcdExtra Packages in R Michael

Computing Support for the Enterprise John P. Glaser, Ph.D. Vice President and CIO Partners

1 Welcome! My name is Marilyn Denegre- Rumbin, and Im the Director of Payer and Reimbursement

Uncertain Trajectories Best practices from start to finish Alex Mancini, Pan London Lead Nurse

Individual Income Tax 2 KRS 141.010, enacted October 1, 1942 Flat 5% rate on taxable

Fully Commutative Elements and Lattice Walks Philippe Nadeau (CNRS / Universit e Lyon 1)

The R User Conference 2009 Rennes Hup et al. (Institut Curie, Paris, France) Analysis of DNA

Learning penalties for change-point detection using max-margin interval regression Presented as a

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

The Massive Parallel Sequencing era: "Global sequencing" Richard Christen CNRS UMR