Where are the indels coming from? • Allowing a small number of indels we assigned 3,748,614 (over 99.5%) of the alignments to genomic fragments flanked by pairs of identical 4-cutter sites • Only 1% of the fragments attributed to the sticky ends 4-cutter , GATC , contained indels, whereas the three blunt ends 4-cutter ( GTAC 8, GGCC 8, AGCT ) had much higher rates of indels 29.5-54% • If a base is lost at the sticky end ligation is significantly compromised: double stranded vector with ligated insert 5’ GATC GATC 3’ 3’ CTAG CTAG 5’ • However, a lost bp at the blunt end has no effect on ligation: double stranded vector ready for insert ligation 5’ 3’ 3’ 5’
Where are the indels coming from? • Allowing a small number of indels we assigned 3,748,614 (over 99.5%) of the alignments to genomic fragments flanked by pairs of identical 4-cutter sites • Only 1% of the fragments attributed to the sticky ends 4-cutter , GATC , contained indels, whereas the three blunt ends 4-cutter ( GTAC 8, GGCC 8, AGCT ) had much higher rates of indels 29.5-54% • If a base is lost at the sticky end ligation is significantly compromised: double stranded vector with ligated insert 5’ GATC GATC 3’ 3’ CTAG CTAG 5’ • However, a lost bp at the blunt end has no effect on ligation: double stranded vector with ligated insert 5’ CC GG 3’ 3’ GG CC 5’
Where are the indels coming from? • Allowing a small number of indels we assigned 3,748,614 (over 99.5%) of the alignments to genomic fragments flanked by pairs of identical 4-cutter sites • Only 1% of the fragments attributed to the sticky ends 4-cutter , GATC , contained indels, whereas the three blunt ends 4-cutter ( GTAC 8, GGCC 8, AGCT ) had much higher rates of indels 29.5-54% • If a base is lost at the sticky end ligation is significantly compromised: double stranded vector with ligated insert 5’ GATC GATC 3’ 3’ CTAG CTAG 5’ • However, a lost bp at the blunt end has no effect on ligation: double stranded vector with ligated insert 5’ CC GG 3’ 3’ GG CC 5’ • Consistently, GATC libraries gave much lower cloning efficiencies than the 3 blunt cutters • The indels are most likely not generated during sequencing, but then why haven’t we observed them when using Sanger sequencing?
contigs and functional cores
contigs and functional cores • Filtering out low quality reads and genomic fragments supported by a single read pair we ended up with 720 overlapping genomic fragments
contigs and functional cores • Filtering out low quality reads and genomic fragments supported by a single read pair we ended up with 720 overlapping genomic fragments • Assembled into 366 contigs each containing 1-5 fragments
contigs and functional cores • Filtering out low quality reads and genomic fragments supported by a single read pair we ended up with 720 overlapping genomic fragments • Assembled into 366 contigs each containing 1-5 fragments • The minimal functional element, or the “functional core” is in principle better approximated by the intersection of all the contig’s fragments
contigs and functional cores • Filtering out low quality reads and genomic fragments supported by a single read pair we ended up with 720 overlapping genomic fragments • Assembled into 366 contigs each containing 1-5 fragments • The minimal functional element, or the “functional core” is in principle better approximated by the intersection of all the contig’s fragments • However, due to erroneous (FP) fragments the intersection might be empty
contigs and functional cores • Filtering out low quality reads and genomic fragments supported by a single read pair we ended up with 720 overlapping genomic fragments • Assembled into 366 contigs each containing 1-5 fragments • The minimal functional element, or the “functional core” is in principle better approximated by the intersection of all the contig’s fragments • However, due to erroneous (FP) fragments the intersection might be empty • A dynamic programming script finds the smallest number of fragments that needs to be omitted so that the intersection of the remaining fragments is at least 50bp long
contigs and functional cores • Filtering out low quality reads and genomic fragments supported by a single read pair we ended up with 720 overlapping genomic fragments • Assembled into 366 contigs each containing 1-5 fragments • The minimal functional element, or the “functional core” is in principle better approximated by the intersection of all the contig’s fragments • However, due to erroneous (FP) fragments the intersection might be empty • A dynamic programming script finds the smallest number of fragments that needs to be omitted so that the intersection of the remaining fragments is at least 50bp long fragment 702 • Median lengths in bps: contig 1002 core 387
contigs and functional cores • Filtering out low quality reads and genomic fragments supported by a single read pair we ended up with 720 overlapping genomic fragments • Assembled into 366 contigs each containing 1-5 fragments • The minimal functional element, or the “functional core” is in principle better approximated by the intersection of all the contig’s fragments • However, due to erroneous (FP) fragments the intersection might be empty • A dynamic programming script finds the smallest number of fragments that needs to be omitted so that the intersection of the remaining fragments is at least 50bp long fragment 702 • Median lengths in bps: contig 1002 core 387 • While there are cases of FPs or non-functional cores, in all such cases that we checked, the contig is also non-functional so the cores seem well defined
Comparison with gold standard Ivan Liachko • Able to identify ~85% of S. cerevisiae ARSs in a single experiment • Fairly low FP rate ~12.5% • Confirmed >50 likely ARSs and discovered a handful of new ones
Which part of the ARS is necessary for function? ARS full (231bp) -40L -80L -101L -120L -40R -80R -120R -40L-40R -80L-40R -120L-40R -101L-40R (min) 9.%)+!%*( 7!%:;.%)+!%*( Ivan Liachko
miniARS-seq: Defining essential ARS regions purify and sequence construct genomic libraries ARS screen ARS plasmids in ARS-less vector ARS ARS ARS + URA3 selective media
miniARS-seq: Defining essential ARS regions purify and sequence construct genomic libraries ARS screen ARS plasmids in ARS-less vector ARS ARS ARS + URA3 selective media ARS ARS ARS Amplify ARS-seq inserts
miniARS-seq: Defining essential ARS regions purify and sequence construct genomic libraries ARS screen ARS plasmids in ARS-less vector ARS ARS ARS + URA3 selective media ARS + ARS ARS URA3 Shear and clone Amplify ARS-seq inserts ARS sub-fragments
miniARS-seq: Defining essential ARS regions purify and sequence construct genomic libraries ARS screen ARS plasmids in ARS-less vector ARS ARS ARS + URA3 selective media ARS + ARS ARS URA3 ARS screen Shear and clone Amplify ARS-seq inserts ARS sub-fragments
miniARS-seq: Defining essential ARS regions purify and sequence construct genomic libraries ARS screen ARS plasmids in ARS-less vector ARS ARS ARS + URA3 selective media miniARS miniARS miniARS ARS + ARS ARS URA3 Isolate and sequence ARS screen Shear and clone Amplify ARS-seq inserts miniARS plasmids ARS sub-fragments Ivan Liachko
Works great except...
Works great except... • The miniARS-seq experiment created quite a few inexplicable FPs
Works great except... • The miniARS-seq experiment created quite a few inexplicable FPs • Additional technical and biological replicates (4 sequencing runs altogether) did not solve the problem:
Works great except... • The miniARS-seq experiment created quite a few inexplicable FPs • Additional technical and biological replicates (4 sequencing runs altogether) did not solve the problem: • We observed clearly non-functional fragments with repeated substantial read count
Works great except... • The miniARS-seq experiment created quite a few inexplicable FPs • Additional technical and biological replicates (4 sequencing runs altogether) did not solve the problem: • We observed clearly non-functional fragments with repeated substantial read count • How can that be?
Works great except... • The miniARS-seq experiment created quite a few inexplicable FPs • Additional technical and biological replicates (4 sequencing runs altogether) did not solve the problem: • We observed clearly non-functional fragments with repeated substantial read count • How can that be? • We’ll get back to it but in the meantime, what’s with all the reads we can’t map?
Persistent mapping
Persistent mapping 5’ arsseq vector ARSseq insert 3’ arsseq vector
Persistent mapping 5’ arsseq vector ARSseq insert 3’ arsseq vector DNAseI DNAseI
Persistent mapping 5’ arsseq vector ARSseq insert 3’ arsseq vector DNAseI DNAseI miniARS insert
Persistent mapping 5’ arsseq vector ARSseq insert 3’ arsseq vector DNAseI DNAseI miniARS insert 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert
Persistent mapping 5’ arsseq vector ARSseq insert 3’ arsseq vector DNAseI DNAseI miniARS insert 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert S1 read S2 read
Persistent mapping 5’ arsseq vector ARSseq insert 3’ arsseq vector DNAseI DNAseI miniARS insert 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert Stats for 1 of 4 runs: 9M 101bp read pairs S1 read S2 read 3.6M aligned by BT
Persistent mapping 5’ arsseq vector ARSseq insert 3’ arsseq vector DNAseI DNAseI miniARS insert 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert Stats for 1 of 4 runs: 9M 101bp read pairs S1 read S2 read 3.6M aligned by BT 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert S1 read S2 read
Persistent mapping 5’ arsseq vector ARSseq insert 3’ arsseq vector DNAseI DNAseI miniARS insert 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert Stats for 1 of 4 runs: 9M 101bp read pairs S1 read S2 read 3.6M aligned by BT 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert Trim 3’ ends of reads 300K pairs aligned by BT S1 read S2 read
Persistent mapping 5’ arsseq vector ARSseq insert 3’ arsseq vector DNAseI DNAseI DNAseI DNAseI miniARS insert miniARS insert 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert Stats for 1 of 4 runs: 9M 101bp read pairs S1 read S2 read 3.6M aligned by BT 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert Trim 3’ ends of reads 300K pairs aligned by BT S1 read S2 read
Persistent mapping 5’ arsseq vector ARSseq insert 3’ arsseq vector DNAseI DNAseI DNAseI DNAseI miniARS insert miniARS insert 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert Stats for 1 of 4 runs: 9M 101bp read pairs S1 read S2 read 3.6M aligned by BT 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert Trim 3’ ends of reads 300K pairs aligned by BT S1 read S2 read 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert S1 read S2 read
Persistent mapping 5’ arsseq vector ARSseq insert 3’ arsseq vector DNAseI DNAseI DNAseI DNAseI miniARS insert miniARS insert 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert Stats for 1 of 4 runs: 9M 101bp read pairs S1 read S2 read 3.6M aligned by BT 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert Trim 3’ ends of reads 300K pairs aligned by BT S1 read S2 read 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert Trim 5’ prefixes of reads matching 5’ suffixes of vector S1 read 400K pairs aligned by BT S2 read
Persistence pays
Persistence pays • Still, BT fails to align 4.7M (52%) of the reads, many with good quality scores
Persistence pays • Still, BT fails to align 4.7M (52%) of the reads, many with good quality scores • 2M of those read pairs turn out to be confirmed double inserts: the two reads were mapped to distinct parts of the concatenated genome 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert 1 miniARS insert 2 S1 read S2 read
Persistence pays • Still, BT fails to align 4.7M (52%) of the reads, many with good quality scores • 2M of those read pairs turn out to be confirmed double inserts: the two reads were mapped to distinct parts of the concatenated genome 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert 1 miniARS insert 2 S1 read S2 read • And probably quite a few more are of this type:
Persistence pays • Still, BT fails to align 4.7M (52%) of the reads, many with good quality scores • 2M of those read pairs turn out to be confirmed double inserts: the two reads were mapped to distinct parts of the concatenated genome 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert 1 miniARS insert 2 S1 read S2 read • And probably quite a few more are of this type: 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert 1 miniARS insert 2 S1 read S2 read
Persistence pays • Still, BT fails to align 4.7M (52%) of the reads, many with good quality scores • 2M of those read pairs turn out to be confirmed double inserts: the two reads were mapped to distinct parts of the concatenated genome 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert 1 miniARS insert 2 S1 read S2 read • And probably quite a few more are of this type: 5‘ mini vector S1 primer S2 primer 3’ mini vector miniARS insert 1 miniARS insert 2 S1 read S2 read • But we didn’t pursue this further as at this point we realized we have a solution to the more important question of the inexplicable FP mini inserts
Silent double insert may mask the functional and reveal only the non-functional insert
Silent double insert may mask the functional and reveal only the non-functional insert 5’ arsseq 3’ arsseq ARSseq insert 1 S1 primer S2 primer vector vector ACS
Silent double insert may mask the functional and reveal only the non-functional insert 5’ arsseq 3’ arsseq ARSseq insert 1 S1 primer S2 primer vector vector ACS DNAseI DNAseI
Silent double insert may mask the functional and reveal only the non-functional insert 5’ arsseq 3’ arsseq ARSseq insert 1 S1 primer S2 primer vector vector ACS DNAseI DNAseI non-functional miniARS insert
Silent double insert may mask the functional and reveal only the non-functional insert 5’ arsseq 3’ arsseq ARSseq insert 1 S1 primer S2 primer vector vector ACS DNAseI DNAseI non-functional miniARS insert 5’ arsseq 3’ arsseq ARSseq insert 2 S2 primer S1 primer vector vector ACS DNAseI DNAseI functional miniARS insert ACS
Silent double insert may mask the functional and reveal only the non-functional insert 5’ arsseq 3’ arsseq ARSseq insert 1 S1 primer S2 primer vector vector ACS DNAseI DNAseI non-functional miniARS insert 5’ arsseq 3’ arsseq ARSseq insert 2 S2 primer S1 primer vector vector ACS DNAseI DNAseI functional miniARS insert ACS 5’ miniARS vector S1 primer S2 primer 3’ miniARS vector non-functional miniARS insert functional miniARS insert ACS
Silent double insert may mask the functional and reveal only the non-functional insert 5’ arsseq 3’ arsseq ARSseq insert 1 S1 primer S2 primer vector vector ACS DNAseI DNAseI non-functional miniARS insert 5’ arsseq 3’ arsseq ARSseq insert 2 S2 primer S1 primer vector vector ACS DNAseI DNAseI functional miniARS insert ACS 5’ miniARS vector S1 primer S2 primer 3’ miniARS vector non-functional miniARS insert functional miniARS insert ACS This is the part that will get sequenced
Nice hypothesis, but where’s the evidence?
Nice hypothesis, but where’s the evidence? • It’s all circumstantial
Nice hypothesis, but where’s the evidence? • It’s all circumstantial • There is a substantial number of observed double insert: 22-24% of all reads
Nice hypothesis, but where’s the evidence? • It’s all circumstantial • There is a substantial number of observed double insert: 22-24% of all reads • We observe quite a few miniARS inserts starting with the ARS-seq vector
Nice hypothesis, but where’s the evidence? • It’s all circumstantial • There is a substantial number of observed double insert: 22-24% of all reads • We observe quite a few miniARS inserts starting with the ARS-seq vector • When it’s not long enough to serve as a primer for the mini sequencing reaction
Nice hypothesis, but where’s the evidence? • It’s all circumstantial • There is a substantial number of observed double insert: 22-24% of all reads • We observe quite a few miniARS inserts starting with the ARS-seq vector • When it’s not long enough to serve as a primer for the mini sequencing reaction • Of the 611 miniARS fragments sharing an end with the parent ARS-seq 465 also share the arsseq orientation relative to the vector (p-value < 2.2e-16)
Nice hypothesis, but where’s the evidence? • It’s all circumstantial • There is a substantial number of observed double insert: 22-24% of all reads • We observe quite a few miniARS inserts starting with the ARS-seq vector • When it’s not long enough to serve as a primer for the mini sequencing reaction • Of the 611 miniARS fragments sharing an end with the parent ARS-seq 465 also share the arsseq orientation relative to the vector (p-value < 2.2e-16) • Filtering out the mini fragments that share an end with the parent ARS-seq insert removes most of the suspected FPs
miniARS contigs and inferred cores
miniARS contigs and inferred cores • After filtering out suspected double inserts we assembled the remaining 12,338 miniARS genomic fragments (median 148bp) into 181 unique contigs
miniARS contigs and inferred cores • After filtering out suspected double inserts we assembled the remaining 12,338 miniARS genomic fragments (median 148bp) into 181 unique contigs • Defining the cores using the same procedure for arsseq was not optimal: average of 68 miniARS vs. 2 ARSseq fragments per contig
miniARS contigs and inferred cores • After filtering out suspected double inserts we assembled the remaining 12,338 miniARS genomic fragments (median 148bp) into 181 unique contigs • Defining the cores using the same procedure for arsseq was not optimal: average of 68 miniARS vs. 2 ARSseq fragments per contig • We added a statistical aspect to the combinatorial approach:
miniARS contigs and inferred cores • After filtering out suspected double inserts we assembled the remaining 12,338 miniARS genomic fragments (median 148bp) into 181 unique contigs • Defining the cores using the same procedure for arsseq was not optimal: average of 68 miniARS vs. 2 ARSseq fragments per contig • We added a statistical aspect to the combinatorial approach: • A mini contig’s core is defined essentially by dropping the 5% rightmost fragment starts as well as the leftmost 5% fragment ends
miniARS contigs and inferred cores • After filtering out suspected double inserts we assembled the remaining 12,338 miniARS genomic fragments (median 148bp) into 181 unique contigs • Defining the cores using the same procedure for arsseq was not optimal: average of 68 miniARS vs. 2 ARSseq fragments per contig • We added a statistical aspect to the combinatorial approach: • A mini contig’s core is defined essentially by dropping the 5% rightmost fragment starts as well as the leftmost 5% fragment ends • If it is shorter than 50bp the DP approach removes additional fragments
miniARS contigs and inferred cores • After filtering out suspected double inserts we assembled the remaining 12,338 miniARS genomic fragments (median 148bp) into 181 unique contigs • Defining the cores using the same procedure for arsseq was not optimal: average of 68 miniARS vs. 2 ARSseq fragments per contig • We added a statistical aspect to the combinatorial approach: • A mini contig’s core is defined essentially by dropping the 5% rightmost fragment starts as well as the leftmost 5% fragment ends • If it is shorter than 50bp the DP approach removes additional fragments • The contig median length is 230 bp whereas the core’s is 92 bp
miniARS contigs and inferred cores • After filtering out suspected double inserts we assembled the remaining 12,338 miniARS genomic fragments (median 148bp) into 181 unique contigs • Defining the cores using the same procedure for arsseq was not optimal: average of 68 miniARS vs. 2 ARSseq fragments per contig • We added a statistical aspect to the combinatorial approach: • A mini contig’s core is defined essentially by dropping the 5% rightmost fragment starts as well as the leftmost 5% fragment ends • If it is shorter than 50bp the DP approach removes additional fragments • The contig median length is 230 bp whereas the core’s is 92 bp • We have no evidence of incorrectly defined cores
miniARS contigs and inferred cores • After filtering out suspected double inserts we assembled the remaining 12,338 miniARS genomic fragments (median 148bp) into 181 unique contigs • Defining the cores using the same procedure for arsseq was not optimal: average of 68 miniARS vs. 2 ARSseq fragments per contig • We added a statistical aspect to the combinatorial approach: • A mini contig’s core is defined essentially by dropping the 5% rightmost fragment starts as well as the leftmost 5% fragment ends • If it is shorter than 50bp the DP approach removes additional fragments • The contig median length is 230 bp whereas the core’s is 92 bp • We have no evidence of incorrectly defined cores • FP rate is estimated at 3.9%: 8 of 181 contigs
BCD:/&EF5#%#BCD:/&E,"&G%&,).""&%2,BCD,=%!1(&6$& ACS ARS-seq miniARS-seq ARS419 OriDB YOS9 TGL2 UBC5 * 566000 567000 568000 569000 <#*)3=!, !"#$%&# /.-5#A&6
To boldly go where others have gone before...
To boldly go where others have gone before... 71 “left skewed” mini cores: 8 “right skewed” mini cores: ACS ACS ≤ 5bp ≤ 5bp
To boldly go where others have gone before... 71 “left skewed” mini cores: 8 “right skewed” mini cores: ACS ACS ≤ 5bp ≤ 5bp • 2-sided binomial test p-value = 9.7e-14
To boldly go where others have gone before... 71 “left skewed” mini cores: 8 “right skewed” mini cores: ACS ACS ≤ 5bp ≤ 5bp • 2-sided binomial test p-value = 9.7e-14 • More information 5’ than 3’ of the oriented ACS
To boldly go where others have gone before... 71 “left skewed” mini cores: 8 “right skewed” mini cores: ACS ACS ≤ 5bp ≤ 5bp • 2-sided binomial test p-value = 9.7e-14 • More information 5’ than 3’ of the oriented ACS • Still, few of the 8 are functional: is the 33bp ACS sufficient for ARS function?
Recommend
More recommend