Understanding Nothing: Zeros in scRNASeq Tallulah Andrews, 27 Sept - - PowerPoint PPT Presentation

understanding nothing zeros in scrnaseq
SMART_READER_LITE
LIVE PREVIEW

Understanding Nothing: Zeros in scRNASeq Tallulah Andrews, 27 Sept - - PowerPoint PPT Presentation

Understanding Nothing: Zeros in scRNASeq Tallulah Andrews, 27 Sept 2016 Single-cell vs bulk RNASeq Cell Library Expression RNA cDNA Amplification Sequencing Isolation Preparation Matrix ATTCG 0 10 0 20 TCACT 13 2 0 8 TCGGA 11


slide-1
SLIDE 1

Understanding Nothing: Zeros in scRNASeq

Tallulah Andrews, 27 Sept 2016

slide-2
SLIDE 2

Single-cell vs bulk RNASeq

ATTCG TCACT TCGGA

0 10 0 20 13 2 0 8 11 30 0 0

Cell Isolation RNA cDNA Amplification Library Preparation Sequencing Expression Matrix Computational Analysis Enables:

  • Unbiased cell-type identification/tissue composition
  • Elucidation of cell-fate decisions & development
  • Detection of heterogeneity of cellular responses
  • Investigation of stochastic gene expression
slide-3
SLIDE 3

Single-cell vs bulk RNASeq

ATTCG TCACT TCGGA

0 10 0 20 13 2 0 8 11 30 0 0

Cell Isolation RNA cDNA Amplification Library Preparation Sequencing Expression Matrix Bulk RNASeq: 100 ng Single cell RNASeq: ~10 pg Computational Analysis

slide-4
SLIDE 4

Zeros Dominate scRNASeq

Dataset Type No. Cells No. Genes Prop Zero Buettner mouse ESCs 279 17,231 51.2% Shalek mouse bone marrow 324 12,474 66.4% Deng mouse embryo 255 17,406 50.2% Usoskin mouse neuron 530 15,585 72.5% Kirschner mouse ESCs 2,448 23,729 62.5% Linnarsson mouse brain 2,542 17,867 76.9% Pollen human neural 301 19,624 60.3% Zhong mouse embryo 49 20,558 38.0%

*Cells with > 2,000 detected genes **Genes seen in >3 cells

slide-5
SLIDE 5

Source of Zeros

ATTCG TCACT TCGGA

0 10 0 20 13 2 0 8 11 30 0 0

Cell Isolation RNA cDNA Amplification Library Preparation Sequencing Expression Matrix Computational Analysis

slide-6
SLIDE 6

Source of Zeros

ATTCG TCACT TCGGA

0 10 0 20 13 2 0 8 11 30 0 0

Cell Isolation RNA cDNA Amplification Library Preparation Sequencing Expression Matrix Computational Analysis Under ~1 million reads/cell Svensson et al. (2016)

slide-7
SLIDE 7

Source of Zeros

ATTCG TCACT TCGGA

0 10 0 20 13 2 0 8 11 30 0 0

Cell Isolation RNA cDNA Amplification Library Preparation Sequencing Expression Matrix Computational Analysis ~66% Efficiency >95% Efficiency Reiter et al. (2011) & Bengtsson et al. (2008)

slide-8
SLIDE 8

RT failure propagates downstream

n0 = 1 n0 = 5

slide-9
SLIDE 9

Reverse Transcription = Michaelis-Menten

mRNA DNA dNTP Reverse Transcriptase To model probability: Vmax = 1 Detection probability

slide-10
SLIDE 10

MM vs Other Models

Michaelis-Menten Modelling of Dropouts (M3Drop)

  • Pdropout = 1- [s]/(K+[s])
  • For Deng: K = 9.5

log10(expression)

slide-11
SLIDE 11

MM vs Other Models

log10(expression)

Michaelis-Menten Modelling of Dropouts (M3Drop)

  • Pdropout = 1- [s]/(K+[s])
  • For Deng: K = 9.5

Zero Inflated Factor Analysis (ZIFA)

  • Dimensionality Reduction for scRNASeq
  • Pdropout = e-ƛ[s][s]
  • For Deng: λ = 0.0075
slide-12
SLIDE 12

log10(expression)

Michaelis-Menten Modelling of Dropouts (M3Drop)

  • Pdropout = 1- [s]/(K+[s])
  • For Deng: K = 9.5

Zero Inflated Factor Analysis (ZIFA)

  • Dimensionality Reduction for scRNASeq
  • Pdropout = e-ƛ[s][s]
  • For Deng: λ = 0.0075

Single Cell Differential Expression (SCDE)

  • Pdropout = 1/(1+e-(a+b*log([s])))
  • For Deng: a = 1.5, b = -0.75

MM vs Other Models

slide-13
SLIDE 13

Michaelis-Menten fits diverse datasets.

Buettner - CPM Linnarsson - UMI CPM Shalek - FPKM

slide-14
SLIDE 14

Error

Michaelis-Menten fits diverse datasets.

M3Drop SCDE ZIFA

slide-15
SLIDE 15

Differentially Expressed Genes are Outliers

Dropout Rate Expression Average across mixture

P2

Log Expression Dropout Rate

P1

(P1+P2)

2

slide-16
SLIDE 16

Outlier/DE gene detection

Michaelis-Menten: Pdropout = 1- S/(K+S) Rearrange to solve for K: K = P / (1-P) * S

1. Calculate Kj for each gene 2. Propagate errors in estimates for S (mean expression) and P (observed dropout rate) to get error for Kj 3. Estimate error of global KM 4. Test whether Kj is significantly larger than KM fit across all genes using a Z-test combining errors of (2) & (3)

slide-17
SLIDE 17

Highly Variable Genes

In general: f(variance) = g(mean)

1. Fit a relationship between variance and mean expression

a. May use all genes or only spike-ins in fitting

2. Identify points above this relationship Brennecke et al. (2013) : 1. CV2 = a1/μ + α0 2. Significant outliers detected using 2-test

slide-18
SLIDE 18

DE Simulations - Dropouts vs Variance.

slide-19
SLIDE 19

DE Simulations - Dropouts vs Variance.

μ = 100, n = 100

slide-20
SLIDE 20

Deng

Applying M3Drop to Early Mouse Development

slide-21
SLIDE 21

Identification of TE and ICM

slide-22
SLIDE 22

What are outliers to the left?

Buettner - CPM Log Expression Dropout Rate DE Genes Highly Variable Genes

Under measured expression Mismapping reads

slide-23
SLIDE 23

Genome

Processed Pseudogenes = True Negatives

Processed mRNA cDNA Randomly inserted into genome

  • Identical sequence to original transcript
  • Lacks introns
  • Lacks promoters & regulatory sequences
  • Assumed to not be transcribed
  • >3,000 identified in the mouse genome
  • nly 150 have confirmed expression
slide-24
SLIDE 24

Processed Pseudogenes - Mismapping Reads

Processed Pseudogenes Left shifted by 1.4 (p ~ 0)

Gene Processed Pseudogene

Truth Observed

Gene Processed Pseudogene

~4% 1% sequencing error rate x 100bp reads: 4% of reads have 3+ sequencing errors

slide-25
SLIDE 25

Under-Measured Expression

Duplication node: Mus musculus Left shifted by 0.66 (p < 10-40)

multimapping reads = under counting

CDS < 300 n.t. Left shifted by 0.21 (p < 10-45)

fewer unique fragments = fewer unique reads Paralogs Short Genes

slide-26
SLIDE 26

Tophat2 maps more reads to processed pseudogenes

STAR Tophat2 Kallisto

slide-27
SLIDE 27

Unique Molecular Identifiers (UMIs)

ATTCG TCACT TCGGA

0 10 0 20 13 2 0 8 11 30 0 0

Cell Isolation RNA cDNA Amplification Library Preparation Sequencing UMI count Matrix Enables:

  • Correction for PCR duplicates (amplification noise)
slide-28
SLIDE 28

None of the proposed models fit corrected UMIs

slide-29
SLIDE 29

Cell-specific detection rates obscure true relationship

Downsample to 2122 UMIs/cell Saturation of Detected genes

p(0) = e-λ λ = mean gene expression * a

slide-30
SLIDE 30

The PoissonUMIs Model

Mij ~ Poisson(λ) λ = mi*mj*total*α Mij = Molecules of gene j in cell i mi = proportion of molecules in cell i mj = proportion of molecules for gene j total = total detected molecules α = scaling factor

Account for different counting methods

slide-31
SLIDE 31

Poisson model accounting for differences in read depth

α fixed at 1 α fixed at 1

slide-32
SLIDE 32

Fitted alpha reflects quantification method

α = 0.90 α = 0.64 Unique UMIs Corrected UMIs Reads α = 0.016

slide-33
SLIDE 33

Fitting the model to other UMI datasets

Linnarsson α = 0.65 Kirschner α = 0.90

slide-34
SLIDE 34

Fitting the model to other UMI datasets

Linnarsson α = 0.65 Kirschner α = 0.90 Removed singleton UMIs Corrected for 2 mismatches

slide-35
SLIDE 35

Summary

Amplification noise

slide-36
SLIDE 36

Summary

Amplification noise Mismapping / Miscounting

slide-37
SLIDE 37

Summary

Amplification noise Differential Expression Mismapping / Miscounting

slide-38
SLIDE 38

Acknowledgements

Wellcome Trust Sanger Institute Martin Hemberg Vladimir Kiselev EMBL Rome Christophe Lancrin Isabelle Bergiers Availability M3Drop : https://github.com/tallulandrews/M3Drop PoissonUMIs: https://github.com/tallulandrews/PoissonUMIs