geneqc statistical model general idea
play

GeneQC Statistical Model General Idea Reads can be mapped to - PowerPoint PPT Presentation

GeneQC Statistical Model General Idea Reads can be mapped to multiple gene loci Leads to varying degrees of mapping uncertainty Potentially causes issues with inferences based on read counts Differentially expressed genes


  1. GeneQC Statistical Model

  2. General Idea • Reads can be mapped to multiple gene loci • Leads to varying degrees of mapping uncertainty • Potentially causes issues with inferences based on read counts • Differentially expressed genes • Co-expression patterns • Various network analyses

  3. Options • Exclude ambiguous reads • Multiple assignment • Random assignment • Probabilistic assignment • Only considering local information

  4. Co-expressed Genes • Co-expressed genes provided additional level of information • Global data for more solid statistical evaluation

  5. Goal • Create statistically sound model for assignment of ambiguous reads • Use co-expression of genes • Develop method that produces p-value or probability score for each ambiguous read assignment • Provide a p- value signifying the confidence of each gene’s read count

  6. Previous Publications • Faulkner, G.J., et al., A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. Genomics, 2008. 91 (3): p. 281-288. • Hashimoto, T ., et al., Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite . Bioinformatics, 2009. 25 (19): p. 2613-2614. • Wang, J., Huda, A., Lunyak, V. V., & Jordan, I. K., A Gibbs sampling strategy applied to the mapping of ambiguous short- sequence tags . Bioinformatics, 2010. 26 (20): p.2501-2508

  7. Overall Direction • Assign all unambiguous reads • Use co-expression information of unambiguous reads to make first probabilistic assignment of ambiguous reads • Based on assignments, recalculate probabilities for ambiguous reads • Continue iterative procedure until no/minimal changes occur

  8. Additional parameters • Similarity between a given read and each potential gene locus • Differences generally very minute • Co-expression rate between genes and co-expressed genes

  9. Concerns & Limitations • Requires accurate co-expression information • Limited sample size of co-expression information could skew probability distribution • Potentially highly computationally intensive • Local optimization may occur • Does not currently consider dependence of read assignment

  10. Our Future Plans • Collect test data to verify increased performance using statistical model • Run model with various validated probability assumptions • Normal, Poisson, etc. • Develop R package with statistical model implementation

Recommend


More recommend