Using Disco and MapReduce to study mRNA complexity Dan Williams SciPy 2011 Lightning Talk 7/14/2011 | Life Technologies Proprietary & Confidential | 1
Disco • MapReduce framework written in Python and Erlang − useful for dealing with massive data • Users specify map and reduce operations as Python functions, then chain them together to get stuff done 7/14/2011 | Life Technologies Proprietary & Confidential | 2
mRNA molecules contain three distinct regions: AAATGACGACAACGGTGAGGGTTCTCGGGCGGGGCCTGGGACAGGCAGCTCCGGGGTCCGCGGTTTCACATCGGAAACAAAACAGCGG CTGGTCTGGAAGGAACCTGAGCTACGAGCCGCGGCGGCAGCGGGGCGGCGGGGAAGCGTATACCTAATCTGGGAGCCTGCAAGTGACA ACAGCCTTTGCGGTCCTTAGACAGCTTGGCCTGGAGGAGAACACATGAAAGAAAGAACCTCAAGAGGCTTTGTTTTCTGTGAAACAGT ATTTCTATACAGTTGCTCCAATGACAGAGTTACCTGCACCGTTGTCCTACTTCCAGAATGCACAGATGTCTGAGGACAACCACCTGAG CAATACTGTACGTAGCCAGAATGACAATAGAGAACGGCAGGAGCACAACGACAGACGGAGCCTTGGCCACCCTGAGCCATTATCTAAT GGACGACCCCAGGGTAACTCCCGGCAGGTGGTGGAGCAAGATGAGGAAGAAGATGAGGAGCTGACATTGAAATATGGCGCCAAGCATG TGATCATGCTCTTTGTCCCTGTGACTCTCTGCATGGTGGTGGTCGTGGCTACCATTAAGTCAGTCAGCTTTTATACCCGGAAGGATGG GCAGCTAATCTATACCCCATTCACAGAAGATACCGAGACTGTGGGCCAGAGAGCCCTGCACTCAATTCTGAATGCTGCCATCATGATC AGTGTCATTGTTGTCATGACTATCCTCCTGGTGGTTCTGTATAAATACAGGTGCTATAAGGTCATCCATGCCTGGCTTATTATATCAT CTCTATTGTTGCTGTTCTTTTTTTCATTCATTTACTTGGGGGAAGTGTTTAAAACCTATAACGTTGCTGTGGACTACATTACTGTTGC ACTCCTGATCTGGAATTTTGGTGTGGTGGGAATGATTTCCATTCACTGGAAAGGTCCACTTCGACTCCAGCAGGCATATCTCATTATG ATTAGTGCCCTCATGGCCCTGGTGTTTATCAAGTACCTCCCTGAATGGACTGCGTGGCTCATCTTGGCTGTGATTTCAGTATATGATT TAGTGGCTGTTTTGTGTCCGAAAGGTCCACTTCGTATGCTGGTTGAAACAGCTCAGGAGAGAAATGAAACGCTTTTTCCAGCTCTCAT TTACTCCTCAACAATGGTGTGGTTGGTGAATATGGCAGAAGGAGACCCGGAAGCTCAAAGGAGAGTATCCAAAAATTCCAAGTATAAT GCAGAAAGCACAGAAAGGGAGTCACAAGACACTGTTGCAGAGAATGATGATGGCGGGTTCAGTGAGGAATGGGAAGCCCAGAGGGACA GTCATCTAGGGCCTCATCGCTCTACACCTGAGTCACGAGCTGCTGTCCAGGAACTTTCCAGCAGTATCCTCGCTGGTGAAGACCCAGA GGAAAGGGGAGTAAAACTTGGATTGGGAGATTTCATTTTCTACAGTGTTCTGGTTGGTAAAGCCTCAGCAACAGCCAGTGGAGACTGG AACACAACCATAGCCTGTTTCGTAGCCATATTAATTGGTTTGTGCCTTACATTATTACTCCTTGCCATTTTCAAGAAAGCATTGCCAG CTCTTCCAATCTCCATCACCTTTGGG Research question: Do the three mRNA regions generally differ in information content? 7/14/2011 | Life Technologies Proprietary & Confidential | 3
Method: Calculate the Shannon entropy of each 21- nucleotide segment of each mRNA from a well-known database. Group results by region and compare. MapReduce with Disco speeds the computation (across ~30k mRNA sequences) 7/14/2011 | Life Technologies Proprietary & Confidential | 4
Map 21-mer segments and regions to 1 Reduce to remove duplicates Reduce to get a boxplot for each region Map Shannon entropy of 21-mer segment to region 7/14/2011 | Life Technologies Proprietary & Confidential | 5
7/14/2011 | Life Technologies Proprietary & Confidential | 6
Thank you! 7/14/2011 | Life Technologies Proprietary & Confidential | 7
Recommend
More recommend