eff fficient processing of hi hi c data an and ap
play

Eff fficient processing of Hi Hi-C data an and ap application to - PowerPoint PPT Presentation

Eff fficient processing of Hi Hi-C data an and ap application to can ancer Nicola las Serv rvant, PhD Institut Curie, INSERM U900, Mines ParisTech, PSL-Research University 04th of Decembre 2019 Hi-C days, Toulouse 2 Spa patial


  1. Eff fficient processing of Hi Hi-C data an and ap application to can ancer Nicola las Serv rvant, PhD Institut Curie, INSERM U900, Mines ParisTech, PSL-Research University 04th of Decembre 2019 Hi-C days, Toulouse

  2. 2 Spa patial organization onof the he geno nome me How w are 2 metersof DNA packed into ntoa 10µm dia iameternuc ucle leus us ? Nucleosome (histone/DNA) H1 Histone Chromatin fibre Adapted from Annunziato et al. 2008

  3. 3 Different nt levels of spa patial organiza zation Nucleus

  4. Hi Hi-C capt ptures the he chr hrom omatinconf nfor ormation on within the he nucleus Cells population Lieberman-Aiden et al. 2009, Rao et al. 2014 Contact chr1 chr4 chr2 chr3 chr5 chr6 … Frequencies chr6 High A B C Low A +1 Bin size A B +1 B C Whole genome map Intrachromosomalcontact maps at differentresolutions 4

  5. 5 Genom nome organization onand nd Hi Hi-C chr2 chr4 chr1 chr3 chr2 Lieberman-Aiden et al. 2009, Dixon et al. 2012, Rao et al. 2014

  6. 6 Genom nome organization onand nd Hi Hi-C ~100 Kb ~1 Mb Lieberman-Aiden et al. 2009, Dixon et al. 2012, Rao et al. 2014

  7. 7 Topol opolog ogical Associated doma omains ns (TAD ADs) The topological domains (TADs) have been described as the functional units of the genome organization, able to promote enhancer/promoter interactions. Contact Probability

  8. 8 'Hi-C'-basedexpe periments Method Main features References For mapping whole-genomechromatin interaction in a cell Lieberman-Aiden et al. (2009) Hi-C population; proximity ligation is carried out in a large volume Similar to Hi-C, except that proximity ligation is carried out Kalhor et al. (2011) TCC on a solid phase-immobilized proteins For mapping chromatin interactions at the single-cell level Nagano et al. (2013) Single-cell Hi-C Proximity ligation is carried out in the intact nucleus Rao et al. (2014) In situ Hi-C Combines 3C with a DNA capture technology ; equivalent to Hughes et al. (2014) Capture-C high-throughput4C Chromatin is fragmented with DnaseI; proximity ligation is Ma et al. (2015) Dnase Hi-C carried out on a solid gel Combine Dnaseor in situ DnaseHi-C with a capture Ma et al. (2015) Targeted Dnase Hi-C technology Chromatin is fragmented with micrococcalnuclease Hsiech et al. (2015) Micro-C Chromatin is fragmented with DnaseI; proximity logation is Deng et al. (2015) In situ DNAse Hi-C carried out in the intact nucleus Combines 3C with a DNA capture technology ; equivalent to Mifsud et al. (2015) Capture-Hi-C high-throughput5C Detecting genome-widechromatin interaction mediated by Mumbach et al. (2016) HiChIP a particular protein ; equivalent to ChAI-PET

  9. 9 Ready-to to-us use Hi Hi-C Kits https://www.qiagen.com https://arimagenomics.com/kit https://www.phasegenomics.com

  10. Whi hichappr proachfor or whi hichpur urpo pose ?

  11. Whi hichappr proachfor or whi hichpur urpo pose ? Captu pture Hi Hi-Cprotocol ( Franck et al. 2016 ) i.e. Hi-C library combined with capture of a dedicated genomic region

  12. 12 Que uestion ons ? 1. How to efficiently process Hi-C data? 2. Are there any specific computational challenges in analyzing Hi-C data from cancer samples ?

  13. Wha hat doe oes Hi-C data look ook like ? Illumina paired-end sequencing Hi-C Fragments PE Sequencing

  14. 14 Cha hallenges in Hi Hi-Cdata proc ocessing IMR90 chr6 240 - Hi-C Fragments 0- IMR90 chr6 PE Sequencing Bin size = 500 Kb How to process Hi-C data in an easy and effic icie ient way taking into account ; - The huge amount of data - The evolutionof protocols - The computationalressources

  15. 15 Reads ds mapp pping ng strategy Hi-C Fragments PE Sequencing End-to-end genome Alignment Aligned Reads Unmapped Reads Trim 3’ RS site End-to-end genome Alignment Aligned Reads Unmapped Reads

  16. 16 Detectionof validinteraction on prod oduc ucts Hi-C Fragment Valid Pairs Invalid Pairs FR Singleton RF Dangling End FF RR Self Circle R=Reverse / F=Forward Dumped Pairs + Filtering on : • Insert size • Restriction fragment size • MAPQ • etc.

  17. 17 Bui uildi ding ng cont ntact maps ps bins There is currently no consensus about how to (efficiently) store the contact maps A Hi-C contact map is : • Usually very sparse bins • Symmetric We therefore propose to use a standard triplet sparse format to store only half of the non-zero contact values. Sparse Complete Dense (MB) Sparse Symmetric (MB) (MB) 1M 25 98 49 500Kb 77 363 182 150Kb 818 1 900 934 40Kb 12 000 3 800 1 900 20Kb 45 000 5 300 2 700 5Kb >100 000 ?? 8 600 4 300

  18. 18 Hi Hi-C formats .hic ic file iles (Juic uicer, Juic icebo box) .cool file les (coole ler, hig igla lass) • • Flexibility to store one or multiple matrices with Contact matrices in multiple resolutions and summary statistics stored in one file varying bin sizes • • python library Java and C bindings • • Command line tools Command line tools • • HDF5, which has native bindings in practically all Extant suite of analysis tools • Extant visualization tool. languages • out of memory iterative matrix balancing, that can work on very large matrices.

  19. 19 Hi Hi-C data nor ormalization All high-througthut techniques are subject to technical and experimentalbiases The iterativecorrection (ICE) method is a widely used approach for Hi-C data normalization. This method is based on the assumption that each locus should have the same probability of interaction genome-wide , and is in theory able to correct for any bias in the contact maps. 1 1 1 1

  20. HiC-Pro – proc ocessing ng of Hi Hi-C/HiChiPdata Dedicated Detection of Valid Generates raw and Quality Reads Mapping Interaction normalized Controls Strategy Products contact maps • Easy-to-use Highlyusedin the last years • Optimized and scalable • Flexible Available at https://github.com/nservant/HiC-Pro • Support most protocols Forum and discussion at • Open to contribution https://groups.google.com/forum/#!forum/hic-pro • Compatible with many downstream analysis software

  21. Bui uildi ding ng Efficient and ndRepr prod oduc ucibl ble Wor orkflows For facilities Highly optimised pipelines with excellent reporting. Validated releasesensure reproducibility.​ For users Portable, documentedand easy to use workflows. Pipelines that you can trust. For developers Companiontemplates and tools help to validateyour code and simplify common tasks.​

  22. Analys ysis ispipe ipelin lines: Commun unit ity: 29 organisations over the world • Nextflow-based pipelines More than 90 contributors • High level of reproducibily • Strict Guidelines • 17 released pipelines • 19 under development

  23. First version of nf-core Hi-C pipeline released ! V1.1.0 = Nextflow HiC-Pro version • Automatic installation • Natively support most schedulers • Natively compatible with conda, docker, singularity • Efficient tasks management • Reads can be automaticaly splitted by chuncks to speed the processing

  24. Plans for the next xt versions : • TADs calling​ (which methods ?) • Compartment Calling​ • Detection of significant contacts • ​Specific pipelines for Hi -C based assembly ? Cancer Hi-C ? Contribution is welcome !​

  25. 25 Questions ? 1. How to efficiently process Hi-C data? 2. Are there any specific computational challenges in analyzing Hi-C data from cancer samples ?

  26. Hi Hi-C on canc ncer data So far, most of the studies were dedicated to normal cell … and a few ones started to investigate chromatin structure of Breast and Prostate cancer using Hi-C

  27. 27 Alterations ns in canc ncer (epi pi)geno nomics

  28. 28 TADs are biol olog ogicallyrelevant • TADs disruption leads to new enhancer/promoter contacts • Abnormal enhancer/promoter contacts can have strong phenotypic impacts • Structural variants can disrupt TADs structure Luipanez et al. 2015, Franke et al. 2016

  29. 29 Organization on of canc ncer genom omes? Non-coding DNA mutation Structural Variants Valton & Dekker, 2016

  30. 30 Hi Hi-C, a good ood tool ol to stud udy CNVs ?

  31. 31 Cha hallenges in Hi-C canc ncer data? Hi-C Impacts? Simulation? Copy Number Normalization Variants Impacts?

  32. Hi Hi-C C – Wha hat do we coun ount? chr6 HOX genes cluster i j Whole genome map Intrachromosomal contact Topological maps domains In the context of a diploid genome If i and j belong to the same chromosome C ij = 2 cis + 2 transH If i and j belong to different chromosomes C ij = 4 trans 32

  33. Gene neralization n to pol olyploid genom omes N i = N j = 1 N i = N j = 2 N i = N j = 3 If chr i = chr j , C ij = 1 cis If chr i = chr j , C ij = 2 cis + 2 transH If chr i = chr j , C ij = 3 cis + 6 transH If chr i ≠ chr j , C ij = 1 trans If chr i ≠ chr j , C ij = 4 trans If chr i ≠ chr j , C ij = 9 trans N i = N j If chr i = chr j , C ij = N i cis + N i (N j -1) transH If chr i ≠ chr j , C ij = N i N j trans 33

  34. Extens nsion on to o Canc ncer geno nome N = 4 If i and j belong to the same chromosomal segment C ij = N i cis + N i (N j -1) transH N = 2 34

Recommend


More recommend