Eff fficient processing of Hi Hi-C data an and ap application to - PowerPoint PPT Presentation

Eff fficient processing of Hi Hi-C data an and ap application to can ancer Nicola las Serv rvant, PhD Institut Curie, INSERM U900, Mines ParisTech, PSL-Research University 04th of Decembre 2019 Hi-C days, Toulouse

2 Spa patial organization onof the he geno nome me How w are 2 metersof DNA packed into ntoa 10µm dia iameternuc ucle leus us ? Nucleosome (histone/DNA) H1 Histone Chromatin fibre Adapted from Annunziato et al. 2008

3 Different nt levels of spa patial organiza zation Nucleus

Hi Hi-C capt ptures the he chr hrom omatinconf nfor ormation on within the he nucleus Cells population Lieberman-Aiden et al. 2009, Rao et al. 2014 Contact chr1 chr4 chr2 chr3 chr5 chr6 … Frequencies chr6 High A B C Low A +1 Bin size A B +1 B C Whole genome map Intrachromosomalcontact maps at differentresolutions 4

5 Genom nome organization onand nd Hi Hi-C chr2 chr4 chr1 chr3 chr2 Lieberman-Aiden et al. 2009, Dixon et al. 2012, Rao et al. 2014

6 Genom nome organization onand nd Hi Hi-C ~100 Kb ~1 Mb Lieberman-Aiden et al. 2009, Dixon et al. 2012, Rao et al. 2014

7 Topol opolog ogical Associated doma omains ns (TAD ADs) The topological domains (TADs) have been described as the functional units of the genome organization, able to promote enhancer/promoter interactions. Contact Probability

8 'Hi-C'-basedexpe periments Method Main features References For mapping whole-genomechromatin interaction in a cell Lieberman-Aiden et al. (2009) Hi-C population; proximity ligation is carried out in a large volume Similar to Hi-C, except that proximity ligation is carried out Kalhor et al. (2011) TCC on a solid phase-immobilized proteins For mapping chromatin interactions at the single-cell level Nagano et al. (2013) Single-cell Hi-C Proximity ligation is carried out in the intact nucleus Rao et al. (2014) In situ Hi-C Combines 3C with a DNA capture technology ; equivalent to Hughes et al. (2014) Capture-C high-throughput4C Chromatin is fragmented with DnaseI; proximity ligation is Ma et al. (2015) Dnase Hi-C carried out on a solid gel Combine Dnaseor in situ DnaseHi-C with a capture Ma et al. (2015) Targeted Dnase Hi-C technology Chromatin is fragmented with micrococcalnuclease Hsiech et al. (2015) Micro-C Chromatin is fragmented with DnaseI; proximity logation is Deng et al. (2015) In situ DNAse Hi-C carried out in the intact nucleus Combines 3C with a DNA capture technology ; equivalent to Mifsud et al. (2015) Capture-Hi-C high-throughput5C Detecting genome-widechromatin interaction mediated by Mumbach et al. (2016) HiChIP a particular protein ; equivalent to ChAI-PET

9 Ready-to to-us use Hi Hi-C Kits https://www.qiagen.com https://arimagenomics.com/kit https://www.phasegenomics.com

Whi hichappr proachfor or whi hichpur urpo pose ?

Whi hichappr proachfor or whi hichpur urpo pose ? Captu pture Hi Hi-Cprotocol ( Franck et al. 2016 ) i.e. Hi-C library combined with capture of a dedicated genomic region

12 Que uestion ons ? 1. How to efficiently process Hi-C data? 2. Are there any specific computational challenges in analyzing Hi-C data from cancer samples ?

Wha hat doe oes Hi-C data look ook like ? Illumina paired-end sequencing Hi-C Fragments PE Sequencing

14 Cha hallenges in Hi Hi-Cdata proc ocessing IMR90 chr6 240 - Hi-C Fragments 0- IMR90 chr6 PE Sequencing Bin size = 500 Kb How to process Hi-C data in an easy and effic icie ient way taking into account ; - The huge amount of data - The evolutionof protocols - The computationalressources

15 Reads ds mapp pping ng strategy Hi-C Fragments PE Sequencing End-to-end genome Alignment Aligned Reads Unmapped Reads Trim 3’ RS site End-to-end genome Alignment Aligned Reads Unmapped Reads

16 Detectionof validinteraction on prod oduc ucts Hi-C Fragment Valid Pairs Invalid Pairs FR Singleton RF Dangling End FF RR Self Circle R=Reverse / F=Forward Dumped Pairs + Filtering on : • Insert size • Restriction fragment size • MAPQ • etc.

17 Bui uildi ding ng cont ntact maps ps bins There is currently no consensus about how to (efficiently) store the contact maps A Hi-C contact map is : • Usually very sparse bins • Symmetric We therefore propose to use a standard triplet sparse format to store only half of the non-zero contact values. Sparse Complete Dense (MB) Sparse Symmetric (MB) (MB) 1M 25 98 49 500Kb 77 363 182 150Kb 818 1 900 934 40Kb 12 000 3 800 1 900 20Kb 45 000 5 300 2 700 5Kb >100 000 ?? 8 600 4 300

18 Hi Hi-C formats .hic ic file iles (Juic uicer, Juic icebo box) .cool file les (coole ler, hig igla lass) • • Flexibility to store one or multiple matrices with Contact matrices in multiple resolutions and summary statistics stored in one file varying bin sizes • • python library Java and C bindings • • Command line tools Command line tools • • HDF5, which has native bindings in practically all Extant suite of analysis tools • Extant visualization tool. languages • out of memory iterative matrix balancing, that can work on very large matrices.

19 Hi Hi-C data nor ormalization All high-througthut techniques are subject to technical and experimentalbiases The iterativecorrection (ICE) method is a widely used approach for Hi-C data normalization. This method is based on the assumption that each locus should have the same probability of interaction genome-wide , and is in theory able to correct for any bias in the contact maps. 1 1 1 1

HiC-Pro – proc ocessing ng of Hi Hi-C/HiChiPdata Dedicated Detection of Valid Generates raw and Quality Reads Mapping Interaction normalized Controls Strategy Products contact maps • Easy-to-use Highlyusedin the last years • Optimized and scalable • Flexible Available at https://github.com/nservant/HiC-Pro • Support most protocols Forum and discussion at • Open to contribution https://groups.google.com/forum/#!forum/hic-pro • Compatible with many downstream analysis software

Bui uildi ding ng Efficient and ndRepr prod oduc ucibl ble Wor orkflows For facilities Highly optimised pipelines with excellent reporting. Validated releasesensure reproducibility. For users Portable, documentedand easy to use workflows. Pipelines that you can trust. For developers Companiontemplates and tools help to validateyour code and simplify common tasks.

Analys ysis ispipe ipelin lines: Commun unit ity: 29 organisations over the world • Nextflow-based pipelines More than 90 contributors • High level of reproducibily • Strict Guidelines • 17 released pipelines • 19 under development

First version of nf-core Hi-C pipeline released ! V1.1.0 = Nextflow HiC-Pro version • Automatic installation • Natively support most schedulers • Natively compatible with conda, docker, singularity • Efficient tasks management • Reads can be automaticaly splitted by chuncks to speed the processing

Plans for the next xt versions : • TADs calling (which methods ?) • Compartment Calling • Detection of significant contacts • Specific pipelines for Hi -C based assembly ? Cancer Hi-C ? Contribution is welcome !

25 Questions ? 1. How to efficiently process Hi-C data? 2. Are there any specific computational challenges in analyzing Hi-C data from cancer samples ?

Hi Hi-C on canc ncer data So far, most of the studies were dedicated to normal cell … and a few ones started to investigate chromatin structure of Breast and Prostate cancer using Hi-C

27 Alterations ns in canc ncer (epi pi)geno nomics

28 TADs are biol olog ogicallyrelevant • TADs disruption leads to new enhancer/promoter contacts • Abnormal enhancer/promoter contacts can have strong phenotypic impacts • Structural variants can disrupt TADs structure Luipanez et al. 2015, Franke et al. 2016

29 Organization on of canc ncer genom omes? Non-coding DNA mutation Structural Variants Valton & Dekker, 2016

30 Hi Hi-C, a good ood tool ol to stud udy CNVs ?

31 Cha hallenges in Hi-C canc ncer data? Hi-C Impacts? Simulation? Copy Number Normalization Variants Impacts?

Hi Hi-C C – Wha hat do we coun ount? chr6 HOX genes cluster i j Whole genome map Intrachromosomal contact Topological maps domains In the context of a diploid genome If i and j belong to the same chromosome C ij = 2 cis + 2 transH If i and j belong to different chromosomes C ij = 4 trans 32

Gene neralization n to pol olyploid genom omes N i = N j = 1 N i = N j = 2 N i = N j = 3 If chr i = chr j , C ij = 1 cis If chr i = chr j , C ij = 2 cis + 2 transH If chr i = chr j , C ij = 3 cis + 6 transH If chr i ≠ chr j , C ij = 1 trans If chr i ≠ chr j , C ij = 4 trans If chr i ≠ chr j , C ij = 9 trans N i = N j If chr i = chr j , C ij = N i cis + N i (N j -1) transH If chr i ≠ chr j , C ij = N i N j trans 33

Extens nsion on to o Canc ncer geno nome N = 4 If i and j belong to the same chromosomal segment C ij = N i cis + N i (N j -1) transH N = 2 34

Eff fficient processing of Hi Hi-C data an and ap application to - PowerPoint PPT Presentation

Eff fficient processing of Hi Hi-C data an and ap application to can ancer Nicola las Serv rvant, PhD Institut Curie, INSERM U900, Mines ParisTech, PSL-Research University 04th of Decembre 2019 Hi-C days, Toulouse 2 Spa patial

Precision EW Measurements from ATLAS Extracting sin 2 eff Introduction Why measure sin 2 eff

Using Domestic Networks to Spy on the World Katitza Rodriguez, EFF katitza@eff.org NSA Spying

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

West Virginia Performance West Virginia Performance Eff Effectiveness Review Tool Effectiveness

Electronic Frontier Foundation https://www.eff.org/ What's the Electronic Frontier Foundation?

Concepts in Crypto Parker Higgins parker@eff.org @xor PGP: 4FF3 AA1B D29E 1638 32DE C765 9433

QGP from the quantum ground-state of QCD? beautiful math or New Physics of QCD? Roman

CO 447 | LEC6 BLOCKCHAIN SECURITY Dr. Benjamin Livshits Stateless Fingerprinting 2 EFF

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

E NERGY -E FFICIENT D ATA R EPLICATION IN C LOUD C OMPUTING D ATACENTERS Presented by David Ocejo

V ANTAGE : S CALABLE AND E FFICIENT F INE -G RAIN C ACHE P ARTITIONING Daniel Sanchez and Christos

EFFICIENT FFICIENT ECONO ONOMY MY Opportunities for Companies and Financial Institutions

S ECURITY M ODEL OF DAA 2 A B LUEPRINT FOR DAA 3 B UILDING B LOCKS 4 O UR C ONSTRUCTIONS 5 E

Group Signatures [CH91] allow a member to anonymously and accountably sign on behalf of a group.

Q UASAR : R ESOURCE -E FFICIENT A ND Q O S-A WARE C LUSTER M ANAGEMENT Christina Delimitrou and

E FFICIENT V ERIFICATION OF R EPLICATED D ATATYPES USING L ATER A PPEARANCE R ECORDS (LAR) Madhavan

constrained Tucker decomposition Dongjin Choi and Lee Sael 1 / 24 Motivation Q: How can we

IANR to 2017 to 2025 All Hands Meeting February 10, 2012 Nebraska East Union Roadmap for

Linear regression How to measure the accuracy of linear regression models Linear Regression

Fundamentals of Machine Learning Instructor: Ekpe Okorafor 1. Accenture Big Data Academy 2.

Operating Plan 2017-19 Progress update Dr Jim ODonnell Clinical Chair Slough CCG Nine

Cervix Cancer Research Network (CCRN) Chair: Mary McCormack Steering Committee: Marie Plante,

Research Network Chair: Mary McCormack Steering Committee: Marie Plante, David Gaffney, Sang

Cervical Cancer Prevention: The Interface of HPV Vaccination and Screening Strategies UICC World

Sambuz

Useful Links

Newsletter

Mail Us