Data Mining for Genomic- Phenomic Correlations Joyce C. Niland, Ph.D. Associate Director & Chair, Information Sciences Rebecca Nelson, Ph.D. Lead, Data Mining Section City of Hope National Medical Center Duarte, California, USA
City of Hope National Medical Center, Duarte, California City of Hope National Medical Center, Duarte, California
City of Hope National Medical Center City of Hope National Medical Center � Founded in 1913 Founded in 1913 � � State State- -of of- -the the- -art care to patients with cancer & other life art care to patients with cancer & other life- - � threatening diseases (e.g. diabetes) threatening diseases (e.g. diabetes) � Leading edge research into the causes, prevention, and Leading edge research into the causes, prevention, and � cure of such diseases cure of such diseases � Promising new therapeutic agents being taken from Promising new therapeutic agents being taken from “ “bench bench” ” to to � “bedside bedside” ” (translational research) (translational research) “ � Over 400 ongoing clinical trials, 1/3 initiated at City of Hope Over 400 ongoing clinical trials, 1/3 initiated at City of Hope � � Human genome has expanded scope and objectives Human genome has expanded scope and objectives � of translational research of translational research
� Improved diagnosis of disease � Improved diagnosis of disease � Earlier detection of genetic risk � Earlier detection of genetic risk � Repairing defective genes with healthy ones � Repairing defective genes with healthy ones � New drugs based on information about genes � New drugs based on information about genes
Recent Examples of Genomic-Phenomic Data Mining Supported by Biostatistics � Predictors of Genetic Susceptibility to Heart Disease Post-Bone Marrow Transplant � Pathogenesis of Radiation-induced Breast Cancer � Gene Expression in Prostate Cancer Tumors � Prognostic Biomarkers for Stage I-III Renal Cell Carcinoma � Validation of Biomarkers for Tumor Initiating Cells in Brain Cancer � Expression of DNA Repair in Normal versus Tumor Cell Genes
The 3 (4?) I’s of Data Mining Process I ntegrate Data Data W arehouse Sources I nclude Appropriate Cohort Assem bly Samples I dentify ( I nfer?) Honest Broker Subjects Wishes
Integrating Data Systems to Support Biomedical Research � Biomedical research is an increasingly complex collaborative undertaking � Requires integration of data, rules, processes, and vocabularies from many different source systems � Most information systems developed independently � Operational systems, created to meet different functional and departmental needs of an institution
Operational Systems Versus Data Warehousing � On-Line Transaction Processing ( OLTP) : � Focuses on an organization’s day-to-day business needs (electronic medical record, financial systems, clinical trial management systems) � On-Line Analytic Processing ( OLAP) : � Retrieves, analyzes, reports, and shares data from disparate systems, vendors & departments (DW) 8
Data Warehousing Concept � Large, centralized, and longitudinal store of data to facilitate organization-wide consolidated reporting and analysis o Multiple source databases o Central coordination and management via metadata repository o Multiple target “data marts” (aggregated datasets for efficient querying and analysis) 9
City of Hope (COH) Da City of Hope (COH) Data Warehous ta Warehouse Overvi e Overview ew Financials Trendstar OLAP “Cubes” Financial Data Patient Care Caregiver OLTP OLAP E T Documentation L Surgical Hospital Info System QA Eclypsis (SIS) Reporting Radiology Info System Electronic (RIS) Medical Extract, Transform & Load (ETL) SafeTrace Record Transfusion Medicine (EMR) Data Validation & Quality Assurance (QA) M CoPath COH Data COH Data E Disease Warehouse: Warehouse: Sunquest T Cluster A Reporting & Source D Analysis ETL Data Systems Phenomic Cancer Discrete Core Registry A Data Elements and T from Patient Care A Genomic Clinical Research MIDAS Information Abstraction of Observational ETL Core Phenomic Data for Hypothesis Data System on All COH Patients Generation & Reporting & Grant Proposals Data Mining Medidata Management of Electronic Data Patients on ETL Capture (EDC) Clinical Trials L A Y Basic Science Genomic / Proteomic Results E Genotype – Labware R Phenotype Solexa Biospecimen Microarray Correlative Repository Gene ETL Analyses High 10 Analysis Throughput Analysis Sample Data indicates in progress ETL
Utility of a Data Warehouse While protecting personally identifiable information and proprietary research data: o Decision support to administrators o Screening of patients for eligibility o Measurement of quality of care & outcomes o Query capabilities to investigators o Data mining to generate new hypotheses, facilitate new discoveries 11
Technical & Business Metadata Directories Technical Metadata:* Business Metadata:** Technical Metadata:* Business Metadata:** � Data Sources Data Definition � � Technical name � Field names, aliases � Data type & length � Description of data meaning � Creation, expiration dates Data Directives � R R � Source system M M � Instructions for data collection E E E E � Data ‘steward’ � Guidelines for data coding P P T T O O � Mappings Queries � A A S S D D � Rules for merging/filtering � Synonyms I I A A � Classification coding � Validation Rules T T T T O O Reports � � Missing value fields A A R R � List of reports that use term � Data integrity, consistency Y Y Security Information � Transformation Rules � � Authorization to access � Derivation of values � Data summaries *Database Administrator Perspective **Database User Perspective *Database Administrator Perspective **Database User Perspective
Cohort Assembly � Inclusion of subjects with appropriate phenomic characteristics AND available tissue � > 360,000 specimens logged in CoPath system, going back to 1955 � Critical to integrate tissue sample data into the data warehouse � Broken down into “Class of Case” to describe type of specimen
CoPath Specimens by Year and Type 30000 25000 Routine Surgical Other Outside Consults Hematopathology Cytology 20000 Card Class Blood 15000 10000 5000 0 1955 1958 1960 1962 1964 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Integrate “Best” Source of Tissue Annotation Data � Standardized formatted path reports now available from pathologist � ‘Synoptic Report’ includes: � Path T Stage � Nodes Examined � Nodes Positive � Path N Stage � Path M Stage � Margins � Histology � Grade
Concordance Between CoPath Synoptic Reports & Cancer Registry 1.00 0.90 0.80 0.70 0.60 0.50 0.99 0.91 0.88 0.84 0.82 0.40 0.30 0.20 0.10 0.00 Prostate Breast Colon and Kidney Lung Rectum
Comparing Path Data Sources: Synoptic Report vs. CNExT Synoptic Report - CNExT – Cancer Pathology Registry � Surgery specific � Patient specific � Only 22% of cancer cases � 100% of cancer cases � Non-strict reporting rules � Strict reporting rules � No confirmation by � Confirmed by treating treating MD MD
Text Mining to Identify Cases � Text mining may be needed if neither data source specific enough � Example: Diagnosis of Unknown Primary Sites for Metastatic Tumors via 92-Gene RT-PCR � Needed to find tissues samples labeled as purely metastatic, with no link to original cancer site
Enabling Investigators to Conduct Cohort Searches via i2b2 � i2 b2 : Integrating Biology and the Bedside � Facile user interface to allow investigators to search for cohorts on their own � Executes advanced queries against meta- database to identify available subjects, tissue samples, or biospecimens matching query criteria � if feasible number of cases returned, then submit IRB protocol for approval � Biostatistics Division first needs to provide ‘Honest Broker’ service to eliminate any dissenting patients
Definition of “Honest Broker” � Impartial party and process to determine whether patient’s wishes would be violated by: � Analyzing their data arising from standard care, or � Studying their discard tissue samples � Moving to single “General Research Consent” for all patients going forward � However many different consents used for various studies in the past, must be considered
Protocols Requiring “Honest Broker” Process Total Protocols N=3,299 No Honest Broker Honest Broker Required Protocols with Consent Non-intervention for Specific Interventions Studies N=2,314 N= 985 Consent Not Required Consent Required N=620 N=365
Patient Consent Status Consent Status N Percent Consented 70,133 92.5 Dissented 5,637 7.4 Consent Withdrawn 25 < 1.0 (Dissented) Total 75,795
Honest Broker Algorithmic Approach � Use computerized algorithms to evaluate: � Consent Type � Participation Type � Any “no” response to a consent/ participation type related to objectives of the study do not include in cohort � Note: Consents change over time � May require different algorithms depending on protocol version
Recommend
More recommend