data quality challenges
play

Data Quality Challenges ACM JDIQ EiC Open Knowledge Networks - PowerPoint PPT Presentation

Data Quality Challenges ACM JDIQ EiC Open Knowledge Networks (Biomedicine) Data Science for Finance (DSfin) Louiqa Raschid Smith School of Business Computer Science and UMIACS Technical Review of Data Quality Provenance


  1. Data Quality Challenges ● ACM JDIQ EiC ● Open Knowledge Networks (Biomedicine) ● Data Science for Finance (DSfin) Louiqa Raschid Smith School of Business Computer Science and UMIACS

  2. Technical Review of Data Quality ● Provenance Cleaning Annotation ● Data cleaning infrastructure and tools: ○ Robust first generation. ○ Big data and scalability. ○ Human-in-the-loop (HumInt). ● Process ○ Fitness to task. ○ Understanding workflows.

  3. First Gen methodologies and products

  4. Technical Review of Data Quality ● Provenance Cleaning Annotation ● Data cleaning infrastructure and tools: ○ Robust first generation. ○ Big data and scalability. ○ Human-in-the-loop (HumInt). ● Process ○ Fitness to task. ○ Understanding workflows.

  5. Scenarios ● Lung cancer data (primary) generated by clinicians: ○ Patient entity identification in clinical notes, e.g., JM, J.M., etc. (cleaning) ○ Scale: barthel 4 ○ Stages: P0T0, Stage4, etc. (annotation) ● Analytics over (secondary) sources: ○ Drug induced liver injury (DILI): phenotype includes elevated levels of liver enzymes, etc. ○ (fitness to task; HumInt): There are many causes for elevated liver enzymes including transplants, some infants, etc.

  6. Scenarios ● Privacy preserving data mining: ○ Entity linkage in the de-identified space. ○ Different entries contribute hashed identifiers but they may be missing a variety of fields. (provenance; fitness to task) ● iASiS SEMANTIC Data Cleaning / Annotation Pipeline ● Finding patterns in OKN: DILI Case Study ○ (Provenance; fitness to task; HumInt; Annotation.)

  7. iASiS

  8. DILI Case Study o Given a knowledge graph and a DILI phenotype (keywords) ... o Create profiles, e.g., [Phenotype | Drug | Gene | Pathway] o Rank the DRUG at most risk for DILI.

  9. DILI Case Study

  10. DILI Case Study

  11. Tamr: Understanding Workflows

  12. Tamr: Understanding Workflows

  13. Tamr: Understanding Workflows

  14. Lessons learned ● First generation tools work well. ● Next generation needs to focus on processes and workflows and HumInt. ● Scientists still spend huge amounts of time on cleaning. How can we fix this problem? ● Is Open Knowledge Networks a solution? ● An unexpected case study ...

Recommend


More recommend