The Data Cleaning Problem: Some Key Issues & Practical - PDF document

Feb 17, 2023 •182 likes •483 views

The Data Cleaning Problem: Some Key Issues & Practical Approaches Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Department of Pathology, Anatomy, and Cell Biology Thomas Jefferson University

The Data Cleaning Problem: Some Key Issues & Practical Approaches Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Department of Pathology, Anatomy, and Cell Biology Thomas Jefferson University Philadelphia, PA DIMACS Workshop on Data Quality, Data Cleaning and Treatment of Noisy Data November 3-4, 2003 1
Topics 1. Outliers: an important data anomaly - types and working assumptions - some real data examples 2. Detecting outliers - the popular 3 σ edit rule - order-statistics vs. moments - some alternative approaches 3. Other data anomalies - missing data - misalignments - noninformative variables - comparing performance 2
Example 1: Outlier in a microarray data sequence Dye swap average of log2 intensity ratios, gene 263 <<-- Outlier 4 3 Log2 Intensity Ratio Control 2 1 0 EtOH -1 5 10 15 Sample 3
Example 2: Influence of outliers on a volcano plot Log2 expression change vs. p-value, Genes 201 to 300 1.0 0.5 Log2 Expression Change 0.0 -0.5 -1.0 0.005 0.010 0.050 0.100 0.500 1.000 t-test P-value 4
Example 3: Bivariate outlier in a simulated dataset � NOTE: Outlier is not extreme with respect to either x or y individually 1.0 0.8 0.6 y(k) value OUTLIER -->> 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 x(k) value 5

Recommend

Workflow basics, RMarkdown, git/Github Cleaning up Cleaning up Cleaning up Cleaning up

Workflow basics, RMarkdown, git/Github Cleaning up Cleaning up Cleaning up Cleaning up Cleaning up Cleaning up Don't worry, your history is preserved Settings Cleaning up Cleaning up Lots more options! RMarkdown Structure of an Rmd

657 views • 50 slides

Floor Cleaning By Vacuum After vacuum Cleaning After vacuum Cleaning After vacuum Cleaning

Floor Cleaning By Vacuum After vacuum Cleaning After vacuum Cleaning After vacuum Cleaning Before Vacuum cleaning After vacuum Cleaning After vacuum Cleaning

288 views • 12 slides

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CLEANING DATA IN PYTHON Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis Data almost never comes in clean Diagnose your data for problems Cleaning Data in Python Common data problems

1.08k views • 29 slides

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang,

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Data Cleaning? 2 Data is Dirty 2 incomplete inconsistent inaccurate Data is Dirty 2 incomplete 25% companies: flawed data

1.08k views • 106 slides

Data types Cleaning Data in Python Prepare and clean data Cleaning Data in Python Data types

CLEANING DATA IN PYTHON Data types Cleaning Data in Python Prepare and clean data Cleaning Data in Python Data types In [1]: print(df.dtypes) name object sex object treatment a object treatment b int64 dtype:

719 views • 44 slides

Diagnose data for cleaning CLEAN IN G DATA IN P YTH ON Daniel Chen Instructor Cleaning data

Diagnose data for cleaning CLEAN IN G DATA IN P YTH ON Daniel Chen Instructor Cleaning data Prepare data for analysis Data almost never comes in clean Diagnose your data for problems CLEANING DATA IN PYTHON Common data problems

783 views • 38 slides

Equipment Cleaning for Drug Products www.gmpsop.com 1 Scope and General Types of Cleaning

Equipment Cleaning for Drug Products www.gmpsop.com 1 Scope and General Types of Cleaning Product Contact Equipment , both Major and Minor , shall be cleaned and shall include : Changeover Cleaning ; Interval Cleaning* ;

1.04k views • 88 slides

Building Cleaning Services (BCS) Kirsty Thomas, Assistant Building Cleaning Finance officer

Building Cleaning Services (BCS) Kirsty Thomas, Assistant Building Cleaning Finance officer Julie Howe, Area Supervisor Principal Building Cleaning Officer Divisional Performance, Assistant Building Cleaning Management & Quality Finance

317 views • 18 slides

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant What is Data Cleaning? Data Cleaning : Preparing raw data for use in data processing pipelines. Possible tasks in data

1.04k views • 20 slides

Data Mining: A Powerful Data Mining: A Powerful Tool for Data Cleaning Tool for Data Cleaning

Data Mining: A Powerful Data Mining: A Powerful Tool for Data Cleaning Tool for Data Cleaning Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Nov. 4, 2003 1 Data mining for data quality assurance

724 views • 40 slides

Data cleaning and standardisation (1) Data cleaning and standardisation (2) Real world data is

Data cleaning and standardisation Febrl A parallel open source record linkage and geocoding system Record linkage and data integration Febrl overview Peter Christen Probabilistic data cleaning and standardisation Data Mining Group,

663 views • 5 slides

The Preventing harm to Cleaning Workers Report Terry N Taylor Head of Working Environment

The Preventing harm to Cleaning Workers Report Terry N Taylor Head of Working Environment Information Unit EU-OSHA The Cleaning Sector - Facts Cleaning is carried out in every w orkplace The cleaning industry is a grow ing sector I

115 views • 6 slides

SEIKA Optimal Cleaning Performance TOP Ultrasonic Cleaning Head Shower Pipe Two ultrasonic

Fully-Automatic Ultrasonic Stencil Cleaner SC-AH100F-LV Low-VOC Model SEIKA Optimal Cleaning Performance TOP Ultrasonic Cleaning Head Shower Pipe Two ultrasonic cleaning head panels clean both the front Ultrasonic and back sides of the

304 views • 8 slides

REFRESH Travel Cleaning Kit 529001 At 100ml the Travel Screen Cleaning Kit is perfectly suitable

REFRESH Travel Cleaning Kit 529001 At 100ml the Travel Screen Cleaning Kit is perfectly suitable for air travel. Contains: Anti-Bacterial fresh scent formula and machine-washable Cleaning Cloth. Spray and Swipe your device free of fingerprints,

344 views • 10 slides

Notes on FC cleaning & assembly at CERN Jeff Nelson, William & Mary Jan, '17 DUNE - FC

Notes on FC cleaning & assembly at CERN Jeff Nelson, William & Mary Jan, '17 DUNE - FC cleaning and CERN notes 1 Notes on cleaning & shipping Jan, '17 DUNE - FC cleaning and CERN notes 2 35t HV test TPC: 1 st experience trying

279 views • 14 slides

6 Foot Kitchen Training Cleaning Contact Surfaces May 2020 Cleaning Contact Surfaces (35

6 Foot Kitchen Training Cleaning Contact Surfaces May 2020 Cleaning Contact Surfaces (35 minutes) LEARNING OUTCOMES: After this topic, learners will have an understanding of: Module I: 5 minutes Differences between cleaning, sanitizing,

320 views • 30 slides

Document Understanding Conference DUC 2006 Welcome! DUC 2006-2007 Program Committee John Conroy

Document Understanding Conference DUC 2006 Welcome! DUC 2006-2007 Program Committee John Conroy IDA/CCS Hoa Dang NIST Donna Harman NIST Ed Hovy ISI/USC Kathy McKeown Columbia University Drago Radev University of Michigan Karen

597 views • 47 slides

Rate of Change Part 2: Fitting and Using Lines INFO-1301, Quantitative Reasoning 1 University of

Rate of Change Part 2: Fitting and Using Lines INFO-1301, Quantitative Reasoning 1 University of Colorado Boulder October 31, 2016 Prof. Michael Paul Prof. William Aspray Interpreting Linear Functions Fishermen in the Finger Lakes Region

484 views • 21 slides

Applications of Graph Theory and Probability in the Board Game Ticket to Ride R. Teal Witter &

Applications of Graph Theory and Probability in the Board Game Ticket to Ride R. Teal Witter & Alex Lyford Middlebury College January 16, 2020 Ticket to Ride (USA) Overview Routes Long routes are overvalued ... and can be used to

748 views • 23 slides

Higgs Measurements at a Muon Collider Higgs Factory [Preliminary] Alexander Conway, UChicago

Higgs Measurements at a Muon Collider Higgs Factory [Preliminary] Alexander Conway, UChicago Muon Collider Detector Research and Design Group aconway@fnal.gov 1 Muon Collider Detector Research and Desi gn Machine Parameters Average

284 views • 13 slides

Visualizing Data and Summary Statistics Introduction to Evolution and Scientific Inquiry Dr.

Visualizing Data and Summary Statistics Introduction to Evolution and Scientific Inquiry Dr. Spielman; spielman@rowan.edu Quantitative vs. Categorical variables Quantitative variables are described by data as numbers Height of a plant

243 views • 21 slides

Reporting Statistics T test There was a significant difference in the change scores between X

Reporting Statistics T test There was a significant difference in the change scores between X intervention ( M = 8.61, SD = 5.62) and Y intervention ( M = 2.54, SD = 2.20); t (12.30) = 3.10, p = 0.009. Since we see a greater change before and

725 views • 52 slides

Data Mining and Matrices 04 Matrix Completion Rainer Gemulla, Pauli Miettinen May 02, 2013

Data Mining and Matrices 04 Matrix Completion Rainer Gemulla, Pauli Miettinen May 02, 2013 Recommender systems Problem Set of users Set of items (movies, books, jokes, products, stories, ...) Feedback (ratings, purchase,

477 views • 35 slides

1 Gaussian Fun Facts Well add to these as we go along! First, consider a Gaussian random

Statistical Modeling and Analysis of Neural Data (NEU 560) Princeton University, Spring 2018 Jonathan Pillow Lecture 11 notes: MAP inference and regularization Thurs, 3.15 1 Gaussian Fun Facts Well add to these as we go along! First,

260 views • 4 slides

Download document

More recommend

Explore More Topics

Stay informed with curated content and fresh updates.

animals pets art culture automotive transportation business finance computer internet construction architecture education-career electronics communication

The Data Cleaning Problem: Some Key Issues & Practical - PDF document

The Data Cleaning Problem: Some Key Issues & Practical Approaches Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Department of Pathology, Anatomy, and Cell Biology Thomas Jefferson University

Workflow basics, RMarkdown, git/Github Cleaning up Cleaning up Cleaning up Cleaning up

Floor Cleaning By Vacuum After vacuum Cleaning After vacuum Cleaning After vacuum Cleaning

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang,

Data types Cleaning Data in Python Prepare and clean data Cleaning Data in Python Data types

Diagnose data for cleaning CLEAN IN G DATA IN P YTH ON Daniel Chen Instructor Cleaning data

Equipment Cleaning for Drug Products www.gmpsop.com 1 Scope and General Types of Cleaning

Building Cleaning Services (BCS) Kirsty Thomas, Assistant Building Cleaning Finance officer

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

Data Mining: A Powerful Data Mining: A Powerful Tool for Data Cleaning Tool for Data Cleaning

Data cleaning and standardisation (1) Data cleaning and standardisation (2) Real world data is

The Preventing harm to Cleaning Workers Report Terry N Taylor Head of Working Environment

SEIKA Optimal Cleaning Performance TOP Ultrasonic Cleaning Head Shower Pipe Two ultrasonic

REFRESH Travel Cleaning Kit 529001 At 100ml the Travel Screen Cleaning Kit is perfectly suitable

Notes on FC cleaning & assembly at CERN Jeff Nelson, William & Mary Jan, '17 DUNE - FC

6 Foot Kitchen Training Cleaning Contact Surfaces May 2020 Cleaning Contact Surfaces (35

Document Understanding Conference DUC 2006 Welcome! DUC 2006-2007 Program Committee John Conroy

Rate of Change Part 2: Fitting and Using Lines INFO-1301, Quantitative Reasoning 1 University of

Applications of Graph Theory and Probability in the Board Game Ticket to Ride R. Teal Witter &

Higgs Measurements at a Muon Collider Higgs Factory [Preliminary] Alexander Conway, UChicago

Visualizing Data and Summary Statistics Introduction to Evolution and Scientific Inquiry Dr.

Reporting Statistics T test There was a significant difference in the change scores between X

Data Mining and Matrices 04 Matrix Completion Rainer Gemulla, Pauli Miettinen May 02, 2013

1 Gaussian Fun Facts Well add to these as we go along! First, consider a Gaussian random

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

The Data Cleaning Problem: Some Key Issues & Practical - PDF document

The Data Cleaning Problem: Some Key Issues & Practical Approaches Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Department of Pathology, Anatomy, and Cell Biology Thomas Jefferson University

Workflow basics, RMarkdown, git/Github Cleaning up Cleaning up Cleaning up Cleaning up

Floor Cleaning By Vacuum After vacuum Cleaning After vacuum Cleaning After vacuum Cleaning

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang,

Data types Cleaning Data in Python Prepare and clean data Cleaning Data in Python Data types

Diagnose data for cleaning CLEAN IN G DATA IN P YTH ON Daniel Chen Instructor Cleaning data

Equipment Cleaning for Drug Products www.gmpsop.com 1 Scope and General Types of Cleaning

Building Cleaning Services (BCS) Kirsty Thomas, Assistant Building Cleaning Finance officer

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

Data Mining: A Powerful Data Mining: A Powerful Tool for Data Cleaning Tool for Data Cleaning

Data cleaning and standardisation (1) Data cleaning and standardisation (2) Real world data is

The Preventing harm to Cleaning Workers Report Terry N Taylor Head of Working Environment

SEIKA Optimal Cleaning Performance TOP Ultrasonic Cleaning Head Shower Pipe Two ultrasonic

REFRESH Travel Cleaning Kit 529001 At 100ml the Travel Screen Cleaning Kit is perfectly suitable

Notes on FC cleaning &amp; assembly at CERN Jeff Nelson, William &amp; Mary Jan, '17 DUNE - FC

6 Foot Kitchen Training Cleaning Contact Surfaces May 2020 Cleaning Contact Surfaces (35

Document Understanding Conference DUC 2006 Welcome! DUC 2006-2007 Program Committee John Conroy

Rate of Change Part 2: Fitting and Using Lines INFO-1301, Quantitative Reasoning 1 University of

Applications of Graph Theory and Probability in the Board Game Ticket to Ride R. Teal Witter &amp;

Higgs Measurements at a Muon Collider Higgs Factory [Preliminary] Alexander Conway, UChicago

Visualizing Data and Summary Statistics Introduction to Evolution and Scientific Inquiry Dr.

Reporting Statistics T test There was a significant difference in the change scores between X

Data Mining and Matrices 04 Matrix Completion Rainer Gemulla, Pauli Miettinen May 02, 2013

1 Gaussian Fun Facts Well add to these as we go along! First, consider a Gaussian random

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Notes on FC cleaning & assembly at CERN Jeff Nelson, William & Mary Jan, '17 DUNE - FC

Applications of Graph Theory and Probability in the Board Game Ticket to Ride R. Teal Witter &