Measurement and Data Data describes the real world Data maps - PowerPoint PPT Presentation

Measurement and Data

Data describes the real world • Data maps entities in the domain of interest to symbolic representation by means of a measurement procedure • Numerical relationships between variables capture relationships between objects • Measurement process is crucial

Types of Measurement • Ordinal, e.g., excellent=5, very good=4, good=3… • Nominal, e.g., religion, profession – Need non-metric methods • Ratio, e.g., weight – has concatenation property, two weights add to balance a third: 2+3 = 5 – changing scale (multiplying values by a constant) does not change ratio • Interval, e.g., temperature, calendar time – Unit of measurement is arbitrary, as well as origin

Distance Measures • Many data mining techniques (e.g., nn- classification, cluster analysis) are based on similarity measures between objects • s(i,j): similarity, d(i,j): dissimilarity • Possible transformations: d(i,j)= 1 – s(i,j) or d(i,j)= sqrt (2*(1-s(i,j))

Metric Properties 1. d(i,j) > 0: Positivity 2. d(i,j) = d(j,i): Commutativity 3. d(i,j) < d(i,k)+d(k,j): Triangle Inequality

Euclidean Distance between vectors 1 / 2   = ∑ p ( )   − 2 ( , ) d x y x y   E k k   = 1 k

Commensurability • Euclidean distance assumes variables are commensurate • E.g., each variable a measure of length • If one were weight and other was length there is no obvious choice of units • Altering units would change which variables are important

Standardizing the Data • Divide each variable by its standard deviation • Standard deviation for the k th variable is 1   1 ∑ 2 σ =  − µ  2 ( ( ) ) x i k k k   n = 1 i where n 1 ∑ µ = ( ) x i k k n = 1 i

Weighted Euclidean Distance • If we know relative importance of variables 1   = ∑ p 2   − 2 ( , ) (( ( ) ( )) d i j w x i x j   WE k k k   = 1 k

Need for Covariance in distance measure • Suppose we measured a cup’s height 100 times and diameter only once • Clearly height will dominate although 99 of the height measurements are not contributing anything • They are very highly correlated • To eliminate redundancy we need a data- driven method

Sample Covariance between X and Y     = ∑ n 1 _ _ − −     ( , ) ( ) ( ) Cov X Y x i x y i y     n = 1 i • Measure of how X and Y vary together • Large positive value if large values of X tend to be associated with large values of Y and small values of X with small values of Y • Large negative value if large values of X tend to be associated with small values of Y

Correlation Coefficient Value of Covariance is dependent upon ranges of X and Y Removed by dividing values of X by their standard deviation and values of Y by their standard deviation _ n _ ∑ − − ( ( ) )( ( ) ) x i x y i y ρ = = 1 i ( , ) X Y σ σ x y

Correlation Matrix

Mahanalobis Distance 1 ( ) ∑ − 1 = − − T ( , ) [ ( ) ( ) ( ( ) ( )) ] 2 d i j x i x j x i x j M

Generalizing Euclidean Distance • Minkowski or L ? metric 1   λ p ( ) ∑ λ  −  ( ) ( ) x i x j   k k   = 1 k • ? = 2 gives the Euclidean metric

Minkowski metric • ? = 1 is the Manhattan or city block metric p ∑ − | ( ) ( ) | x i x j k k = 1 k • ? = infinity yields − max | ( ) ( ) | x i x j k k k

Mutivariate Binary Data • Most obvious measure is Hamming Distance normalized by number of bits + S S 11 00 + + + S S S S 11 10 01 00 • If we don’t care about irrelevant properties had by neither object we have Jaccard Coefficient S 11 + + S S S 11 10 01 • Dice Coefficient extends this argument. If 00 matches are irrelevant then 10 and 01 matches should have half relevance

Some Similarity/Dissimilarity Measures for N-dim Binary Vectors * * * where

Weighted Dissimilarity Measures for Binary Vectors • Unequal importance to ‘0’ matches and ‘1’ matches • Multiply S 00 with ß ([0,1]) + β ⋅ • Examples: S S = − D sm 11 00 1 (X,Y) N − − β ⋅ 2 ( ) N S S = D rta 11 00 ( , ) X Y − − β ⋅ 2 N S S 11 00

Transforming the Data

V 1 is non-linearly Related to V 2 V 2 V 1 V 3 =1/V 2 is linearly related to V 1

Variance increases (regression assumes variance is constant) Square root transformation keeps the variance constant

Form of Data

Data Matrix • A set of p measurements on objects o(1)…o(n) • n rows and p columns • Also called standard data , data matrix or table

Multirelational Data • Payroll database has – Employees table: name, department-name, age, salary – Department table: department-name, budget, manager • The tables are connected to each other by the department-name field and the fields name and manager • Can be combined together, e.g., with fields name, department-name, age, salary, budget, manager • Or create as many rows as department-names • Flattening may require needless replication of values

Data Quality

Outlier

Measurement and Data Data describes the real world Data maps - PowerPoint PPT Presentation

Measurement and Data Data describes the real world Data maps entities in the domain of interest to symbolic representation by means of a measurement procedure Numerical relationships between variables capture relationships between

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Bridging social and physical measurement: measurement is not scale construction; measurement is

Introduction to English Linguistics 2: Phonetics and Phonology Phonetics articulary describes

Introduction to English Linguistics 2: Phonetics and Phonology Phonetics articulary describes

Presentation to Ontario Smart Grid Working Group Who is Measurement Canada? Measurement: A part

Real Students Real World Real Work Real Life: A Plan for a Holistic Approach to Supporting

WORLD WORLD WORLD WORLD WORLD WORLD En End of of the Br Bron onze Age ME MEETI NG 8

Measurement Techniques Part 2: Measurement Techniques Terminology and general issues

Measurement: Concepts in Practice Department of Government London School of Economics and

Frontal Crash Protection Frontal Crash Protection Real World Experience with Real World

CHAPTER 2 MEASUREMENT OF HIGH VOLTAGE AND CURRENTS 2.1 MEASUREMENT OF HIGH DIRECT VOLTAGES

solid inventory measurement Industrialised 3D surface scanning ALLISON Eng inventory measurement

Using measurement uncertainties in the MQO 1 Using measurement uncertainties | 24-25 june 2015

Measurement 4 - 1 Introduction Measurement is finding a number

Measurement There are two main systems of measurement: - The English system

Introduction to CSS Measurement Measurement 3 Measurement units units units Selectors

IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta

Recitation sessions : Review of proof techniques and probability Friday January 17, 3:00-4:10

Convex Optimization 1. Introduction Prof. Ying Cui Department of Electrical Engineering

CSC 2515 Lecture 7: Expectation-Maximization Marzyeh Ghassemi Material and slides developed by

Natural Language Processing and Information Retrieval Indexing and Vector Space Models

Clustering Algorithms Johannes Bl omer WS 2015/16 1 / 20 Introduction Clustering techniques

Collec&ve En&ty Resolu&on in Rela&onal Data CompSci

Data Mining 2020 Mining Social Network Data: Link Prediction Ad Feelders Universiteit Utrecht

Measurement and Data Data describes the real world Data maps - PowerPoint PPT Presentation

Measurement and Data Data describes the real world Data maps entities in the domain of interest to symbolic representation by means of a measurement procedure Numerical relationships between variables capture relationships between

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Bridging social and physical measurement: measurement is not scale construction; measurement is

Introduction to English Linguistics 2: Phonetics and Phonology Phonetics articulary describes

Introduction to English Linguistics 2: Phonetics and Phonology Phonetics articulary describes

Presentation to Ontario Smart Grid Working Group Who is Measurement Canada? Measurement: A part

Real Students Real World Real Work Real Life: A Plan for a Holistic Approach to Supporting

WORLD WORLD WORLD WORLD WORLD WORLD En End of of the Br Bron onze Age ME MEETI NG 8

Measurement Techniques Part 2: Measurement Techniques Terminology and general issues

Measurement: Concepts in Practice Department of Government London School of Economics and

Frontal Crash Protection Frontal Crash Protection Real World Experience with Real World

CHAPTER 2 MEASUREMENT OF HIGH VOLTAGE AND CURRENTS 2.1 MEASUREMENT OF HIGH DIRECT VOLTAGES

solid inventory measurement Industrialised 3D surface scanning ALLISON Eng inventory measurement

Using measurement uncertainties in the MQO 1 Using measurement uncertainties | 24-25 june 2015

Measurement 4 - 1 Introduction Measurement is finding a number

Measurement There are two main systems of measurement: - The English system

Introduction to CSS Measurement Measurement 3 Measurement units units units Selectors

IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta

Recitation sessions : Review of proof techniques and probability Friday January 17, 3:00-4:10

Convex Optimization 1. Introduction Prof. Ying Cui Department of Electrical Engineering

CSC 2515 Lecture 7: Expectation-Maximization Marzyeh Ghassemi Material and slides developed by

Natural Language Processing and Information Retrieval Indexing and Vector Space Models

Clustering Algorithms Johannes Bl omer WS 2015/16 1 / 20 Introduction Clustering techniques

Collec&amp;ve En&amp;ty Resolu&amp;on in Rela&amp;onal Data CompSci

Data Mining 2020 Mining Social Network Data: Link Prediction Ad Feelders Universiteit Utrecht

Collec&ve En&ty Resolu&on in Rela&onal Data CompSci