data mining intelligent data analysis
play

Data Mining / Intelligent Data Analysis Christian Borgelt Dept. of - PDF document

Data Mining / Intelligent Data Analysis Christian Borgelt Dept. of Mathematics / Dept. of Computer Sciences Paris Lodron University of Salzburg Hellbrunner Strae 34, 5020 Salzburg, Austria christian.borgelt@sbg.ac.at christian@borgelt.net


  1. Data Analysis / Data Mining Methods 2 • Artificial Neural Networks (multilayer perceptrons, radial basis function networks, learning vector quantization) tasks: classification, prediction, clustering • Cluster Analysis ( k -means and fuzzy clustering, Gaussian mixtures, hierarchical agglomerative clustering) tasks: segmentation, clustering • Association Rule Induction (frequent item set mining, rule generation) tasks: association analysis • Inductive Logic Programming (rule generation, version space, search strategies, declarative bias) tasks: classification, association analysis, concept description Christian Borgelt Data Mining / Intelligent Data Analysis 17 Statistics Christian Borgelt Data Mining / Intelligent Data Analysis 18 Statistics • Descriptive Statistics ◦ Tabular and Graphical Representations ◦ Characteristic Measures ◦ Principal Component Analysis • Inductive Statistics ◦ Parameter Estimation (point and interval estimation, finding estimators) ◦ Hypothesis Testing (parameter test, goodness-of-fit test, dependence test) ◦ Model Selection (information criteria, minimum description length) • Summary Christian Borgelt Data Mining / Intelligent Data Analysis 19 Statistics: Introduction Statistics is the art to collect, to display, to analyze, and to interpret data in order to gain new knowledge. “Applied Statistics” [Lothar Sachs 1999] [...] statistics, that is, the mathematical treatment of reality, [...] Hannah Arendt [1906–1975] in “The Human Condition” 1958 There are three kinds of lies: lies, damned lies, and statistics. Benjamin Disraeli [1804–1881] (attributed by Mark Twain, but disputed) Statistics, n . Exactly 76.4% of all statistics (including this one) are invented on the spot. However, in 83% of cases it is inappropriate to admit it. The Devil’s IT Dictionary Christian Borgelt Data Mining / Intelligent Data Analysis 20

  2. Basic Notions • Object, Case Data describe objects, cases, persons etc. • (Random) Sample The objects or cases described by a data set is called a sample , their number is the sample size . • Attribute Objects and cases are described by attributes , patients in a hospital, for example, by age, sex, blood pressure etc. • (Attribute) Value Attributes have different possible values . The age of a patient, for example, is a non-negative number. • Sample Value The value an attribute has for an object in the sample is called sample value . Christian Borgelt Data Mining / Intelligent Data Analysis 21 Scale Types / Attribute Types Scale Type Possible Operations Examples nominal test for equality sex/gender (categorical, qualitative) blood group ordinal test for equality exam grade (rank scale, comparative) greater/less than wind strength metric test for equality length (interval scale, quantitative) greater/less than weight difference time maybe ratio temperature • Nominal scales are sometimes divided into dichotomic (binary, two values) and polytomic (more than two values). • Metric scales may or may not allow us to form a ratio of values: weight and length do, temperature (in ◦ C ) does not. time as duration does, time as calender time does not. • Counts may be considered as a special type (e.g. number of children). Christian Borgelt Data Mining / Intelligent Data Analysis 22 Descriptive Statistics Christian Borgelt Data Mining / Intelligent Data Analysis 23 Tabular Representations: Frequency Table • Given data set: � x = (3 , 4 , 3 , 2 , 5 , 3 , 1 , 2 , 4 , 3 , 3 , 4 , 4 , 1 , 5 , 2 , 2 , 3 , 5 , 3 , 2 , 4 , 3 , 2 , 3) � k � k a k h k r k i =1 h i i =1 r i 2 2 1 2 25 = 0 . 08 2 25 = 0 . 08 6 8 2 6 25 = 0 . 24 8 25 = 0 . 32 9 17 3 9 25 = 0 . 36 17 25 = 0 . 68 5 22 4 5 25 = 0 . 20 22 25 = 0 . 88 3 25 5 3 25 = 0 . 12 25 25 = 1 . 00 • Absolute Frequency h k (frequency of an attribute value a k in the sample). • Relative Frequency r k = h k n , where n is the sample size (here n = 25). � k � k • Cumulated Absolute/Relative Frequency i =1 h i and i =1 r i . Christian Borgelt Data Mining / Intelligent Data Analysis 24

  3. Tabular Representations: Contingency Tables • Frequency tables for two or more attributes are called contingency tables . • They contain the absolute or relative frequency of value combinations . � a 1 a 2 a 3 a 4 b 1 8 3 5 2 18 b 2 2 6 1 3 12 b 3 4 1 2 7 14 � 14 10 8 12 44 • A contingency table may also contain the marginal frequencies , i.e., the frequencies of the values of individual attributes. • Contingency tables for a higher number of dimensions ( ≥ 4) may be difficult to read. Christian Borgelt Data Mining / Intelligent Data Analysis 25 Graphical Representations: Pole and Bar Chart • Numbers, which may be, for example, the frequencies of attribute values are represented by the lengths of poles/sticks (left) or the height of bars (right). 10 0.4 10 0.4 0.3 0.3 5 0.2 5 0.2 0.1 0.1 0 0.0 0 0.0 1 2 3 4 5 1 2 3 4 5 • Bar charts are the most frequently used and most comprehensible way of displaying absolute frequencies. • A wrong impression can result if the vertical scale does not start at 0 (for frequencies or other absolute numbers). Christian Borgelt Data Mining / Intelligent Data Analysis 26 Frequency Polygon and Line Chart • Frequency polygon: the ends of the poles of a pole chart are connected by lines. (Numbers are still represented by lengths.) 4 lines 10 0.4 y blue 0.3 3 red 5 0.2 2 0.1 1 0 0.0 0 1 2 3 4 5 0 1 2 3 4 x • If the attribute values on the horizontal axis are not ordered, connecting the ends of the poles does not make sense. • Line charts are frequently used to display time series. Christian Borgelt Data Mining / Intelligent Data Analysis 27 Area and Volume Charts • Numbers may also be represented by geometric quantities other than lengths, like areas or volumes. • Area and volume charts are usually less comprehensible than bar charts, because humans have more difficulties to compare and assess the relative size of areas and especially of volumes than lengths. (exception: the represented numbers describe areas or volumes) 3 2 4 3 2 5 4 1 5 1 • Sometimes the height of a two- or three-dimensional object is used to represent a number. The diagram then conveys a misleading impression. Christian Borgelt Data Mining / Intelligent Data Analysis 28

  4. Pie and Stripe Charts • Relative numbers may be represented by angles or sections of a stripe. Pie Chart 5 1 2 Mosaic Chart 4 3 Stripe Chart 1 2 3 4 5 • Mosaic charts can be used to display contingency tables. • More than two attributes are possible, but then separation distances and color must support the visualization to keep it comprehensible. Christian Borgelt Data Mining / Intelligent Data Analysis 29 Histograms • Intuitively: Histograms are frequency bar charts for metric data. • However: Since there are so many different values, values have to be grouped in order to arrive a proper representation. Most common approach: form equally sized intervals (so-called bins ) and count the frequency of sample values inside each interval. • Attention: Depending on the size and the position of the bins the histogram may look considerably different. • In sketches often only a rough outline of a histogram is drawn: Christian Borgelt Data Mining / Intelligent Data Analysis 30 Histograms: Number of Bins • Ages of customers of a super- 1000 1000 market/store (fictitious data); 800 800 year of analysis: 2010. frequency 600 600 • Depiction as a histogram 400 400 indicates larger market share among younger people, but 200 200 nothing is too conspicuous. 0 0 [18-28] (28-38] (38-48] (48-58] (58-68] (68-78] (78-88] (88-98] age Christian Borgelt Data Mining / Intelligent Data Analysis 31 Histograms: Number of Bins • Ages of customers of a super- 1000 1000 market/store (fictitious data); 800 800 year of analysis: 2010. frequency 600 600 • Depiction as a histogram 400 400 indicates larger market share among younger people, but 200 200 nothing is too conspicuous. 0 0 [18-28] (28-38] (38-48] (48-58] (58-68] (68-78] (78-88] (88-98] age 500 500 400 400 frequency 300 300 200 200 100 100 0 [ ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 0 - 8 0 - 2 - 4 - - 6 8 - - 0 2 - - 4 6 - 8 - 0 - - 2 4 - - 6 8 - 0 - 2 - 4 - 6 - 8 - 0 - 2 - 4 - 6 - 8 - 0 - 2 - 4 - 6 - 8 - 0 - 2 - 4 - 6 - 8 - 0 - 2 - 4 - 6 - 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 ] 0 2 ] 4 ] 6 ] ] 8 0 ] ] 2 4 ] ] 6 8 ] ] 0 2 ] ] 4 6 ] 8 ] 0 ] ] 2 4 ] ] 6 8 ] 0 ] 2 ] 4 ] 6 ] 8 ] 0 ] 2 ] 4 ] 6 ] 8 ] 0 ] 2 ] 4 ] 6 ] 8 ] 0 ] 2 ] 4 ] 6 ] 8 ] age Christian Borgelt Data Mining / Intelligent Data Analysis 32

  5. Histograms: Number of Bins 0.2 probability density • Probability density function 0.15 of a sample distribution, 0.1 from which the data for the following histograms 0.05 was sampled (1000 values). 0 –3 –2 –1 0 1 2 3 4 5 6 attribute value 175 • A histogram with 11 bins, 150 for the data that was sampled 125 from the above distribution. frequency 100 • How should we choose 75 the number of bins? 50 • What happens 25 if we choose badly? 0 –3 –2 –1 0 1 2 3 4 5 6 7 attribute value Christian Borgelt Data Mining / Intelligent Data Analysis 33 Histograms: Number of Bins 0.2 probability density • Probability density function 0.15 of a sample distribution, 0.1 from which the data for the following histograms 0.05 was sampled (1000 values). 0 –3 –2 –1 0 1 2 3 4 5 6 attribute value 350 300 • A histogram with too few bins, 250 for the same data as before. frequency 200 • As a consequence of the low num- 150 ber of bins, the distribution looks 100 unimodal (only one maximum), but skewed (asymmetric). 50 0 –3 –2 –1 0 1 2 3 4 5 6 7 attribute value Christian Borgelt Data Mining / Intelligent Data Analysis 34 Histograms: Number of Bins 0.2 probability density • Probability density function 0.15 of a sample distribution, 0.1 from which the data for the following histograms 0.05 was sampled (1000 values). 0 –3 –2 –1 0 1 2 3 4 5 6 attribute value 15 • A histogram with too many bins, frequency 10 for the same data as before. • As a consequence of the high number 5 of bins, the shape of the distribution is not well discernable. 0 –3 –2 –1 0 1 2 3 4 5 6 7 attribute value Christian Borgelt Data Mining / Intelligent Data Analysis 35 Histograms: Number of Bins 0.2 probability density • Probability density function 0.15 of a sample distribution, 0.1 from which the data for the following histograms 0.05 was sampled (1000 values). 0 –3 –2 –1 0 1 2 3 4 5 6 attribute value • A histogram with 11 bins, a number 175 computed with Sturges’ Rule : 150 k = ⌈ log 2 ( n ) + 1 ⌉ , 125 frequency 100 where n is the number of data points 75 (here: n = 1000). 50 • Sturges’ rule is tailored to data from 25 normal distributions and data sets of moderate size ( n ≤ 200). 0 –3 –2 –1 0 1 2 3 4 5 6 7 attribute value Christian Borgelt Data Mining / Intelligent Data Analysis 36

  6. Histograms: Number of Bins 0.2 probability density • Probability density function 0.15 of a sample distribution, 0.1 from which the data for the following histograms 0.05 was sampled (1000 values). 0 –3 –2 –1 0 1 2 3 4 5 6 attribute value • A histogram with 17 bins, a number 120 that was computed with: 100 � 1 � k = h ( max i { x i } − max i { x i } ) , 80 frequency where h may be chosen as 60 h = 3 . 5 · s · n − 1 3 40 ( s : sample standard deviation) or h = 2 · ( Q 3 − Q 1 ) · n − 1 20 3 ( Q 3 − Q 1 : interquartile range). 0 –3 –2 –1 0 1 2 3 4 5 6 7 attribute value Christian Borgelt Data Mining / Intelligent Data Analysis 37 3-Dimensional Diagrams • 3-dimensional bar charts may be used to display contingency tables (the 3rd dimension represents the value pair frequency). • The 3rd spacial dimension may be replaced by a color scale. This type of chart is sometimes referred to as a heatmap . (In a 3-dimensional bar chart color may also code the z -value (redundantly).) • Surface plots are 3-dimensional analogs of line charts. 9 3 z 1 9 6 z 2 6 y z 1 3 3 2 1 3 0 0 2 2 0 y 0 1 0 y 1 –1 1 0 0 1 2 3 x –1 2 0 x x –2 3 –2 Christian Borgelt Data Mining / Intelligent Data Analysis 38 Scatter Plots • Scatter plots are used to display 2- or 3-dimensional metric data sets. • Sample values are the coordinates of a point (that is, numbers are represented by lengths). wine data iris data 2.5 2 petal width / cm 0 0 5 1 1.5 0 2 5 1 0 0 1 0 1 attribute 13 0 7 5 0 0 5 0.5 5 5 0 2 1 4 2 1 0 3 0 8 2 1 2 3 4 5 6 7 a 6 7 t t t e r i 4 u petal length / cm b 1 b u t 2 r i e t t 1 0 a • Scatter plots provide simple means to check for dependency. Christian Borgelt Data Mining / Intelligent Data Analysis 39 How to Lie with Statistics Often the vertical axis of a pole or bar chart does not start at zero, but at some higher value. In such a case the con- veyed impression of the ratio of the depicted val- ues is completely wrong. pictures not available in online version This effect is used to brag about increases in turnover, speed etc. Sources of these diagrams and those on the following transparencies: Darrell Huff: How to Lie with Statistics, 1954. Walter Kr¨ amer: So l¨ ugt man mit Statistik, 1991. Christian Borgelt Data Mining / Intelligent Data Analysis 40

  7. How to Lie with Statistics pictures not available in online version • Depending on the position of the zero line of a pole, bar, or line chart completely different impressions can be conveyed. Christian Borgelt Data Mining / Intelligent Data Analysis 41 How to Lie with Statistics pictures not available in online version • Poles and bars are frequently replaced by (sketches of) objects in order to make the diagram more aesthetically appealing. • However, objects are perceived as 2- or even 3-dimensional and thus convey a completely different impression of the numerical ratios. Christian Borgelt Data Mining / Intelligent Data Analysis 42 How to Lie with Statistics pictures not available in online version Christian Borgelt Data Mining / Intelligent Data Analysis 43 How to Lie with Statistics pictures not available in online version • In the left diagram the areas of the barrels represent the numerical value. However, since the barrels are drawn 3-dimensional, a wrong impression of the numerical ratios is conveyed. • The right diagram is particularly striking: an area measure is represented by the side length of a rectangle representing the apartment. Christian Borgelt Data Mining / Intelligent Data Analysis 44

  8. Good Data Visualization picture not available in online version • This is likely the most famous example of good data visualization. • It is easy to understand and conveys information about several quantities, like number of people, location, temperature etc. [Charles Joseph Minard 1869] Christian Borgelt Data Mining / Intelligent Data Analysis 45 Descriptive Statistics: Characteristic Measures Christian Borgelt Data Mining / Intelligent Data Analysis 46 Descriptive Statistics: Characteristic Measures Idea: Describe a given sample by few characteristic measures and thus summarize the data. • Localization Measures Localization measures describe, often by a single number, where the data points of a sample are located in the domain of an attribute. • Dispersion Measures Dispersion measures describe how much the data points vary around a localization parameter and thus indicate how well this parameter captures the localization of the data. • Shape Measures Shape measures describe the shape of the distribution of the data points relative to a reference distribution. The most common reference distribution is the normal distribution (Gaussian). Christian Borgelt Data Mining / Intelligent Data Analysis 47 Localization Measures: Mode and Median • Mode x ∗ The mode is the attribute value that is most frequent in the sample. It need not be unique, because several values can have the same frequency. It is the most general measure, because it is applicable for all scale types. • Median ˜ x The median minimizes the sum of absolute differences: n n � � | x i − ˜ x | = min . and thus it is sgn( x i − ˜ x ) = 0 i =1 i =1 If x = ( x (1) , . . . , x ( n ) ) is a sorted data set, the median is defined as  x ( n +1 2 ) , if n is odd,  x = ˜ � � 1  x ( n 2 ) + x ( n , if n is even. 2 2 +1) The median is applicable to ordinal and metric attributes. (For non-metric attributes either x ( n 2 ) or x ( n 2 +1) needs to be chosen for even n .) Christian Borgelt Data Mining / Intelligent Data Analysis 48

  9. Localization Measures: Arithmetic Mean • Arithmetic Mean ¯ x The arithmetic mean minimizes the sum of squared differences: n n n � � � x ) 2 = min . ( x i − ¯ and thus it is ( x i − ¯ x ) = x i − n ¯ x = 0 i =1 i =1 i =1 The arithmetic mean is defined as n � x = 1 ¯ x i . n i =1 The arithmetic mean is only applicable to metric attributes. • Even though the arithmetic mean is the most common localization measure, the median is preferable if ◦ there are few sample cases, ◦ the distribution is asymmetric, and/or ◦ one expects that outliers are present. Christian Borgelt Data Mining / Intelligent Data Analysis 49 How to Lie with Statistics pictures not available in online version Christian Borgelt Data Mining / Intelligent Data Analysis 50 Dispersion Measures: Range and Interquantile Range A man with his head in the freezer and feet in the oven is on the average quite comfortable. old statistics joke • Range R The range of a data set is the difference between the maximum and the minimum value. R = x max − x min = max n i =1 x i − min n i =1 x i • Interquantile Range The p -quantile of a data set is a value such that a fraction of p of all sample values are smaller than this value. (The median is the 1 2 -quantile.) The p -interquantile range, 0 < p < 1 2 , is the difference between the (1 − p )-quantile and the p -quantile. The most common is the interquartile range ( p = 1 4 ). Christian Borgelt Data Mining / Intelligent Data Analysis 51 Dispersion Measures: Average Absolute Deviation • Average Absolute Deviation The average absolute deviation is the average of the absolute deviations of the sample values from the median or the arithmetic mean. • Average Absolute Deviation from the Median n x = 1 � | x i − ˜ x | d ˜ n i =1 • Average Absolute Deviation from the Arithmetic Mean n x = 1 � d ¯ | x i − ¯ x | n i =1 • It is always d ˜ x ≤ d ¯ x , since the median minimizes the sum of absolute deviations (see the definition of the median). Christian Borgelt Data Mining / Intelligent Data Analysis 52

  10. Dispersion Measures: Variance and Standard Deviation • (Empirical) Variance s 2 It would be natural to define the variance as the average squared deviation: n v 2 = 1 � x ) 2 . ( x i − ¯ n i =1 However, inductive statistics suggests that it is better defined as (Bessel’s correction, after Friedrich Wilhelm Bessel, 1784–1846) n 1 � s 2 = x ) 2 . ( x i − ¯ n − 1 i =1 • (Empirical) Standard Deviation s The standard deviation is the square root of the variance, i.e., � � � n � 1 � s 2 = � x ) 2 . s = ( x i − ¯ n − 1 i =1 Christian Borgelt Data Mining / Intelligent Data Analysis 53 Dispersion Measures: Variance and Standard Deviation • Special Case: Normal/Gaussian Distribution The variance/standard deviation provides information about the height of the mode and the width of the curve. � � µ − ( x − µ ) 2 f X ( x ; µ, σ 2 ) = 1 2 πσ 2 · exp 1 √ 2 σ 2 √ 2 πσ 2 σ σ 1 √ 2 πσ 2 e 2 σ 2 σ 1 x √ 2 πσ 2 e 2 • µ : expected value, estimated by mean value ¯ x σ 2 : estimated by (empirical) variance s 2 variance, σ : standard deviation, estimated by (empirical) standard deviation s (Details about parameter estimation are studied later.) Christian Borgelt Data Mining / Intelligent Data Analysis 54 Dispersion Measures: Variance and Standard Deviation Note that it is often more convenient to compute the variance using the formula that results from the following transformation: n n � x 2 � 1 � 1 � s 2 = x ) 2 x 2 ( x i − ¯ = i − 2 x i ¯ x + ¯ n − 1 n − 1 i =1 i =1   n n n 1 � � � x 2 x 2   = i − 2¯ x x i + ¯ n − 1 i =1 i =1 i =1     n n � � 1 1 x 2 + n ¯ x 2 x 2 x 2 x 2     = i − 2 n ¯ = i − n ¯ n − 1 n − 1 i =1 i =1  2    n n 1 � i − 1 �  x 2  =  x i    n − 1 n i =1 i =1 � n � n i =1 x 2 • Advantage: The sums i =1 x i and i can both be computed in the same traversal of the data and from them both mean and variance can be computed. Christian Borgelt Data Mining / Intelligent Data Analysis 55 Shape Measures: Skewness • The skewness α 3 (or skew for short) measures whether, and if, how much, a distribution differs from a symmetric distribution. • It is computed from the 3rd moment about the mean, which explains the index 3. n n 1 � x ) 3 = 1 � z 3 ( x i − ¯ α 3 = i n · v 3 n i =1 i =1 n z i = x i − ¯ x v 2 = 1 � x ) 2 . where and ( x i − ¯ v n i =1 α 3 < 0: right steep α 3 = 0: symmetric α 3 > 0: left steep Christian Borgelt Data Mining / Intelligent Data Analysis 56

  11. Shape Measures: Kurtosis • The kurtosis or excess α 4 measures how much a distribution is arched, usually compared to a Gaussian distribution. • It is computed from the 4th moment about the mean, which explains the index 4. n n 1 � x ) 4 = 1 � z 4 α 4 = ( x i − ¯ n · v 4 i n i =1 i =1 n z i = x i − ¯ x v 2 = 1 � x ) 2 . where and ( x i − ¯ v n i =1 α 4 < 3: leptokurtic α 4 = 3: Gaussian α 4 > 3: platikurtic Christian Borgelt Data Mining / Intelligent Data Analysis 57 Moments of Data Sets • The k -th moment of a dataset is defined as n k = 1 � m ′ x k i . n i =1 The first moment is the mean m ′ 1 = ¯ x of the data set. Using the moments of a data set the variance s 2 can also be written as � � s 2 = v 2 = 1 1 m ′ 2 − 1 n m ′ 2 n m ′ 2 − 1 n 2 m ′ 2 and also 1 . n − 1 1 • The k -th moment about the mean is defined as n m k = 1 � x ) k . ( x i − ¯ n i =1 It is m 1 = 0 and m 2 = v 2 (i.e., the average squared deviation ). m 3 and the kurtosis is α 4 = m 4 The skewness is α 3 = . m 3 / 2 m 2 2 2 Christian Borgelt Data Mining / Intelligent Data Analysis 58 Visualizing Characteristic Measures: Box Plots ◦ ◦ A box plot is a common way to outliers ◦ x max combine some important char- maximum acteristic measures into a single (or max { x | x ≤ Q 3 + 1 . 5( Q 3 − Q 1 ) } or 97 . 5% quantile) graphical representation. Often the central box is drawn Q 3 3. quartile constricted ( �� ) w.r.t. the arith- x ¯ arithmetic mean metic mean in order to empha- x = Q 2 ˜ median/2. quartile size its location. Q 1 1. quartile The “whiskers” are often limited in length to 1 . 5( Q 3 − Q 1 ). Data (or min { x | x ≥ Q 1 − 1 . 5( Q 3 − Q 1 ) } points beyond these limits are or 2 . 5% quantile) suspected to be outliers. x min minimum Box plots are often used to get a quick impression of the distribution of the data by showing them side by side for several attributes or data subsets. Christian Borgelt Data Mining / Intelligent Data Analysis 59 Box Plots: Examples • left top: two samples from a standard normal distribution. • left bottom: two samples from an exponential distribution. • right bottom: probability density function of the exponential distribution with λ = 1. 1 probability density 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 attribute value Christian Borgelt Data Mining / Intelligent Data Analysis 60

  12. Multidimensional Characteristic Measures General Idea: Transfer the characteristic measures to vectors. • Arithmetic Mean The arithmetic mean for multi-dimensional data is the vector mean of the data points. For two dimensions it is n ( x, y ) = 1 � ( x i , y i ) = (¯ x, ¯ y ) n i =1 For the arithmetic mean the transition to several dimensions only combines the arithmetic means of the individual dimensions into one vector. • Other measures are transferred in a similar way. However, sometimes the transfer leads to new quantities, as for the variance, which requires adaptation due to its quadratic nature. Christian Borgelt Data Mining / Intelligent Data Analysis 61 Excursion: Vector Products General Idea: Transfer dispersion measures to vectors. For the variance, the square of the difference to the mean has to be generalized. Inner Product Outer Product Scalar Product Matrix Product   v ⊤ v 1 � v� ( v 1 , v 2 , . . . , v m )        v 2  v ⊤ � v 2   v 1 v 1 v 2 · · · v 1 v m � v .  .  1 .       v 2  v 2   v 1 v 2 · · · v 2 v m     2  v m . . ... .  .   . .  . . .     � m i =1 v 2 v m v 2 ( v 1 , v 2 , . . . , v m ) v 1 v m v 2 v m · · · m i • In principle both vector products may be used for a generalization. • The second, however, yields more information about the distribution: ◦ a measure of the (linear) dependence of the attributes, ◦ a description of the direction dependence of the dispersion. Christian Borgelt Data Mining / Intelligent Data Analysis 62 Covariance Matrix • Covariance Matrix v ⊤ ): Compute variance formula with vectors (square: outer product � v� �� x i � � ¯ �� �� x i � � ¯ �� � � ⊤ n s 2 1 � x x s xy x S = − − = s yx s 2 n − 1 y i y ¯ y i y ¯ y i =1 where s 2 x and s 2 y are variances and s xy is the covariance of x and y :   n n 1 � 1 � s 2 x ) 2 x 2 x 2 x = s xx = ( x i − ¯ =  i − n ¯  n − 1 n − 1 i =1 i =1   n n 1 � 1 � y ) 2 y 2 y 2 s 2 y = s yy = ( y i − ¯ =  i − n ¯  n − 1 n − 1 i =1 i =1   n n 1 � 1 �   s xy = s yx = ( x i − ¯ x )( y i − ¯ y ) = x i y i − n ¯ x ¯ y n − 1 n − 1 i =1 i =1 (Using n − 1 instead of n is called Bessel’s correction, after Friedrich Wilhelm Bessel, 1784–1846.) Christian Borgelt Data Mining / Intelligent Data Analysis 63 Reminder: Variance and Standard Deviation • Special Case: Normal/Gaussian Distribution The variance/standard deviation provides information about the height of the mode and the width of the curve. � � µ − ( x − µ ) 2 f X ( x ; µ, σ 2 ) = 2 πσ 2 · exp 1 1 √ 2 σ 2 √ 2 πσ 2 σ σ 1 √ 2 πσ 2 e 2 σ 2 σ 1 x √ 2 πσ 2 e 2 • µ : expected value, estimated by mean value ¯ x , σ 2 : estimated by (empirical) variance s 2 , variance, σ : standard deviation, estimated by (empirical) standard deviation s . Important: the standard deviation has the same unit as the expected value. Christian Borgelt Data Mining / Intelligent Data Analysis 64

  13. Multivariate Normal Distribution • A univariate normal distribution has the density function � � − ( x − µ ) 2 1 f X ( x ; µ, σ 2 ) = √ 2 πσ 2 · exp 2 σ 2 µ : expected value, estimated by mean value ¯ x , σ 2 : estimated by (empirical) variance s 2 , variance, σ : standard deviation, estimated by (empirical) standard deviation s . • A multivariate normal distribution has the density function � � 1 − 1 µ ) ⊤ Σ − 1 ( � f � X ( � x ; � µ, Σ ) = � · exp 2( � x − � x − � µ ) (2 π ) m | Σ | m : size of the vector � x (it is m -dimensional), estimated by mean value vector ¯ � µ : expected value vector, � x , Σ : covariance matrix, estimated by (empirical) covariance matrix S , | Σ | : determinant of the covariance matrix Σ . Christian Borgelt Data Mining / Intelligent Data Analysis 65 Interpretation of a Covariance Matrix • The variance/standard deviation relates the spread of the distribution to the spread of a standard normal distribution ( σ 2 = σ = 1). • The covariance matrix relates the spread of the distribution to the spread of a multivariate standard normal distribution ( Σ = 1 ). • Example: bivariate normal distribution 2 2 1 1 0 0 –1 –1 2 2 1 1 –2 0 –2 0 – 1 – 1 – 2 – 2 standard general • Question: Is there a multivariate analog of standard deviation? Christian Borgelt Data Mining / Intelligent Data Analysis 66 Interpretation of a Covariance Matrix Question: Is there a multivariate analog of standard deviation? First insight: − σ x σ x σ y If the covariances vanish, � σ 2 � 0 the contour lines are axes-parallel ellipses. x Σ = σ 2 0 The upper ellipse is inscribed into the y rectangle [ − σ x , σ x ] × [ − σ y , σ y ]. − σ y − σ x σ x Second insight: σ y If the covariances do not vanish, � σ 2 � σ xy the contour lines are rotated ellipses. x Σ = σ 2 σ xy Still the upper ellipse is inscribed into the y rectangle [ − σ x , σ x ] × [ − σ y , σ y ]. − σ y Consequence: A covariance matrix describes a scaling and a rotation. Christian Borgelt Data Mining / Intelligent Data Analysis 67 Interpretation of a Covariance Matrix A covariance matrix is always positive semi-definite. R m : v ⊤ S � • positive semi-definite: ∀ � v ∈ I � v ≥ 0, R m : v ⊤ S � negative semi-definite: ∀ � v ∈ I � v ≤ 0, R m the outer product � x ⊤ yields a positive semi-definite matrix: • For any � x ∈ I x� R m : x ) 2 ≥ 0 . v ⊤ � x ⊤ � v ⊤ � ∀ � v ∈ I � x� v = ( � • If S i , i = 1 , . . . , k , are positive (negative) semi-definite matrices, � k then S = i =1 S i is a positive (negative) semi-definite matrix. v ⊤ �� k � � k R m : v ⊤ S � v ⊤ S i � ∀ � v ∈ I ≥ 0 . � v = � i =1 S i � v = i =1 � v � �� � ≥ 0 • A(n empirical) covariance matrix is computed as n � x i − ¯ ⊤ ( � x i − ¯ S = ( � � x ) � x ) . i =1 As the sum of positive semi-definite matrices it is positive semi-definite itself. Christian Borgelt Data Mining / Intelligent Data Analysis 68

  14. Interpretation of a Covariance Matrix A covariance matrix is generally positive definite, unless all data points lie in a lower-dimensional (linear) subspace. R m − { � v ⊤ S � • positive definite: ∀ � v ∈ I 0 } : � v > 0, R m − { � v ⊤ S � negative definite: ∀ � v ∈ I 0 } : � v < 0, � n x i − ¯ ⊤ ( � x i − ¯ • A(n empirical) covariance matrix is computed as S = i =1 ( � x ) � � x ). x i − ¯ • Let � z i = ( � � x ), i = 1 , . . . , n , and suppose that z i ) 2 = 0). R m − { � v ⊤ v ⊤ z ⊤ v ⊤ ∃ � v ∈ I 0 } : ∀ i ; 1 ≤ i ≤ n : � � z i = 0 (implying � � z i � i � v = ( � � R m . Furthermore, suppose that the set { � z 1 , . . . , � z n } of difference vectors spans I Then there exist α 1 , . . . , α n ∈ I R such that � v = α 1 � z 1 + . . . + α n � z n . v ⊤ v ⊤ v ⊤ v = � v � = � Hence � � v = � � z 1 α 1 + . . . + � z n � α n = 0, implying � 0, contradicting � 0. � �� � � �� � =0 =0 (by assumption) R m , then S is positive definite. • Therefore, if the � z i , i = 1 , . . . , n , span I R m , that is, if the data points lie Only if the � z i , i = 1 , . . . , n , do not span I in a lower-dimensional (linear) subspace, S is only positive semi-definite. Christian Borgelt Data Mining / Intelligent Data Analysis 69 Cholesky Decomposition • Intuitively: Compute an analog of standard deviation . • Let S be a symmetric, positive definite matrix (e.g. a covariance matrix). Cholesky decomposition serves the purpose to compute a “square root” of S . S ⊤ = S . ◦ symmetric: ∀ 1 ≤ i, j ≤ m : s ij = s ji or ( S ⊤ is the transpose of the matrix S .) v ⊤ S � v � = � ◦ positive definite: for all m -dimensional vectors � 0 it is � v > 0 • Formally: Compute a left/lower triangular matrix L such that LL ⊤ = S . ( L ⊤ is the transpose of the matrix L .)   1 i − 1 2 � l 2  s ii −  l ii = ik k =1   i − 1 1 �  , l ji =  s ij − l ik l jk j = i + 1 , i + 2 , . . . , m. l ii k =1 Christian Borgelt Data Mining / Intelligent Data Analysis 70 Cholesky Decomposition Special Case: Two Dimensions • Covariance matrix    σ 2 σ xy x  Σ = σ 2 σ xy y • Cholesky decomposition   σ x 0    �  L = σ xy 1   σ 2 x σ 2 y − σ 2 xy σ x σ x unit circle 2 2 1 1 3 mapping with L v ′ = L � � v 3 4 4 Christian Borgelt Data Mining / Intelligent Data Analysis 71 Eigenvalue Decomposition • Eigenvalue decomposition also yields an analog of standard deviation . • It is computationally more expensive than Cholesky decomposition. • Let S be a symmetric, positive definite matrix (e.g. a covariance matrix). ◦ S can be written as S = R diag( λ 1 , . . . , λ m ) R − 1 , where the λ j , j = 1 , . . . , m , are the eigenvalues of S and the columns of R are the (normalized) eigenvectors of S . ◦ The eigenvalues λ j , j = 1 , . . . , m , of S are all positive and the eigenvectors of S are orthonormal ( → R − 1 = R ⊤ ). • Due to the above, S can be written as S = T T ⊤ , where �� � � T = R diag λ 1 , . . . , λ m Christian Borgelt Data Mining / Intelligent Data Analysis 72

  15. Eigenvalue Decomposition Special Case: Two Dimensions • Covariance matrix    σ 2 σ xy x Σ =  σ 2 σ xy y • Eigenvalue decomposition 2 arctan 2 σ xy s = sin φ, c = cos φ, φ = 1 y ,     σ 2 x − σ 2  c − s  σ 1 0 �  , T =  c 2 σ 2 x + s 2 σ 2 σ 1 = y + 2 scσ xy , s c 0 σ 2 � s 2 σ 2 x + c 2 σ 2 y − 2 scσ xy . σ 2 = unit 1 circle 2 2 σ 1 1 φ 3 mapping with T σ 2 v ′ = T � � v 4 4 3 Christian Borgelt Data Mining / Intelligent Data Analysis 73 Eigenvalue Decomposition Eigenvalue decomposition enables us to write a covariance matrix Σ as �� � � Σ = TT ⊤ with T = R diag λ 1 , . . . , λ m . As a consequence we can write its inverse Σ − 1 as � � λ − 1 1 , . . . , λ − 1 Σ − 1 = U ⊤ U R ⊤ . 2 2 with U = diag m U describes the inverse mapping of T , i.e., rotates the ellipse so that its axes coincide with the coordinate axes and then scales the axes to unit length. Hence: y ) ⊤ Σ − 1 ( � y ) ⊤ U ⊤ U ( � x ′ − � y ′ ) ⊤ ( � x ′ − � y ′ ) , ( � x − � x − � y ) = ( � x − � x − � y ) = ( � x ′ = U � y ′ = U � where � x and � y . y ) ⊤ Σ − 1 ( � Result: ( � x − � x − � y ) is equivalent to the squared Euclidean distance in the properly scaled eigensystem of the covariance matrix Σ . � y ) ⊤ Σ − 1 ( � d ( � x, � y ) = ( � x − � x − � y ) is called Mahalanobis distance . Christian Borgelt Data Mining / Intelligent Data Analysis 74 Eigenvalue Decomposition Eigenvalue decomposition also shows that the determinant of the covariance matrix Σ provides a measure of the (hyper-)volume of the (hyper-)ellipsoid. It is m � | Σ | = | R | | diag( λ 1 , . . . , λ m ) | | R ⊤ | = | diag( λ 1 , . . . , λ m ) | = λ i , i =1 since | R | = | R ⊤ | = 1 as R is orthogonal with unit length columns, and thus � m � � | Σ | = λ i , i =1 which is proportional to the (hyper-)volume of the (hyper-)ellipsoid. To be precise, the volume of the m -dimensional (hyper-)ellipsoid a (hyper-)sphere with radius r is mapped to with the eigenvalue decomposition of a covariance matrix Σ is � ∞ m e − t t x − 1 d t, 2 r m � Γ( x ) = x > 0 , π 0 | Σ | , V m ( r ) = � m � where 2 ) = √ π, Γ(1) = 1 . Γ( x + 1) = x · Γ( x ) , Γ( 1 Γ 2 + 1 Christian Borgelt Data Mining / Intelligent Data Analysis 75 Eigenvalue Decomposition Special Case: Two Dimensions • Covariance matrix and its eigenvalue decomposition:        σ 2 σ xy  cos φ − sin φ  σ 1 0 x  . Σ =  and T =  σ 2 sin φ cos φ 0 σ 2 σ xy y unit 1 circle 2 2 σ 1 1 φ mapping with T 3 σ 2 v ′ = T � � v 4 4 3 • The area of the ellipse, to which the unit circle (area π ) is mapped, is � | Σ | . A = πσ 1 σ 2 = π Christian Borgelt Data Mining / Intelligent Data Analysis 76

  16. Covariance Matrices of Example Data Sets Σ ≈ Σ ≈ � � � � 3 . 59 0 . 19 2 . 33 1 . 44 0 . 19 3 . 54 1 . 44 2 . 41 L ≈ L ≈ � � � � 1 . 90 0 1 . 52 0 0 . 10 1 . 88 0 . 95 1 . 22 Σ ≈ Σ ≈ � � � � 1 . 88 1 . 62 2 . 25 − 1 . 93 1 . 62 2 . 03 − 1 . 93 2 . 23 L ≈ L ≈ � � � � 1 . 37 0 1 . 50 0 1 . 18 0 . 80 − 1 . 29 0 . 76 Christian Borgelt Data Mining / Intelligent Data Analysis 77 Covariance Matrix: Summary • A covariance matrix provides information about the height of the mode and about the spread/dispersion of a multivariate normal distribution (or of a set of data points that are roughly normally distributed). • A multivariate analog of standard deviation can be computed with Cholesky decomposition and eigenvalue decomposition. The resulting matrix describes the distribution’s shape and orientation. • The shape and the orientation of a two-dimensional normal distribution can be visualized as an ellipse (curve of equal probability density; similar to a contour line — line of equal height — on a map.) • The shape and the orientation of a three-dimensional normal distribution can be visualized as an ellipsoid (surface of equal probability density). • The (square root of the) determinant of a covariance matrix describes the spread of a multivariate normal distribution with a single value. It is a measure of the area or (hyper-)volume of the (hyper-)ellipsoid. Christian Borgelt Data Mining / Intelligent Data Analysis 78 Correlation and Principal Component Analysis Christian Borgelt Data Mining / Intelligent Data Analysis 79 Correlation Coefficient • The covariance is a measure of the strength of linear dependence of the two quantities of which it is computed. • However, its value depends on the variances of the individual dimensions. ⇒ Normalize to unit variance in the individual dimensions. • Correlation Coefficient (more precisely: Pearson’s Product Moment Correlation Coefficient or Bravais–Pearson Correlation Coefficient) ρ xy = s xy , ρ xy ∈ [ − 1 , +1] . s x s y • ρ xy measures the strength of linear dependence (of y on x ): ρ xy = − 1: the data points lie perfectly on a straight line with negative slope. ρ yx = +1: the data points lie perfectly on a straight line with positive slope. • ρ xy = 0: there is no linear dependence between the two attributes (but there may be a non-linear dependence!). Christian Borgelt Data Mining / Intelligent Data Analysis 80

  17. Correlation Coefficient • ρ xy exists whenever s x > 0 , s y > 0, and then we have − 1 ≤ ρ xy ≤ +1. • In case of ρ xy = 0, we call the sample ( x 1 , y 1 ) , . . . , ( x n , y n ) uncorrelated . • ρ xy is not a measure of dependence — it only measures linear dependence . ρ xy = 0 only means that there is no linear dependence. • Example: Suppose the data points lie symmetrically on a parabola, ρ xy = 0. • Note that ρ xy = ρ yx (simply because s xy = s yx ), which justifies that we merely write ρ in the following. Christian Borgelt Data Mining / Intelligent Data Analysis 81 Correlation Coefficients of Example Data Sets weak no positive correlation correlation ( ρ ≈ 0 . 05) ( ρ ≈ 0 . 61) strong strong positive negative correlation correlation ( ρ ≈ 0 . 83) ( ρ ≈ − 0 . 86) Christian Borgelt Data Mining / Intelligent Data Analysis 82 Correlation Matrix • Normalize Data ( z -score normalization) Transform data to mean value 0 and variance/standard deviation 1: i = x i − ¯ i = y i − ¯ x y x ′ y ′ ∀ i ; 1 ≤ i ≤ n : , . s x s y • Compute Covariance Matrix of Normalized Data Sum outer products of transformed data vectors: � x ′ �� x ′ � � � n ⊤ 1 � 1 ρ Σ ′ i i = = y ′ y ′ n − 1 ρ 1 i i i =1 µ ′ = (0 , 0) ⊤ ). Subtraction of mean vector is not necessary (because it is � Diagonal elements are always 1 (because of unit variance in each dimension). • Normalizing the data and then computing the covariances or computing the covariances and then normalizing them has the same effect. Christian Borgelt Data Mining / Intelligent Data Analysis 83 Correlation Matrix: Interpretation Special Case: Two Dimensions • Correlation matrix Side note: The (numerical) eccentricity ǫ of an ellipse   eigenvalues: σ 2 1 , σ 2  1 ρ (in geometry) satisfies 2 Σ ′ =  , ρ = σ 2 1 − σ 2 | σ 2 1 − σ 2 2 | ǫ 2 = 2 ρ 1 correlation: 2 ) . σ 2 1 + σ 2 max( σ 2 1 , σ 2 2 • Eigenvalue decomposition σ 1 = √ 1 + ρ,     s = sin π 1 4 = √ 2 ,  c − s  σ 1 0 T =   , σ 2 = √ 1 − ρ. c = cos π 1 s c 0 σ 2 4 = √ 2 , unit 1 circle 2 σ 1 2 π 1 4 mapping with T σ 2 3 v ′ = T � 4 � v 4 3 Christian Borgelt Data Mining / Intelligent Data Analysis 84

  18. Correlation Matrix: Interpretation • Via the ellipse that results from mapping P a unit circle with an eigenvalue decompo- a a sition of a correlation matrix, correlation b can be understood geometrically. F 1 F 2 e a • In this view correlation is related to √ √ a 2 − b 2 a 2 − b 2 e = ǫ = the (numerical) eccentricity of an ellipse a (different normalization, though). • Correlation: Given the two focal points F 1 and ρ = σ 2 1 − σ 2 � 2 σ 2 1 + σ 2 2 : distance F 2 and the length a of the semi- σ 2 1 + σ 2 vertex to co-vertex 2 major axis, an ellipse is the set of σ 2 points { P | | F 1 P | + | F 2 P | = 2 a } . 2 → 0 ⇒ ρ → +1 , σ 2 Linear eccentricity: 1 → 0 ⇒ ρ → − 1 . √ a 2 − b 2 e = • Squared (numerical) eccentricity: (Numerical) eccentricity: | σ 2 1 − σ 2 2 | ǫ 2 = √ max( σ 1 , σ 2 ): length a 2 − b 2 ǫ = e max( σ 2 1 , σ 2 of semi-major axis a = 2 ) a Christian Borgelt Data Mining / Intelligent Data Analysis 85 Correlation Matrix: Interpretation • For two dimensions the eigenvectors of a correlation matrix are always � � � � 1 2 , 1 − 1 2 , 1 � v 1 = √ √ and � v 2 = √ √ 2 2 (or their opposites − � v 1 or − � v 2 or exchanged). The reason is that the normalization trans- forms the data points in such a way that the ellipse, the unit circle is mapped to by the “square root” of the covariance matrix of the normalized data, is always inscribed into the square [ − 1 , 1] × [ − 1 , 1]. Hence the ellipse’s major axes are the square’s diagonals. • The situation is analogous in m -dimensional spaces: the eigenvectors are always m of the 2 m − 1 diagonals of the m -dimensional unit (hyper-)cube around the origin. Christian Borgelt Data Mining / Intelligent Data Analysis 86 Attention: Correlation �⇒ Causation! pictures not available in online version • Always remember: An observed correlation may be purely coincidental! • This is especially the case if the data come from processes that show relatively steady growth or decline (these are always correlated). • In order to claim a causal connection between quantities, the actual mechanism needs to be discovered and confirmed! Christian Borgelt Data Mining / Intelligent Data Analysis 87 Attention: Correlation �⇒ Causation! • Data shows a clear correlation between breast cancer death rates and fat intake. • Is this evidence Does high fat intake for a causal connection? cause breast cancer? If at all, it is quite weak. • Amount of fat in diet and amount of sugar are correlated. • Plot of amount of sugar in diet and colon cancer death rates would look similar. picture not available in online version • How rich a country is influences the amount of fat and sugar in the diet, but also a lot of other factors (e.g. life expectancy). Christian Borgelt Data Mining / Intelligent Data Analysis 88

  19. Attention: Correlation �⇒ Causation! pictures not available in online version Christian Borgelt Data Mining / Intelligent Data Analysis 89 Attention: Correlation �⇒ Causation! pictures not available in online version Christian Borgelt Data Mining / Intelligent Data Analysis 90 Attention: Correlation �⇒ Causation! pictures not available in online version Christian Borgelt Data Mining / Intelligent Data Analysis 91 Attention: Correlation �⇒ Causation! pictures not available in online version Christian Borgelt Data Mining / Intelligent Data Analysis 92

  20. Attention: Correlation �⇒ Causation! pictures not available in online version Christian Borgelt Data Mining / Intelligent Data Analysis 93 Attention: Correlation �⇒ Causation! pictures not available in online version Christian Borgelt Data Mining / Intelligent Data Analysis 94 Regression Line • Since the covariance/correlation measures linear dependence, it is not surprising that it can be used to define a regression line : y ) = s xy y = s xy ( y − ¯ ( x − ¯ x ) or ( x − ¯ x ) + ¯ y. s 2 s 2 x x • The regression line can be seen as a conditional arithmetic mean: there is one arithmetic mean for the y -dimensions for each x -value. • This interpretation is supported by the fact that the regression line minimizes the sum of squared differences in y -direction. (Reminder: the arithmetic mean minimizes the sum of squared differences.) • More information on regression and the method of least squares in the corresponding chapter (to be discussed later). Christian Borgelt Data Mining / Intelligent Data Analysis 95 Principal Component Analysis • Correlations between the attributes of a data set can be used to reduce the number of dimensions : ◦ Of two strongly correlated features only one needs to be considered. ◦ The other can be reconstructed approximately from the regression line. ◦ However, the feature selection can be difficult. • Better approach: Principal Component Analysis (PCA) ◦ Find the direction in the data space that has the highest variance. ◦ Find the direction in the data space that has the highest variance among those perpendicular to the first. ◦ Find the direction in the data space that has the highest variance among those perpendicular to the first and second and so on. ◦ Use first directions to describe the data. Christian Borgelt Data Mining / Intelligent Data Analysis 96

  21. Principal Component Analysis: Physical Analog • The rotation of a body around an axis through its center of gravity can be described by a so-called inertia tensor , which is a 3 × 3-matrix   Θ xx Θ xy Θ xz   Θ xy Θ yy Θ yz Θ =  .  Θ xz Θ yz Θ zz • The diagonal elements of this tensor are called the moments of inertia . They describe the “resistance” of the body against being rotated. • The off-diagonal elements are the so-called deviation moments . They describe forces vertical to the rotation axis. • All bodies possess three perpendicular axes through their center of gravity, around which they can be rotated without forces perpendicular to the rotation axis. These axes are called principal axes of inertia . There are bodies that possess more than 3 such axes (example: a homogeneous sphere), but all bodies have at least three such axes. Christian Borgelt Data Mining / Intelligent Data Analysis 97 Principal Component Analysis: Physical Analog The principal axes of inertia of a box. • The deviation moments cause “rattling” in the bearings of the rotation axis, which cause the bearings to wear out quickly. • A car mechanic who balances a wheel carries out, in a way, a principal axes transformation. However, instead of changing the orientation of the axes, he/she adds small weights to minimize the deviation moments. • A statistician who does a principal component analysis, finds, in a way, the axes through a weight distribution with unit weights at each data point, around which it can be rotated most easily. Christian Borgelt Data Mining / Intelligent Data Analysis 98 Principal Component Analysis: Formal Approach • Normalize all attributes to arithmetic mean 0 and standard deviation 1: x ′ = x − ¯ x s x • Compute the correlation matrix Σ (i.e., the covariance matrix of the normalized data) • Carry out a principal axes transformation of the correlation matrix, that is, find a matrix R , such that R ⊤ ΣR is a diagonal matrix. • Formal procedure: ◦ Find the eigenvalues and eigenvectors of the correlation matrix, i.e., find the values λ i and vectors � v i , such that Σ � v i = λ i � v i . ◦ The eigenvectors indicate the desired directions. ◦ The eigenvalues are the variances in these directions. Christian Borgelt Data Mining / Intelligent Data Analysis 99 Principal Component Analysis: Formal Approach • Select dimensions using the percentage of explained variance . ◦ The eigenvalues λ i are the variances σ 2 i in the principal dimensions. ◦ It can be shown that the sum of the eigenvalues of an m × m correlation matrix is m . Therefore it is plausible to define λ i m as the share the i -th principal axis has in the total variance. ◦ Sort the λ i descendingly and find the smallest value k , such that k � λ i m ≥ α, i =1 where α is a user-defined parameter (e.g. α = 0 . 9). ◦ Select the corresponding k directions (given by the eigenvectors). • Transform the data to the new data space by multiplying the data points with a matrix, the rows of which are the eigenvectors of the selected dimensions. Christian Borgelt Data Mining / Intelligent Data Analysis 100

  22. Principal Component Analysis: Example x 5 15 21 29 31 43 49 51 61 65 y 33 35 24 21 27 16 18 10 4 12 y 30 20 10 0 x 0 10 20 30 40 50 60 • Strongly correlated features ⇒ Reduction to one dimension possible. • Second dimension may be reconstructed from regression line. Christian Borgelt Data Mining / Intelligent Data Analysis 101 Principal Component Analysis: Example Normalize to arithmetic mean 0 and standard deviation 1: 10 x = 1 � = 370 ¯ x i = 37 , 10 10 i =1 10 y = 1 � = 200 ¯ y i = 20 , 10 10 i =1   10 �  = 17290 − 13690 x = 1 s 2 x 2 x 2  i − 10¯ = 400 ⇒ s x = 20 , 9 9 i =1   10 y = 1 � 4900 − 4000 y 2 y 2 s 2  =  i − 10¯ = 100 ⇒ s y = 10 . 9 9 i =1 x ′ − 1 . 6 − 1 . 1 − 0 . 8 − 0 . 4 − 0 . 3 0 . 3 0 . 6 0 . 7 1 . 2 1 . 4 y ′ 1 . 3 1 . 5 0 . 4 0 . 1 0 . 7 − 0 . 4 − 0 . 2 − 1 . 0 − 1 . 6 − 0 . 8 Christian Borgelt Data Mining / Intelligent Data Analysis 102 Principal Component Analysis: Example • Compute the correlation matrix (covariance matrix of normalized data). � � � � − 23 Σ = 1 9 − 8 . 28 1 = 25 . − 23 − 8 . 28 9 9 1 25 • Find the eigenvalues and eigenvectors, i.e., the values λ i and vectors � v i , i = 1 , 2, such that v i = � ( Σ − λ i 1 ) � Σ � v i = λ i � v i or 0 . where 1 is the unit matrix. • Here: Find the eigenvalues as the roots of the characteristic polynomial. c ( λ ) = | Σ − λ 1 | = (1 − λ ) 2 − 529 625 . For more than 3 dimensions, this method is numerically unstable and should be replaced by some other method (Jacobi-Transformation, Householder Transfor- mation to tridiagonal form followed by the QR algorithm etc.). Christian Borgelt Data Mining / Intelligent Data Analysis 103 Principal Component Analysis: Example • The roots of the characteristic polynomial c ( λ ) = (1 − λ ) 2 − 529 625 are � 625 = 1 ± 23 529 λ 1 = 48 λ 2 = 2 λ 1 / 2 = 1 ± 25 , i.e. and 25 25 • The corresponding eigenvectors are determined by solving for i = 1 , 2 the (underdetermined) linear equation system v i = � ( Σ − λ i 1 ) � 0 • The resulting eigenvectors (normalized to length 1) are � 1 � 1 � � 2 , − 1 2 , 1 √ √ √ √ � v 1 = and v 2 = � , 2 2 (Note that for two dimensions always these two vectors result. Reminder: directions of the eigenvectors of a correlation matrix.) Christian Borgelt Data Mining / Intelligent Data Analysis 104

  23. Principal Component Analysis: Example • Therefore the transformation matrix for the principal axes transformation is   1 1 � � √ √ λ 1 0  2 2  R ⊤ ΣR = R =  , for which it is  − 1 1 0 λ 2 √ √ 2 2 √ • However, instead of R ⊤ we use 2 R ⊤ to transform the data: � x ′′ � � x ′ � √ 2 · R ⊤ · = . y ′′ y ′ √ Resulting data set (additional factor 2 leads to nicer values): x ′′ − 2 . 9 − 2 . 6 − 1 . 2 − 0 . 5 − 1 . 0 0.7 0.8 1.7 2.8 2.2 y ′′ − 0 . 3 0.4 − 0 . 4 − 0 . 3 0.4 − 0 . 1 0.4 − 0 . 3 − 0 . 4 0.6 • y ′′ is discarded ( s 2 25 ) and only x ′′ is kept ( s 2 y ′′ = 2 λ 2 = 4 x ′′ = 2 λ 1 = 96 25 ). Christian Borgelt Data Mining / Intelligent Data Analysis 105 The Iris Data pictures not available in online version • Collected by Edgar Anderson on the Gasp´ e Peninsula (Canada). • First analyzed by Ronald Aylmer Fisher (famous statistician). • 150 cases in total, 50 cases per Iris flower type. • Measurements of sepal length and width and petal length and width (in cm). • Most famous data set in pattern recognition and data analysis. Christian Borgelt Data Mining / Intelligent Data Analysis 106 The Iris Data 4.5 2.5 Iris virginica Iris versicolor 4 Iris setosa 2 sepal width / cm petal width / cm 3.5 1.5 3 1 2.5 Iris virginica 0.5 Iris versicolor Iris setosa 2 0 5 6 7 8 1 2 3 4 5 6 7 sepal length / cm petal length / cm • Scatter plots of the iris data set for sepal length vs. sepal width (left) and for petal length vs. petal width (right). All quantities are measured in centimeters (cm). Christian Borgelt Data Mining / Intelligent Data Analysis 107 Principal Component Analysis: Iris Data 3 1.5 2 second principal component normalized petal width 1 1 0.5 0 0 –0.5 –1 –1 Iris virginica Iris virginica –2 Iris versicolor Iris versicolor –1.5 Iris setosa Iris setosa –3 –1.5 –1 –0.5 0 0.5 1 1.5 –3 –2 –1 0 1 2 3 normalized petal length first principal component • Left: the first (solid line) and the second principal component (dashed line). • Right: the iris data projected to the space that is spanned by the first and the second principal component (resulting from a PCA involving all four attributes). Christian Borgelt Data Mining / Intelligent Data Analysis 108

  24. Inductive Statistics Christian Borgelt Data Mining / Intelligent Data Analysis 109 Inductive Statistics: Main Tasks • Parameter Estimation Given an assumption about the type of distribution of the underlying random variable the parameter(s) of the distribution function is estimated. • Hypothesis Testing A hypothesis about the data generating process is tested by means of the data. ◦ Parameter Test Test whether a parameter can have certain values. ◦ Goodness-of-Fit Test Test whether a distribution assumption fits the data. ◦ Dependence Test Test whether two attributes are dependent. • Model Selection Among different models that can be used to explain the data the best fitting is selected, taking the complexity of the model into account. Christian Borgelt Data Mining / Intelligent Data Analysis 110 Inductive Statistics: Random Samples • In inductive statistics probability theory is applied to make inferences about the process that generated the data. This presupposes that the sample is the result of a random experiment, a so-called random sample . • The random variable yielding the sample value x i is denoted X i . x i is called a instantiation of the random variable X i . • A random sample x = ( x 1 , . . . , x n ) is an instantiation of the random vector X = ( X 1 , . . . , X n ). • A random sample is called independent if the random variables X 1 , . . . , X n are (stochastically) independent, i.e. if   n n � �  = ∀ c 1 , . . . , c n ∈ I R : P  X i ≤ c i P ( X i ≤ c i ) . i =1 i =1 • An independent random sample is called simple if the random variables X 1 , . . . , X n have the same distribution function. Christian Borgelt Data Mining / Intelligent Data Analysis 111 Inductive Statistics: Parameter Estimation Christian Borgelt Data Mining / Intelligent Data Analysis 112

  25. Parameter Estimation Given: • A data set and • a family of parameterized distributions functions of the same type, e.g. ◦ the family of binomial distributions b X ( x ; p, n ) with the parameters p , 0 ≤ p ≤ 1, and n ∈ I N, where n is the sample size, ◦ the family of normal distributions N X ( x ; µ, σ 2 ) with the parameters µ (expected value) and σ 2 (variance). Assumption: • The process that generated the data can be described well by an element of the given family of distribution functions. Desired: • The element of the given family of distribution functions (determined by its parameters) that is the best model for the data. Christian Borgelt Data Mining / Intelligent Data Analysis 113 Parameter Estimation • Methods that yield an estimate for a parameter are called estimators . • Estimators are statistics , i.e. functions of the values in a sample. As a consequence they are functions of (instantiations of) random variables and thus (instantiations of) random variables themselves. Therefore we can use all of probability theory to analyze estimators. • There are two types of parameter estimation: ◦ Point Estimators Point estimators determine the best value of a parameter w.r.t. the data and certain quality criteria. ◦ Interval Estimators Interval estimators yield a region, a so-called confidence interval , in which the true value of the parameter lies with high certainty. Christian Borgelt Data Mining / Intelligent Data Analysis 114 Inductive Statistics: Point Estimation Christian Borgelt Data Mining / Intelligent Data Analysis 115 Point Estimation Not all statistics, that is, not all functions of the sample values are reasonable and useful estimator. Desirable properties are: • Consistency With growing data volume the estimated value should get closer and closer to the true value, at least with higher and higher probability. Formally: If T is an estimator for the parameter θ , it should be ∀ ε > 0 : n →∞ P ( | T − θ | < ε ) = 1 , lim where n is the sample size. • Unbiasedness An estimator should not tend to over- or underestimate the parameter. Rather it should yield, on average, the correct value. Formally this means E ( T ) = θ. Christian Borgelt Data Mining / Intelligent Data Analysis 116

  26. Point Estimation • Efficiency The estimation should be as precise as possible, that is, the deviation from the true value should be as small as possible. Formally: If T and U are two estimators for the same parameter θ , then T is called more efficient than U if D 2 ( T ) < D 2 ( U ) . • Sufficiency An estimator should exploit all information about the parameter contained in the data. More precisely: two samples that yield the same estimate should have the same probability (otherwise there is unused information). Formally: an estimator T for a parameter θ is called sufficient iff for all samples x = ( x 1 , . . . , x n ) with T ( x ) = t the expression f X 1 ( x 1 ; θ ) · · · f X n ( x n ; θ ) f T ( t ; θ ) is independent of θ . Christian Borgelt Data Mining / Intelligent Data Analysis 117 Point Estimation: Example Given: a family of uniform distributions on the interval [0 , θ ], i.e. � 1 θ , if 0 ≤ x ≤ θ , f X ( x ; θ ) = 0 , otherwise. Desired: an estimate for the unknown parameter θ . • We will now consider two estimators for the parameter θ and compare their properties. ◦ T = max { X 1 , . . . , X n } ◦ U = n +1 n max { X 1 , . . . , X n } • General approach: ◦ Find the probability density function of the estimator. ◦ Check the desirable properties by exploiting this density function. Christian Borgelt Data Mining / Intelligent Data Analysis 118 Point Estimation: Example To analyze the estimator T = max { X 1 , . . . , X n } , we compute its density function: d d f T ( t ; θ ) = d tF T ( t ; θ ) = d tP ( T ≤ t ) d d tP (max { X 1 , . . . , X n } ≤ t ) = n �� n � � d d = d tP i =1 X i ≤ t = P ( X i ≤ t ) d t i =1 d t ( F X ( t ; θ )) n = n · ( F X ( t ; θ )) n − 1 f X ( t, θ ) d = where   0 , if x ≤ 0, � x  x F X ( x ; θ ) = −∞ f X ( x ; θ ) d x = θ , if 0 ≤ x ≤ θ ,   if x ≥ θ . 1 , Therefore it is f T ( t ; θ ) = n · t n − 1 for 0 ≤ t ≤ θ, and 0 otherwise . θ n Christian Borgelt Data Mining / Intelligent Data Analysis 119 Point Estimation: Example • The estimator T = max { X 1 , . . . , X n } is consistent : n →∞ P ( | T − θ | < ǫ ) = lim n →∞ P ( T > θ − ǫ ) lim � t n � θ � θ n · t n − 1 = lim d t = lim n →∞ θ n n →∞ θ n θ − ǫ θ − ǫ � θ n � θ n − ( θ − ǫ ) n = lim θ n n →∞ � � θ − ǫ � n � 1 − = lim = 1 n →∞ θ • It is not unbiased : � ∞ � θ 0 t · n · t n − 1 E ( T ) = −∞ t · f T ( t ; θ ) d t = d t θ n � n · t n +1 � θ n = = n + 1 θ < θ for n < ∞ . ( n + 1) θ n 0 Christian Borgelt Data Mining / Intelligent Data Analysis 120

  27. Point Estimation: Example • The estimator U = n +1 n max { X 1 , . . . , X n } has the density function n n +1 u n − 1 for 0 ≤ t ≤ n +1 f U ( u ; θ ) = n θ , and 0 otherwise . ( n + 1) n θ n • The estimator U is consistent (without formal proof). • It is unbiased : � ∞ E ( U ) = −∞ u · f U ( u ; θ ) d u � n +1 u n − 1 n n +1 n θ = u · d u ( n + 1) n θ n 0 � u n +1 � n +1 n θ n n +1 = ( n + 1) n θ n n + 1 0 � n + 1 � n +1 n n +1 1 = ( n + 1) n θ n · θ = θ n + 1 n Christian Borgelt Data Mining / Intelligent Data Analysis 121 Point Estimation: Example Given: a family of normal distributions N X ( x ; µ, σ 2 ) � � − ( x − µ ) 2 1 f X ( x ; µ, σ 2 ) = √ 2 πσ 2 exp 2 σ 2 Desired: estimates for the unknown parameters µ and σ 2 . • The median and the arithmetic mean of the sample are both consistent and unbiased estimators for the parameter µ . The median is less efficient than the arithmetic mean. � n • The function V 2 = 1 X ) 2 is a consistent, but biased estimator i =1 ( X i − ¯ n for the parameter σ 2 (it tends to underestimate the variance). � n The function S 2 = 1 i =1 ( X i − ¯ X ) 2 , however, is a consistent and unbiased n − 1 estimator for σ 2 (this explains the definition of the empirical variance). Christian Borgelt Data Mining / Intelligent Data Analysis 122 Point Estimation: Example Given: a family of polynomial distributions k n ! � θ x i f X 1 ,...,X k ( x 1 , . . . , x k ; θ 1 , . . . , θ k , n ) = i , � k i =1 x i ! i =1 ( n is the sample size, the x i are the frequencies of the different values a i , i = 1 , . . . , k , and the θ i are the probabilities with which the values a i occur.) Desired: estimates for the unknown parameters θ 1 , . . . , θ k • The relative frequencies R i = X i n of the different values a i , i = 1 , . . . , k , are ◦ consistent, ◦ unbiased, ◦ most efficient, and ◦ sufficient estimators for the θ i . Christian Borgelt Data Mining / Intelligent Data Analysis 123 Inductive Statistics: Finding Point Estimators Christian Borgelt Data Mining / Intelligent Data Analysis 124

  28. How Can We Find Estimators? • Up to now we analyzed given estimators, now we consider the question how to find them. • There are three main approaches to find estimators: ◦ Method of Moments Derive an estimator for a parameter from the moments of a distribution and its generator function. (We do not consider this method here.) ◦ Maximum Likelihood Estimation Choose the (set of) parameter value(s) that makes the sample most likely. ◦ Maximum A-posteriori Estimation Choose a prior distribution on the range of parameter values, apply Bayes’ rule to compute the posterior probability from the sample, and choose the (set of) parameter value(s) that maximizes this probability. Christian Borgelt Data Mining / Intelligent Data Analysis 125 Maximum Likelihood Estimation • General idea: Choose the (set of) parameter value(s) that makes the sample most likely. • If the parameter value(s) were known, it would be possible to compute the proba- bility of the sample. With unknown parameter value(s), however, it is still possible to state this probability as a function of the parameter(s). • Formally this can be described as choosing the value θ that maximizes L ( D ; θ ) = f ( D | θ ) , where D are the sample data and L is called the Likelihood Function . • Technically the estimator is determined by ◦ setting up the likelihood function, ◦ forming its partial derivative(s) w.r.t. the parameter(s), and ◦ setting these derivatives equal to zero (necessary condition for a maximum). Christian Borgelt Data Mining / Intelligent Data Analysis 126 Brief Excursion: Function Optimization Task: Find values � x = ( x 1 , . . . , x m ) such that f ( � x ) = f ( x 1 , . . . , x m ) is optimal. Often feasible approach: • A necessary condition for a (local) optimum (maximum or minimum) is that the partial derivatives w.r.t. the parameters vanish (Pierre Fermat). • Therefore: (Try to) solve the equation system that results from setting all partial derivatives w.r.t. the parameters equal to zero. f ( x, y ) = x 2 + y 2 + xy − 4 x − 5 y . Example task: Minimize Solution procedure: 1. Take the partial derivatives of the objective function and set them to zero: ∂f ∂f ∂x = 2 x + y − 4 = 0 , ∂y = 2 y + x − 5 = 0 . 2. Solve the resulting (here: linear) equation system: x = 1 , y = 2. Christian Borgelt Data Mining / Intelligent Data Analysis 127 Maximum Likelihood Estimation: Example Given: a family of normal distributions N X ( x ; µ, σ 2 ) � � − ( x − µ ) 2 1 f X ( x ; µ, σ 2 ) = √ 2 πσ 2 exp 2 σ 2 Desired: estimators for the unknown parameters µ and σ 2 . The Likelihood Function , which describes the probability of the data, is � � n − ( x i − µ ) 2 � 1 L ( x 1 , . . . , x n ; µ, σ 2 ) = √ 2 πσ 2 exp . 2 σ 2 i =1 To simplify the technical task of forming the partial derivatives, we consider the natural logarithm of the likelihood function, i.e. �� � n − 1 � ln L ( x 1 , . . . , x n ; µ, σ 2 ) = − n ln 2 πσ 2 ( x i − µ ) 2 . 2 σ 2 i =1 Christian Borgelt Data Mining / Intelligent Data Analysis 128

  29. Maximum Likelihood Estimation: Example • Estimator for the expected value µ : n ∂µ ln L ( x 1 , . . . , x n ; µ, σ 2 ) = 1 ∂ � ( x i − µ ) ! = 0 σ 2 i =1   n n n � � µ = 1 �  − nµ !  ⇒ ( x i − µ ) = x i = 0 ⇒ ˆ x i n i =1 i =1 i =1 • Estimator for the variance σ 2 : n ∂σ 2 ln L ( x 1 , . . . , x n ; µ, σ 2 ) = − n ∂ 2 σ 2 + 1 � ( x i − µ ) 2 ! = 0 2 σ 4 i =1   2 n n n σ 2 = 1 � µ ) 2 = 1 � i − 1 � x 2 ⇒ ˆ ( x i − ˆ  x i  (biased!) n 2 n n i =1 i =1 i =1 Christian Borgelt Data Mining / Intelligent Data Analysis 129 Maximum A-posteriori Estimation: Motivation Consider the following three situations: • A drunkard claims to be able to predict the side on which a thrown coin will land (head or tails). On ten trials he always states the correct side beforehand. • A tea lover claims that she is able to taste whether the tea or the milk was poured into the cup first. On ten trials she always identifies the correct order. • An expert of classical music claims to be able to recognize from a single sheet of music whether the composer was Mozart or somebody else. On ten trials he is indeed correct every time. Maximum likelihood estimation treats all situations alike, because formally the samples are the same. However, this is implausible: • We do not believe the drunkard at all, despite the sample data. • We highly doubt the tea drinker, but tend to consider the data as evidence. • We tend to believe the music expert easily. Christian Borgelt Data Mining / Intelligent Data Analysis 130 Maximum A-posteriori Estimation • Background knowledge about the plausible values can be incorporated by ◦ using a prior distribution on the domain of the parameter and ◦ adapting this distribution with Bayes’ rule and the data. • Formally maximum a-posteriori estimation is defined as follows: find the parameter value θ that maximizes f ( θ | D ) = f ( D | θ ) f ( θ ) f ( D | θ ) f ( θ ) = � ∞ f ( D ) −∞ f ( D | θ ) f ( θ ) d θ • As a comparison: maximum likelihood estimation maximizes f ( D | θ ) • Note that f ( D ) need not be computed: It is the same for all parameter values and since we are only interested in the value θ that maximizes f ( θ | D ) and not the value of f ( θ | D ), we can treat it as a normalization constant. Christian Borgelt Data Mining / Intelligent Data Analysis 131 Maximum A-posteriori Estimation: Example Given: a family of binomial distributions � n � θ x (1 − θ ) n − x . f X ( x ; θ, n ) = x Desired: an estimator for the unknown parameter θ . a) Uniform prior : f ( θ ) = 1, 0 ≤ θ ≤ 1. � n � θ = x θ x (1 − θ ) n − x · 1 ˆ f ( θ | D ) = γ ⇒ x n b) Tendency towards 1 2 : f ( θ ) = 6 θ (1 − θ ), 0 ≤ θ ≤ 1. � n � � n � θ x (1 − θ ) n − x · θ (1 − θ ) = γ θ x +1 (1 − θ ) n − x +1 f ( θ | D ) = γ x x θ = x + 1 ˆ ⇒ n + 2 Christian Borgelt Data Mining / Intelligent Data Analysis 132

  30. Excursion: Dirichlet’s Integral • For computing the normalization factors of the probability density functions that occur with polynomial distributions, Dirichlet’s Integral is helpful: � k � � k k � i =1 Γ( x i + 1) � θ x i . . . d θ 1 . . . d θ k = , where n = x i i Γ( n + k ) θ 1 θ k i =1 i =1 and the Γ-function is the so-called generalized factorial : � ∞ e − t t x − 1 d t, Γ( x ) = x > 0 , 0 which satisfies 2 ) = √ π, Γ( 1 Γ( x + 1) = x · Γ( x ) , Γ(1) = 1 . • Example: the normalization factor α for the binomial distribution prior f ( θ ) = α θ 2 (1 − θ ) 3 is 1 Γ(5 + 2) 2! 3! = 720 6! α = θ θ 2 (1 − θ ) 3 d θ = Γ(2 + 1) Γ(3 + 1) = = 60 . � 12 Christian Borgelt Data Mining / Intelligent Data Analysis 133 Maximum A-posteriori Estimation: Example drunkard tea lover music expert f ( θ ) f ( θ ) f ( θ ) Dirac pulse αθ 10 (1 − θ ) 10 12 θ (1 − θ ) θ θ θ 0 0.5 1 0 0.5 1 0 0.5 1 f ( D | θ ) f ( D | θ ) f ( D | θ ) θ θ θ 0 0.5 1 0 0.5 1 0 0.5 1 f ( θ | D ) θ = 1 ˆ f ( θ | D ) θ = 2 ˆ f ( θ | D ) θ = 11 ˆ 2 3 12 θ θ θ 0 0.5 1 0 0.5 1 0 0.5 1 Christian Borgelt Data Mining / Intelligent Data Analysis 134 Inductive Statistics: Interval Estimation Christian Borgelt Data Mining / Intelligent Data Analysis 135 Interval Estimation • In general the estimated value of a parameter will differ from the true value. • It is desirable to be able to make an assertion about the possible deviations. • The simplest possibility is to state not only a point estimate, but also the standard deviation of the estimator: � D 2 ( T ) . t ± D ( T ) = t ± • A better possibility is to find intervals that contain the true value with high probability. Formally they can be defined as follows: Let A = g A ( X 1 , . . . , X n ) and B = g B ( X 1 , . . . , X n ) be two statistics with P ( θ ≤ A ) = α P ( θ ≥ B ) = α P ( A < θ < B ) = 1 − α, 2 , 2 . Then the random interval [ A, B ] (or an instantiation [ a, b ] of this interval) is called (1 − α ) · 100% confidence interval for θ . The value 1 − α is called confidence level . Christian Borgelt Data Mining / Intelligent Data Analysis 136

  31. Interval Estimation • This definition of a confidence interval is not specific enough: A and B are not uniquely determined. • Common solution: Start from a point estimator T for the unknown parameter θ and define A and B as functions of T : A = h A ( T ) and B = h B ( T ) . • Instead of A ≤ θ ≤ B consider the corresponding event w.r.t. the estimator T , that is, A ∗ ≤ T ≤ B ∗ . • Determine A = h A ( T ) and B = h B ( T ) from the inverse functions A ∗ = h − 1 A ( θ ) and B ∗ = h − 1 B ( θ ). Procedure: P ( A ∗ < T < B ∗ ) = 1 − α ⇒ P ( h − 1 A ( θ ) < T < h − 1 B ( θ )) = 1 − α ⇒ P ( h A ( T ) < θ < h B ( T )) = 1 − α ⇒ P ( A < θ < B ) = 1 − α. Christian Borgelt Data Mining / Intelligent Data Analysis 137 Interval Estimation: Example Given: a family of uniform distributions on the interval [0 , θ ], i.e. � 1 θ , if 0 ≤ x ≤ θ , f X ( x ; θ ) = 0 , otherwise. Desired: a confidence interval for the unknown parameter θ . • Start from the unbiased point estimator U = n +1 n max { X 1 , . . . , X n } : � B ∗ α P ( U ≤ B ∗ ) = f U ( u ; θ ) d u = 2 0 � n +1 n θ α P ( U ≥ A ∗ ) = f U ( u ; θ ) d u = A ∗ 2 • From the study of point estimators we know u n − 1 n n +1 f U ( u ; θ ) = θ n . ( n + 1) n Christian Borgelt Data Mining / Intelligent Data Analysis 138 Interval Estimation: Example • Solving the integrals gives us � α � n + 1 1 − α n + 1 B ∗ = A ∗ = n n θ and θ, 2 n 2 n that is, � � α � � n + 1 1 − α n + 1 n n P θ < U < θ = 1 − α. 2 n 2 n • Computing the inverse functions leads to   U U   P � < θ < � α  = 1 − α,  1 − α n +1 n +1 n n 2 n 2 n that is, U U A = � and B = � α . 1 − α n +1 n +1 n n 2 n 2 n Christian Borgelt Data Mining / Intelligent Data Analysis 139 Inductive Statistics: Hypothesis Testing Christian Borgelt Data Mining / Intelligent Data Analysis 140

  32. Hypothesis Testing • A hypothesis test is a statistical procedure with which a decision is made between two contrary hypotheses about the process that generated the data. • The two hypotheses may refer to ◦ the value of a parameter (Parameter Test) , ◦ a distribution assumption (Goodness-of-Fit Test) , ◦ the dependence of two attributes (Dependence Test) . • One of the two hypothesis is preferred, that is, in case of doubt the decision is made in its favor. (One says that it gets the “benefit of the doubt”.) • The preferred hypothesis is called the Null Hypothesis H 0 , the other hypothesis is called the Alternative Hypothesis H a . • Intuitively: The null hypothesis H 0 is put on trial. It is accused to be false. Only if the evidence is strong enough, it is convicted (that is, rejected). If there is (sufficient) doubt, however, it is acquitted (that is, accepted). Christian Borgelt Data Mining / Intelligent Data Analysis 141 Hypothesis Testing • The test decision is based on a test statistic , that is, a function of the sample values. • The null hypothesis is rejected if the value of the test statistic lies inside the so-called critical region C . • Developing a hypothesis test consists in finding the critical region for a given test statistic and significance level (see below). • The test decision may be wrong. There are two possible types of errors: Type 1: The null hypothesis H 0 is rejected, even though it is correct. Type 2: The null hypothesis H 0 is accepted, even though it is false. • Type 1 errors are considered to be more severe, since the null hypothesis gets the “benefit of the doubt”. • Hence it is tried to limit the probability of a type 1 error to a certain maximum α . This maximum value α is called significance level . Christian Borgelt Data Mining / Intelligent Data Analysis 142 Parameter Test • In a parameter test the contrary hypotheses refer to the value of a parameter, for example (one-sided test): H 0 : θ ≥ θ 0 , H a : θ < θ 0 . • For such a test usually a point estimator T is chosen as the test statistic. • The null hypothesis H 0 is rejected if the value t of the point estimator does not exceed a certain value c , the so-called critical value (that is, C = ( −∞ , c ]). • Formally the critical value c is determined as follows: We consider β ( θ ) = P θ ( H 0 is rejected) = P θ ( T ∈ C ) , the so-called power β of the test. • The power must not exceed the significance level α for values θ satisfying H 0 : max β ( θ ) ≤ α. (here: β ( θ 0 ) ≤ α ) θ : θ satisfies H 0 Christian Borgelt Data Mining / Intelligent Data Analysis 143 Parameter Test: Intuition • The probability of a type 1 error is the area under the estimator’s probability density function f ( T | θ 0 ) to the left of the critical value c . (Note: This example illustrates H 0 : θ ≥ θ 0 and H a : θ < θ 0 .) f ( T | θ 0 ) probability of a type 1 error β ( θ ) T critical region C c θ 0 • Obviously the probability of a type 1 error depends on the location of the critical value c : higher values mean a higher error probability. • Idea: Choose the location of the cricital value so that the maximal probability of a type 1 error equals α , the chosen significance level. Christian Borgelt Data Mining / Intelligent Data Analysis 144

  33. Parameter Test: Intuition • What is so special about θ 0 that we use f ( T | θ 0 )? f ( T | θ ) probability of a type 1 error β ( θ ) T critical region C c θ 0 θ satisfying H 0 • In principle, all θ satisfying H 0 have to be considered, that is, all density functions f ( T | θ ) with θ ≥ θ 0 . • Among these values θ , the one with the highest probability of a type 1 error (that is, the one with the highest power β ( θ )) determines the critical value. Intuitively: we consider the worst possible case . Christian Borgelt Data Mining / Intelligent Data Analysis 145 Parameter Test: Example • Consider a one-sided test of the expected value µ of a normal distribution N ( µ, σ 2 ) with known variance σ 2 , that is, consider the hypotheses H 0 : µ ≥ µ 0 , H a : µ < µ 0 . • As a test statistic we use the standard point estimator for the expected value n X = 1 � ¯ X i . n i =1 This point estimator has the probability density � � x ; µ, σ 2 f ¯ X ( x ) = N . n • Therefore it is (with the N (0 , 1)-distributed random variable Z ) � ¯ � � � X − µ 0 σ/ √ n ≤ c − µ 0 Z ≤ c − µ 0 α = β ( µ 0 ) = P µ 0 ( ¯ X ≤ c ) = P σ/ √ n = P σ/ √ n . Christian Borgelt Data Mining / Intelligent Data Analysis 146 Parameter Test: Example • We have as a result that � c − µ 0 � α = Φ σ/ √ n , where Φ is the distribution function of the standard normal distribution. • The distribution function Φ is tabulated, because it cannot be represented in closed form. From such a table we retrieve the value z α satisfying α = Φ( z α ). • Then the critical value is σ c = µ 0 + z α √ n. (Note that the value of z α is negative due to the usually small value of α . Typical values are α = 0 . 1, α = 0 . 05 or α = 0 . 01.) x of the point estimator ¯ • H 0 is rejected if the value ¯ X does not exceed c , otherwise it is accepted. Christian Borgelt Data Mining / Intelligent Data Analysis 147 Parameter Test: Example • Let σ = 5 . 4, n = 25 and ¯ x = 128. We choose µ 0 = 130 and α = 0 . 05. • From a standard normal distribution table we retrieve z 0 . 05 ≈ − 1 . 645 and get c 0 . 05 ≈ 130 − 1 . 645 5 . 4 √ 25 ≈ 128 . 22 . Since ¯ x = 128 < 128 . 22 = c , we reject the null hypothesis H 0 . • If, however, we had chosen α = 0 . 01, it would have been (with z 0 . 01 ≈ − 2 . 326): c 0 . 01 ≈ 130 − 2 . 326 5 . 4 √ 25 ≈ 127 . 49 Since ¯ x = 128 > 127 . 49 = c , we would have accepted the null hypothesis H 0 . • Instead of fixing a significance level α one may state the so-called p-value � 128 − 130 � √ ≈ 0 . 032 . p = Φ 5 . 4 / 25 For α ≥ p = 0 . 032 the null hypothesis is rejected, for α < p = 0 . 032 accepted. Christian Borgelt Data Mining / Intelligent Data Analysis 148

  34. Parameter Test: p-value • Let t be the value of the test statistic T that has been computed from a given data set. (Note: This example illustrates H 0 : θ ≥ θ 0 and H a : θ < θ 0 .) f ( T | θ 0 ) p -value of t T t θ 0 • The p-value is the probability that a value of t or less can be observed for the chosen test statistic T . • The p -value is a lower limit for the significance level α that may have been chosen if we wanted to reject the null hypothesis H 0 . Christian Borgelt Data Mining / Intelligent Data Analysis 149 Parameter Test: p-value Attention: p-values are often misused or misinterpreted! • A low p -value does not mean that the result is very reliable! All that matters for the test is whether the computed p -value is below the chosen significance level or not . (A low p -value could just be a chance event, an accident!) • The significance level may not be chosen after computing the p -value, since we tend to choose lower significance levels if we know that they are met. Doing so would undermine the reliability of the procedure! • Stating p -values is only a convenient way of avoiding a fixed significance level (since significance levels are a matter of choice and thus user-dependent). However: A significance level must still be chosen before a reported p -value is looked at. Christian Borgelt Data Mining / Intelligent Data Analysis 150 Relevance of the Type-2 Error • Reminder: There are two possible types of errors: Type 1: The null hypothesis H 0 is rejected, even though it is correct. Type 2: The null hypothesis H 0 is accepted, even though it is false. • Type-1 errors are considered to be more severe, since the null hypothesis gets the “benefit of the doubt”. • However, type-2 errors should not be neglected completely: ◦ It is always possible to achieve a vanishing probability of a type-1 error: Simply accept the null hypothesis in all instances, regardless of the data. ◦ Unfortunately such an approach maximizes the type-2 error. • Generally, type-1 and type-2 errors are complementary quantities : The lower we require the type-1 error to be (the lower the significance level), the higher will be the probability of a type-2 error. Christian Borgelt Data Mining / Intelligent Data Analysis 151 Relationship between Type-1 and Type-2 Error • Suppose there are only two possible parameter values θ 0 and θ 1 with θ 1 < θ 0 . (That is, we have H 0 : θ = θ 0 and H a : θ = θ 1 .) probability of a type-1 error type-2 error f ( T | θ 1 ) f ( T | θ 0 ) T θ 1 c θ 0 • Lowering the significance level α moves the critical value c to the left: lower type-1 error (red), but higher type-2 error (blue). • Increasing the significance level α moves the critical value c to the right: higher type-1 error (red), but lower type-2 error (blue). Christian Borgelt Data Mining / Intelligent Data Analysis 152

  35. Inductive Statistics: Model Selection Christian Borgelt Data Mining / Intelligent Data Analysis 153 Model Selection • Objective: select the model that best fits the data, taking the model complexity into account . The more complex the model, the better it usually fits the data. y 7 black line: 6 regression line 5 (2 free parameters) 4 3 blue curve: 2 7th order regression polynomial 1 (8 free parameters) x 0 0 1 2 3 4 5 6 7 8 • The blue curve fits the data points perfectly, but it is not a good model . Christian Borgelt Data Mining / Intelligent Data Analysis 154 Information Criteria • There is a tradeoff between model complexity and fit to the data . Question: How much better must a more complex model fit the data in order to justify the higher complexity? • One approach to quantify the tradeoff: Information Criteria Let M be a model and Θ the set of free parameters of M . Then: IC κ ( M, Θ | D ) = − 2 ln P ( D | M, Θ) + κ | Θ | , where D are the sample data and κ is a parameter. Special cases: ◦ Akaike Information Criterion (AIC): κ = 2 ◦ Bayesian Information Criterion (BIC): κ = ln n , where n is the sample size • The lower the value of these measures, the better the model. Christian Borgelt Data Mining / Intelligent Data Analysis 155 Minimum Description Length • Idea: Consider the transmission of the data from a sender to a receiver. Since the transmission of information is costly, the length of the message to be transmitted should be minimized. • A good model of the data can be used to transmit the data with fewer bits. However, the receiver does not know the model the sender used and thus cannot decode the message. Therefore: if the sender uses a model, he/she has to transmit the model as well. • description length = length of model description + length of data description (A more complex model increases the length of the model description, but reduces the length of the data description.) • The model that leads to the smallest total description length is the best. Christian Borgelt Data Mining / Intelligent Data Analysis 156

  36. Minimum Description Length: Example • Given: a one-dimensional sample from a polynomial distribution. • Question: are the probabilities of the attribute values sufficiently different to justify a non-uniform distribution model? • Coding using no model (equal probabilities for all values): l 1 = n log 2 k, where n is the sample size and k the number of attribute values. • Coding using a polynomial distribution model : ( n + k − 1)! n ! l 2 = log 2 + log 2 n !( k − 1)! x 1 ! . . . x k ! � �� � � �� � model description data description (Idea: Use a codebook with one page per configuration, that is, frequency distri- bution (model) and specific sequence (data), and transmit the page number.) Christian Borgelt Data Mining / Intelligent Data Analysis 157 Minimum Description Length: Example Some details about the codebook idea: • Model Description : There are n objects (the sample cases) that have to be partitioned into k groups (one for each attribute value). (Model: distribute n balls on k boxes.) ( n + k − 1)! Number of possible distributions: n !( k − 1)! Idea: number of possible sequences of n + k − 1 objects ( n balls and k − 1 box walls) of which n (the objects) and k − 1 (the box walls) are indistinguishable. • Data Description : There are k groups of objects with n i , i = 1 , . . . , k , elements in them. (The values of the n k are known from the model description.) n ! Number of possible sequences: n 1 ! . . . n k ! Christian Borgelt Data Mining / Intelligent Data Analysis 158 Summary Statistics Statistics has two main areas: • Descriptive Statistics ◦ Display the data in tables or charts. ◦ Summarize the data in characteristic measures. ◦ Reduce the dimensionality of the data with principal component analysis. • Inductive Statistics ◦ Use probability theory to draw inferences about the process that generated the data. ◦ Parameter Estimation (point and interval) ◦ Hypothesis Testing (parameter, goodness-of-fit, dependence) ◦ Model Selection (tradeoff between fit and complexity) Christian Borgelt Data Mining / Intelligent Data Analysis 159 Principles of Modeling Christian Borgelt Data Mining / Intelligent Data Analysis 160

  37. Principles of Modeling • The Data Mining step of the KDD Process consists mainly of model building for specific purposes (e.g. prediction). • What type of model is to be built depends on the task, e.g., ◦ if the task is numeric prediction, one may use a regression function, ◦ if the task is classification, one may use a decision tree, ◦ if the task is clustering, one may use a set of cluster prototypes, ◦ etc. • Most data analysis methods comprise the following four steps: ◦ Select the Model Class (e.g. decision tree) ◦ Select the Objective Function (e.g. misclassification rate) ◦ Apply an Optimization Algorithm (e.g. top-down induction) ◦ Validate the Results (e.g. cross validation) Christian Borgelt Data Mining / Intelligent Data Analysis 161 Model Classes • In order to extract information from data, it is necessary to specify the general form the analysis result should have. • We call this the model class or architecture of the analysis result. • Attention: In Data Mining / Machine Learning the notion of a model class is considerably more general than, e.g., in statistics, where it reflects a structure inherent in the data or represents the process of data generation. • Typical distinctions w.r.t. model classes: ◦ Type of Model (e.g. linear function, rules, decision tree, clusters etc.) ◦ Global versus Local Model (e.g. regression models usually cover the whole data space while rules are applicable only in the region where their antecedent is satisfied) ◦ Interpretable versus Black Box (rules and decision trees are usually considered as interpretable, artificial neural networks as black boxes) Christian Borgelt Data Mining / Intelligent Data Analysis 162 Model Evaluation • After a model has been constructed, one would like to know how “good” it is. ⇒ How can we measure the quality of a model ? • Desired: The model should generalize well and thus yield, on new data, an error (to be made precise) that is as small as possible. • However, due to possible overfitting to the induction / training data (i.e. adaptations to features that are not regular, but accidental), the error on the training data is usually not too indicative. ⇒ How can we assess the (expected) performance on new data ? • General idea: Evaluate on a hold-out data set ( validation data ), that is, on data not used for building / training the predictor. ◦ It is (highly) unlikely that the validation data exhibits the same accidental features as the training data. ◦ Hence an evaluation on the validation data can provide a good indication of the performance on new data. Christian Borgelt Data Mining / Intelligent Data Analysis 163 Fitting Criteria and Score / Loss Functions • In order to find the best or at least a good model for the given data, a fitting criterion is needed, usually in the form of an objective function f : M → I R , where M is the set of considered models. • The objective function f may also be referred to as ◦ Score Function (usually to be maximized), ◦ Loss Function (usually to be minimized). • Typical examples of objective functions are ( m ∈ M is a model, D the data) � x ) − y ) 2 1 ◦ Mean squared error (MSE): f ( m, D ) = x,y ) ∈ D ( m ( � | D | ( � � 1 ◦ Mean absolute error (MAE): f ( m, D ) = x,y ) ∈ D | m ( � x ) − y | | D | ( � � 1 ◦ Accuracy: f ( m, D ) = x,y ) ∈ D δ m ( � x ) ,y | D | ( � Christian Borgelt Data Mining / Intelligent Data Analysis 164

  38. Classification Evaluation • The most common loss function for classification is the misclassification rate � 1 x,y ) ∈ D (1 − δ m ( � E ( m, D ) = x ) ,y ) , | D | ( � and (alternatively) its dual, the score function accuracy � 1 x ) ,y = 1 − E ( m, D ) . A ( m, D ) = x,y ) ∈ D δ m ( � | D | ( � • A confusion matrix displays the misclassifications in more detail. It is a table in which the rows represent the true classes and the columns the predicted classes. • Each entry specifies how many objects from the true class of the corresponding row are classified as belonging to the class of the corresponding column. • The accuracy is the sum of the diagonal entries divided by the sum of all entries. The misclassification rate (or simply error rate ) is the sum of the off-diagonal entries divided by the sum of all entries. • An ideal classifier has non-zero entries only on the diagonal. Christian Borgelt Data Mining / Intelligent Data Analysis 165 Reminder: The Iris Data pictures not available in online version • Collected by Edgar Anderson on the Gasp´ e Peninsula (Canada). • First analyzed by Ronald Aylmer Fisher (famous statistician). • 150 cases in total, 50 cases per Iris flower type. • Measurements of sepal length and width and petal length and width (in cm). • Most famous data set in pattern recognition and data analysis. Christian Borgelt Data Mining / Intelligent Data Analysis 166 Reminder: The Iris Data 4.5 2.5 Iris virginica Iris versicolor 4 Iris setosa 2 sepal width / cm petal width / cm 3.5 1.5 3 1 2.5 Iris virginica 0.5 Iris versicolor Iris setosa 2 0 5 6 7 8 1 2 3 4 5 6 7 sepal length / cm petal length / cm • Scatter plots of the iris data set for sepal length vs. sepal width (left) and for petal length vs. petal width (right). All quantities are measured in centimeters (cm). Christian Borgelt Data Mining / Intelligent Data Analysis 167 Classification Evaluation: Confusion Matrix • The table below shows a possible confusion matrix for the Iris data set. predicted class true class Iris setosa Iris versicolor Iris virginica Iris setosa 50 0 0 Iris versicolor 0 47 3 Iris virginica 0 2 48 • From this matrix, we can see that all cases of the class Iris setosa are classified correctly and no case of another class is wrongly classified as Iris setosa. • A few cases of the other classes are wrongly classified: three cases of Iris versicolor are classified as Iris virginica, two cases of Iris virginica are classified as Iris versicolor. 2+3 5 • The misclassification rate is E = 50+47+3+2+48 = 150 ≈ 3 . 33%. 50+47+3+2+48 = 145 50+47+48 • The accuracy is A = 150 ≈ 96 . 67%. Christian Borgelt Data Mining / Intelligent Data Analysis 168

  39. Classification Evaluation: Two Classes • For many classification problems there are only two classes that the classifier is supposed to distinguish. • Let us call the two classes plus (or positive ) and minus (or negative ). • The classifier can make two different kinds of mistakes: ◦ Cases of the class minus may be wrongly assigned to the class plus . These cases are called false positives (fp) . ◦ Vice versa, cases of the class plus may be wrongly classified as minus . Such cases are called false negatives (fn) . • The cases that are classified correctly are called true positives (tp) and true negatives (tn) , respectively. Confusion matrix: fp + fn true predicted class • error rate: E = tp + fn + fp + tn class plus minus tp + tn plus tp fn p • accuracy: A = tp + fn + fp + tn minus fp tn n Christian Borgelt Data Mining / Intelligent Data Analysis 169 Classification Evaluation: Precision and Recall • Sometimes one would like to capture not merely the overall classification accuracy, but how well the individual classes are recognized. • Especially if the class distribution is skewed, that is, if there are large differences in the class frequencies, overall measures may give a wrong impression. • For example, if in a two class problem ◦ one class occurs in 98% of all cases, ◦ while the other covers only the remaining 2%, a classifier that always predicts the first class reaches an impressive accuracy of 98%—without distinguishing between the classes at all. • Such unpleasant situations are fairly common in practice, for example: ◦ illnesses are (fortunately) rare and ◦ replies to mailings are (unfortunately?) scarce. Hence: predict that everyone is healthy or a non-replier. • However, such a classifier is useless to a physician or a product manager. Christian Borgelt Data Mining / Intelligent Data Analysis 170 Classification Evaluation: Precision and Recall • In such cases (skewed class distribution) higher error rates are usually accepted in exchange for a better coverage of the minority class. • In order to allow for such a possibility, the following two measures may be used: tp tp ◦ Precision: π = ◦ Recall: ρ = tp + fp tp + fn [Perry, Kent & Berry 1955] • In other words: precision is the ratio of true positives to all data points classified as positive; recall is the ratio of true positives to all actually positive data points. • In yet other words: precision is the fraction of data points for which a positive classification is correct; recall is the fraction of positive data points that is identified by the classifier. • Precision and recall are usually complementary quantities: higher precision may be obtained at the price of lower recall and vice versa. Christian Borgelt Data Mining / Intelligent Data Analysis 171 Classification Evaluation: Other Quantities I • recall, sensitivity, hit rate, or true positive rate (TPR) TPR = tp tp tp + fn = 1 − FNR p = • specificity, selectivity or true negative rate (TNR) TNR = tn tn n = tn + fp = 1 − FPR • precision or positive predictive value (PPV) tp PPV = tp + fp • negative predictive value (NPV) tn NPV = tn + fn • miss rate or false negative rate (FNR) FNR = fn fn p = fn + tp = 1 − TPR Christian Borgelt Data Mining / Intelligent Data Analysis 172

  40. Classification Evaluation: Other Quantities II • fall-out or false positive rate (FPR) FPR = fp fp n = fp + tn = 1 − TNR • false discovery rate (FDR) fp fp + tp = 1 − PPV FDR = • false omission rate (FOR) fn FOR = fn + tn = 1 − NPV • accuracy (ACC) ACC = tp + tn tp + tn p + n = tp + tn + fp + fn • misclassification rate or error rate (ERR) ERR = fp + fn fp + fn p + n = tp + tn + fp + fn Christian Borgelt Data Mining / Intelligent Data Analysis 173 Classification Evaluation: F-Measure • With precision and recall we have two numbers that assess the quality of a classifier. • A common way to combine them into one number is to compute the F 1 measure [Rijsbergen 1979], which is the harmonic mean of precision and recall: 2 2 πρ F 1 = = π + ρ. π + 1 1 ρ In this formula precision and recall have the same weight. • The generalized F measure [Rijsbergen 1979] introduces a mixing parameter. It can be found in several different, but basically equivalent versions, for example: 1 πρ F α = = αρ + (1 − α ) π, α ∈ [0 , 1] , α π + 1 − α ρ or 1 + β 2 πρ (1 + β 2 ) F β = = ρ + β 2 π , β ∈ [0 , ∞ ) . π + β 2 1 ρ Obviously, the standard F 1 measure results for α = 1 2 or β = 1, respectively. Christian Borgelt Data Mining / Intelligent Data Analysis 174 Classification Evaluation: F-Measure • The generalized F measure is [Rijsbergen 1979]: F β = πρ (1 + β 2 ) πρ F α = αρ + (1 − α ) π, α ∈ [0 , 1] , ρ + β 2 π , β ∈ [0 , ∞ ) . • By choosing the mixing parameters α or β it can be controlled whether the focus should be more on precision or on recall: ◦ For α > 1 2 or β > 1 the focus is more on precision; for α = 1 or β = 0 we have F α = F β = π . ◦ For α < 1 2 or β < 1 the focus is more on recall; for α = 0 or β → ∞ we have F α = F β = ρ . • However, this possibility is rarely used, presumably, because precision and recall are usually considered to be equally important. • Note that precision and recall and thus the generalized F measure as well as its special case, the F 1 measure, focus on one class (namely the positive class). Exchanging the two classes usually changes all of these measures. Christian Borgelt Data Mining / Intelligent Data Analysis 175 Classification Evaluation: More Than Two Classes • The misclassification rate (or error rate ) and the accuracy can be used regardless of the number of classes (whether only two or more). • In contrast, precision , recall and F measure are defined only for two classes. • However, they can be generalized to more than two classes by computing them for each class separately and averaging the results. • In this approach, each class in turn is seen as the positive class ( plus ) while all other classes together form the negative class ( minus ). • Macro-averaging (1st possibility of averaging) [Sebastini 2002] c c tp ( k ) π macro = 1 � π k = 1 � ◦ precision: tp ( k ) + fp ( k ) c c k =1 k =1 c c tp ( k ) ρ macro = 1 � ρ k = 1 � ◦ recall: tp ( k ) + fn ( k ) c c k =1 k =1 Here c is the number of classes. Christian Borgelt Data Mining / Intelligent Data Analysis 176

  41. Classification Evaluation: More Than Two Classes • Class-weighted averaging (2nd possibility of averaging) tp ( k ) + fn ( k ) c c � n k n π k = 1 � tp ( k ) + fp ( k ) · tp ( k ) ◦ precision: π wgt = n k =1 k =1 c c � � n k n ρ k = 1 tp ( k ) ◦ recall: ρ wgt = n k =1 k =1 Here c is again the number of classes, n k is the number of cases belonging to class k , k = 1 , . . . , c , and � c n is the total number of cases, n = k =1 n k . • While macro-averaging treats each class as having the same weight (thus ignoring the (possibly skewed) class frequencies) class-weighted averaging takes the class frequencies into account. • Note that class-weighted average recall is equivalent to accuracy , since � c k =1 tp ( k ) is simply the sum of the diagonal elements of the confusion matrix and n , the total number of cases, is the sum over all entries. Christian Borgelt Data Mining / Intelligent Data Analysis 177 Classification Evaluation: More Than Two Classes • Micro-averaging (3rd possibility of averaging) [Sebastini 2002] � c k =1 tp ( k ) c tp ( k ) + fp ( k ) � = 1 � tp ( k ) ◦ precision: π micro = � � c n k =1 k =1 � c k =1 tp ( k ) c tp ( k ) + fn ( k ) � = 1 � tp ( k ) ◦ recall: ρ micro = � � c n k =1 k =1 Here c is again the number of classes and n is the total number of cases. This averaging renders precision and recall identical and equivalent to accuracy . • As a consequence, micro-averaging is not useful in this setting, but it may be useful, e.g., for averaging results over different data sets. • For all different averaging approaches, the F 1 measure may be computed as the harmonic mean of (averaged) precision and recall. • Alternatively, the F 1 measure may be computed for each class separately and then averaged in analogy to the above methods. Christian Borgelt Data Mining / Intelligent Data Analysis 178 Classification Evaluation: Misclassification Costs • Misclassifications may also be handled via misclassification costs . • Misclassification costs are specified in a matrix analogous to a confusion matrix, that is, as a table the rows of which refer to the true class and the columns of which refer to the predicted class. ◦ The diagonal of a misclassification cost matrix is zero (correct classifications). ◦ The off-diagonal elements specify the costs of a specific misclassification: entry x i,j specifies the costs of misclassifying class i as class j . • If a cost matrix X = ( x i,j ) 1 ≤ i,j ≤ c is given (like the one on the right), the expected true predicted class loss is used as the objective function: class 1 2 . . . c c c 1 0 x 1 , 2 . . . x 1 ,c � � p i,j · x i,j L ( m, D ) = 2 x 2 , 1 0 . . . x 2 ,c i =1 j =1 . . . ... . . . . . . . . . c x c, 1 x c, 2 . . . 0 Here p i,j is the (relative) frequency with which class i is misclassified as class j . Christian Borgelt Data Mining / Intelligent Data Analysis 179 Classification Evaluation: Misclassification Costs • Misclassification costs generalize the misclassification rate , which results as the special case of equal misclassification costs. • With misclassification costs one can avoid the problems caused by skewed class distributions , because it can take into account that certain misclassifications can have stronger consequences or higher costs than others. ◦ Misclassifying a sick patient as healthy has high costs (as this leaves the disease untreated). ◦ Misclassifying a healthy patient as sick has low costs (although the patient may have to endure additional tests, it will finally be revealed that he/she is healthy). ◦ Not sending an ad to a prospective buyer has high costs (because the seller loses the revenue from the sale). ◦ Sending an ad to a non-buyer has low costs (only the cost of the mailing, which may be very low, is lost). • However, specifying proper costs can be tedious and time consuming. Christian Borgelt Data Mining / Intelligent Data Analysis 180

  42. Classification Evaluation: ROC Curves • Some classifiers (can) yield for every case to be classified a probability or confidence for each class. • In such a case it is common to assign a case to the class for which the highest confidence / probability is produced. • In the case of two classes, plus and minus , one may assign a case to class plus if the probability for this class exceeds 0.5 and to class minus otherwise. • However, one may also be more careful and assign a case to class plus only if the probability exceeds, e.g., τ = 0 . 8, leading to fewer false positives. • On the other hand, choosing a threshold τ < 0 . 5 leads to more true positives. • The trade-off between true positives and false positives is illustrated by the receiver operating characteristic curve ( ROC curve ) that shows the true positive rate versus the false positive rate. • The area under the (ROC) curve ( AUC ) may be used as an indicator how well a classifier solves a given problem. Christian Borgelt Data Mining / Intelligent Data Analysis 181 Classification Evaluation: ROC Curves 1 • ROC curve: For all choices of the threshold τ a new point is drawn at the respective coordinates 0.8 true positive rate of false positive rate and true positive rate. 0.6 • These dots are connected to form a curve. 0.4 • The curves on the left are idealized; ROC curves perfect 0.2 actual ROC curves are (usually) step functions. random “real” 0 0 0.2 0.4 0.6 0.8 1 • Diagonal segments may be used if the same prob- false positive rate ability is assigned to cases of different classes. • An ideal ROC curve (green) jumps to 100% true positives without producing any false positives, then adds the remaining cases as false positives. • A random classifier, which assigns random probability / confidence values to the cases, is represented by the (idealized) red ROC curve (“expected” ROC curve). • An actual classifier may produce an ROC curve like the blue one. Christian Borgelt Data Mining / Intelligent Data Analysis 182 Classification Evaluation: ROC Curves 1 1 1 0.8 0.8 0.8 true positive rate true positive rate true positive rate 0.6 0.6 0.6 0.4 0.4 0.4 ROC curves perfect 0.2 0.2 0.2 random “real” random ROC curves random ROC curves 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 false positive rate false positive rate false positive rate • A random classifier, which assigns random probability / confidence values to the cases, is represented by the (idealized) red ROC curve (“expected” ROC curve). • This line is idealized, because different random classifiers produce different ROC curves that scatter around this diagonal. • middle: 50 positive, 50 negative cases; right: 250 positive, 250 negative cases; the diagrams show 100 random ROC curves each, with one highlighted in red. Christian Borgelt Data Mining / Intelligent Data Analysis 183 Classification Evaluation: Area Under the (ROC) Curve 1 1 1 0.8 0.8 0.8 true positive rate true positive rate true positive rate 0.6 0.6 0.6 0.4 0.4 0.4 ROC curves 0.2 perfect 0.2 0.2 random AUC version 1 AUC version 2 “real” ROC curve ROC curve 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 false positive rate false positive rate false positive rate • The Area Under the (ROC) Curve ( AUC ) may be defined in two ways: ◦ the area extends down to the horizontal axis (more common), ◦ the area extends down to the diagonal and is doubled (more intuitive). • It is AUC 2 = 2(AUC 1 − 1 2 ) and AUC 1 = 1 2 AUC 2 + 1 2 . • For a random ROC curve it is AUC 1 ≈ 0 . 5, but AUC 2 ≈ 0. Note: AUC 2 may become negative, AUC 1 is always non-negative! Christian Borgelt Data Mining / Intelligent Data Analysis 184

  43. Algorithms for Model Fitting • An objective (score / loss) function does not tell us directly how to find the best or at least a good model. • To find a good model, an optimization method is needed (i.e., a method that optimizes the objective function). • Typical examples of optimization methods are ◦ Analytical / Closed Form Solutions Sometimes a solution can be obtained in a closed form (e.g. linear regression). ◦ Combinatorial Optimization If the model space is small, exhaustive search may be feasible. ◦ Gradient Methods If the objective function is differentiable, a gradient method (gradient ascent or descent) may be applied to find a (possibly only local) optimum. ◦ Random Search, Greedy Strategies and Other Heuristics For example, hill climbing , greedy search , alternating optimization , widening , evolutionary and swarm-based algorithms etc. Christian Borgelt Data Mining / Intelligent Data Analysis 185 Causes of Errors • One may distinguish between four types of errors : ◦ the pure / intrinsic / experimental / Bayes error, ◦ the sample / variance error or scatter, ◦ the lack of fit / model / bias error, ◦ the algorithmic error. • Pure / Intrinsic / Experimental / Bayes Error ◦ Inherent in the data, impossible to overcome by any model. ◦ Due to noise, random variations, imprecise measurements, or the influence of hidden (unobservable) variables. ◦ For any data points usually several classes (if classification) or numeric values (if numeric prediction) have a non-vanishing probability. However, usually a predictor has to produce a single class / specific value; hence it cannot yield correct predictions all of the time. Christian Borgelt Data Mining / Intelligent Data Analysis 186 Causes of Errors: Bayes Error • The term “ Bayes error ” is usually used in the context of classification. • It describes a situation, in which for any data point more than one class is possible (multiple classes have a non-vanishing probability). perfect class separation more difficult problem strongly overlapping classes • If classes overlap in the data space, no model can perfectly separate them. Christian Borgelt Data Mining / Intelligent Data Analysis 187 Causes of Errors: Bayes Error • In the (artificial) example on the previous slide, the samples of each class are drawn from two bivariate normal distributions (four in total). • In all three cases, the means of the distributions are the same, only the variances differ, leading to a greater overlap of the classes the greater the variances. • However, this is not necessarily the only (relevant) situation. Classes may have more or less identical means and rather differ in their variances. • Example: Three darts players try to hit the center of the dartboard (the so-called bull’s eye ). • Assume, one of the players is a professional, one is a hobby player and one is a complete beginner. • Objective: Predict who has thrown the dart, given the point where the dart hit the dartboard. Christian Borgelt Data Mining / Intelligent Data Analysis 188

  44. Causes of Errors: Bayes Error • Example: Three darts players try to hit the center of the dartboard (the so-called em bull’s eye). • Assume, one of the players is a professional, one is a hobby player and one is a complete beginner. • Objective: Predict who has thrown the dart, given the point where the dart hit the dartboard. professional hobby player beginner Christian Borgelt Data Mining / Intelligent Data Analysis 189 Causes of Errors: Bayes Error • Example: Three darts players try to hit the center of the dartboard (the so-called em bull’s eye). • Assume, one of the players is a professional, one is a hobby player and one is a complete beginner. • Objective: Predict who has thrown the dart, given the point where the dart hit the dartboard. 0.4 • Simple classification rule: Assuming equal frequency of the three classes, probability density 0.3 assign the class with the highest likelihood. (one-dimensional normal distributions ⇒ ) 0.2 • Attention: Do not confuse 0.1 classification boundaries with class boundaries 0 (which may not even exist). –8 –4 0 4 8 attribute value Christian Borgelt Data Mining / Intelligent Data Analysis 190 Causes of Errors: Sample Error • Sample / Variance Error or Scatter ◦ The sample error is caused by the fact that the given data is only an imperfect representation of the underlying distribution. ◦ According to the laws of large numbers, the sample distribution converges in probability to the true distribution when the sample size approaches infinity. ◦ However, a finite sample can deviate significantly from the true distribution although the probability for such a deviation might be small. 20 • The bar chart on the right shows the result 15 for throwing a fair die 60 times. frequency • In the ideal case, one would expect each of 10 the numbers 1 , . . . , 6 to occur 10 times. 5 • But for this sample, the distribution does not look uniform. 0 1 2 3 4 5 6 number of pips Christian Borgelt Data Mining / Intelligent Data Analysis 191 Causes of Errors: Sample Error • Another source for sample errors are measurements with limited precision and round-off errors in features derived by computations. • Sometimes the sample is also (systematically) biased . ◦ Consider a bank that supplies loans to customers. ◦ Based on historical data available on customers who have obtained loans, the bank wants to estimate a new customer’s credit-worthiness (i.e., the probability that the customer will pay back a loan). ◦ The collected data will be biased towards better customers, because customers with a more problematic financial status have not been granted loans. ◦ Therefore, no information is available for such customers whether they might have paid back the loan nevertheless. ◦ In statistical terms: the sample is not representative, but biased. ◦ (Cf. also e.g. the Yalemen example on exercise sheet 1.) Christian Borgelt Data Mining / Intelligent Data Analysis 192

  45. Causes of Errors: Model Error • A large error might be caused by a high pure error, but it might also be due to a lack of fit . • If the set of considered models is too simple for the structure inherent in the data, no model (from this set) will yield a small error. • Such an error is also called model error or bias error . (Because an improper choice of the model class introduces a bias into the fit.) 100 • The chart on the right shows a regression line 80 fitted to data with no pure error. 60 • However, the data points originate y 40 from a quadratic and not from a linear function. 20 • As a consequence, there is a considerable lack of fit. 0 0 2 4 6 8 10 x Christian Borgelt Data Mining / Intelligent Data Analysis 193 Reminder: Model Selection • Objective: select the model that best fits the data, taking the model complexity into account . The more complex the model, the better it usually fits the data. y 7 black line: 6 regression line 5 (2 free parameters) 4 3 blue curve: 2 7th order regression polynomial 1 (8 free parameters) x 0 0 1 2 3 4 5 6 7 8 • The blue curve fits the data points perfectly, but it is not a good model . • On the other hand, too simple a model can lead to a lack of fit. Christian Borgelt Data Mining / Intelligent Data Analysis 194 Causes of Errors: Algorithmic Error • The algorithmic error is caused by the method that is used to fit the model or the model parameters. • In the ideal case, if an analytical solution for the optimum of the objective function exists, the algorithmic error is zero or only caused by numerical problems. • However, in many cases an analytical solution cannot be provided and heuristic strategies are needed to fit the model to the data. • Even if a model exists with a very good fit—the global optimum of the objective function—the heuristic optimization strategy might only be able to find a local optimum with a much larger error. • This error is neither caused by the pure error nor by the error due to the lack of fit (model error). • Most of the time, the algorithmic error will not be considered and it is assumed that the heuristic optimization strategy is chosen well enough to find an optimum that is at least close to the global optimum. Christian Borgelt Data Mining / Intelligent Data Analysis 195 Machine Learning Bias and Variance • The four types of errors that were mentioned can be grouped into two categories. • The algorithmic and the model error can be controlled to a certain extend, since we are free to choose a suitable model and algorithm. These errors are also called machine learning bias . • On the other hand, we have no influence on the pure / intrinsic error or the sample error (at least if the data to be analyzed have already been collected). These errors are also called machine learning variance . • Note that this decomposition differs from the one commonly known in statistics, where, for example, the mean squared error of an estimator ˆ θ for a parameter θ can be decomposed in terms of the variance of the estimator and its bias: MSE = Var(ˆ θ ) + (Bias(ˆ θ )) 2 . Here the variance depends on the intrinsic error, i.e. on the variance of the ran- dom variable from which the sample is generated, but also on the choice of the estimator θ ∗ which is considered part of the model bias in machine learning. Christian Borgelt Data Mining / Intelligent Data Analysis 196

  46. Learning Without Bias? No Free Lunch Theorem • The different types of errors or biases have an interesting additional impact on the ability to find a suitable model for a given data set: If we have no model bias, we will not be able to generalize. • The model bias is actually essential to put some sort of a-priori knowledge into the model learning process. • Essentially this means that we need to constrain ◦ either the types of models that are available ◦ or the way we are searching for a suitable model (or both). • The technical reason for this need is the No Free Lunch Theorem . [Wolpert and MacReady 1997] • Intuitively, this theorem states that if an algorithm (e.g. a machine learning or opti- mization algorithm) performs well on a certain class of problems, then it necessarily pays for that with degraded performance on the set of all remaining problems. Christian Borgelt Data Mining / Intelligent Data Analysis 197 Model Validation • Due to possible overfitting to the induction / training data (i.e. adaptations to features that are not regular, but accidental), the error on the training data is not too indicative of the error on new data. • General idea of model validation: Evaluate on a hold-out data set ( validation data ), that is, on data not used for building / training the model. ◦ Split the data into two parts: training data and validation data (often recommended: training data 80%, validation data 20%). ◦ Train a model on the training data and evaluate it on the validation data. • It is (highly) unlikely that the validation data exhibits the same accidental features as the training data. • However, by chance, we might be lucky (unlucky) that the validation data contains easy (difficult) examples leading to an over-optimistic (-pessimistic) evaluation. • Solution approach: repeat the split, the training and the evaluation. Christian Borgelt Data Mining / Intelligent Data Analysis 198 Model Validation: Cross Validation • General method to evaluate / to predict the performance of models. • Serves the purpose to estimate the error (rate) on new example cases . • Procedure of cross validation: ◦ Split the given data set into n so-called folds of equal size ( n-fold cross validation ). Often recommended: n = 10. ◦ Combine n − 1 folds into a training data set, build a classifier, and test it on the n -th fold (the hold-out fold ). ◦ Do this for all n possible selections of n − 1 folds and average the error (rates). • Special case: leave-1-out cross validation (a.k.a. jackknife method ). (use as many folds as there are example cases) • The final classifier is learned from the full data set (in order to exploit all available information). Christian Borgelt Data Mining / Intelligent Data Analysis 199 Model Validation: Cross Validation • Cross validation is also a method that may be used to determine good so-called hyper-parameters of a model building method. • Distinction between parameters and hyper-parameters : ◦ parameter refers to parameters of a model as it is produced by an algorithm, for example, regression coefficients; ◦ hyper-parameter refers to the parameters of a model-building method, for example, the maximum height of a decision tree, the number of trees in a random forest, the learning rate for a neural network etc. • Hyper-parameters are commonly chosen by running a cross validation for various choices of the hyper-parameter(s) and finally choosing the one that produced the best models (in terms of their evaluation on the validation data sets). • A final model is built using the found values for the hyper-parameters on the whole data set (to maximize the exploitation of information). Christian Borgelt Data Mining / Intelligent Data Analysis 200

  47. Model Validation: Bootstrapping • Bootstrapping is a resampling technique from statistics that does not directly evaluate the model error, but aims at estimating the variance of the estimated model parameters. • Therefore, bootstrapping is suitable for models with real-valued parameters. • Like in cross-validation, the model is not only computed once, but multiple times. • For this purpose, k bootstrap samples, each of size n , are drawn randomly with replacement from the original data set with n records. • The model is fitted to each of these bootstrap samples, so that we obtain k estimates for the model parameters. • Based on these k estimates the empirical standard deviation can be computed for each parameter to provide an assessment how reliable its estimation is. • It is also possible to compute confidence intervals for the parameters based on bootstrapping. Christian Borgelt Data Mining / Intelligent Data Analysis 201 Model Validation: Bootstrapping • The figure on the right shows a data set with n = 20 data points from which 8 k = 10 bootstrap samples were drawn. 6 • For each of the bootstrap samples y the corresponding regression line is shown. 4 sample intercept slope 2 1 0.3801791 0.3749113 0 2 0.5705601 0.3763055 3 − 0 . 2840765 0.4078726 0 5 10 15 20 4 0.9466432 0.3532497 x 5 1.4240513 0.3201722 • The resulting parameter estimates for 6 0.9386061 0.3596913 the intercept and the slope of the regression line 7 0.6992394 0.3417433 are listed in the table on the left. 8 0.8300100 0.3385122 9 1.1859194 0.3075218 • The standard deviation for the slope 10 0.2496341 0.4213876 is much lower than for the intercept, so that mean 0.6940766 0.3601367 the estimation for the slope is more reliable. std. dev. 0.4927206 0.0361004 Christian Borgelt Data Mining / Intelligent Data Analysis 202 Regression Christian Borgelt Data Mining / Intelligent Data Analysis 203 Regression • General Idea of Regression ◦ Method of least squares • Linear Regression ◦ An illustrative example • Polynomial Regression ◦ Generalization to polynomial functional relationships • Multivariate Regression ◦ Generalization to more than one function argument • Logistic Regression ◦ Generalization to non-polynomial functional relationships • Logistic Classification ◦ Modelling 2-class problems with a logistic function • Robust Regression ◦ Dealing with outliers • Summary Christian Borgelt Data Mining / Intelligent Data Analysis 204

  48. Regression: Method of Least Squares Regression is also known as Method of Least Squares . (Carl Friedrich Gauß) (also known as Ordinary Least Squares, abbreviated OLS) Given: • A dataset (( � x 1 , y 1 ) , . . . , ( � x n , y n )) of n data tuples (one or more input values and one output value) and • a hypothesis about the functional relationship between response and predictor values, e.g. Y = f ( X ) = a + bX + ε . Desired: • A parameterization of the conjectured function that minimizes the sum of squared errors (“best fit”). Depending on • the hypothesis about the functional relationship and • the number of arguments to the conjectured function different types of regression are distinguished. Christian Borgelt Data Mining / Intelligent Data Analysis 205 Reminder: Function Optimization Task: Find values � x = ( x 1 , . . . , x m ) such that f ( � x ) = f ( x 1 , . . . , x m ) is optimal. Often feasible approach: • A necessary condition for a (local) optimum (maximum or minimum) is that the partial derivatives w.r.t. the parameters vanish (Pierre de Fermat, 1607–1665). • Therefore: (Try to) solve the equation system that results from setting all partial derivatives w.r.t. the parameters equal to zero. f ( x, y ) = x 2 + y 2 + xy − 4 x − 5 y . Example task: Minimize Solution procedure: 1. Take the partial derivatives of the objective function and set them to zero: ∂f ∂f ∂x = 2 x + y − 4 = 0 , ∂y = 2 y + x − 5 = 0 . 2. Solve the resulting (here: linear) equation system: x = 1 , y = 2. Christian Borgelt Data Mining / Intelligent Data Analysis 206 Linear Regression: General Approach • A dataset (( x 1 , y 1 ) , . . . , ( x n , y n )) of n data tuples and Given: • a hypothesis about the functional relationship, e.g. Y = f ( X ) = a + bX + ε . Approach: Minimize the sum of squared errors, that is, n n � � ( f ( x i ) − y i ) 2 = a, ˆ a + ˆ bx i − y i ) 2 . F (ˆ b ) = (ˆ i =1 i =1 Necessary conditions for a minimum (also known as Fermat’s theorem, after Pierre de Fermat, 1607–1665): n ∂F � a + ˆ = 2(ˆ bx i − y i ) = 0 and ∂ ˆ a i =1 n ∂F � a + ˆ = 2(ˆ bx i − y i ) x i = 0 ∂ ˆ b i =1 Christian Borgelt Data Mining / Intelligent Data Analysis 207 Linear Regression: Example of Error Functional 2 3 0 y b ) a, ˆ 2 0 1 F (ˆ 1 0 0 – 2 0 – 1 1 0 ˆ a 1 0 0 1 2 3 2 ˆ x b 3 1 – 4 • A very simple data set (4 points), 3 0 to which a line is to be fitted. b ) a, ˆ 2 0 F (ˆ • The error functional for linear regression n 1 0 � bx i − y i ) 2 a, ˆ a + ˆ F (ˆ b ) = (ˆ 0 i =1 1 (same function, two different views). 0 4 3 b ˆ 2 –1 1 0 1 ˆ a 2 – – Christian Borgelt Data Mining / Intelligent Data Analysis 208

  49. Linear Regression: Normal Equations Result of necessary conditions: System of so-called normal equations , that is,   n n � �  ˆ  n ˆ a + x i b = y i , i =1 i =1     n n n � � � x 2  ˆ  ˆ  x i a +  b = x i y i . i i =1 i =1 i =1 a and ˆ • Two linear equations for two unknowns ˆ b . • System can be solved with standard methods from linear algebra. • Solution is unique unless all x -values are identical (vertical lines cannot be represented as y = a + bx ). • The resulting line is called a regression line . Christian Borgelt Data Mining / Intelligent Data Analysis 209 Linear Regression: Example Normal equations: x 1 2 3 4 5 6 7 8 36ˆ 8 ˆ a + b = 27 , y 1 3 2 3 4 3 5 6 a + 204ˆ 36 ˆ b = 146 . Assumption: Solution: y = 3 4 + 7 12 x. Y = a + bX + ε y y 6 6 5 5 4 4 3 3 2 2 1 1 x x 0 0 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Christian Borgelt Data Mining / Intelligent Data Analysis 210 Side Note: Least Squares and Maximum Likelihood A regression line can be interpreted as a maximum likelihood estimator: Assumption: The data generation process can be described well by the model Y = a + Xx + ε, where ε is normally distributed with mean 0 and (unknown) variance σ 2 ( ε ∼ N (0 , σ 2 )) ( σ 2 independent of X , that is, same dispersion of Y for all X — homoscedasticity ). As a consequence we have � � − ( y − ( a + bx )) 2 1 f Y | X ( y | x ) = √ 2 πσ 2 · exp . 2 σ 2 With this expression we can set up the likelihood function a, ˆ b, σ 2 ) L (( x 1 , y 1 ) , . . . ( x n , y n ); ˆ   n n a + ˆ bx i )) 2 � � 1  − ( y i − (ˆ  . = f X ( x i ) f Y | X ( y i | x i ) = f X ( x i ) · √ 2 πσ 2 · exp 2 σ 2 i =1 i =1 Christian Borgelt Data Mining / Intelligent Data Analysis 211 Side Note: Least Squares and Maximum Likelihood To simplify taking the derivatives, we compute the natural logarithm: a, ˆ b, σ 2 ) ln L (( x 1 , y 1 ) , . . . ( x n , y n ); ˆ   n a + ˆ bx i )) 2 � 1  − ( y i − (ˆ = ln f X ( x i ) · √ 2 πσ 2 · exp  2 σ 2 i =1 n n n � � 1 1 � a + ˆ bx i )) 2 = ln f X ( x i ) + ln √ 2 πσ 2 − ( y i − (ˆ 2 σ 2 i =1 i =1 i =1 a , ˆ b , and σ 2 ) From this expression it is clear that (provided f X ( x ) is independent of ˆ maximizing the likelihood function is equivalent to minimizing n � a, ˆ a + ˆ bx i )) 2 . F (ˆ b ) = ( y i − (ˆ i =1 Interpreting the method of least squares as a maximum likelihood estimator works also for the generalizations to polynomials and multivariate linear functions discussed next. Christian Borgelt Data Mining / Intelligent Data Analysis 212

  50. Polynomial Regression Generalization to polynomials y = p ( x ) = a 0 + a 1 x + . . . + a m x m Approach: Minimize the sum of squared errors, that is, n n � � ( p ( x i ) − y i ) 2 = ( a 0 + a 1 x i + . . . + a m x m i − y i ) 2 F ( a 0 , a 1 , . . . , a m ) = i =1 i =1 Necessary conditions for a minimum: All partial derivatives vanish, that is, ∂F ∂F ∂F = 0 , = 0 , . . . , = 0 . ∂a 0 ∂a 1 ∂a m Christian Borgelt Data Mining / Intelligent Data Analysis 213 Polynomial Regression System of normal equations for polynomials     n n n � � �  a 1 + . . . + x m  a m = na 0 +  x i  y i i i =1 i =1 i =1       n n n n � � � �  a 0 + x 2  a 1 + . . . + x m +1  a m =  x i   x i y i i i i =1 i =1 i =1 i =1 . . . . . .       n n n n � � � � x m  a 0 + x m +1  a 1 + . . . + x 2 m  a m = x m    i y i , i i i i =1 i =1 i =1 i =1 • m + 1 linear equations for m + 1 unknowns a 0 , . . . , a m . • System can be solved with standard methods from linear algebra. • Solution is unique unless the coefficient matrix is singular. Christian Borgelt Data Mining / Intelligent Data Analysis 214 Multivariate Linear Regression Generalization to more than one argument z = f ( x, y ) = a + bx + cy Approach: Minimize the sum of squared errors, that is, n n � � ( f ( x i , y i ) − z i ) 2 = ( a + bx i + cy i − z i ) 2 F ( a, b, c ) = i =1 i =1 Necessary conditions for a minimum: All partial derivatives vanish, that is, n ∂F � = 2( a + bx i + cy i − z i ) = 0 , ∂a i =1 n ∂F � = 2( a + bx i + cy i − z i ) x i = 0 , ∂b i =1 n ∂F � = 2( a + bx i + cy i − z i ) y i = 0 . ∂c i =1 Christian Borgelt Data Mining / Intelligent Data Analysis 215 Multivariate Linear Regression System of normal equations for several arguments     n n n � � �  b +  c = na +  x i  y i z i i =1 i =1 i =1       n n n n � � � � x 2   a +   b +   c = x i x i y i z i x i i i =1 i =1 i =1 i =1       n n n n � � � �  a +  b + y 2  c =    y i x i y i z i y i i i =1 i =1 i =1 i =1 • 3 linear equations for 3 unknowns a , b , and c . • System can be solved with standard methods from linear algebra. • Solution is unique unless all data points lie on a straight line. Christian Borgelt Data Mining / Intelligent Data Analysis 216

  51. Multivariate Linear Regression General multivariate linear case: m � y = f ( x 1 , . . . , x m ) = a 0 + a k x k k =1 Approach: Minimize the sum of squared errors, that is, y ) ⊤ ( X � F ( � a ) = ( X � a − � a − � y ) , where       a 0 1 x 11 . . . x 1 m y 1    a 1   . . ... .   .  . . . .   X = . . .  , y = � .  , and � a =   .  .  .   1 x n 1 . . . x nm y n a m Necessary condition for a minimum: � a ) = � y ) ⊤ ( X � y ) = � ∇ ∇ a − � a − � a F ( � a ( X � 0 � � Christian Borgelt Data Mining / Intelligent Data Analysis 217 Multivariate Linear Regression • � ∇ a F ( � a ) may easily be computed by remembering that the differential operator � � ∂ � ∂ � ∇ a = , . . . , � ∂a 0 ∂a m behaves formally like a vector that is “multiplied” to the sum of squared errors. • Alternatively, one may write out the differentiation componentwise. With the former method we obtain for the derivative: � � y ) ⊤ ( X � ∇ a F ( � a ) = ∇ a (( X � a − � a − � y )) � � � � � ⊤ ( X � � y ) ⊤ � � �� ⊤ = ∇ a ( X � a − � y ) a − � y ) + ( X � a − � ∇ a ( X � a − � y ) � � � � � ⊤ ( X � � � � ⊤ ( X � = ∇ a ( X � a − � y ) a − � y ) + ∇ a ( X � a − � y ) a − � y ) � � = 2 X ⊤ ( X � a − � y ) = 2 X ⊤ X � a − 2 X ⊤ � ! = � y 0 Christian Borgelt Data Mining / Intelligent Data Analysis 218 Multivariate Linear Regression Necessary condition for a minimum therefore: � � y ) ⊤ ( X � ∇ a F ( � a ) = ∇ a ( X � a − � a − � y ) � � = 2 X ⊤ X � a − 2 X ⊤ � ! = � y 0 As a consequence we obtain the system of normal equations : X ⊤ X � a = X ⊤ � y This system has a solution unless X ⊤ X is singular. If it is regular, we have a = ( X ⊤ X ) − 1 X ⊤ � � y. ( X ⊤ X ) − 1 X ⊤ is called the (Moore-Penrose-) Pseudoinverse of the matrix X . With the matrix-vector representation of the regression problem an extension to multipolynomial regression is straightforward: Simply add the desired products of powers (monomials) to the matrix X . Christian Borgelt Data Mining / Intelligent Data Analysis 219 Mathematical Background: Logistic Function y Logistic function: 1 y max y = f ( x ) = 1 + e − a ( x − x 0 ) 1 2 Special case y max = a = 1, x 0 = 0: 1 x y = f ( x ) = 0 1 + e − x − 4 − 2 0 +2 +4 Application areas of the logistic function: • Can be used to describe saturation processes (growth processes with finite capacity/finite resources y max ). Derivation e.g. from a Bernoulli differential equation f ′ ( x ) = k · f ( x ) · ( y max − f ( x )) (yields a = ky max ) • Can be used to describe a linear classifier (especially for two-class problems, considered later). Christian Borgelt Data Mining / Intelligent Data Analysis 220

  52. Mathematical Background: Logistic Function Example: two-dimensional logistic function 1 1 y = f ( � x ) = 1 + exp( − ( x 1 + x 2 − 4)) = � � − ((1 , 1)( x 1 , x 2 ) ⊤ − 4) 1 + exp x 2 4 3 0 1 . 9 0 . 8 2 0 . 7 0 . 6 y 0 . 4 5 0 . 4 0 3 . 3 1 0 . 2 x 2 2 0 0 . 4 1 x 1 3 1 0 2 x 1 1 0 1 2 3 4 0 0 The “contour lines” of the logistic function are parallel lines/hyperplanes. Christian Borgelt Data Mining / Intelligent Data Analysis 221 Mathematical Background: Logistic Function Example: two-dimensional logistic function 1 1 y = f ( � x ) = 1 + exp( − (2 x 1 + x 2 − 6)) = � � − ((2 , 1)( x 1 , x 2 ) ⊤ − 6) 1 + exp x 2 4 3 1 2 0.9 y 4 0.8 0.7 0.6 0.5 0.4 0.3 3 1 0.2 0.1 x 2 2 0 4 x 1 3 1 0 2 x 1 1 0 1 2 3 4 0 0 The “contour lines” of the logistic function are parallel lines/hyperplanes. Christian Borgelt Data Mining / Intelligent Data Analysis 222 Regression: Generalization, Logistic Regression Generalization of regression to non-polynomial functions . y = ax b Simple example: Idea: Find a transformation to the linear/polynomial case . Transformation for the above example: ln y = ln a + b · ln x. y ′ = ln y x ′ = ln x . ⇒ Linear regression for the transformed data and a ⊤ Special case: Logistic Function (mit a 0 = � � x 0 ) a ⊤ y = 1 + e − ( � � x + a 0 ) y max 1 y max − y a ⊤ = e − ( � � x + a 0 ) . ⇔ ⇔ y = a ⊤ 1 + e − ( � � x + a 0 ) y max y Result: Apply so-called Logit Transform � � y a ⊤ z = ln = � � x + a 0 . y max − y Christian Borgelt Data Mining / Intelligent Data Analysis 223 Logistic Regression: Example Data points: x 1 2 3 4 5 y 0.4 1.0 3.0 5.0 5.6 Apply the logit transform � � y z = ln , y max = 6 . y max − y Transformed data points: (for linear regression) x 1 2 3 4 5 z − 2 . 64 − 1 . 61 0.00 1.61 2.64 The resulting regression line and therefore the desired function are 6 6 z ≈ 1 . 3775 x − 4 . 133 y ≈ 1 + e − (1 . 3775 x − 4 . 133) ≈ and 1 + e − 1 . 3775( x − 3) . Attention: Note that the error is minimized only in the transformed space! Therefore the function in the original space may not be optimal! Christian Borgelt Data Mining / Intelligent Data Analysis 224

  53. Logistic Regression: Example z y Y = 6 4 6 3 5 2 4 1 x 0 3 1 2 3 4 5 − 1 2 − 2 1 − 3 x 0 − 4 0 1 2 3 4 5 The resulting regression line and therefore the desired function are 6 6 z ≈ 1 . 3775 x − 4 . 133 and y ≈ 1 + e − (1 . 3775 x − 4 . 133) ≈ 1 + e − 1 . 3775( x − 3) . Attention: Note that the error is minimized only in the transformed space! Therefore the function in the original space may not be optimal! Christian Borgelt Data Mining / Intelligent Data Analysis 225 Multivariate Logistic Regression: Example 4 1 3 0.8 0.6 2 y 0.4 1 0.2 4 0 3 0 0 2 x 2 1 2 0 1 2 3 4 x 1 1 3 4 0 • Example data were drawn from a logistic function and noise was added. (The gray “contour lines” show the ideal logistic function.) • Reconstructing the logistic function can be reduced to a multivariate linear regres- sion by applying a logit transform to the y -values of the data points. Christian Borgelt Data Mining / Intelligent Data Analysis 226 Multivariate Logistic Regression: Example 4 1 3 0.8 2 0.6 y 0.4 1 0.2 4 0 3 0 0 1 2 x 2 2 0 1 2 3 4 x 1 1 3 0 4 • The black “contour lines” show the resulting logistic function. Is the deviation from the ideal logistic function (gray) caused by the added noise? • Attention: Note that the error is minimized only in the transformed space! Therefore the function in the original space may not be optimal! Christian Borgelt Data Mining / Intelligent Data Analysis 227 Logistic Regression: Optimization in Original Space Approach analogous to linear/polynomial regression Given: data set D = { ( � x 1 , y 1 ) , . . . , ( � x n , y n ) } with n data points, y i ∈ (0 , 1). x ∗ i = (1 , x i 1 , . . . , x im ) ⊤ und � a = ( a 0 , a 1 , . . . , a m ) ⊤ . Simplification: Use � x ∗ (By the leading 1 in � i the constant a 0 is captured.) Minimize sum of squared errors / deviations: � � 2 n � 1 ! y i − F ( � a ) = = min . a ⊤ x ∗ 1 + e − � � i i =1 Necessary condition for a minimum: ! � = � Gradient of the objective function F ( � a ) w.r.t. � a vanishes: ∇ a F ( � a ) 0 � Problem: The resulting equation system is not linear. Solution possibilities: • Gradient descent on objective function F ( � a ). • Root search on gradient � ∇ a F ( � a ). (e.g. Newton–Raphson method) � Christian Borgelt Data Mining / Intelligent Data Analysis 228

  54. Reminder: Gradient Methods for Optimization The gradient is a differential operator, that turns a scalar function into a vector field. � ∇ z | � p =( x 0 ,y 0 ) y 0 Illustration of the gradient of z ∂z ∂y | � ∂z p ∂x | � a real-valued function z = f ( x, y ) p at a point � p = ( x 0 , y 0 ). y x 0 � � � � � � It is � ∂z � x 0 , ∂z ∇ z | ( x 0 ,y 0 ) = . � y 0 ∂x ∂y x The gradient at a point shows the direction of the steepest ascent of the function at this point; its length describes the steepness of the ascent. Principle of gradient methods: Starting at a (possibly randomly chosen) initial point, make (small) steps in (or against) the direction of the gradient of the objective function at the current point, until a maximum (or a minimum) has been reached. Christian Borgelt Data Mining / Intelligent Data Analysis 229 Gradient Methods: Cookbook Recipe Idea: Starting from a randomly chosen point in the search space, make small steps in the search space, always in the direction of the steepest ascent (or descent) of the function to optimize, until a (local) maximum (or minimum) is reached. � � ⊤ x (0) = x (0) 1 , . . . , x (0) 1. Choose a (random) starting point � n x ( i ) : 2. Compute the gradient of the objective function f at the current point � � � ⊤ � � � � ∂ � ∂ � ∇ x f ( � x ) x ( i ) = ∂x 1 f ( � x ) , . . . , ∂x n f ( � x ) � � � x ( i ) � x ( i ) � n 1 3. Make a small step in the direction (or against the direction) of the gradient: � x ( i ) � + : gradient ascent x ( i +1) = � x ( i ) ± η ∇ � x f � . � − : gradient descent η is a step width parameter (“learning rate” in artificial neuronal networks) 4. Repeat steps 2 and 3, until some termination criterion is satisfied. (e.g., a certain number of steps has been executed, current gradient is small) Christian Borgelt Data Mining / Intelligent Data Analysis 230 Gradient Descent: Simple Example f ( x ) = 5 6 x 4 − 7 x 3 + 115 6 x 2 − 18 x + 6 , Example function: f ′ ( x i ) i x i f ( x i ) ∆ x i 6 0 0 . 200 3 . 112 − 11 . 147 0 . 111 1 0 . 311 2 . 050 − 7 . 999 0 . 080 starting point 5 2 0 . 391 1 . 491 − 6 . 015 0 . 060 4 − 4 . 667 3 0 . 451 1 . 171 0 . 047 4 0 . 498 0 . 976 − 3 . 704 0 . 037 3 5 0 . 535 0 . 852 − 2 . 990 0 . 030 6 0 . 565 0 . 771 − 2 . 444 0 . 024 2 7 0 . 589 0 . 716 − 2 . 019 0 . 020 global optimum 8 0 . 610 0 . 679 − 1 . 681 0 . 017 1 9 0 . 626 0 . 653 − 1 . 409 0 . 014 x 0 10 0 . 640 0 . 635 0 1 2 3 4 Gradient descent with initial value 0 . 2 and step width/learning rate 0 . 01. Due to a proper step width/learning rate, the minimum is approached fairly quickly. Christian Borgelt Data Mining / Intelligent Data Analysis 231 Logistic Regression: Gradient Descent 1 With the abbreviation f ( z ) = 1+ e − z for the logistic function it is n n � � i )) 2 = � a ) = � a ⊤ x ∗ a ⊤ x ∗ i )) · f ′ ( � a ⊤ x ∗ x ∗ ∇ a F ( � ∇ ( y i − f ( � � − 2 ( y i − f ( � � � i ) · � i . � a � i =1 i =1 Derivative of the logistic function: (cf. Bernoulli differential equation) � 1 + e − z � − 1 � 1 + e − z � − 2 � − e − z � d f ′ ( z ) = = − d z 1 + e − z − 1 � � 1 1 = = 1 − = f ( z ) · (1 − f ( z )) , (1 + e − z ) 2 1 + e − z 1 + e − z y y 1 1 1 1 2 2 1 4 x x 0 0 − 4 − 2 0 +2 +4 − 4 − 2 0 +2 +4 Christian Borgelt Data Mining / Intelligent Data Analysis 232

  55. Logistic Regression: Gradient Descent Given: data set D = { ( � x 1 , y 1 ) , . . . , ( � x n , y n ) } with n data points, y i ∈ (0 , 1). i = (1 , x i 1 , . . . , x im ) ⊤ and � x ∗ a = ( a 0 , a 1 , . . . , a m ) ⊤ . Simplification: Use � Gradient descent on the objective function F ( � a ) : • Choose as the initial point � a 0 the result of a logit transform and a linear regression (or merely a linear regression). • Update of the parameters � a : a t − η 2 · � a t +1 = � � ∇ a F ( � a ) | � � a t n � a ⊤ x ∗ a ⊤ x ∗ a ⊤ x ∗ x ∗ = � a t + η · ( y i − f ( � t � i )) · f ( � t � i ) · (1 − f ( � t � i )) · � i , i =1 where η is a step width parameter to be chosen by a user (e.g. η = 0 . 05) (in the area of artificial neural networks also called “learning rate”). • Repeat the update step until convergence, e.g. until a t || < τ with a chosen threshold τ (z.B. τ = 10 − 6 ). || � a t +1 − � Christian Borgelt Data Mining / Intelligent Data Analysis 233 Multivariate Logistic Regression: Example 4 1 3 0.8 0.6 2 y 0.4 1 0.2 4 0 3 0 0 2 x 2 1 2 0 1 2 3 4 x 1 1 3 4 0 • Black “contour line”: logit transform and linear regression. • Green “contour line”: gradient descent on error function in original space. (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) Christian Borgelt Data Mining / Intelligent Data Analysis 234 Reminder: Newton–Raphson Method • The Newton–Raphson method starting point is an iterative numeric algorithm to approximate a root of a function. • Idea: use slope/direction of a tangent to the function at a current point to find the next approximation. root f ( x ) • Formally: f ( � x t ) x t = − � x t +1 = � x t +∆ � x t ; ∆ � � � � ∇ x f ( � x ) � � � ∆ x x x t • The gradient describes the direction of steepest ascent of tangent (hyper-)planes to the function (in one dimension: the slope of tangents to the function). • In one dimension (see diagram): Solve ( x t +1 , 0) = ( x t , f ( x t )) ⊤ + k · (1 , d d x f ( x ) | x t ) ⊤ , k ∈ I R, for x t +1 . Since 0 = f ( x t ) + k · d d x f ( x ) | x t , it is k = − f ( x t ) / d d x f ( x ) | x t . Christian Borgelt Data Mining / Intelligent Data Analysis 235 Newton–Raphson Method for Finding Optima • The standard Newton-Raphson method finds roots of functions. • By apply ing it to the gradient of a function, it may be used to find optima (minima, maxima, or saddle points), because a vanishing gradient is a necessary condition for an optimum. • In this case the update formula is � � � − 1 � ∇ 2 � � � · � x t +1 = � � x t + ∆ � x t , ∆ � x t = − x f ( � x ) ∇ x f ( � x ) x t , � � � � � � x t where � ∇ 2 x f ( � x ) is the so-called Hessian matrix , that is, � the matrix of second-order partial derivatives of the scalar-valued function f . � • In one dimension: ∂f ( x ) � � x t ∂x − x t +1 = x t + ∆ x t , ∆ x t = . � ∂ 2 f ( x ) � � x t ∂x 2 • The Newton–Raphson method usually converges much faster than gradient descent (... and needs no step width parameter!). Christian Borgelt Data Mining / Intelligent Data Analysis 236

  56. Logistic Regression: Newton–Raphson 1 With the abbreviation f ( z ) = 1+ e − z for the logistic function it is   n � � ∇ 2 � a ⊤ x ∗ a ⊤ x ∗ a ⊤ x ∗ x ∗ ∇  − 2 ( y i − f ( � i )) · f ( � i ) · (1 − f ( � i )) · �  a F ( � a ) = � � � � a � i i =1 n � � � = − 2 � a ⊤ x ∗ i ) − ( y i + 1) f 2 ( � a ⊤ x ∗ i ) + f 3 ( � a ⊤ x ∗ x ∗ ∇ y i f ( � � � � i ) · � a � i i =1 n � � � a ⊤ x ∗ i ) + 3 f 2 ( � a ⊤ x ∗ · f ′ ( � a ⊤ x ∗ x ∗ x ∗ ⊤ = − 2 y i − 2( y i + 1) f ( � � � i ) � i ) · � i � i i =1 where again f ′ ( z ) = f ( z ) · (1 − f ( z )) (as derived above). Thus we get for the update of the parameters � a : (note: no step width η ) � � � − 1 � ∇ 2 � � � · � a t +1 = � � a t − a F ( � a ) ∇ a F ( � a ) � � � � � � a t a t � � � � with � ∇ 2 a t as shown above and � ∇ a F ( � a ) � � a F ( � a ) � � a t as the expression in large parentheses. � � Christian Borgelt Data Mining / Intelligent Data Analysis 237 Logistic Classification: Two Classes 1 Logistic function with Y = 1: probability 1 y = f ( x ) = 1 class 0 1 + e − a ( x − x 0 ) 2 class 1 probability Interpret the logistic function x 0 as the probability of one class. x 0 − 4 x 0 − 2 x 0 + 2 x 0 + 4 x 0 a a a a a = ( a 0 , x 1 , . . . , x m ) ⊤ � x ∗ = (1 , x 1 , . . . , x m ) ⊤ • Conditional class probability is logistic function: � 1 P ( C = c 1 | � X = � x ) = p 1 ( � x ) = p ( � x ; � a ) = x ∗ . a ⊤ 1 + e − � � • With only two classes the conditional probability of the other class is: P ( C = c 0 | � X = � x ) = p 1 ( � x ) = 1 − p ( � x ; � a ) . � c 1 , if p ( � • Classification rule: x ; � a ) ≥ θ, C = θ = 0 . 5 . c 0 , if p ( � x ; � a ) < θ, Christian Borgelt Data Mining / Intelligent Data Analysis 238 Logistic Classification x 2 4 3 class 1 1 2 y 4 3 1 class 0 x 2 2 0 4 x 1 3 1 0 2 x 1 1 0 1 2 3 4 0 0 • The classes are separated at the “contour line” p ( � x ; � a ) = θ = 0 . 5 (inflection line). (The classification boundary is linear, therefore linear classification .) • Via the classification threshold θ , which need not be θ = 0 . 5, misclassification costs may be incorporated. Christian Borgelt Data Mining / Intelligent Data Analysis 239 Logistic Classification: Example 4 1 3 0.8 2 0.6 y 0.4 1 0.2 4 0 3 0 0 2 1 x 2 2 0 1 2 3 4 x 1 1 3 4 0 • In finance (e.g. when assessing the credit worthiness of businesses) logistic classification is often applied in discrete spaces, that are spanned e.g. by binary attributes and expert assessments. (e.g. assessments of the range of products, market share, growth etc.) Christian Borgelt Data Mining / Intelligent Data Analysis 240

  57. Logistic Classification: Example 4 1 3 0.8 0.6 2 y 0.4 1 0.2 4 0 3 0 0 2 1 x 2 0 1 2 3 4 x 2 1 1 3 4 0 • In such a case multiple businesses may fall onto the same grid point. • Then probabilities may be estimated from observed credit defaults: x ) = #defaults( � x ) + γ ( γ : Laplace correction, e.g. γ ∈ { 1 p default ( � 2 , 1 } ) � #loans( � x ) + 2 γ Christian Borgelt Data Mining / Intelligent Data Analysis 241 Logistic Classification: Example 4 1 3 0.8 0.6 2 y 0.4 1 0.2 4 0 3 0 0 2 x 2 1 2 0 1 2 3 4 x 1 1 3 4 0 • Black “contour line”: logit transform and linear regression. • Green “contour line”: gradient descent on error function in original space. (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) Christian Borgelt Data Mining / Intelligent Data Analysis 242 Logistic Classification: Example 4 1 3 0.8 2 0.6 y 0.4 1 0.2 4 0 3 0 0 1 2 x 2 2 0 1 2 3 4 x 1 1 3 0 4 • More frequent is the case in which at least some attributes are metric and for each point a class, but no class probability is available. • If we assign class 0: c 0 � = y = 0 and class 1: c 1 � = y = 1, the logit transform is not applicable. Christian Borgelt Data Mining / Intelligent Data Analysis 243 Logistic Classification: Example 4 1 3 0.8 2 0.6 y 0.4 1 0.2 4 0 3 0 0 2 1 x 2 2 0 1 2 3 4 x 1 1 3 4 0 • The logit transform becomes applicable by mapping the classes to � ǫ � � 1 − ǫ � � ǫ � c 1 = y = ln and c 0 = y = ln = − ln . � � 1 − ǫ ǫ 1 − ǫ • The value of ǫ ∈ (0 , 1 2 ) is irrelevant (i.e., the result is independent of ǫ and equivalent to a linear regression with c 0 � = y = 0 and c 1 � = y = 1). Christian Borgelt Data Mining / Intelligent Data Analysis 244

  58. Logistic Classification: Example 4 1 3 0.8 0.6 2 y 0.4 1 0.2 4 0 3 0 0 2 1 x 2 0 1 2 3 4 x 2 1 1 3 4 0 • Logit transform and linear regression often yield suboptimal results: Depending on the distribution of the data points relativ to a(n optimal) separating hyperplane the computed separating hyperplane can be shifted and/or rotated. • This can lead to (unnecessary) misclassifications! Christian Borgelt Data Mining / Intelligent Data Analysis 245 Logistic Classification: Example 4 1 3 0.8 0.6 2 y 0.4 1 0.2 4 0 3 0 0 2 x 2 1 2 0 1 2 3 4 x 1 1 3 4 0 • Black “contour line”: logit transform and linear regression. • Green “contour line”: gradient descent on error function in original space. (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) Christian Borgelt Data Mining / Intelligent Data Analysis 246 Logistic Classification: Maximum Likelihood Approach A likelihood function describes the probability of observed data depending on the parameters � a of the (conjectured) data generating process. Here: logistic function to describe the class probabilities. a ⊤ x ∗ ), class y = 1 occurs with probability p 1 ( � x ) = f ( � � a ⊤ x ∗ ), class y = 0 occurs with probability p 0 ( � x ) = 1 − f ( � � x ∗ = (1 , x 1 , . . . , x m ) ⊤ and � 1 a = ( a 0 , a 1 , . . . , a m ) ⊤ . with f ( z ) = 1+ e − z and � Likelihood function for the data set D = { ( � x 1 , y 1 ) , . . . , � x n , y n ) } with y i ∈ { 0 , 1 } : n � x i ) y i · p 0 ( � x i ) 1 − y i L ( � a ) = p 1 ( � i =1 n � a ⊤ x ∗ i ) y i · (1 − f ( � a ⊤ x ∗ i )) 1 − y i = f ( � � � i =1 Maximum Likelihood Approach: Find the set of parameters � a , which renders the occurrence of the (observed) data most likely. Christian Borgelt Data Mining / Intelligent Data Analysis 247 Logistic Classification: Maximum Likelihood Approach Simplification by taking the logarithm: log likelihood function n � � � a ⊤ x ∗ a ⊤ x ∗ ln L ( � a ) = y i · ln f ( � � i ) + (1 − y i ) · ln(1 − f ( � � i )) i =1   a ⊤ x ∗ n e − ( � � i ) � 1  y i · ln i ) + (1 − y i ) · ln  = a ⊤ x ∗ a ⊤ x ∗ 1 + e − ( � � 1 + e − ( � � i ) i =1 � i ) �� n � � a ⊤ x ∗ a ⊤ x ∗ 1 + e − ( � � = ( y i − 1) · � � i − ln i =1 Necessary condition for a maximum: ! � = � Gradient of the objective function ln L ( � a ) w.r.t. � a vanishes: ∇ a ln L ( � a ) 0 � Problem: The resulting equation system is not linear. Solution possibilities: • Gradient descent on objective function ln L ( � a ). • Root search on gradient � ∇ a ln L ( � a ). (e.g. Newton–Raphson method) � Christian Borgelt Data Mining / Intelligent Data Analysis 248

  59. Logistic Classification: Gradient Ascent 1 Gradient of the log likelihood function: (mit f ( z ) = 1+ e − z ) � i ) �� n � � a ⊤ x ∗ � � a ⊤ x ∗ 1 + e − ( � � ∇ a ln L ( � a ) = ∇ ( y i − 1) · � � i − ln � a � i =1   a ⊤ x ∗ n e − ( � � i ) � x ∗ x ∗  ( y i − 1) · �  = i + i ) · � i a ⊤ x ∗ 1 + e − ( � � i =1 n � � � x ∗ a ⊤ x ∗ x ∗ = ( y i − 1) · � i + (1 − f ( � � i )) · � i i =1 n � � � a ⊤ x ∗ x ∗ = ( y i − f ( � � i )) · � i i =1 As a comparison: Gradient of the sum of squared errors / deviations: n � � a ⊤ x ∗ a ⊤ x ∗ a ⊤ x ∗ x ∗ ∇ a F ( � a ) = − 2 ( y i − f ( � � i )) · f ( � � i ) · (1 − f ( � � i )) · � � i � �� � i =1 additional factor: derivative of the logistic function Christian Borgelt Data Mining / Intelligent Data Analysis 249 Logistic Classification: Gradient Ascent Given: data set D = { ( � x 1 , y 1 ) , . . . , ( � x n , y n ) } with n data points, y ∈ { 0 , 1 } . i = (1 , x i 1 , . . . , x im ) ⊤ and � x ∗ a = ( a 0 , a 1 , . . . , a m ) ⊤ . Simplification: Use � Gradient ascent on the objective function ln L ( � a ) : • Choose as the initial point � a 0 the result of a logit transform and a linear regression (or merely a linear regression). • Update of the parameters � a : a t + η · � ∇ a ) | � � a t +1 = � a ln L ( � � a t n � a ⊤ x ∗ x ∗ = � a t + η · ( y i − f ( � t � i )) · � i , i =1 where η is a step width parameter to be chosen by a user (e.g. η = 0 . 01). a ⊤ x ∗ a ⊤ x ∗ i ) · (1 − f ( � (Comparison with gradient descent: missing factor f ( � t � t � i )).) • Repeat the update step until convergence, e.g. until a t || < τ with a chosen threshold τ (z.B. τ = 10 − 6 ). || � a t +1 − � Christian Borgelt Data Mining / Intelligent Data Analysis 250 Logistic Classification: Example 4 1 3 0.8 2 0.6 y 0.4 1 0.2 4 0 3 0 0 1 2 x 2 2 0 1 2 3 4 x 1 1 3 0 4 • Black “contour line”: logit transform and linear regression. • Green “contour line”: gradient descent on error function in the original space. • Magenta “contour line”: gradient ascent on log likelihood function. (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) Christian Borgelt Data Mining / Intelligent Data Analysis 251 Logistic Classification: No Gap Between Classes 4 1 3 0.8 2 0.6 y 0.4 1 0.2 4 0 3 0 0 2 1 x 2 2 0 1 2 3 4 x 1 1 3 4 0 • If there is no (clear) gap between the classes a logit transform and subsequent linear regression yields (unnecessary) misclassifications even more often. • In such a case the alternative methods are clearly preferable! Christian Borgelt Data Mining / Intelligent Data Analysis 252

  60. Logistic Classification: No Gap Between Classes 4 1 3 0.8 0.6 2 y 0.4 1 0.2 4 0 3 0 0 2 1 x 2 0 1 2 3 4 x 2 1 1 3 4 0 • Black “contour line”: logit transform and linear regression. • Green “contour line”: gradient descent on error function in the original space. • Magenta “contour line”: gradient ascent on log likelihood function. (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) Christian Borgelt Data Mining / Intelligent Data Analysis 253 Logistic Classification: Overlapping Classes 4 1 3 0.8 0.6 2 y 0.4 1 0.2 4 0 3 0 0 2 x 2 1 2 0 1 2 3 4 x 1 1 3 4 0 • Even more problematic is the situation if the classes overlap (i.e., there is no perfect separating line/hyperplane). • In such a case even the other methods cannot avoid misclassifications. (There is no way to be better than the pure or Bayes error.) Christian Borgelt Data Mining / Intelligent Data Analysis 254 Logistic Classification: Overlapping Classes 4 1 3 0.8 2 0.6 y 0.4 1 0.2 4 0 3 0 0 1 2 x 2 2 0 1 2 3 4 x 1 1 3 0 4 • Black “contour line”: logit transform and linear regression. • Green “contour line”: gradient descent on error function in the original space. • Magenta “contour line”: gradient ascent on log likelihood function. (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) Christian Borgelt Data Mining / Intelligent Data Analysis 255 Logistic Classification: Newton–Raphson 1 With the abbreviation f ( z ) = 1+ e − z for the logistic function it is n � � � ∇ 2 � � a ⊤ x ∗ x ∗ ∇ ( y i − f ( � i )) · � a ln L ( � a ) = � � � a i i =1 n � � � � x ∗ a ⊤ x ∗ x ∗ = ∇ ( y i � i − f ( � � i ) · � i ) � a i =1 n n � � f ′ ( � a ⊤ x ∗ x ∗ x ∗ ⊤ a ⊤ x ∗ a ⊤ x ∗ x ∗ x ∗ ⊤ = − � i ) · � i � = − f ( � � i ) · (1 − f ( � � i )) · � i � i i i =1 i =1 where again f ′ ( z ) = f ( z ) · (1 − f ( z )) (as derived above). Thus we get for the update of the parameters � a : (note: no step width η ) � � − 1 � � � � ∇ 2 � · � � a t +1 = � a t − a ln L ( � a ) ∇ a ln L ( � a ) � � � � � � a t a t     − 1 n n � � a ⊤ x ∗ a ⊤ x ∗ x ∗ x ∗ ⊤ a ⊤ x ∗ x ∗     . = � a t + f ( � t � i ) · (1 − f ( � t � i )) · � i � · ( y i − f ( � t � i )) · � i i i =1 i =1 Christian Borgelt Data Mining / Intelligent Data Analysis 256

  61. Robust Regression • Solutions of (ordinary) least squares regression can be strongly affected by outliers. The reason for this is obviously the squared error function, which weights outliers fairly heavily (quadratically). • More robust results can usually be obtained by minimizing the sum of absolute deviations ( least absolute deviations, LAD ). • However, this approach has the disadvantage of not being analytically solvable (like least squares) and thus has to be addressed with iterative methods right from the start. • In addition, least absolute deviation solutions can be unstable in the sense that small changes in the data can lead to “jumps” (discontinuous changes) of the solution parameters. Instead, least squares solutions always changes “smoothly” (continuously). • Finally, severe outliers can still have a distorting effect on the solution. Christian Borgelt Data Mining / Intelligent Data Analysis 257 Robust Regression • In order to improve the robustness of the procedure, more sophisticated regression methods have been developed: robust regression , which include: ◦ M-estimation and S-estimation for regression and ◦ least trimmed squares (LTS) , which simply uses a subset of at least half the size of the data set that yields the smallest sum of squared errors. Here, we take a closer look at M-estimators. • We rewrite the error functional (that is, the sum of squared errors) to be minimized in the form n n � � x ⊤ F ρ ( a, b ) = ρ ( e i ) = ρ ( � i � a − y i ) i =1 i =1 where ρ ( e i ) = e 2 i and e i is the (signed) error of the regression function at the i th point, that is e i = e ( x i , y i ,� a ) = f � a ( x i ) − y i , where f is the conjectured regression function family with parameters � a . Christian Borgelt Data Mining / Intelligent Data Analysis 258 Robust Regression: M-Estimators • Is ρ ( e i ) = e 2 i the only reasonable choice for the function ρ ? Certainly not. • However, ρ should satisfy at least some reasonable restrictions. ◦ The function ρ should alway be positive, except for the case e i = 0. ◦ The sign of the error e i should not matter for ρ . ◦ ρ should be increasing when the absolute value of the error increases. • These requirements can formalized in the following way: ρ ( e ) ≥ 0 , ρ (0) = 0 , ρ ( e ) = ρ ( − e ) , ρ ( e i ) ≥ ρ ( e j ) if | e i | ≥ | e j | . • Parameter estimation (here the estimation of the parameter vector � a ) is based on an objective function of the form n n � � x ⊤ F ρ ( a, b ) = ρ ( e i ) = ρ ( � i � a − y i ) i =1 i =1 and an error measure satisfying the above conditions is called an M-estimator . Christian Borgelt Data Mining / Intelligent Data Analysis 259 Robust Regression: M-Estimators • Parameter estimation (here the estimation of the parameter vector � a ) based on an objective function of the form n n � � x ⊤ F ρ ( a, b ) = ρ ( e i ) = ρ ( � i � a − y i ) i =1 i =1 and an error measure satisfying the above conditions is called an M-estimator . • Examples of such estimators are: Method ρ ( e ) e 2 Least squares � 1 2 e 2 if | e | ≤ k, Huber k | e | − 1 2 k 2 if | e | > k .  � � 2 � 3 � � � e   k 2  1 − 1 − , if | e | ≤ k , 6 k Tukey’s bisquare   k 2  if | e | > k . 6 , Christian Borgelt Data Mining / Intelligent Data Analysis 260

  62. Robust Regression • In order to understand the more general setting of an error measure ρ , it is useful to consider the derivative ψ = ρ ′ . • Taking the derivatives of the objective function n n � � x ⊤ F ρ ( a, b ) = ρ ( e i ) = ρ ( � i � a − y i ) i =1 i =1 with respect to the parameters a i , we obtain a system of ( m + 1) equations n � x ⊤ x ⊤ ψ i ( � i � a − y i ) � i = 0 . i =1 • Defining w ( e ) = ψ ( e ) /e and w i = w ( e i ), the system of linear equations can be rewritten in the form n x ⊤ n � ψ i ( � i � a − y i ) � x ⊤ w i · ( y i − x ⊤ i b ) · x ⊤ · e i · � = = 0 . i i e i i =1 i =1 Christian Borgelt Data Mining / Intelligent Data Analysis 261 Robust Regression • Solving this system of linear equations corresponds to solving � n i =1 w i e 2 a standard least squares problem with (non-fixed) weights in the form i . • However, the weights w i depend on the residuals e i , the residuals depend on the coefficients a i , and the coefficients depend on the weights. • Therefore, it is in general not possible to provide an explicit solution. • Instead of an analytical solution, the following iteration scheme is applied: a (0) , 1. Choose an initial solution � for instance the standard least squares solution setting all weights to w i = 1. 2. In each iteration step t , calculate the residuals e ( t − 1) and the corresponding � e ( t − 1) � weights w ( t − 1) = w determined by the previous step. � n i =1 w i e 2 3. Solve the weighted least squares problem i which leads to � � − 1 X ⊤ W ( t − 1) � a ( t ) = X ⊤ W ( t − 1) X � y, where W stands for a diagonal matrix with weights w i on the diagonal. Christian Borgelt Data Mining / Intelligent Data Analysis 262 Robust Regression • The error measures and the weights are related as follows: Method w ( e ) Least squares 1 � 1 , if | e | ≤ k , Huber k/ | e | , if | e | > k .  � � 2 � 2 � e   1 − , if | e | ≤ k, Tukey’s bisquare k   0 , if | e | > k . • Note that the weights are an additional result of the procedure (beyond the actually desired regression function). • They provide information which data points may be considered as outliers (those with low weights, as this indicates that they have not been fitted well). • Note also that the weights may be plotted even for high-dimensional data sets (using some suitable arrangement of the data points, e.g., sorted by weight). Christian Borgelt Data Mining / Intelligent Data Analysis 263 Robust Regression (ordinary) least squares Huber Tukey’s bisquare 8 40 3 6 30 ρ 1 . 5 ( e ) ρ 4 . 5 ( e ) 2 ρ ( e ) 4 20 1 2 10 0 0 0 –6 –4 –2 0 2 4 6 –6 –4 –2 0 2 4 6 –6 –4 –2 0 2 4 6 e e e error error error 1 1 1 0.8 0.8 0.8 w 1 . 5 ( e ) w 4 . 5 ( e ) 0.6 0.6 0.6 w 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 –6 –4 –2 0 2 4 6 –6 –4 –2 0 2 4 6 –6 –4 –2 0 2 4 6 e e e weight weight weight Christian Borgelt Data Mining / Intelligent Data Analysis 264

  63. Robust Regression • The (ordinary) least squares error increases in a quadratic manner with increasing distance. The weights are always constant. This means that extreme outliers will have full influence on the regression coefficients and can corrupt the result completely. • In the more robust approach by Huber the change of the error measure ρ switches from a quadratic increase for small errors to a linear increase for larger errors. As a result, only data points with small errors will have full influence on the regression coefficients. For extreme outliers the weights tend to zero. • Tukey’s bisquare approach is even more drastic than Huber’s. For larger errors the error measure ρ does not increase at all, but remains constant. As a consequence, the weights for outliers drop to zero if they are too far away from the regression curve. This means that extreme outliers have no influence on the regression curve at all. Christian Borgelt Data Mining / Intelligent Data Analysis 265 Robust Regression: Example 4 1 regression weight 2 0.8 0 y 0.6 –2 0.4 –4 0.2 –6 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 data point index x • There is one outlier that leads to the red regression line that neither fits the outlier nor the other points. • With robust regression, for instance based on Huber’s ρ -function, we obtain the blue regression line that simply ignores the outlier. • An additional result are the computed weights for the data points. In this way, outliers can be identified by robust regression. Christian Borgelt Data Mining / Intelligent Data Analysis 266 Summary Regression • Minimize the Sum of Squared Errors ◦ Write the sum of squared errors as a function of the parameters to be determined. • Exploit Necessary Conditions for a Minimum ◦ Partial derivatives w.r.t. the parameters to determine must vanish. • Solve the System of Normal Equations ◦ The best fit parameters are the solution of the system of normal equations. • Non-polynomial Regression Functions ◦ Find a transformation to the multipolynomial case. ◦ Logistic regression can be used to solve two class classification problems. • Robust Regression ◦ Reduce the influence of outliers by using different error measures. Christian Borgelt Data Mining / Intelligent Data Analysis 267 Bayes Classifiers Christian Borgelt Data Mining / Intelligent Data Analysis 268

  64. Bayes Classifiers • Probabilistic Classification and Bayes’ Rule • Naive Bayes Classifiers ◦ Derivation of the classification formula ◦ Probability estimation and Laplace correction ◦ Simple examples of naive Bayes classifiers ◦ A naive Bayes classifier for the Iris data • Full Bayes Classifiers ◦ Derivation of the classification formula ◦ Comparison to naive Bayes classifiers ◦ A simple example of a full Bayes classifier ◦ A full Bayes classifier for the Iris data • Summary Christian Borgelt Data Mining / Intelligent Data Analysis 269 Probabilistic Classification • A classifier is an algorithm that assigns a class from a predefined set to a case or object, based on the values of descriptive attributes. • An optimal classifier maximizes the probability of a correct class assignment. ◦ Let C be a class attribute with dom( C ) = { c 1 , . . . , c n C } , which occur with probabilities p i , 1 ≤ i ≤ n C . ◦ Let q i be the probability with which a classifier assigns class c i . ( q i ∈ { 0 , 1 } for a deterministic classifier) ◦ The probability of a correct assignment is n C � P (correct assignment) = p i q i . i =1 ◦ Therefore the best choice for the q i is � 1 , if p i = max n C k =1 p k , q i = 0 , otherwise. Christian Borgelt Data Mining / Intelligent Data Analysis 270 Probabilistic Classification • Consequence: An optimal classifier should assign the most probable class . • This argument does not change if we take descriptive attributes into account. ◦ Let U = { A 1 , . . . , A m } be a set of descriptive attributes with domains dom( A k ), 1 ≤ k ≤ m . ◦ Let A 1 = a 1 , . . . , A m = a m be an instantiation of the attributes. ◦ An optimal classifier should assign the class c i for which P ( C = c i | A 1 = a 1 , . . . , A m = a m ) = max n C j =1 P ( C = c j | A 1 = a 1 , . . . , A m = a m ) • Problem: We cannot store a class (or the class probabilities) for every possible instantiation A 1 = a 1 , . . . , A m = a m of the descriptive attributes. (The table size grows exponentially with the number of attributes.) • Therefore: Simplifying assumptions are necessary. Christian Borgelt Data Mining / Intelligent Data Analysis 271 Bayes’ Rule and Bayes’ Classifiers • Bayes’ rule is a formula that can be used to “invert” conditional probabilities: Let X and Y be events, P ( X ) > 0. Then P ( Y | X ) = P ( X | Y ) · P ( Y ) . P ( X ) • Bayes’ rule follows directly from the definition of conditional probability: P ( Y | X ) = P ( X ∩ Y ) P ( X | Y ) = P ( X ∩ Y ) and . P ( X ) P ( Y ) • Bayes’ classifiers: Compute the class probabilities as P ( C = c i | A 1 = a 1 , . . . , A m = a m ) = P ( A 1 = a 1 , . . . , A m = a m | C = c i ) · P ( C = c i ) . P ( A 1 = a 1 , . . . , A m = a m ) • Looks unreasonable at first sight: Even more probabilities to store. Christian Borgelt Data Mining / Intelligent Data Analysis 272

  65. Naive Bayes Classifiers Naive Assumption: The descriptive attributes are conditionally independent given the class. Bayes’ Rule: P ( C = c i | ω ) = P ( A 1 = a 1 , . . . , A m = a m | C = c i ) · P ( C = c i ) ← p 0 P ( A 1 = a 1 , . . . , A m = a m ) Chain Rule of Probability: m P ( C = c i | ω ) = P ( C = c i ) � · P ( A k = a k | A 1 = a 1 , . . . , A k − 1 = a k − 1 , C = c i ) p 0 k =1 Conditional Independence Assumption: m P ( C = c i | ω ) = P ( C = c i ) � · P ( A k = a k | C = c i ) p 0 k =1 Christian Borgelt Data Mining / Intelligent Data Analysis 273 Reminder: Chain Rule of Probability • Based on the product rule of probability: P ( A ∧ B ) = P ( A | B ) · P ( B ) (Multiply definition of conditional probability with P ( B ).) • Multiple application of the product rule yields: P ( A 1 , . . . , A m ) = P ( A m | A 1 , . . . , A m − 1 ) · P ( A 1 , . . . , A m − 1 ) = P ( A m | A 1 , . . . , A m − 1 ) · P ( A m − 1 | A 1 , . . . , A m − 2 ) · P ( A 1 , . . . , A m − 2 ) = . . . m � = P ( A k | A 1 , . . . , A k − 1 ) k =1 • The scheme works also if there is already a condition in the original expression: m � P ( A 1 , . . . , A m | C ) = P ( A k | A 1 , . . . , A k − 1 , C ) i =1 Christian Borgelt Data Mining / Intelligent Data Analysis 274 Conditional Independence • Reminder: stochastic independence (unconditional) P ( A ∧ B ) = P ( A ) · P ( B ) (Joint probability is the product of the individual probabilities.) • Comparison to the product rule P ( A ∧ B ) = P ( A | B ) · P ( B ) shows that this is equivalent to P ( A | B ) = P ( A ) • The same formulae hold conditionally, i.e. P ( A ∧ B | C ) = P ( A | C ) · P ( B | C ) and P ( A | B, C ) = P ( A | C ) . • Conditional independence allows us to cancel some conditions . Christian Borgelt Data Mining / Intelligent Data Analysis 275 Conditional Independence: An Example ✻ t t t t t t t t t t Group 1 t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t Group 2 t t t t t t t t t t t t t t t t t t t ✲ Christian Borgelt Data Mining / Intelligent Data Analysis 276

  66. Conditional Independence: An Example ✻ t t t t t t t t t t Group 1 t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t ✲ Christian Borgelt Data Mining / Intelligent Data Analysis 277 Conditional Independence: An Example ✻ t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t Group 2 t t t t t t t t t t t t t t t t t t t t t t ✲ Christian Borgelt Data Mining / Intelligent Data Analysis 278 Naive Bayes Classifiers • Consequence: Manageable amount of data to store. Store distributions P ( C = c i ) and ∀ 1 ≤ k ≤ m : P ( A k = a k | C = c i ). • It is not necessary to compute p 0 explicitely, because it can be computed implicitly by normalizing the computed values to sum 1. Estimation of Probabilities: • Nominal/Symbolic Attributes P ( A k = a k | C = c i ) = #( A k = a k , C = c i ) + γ ˆ #( C = c i ) + n A k γ γ is called Laplace correction . γ = 0: Maximum likelihood estimation. Common choices: γ = 1 or γ = 1 2 . Christian Borgelt Data Mining / Intelligent Data Analysis 279 Naive Bayes Classifiers Estimation of Probabilities: • Metric/Numeric Attributes: Assume a normal distribution. � � − ( a k − µ k ( c i )) 2 1 √ f ( A k = a k | C = c i ) = 2 πσ k ( c i ) exp 2 σ 2 k ( c i ) • Estimate of mean value #( C = c i ) 1 � µ k ( c i ) = ˆ a k ( j ) #( C = c i ) j =1 • Estimate of variance #( C = c i ) � k ( c i ) = 1 σ 2 µ k ( c i )) 2 ˆ ( a k ( j ) − ˆ ξ j =1 ξ = #( C = c i ) : Maximum likelihood estimation ξ = #( C = c i ) − 1: Unbiased estimation Christian Borgelt Data Mining / Intelligent Data Analysis 280

  67. Naive Bayes Classifiers: Simple Example 1 P (Drug) A B No Sex Age Blood pr. Drug 0 . 5 0 . 5 1 male 20 normal A 2 female 73 normal B P (Sex | Drug) A B 3 female 37 high A male 0 . 5 0 . 5 4 male 33 low B female 0 . 5 0 . 5 5 female 48 high A P (Age | Drug) A B 6 male 29 normal A µ 36 . 3 47 . 8 7 female 52 normal B σ 2 161 . 9 311 . 0 8 male 42 low B P (Blood Pr. | Drug) A B 9 male 61 normal B 10 female 30 normal A low 0 0 . 5 11 female 26 low B normal 0 . 5 0 . 5 12 male 54 high A high 0 . 5 0 A simple database and estimated (conditional) probability distributions. Christian Borgelt Data Mining / Intelligent Data Analysis 281 Naive Bayes Classifiers: Simple Example 1 d (Drug A | male, 61, normal) c 1 · P (Drug A) · P (male | Drug A) · f (61 | Drug A) · P (normal | Drug A) = c 1 · 5 . 984 · 10 − 4 ≈ c 1 · 0 . 5 · 0 . 5 · 0 . 004787 · 0 . 5 = = 0 . 219 f (Drug B | male, 61, normal) = c 1 · P (Drug B) · P (male | Drug B) · f (61 | Drug B) · P (normal | Drug B) c 1 · 2 . 140 · 10 − 3 ≈ c 1 · 0 . 5 · 0 . 5 · 0 . 017120 · 0 . 5 = = 0 . 781 f (Drug A | female, 30, normal) = c 2 · P (Drug A) · P (female | Drug A) · f (30 | Drug A) · P (normal | Drug A) c 2 · 3 . 471 · 10 − 3 ≈ c 2 · 0 . 5 · 0 . 5 · 0 . 027703 · 0 . 5 = = 0 . 671 f (Drug B | female, 30, normal) = c 2 · P (Drug B) · P (female | Drug B) · f (30 | Drug B) · P (normal | Drug B) c 2 · 1 . 696 · 10 − 3 ≈ c 2 · 0 . 5 · 0 . 5 · 0 . 013567 · 0 . 5 = = 0 . 329 Christian Borgelt Data Mining / Intelligent Data Analysis 282 Naive Bayes Classifiers: Simple Example 2 • 100 data points, 2 classes • Small squares: mean values • Inner ellipses: one standard deviation • Outer ellipses: two standard deviations • Classes overlap: classification is not perfect Naive Bayes Classifier Christian Borgelt Data Mining / Intelligent Data Analysis 283 Naive Bayes Classifiers: Simple Example 3 • 20 data points, 2 classes • Small squares: mean values • Inner ellipses: one standard deviation • Outer ellipses: two standard deviations • Attributes are not conditionally independent given the class Naive Bayes Classifier Christian Borgelt Data Mining / Intelligent Data Analysis 284

  68. Reminder: The Iris Data pictures not available in online version • Collected by Edgar Anderson on the Gasp´ e Peninsula (Canada). • First analyzed by Ronald Aylmer Fisher (famous statistician). • 150 cases in total, 50 cases per Iris flower type. • Measurements of sepal length and width and petal length and width (in cm). • Most famous data set in pattern recognition and data analysis. Christian Borgelt Data Mining / Intelligent Data Analysis 285 Naive Bayes Classifiers: Iris Data • 150 data points, 3 classes Iris setosa (red) Iris versicolor (green) Iris virginica (blue) • Shown: 2 out of 4 attributes sepal length sepal width petal length (horizontal) petal width (vertical) • 6 misclassifications on the training data (with all 4 attributes) Naive Bayes Classifier Christian Borgelt Data Mining / Intelligent Data Analysis 286 Full Bayes Classifiers • Restricted to metric/numeric attributes (only the class is nominal/symbolic). • Simplifying Assumption: Each class can be described by a multivariate normal distribution. f ( A 1 = a 1 , . . . , A m = a m | C = c i ) � � 1 − 1 µ i ) ⊤ Σ − 1 = exp 2( � a − � i ( � a − � µ i ) � (2 π ) m | Σ i | � µ i : mean value vector for class c i Σ i : covariance matrix for class c i • Intuitively: Each class has a bell-shaped probability density. • Naive Bayes classifiers: Covariance matrices are diagonal matrices. (Details about this relation are given below.) Christian Borgelt Data Mining / Intelligent Data Analysis 287 Full Bayes Classifiers Estimation of Probabilities: • Estimate of mean value vector #( C = c i ) 1 � ˆ � µ i = � a ( j ) #( C = c i ) j =1 • Estimate of covariance matrix #( C = c i ) � � � � ⊤ Σ i = 1 � a ( j ) − ˆ a ( j ) − ˆ � � � µ i � � µ i ξ j =1 ξ = #( C = c i ) : Maximum likelihood estimation ξ = #( C = c i ) − 1: Unbiased estimation x ⊤ denotes the transpose of the vector � � x . x ⊤ is the so-called outer product or matrix product of � � x� x with itself. Christian Borgelt Data Mining / Intelligent Data Analysis 288

  69. Comparison of Naive and Full Bayes Classifiers Naive Bayes classifiers for metric/numeric data are equivalent to full Bayes classifiers with diagonal covariance matrices: f ( A 1 = a 1 , . . . , A m = a m | C = c i ) � � 1 − 1 µ i ) ⊤ Σ − 1 = � · exp 2( � a − � i ( � a − � µ i ) (2 π ) m | Σ i | � � � � 1 − 1 µ i ) ⊤ diag σ − 2 i, 1 , . . . , σ − 2 = · exp 2( � a − � ( � a − � µ i ) � (2 π ) m � m i,m k =1 σ 2 i,k   m ( a k − µ i,k ) 2 1  − 1 � = · exp  � � m 2 πσ 2 σ 2 2 k =1 i,k k =1 i,k   m  − ( a k − µ i,k ) 2 m � 1 � = · exp  = f ( A k = a k | C = c i ) , � � 2 πσ 2 2 σ 2 k =1 i,k k =1 i,k where f ( A k = a k | C = c i ) are the density functions of a naive Bayes classifier. Christian Borgelt Data Mining / Intelligent Data Analysis 289 Comparison of Naive and Full Bayes Classifiers Naive Bayes Classifier Full Bayes Classifier Christian Borgelt Data Mining / Intelligent Data Analysis 290 Full Bayes Classifiers: Iris Data • 150 data points, 3 classes Iris setosa (red) Iris versicolor (green) Iris virginica (blue) • Shown: 2 out of 4 attributes sepal length sepal width petal length (horizontal) petal width (vertical) • 2 misclassifications on the training data (with all 4 attributes) Full Bayes Classifier Christian Borgelt Data Mining / Intelligent Data Analysis 291 Tree-Augmented Naive Bayes Classifiers • A naive Bayes classifier can be seen as a special Bayesian network . • Intuitively, Bayesian networks are a graphical language for expressing conditional independence statements: A directed acyclic graph encodes, by a vertex separation criterion, which conditional independence statements hold in the joint probability distribution on the space spanned by the vertex attributes. Definition ( d -separation): Let � G = ( V, � E ) be a directed acyclic graph and X , Y , and Z three disjoint subsets of vertices. Z d-separates X and Y in � G , written � X | Z | Y � � G , iff there is no path from a vertex in X to a vertex in Y along which the following two conditions hold: 1. every vertex with converging edges (from its predecessor and its successor on the path) either is in Z or has a descendant in Z , 2. every other vertex is not in Z . A path satisfying the conditions above is said to be active , otherwise it is said to be blocked (by Z ); so separation means that all paths are blocked. Christian Borgelt Data Mining / Intelligent Data Analysis 292

  70. Tree-Augmented Naive Bayes Classifiers A 2 A 2 A 3 A 1 A 3 A 1 C C A 4 A n A 4 A n · · · · · · • If in a directed acyclic graph all paths from a vertex set X to a vertex set Y are blocked by a vertex set Z (according to d -separation), this expresses that the conditional independence X ⊥ ⊥ Y | Z holds in the probability distribution that is described by a Bayesina network having this graph structure. • A star-like network, with the class attribute in the middle, represents a naive Bayes classifier: All paths are blocked by the class attribute C . • The strong conditional independence assumptions can be mitigated by allowing for additional edges between attributes. Restricting these edges to a (directed) tree allows for efficient learning ( tree-augmented naive Bayes classifiers ). Christian Borgelt Data Mining / Intelligent Data Analysis 293 Summary Bayes Classifiers • Probabilistic Classification : Assign the most probable class. • Bayes’ Rule : “Invert” the conditional class probabilities. • Naive Bayes Classifiers ◦ Simplifying Assumption: Attributes are conditionally independent given the class. ◦ Can handle nominal/symbolic as well as metric/numeric attributes. • Full Bayes Classifiers ◦ Simplifying Assumption: Each class can be described by a multivariate normal distribution. ◦ Can handle only metric/numeric attributes. • Tree-Augmented Naive Bayes Classifiers ◦ Mitigate the strong conditional independence assumptions. Christian Borgelt Data Mining / Intelligent Data Analysis 294 Decision and Regression Trees Christian Borgelt Data Mining / Intelligent Data Analysis 295 Decision and Regression Trees • Classification with a Decision Tree • Top-down Induction of Decision Trees ◦ A simple example ◦ The general algorithm ◦ Attribute selection measures ◦ Treatment of numeric attributes and missing values • Pruning Decision Trees ◦ General approaches ◦ A simple example • Regression Trees • Summary Christian Borgelt Data Mining / Intelligent Data Analysis 296

  71. A Very Simple Decision Tree Assignment of a drug to a patient: Blood pressure high low normal Drug A Age Drug B ≤ 40 > 40 Drug A Drug B Christian Borgelt Data Mining / Intelligent Data Analysis 297 Classification with a Decision Tree Recursive Descent: • Start at the root node. • If the current node is an leaf node : ◦ Return the class assigned to the node. • If the current node is an inner node : ◦ Test the attribute associated with the node. ◦ Follow the branch labeled with the outcome of the test. ◦ Apply the algorithm recursively. Intuitively: Follow the path corresponding to the case to be classified. Christian Borgelt Data Mining / Intelligent Data Analysis 298 Classification in the Example Assignment of a drug to a patient: Blood pressure high low normal Drug A Age Drug B ≤ 40 > 40 Drug A Drug B Christian Borgelt Data Mining / Intelligent Data Analysis 299 Classification in the Example Assignment of a drug to a patient: Blood pressure high low normal Drug A Age Drug B ≤ 40 > 40 Drug A Drug B Christian Borgelt Data Mining / Intelligent Data Analysis 300

  72. Classification in the Example Assignment of a drug to a patient: Blood pressure high low normal Drug A Age Drug B ≤ 40 > 40 Drug A Drug B Christian Borgelt Data Mining / Intelligent Data Analysis 301 Induction of Decision Trees • Top-down approach ◦ Build the decision tree from top to bottom (from the root to the leaves). • Greedy Selection of a Test Attribute ◦ Compute an evaluation measure for all attributes. ◦ Select the attribute with the best evaluation. • Divide and Conquer / Recursive Descent ◦ Divide the example cases according to the values of the test attribute. ◦ Apply the procedure recursively to the subsets. ◦ Terminate the recursion if – all cases belong to the same class – no more test attributes are available Christian Borgelt Data Mining / Intelligent Data Analysis 302 Induction of a Decision Tree: Example Patient database No Gender Age Blood pr. Drug • 12 example cases 1 male 20 normal A 2 female 73 normal B • 3 descriptive attributes 3 female 37 high A • 1 class attribute 4 male 33 low B 5 female 48 high A 6 male 29 normal A 7 female 52 normal B Assignment of drug 8 male 42 low B 9 male 61 normal B (without patient attributes) 10 female 30 normal A always drug A or always drug B: 11 female 26 low B 50% correct (in 6 of 12 cases) 12 male 54 high A Christian Borgelt Data Mining / Intelligent Data Analysis 303 Induction of a Decision Tree: Example Gender of the patient No Gender Drug • Division w.r.t. male/female. 1 male A 6 male A 12 male A 4 male B 8 male B 9 male B 3 female A Assignment of drug 5 female A male: 50% correct (in 3 of 6 cases) 10 female A 2 female B female: 50% correct (in 3 of 6 cases) 7 female B total: 50% correct (in 6 of 12 cases) 11 female B Christian Borgelt Data Mining / Intelligent Data Analysis 304

  73. Induction of a Decision Tree: Example Age of the patient No Age Drug • Sort according to age. 1 20 A 11 26 B • Find best age split. 6 29 A here: ca. 40 years 10 30 A 4 33 B 3 37 A 8 42 B Assignment of drug 5 48 A ≤ 40: A 67% correct (in 4 of 6 cases) 7 52 B 12 54 A > 40: B 67% correct (in 4 of 6 cases) 9 61 B total: 67% correct (in 8 of 12 cases) 2 73 B Christian Borgelt Data Mining / Intelligent Data Analysis 305 Induction of a Decision Tree: Example Blood pressure of the patient No Blood pr. Drug • Division w.r.t. high/normal/low. 3 high A 5 high A 12 high A 1 normal A 6 normal A 10 normal A Assignment of drug 2 normal B high: A 100% correct (in 3 of 3 cases) 7 normal B 9 normal B normal: 50% correct (in 3 of 6 cases) 4 low B low: B 100% correct (in 3 of 3 cases) 8 low B total: 75% correct (in 9 of 12 cases) 11 low B Christian Borgelt Data Mining / Intelligent Data Analysis 306 Induction of a Decision Tree: Example Current Decision Tree: Blood pressure high low normal Drug A Drug B ? Christian Borgelt Data Mining / Intelligent Data Analysis 307 Induction of a Decision Tree: Example Blood pressure and gender No Blood pr. Gender Drug • Only patients 3 high A with normal blood pressure. 5 high A 12 high A • Division w.r.t. male/female. 1 normal male A 6 normal male A 9 normal male B 2 normal female B Assignment of drug 7 normal female B 10 normal female A male: A 67% correct (2 of 3) 4 low B female: B 67% correct (2 of 3) 8 low B total: 67% correct (4 of 6) 11 low B Christian Borgelt Data Mining / Intelligent Data Analysis 308

  74. Induction of a Decision Tree: Example Blood pressure and age No Blood pr. Age Drug • Only patients 3 high A with normal blood pressure. 5 high A 12 high A • Sort according to age. 1 normal 20 A • Find best age split. 6 normal 29 A here: ca. 40 years 10 normal 30 A 7 normal 52 B Assignment of drug 9 normal 61 B 2 normal 73 B ≤ 40: A 100% correct (3 of 3) 11 low B > 40: B 100% correct (3 of 3) 4 low B total: 100% correct (6 of 6) 8 low B Christian Borgelt Data Mining / Intelligent Data Analysis 309 Result of Decision Tree Induction Assignment of a drug to a patient: Blood pressure high low normal Drug A Age Drug B ≤ 40 > 40 Drug A Drug B Christian Borgelt Data Mining / Intelligent Data Analysis 310 Decision Tree Induction: Notation S a set of case or object descriptions C the class attribute A (1) , . . . , A ( m ) other attributes (index dropped in the following) dom( C ) = { c 1 , . . . , c n C } , n C : number of classes dom( A ) = { a 1 , . . . , a n A } , n A : number of attribute values N .. total number of case or object descriptions i.e. N .. = | S | N i. absolute frequency of the class c i N .j absolute frequency of the attribute value a j N ij absolute frequency of the combination of the class c i and the attribute value a j . It is N i. = � n A j =1 N ij and N .j = � n C i =1 N ij . relative frequency of the class c i , p i. = N i. p i. N .. relative frequency of the attribute value a j , p .j = N .j p .j N .. relative frequency of the combination of class c i and attribute value a j , p ij = N ij p ij N .. relative frequency of the class c i in cases having attribute value a j , p i | j = N ij N .j = p ij p i | j p .j Christian Borgelt Data Mining / Intelligent Data Analysis 311 Decision Tree Induction: General Algorithm function grow tree ( S : set of cases) : node; begin best v := WORTHLESS; for all untested attributes A do compute frequencies N ij , N i. , N .j for 1 ≤ i ≤ n C and 1 ≤ j ≤ n A ; compute value v of an evaluation measure using N ij , N i. , N .j ; if v > best v then best v := v ; best A := A ; end ; end if best v = WORTHLESS then create leaf node x ; assign majority class of S to x ; else create test node x ; assign test on attribute best A to x ; for all a ∈ dom( best A ) do x .child[ a ] := grow tree( S | best A = a ); end ; end ; return x ; end ; Christian Borgelt Data Mining / Intelligent Data Analysis 312

  75. Evaluation Measures • Evaluation measure used in the above example: rate of correctly classified example cases . ◦ Advantage: simple to compute, easy to understand. ◦ Disadvantage: works well only for two classes. • If there are more than two classes, the rate of misclassified example cases neglects a lot of the available information . ◦ Only the majority class—that is, the class occurring most often in (a subset of) the example cases—is really considered. ◦ The distribution of the other classes has no influence. However, a good choice here can be important for deeper levels of the decision tree. • Therefore: Study also other evaluation measures. Here: ◦ Information gain and its various normalizations. ◦ χ 2 measure (well-known in statistics). Christian Borgelt Data Mining / Intelligent Data Analysis 313 An Information-theoretic Evaluation Measure Information Gain (Kullback and Leibler 1951, Quinlan 1986) n � Based on Shannon Entropy H = − p i log 2 p i (Shannon 1948) i =1 I gain ( C, A ) = H ( C ) − H ( C | A ) � �� � � �� �   n C n A n C � � �  −  = − p i. log 2 p i. − p .j p i | j log 2 p i | j i =1 j =1 i =1 H ( C ) Entropy of the class distribution ( C : class attribute) H ( C | A ) Expected entropy of the class distribution if the value of the attribute A becomes known H ( C ) − H ( C | A ) Expected entropy reduction or information gain Christian Borgelt Data Mining / Intelligent Data Analysis 314 Interpretation of Shannon Entropy • Let S = { s 1 , . . . , s n } be a finite set of alternatives having positive probabilities � n P ( s i ), i = 1 , . . . , n , satisfying i =1 P ( s i ) = 1. • Shannon Entropy: n � H ( S ) = − P ( s i ) log 2 P ( s i ) i =1 • Intuitively: Expected number of yes/no questions that have to be asked in order to determine the obtaining alternative. ◦ Suppose there is an oracle, which knows the obtaining alternative, but responds only if the question can be answered with “yes” or “no”. ◦ A better question scheme than asking for one alternative after the other can easily be found: Divide the set into two subsets of about equal size. ◦ Ask for containment in an arbitrarily chosen subset. ◦ Apply this scheme recursively → number of questions bounded by ⌈ log 2 n ⌉ . Christian Borgelt Data Mining / Intelligent Data Analysis 315 Question/Coding Schemes P ( s 1 ) = 0 . 10 , P ( s 2 ) = 0 . 15 , P ( s 3 ) = 0 . 16 , P ( s 4 ) = 0 . 19 , P ( s 5 ) = 0 . 40 − � Shannon entropy: i P ( s i ) log 2 P ( s i ) = 2 . 15 bit/symbol Linear Traversal Equal Size Subsets s 1 , s 2 , s 3 , s 4 , s 5 s 1 , s 2 , s 3 , s 4 , s 5 s 2 , s 3 , s 4 , s 5 0.25 0.75 s 1 , s 2 s 3 , s 4 , s 5 s 3 , s 4 , s 5 0.59 s 4 , s 5 s 4 , s 5 0.10 0.15 0.16 0.19 0.40 0.10 0.15 0.16 0.19 0.40 s 1 s 2 s 3 s 4 s 5 s 1 s 2 s 3 s 4 s 5 1 2 3 4 4 2 2 2 3 3 Code length: 3.24 bit/symbol Code length: 2.59 bit/symbol Code efficiency: 0.664 Code efficiency: 0.830 Christian Borgelt Data Mining / Intelligent Data Analysis 316

  76. Question/Coding Schemes • Splitting into subsets of about equal size can lead to a bad arrangement of the alternatives into subsets → high expected number of questions. • Good question schemes take the probability of the alternatives into account. • Shannon-Fano Coding (1948) ◦ Build the question/coding scheme top-down. ◦ Sort the alternatives w.r.t. their probabilities. ◦ Split the set so that the subsets have about equal probability (splits must respect the probability order of the alternatives). • Huffman Coding (1952) ◦ Build the question/coding scheme bottom-up. ◦ Start with one element sets. ◦ Always combine those two sets that have the smallest probabilities. Christian Borgelt Data Mining / Intelligent Data Analysis 317 Question/Coding Schemes P ( s 1 ) = 0 . 10 , P ( s 2 ) = 0 . 15 , P ( s 3 ) = 0 . 16 , P ( s 4 ) = 0 . 19 , P ( s 5 ) = 0 . 40 � Shannon entropy: − i P ( s i ) log 2 P ( s i ) = 2 . 15 bit/symbol Shannon–Fano Coding (1948) Huffman Coding (1952) s 1 , s 2 , s 3 , s 4 , s 5 s 1 , s 2 , s 3 , s 4 , s 5 0.41 0.59 0.60 s 1 , s 2 , s 3 s 4 , s 5 s 1 , s 2 , s 3 , s 4 0.25 0.25 0.35 s 1 , s 2 s 1 , s 2 s 3 , s 4 0.10 0.15 0.16 0.19 0.40 0.10 0.15 0.16 0.19 0.40 s 1 s 2 s 3 s 4 s 5 s 1 s 2 s 3 s 4 s 5 3 3 2 2 2 3 3 3 3 1 Code length: 2.25 bit/symbol Code length: 2.20 bit/symbol Code efficiency: 0.955 Code efficiency: 0.977 Christian Borgelt Data Mining / Intelligent Data Analysis 318 Question/Coding Schemes • It can be shown that Huffman coding is optimal if we have to determine the obtaining alternative in a single instance. (No question/coding scheme has a smaller expected number of questions.) • Only if the obtaining alternative has to be determined in a sequence of (indepen- dent) situations, this scheme can be improved upon. • Idea: Process the sequence not instance by instance, but combine two, three or more consecutive instances and ask directly for the obtaining combination of alternatives. • Although this enlarges the question/coding scheme, the expected number of ques- tions per identification is reduced (because each interrogation identifies the ob- taining alternative for several situations). • However, the expected number of questions per identification cannot be made ar- bitrarily small. Shannon showed that there is a lower bound, namely the Shannon entropy. Christian Borgelt Data Mining / Intelligent Data Analysis 319 Interpretation of Shannon Entropy P ( s 1 ) = 1 P ( s 2 ) = 1 P ( s 3 ) = 1 P ( s 4 ) = 1 P ( s 5 ) = 1 2 , 4 , 8 , 16 , 16 − � Shannon entropy: i P ( s i ) log 2 P ( s i ) = 1 . 875 bit/symbol Perfect Question Scheme If the probability distribution allows for a perfect Huffman code (code efficiency 1), s 1 , s 2 , s 3 , s 4 , s 5 the Shannon entropy can easily be inter- preted as follows: s 2 , s 3 , s 4 , s 5 � − P ( s i ) log 2 P ( s i ) s 3 , s 4 , s 5 i � 1 s 4 , s 5 = P ( s i ) · log 2 . P ( s i ) 1 1 1 1 1 i � �� � � �� � 2 4 8 16 16 s 1 s 2 s 3 s 4 s 5 occurrence path length 1 2 3 4 4 probability in tree In other words, it is the expected number Code length: 1.875 bit/symbol of needed yes/no questions. Code efficiency: 1 Christian Borgelt Data Mining / Intelligent Data Analysis 320

  77. Other Information-theoretic Evaluation Measures Normalized Information Gain • Information gain is biased towards many-valued attributes. • Normalization removes / reduces this bias. Information Gain Ratio (Quinlan 1986 / 1993) I gr ( C, A ) = I gain ( C, A ) I gain ( C, A ) = � n A H A − j =1 p .j log 2 p .j Symmetric Information Gain Ratio (L´ opez de M´ antaras 1991) sgr ( C, A ) = I gain ( C, A ) sgr ( C, A ) = I gain ( C, A ) I (1) I (2) or H AC H A + H C Christian Borgelt Data Mining / Intelligent Data Analysis 321 Bias of Information Gain • Information gain is biased towards many-valued attributes , i.e., of two attributes having about the same information content it tends to select the one having more values. • The reasons are quantization effects caused by the finite number of example cases (due to which only a finite number of different probabilities can result in estima- tions) in connection with the following theorem: • Theorem: Let A , B , and C be three attributes with finite domains and let their joint probability distribution be strictly positive, i.e., ∀ a ∈ dom( A ) : ∀ b ∈ dom( B ) : ∀ c ∈ dom( C ) : P ( A = a, B = b, C = c ) > 0. Then I gain ( C, AB ) ≥ I gain ( C, B ) , with equality obtaining only if the attributes C and A are conditionally indepen- dent given B , i.e., if P ( C = c | A = a, B = b ) = P ( C = c | B = b ). (A detailed proof of this theorem can be found, for example, in [Borgelt and Kruse 2002], p. 311ff.) Christian Borgelt Data Mining / Intelligent Data Analysis 322 A Statistical Evaluation Measure χ 2 Measure • Compares the actual joint distribution with a hypothetical independent distribution . • Uses absolute comparison. • Can be interpreted as a difference measure. n C n A ( p i. p .j − p ij ) 2 � � χ 2 ( C, A ) = N .. p i. p .j i =1 j =1 • Side remark: Information gain can also be interpreted as a difference measure. n C n A � � p ij I gain ( C, A ) = p ij log 2 p i. p .j i =1 j =1 Christian Borgelt Data Mining / Intelligent Data Analysis 323 Treatment of Numeric Attributes General Approach: Discretization • Preprocessing I ◦ Form equally sized or equally populated intervals. • During the tree construction ◦ Sort the example cases according to the attribute’s values. ◦ Construct a binary symbolic attribute for every possible split (values: “ ≤ threshold” and “ > threshold”). ◦ Compute the evaluation measure for these binary attributes. ◦ Possible improvements: Add a penalty depending on the number of splits. • Preprocessing II / Multisplits during tree construction ◦ Build a decision tree using only the numeric attribute. ◦ Flatten the tree to obtain a multi-interval discretization. Christian Borgelt Data Mining / Intelligent Data Analysis 324

  78. Treatment of Numeric Attributes • Problem: If the class boundary is oblique in the space spanned by two or more nu- meric attributes, decision trees construct a step function as the decision boundary. • Green: data points of class A Blue: data points of class B Yellow: actual class boundary Red: decision boundary built by a decision tree Gray: subdivision of the space used by a decision tree (threshold values) • Note: the complex decision boundary even produces an error! So-called “oblique” decision trees are able to find the yellow line. Christian Borgelt Data Mining / Intelligent Data Analysis 325 Treatment of Numeric Attributes • For the data set on the preceding slide a decision tree builds a proper step function as the decision boundary. Although sub- optimal, this may still be acceptable. • Unfortunately, other data point config- urations can lead to strange anomalies, which do not approximate the actual de- cision boundary well. • Green: data points of class A Blue: data points of class B Yellow: actual class boundary Red: decision boundary built by a decision tree So-called “oblique” decision trees Gray: subdivision of the space are able to find the yellow line. used by a decision tree Christian Borgelt Data Mining / Intelligent Data Analysis 326 Treatment of Missing Values Induction • Weight the evaluation measure with the fraction of cases with known values. ◦ Idea: The attribute provides information only if it is known. • Try to find a surrogate test attribute with similar properties (CART, Breiman et al. 1984) • Assign the case to all branches, weighted in each branch with the relative frequency of the corresponding attribute value (C4.5, Quinlan 1993). Classification • Use the surrogate test attribute found during induction. • Follow all branches of the test attribute, weighted with their relative number of cases, aggregate the class distributions of all leaves reached, and assign the majority class of the aggregated class distribution. Christian Borgelt Data Mining / Intelligent Data Analysis 327 Pruning Decision Trees Pruning serves the purpose • to simplify the tree (improve interpretability), • to avoid overfitting (improve generalization). Basic ideas: • Replace “bad” branches (subtrees) by leaves. • Replace a subtree by its largest branch if it is better. Common approaches: • Limiting the number of leaf cases • Reduced error pruning • Pessimistic pruning • Confidence level pruning • Minimum description length pruning Christian Borgelt Data Mining / Intelligent Data Analysis 328

  79. Limiting the Number of Leaf Cases • A decision tree may be grown until either the set of sample cases is class-pure or the set of descriptive attributes is exhausted. • However, this may lead to leaves that capture only very few, in extreme cases even just a single sample case. • Thus a decision tree may become very similar to a 1-nearest-neighbor classifier. • In order to prevent such results, it is common to let a user specify a minimum number of sample cases per leaf . • In such an approach, splits are usually limited to binary splits. (nominal attributes: usually one attribute value against all others) • A split is then adopted only if on both sides of the split at least the minimum number of sample cases are present. • Note that this approach is not an actual pruning method, as it is already applied during induction, not after. Christian Borgelt Data Mining / Intelligent Data Analysis 329 Reduced Error Pruning • Classify a set of new example cases with the decision tree. (These cases must not have been used for the induction!) • Determine the number of errors for all leaves. • The number of errors of a subtree is the sum of the errors of all of its leaves. • Determine the number of errors for leaves that replace subtrees. • If such a leaf leads to the same or fewer errors than the subtree, replace the subtree by the leaf. • If a subtree has been replaced, recompute the number of errors of the subtrees it is part of. Advantage: Very good pruning, effective avoidance of overfitting. Disadvantage: Additional example cases needed. Christian Borgelt Data Mining / Intelligent Data Analysis 330 Pessimistic Pruning • Classify a set of example cases with the decision tree. (These cases may or may not have been used for the induction.) • Determine the number of errors for all leaves and increase this number by a fixed, user-specified amount r . • The number of errors of a subtree is the sum of the errors of all of its leaves. • Determine the number of errors for leaves that replace subtrees (also increased by r ). • If such a leaf leads to the same or fewer errors than the subtree, replace the subtree by the leaf and recompute subtree errors. Advantage: No additional example cases needed. Disadvantage: Number of cases in a leaf has no influence. Christian Borgelt Data Mining / Intelligent Data Analysis 331 Confidence Level Pruning • Like pessimistic pruning, but the number of errors is computed as follows: ◦ See classification in a leaf as a Bernoulli experiment (error / no error). ◦ Estimate an interval for the error probability based on a user-specified confi- dence level α . (use approximation of the binomial distribution by a normal distribution) ◦ Increase error number to the upper level of the confidence interval times the number of cases assigned to the leaf. ◦ Formal problem: Classification is not a random experiment. Advantage: No additional example cases needed, good pruning. Disadvantage: Statistically dubious foundation. Christian Borgelt Data Mining / Intelligent Data Analysis 332

  80. Pruning a Decision Tree: A Simple Example Pessimistic Pruning with r = 0 . 8 and r = 0 . 4 : leaf: 7.0 errors c 1 : 13, c 2 : 7 r = 0 . 8: 7.8 errors (prune subtree) r = 0 . 4: 7.4 errors (keep subtree) a 1 a 2 a 3 c 1 : 5, c 2 : 2 c 1 : 6, c 2 : 2 c 1 : 2, c 2 : 3 total: 6.0 errors 2.8 errors 2.8 errors 2.8 errors r = 0 . 8: 8.4 errors 2.4 errors 2.4 errors 2.4 errors r = 0 . 4: 7.2 errors Christian Borgelt Data Mining / Intelligent Data Analysis 333 Reminder: The Iris Data pictures not available in online version • Collected by Edgar Anderson on the Gasp´ e Peninsula (Canada). • First analyzed by Ronald Aylmer Fisher (famous statistician). • 150 cases in total, 50 cases per Iris flower type. • Measurements of sepal length and width and petal length and width (in cm). • Most famous data set in pattern recognition and data analysis. Christian Borgelt Data Mining / Intelligent Data Analysis 334 Decision Trees: An Example A decision tree for the Iris data (induced with information gain ratio, unpruned) Christian Borgelt Data Mining / Intelligent Data Analysis 335 Decision Trees: An Example A decision tree for the Iris data (pruned with confidence level pruning, α = 0 . 8, and pessimistic pruning, r = 2) • Left: 7 instead of 11 nodes, 4 instead of 2 misclassifications. • Right: 5 instead of 11 nodes, 6 instead of 2 misclassifications. • The right tree is “minimal” for the three classes. Christian Borgelt Data Mining / Intelligent Data Analysis 336

  81. Regression Trees • Target variable is not a class, y but a numeric quantity. • Simple regression trees: predict constant values in leaves. (blue lines) x • More complex regression trees: 30 60 predict linear functions in leaves. x : input variable, y : target variable (red line) Christian Borgelt Data Mining / Intelligent Data Analysis 337 Regression Trees: Attribute Selection distributions of the target value a 1 a 2 split w.r.t. a test attribute • The variance / standard deviation is compared to the variance / standard deviation in the branches. • The attribute that yields the highest reduction is selected. Christian Borgelt Data Mining / Intelligent Data Analysis 338 Regression Trees: An Example A regression tree for the Iris data (petal width) (induced with reduction of sum of squared errors) Christian Borgelt Data Mining / Intelligent Data Analysis 339 Summary Decision and Regression Trees • Decision Trees are Classifiers with Tree Structure ◦ Inner node: Test of a descriptive attribute ◦ Leaf node: Assignment of a class • Induction of Decision Trees from Data (Top-Down Induction of Decision Trees, TDIDT) ◦ Divide and conquer approach / recursive descent ◦ Greedy selection of the test attributes ◦ Attributes are selected based on an evaluation measure , e.g. information gain, χ 2 measure ◦ Recommended: Pruning of the decision tree • Numeric Target: Regression Trees Christian Borgelt Data Mining / Intelligent Data Analysis 340

  82. k-Nearest Neighbors Christian Borgelt Data Mining / Intelligent Data Analysis 341 k-Nearest Neighbors • Basic Principle and Simple Examples • Ingredients of k-Nearest Neighbors ◦ Distance Metric ◦ Number of Neighbors ◦ Weighting Function for the Neighbors ◦ Prediction Function • Weighting with Kernel Functions • Locally Weighted Polynomial Regression • Implementation Aspects • Feature/Attribute Weights • Data Set Reduction and Prototype Building • Summary Christian Borgelt Data Mining / Intelligent Data Analysis 342 k-Nearest Neighbors: Principle • The nearest neighbor algorithm [Cover and Hart 1967] is one of the simplest and most natural classification and numeric prediction methods. • It derives the class labels or the (numeric) target values of new input objects from the most similar training examples, where similarity is measured by distance in the feature space. • The prediction is computed by a majority vote of the nearest neighbors or by averaging their (numeric) target values. • The number k of neighbors to be taken into account is a parameter of the algorithm, the best choice of which depends on the data and the prediction task. • In a basic nearest neighbor approach only one neighbor object, namely the closest one, is considered, and its class or target value is directly transferred to the query object. Christian Borgelt Data Mining / Intelligent Data Analysis 343 k-Nearest Neighbors: Principle • Constructing nearest neighbor classifiers and numeric predictors is a special case of instance-based learning [Aha et al. 1991]. • As such, it is a lazy learning method in the sense that it is not tried to construct a model that generalizes beyond the training data (as eager learning methods do). • Rather, the training examples are merely stored. • Predictions for new cases are derived directly from these stored examples and their (known) classes or target values, usually without any intermediate model construction. • (Partial) Exception: lazy decision trees construct from the stored cases the single path in the decision tree along which the query object is passed down. • This can improve on standard decision trees in the presence of missing values. • However, this comes at the price of higher classification costs. Christian Borgelt Data Mining / Intelligent Data Analysis 344

  83. k-Nearest Neighbors: Simple Examples output input • In both example cases it is k = 1. • Classification works with a Voronoi tesselation of the data space. • Numeric prediction leads to a piecewise constant function. • Using more than one neighbor changes the classification/prediction. Christian Borgelt Data Mining / Intelligent Data Analysis 345 Delaunay Triangulations and Voronoi Diagrams • Dots represent data points • Left: Delaunay Triangulation The circle through the corners of a triangle does not contain another point. • Right: Voronoi Diagram / Tesselation Midperpendiculars of the Delaunay triangulation: boundaries of the regions of points that are closest to the enclosed data points (Voronoi cells). Christian Borgelt Data Mining / Intelligent Data Analysis 346 k-nearest Neighbors: Simple Examples output input • Note: neither the Voronoi tessellation nor the piecewise constant function are actually computed in the learning process; no model is built at training time. • The prediction is determined only in response to a query for the class or target value of a new input object, namely by finding the closest neighbor of the query object and then transferring its class or target value. learned models. Christian Borgelt Data Mining / Intelligent Data Analysis 347 Using More Than One Neighbor • A straightforward generalization of the nearest neighbor approach is to use not just the one closest, but the k nearest neighbors (usually abbreviated as k-NN ). • If the task is classification, the prediction is then determined by a majority vote among these k neighbors (breaking ties arbitrarily). • If the task is numeric prediction, the average of the target values of these k neighbors is computed. • Not surprisingly, using more than one neighbor improves the robustness of the algorithm, since it is not so easily fooled by individual training instances that are labeled incorrectly or are outliers for a class (that is, data points that have an unusual location for the class assigned to them). • Outliers for the complete data set, on the other hand, do not affect nearest neighbor predictors much, because they can only change the prediction for data points that should not occur or should occur only very rarely (provided the rest of the data is representative). Christian Borgelt Data Mining / Intelligent Data Analysis 348

  84. Using More Than One Neighbor • However, using too many neighbors can reduce the capability of the algorithm as it may smooth the classification boundaries or the interpolation too much to yield good results. • As a consequence, apart from the core choice of the distance function that determines which training examples are the nearest, the choice of the number of neighbors to consider is crucial. • Once multiple neighbors are considered, further extensions become possible. • For example, the (relative) influence of a neighbor on the prediction may be made dependent on its distance from the query point (distance weighted k -nearest neighbors). • Or the prediction may be computed from a local model that is constructed on the fly for a given query point (i.e. from its nearest neighbors) rather than by a simple majority or averaging rule. Christian Borgelt Data Mining / Intelligent Data Analysis 349 k-Nearest Neighbors: Basic Ingredients • Distance Metric The distance metric, together with a possible task-specific scaling or weighting of the attributes, determines which of the training examples are nearest to a query data point and thus selects the training example(s) used to compute a prediction. • Number of Neighbors The number of neighbors of the query point that are considered can range from only one (the basic nearest neighbor approach) through a few (like k -nearest neigh- bor approaches) to, in principle, all data points as an extreme case. • Weighting Function for the Neighbors If multiple neighbors are considered, it is plausible that closer (and thus more sim- ilar) neighbors should have a stronger influence on the prediction result. This can be expressed by a weighting function yielding higher values for smaller distances. • Prediction Function If multiple neighbors are considered, one needs a procedure to compute the pre- diction from the (generally differing) classes or target values of these neighbors, since they may differ and thus may not yield a unique prediction directly. Christian Borgelt Data Mining / Intelligent Data Analysis 350 k-Nearest Neighbors: More Than One Neighbor output 3-nearest neighbor predictor, using a simple averaging of the target values of the nearest neighbors. Note that the prediction is still a piecewise constant function. input • The main effect of the number k of considered neighbors is how much the class boundaries or the numeric prediction is smoothed. • If only one neighbor is considered, the prediction is constant in the Voronoi cells of the training data set and meeting the data points. Christian Borgelt Data Mining / Intelligent Data Analysis 351 k-Nearest Neighbors: Number of Neighbors • If only one neighbor is considered, the prediction is constant in the Voronoi cells of the training data set. • This makes the prediction highly susceptible to the deteriorating effects of incorrectly labeled instances or outliers w.r.t. their class, because a single data point with the wrong class spoils the prediction in its whole Voronoi cell. • Considering several neighbors ( k > 1) mitigates this problem, since neighbors having the correct class can override the influence of an outlier. • However, choosing a very large k is also not generally advisable, because it can prevent the classifier from being able to properly approximate narrow class regions or narrow peaks or valleys in a numeric target function. • The example of 3-nearest neighbor prediction on the preceding slide (using a simple averaging of the target values of these nearest neighbors) already shows the smoothing effect (especially at the borders of the input range). • The interpolation deviates considerably from the data points. Christian Borgelt Data Mining / Intelligent Data Analysis 352

  85. k-Nearest Neighbors: Number of Neighbors • A common method to automatically determine an appropriate value for the number k of neighbors is cross-validation . • The training data set is divided into r cross-validation folds of (approximately) equal size. • The fold sizes may differ by one data point, to account for the fact that the total number of training examples may not be divisible by r , the number of folds. • Then r classification or prediction experiments are performed: each combination of r − 1 folds is once chosen as the training set, with which the remaining fold is classified or the (numeric) target value is predicted, using all numbers k of neighbors from a user-specified range. • The classification accurracy or the prediction error is aggregated, for the same value k , over these experiments. • Finally the number k of neighbors that yields the lowest aggregated error is chosen. Christian Borgelt Data Mining / Intelligent Data Analysis 353 k-Nearest Neighbors: Weighting • Approaches that weight the considered neighbors differently based on their distance to the query point are known as distance-weighted k-nearest neighbor or (for numeric targets) locally weighted regression or locally weighted scatterplot smoothing ( LOWESS or LOESS ). • Such weighting is mandatory in the extreme case in which all n training examples are used as neighbors, because otherwise only the majority class or the global average of the target values can be predicted. • However, it is also recommended for k < n , since it can, at least to some degree, counteract the smoothing effect of a large k , because the excess neighbors are likely to be farther away and thus will influence the prediction less. • It should be noted though, that distance-weighted k -NN is not a way of avoiding the need to find a good value for the number k of neighbors. Christian Borgelt Data Mining / Intelligent Data Analysis 354 k-Nearest Neighbors: Weighting • A typical example of a weighting function for the nearest neighbors is the so-called tricubic weighting function , which is defined as  � d ( s i , q ) � 3  3 w ( s i , q, k ) =  1 −  . d max ( q, k ) • q is the query point, s i is (input vector of) the i -th nearest neighbor of q in the training data set, k is the number of considered neighbors, d is employed distance function, and d max ( q, k ) is the maximum distance between any two points from the set { q, s 1 , . . . , s k } , that is, d max ( q, k ) = max a,b ∈{ q,s 1 ,...,s k } d ( a, b ). • The function w yields the weight with which the target value of the i -th nearest neighbor s i of q enters the prediction computation. Christian Borgelt Data Mining / Intelligent Data Analysis 355 k-Nearest Neighbors: Weighting output 2-nearest neighbor predictor, using a distance-weighted averaging of the nearest neighbors. This ensures that the prediction meets the data points. input • Note that the interpolation is mainly linear (because two nearest neighbors are used), except for some small plateaus close to the data points. • These result from the weighting, and certain jumps at points where the two closest neighbors are on the same side of the query point. Christian Borgelt Data Mining / Intelligent Data Analysis 356

  86. k-Nearest Neighbors: Weighting • An alternative approach to distance-weighted k -NN consists in abandoning the requirement of a predetermined number of nearest neighbors. • Rather a data point is weighted with a kernel regression K that is defined on its distance d to the query point and that satisfies the following properties: (1) K ( d ) ≥ 0, (2) K (0) = 1 (or at least that K has its mode at 0), and (3) K ( d ) decreases monotonously for d → ∞ . • In this case all training examples for which the kernel function yields a non-vanishing value w.r.t. a given query point are used for the prediction. • Since the density of training examples may, of course, differ for different regions of the feature space, this may lead to a different number of neighbors being considered, depending on the query point. • If the kernel function has an infinite support (that is, does not vanish for any finite argument value), all data points are considered for any query point. Christian Borgelt Data Mining / Intelligent Data Analysis 357 k-Nearest Neighbors: Weighting • By using such a kernel function, we try to mitigate the problem of choosing a good value for the number K of neighbors, which is now taken care of by the fact that instances that are farther away have a smaller influence on the prediction result. • On the other hand, we now face the problem of having to decide how quickly the influence of a data point should decline with increasing distance, which is analogous to choosing the right number of neighbors and can be equally difficult to solve. • Examples of kernel functions with a finite support, given as a radius σ around the query point within which training examples are considered, are K rect ( d ) = τ ( d ≤ σ ) , K triangle ( d ) = τ ( d ≤ σ ) · (1 − d/σ ) , K tricubic ( d ) = τ ( d ≤ σ ) · (1 − d 3 /σ 3 ) 3 , where τ ( φ ) is 1 if φ is true and 0 otherwise. Christian Borgelt Data Mining / Intelligent Data Analysis 358 k-Nearest Neighbors: Weighting • A typical kernel function with infinite support is the Gaussian function � � − d 2 K gauss ( d ) = exp , 2 σ 2 where d is the distance of the training example to the query point and σ 2 is a parameter that determines the spread of the Gaussian function. • The advantage of a kernel with infinite support is that the prediction function is smooth (has no jumps) if the kernel is smooth, because then a training case does not suddenly enter the prediction if a query point is moved by an infinitesimal amount, but its influence rises smoothly in line with the kernel function. • One also does not have to choose a number of neighbors. • However, the disadvantage is, as already pointed out, that one has to choose an appropriate radius σ for the kernel function, which can be more difficult to choose than an appropriate number of neighbors. Christian Borgelt Data Mining / Intelligent Data Analysis 359 k-Nearest Neighbors: Weighting output Kernel weighted regression, using a Gaussian kernel function. This ensures that the prediction is smooth, though possibly not very close to the training data points. input • Note that the regression function is smooth, because the kernel function is smooth and always refers to all data points as neighbors, so that no jumps occur due to a change in the set of nearest neighbors. • The price one has to pay is an increased computational cost, since the kernel function has to be evaluated for all data points, not only for the nearest neighbors. Christian Borgelt Data Mining / Intelligent Data Analysis 360

  87. k-Nearest Neighbors: Implementation • A core issue of implementing nearest neighbor prediction is the data structure used to store the training examples. • In a naive implementation they are simply stored as a list, which requires merely O ( n ) time, where n is the number of training examples. • However, though fast at training time, this approach has the serious drawback of being very slow at execution time, because a linear traversal of all training examples is needed to find the nearest neighbor(s), requiring O ( nm ) time, where m is the dimensionality of the data. • As a consequence, this approach becomes quickly infeasible with a growing number of training examples or for high-dimensional data. • Better approaches rely on data structures like a k d-tree (short for k -dimensional tree, where the k here refers to the number of dimensions, not the number of neighbors), an R- or R ∗ -tree, a UB-tree etc. Christian Borgelt Data Mining / Intelligent Data Analysis 361 k-Nearest Neighbors: Prediction Function • The most straightforward choices for the prediction function are a simple (weighted) majority vote for classification or a simple (weighted) average for numeric prediction. • However, especially for numeric prediction, one may also consider more complex prediction functions, like building a local regression model from the neighbors (usually with a linear function or a low-degree polynomial), thus arriving at locally weighted polynomial regression . • The prediction is then computed from this local model. • Not surprisingly, distance weighting may also be used in such a setting. • Such an approach should be employed with a larger number of neighbors, so that a change of the set of nearest neighbors leads to less severe changes of the local regression line. (Although there will still be jumps of the predicted value in this case, they are just less high.) Christian Borgelt Data Mining / Intelligent Data Analysis 362 k-Nearest Neighbors: Locally Weighted Regression output 4-nearest neighbor distance-weighted locally linear regression (using a tricubic weighting function). Although linear regression is used, the nearest neighbors do not enter with unit weight! input • Note how the distance weighting leads to deviations from straight lines between the data points. • Note also the somewhat erratic behavior of the resulting regression function (jumps at points where the set of nearest neighbors changes). This will be less severe, the larger the number of neighbors. Christian Borgelt Data Mining / Intelligent Data Analysis 363 k-Nearest Neighbors: Locally Weighted Regression • Locally weighted regression is usually applied with simple regression polynomials: ◦ most of the time linear, ◦ rarely quadratic, ◦ basically never any higher order. • The reason is that the local character of the regression is supposed to take care of the global shape of the function, so that the regression function is not needed to model it. • The advantage of locally weighted polynomial regression is that no global regression function, derived from some data generation model, needs to be found. • This makes the method applicable to a broad range of prediction problems. • Its disadvantages are that its prediction can be less reliable in sparsely sampled regions of the feature space, where the locally employed regression function is stretched to a larger area and thus may fit the actual target function badly. Christian Borgelt Data Mining / Intelligent Data Analysis 364

  88. k-Nearest Neighbors: Implementation • With such data structures the query time can be reduced to O (log n ) per query data point. • The time to store the training examples (that is, the time to construct an efficient access structure for them) is, of course, worse than for storing them in a simple list. • However, with a good data structure and algorithm it is usually acceptably longer. • For example, a k d-tree is constructed by iterative bisections in different dimensions that split the set of data points (roughly) equally. • As a consequence, constructing it from n training examples takes O ( n log n ) time if a linear time algorithm for finding the median in a dimension is employed. • Whether such an approach pays off, depends also on the expected number of query points compared to the number of training data points. Christian Borgelt Data Mining / Intelligent Data Analysis 365 Feature Weights • It is crucial for the success of a nearest neighbor approach that a proper distance function is chosen. • A very simple and natural way of adapting a distance function to the needs of the prediction problem is to use distance weights, thus giving certain features a greater influence than others. • If prior information is available about which features are most informative w.r.t. the target, this information can be incorporated directly into the distance function. • However, one may also try to determine appropriate feature weights automatically. • The simplest approach is to start with equal feature weights and to modify them iteratively in a hill climbing fashion: ◦ apply a (small) random modification to the feature weights, ◦ check with cross validation whether this improves the prediction quality; ◦ if it does, accept the new weights, otherwise keep the old. ◦ Repeat until some termination criterion is met. Christian Borgelt Data Mining / Intelligent Data Analysis 366 Data Set Reduction and Prototype Building • A core problem of nearest neighbor approaches is to quickly find the nearest neigh- bors of a given query point. • This becomes an important practical problem if the training data set is large and predictions must be computed (very) quickly. • In such a case one may try to reduce the set of training examples in a preprocessing step, so that a set of relevant or prototypical data points is found, which yields basically the same prediction quality. • Note that this set may or may not be a subset of the training examples, depending on whether the algorithm used to construct this set merely samples from the training examples or constructs new data points if necessary. • Note also that there are usually no or only few actually redundant data points, which can be removed without affecting the prediction at all. • This is obvious for the numerical case and a 1-nearest neighbor classifier, but also holds for a k -nearest neighbor classifier with k > 1, because any removal of data points may change the vote at some point and potentially the classification. Christian Borgelt Data Mining / Intelligent Data Analysis 367 Data Set Reduction and Prototype Building • A straightforward approach is based on a simple iterative merge scheme: ◦ At the beginning each training example is considered as a prototype. ◦ Then successively two nearest prototypes are merged as long as the prediction quality on some hold-out test data set is not reduced. ◦ Prototypes can be merged, for example, by simply computing a weighted sum, with the relative weights determined by how many original training examples a prototype represents. ◦ This is similar to hierarchical agglomerative clustering. • More sophisticated approaches may employ, for example, genetic algorithms or any other method for solving a combinatorial optimization problem. • This is possible, because the task of finding prototypes can be viewed as the task to find a subset of the training examples that yields the best prediction quality (on a given test data set, not the training data set): finding best subsets is a standard combinatorial optimization problem. Christian Borgelt Data Mining / Intelligent Data Analysis 368

  89. Summary k-Nearest Neighbors • Predict with Target Values of k Nearest Neighbors ◦ classification: majority vote ◦ numeric prediction: average value • Special Case of Instance-based Learning ◦ method is easy to understand ◦ intuitive and plausible prediction principle ◦ lazy learning: no model is constructed • Ingredients of k-Nearest Neighbors ◦ Distance Metric / Feature Weights ◦ Number of Neighbors ◦ Weighting Function for the Neighbors ◦ Prediction Function Christian Borgelt Data Mining / Intelligent Data Analysis 369 Multi-layer Perceptrons Christian Borgelt Data Mining / Intelligent Data Analysis 370 Multi-layer Perceptrons • Biological Background • Threshold Logic Units ◦ Definition, Geometric Interpretation, Linear Separability ◦ Training Threshold Logic Units, Limitations ◦ Networks of Threshold Logic Units • Multilayer Perceptrons ◦ Definition of Multilayer Perceptrons ◦ Why Non-linear Activation Functions? ◦ Function Approximation ◦ Training with Gradient Descent ◦ Training Examples and Variants • Summary Christian Borgelt Data Mining / Intelligent Data Analysis 371 Biological Background Diagram of a typical myelinated vertebrate motoneuron (source: Wikipedia, Ruiz-Villarreal 2007), showing the main parts involved in its signaling activity like the dendrites , the axon , and the synapses . Christian Borgelt Data Mining / Intelligent Data Analysis 372

  90. Biological Background Structure of a prototypical biological neuron (simplified) terminal button synapse dendrites cell body nucleus (soma) axon myelin sheath Christian Borgelt Data Mining / Intelligent Data Analysis 373 Biological Background (Very) simplified description of neural information processing • Axon terminal releases chemicals, called neurotransmitters . • These act on the membrane of the receptor dendrite to change its polarization. (The inside is usually 70mV more negative than the outside.) • Decrease in potential difference: excitatory synapse Increase in potential difference: inhibitory synapse • If there is enough net excitatory input, the axon is depolarized. • The resulting action potential travels along the axon. (Speed depends on the degree to which the axon is covered with myelin.) • When the action potential reaches the terminal buttons, it triggers the release of neurotransmitters. Christian Borgelt Data Mining / Intelligent Data Analysis 374 (Personal) Computers versus the Human Brain Personal Computer Human Brain processing units 1 CPU, 2–10 cores 10 10 transistors 1–2 graphics cards/GPUs, 10 3 cores/shaders 10 10 transistors 10 11 neurons 10 10 bytes main memory (RAM) 10 11 neurons storage capacity 10 12 bytes external memory 10 14 synapses 10 − 9 seconds > 10 − 3 seconds processing speed 10 9 operations per second < 1000 per second 10 12 bits/second 10 14 bits/second bandwidth 10 6 per second 10 14 per second neural updates Christian Borgelt Data Mining / Intelligent Data Analysis 375 (Personal) Computers versus the Human Brain • The processing/switching time of a neuron is relatively large ( > 10 − 3 seconds), but updates are computed in parallel. • A serial simulation on a computer takes several hundred clock cycles per update. Advantages of Neural Networks : • High processing speed due to massive parallelism. • Fault Tolerance: Remain functional even if (larger) parts of a network get damaged. • “Graceful Degradation”: gradual degradation of performance if an increasing number of neurons fail. • Well suited for inductive learning (learning from examples, generalization from instances). It appears to be reasonable to try to mimic or to recreate these advantages by constructing artificial neural networks . Christian Borgelt Data Mining / Intelligent Data Analysis 376

  91. Threshold Logic Units A Threshold Logic Unit (TLU) is a processing unit for numbers with n inputs x 1 , . . . , x n and one output y . The unit has a threshold θ and each input x i is associated with a weight w i . A threshold logic unit computes the function  n  �   1 , if w i x i ≥ θ , y = i =1    0 , otherwise. x 1 w 1 y θ w n x n TLUs mimic the thresholding behavior of biological neurons in a (very) simple fashion. Christian Borgelt Data Mining / Intelligent Data Analysis 377 Threshold Logic Units: Geometric Interpretation Threshold logic unit for x 1 ∧ x 2 . x 1 1 3 1 x 2 0 y 4 0 2 x 2 x 1 0 1 Threshold logic unit for x 2 → x 1 . 0 x 1 1 2 1 x 2 y − 1 0 − 2 x 2 x 1 0 1 Christian Borgelt Data Mining / Intelligent Data Analysis 378 Threshold Logic Units: Limitations The biimplication problem x 1 ↔ x 2 : There is no separating line. 1 x 1 x 2 y 0 0 1 x 2 1 0 0 0 1 0 0 1 1 1 x 1 0 1 Formal proof by reductio ad absurdum : since (0 , 0) �→ 1: 0 ≥ θ, (1) since (1 , 0) �→ 0: w 1 < θ, (2) since (0 , 1) �→ 0: w 2 < θ, (3) since (1 , 1) �→ 1: w 1 + w 2 ≥ θ. (4) (2) and (3): w 1 + w 2 < 2 θ . With (4): 2 θ > θ , or θ > 0. Contradiction to (1). Christian Borgelt Data Mining / Intelligent Data Analysis 379 Linear Separability Definition : Two sets of points in a Euclidean space are called linearly separable , iff there exists at least one point, line, plane or hyperplane (depending on the dimension of the Euclidean space), such that all points of the one set lie on one side and all points of the other set lie on the other side of this point, line, plane or hyperplane (or on it). That is, the point sets can be separated by a linear decision function . Formally: R m are linearly separable iff � R m and θ ∈ I Two sets X, Y ⊂ I w ∈ I R exist such that w ⊤ � w ⊤ � ∀ � x ∈ X : ∀ � y ∈ Y : y ≥ θ. � x < θ and � • Boolean functions define two points sets, namely the set of points that are mapped to the function value 0 and the set of points that are mapped to 1. ⇒ The term “linearly separable” can be transferred to Boolean functions. • As we have seen, conjunction and implication are linearly separable (as are disjunction , NAND, NOR etc.). • The biimplication is not linearly separable (and neither is the exclusive or (XOR)). Christian Borgelt Data Mining / Intelligent Data Analysis 380

  92. Linear Separability Definition : A set of points in a Euclidean space is called convex if it is non-empty and connected (that is, if it is a region ) and for every pair of points in it every point on the straight line segment connecting the points of the pair is also in the set. Definition : The convex hull of a set of points X in a Euclidean space is the smallest convex set of points that contains X . Alternatively, the convex hull of a set of points X is the intersection of all convex sets that contain X . Theorem : Two sets of points in a Euclidean space are linearly separable if and only if their convex hulls are disjoint (that is, have no point in common). • For the biimplication problem, the convex hulls are the diagonal line segments. • They share their intersection point and are thus not disjoint. • Therefore the biimplication is not linearly separable. Christian Borgelt Data Mining / Intelligent Data Analysis 381 Threshold Logic Units: Limitations Total number and number of linearly separable Boolean functions (On-Line Encyclopedia of Integer Sequences, oeis.org , A001146 and A000609): inputs Boolean functions linearly separable functions 1 4 4 2 16 14 3 256 104 4 65,536 1,882 5 4,294,967,296 94,572 6 18,446,744,073,709,551,616 15,028,134 2 (2 n ) n no general formula known • For many inputs a threshold logic unit can compute almost no functions. • Networks of threshold logic units are needed to overcome the limitations. Christian Borgelt Data Mining / Intelligent Data Analysis 382 Networks of Threshold Logic Units Solving the biimplication problem with a network. Idea: logical decomposition x 1 ↔ x 2 ≡ ( x 1 → x 2 ) ∧ ( x 2 → x 1 ) computes y 1 = x 1 → x 2 − 2 x 1 − 1 computes y = y 1 ∧ y 2 2 2 y = x 1 ↔ x 2 3 2 2 x 2 − 1 − 2 computes y 2 = x 2 → x 1 Christian Borgelt Data Mining / Intelligent Data Analysis 383 Networks of Threshold Logic Units Solving the biimplication problem: Geometric interpretation g 2 0 1 g 3 g 1 1 ac d c 1 0 1 1 0 b = ⇒ y 2 x 2 a b 0 0 d x 1 y 1 0 1 0 1 • The first layer computes new Boolean coordinates for the points. • After the coordinate transformation the problem is linearly separable. Christian Borgelt Data Mining / Intelligent Data Analysis 384

  93. Representing Arbitrary Boolean Functions Algorithm : Let y = f ( x 1 , . . . , x n ) be a Boolean function of n variables. (i) Represent the given function f ( x 1 , . . . , x n ) in disjunctive normal form. That is, determine D f = C 1 ∨ . . . ∨ C m , where all C j are conjunctions of n literals, that is, C j = l j 1 ∧ . . . ∧ l jn with l ji = x i (positive literal) or l ji = ¬ x i (negative literal). (ii) Create a neuron for each conjunction C j of the disjunctive normal form (having n inputs — one input for each variable), where � n +2 , if l ji = x i , θ j = n − 1 + 1 � w ji = and w ji . − 2 , if l ji = ¬ x i , 2 i =1 (iii) Create an output neuron (having m inputs — one input for each neuron that was created in step (ii)), where w ( n +1) k = 2 , k = 1 , . . . , m, and θ n +1 = 1 . Remark: weights are set to ± 2 instead of ± 1 in order to ensure integer thresholds. Christian Borgelt Data Mining / Intelligent Data Analysis 385 Representing Arbitrary Boolean Functions Example: First layer (conjunctions): ternary Boolean function: x 1 x 2 x 3 y C j 0 0 0 0 1 0 0 1 x 1 ∧ x 2 ∧ x 3 0 1 0 0 C 1 = C 2 = C 3 = 1 1 0 0 x 1 ∧ x 2 ∧ x 3 x 1 ∧ x 2 ∧ x 3 x 1 ∧ x 2 ∧ x 3 0 0 1 0 1 0 1 0 Second layer (disjunction): x 1 ∧ x 2 ∧ x 3 0 1 1 1 1 1 1 1 x 1 ∧ x 2 ∧ x 3 C 3 D f = C 1 ∨ C 2 ∨ C 3 C 2 C 1 One conjunction for each row where the output y is 1 with D f = C 1 ∨ C 2 ∨ C 3 literals according to input values. Christian Borgelt Data Mining / Intelligent Data Analysis 386 Representing Arbitrary Boolean Functions Example: ternary Boolean function: Resulting network of threshold logic units: x 1 x 2 x 3 y C j C 1 = x 1 ∧ x 2 ∧ x 3 0 0 0 0 2 x 1 1 1 0 0 1 x 1 ∧ x 2 ∧ x 3 − 2 0 1 0 0 2 D f = C 1 ∨ C 2 ∨ C 3 2 1 1 0 0 − 2 2 0 0 1 0 2 x 2 y 3 1 1 0 1 0 2 0 1 1 1 x 1 ∧ x 2 ∧ x 3 − 2 2 1 1 1 1 x 1 ∧ x 2 ∧ x 3 2 x 3 C 2 = x 1 ∧ x 2 ∧ x 3 5 2 D f = C 1 ∨ C 2 ∨ C 3 C 3 = x 1 ∧ x 2 ∧ x 3 One conjunction for each row where the output y is 1 with literals according to input value. Christian Borgelt Data Mining / Intelligent Data Analysis 387 Training Threshold Logic Units • Geometric interpretation provides a way to construct threshold logic units with 2 and 3 inputs, but: ◦ Not an automatic method (human visualization needed). ◦ Not feasible for more than 3 inputs. • General idea of automatic training: ◦ Start with random values for weights and threshold. ◦ Determine the error of the output for a set of training patterns. ◦ Error is a function of the weights and the threshold: e = e ( w 1 , . . . , w n , θ ). ◦ Adapt weights and threshold so that the error becomes smaller. ◦ Iterate adaptation until the error vanishes. Christian Borgelt Data Mining / Intelligent Data Analysis 388

  94. Training Threshold Logic Units: Delta Rule x = ( x 1 , . . . , x n ) ⊤ be an input vector of a threshold Formal Training Rule: Let � logic unit, o the desired output for this input vector and y the actual output of the threshold logic unit. If y � = o , then the threshold θ and the weight vector � w = ( w 1 , . . . , w n ) ⊤ are adapted as follows in order to reduce the error: θ (new) = θ (old) + ∆ θ with ∆ θ = − η ( o − y ) , ∀ i ∈ { 1 , . . . , n } : w (new) = w (old) + ∆ w i with ∆ w i = η ( o − y ) x i , i i where η is a parameter that is called learning rate . It determines the severity of the weight changes. This procedure is called Delta Rule or Widrow–Hoff Procedure [Widrow and Hoff 1960]. • Online Training: Adapt parameters after each training pattern. • Batch Training: Adapt parameters only at the end of each epoch , that is, after a traversal of all training patterns. Christian Borgelt Data Mining / Intelligent Data Analysis 389 Training Threshold Logic Units: Convergence Convergence Theorem: Let L = { ( � x 1 , o 1 ) , . . . ( � x m , o m ) } be a set of training R n and a desired output o i ∈ { 0 , 1 } . patterns, each consisting of an input vector � x i ∈ I Furthermore, let L 0 = { ( � x, o ) ∈ L | o = 0 } and L 1 = { ( � x, o ) ∈ L | o = 1 } . R n and θ ∈ I If L 0 and L 1 are linearly separable, that is, if � w ∈ I R exist such that w ⊤ � ∀ ( � x, 0) ∈ L 0 : � x < θ and w ⊤ � ∀ ( � x, 1) ∈ L 1 : � x ≥ θ, then online as well as batch training terminate. • The algorithms terminate only when the error vanishes. • Therefore the resulting threshold and weights must solve the problem. • For not linearly separable problems the algorithms do not terminate (oscillation, repeated computation of same non-solving � w and θ ). Christian Borgelt Data Mining / Intelligent Data Analysis 390 Training Threshold Logic Units: Delta Rule Turning the threshold value into a weight: +1 = x 0 − 1 w 0 = − θ + θ x 1 x 1 w 1 w 1 x 2 y x 2 y 0 θ w 2 w 2 w n w n x n x n n n � � w i x i ≥ θ w i x i − θ ≥ 0 i =1 i =1 Christian Borgelt Data Mining / Intelligent Data Analysis 391 Training Threshold Logic Units: Delta Rule Formal Training Rule (with threshold turned into a weight): x = ( x 0 = 1 , x 1 , . . . , x n ) ⊤ be an (extended) input vector of a threshold logic unit, Let � o the desired output for this input vector and y the actual output of the threshold w = ( w 0 = − θ, w 1 , . . . , w n ) ⊤ logic unit. If y � = o , then the (extended) weight vector � is adapted as follows in order to reduce the error: w (new) = w (old) ∀ i ∈ { 0 , . . . , n } : + ∆ w i with ∆ w i = η ( o − y ) x i , i i where η is a parameter that is called learning rate . It determines the severity of the weight changes. This procedure is called Delta Rule or Widrow–Hoff Procedure [Widrow and Hoff 1960]. • Note that with extended input and weight vectors, there is only one update rule (no distinction of threshold and weights). x = ( x 0 = − 1 , x 1 , . . . , x n ) ⊤ • Note also that the (extended) input vector may be � w = ( w 0 = + θ, w 1 , . . . , w n ) ⊤ . and the corresponding (extended) weight vector � Christian Borgelt Data Mining / Intelligent Data Analysis 392

  95. Training Networks of Threshold Logic Units • Single threshold logic units have strong limitations: They can only compute linearly separable functions. • Networks of threshold logic units can compute arbitrary Boolean functions. • Training single threshold logic units with the delta rule is easy and fast and guaranteed to find a solution if one exists. • Networks of threshold logic units cannot be trained , because ◦ there are no desired values for the neurons of the first layer(s), ◦ the problem can usually be solved with several different functions computed by the neurons of the first layer(s) (non-unique solution). • When this situation became clear, neural networks were first seen as a “research dead end”. Christian Borgelt Data Mining / Intelligent Data Analysis 393 General Neural Networks Basic graph theoretic notions A (directed) graph is a pair G = ( V, E ) consisting of a (finite) set V of vertices or nodes and a (finite) set E ⊆ V × V of edges . We call an edge e = ( u, v ) ∈ E directed from vertex u to vertex v . Let G = ( V, E ) be a (directed) graph and u ∈ V a vertex. Then the vertices of the set pred( u ) = { v ∈ V | ( v, u ) ∈ E } are called the predecessors of the vertex u and the vertices of the set succ( u ) = { v ∈ V | ( u, v ) ∈ E } are called the successors of the vertex u . Christian Borgelt Data Mining / Intelligent Data Analysis 394 General Neural Networks General definition of a neural network An (artificial) neural network is a (directed) graph G = ( U, C ), whose vertices u ∈ U are called neurons or units and whose edges c ∈ C are called connections . The set U of vertices is partitioned into • the set U in of input neurons , • the set U out of output neurons , and • the set U hidden of hidden neurons . It is U = U in ∪ U out ∪ U hidden , U in � = ∅ , U out � = ∅ , U hidden ∩ ( U in ∪ U out ) = ∅ . Christian Borgelt Data Mining / Intelligent Data Analysis 395 General Neural Networks Each connection ( v, u ) ∈ C possesses a weight w uv and each neuron u ∈ U possesses three (real-valued) state variables: • the network input net u , • the activation act u , and • the output out u . Each input neuron u ∈ U in also possesses a fourth (real-valued) state variable, • the external input ext u . Furthermore, each neuron u ∈ U possesses three functions: f ( u ) R 2 | pred( u ) | + κ 1 ( u ) → I • the network input function net : I R , f ( u ) R κ 2 ( u ) → I • the activation function act : I R , and f ( u ) • the output function out : I R → I R , which are used to compute the values of the state variables. Christian Borgelt Data Mining / Intelligent Data Analysis 396

  96. Structure of a Generalized Neuron A generalized neuron is a simple numeric processor ext u u out v 1 = in uv 1 w uv 1 f ( u ) f ( u ) f ( u ) net net u act act u out out u out v n = in uv n w uv n σ 1 , . . . , σ l θ 1 , . . . , θ k Christian Borgelt Data Mining / Intelligent Data Analysis 397 General Neural Networks Types of (artificial) neural networks: • If the graph of a neural network is acyclic , it is called a feed-forward network . • If the graph of a neural network contains cycles (backward connections), it is called a recurrent network . Representation of the connection weights as a matrix: u 1 u 2 . . . u r   w u 1 u 1 w u 1 u 2 . . . w u 1 u r u 1    w u 2 u 1 w u 2 u 2 w u 2 u r  u 2   . . .  . .  . . . .   w u r u 1 w u r u 2 . . . w u r u r u r Christian Borgelt Data Mining / Intelligent Data Analysis 398 Multi-layer Perceptrons An r-layer perceptron is a neural network with a graph G = ( U, C ) that satisfies the following conditions: (i) U in ∩ U out = ∅ , (ii) U hidden = U (1) hidden ∪ · · · ∪ U ( r − 2) hidden , U ( i ) hidden ∩ U ( j ) ∀ 1 ≤ i < j ≤ r − 2 : hidden = ∅ , � � �� r − 3 � � � U in × U (1) i =1 U ( i ) hidden × U ( i +1) U ( r − 2) (iii) C ⊆ ∪ ∪ hidden × U out hidden hidden or, if there are no hidden neurons ( r = 2 , U hidden = ∅ ), C ⊆ U in × U out . • Feed-forward network with strictly layered structure. Christian Borgelt Data Mining / Intelligent Data Analysis 399 Multi-layer Perceptrons General structure of a multi-layer perceptron x 1 y 1 x 2 y 2 x n y m U (1) U (2) U ( r − 2) U in U out hidden hidden hidden Christian Borgelt Data Mining / Intelligent Data Analysis 400

Recommend


More recommend