Application of a Genetic Algorithm to Variable Selection in Fuzzy - PowerPoint PPT Presentation

Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering Christian R¨ over and Gero Szepannek Fachbereich Statistik Universit¨ at Dortmund roever@statistik.uni-dortmund.de gero.szepannek@web.de March 11, 2004

Overview 1. the problem tackling the problem / methods 2. application to Dortmund data 3. conclusions 4. Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 1

The Problem • given: huge dataset (many variables ) wanted: grouping of observations, clusters • reduce dimensionality to – avoid overfitting – exclude noise and redundant variables – keep data perceptible and interpretable • use variable subsets (instead of, e.g., linear combinations) for interpret- ability ➜ what is the optimal subset of variables? Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 2

Quality requirements • needed: comparable quality measure for variable subsets of – different scales and – varying subset size • restriction : variable subset should be representative of complete data ➜ quality measure? ➜ what makes a variable subset representative? Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 3

Quality measure • focus on fuzzy clustering : no fixed cluster assignments, but membership scores: Cluster Observation 1 2 3 1 0.95 0.02 0.03 2 0.50 0.30 0.20 . . . . . . . . . . . . • compute a measure from membership matrix U Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 4

• classification entropy: N k CE( U ) = − 1 � � ( u ij · log 2 u ij ) N i =1 j =1 • CE( U ) = 0 if all u ij ∈ { 0 , 1 } (most crisp partitioning) CE( U ) greatest if all u ij = 1 k ( fuzziest partitioning) • minimize CE( U ) for ‘optimal’ subset • number of clusters ( k ) was fixed and model-based clustering 1 (fitting of a normal mixture model to data) was applied 1 Fraley, C. and Raftery, A.E. (2002): mclust : Software for model-based clustering, density estimation and discriminant analysis. Technical Report, Department of Statistics, University of Washington . See http://www.stat.washington.edu/mclust . Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 5

Representativeness • variable subset should reflect certain aspects of data • define subgroups of variables having to appear in a subset – manually (by meaning) or – systematically • systematical selection: groups of correlated variables • motivation: subgroups have a common source of variability; by picking from different groups, different sources are covered Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 6

• cluster variables by their correlation • define: distance between variables: d ( X, Y ) = 1 − | Cor( X, Y ) | apply agglomerative hierarchical clustering • complete linkage : (absolute) correlation within group is bounded below • single linkage : correlation between groups is bounded above Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 7

Optimization • problem: minimize function f : M → I R where M has varying dimension and further restrictions • use genetic optimization algorithm (applies principle of survival of the fittest ): fitness ← → objective function genome ← → variable subset mutation ← → change in subset recombination ← → combination of 2 subsets selection (survival) ← → comparison by objective function Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 8

Procedure ✬ ✩ given : set of variables ✫ ✪ ✬ ✩ ❄ define : subgroups ✫ ✪ ✬ ✩ ❄ search : optimal composition out of subgroups ✫ ✪ ✬ ✩ ❄ return : best subgroup found ✫ ✪ Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 9

Application to Dortmund data • raw data : 200 variables, 170 observations (subdistricts) constructed data set of 57 (scaled) variables • 12 observations were considered outliers , e.g. districts containing – horse race track – steel plant being dismantled – university – . . . • systematical selection of variable subgroups proved to be impractical : either huge numbers of variable groups or correlation bounds of insigni- ficant order Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 10

Christian R¨ (absolute) Correlation 1 0.8 0.6 0.4 0.2 0 over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering BevDichte AuslAnteil ArbeitAuslAnteil AlosAuslAnteil AlosRate SHEmpfAnteil WhgProHaus SHEmpfAusl MotorradProNase Anteil.50.60 Clustering of variables by correlation (complete linkage) Anteil.60.65 PKWproNase AlosFrauAnteil RaumProWhg qmProWhg zuZugRate zuWanderFrauAnteil abWanderFrauAnteil FrauAnteil Anteil.65.xx AlterIns KombiAnteil Anteil.00.06 anteil.Hh3K anteil.Hh4K anteil.Hh5undmehrK ArbeitFrauAnteil Anteil.18.26 Anteil.26.30 ausZugRate zuWanderRate abWanderRate Baujahr zuZugFrauAnteil ausZugFrauAnteil umzugBilanzRate GesWanderBilanzRate NeuGebZuwachs NeuQmProWhg Anteil.06.10 Anteil.10.13 PersoHaushalt PersoProWhg Anteil.13.16 Anteil.16.18 anteil.Hh1K anteil.Hh2K SterbRate Anteil.30.40 ArbeitRate Anteil.40.50 WanderBilanzRate GebRate kin.trend SHEmpfF SHEmpfDeuF UmbauGebAnteil 11

• variable groups: i. age distribution ii. births, deaths, migration iii. motoring iv. buildings, housing v. employment, welfare vi. some of above broken down by sex etc. • final variable subset shall represent groups i , ii , iv and v and have at most 6 variables • data exploration suggests presence of 4 clusters Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 12

Results • variable set and cluster means: Cluster Variable Group 1 2 3 4 fraction of population of age 60–65 i. 0.065 0.064 0.057 0.083 moves to district per inhabitant ii. 0.054 0.035 0.075 0.025 apartments per house iv. 7.831 5.331 3.367 2.524 people per apartment iv. 1.877 2.029 1.676 2.216 fraction of welfare recipients v. 0.129 0.031 0.066 0.023 fraction of immigrants of employed people vi. 0.274 0.073 0.086 0.032 minimum , maximum Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 13

Fuzzyness (cluster 4) 1.0 0.8 0.6 0.4 0.2 0.0 Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 14

Spatial distribution of the 4 clusters 4 3 2 1 Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 15

• cluster 1 ( center N ) is most different from cluster 4 ( suburbs SE ): cluster 1 has – few old inhabitants – many immigrants – many welfare recipients – much migration – many apartments per house while cluster 4 takes opposite extreme values • clusters 2 and 3 lie mostly between these extremes and differ by their housing situation: cluster 3 ( suburbs NW ) has – less apartments per house – most people per apartment while cluster 2 ( center S ) has the least people per apartment. Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 16

Conclusions ➜ variable selection problem was expressed as a minimization problem by introducing a quality measure and certain restrictions ➜ an appropriate optimization algorithm was utilized to search for an optimal subset ➜ automatical generation of restrictions proved to be impractical for Dortmund data ➜ variable selection worked well, resulted in an interpretable variable set Christian R¨ over and Gero Szepannek: Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering 17

Application of a Genetic Algorithm to Variable Selection in Fuzzy - PowerPoint PPT Presentation

Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering Christian R over and Gero Szepannek Fachbereich Statistik Universit at Dortmund roever@statistik.uni-dortmund.de gero.szepannek@web.de March 11, 2004 Overview

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

1 2 Genetic Program Genetic Program Parameter 3 Genetic Program Genetic Program 4 Softcoding

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Genetic.io Genetic Algorithms in all their shapes and forms ! Genetic.io Make something of your

Germ- -line Genetic Therapy line Genetic Therapy Germ Munson- -Davis Look Bravely at a Davis

Genetic Programming What is it? Genetic Programming Genetic programming (GP) is an

Evolution of Smile a genetic algorithm hardware implementation PROJECT OVERVIEW Genetic

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

A A HYBRID HYBRID CHC GENETIC ALGORITHM CHC GENETIC ALGORITHM FOR FOR MACRO CELL MACRO

Implementation Report: Concurrent Genetic Algorithm with Island Migration Markus Solbach

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

Introduction to Genetic Epidemiology CM van Duijn Genetic Epidemiology Unit Gene Discovery

Introduction to Genetic Epidemiology CM van Duijn Genetic Epidemiology Unit Gene Discovery

Genetic drift (two types) Genetic drift: changes in allele frequencies due to chance. Founder

The State of Nature (2) Rousseau, Locke, and Hobbes Review .. Aristotle : State of Nature and

The Broken Power Sequence of Radio-Loud AGN + Collective Evidence for Inverse Compton emission

Results of H4 VLE Simulations 03.11.2017, Marcel Rosenthal, Nikolaos Charitonidis, Yannis

The Structure of the Proton in the Higgs Boson Era Juan Rojo STFC Rutherford Fellow

Hadrons with c-s quark content: present, past, and future January 29 th , 2015 | Elisabetta

Building Reusable Test Collections Ellen M. Voorhees 1 Test Collections Evaluate search

Estimating the Error at Given Test Estimating the Error at Given Test Input Points for Linear

Deciding satisfiability problems by rewrite-based deduction: Experiments in the theory of arrays

Application of a Genetic Algorithm to Variable Selection in Fuzzy - PowerPoint PPT Presentation

Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering Christian R over and Gero Szepannek Fachbereich Statistik Universit at Dortmund roever@statistik.uni-dortmund.de gero.szepannek@web.de March 11, 2004 Overview

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

1 2 Genetic Program Genetic Program Parameter 3 Genetic Program Genetic Program 4 Softcoding

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Genetic.io Genetic Algorithms in all their shapes and forms ! Genetic.io Make something of your

Germ- -line Genetic Therapy line Genetic Therapy Germ Munson- -Davis Look Bravely at a Davis

Genetic Programming What is it? Genetic Programming Genetic programming (GP) is an

Evolution of Smile a genetic algorithm hardware implementation PROJECT OVERVIEW Genetic

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

A A HYBRID HYBRID CHC GENETIC ALGORITHM CHC GENETIC ALGORITHM FOR FOR MACRO CELL MACRO

Implementation Report: Concurrent Genetic Algorithm with Island Migration Markus Solbach

Luigi Spezia Biomathematics &amp; Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

Introduction to Genetic Epidemiology CM van Duijn Genetic Epidemiology Unit Gene Discovery

Introduction to Genetic Epidemiology CM van Duijn Genetic Epidemiology Unit Gene Discovery

Genetic drift (two types) Genetic drift: changes in allele frequencies due to chance. Founder

The State of Nature (2) Rousseau, Locke, and Hobbes Review .. Aristotle : State of Nature and

The Broken Power Sequence of Radio-Loud AGN + Collective Evidence for Inverse Compton emission

Results of H4 VLE Simulations 03.11.2017, Marcel Rosenthal, Nikolaos Charitonidis, Yannis

The Structure of the Proton in the Higgs Boson Era Juan Rojo STFC Rutherford Fellow

Hadrons with c-s quark content: present, past, and future January 29 th , 2015 | Elisabetta

Building Reusable Test Collections Ellen M. Voorhees 1 Test Collections Evaluate search

Estimating the Error at Given Test Estimating the Error at Given Test Input Points for Linear

Deciding satisfiability problems by rewrite-based deduction: Experiments in the theory of arrays

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION