Binary attributes quantification with external information Alfonso - PowerPoint PPT Presentation

Introduction Study of association Quantification of binary attributes Applications on real world data set Binary attributes quantification with external information Alfonso Iodice D’Enza ∗ ∗ Universit` a di Cassino, (Italy) iodicede@gmail.com The R User Conference 2009 July 8-10, Agrocampus-Ouest, Rennes, France 1 / 29

Introduction Study of association Quantification of binary attributes Applications on real world data set Outline Introduction 1 Importance of Binary data Study of association 2 Association Rules: Support and Confidence Open Issues in AR Mining Binary data coding Quantification of binary attributes 3 Advantages in attributes quantification A suitable quantification NSCA-based approaches Problem statement Exogenous vs Endogenous information Related work Exploited R functions Applications on real world data set 4 The UniMC data 2 / 29

Introduction Study of association Quantification of binary attributes Applications on real world data set Importance of Binary data Binary Data Relevance of Binary Data During the past decade the attention to Binary Data quickly increased. There are several motivations to take into account to understand the reasons of this major interest. Among the others, binary data can be easily collected, stored and managed Application in several fields Gene Expression Data Text Mining Web click-stream analysis Transactional Data Bases 3 / 29

Introduction Study of association Quantification of binary attributes Applications on real world data set Association Rules: Support and Confidence Association Rules A short reminder Consider a pair of attributes (or sets of attributes) A and B : a simple association rule based on the considered attributes is: If A − → B = { support = .2, confidence = .8 } Sup: the 20% of sequences contain both A and B items; Conf: the 80% of sequences containing the item A contain the item B too; Interpretation - the support measures the intensity of the association between A and B - the confidence measures the strength of the logical dependence between A and B Association rules can be easily generalised to itemsets with cardinality > 2 4 / 29

Introduction Study of association Quantification of binary attributes Applications on real world data set Open Issues in AR Mining Association Rules AR mining is a NP -problem In presence of large databases it becomes soon not feasible cause the number of rules increases exponentially: computational issues (not serious) interpretation difficulties (serious) 5 / 29

Introduction Study of association Quantification of binary attributes Applications on real world data set Open Issues in AR Mining Association study approaches Brute Force approach AR’s having high/very high support are considered trivial rules and are discarded AR’s with low support represent not interesting rules and are discarded defining the thresholds is a ticklish problem loose thresholds determine a huge amount of output tight thresholds may lead to discard interesting association patterns Trojan horse approach An alternative approach is to mine AR within homogeneous groups of items and/or of sequences. Homogeneous subsets can be defined through an exogenous criterion groups are defined according to an external categorical variable endogenous criterion groups are defined via a suitable cluster analysis of the sequences 6 / 29

Introduction Study of association Quantification of binary attributes Applications on real world data set Binary data coding Data structures A multivariate data set is given by a set of n statistical units, named sequences and each sequence is defined by a set of { I 2 , I 2 , . . . , I P } binary variables, which are called attributes or items Binary variables can assume values only in { 0, 1 } To arrange these data, two possibilities exist: presence/absence matrix S with n rows and P columns I 1 I 2 . . . I P 1 0 1 . . . 1 2 1 1 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n 1 0 . . . 1 7 / 29

Introduction Study of association Quantification of binary attributes Applications on real world data set Binary data coding Data structures A multivariate data set is given by a set of n statistical units, named sequences and each sequence is defined by a set of { I 2 , I 2 , . . . , I P } binary variables, which are called attributes or items Binary variables can assume values only in { 0, 1 } To arrange these data, two possibilities exist: disjunctive coded matrix Z with n rows and 2P columns . . . . . . I 1 . I 1 I 2 . I 2 . . . I P . I P . . . . . . 1 0 . 1 1 . 0 . . . 1 . 0 . . . . . . 2 1 . 0 1 . 0 . . . 0 . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . 0 0 . 1 0 . 1 n . . . 7 / 29

Introduction Study of association Quantification of binary attributes Applications on real world data set Binary data coding Association measures: a different point of view The complete disjunctive Binary Data coding turns out extremely useful when defining the association measures Taking into account two general items of the matrix Z : Z j and Z j ′ , j Z i ′ (with { j, j ′ } = 1, 2, . . . , P ) determines the the product Z ′ following 2 × 2 matrix: � a � b D = c d a indicates the number co-presence b and c correspond to the non-matchings d indicates the number of co-absences using the set { a, b, c, d } it is possible to define all the dissimilarity/similarity measures for binary data the tuple { a, b, c, d } can also be used to compute support, confidence and all of the AR interestingness measures (see [7] for a detailed overview). 8 / 29

Binary attributes quantification with external information Alfonso - PowerPoint PPT Presentation

Introduction Study of association Quantification of binary attributes Applications on real world data set Binary attributes quantification with external information Alfonso Iodice DEnza Universit` a di Cassino, (Italy)

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

QUANTIFICATION OF PORE QUANTIFICATION OF PORE QUANTIFICATION OF PORE STRUCTURE CHARACTERISTICS

61A Lecture 16 Terminology: Python object system: Functions are objects. Wednesday, October 3

Data Examples Announcements Examples: Objects Land Owners Instance attributes are found before

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

Introduction to Data Science: Principles ordered categorical data do not have magnitude

From E/R Diagrams to Relations Entity set relation Attributes attributes

CMSC 206 Binary Search Trees 1 Binary Search Tree n A Binary Search Tree is a Binary Tree in

Binary Search Trees and Balanced Binary Search Trees using AVL Trees Mark Redekopp David Kempe

LECTURE 2 Review 1 Binary Math and Assembly BINARY MATH In this section, we review Binary

Binary trees Binary trees David Morgan Binary trees Binary trees elements have up to 2

External buffer Raslan Darawsheh Mellanox External buffer First was introduced by Olivier

Specifying Plausibility Levels for Iterated Belief Change in the Situation Calculus Toryn Q.

Sampling Michel Bierlaire Transport and Mobility Laboratory School of Architecture, Civil and

SEM Professor Patrick Sturgis Plan Path diagrams Exogenous, endogenous variables

Econ 551 Government Finance: Revenues Fall 2019 Given by Kevin Milligan Vancouver School of

Presented by Yvette Conley, PhD School of Nursing What we will cover during this webcast:

MOTIFS DISTRIBUTION IN DNA SEQUENCES St ephane ROBIN robin@inapg.inra.fr UMR INA-PG / INRA,

R01 - Simple linear regression STAT 587 (Engineering) Iowa State University October 17, 2020

Identification Algorithms for Hybrid Systems Giancarlo Ferrari-Trecate Politecnico di Milano,