Finding Outstanding Aspects and Contrast Subspaces Jian Pei School of Computing Science Simon Fraser University jpei@cs.sfu.ca
CHIRC • Computational Health Intelligence Research Centre – Population health powered by big data – Healthcare business intelligence – Predictive health analytics • A collaborative research initiative with industry leaders • Technology transferred to industry – Multi-million US dollars financial gain per year for industry partners J. Pei: Finding Outstanding Aspects and Contrast Subspaces 2
In what aspect is he most similar to cases of coronary artery disease and, at the same time, dissimilar to adiposity? Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat … J. Pei: Finding Outstanding Aspects and Contrast Subspaces 5
Fraud Suspect Analysis • An insurance analyst is investigating a suspicious claim • How is the claim compared with the normal and fraud claims? – In what aspects the suspicious case is most similar to fraudulent cases and different from normal claims? J. Pei: Finding Outstanding Aspects and Contrast Subspaces 6
Don’t You Ever Google Yourself? • Big data makes one know oneself better • 57% American adults search themselves on Internet – Good news: those people are better paid than those who haven’t done so! (Investors.com) • Egocentric analysis becomes more and more important with big data J. Pei: Finding Outstanding Aspects and Contrast Subspaces 7
Egocentric Analysis • How am I different from (more often than not, better than) others? • In what aspects am I good? J. Pei: Finding Outstanding Aspects and Contrast Subspaces 8
Contrast Subspace Finding • Given a set of labeled objects in two classes • For a query object q that is also labeled, the contrast subspace is the one where q is most likely to belong to the target class against the other class J. Pei: Finding Outstanding Aspects and Contrast Subspaces 9
Related Work • Finding patterns and models that manifest drastic differences from one class against the other – Example: emerging patterns • Subspace outlier detection – The query object may not be an outlier • Typicality queries do not consider subspaces J. Pei: Finding Outstanding Aspects and Contrast Subspaces 10
Problem Formulation LC S ( q ) = L S ( q | O + ) • Find subspaces maximizing L S ( q | O − ) • To avoid triviality, consider only subspaces where L S ( q | O + ) ≥ δ J. Pei: Finding Outstanding Aspects and Contrast Subspaces 11
Density Estimation • Density estimated by − distS ( q,o )2 1 L S ( q | O ) = ˆ X 2 h 2 f S ( q, O ) = e S √ | O | 2 π h S o ∈ O • Then, − distS ( q,o )2 2 h 2 P e S + ˆ = | O − | h S − f S ( q, O + ) o ∈ O + LC S ( q, O + , O − ) = · ˆ − distS ( q,o )2 | O + | h S + f S ( q, O − ) 2 h 2 P e S − o ∈ O − J. Pei: Finding Outstanding Aspects and Contrast Subspaces 12
Complexity • MAX SNP-hard – Reduction from the emerging pattern mining problem • Impossible to design a good approximation algorithm J. Pei: Finding Outstanding Aspects and Contrast Subspaces 13
A Monotonic Bound • is not monotonic in subspaces L S ( q | O + ) • Develop an upper bound of , which L S ( q | O + ) is monotonic in subspaces – Sort all the dimensions in their standard deviation descending order – Let be the set of children of S in the subspace set enumeration tree using the standard S deviation descending order � distS ( q,o )2 opt max )2 1 2( σ S h 0 – L ∗ S ( q | O + ) = P e √ | O + | 2 πσ 0 min h 0 opt min o ∈ O + – min = min { σ S 0 | S 0 ∈ S} , h 0 opt min = min { h S 0 opt | S 0 ∈ S} , and σ 0 opt max = max { h S 0 opt | S 0 ∈ S} h 0 J. Pei: Finding Outstanding Aspects and Contrast Subspaces 14
Monotonic Bound For a query object q , a set of objects O , and subspaces S 1 , S 2 such that S 1 is an ancestor of S 2 in the subspace set enumeration tree using the standard deviation descending order in O + , L ∗ S 1 ( q | O + ) ≥ L S 2 ( q | O + ). Baseline algorithm time complexity: O (2 | D | · ( | O + | + | O − | )) J. Pei: Finding Outstanding Aspects and Contrast Subspaces 15
Bounding Using Neighborhoods • Divide the neighborhood of an object into two parts and the S ( q ) = { o ∈ O | dist S ( q, o ) ≤ ✏ } N ✏ rest • Then, S ( q | O ) + L rest L S ( q | O ) = L N ✏ ( q | O ) S − distS ( q,o )2 1 2 h 2 S ( q | O ) = P L N ✏ e √ S | O | 2 π h S S ( q ) o ∈ N ✏ − distS ( q,o )2 1 2 h 2 L rest ( q | O ) = P e √ S S | O | 2 π h S o ∈ O \ N ✏ S ( q ) J. Pei: Finding Outstanding Aspects and Contrast Subspaces 16
Bounding the Rest • Let be the maximum distance dist S ( q | O ) between q and all objects in O in subspace S − distS ( q,O )2 ✏ 2 | O | − | N ✏ S ( q ) | ( q | O ) ≤ | O | − | N ✏ S ( q ) | 2 h 2 − 2 h 2 ≤ L rest 2 π h S · e 2 π h S · e S S √ √ S | O | | O | J. Pei: Finding Outstanding Aspects and Contrast Subspaces 17
Bounding For a query object q , a set of objects O and ✏ ≥ 0, S ( q | O ) ≤ L S ( q | O ) ≤ UL ✏ S ( q | O ) LL ✏ where 0 1 S ( q,o )2 − distS ( q,O )2 − dist ✏ 1 X 2 h 2 2 h 2 S ( q | O ) = + ( | O | − | N ✏ S ( q ) | ) e LL ✏ e √ S S @ A | O | 2 ⇡ h S o ∈ N ✏ S ( q ) and 0 1 S ( q,o )2 − dist ✏ ✏ 2 1 − X 2 h 2 2 h 2 S ( q | O ) = + ( | O | − | N ✏ S ( q ) | ) e UL ✏ e √ S S @ A | O | 2 ⇡ h S o ∈ N ✏ S ( q ) For a query object q , a set of objects O + , a set of objects O − , and ✏ ≥ 0, LC S ( q ) ≤ UL ✏ S ( q | O + ) S ( q | O − ) . LL ✏ J. Pei: Finding Outstanding Aspects and Contrast Subspaces 18
Algorithm J. Pei: Finding Outstanding Aspects and Contrast Subspaces 19
Dimensionality of Inlying Contrast Subspaces J. Pei: Finding Outstanding Aspects and Contrast Subspaces 20
Dimensionality of Outlying Contrast Subspaces J. Pei: Finding Outstanding Aspects and Contrast Subspaces 21
Runtime J. Pei: Finding Outstanding Aspects and Contrast Subspaces 22
In Which Aspects Johnson Is Good? 30 25 Joe Points/game 20 15 10 4 5 30 0 0 1 2 3 4 3 Personal foul Personal foul 25 Joe Points/game 20 2 15 10 1 5 Joe 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Assist Assist J. Pei: Finding Outstanding Aspects and Contrast Subspaces 23
Fraud Investigation • Given a set of claims in an insurance company • For a claim c, in which aspects c is most different from the other claims? J. Pei: Finding Outstanding Aspects and Contrast Subspaces 24
Outlying/Outstanding Aspect Mining • Given a set of objects in a multi-dimensional space • For an object q, find the subspaces where q is most unusual compared to the rest of the data J. Pei: Finding Outstanding Aspects and Contrast Subspaces 25
Differences from Outlier Detection • Outlier detection finds objects that are different from the rest of the data • The query object in outlying aspect finding may not be an outlier J. Pei: Finding Outstanding Aspects and Contrast Subspaces 26
Problem Formulation • A set of objects O in full space D = { D 1 , . . . , D d } • Query object q • The density of q measures how outlying (uncommon) q is – Density estimation n n ✓ o − o i ◆ f h ( o ) = 1 K h ( o − o i ) = 1 ˆ X X K n nh h i =1 i =1 • Find a subspace where the density of q is lowest? J. Pei: Finding Outstanding Aspects and Contrast Subspaces 27
Why Rank Statistics? • Densities in different subspaces are not comparable • We compare the same set of objects in different subspaces • Rank statistics rank S ( o ) = |{ o 0 | o 0 ∈ O, OutDeg ( o 0 ) < OutDeg ( o ) }| + 1 J. Pei: Finding Outstanding Aspects and Contrast Subspaces 28
Unsupervised Problem Formulation Given a set of objects O in a multidimensional space D , a query object q 2 O and a maximum dimensionality threshold 0 < ` | D | , a subspace S ✓ D (0 < | S | ` ) is called a minimal outlying subspace of q if 1. (Rank minimality) there does not exist another subspace S 0 ✓ D ( S 0 6 = ; ), such that rank S 0 ( q ) < rank S ( q ); and 2. (Subspace minimality) there does not exist another subspace S 00 ⇢ S such that rank S 00 ( q ) = rank S ( q ). The problem of outlying aspect mining is to find the minimal outlying subspaces of q . J. Pei: Finding Outstanding Aspects and Contrast Subspaces 29
Density Estimation for Ranking ( q.Di − o.Di )2 P − 2 h 2 f S ( q ) ∼ ˜ ˆ X f S ( q ) = e Di ∈ S Di o ∈ O • Invariance Given a set of objects O in space S = { D 1 , . . . , D d } , define a linear transfor- mation g ( o ) = ( a 1 o.D 1 + b 1 , . . . , a d o.D d + b d ) for any o ∈ O , where a 1 , . . . , a d and b 1 , . . . , b d are real numbers. Let O 0 = { g ( o ) | o ∈ O } be the transformed data set. For any objects o 1 , o 2 ∈ O such that ˜ f S ( o 1 ) > ˜ f S ( o 2 ) in O , ˜ f S ( g ( o 1 )) > ˜ f S ( g ( o 2 )) if the product kernel is used and the bandwidths are set using H¨ ardle’s rule of thumb J. Pei: Finding Outstanding Aspects and Contrast Subspaces 30
Algorithm Framework J. Pei: Finding Outstanding Aspects and Contrast Subspaces 31
Recommend
More recommend