Approximating the Best–Fit Tree Under L p Norms Boulos Harb, Sampath Kannan and Andrew McGregor, UPenn
� 0 � = h L x,z ϕ ( x ) ρ x ( dz ) � 1 � � t ε � + h L x,y x ( s ) ϕ ( x ) ds − t ε L x,z ϕ ( x ) E y t ε 0 The Problem(s) � � t ε � t ε + 1 E y L x,y x ( s ) ϕ ( x ) ds − E x,y L x t ε 0 0 = h � L x ϕ ( x ) + hθ ε ( x, y ) • Input: Distance Matrix D [ i,j ] on n items • Output: Tree Metri c T [ i,j ] Tree Metric • Goal: Minimize the L p cost-of-fit L p 1 /p | D [ i, j ] − T [ i, j ] | p � L p ( D, T ) = i,j
� 0 � = h L x,z ϕ ( x ) ρ x ( dz ) � 1 � � t ε � + h L x,y x ( s ) ϕ ( x ) ds − t ε L x,z ϕ ( x ) E y t ε 0 The Problem(s) � � t ε � t ε + 1 E y L x,y x ( s ) ϕ ( x ) ds − E x,y L x t ε 0 0 = h � L x ϕ ( x ) + hθ ε ( x, y ) • Input: Distance Matrix D [ i,j ] on n items • Output: Tree Metri c T [ i,j ] Ultrametric • Goal: Minimize the L p cost-of-fit L p 1 /p | D [ i, j ] − T [ i, j ] | p � L p ( D, T ) = i,j
� 0 � = h L x,z ϕ ( x ) ρ x ( dz ) � 1 � � t ε � + h L x,y x ( s ) ϕ ( x ) ds − t ε L x,z ϕ ( x ) E y t ε 0 The Problem(s) � � t ε � t ε + 1 E y L x,y x ( s ) ϕ ( x ) ds − E x,y L x t ε 0 0 = h � L x ϕ ( x ) + hθ ε ( x, y ) • Input: Distance Matrix D [ i,j ] on n items • Output: Tree Metri c T [ i,j ] Ultrametric • Goal: Minimize the L p cost-of-fit L rel � D [ i, j ] T [ i, j ] , T [ i, j ] � � L rel ( D, T ) = max D [ i, j ] i,j
Tree Metric & Ultrametrics • Tree Metric: Distances between the leaves of a weighted tree. ∀ w, x, y, z ∈ [ n ] T [ w, x ] + T [ y, z ] ≤ max { T [ w, y ] + T [ x, z ] , T [ w, z ] + T [ x, y ] } • Ultrametric: Distance between the leaves of a rooted weighted tree in which all leaves are equidistance from root. ∀ x, y, z ∈ [ n ] T [ x, y ] ≤ max { T [ x, z ] , T [ z, y ] }
Tree Metric & Ultrametrics • Tree Metric: Distances between the leaves of a weighted tree. ∀ w, x, y, z ∈ [ n ] T [ w, x ] + T [ y, z ] ≤ max { T [ w, y ] + T [ x, z ] , T [ w, z ] + T [ x, y ] } • Ultrametric: Distance between the leaves of a rooted weighted tree in which all leaves are equidistance from root. ∀ x, y, z ∈ [ n ] T [ x, y ] ≤ max { T [ x, z ] , T [ z, y ] } 1 3 3 4 2 3 1 3 3 2 2 1 2 2 1
Biological Motivation • View ultrametric as an evolutionary tree • D [ i,j ] is estimate of time since species i and j diverged • Goal: Reconcile contradictory estimates
Biological Motivation • View ultrametric as an evolutionary tree • D [ i,j ] is estimate of time since species i and j diverged • Goal: Reconcile contradictory estimates Shell Fish Fish Spider Wasp Bee Orangutan Chimp Theorist Computational Geometer
Previous Work
Previous Work • Farach, Kannan & Warnow ’95: Exact construction of best-fit ultrametric under L ∞
Previous Work • Farach, Kannan & Warnow ’95: Exact construction of best-fit ultrametric under L ∞ • Agarwala, Bafna, Farach, Paterson & Thorup ’99: 3 approximation of best-fit tree under L ∞
Previous Work • Farach, Kannan & Warnow ’95: Exact construction of best-fit ultrametric under L ∞ • Agarwala, Bafna, Farach, Paterson & Thorup ’99: 3 approximation of best-fit tree under L ∞ • Ma, Wang & Zhang ’99: n 1/p approximation of best-fit non-contracting ultrametric under L p
Previous Work • Farach, Kannan & Warnow ’95: Exact construction of best-fit ultrametric under L ∞ • Agarwala, Bafna, Farach, Paterson & Thorup ’99: 3 approximation of best-fit tree under L ∞ • Ma, Wang & Zhang ’99: n 1/p approximation of best-fit non-contracting ultrametric under L p • Dhamdhere ’04: O(log 1/p n ) approximation of best-fit line metric under L p
Our Results • Algorithm #1: L p : O( k log n ) 1/p approximation to best-fit tree where k is the number of distinct distances in D L rel : O(log 2 n ) approximation to best-fit ultrametric • Algorithm #2: L p : n 1/p approximation to best-fit tree
Algorithm #1
Restricting Splitting Distances
Restricting Splitting Distances • Original distances are d 1 <d 2 < ... < d k
Restricting Splitting Distances • Original distances are d 1 <d 2 < ... < d k • Lemma:
Restricting Splitting Distances • Original distances are d 1 <d 2 < ... < d k • Lemma: a) There exists a best-fit (under L 1 ) ultrametric whose distances are a subset of { d 1 ,d 2 ,... , d k }
Restricting Splitting Distances • Original distances are d 1 <d 2 < ... < d k • Lemma: a) There exists a best-fit (under L 1 ) ultrametric whose distances are a subset of { d 1 ,d 2 ,... , d k } b) There exists an ultrametric whose distances are a subset of { d 1 ,d 2 ,... , d k } whose cost-of-fit is at most twice optimal (under L p ).
Restricting Splitting Distances • Original distances are d 1 <d 2 < ... < d k • Lemma: a) There exists a best-fit (under L 1 ) ultrametric whose distances are a subset of { d 1 ,d 2 ,... , d k } b) There exists an ultrametric whose distances are a subset of { d 1 ,d 2 ,... , d k } whose cost-of-fit is at most twice optimal (under L p ). c)There exists an ultrametric with O(log n) distances whose cost-of-fit is at most twice optimal (under L rel ). [Assuming d k /d 1 is polynomial in n .]
d 4 d 3 d 2 d 1
d 4 d 3 d 2 d 1
d 4 d 3 d 2 d 1 “Splitting Distance” of internal node v = Distance between leaves of subtree rooted a v
d 4 d 3 d 2 d 1 “Splitting Distance” of internal node v = Distance between leaves of subtree rooted a v
d 4 d 3 d 2 d 1 “Splitting Distance” of internal node v = Distance between leaves of subtree rooted a v
Algorithm Outline • Construct top partition G → G 1 , G 2 , G 3 , ... Set length of inter-cluster edges to d k All other lengths will be set to ≤ d k-1 • Construct trees for G 1 , G 2 , G 3 , ...
G 1 G 2 G 3
T [ i,j ] =d k G 1 G 2 G 3
T [ i,j ] ≤ d k-1 G 1 G 2 G 3
G 1 G 2 G 3
G 1 G 2 G 3
Correlation Clustering • Input: Weighted (positive and negative) graph • Output: A partitioning of nodes • Goal: Minimize, � � ( | w e | if e is split) + ( | w e | if e is not split) e : w e > 0 e : w e < 0 • O(log n ) approximation [Charikar, Guruswami and Wirth ’03]
Correlation Clustering • Input: Weighted (positive and negative) graph • Output: A partitioning of nodes • Goal: Minimize, � � ( | w e | if e is split) + ( | w e | if e is not split) e : w e > 0 e : w e < 0 +1 +1 +2 +3 -1 +2 -5 -5 -7 • O(log n ) approximation [Charikar, Guruswami and Wirth ’03]
Correlation Clustering • Input: Weighted (positive and negative) graph • Output: A partitioning of nodes • Goal: Minimize, � � ( | w e | if e is split) + ( | w e | if e is not split) e : w e > 0 e : w e < 0 +1 +1 +2 +3 -1 +2 -5 -5 -7 • O(log n ) approximation [Charikar, Guruswami and Wirth ’03]
Using Correlation Clustering Best-Fit Ultrametric Instance: 20 11 20 17 14 20 18 18 20 20
Using Correlation Clustering Best-Fit Ultrametric Instance: 20 11 20 17 14 20 18 18 20 20 Possible Splitting Distances: 20, 18, 17, 14, 11
Using Correlation Clustering Best-Fit Ultrametric Instance: 20 11 20 17 14 20 18 18 20 20 Possible Splitting Distances: 20, 18, 17, 14, 11 Top level clustering: Increase some lengths to 20 and decrease some length 20 edges to 18
Using Correlation Clustering Best-Fit Ultrametric Instance: Correlation Clustering Instance: 20 11 -2 +9 20 -2 17 14 +3 +6 20 18 -2 +2 18 20 +2 -2 20 -2
Using Correlation Clustering Best-Fit Ultrametric Instance: Correlation Clustering Instance: 20 11 -2 +9 20 -2 17 14 +3 +6 20 18 -2 +2 18 20 +2 -2 20 -2
Using Correlation Clustering Best-Fit Ultrametric Instance: Correlation Clustering Instance: 20 11 -2 +9 20 -2 17 14 +3 +6 20 18 -2 +2 18 20 +2 -2 20 -2
Using Correlation Clustering Best-Fit Ultrametric Instance: Correlation Clustering Instance: 20 11 -2 +9 20 -2 17 14 +3 +6 20 18 -2 +2 20 20 +2 -2 20 -2
Using Correlation Clustering Best-Fit Ultrametric Instance: Correlation Clustering Instance: 20 11 -2 +9 20 -2 17 14 +3 +6 20 18 -2 +2 20 20 +2 -2 20 -2 Cost of length changes = Cost of disagreements during clustering
Using Correlation Clustering Best-Fit Ultrametric Instance: Correlation Clustering Instance: 20 11 -2 +9 20 -2 17 14 +3 +6 20 18 -2 +2 20 20 +2 -2 20 -2 Cost of length changes = Cost of disagreements during clustering Recurse: 11 18 17 14
Using Correlation Clustering Best-Fit Ultrametric Instance: Correlation Clustering Instance: 20 11 -2 +9 20 -2 17 14 +3 +6 20 18 -2 +2 20 20 +2 -2 20 -2 Cost of length changes = Cost of disagreements during clustering Recurse: 11 14 17 14
Analysis (Outline)
Analysis (Outline) • Let OPT be cost of fit of best-fit tree (under L 1 )
Recommend
More recommend