Selection Detection and Two-Sample-Testing: Generalized Greenwood Statistics and their Applications Ðan Daniel Erdmann-Pham, Jonathan Terhorst & Yun S. Song University of California, Berkeley July 9, 2019 SPA 2019
Motivation Framework Application Two Problems Generalized Greenwood Statistics
Motivation Framework Application Population Genetics: Detecting Selective Pressure Neutral Tree ◮ At each depth, leaf set ◮ Leaf set sizes are highly sizes are approximately unbalanced close to the equidistributed root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes Generalized Greenwood Statistics
Motivation Framework Application Population Genetics: Detecting Selective Pressure Neutral Tree ◮ At each depth, leaf set ◮ Leaf set sizes are highly sizes are approximately unbalanced close to the equidistributed root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes Generalized Greenwood Statistics
Motivation Framework Application Population Genetics: Detecting Selective Pressure Neutral Tree Tree with Selection ◮ At each depth, leaf set ◮ Leaf set sizes are highly sizes are approximately unbalanced close to the equidistributed root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes Generalized Greenwood Statistics
Motivation Framework Application Population Genetics: Detecting Selective Pressure Neutral Tree Tree with Selection ◮ At each depth, leaf set ◮ Leaf set sizes are highly sizes are approximately unbalanced close to the equidistributed root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes Generalized Greenwood Statistics
Motivation Framework Application Population Genetics: Detecting Selective Pressure Neutral Tree Tree with Selection ◮ At each depth, leaf set ◮ Leaf set sizes are highly sizes are approximately unbalanced close to the equidistributed root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes Generalized Greenwood Statistics
Motivation Framework Application Population Genetics: Detecting Selective Pressure Neutral Tree Tree with Selection ◮ At each depth, leaf set ◮ Leaf set sizes are highly sizes are approximately unbalanced close to the equidistributed root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes Generalized Greenwood Statistics
Motivation Framework Application Population Genetics: Detecting Selective Pressure Neutral Tree Tree with Selection ◮ At each depth, leaf set ◮ Leaf set sizes are highly sizes are approximately unbalanced close to the equidistributed root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes Generalized Greenwood Statistics
≠ ≠ Motivation Framework Application Two-Sample Tests: Comparing { X k } k ∈ [ n ] and { Y k } k ∈ [ m ] How to test the hypothesis whether { X k } and { Y k } are identi- cally distributed? Generalized Greenwood Statistics
≠ ≠ Motivation Framework Application Two-Sample Tests: Comparing { X k } k ∈ [ n ] and { Y k } k ∈ [ m ] X k ~ Y k ( Null ) How to test the hypothesis whether { X k } and { Y k } are identi- cally distributed? Generalized Greenwood Statistics
≠ Motivation Framework Application Two-Sample Tests: Comparing { X k } k ∈ [ n ] and { Y k } k ∈ [ m ] X k ~ Y k ( Null ) [ X k ] ≠ [ Y k ] ( Alternative ) How to test the hypothesis whether { X k } and { Y k } are identi- cally distributed? Generalized Greenwood Statistics
Motivation Framework Application Two-Sample Tests: Comparing { X k } k ∈ [ n ] and { Y k } k ∈ [ m ] X k ~ Y k ( Null ) [ X k ] ≠ [ Y k ] ( Alternative ) Var [ X k ] ≠ Var [ Y k ] ( Alternative ) How to test the hypothesis whether { X k } and { Y k } are identi- cally distributed? Generalized Greenwood Statistics
Motivation Framework Application Two-Sample Tests: Comparing { X k } k ∈ [ n ] and { Y k } k ∈ [ m ] X k ~ Y k ( Null ) [ X k ] ≠ [ Y k ] ( Alternative ) Var [ X k ] ≠ Var [ Y k ] ( Alternative ) How to test the hypothesis whether { X k } and { Y k } are identi- cally distributed? Generalized Greenwood Statistics
Motivation Framework Application Sampling uniformly from the k -dimensional simplex ∆ k − 1 Generalized Greenwood Statistics
Motivation Framework Application Balls and bins Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of � n · ∆ k − 1 ∩ Z + � ◮ S n , k ∼ U L 1 and L 2 balls (Bose-Einstein- ◮ Up to k = 3 (Gardner Distribution) ’52) ◮ Large deviations (Schechtner, Zinn ’00) ◮ Tabulation of z -scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81) Generalized Greenwood Statistics
Motivation Framework Application Balls and bins Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of � n · ∆ k − 1 ∩ Z + � ◮ S n , k ∼ U L 1 and L 2 balls (Bose-Einstein- ◮ Up to k = 3 (Gardner Distribution) ’52) ◮ Large deviations (Schechtner, Zinn ’00) ◮ Tabulation of z -scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81) Generalized Greenwood Statistics
Motivation Framework Application Balls and bins Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of � n · ∆ k − 1 ∩ Z + � ◮ S n , k ∼ U L 1 and L 2 balls (Bose-Einstein- ◮ Up to k = 3 (Gardner Distribution) ’52) ◮ Large deviations (Schechtner, Zinn ’00) ◮ Tabulation of z -scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81) Generalized Greenwood Statistics
Motivation Framework Application Balls and bins k S n , k Limit as n → ∞ for fixed k 1 2 S n , k S n , k ... ... ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of � n · ∆ k − 1 ∩ Z + � ◮ S n , k ∼ U L 1 and L 2 balls (Bose-Einstein- ◮ Up to k = 3 (Gardner Distribution) ’52) ◮ Large deviations (Schechtner, Zinn ’00) ◮ Tabulation of z -scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81) Generalized Greenwood Statistics
Motivation Framework Application Balls and bins k S n , k Limit as n → ∞ for fixed k 1 2 S n , k S n , k ... ... ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of � n · ∆ k − 1 ∩ Z + � ◮ S n , k ∼ U L 1 and L 2 balls (Bose-Einstein- ◮ Up to k = 3 (Gardner Distribution) ’52) ◮ Large deviations (Schechtner, Zinn ’00) ◮ Tabulation of z -scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81) Generalized Greenwood Statistics
Motivation Framework Application Balls and bins k S n , k Limit as n → ∞ for fixed k 1 2 S n , k S n , k ... ... ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of � n · ∆ k − 1 ∩ Z + � ◮ S n , k ∼ U L 1 and L 2 balls (Bose-Einstein- ◮ Up to k = 3 (Gardner Distribution) ’52) ◮ Large deviations ◮ Can we perform hypothesis (Schechtner, Zinn ’00) testing based on � S n , k � 2 2 ? ◮ Tabulation of z -scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81) Generalized Greenwood Statistics
Motivation Framework Application Balls and bins k S n , k Limit as n → ∞ for fixed k 1 2 S n , k S n , k ... ... ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of � n · ∆ k − 1 ∩ Z + � ◮ S n , k ∼ U L 1 and L 2 balls (Bose-Einstein- ◮ Up to k = 3 (Gardner Distribution) ’52) ◮ Large deviations ◮ Can we perform hypothesis (Schechtner, Zinn ’00) testing based on � S n , k � 2 2 ? ◮ Tabulation of z -scores up What is the distribution of to k = 20 (Burrows ’79, � S n , k � 2 2 ? Currie ’81, Stephens ’81) Generalized Greenwood Statistics
Recommend
More recommend