scalable hashing based network discovery
play

Scalable Hashing-Based Network Discovery Tara Safavi , Chandra - PowerPoint PPT Presentation

Scalable Hashing-Based Network Discovery Tara Safavi , Chandra Sripada, Danai Koutra University of Michigan, Ann Arbor Networks are everywhere. Airport connections Internet routing Paper citations but are not always directly observed


  1. Scalable Hashing-Based Network Discovery Tara Safavi , Chandra Sripada, Danai Koutra University of Michigan, Ann Arbor

  2. Networks are everywhere…. Airport connections Internet routing Paper citations

  3. …but are not always directly observed … 1. fMRI scans 2. Time series 3. Brain network See Network Structure Inference, A Survey, Brugere, Gallagher, Berger-Wolf

  4. How to build this network? … 1. fMRI scans 2. Time series 3. Brain network

  5. Network discovery Reconstructing networks from indirect, possibly noisy measurements with unobserved interactions Brain scans Gene sequences Stock patterns

  6. Traditional method 2. Fully-connected weighted network . 8 A B 4 All-pairs correlation . 3 . C A B C 1. N time series

  7. Traditional method 2. Fully-connected weighted network . 8 A B 4 All-pairs correlation Drop edges below threshold θ . 3 . C . 8 A A B B C C 1. N time series 3. Sparse graph

  8. Traditional method 2. Fully-connected weighted network . 8 A B 4 All-pairs correlation Drop edges below threshold θ . 3 . C . 8 A A Widely used in many B B domains, interpretable , but… C C 1. N time series 3. Sparse graph

  9. Traditional method 2. Fully-connected weighted network . 8 A B 4 All-pairs correlation Drop edges below threshold θ . 3 . C . 8 A O( N 2 ) A How to comparisons set? B B C C 1. N time series 3. Sparse graph

  10. New hashing-based 2. Fully-connected weighted network . 8 A B 4 All-pairs correlation Drop edges below threshold θ . 3 . C . 8 A O( N 2 ) A How to comparisons set? B B C C 1. N time series 3. Sparse graph 2. Hash series A Binarize Bucket pairwise similarity A B B C C Hash function Buckets

  11. Contributions 2. Fully-connected network . 8 A • Network discovery via new locality-sensitive hashing (LSH) family B • Quickly find similar pairs — circumvent wasteful extra computation 4 All-pairs correlation Drop edges below threshold θ . 3 . • Novel similarity measure on sequences for LSH C • Quantify time-consecutive similarity . 8 A A • Complementary distance measure is a metric : suitable for LSH! O(N 2 ) Arbitrary? comparisons B B • Evaluation on real data in the neuroscience domain C C 1. N time series 3. Sparse graph 2. Hash series A Binarize Bucket pairwise similarity A B B C C Hash function Buckets

  12. Method 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  13. Approximate time series representation Binarize w.r.t series mean (“clipped” representation 1 ) • Why? • Capture approximate fluctuation trend • Preprocess for hashing • Pipeline 1 Ratanamahatana et al, 2005 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  14. Approximate time series representation Binarize w.r.t series mean (“clipped” representation 1 ) • Why? • Capture approximate fluctuation trend • Preprocess for hashing • But — binary sequences only have two possible values • Emphasize consecutive similarity between sequences over pointwise comparison • Pipeline 1 Ratanamahatana et al, 2005 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  15. ABC: Approximate Binary Correlation Capture variable-length consecutive runs between binary sequences • x: 1 1 0 1 0 0 0 y: 1 1 1 1 0 0 1 + (1 + α ) 0 + (1 + α ) 1 (1 + α ) 0 + (1 + α ) 1 + (1 + α ) 2 Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  16. ABC: Approximate Binary Correlation Capture variable-length consecutive runs between binary sequences • Similarity score s a sum of p geometric series, each of length k i • Common ratio (1+ α ): 0 < α ⋘ 1 is a consecutiveness weighting factor • x: 1 1 0 1 0 0 0 y: 1 1 1 1 0 0 1 + (1 + α ) 0 + (1 + α ) 1 (1 + α ) 0 + (1 + α ) 1 + (1 + α ) 2 Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  17. ABC: Approximate Binary Correlation Empirically, a good estimator of correlation coefficient r • Similarity scores s correlate well with r • s " # r ! Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  18. ABC: Approximate Binary Correlation Empirically, a good estimator of correlation coefficient r • Similarity scores s correlate well with r • Added benefit of time-aware hashing • LSH requires a metric : satisfies triangle inequality • We can show ABC distance is a metric (critical) • z s " # x y r ! Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  19. ABC distance triangle inequality: sketch of proof Induction on n , sequence length • Induction step: identify feasible cases between sequence pairs • Disagreement ( d ) • n New run ( n ) • x: 1 1 1 0 0 1 Append to existing run ( a ) • Compute all deltas • Append ( a ) to y: 0 1 1 0 0 1 existing run Triangle inequality holds! • z: 1 0 1 0 0 1 n + 1 z Pipeline x y 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  20. Locality-sensitive hashing Hash data s.t. similar items likely to collide • Family of hash fns F : ( d 1 , d 2 , p 1 , p 2 )-sensitive • Control false negative/positive rates • Parameters • b : number of hash tables, increases p 1 • r : number of hash functions to concatenate, lowers p 2 • Hash Hash table Original data + signatures buckets hash function 1 1 1 0 0 1 x: x, y [1, 0] y: 0 1 1 0 0 1 z [0, 0] 1 0 1 0 0 1 z: g = h 2 & h 4 Background

  21. Proposed: ABC-LSH window sampling Hash Hash table Original data + signatures buckets hash function x: 1 1 1 0 0 1 x, y [11, 00] 0 1 1 0 0 1 y: z [01, 00] 1 0 1 0 0 1 z: k = 2 d 1 d 2 - sensitive ( d 1 , d 2 , 1 − α (1 + α ) n − 1 , 1 − α (1 + α ) n − 1) g = h 2 & h 4 Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  22. Summary 2. Fully-connected network . 8 • Time-consecutive l ocality-sensitive hashing (LSH) family A B • Novel similarity measure + distance metric on sequences 4 All-pairs correlation Drop edges below threshold θ . 3 . . 8 A A B B C C 1. N time series 3. Sparse graph 2. Hash series A Binarize Bucket pairwise similarity A B B C C Hash function Buckets

  23. Evaluation

  24. Evaluation questions 1. How efficient is our approach compared to baselines? • Baseline: pairwise correlation • Proposed: pairwise ABC, ABC-LSH 2. How predictive are the output graphs in real applications? • Can we predict brain health using graphs discovered with ABC-LSH? 3. How robust is our method to parameter choices? Data • Two publicly available fMRI datasets • Synthetic data Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  25. Question 1: scalability 2 - 15 x speedup with 2k - 20k nodes •

  26. Question 2: task-based evaluation Brain networks: identify biomarkers of mental disease • Extract commonly used features from generated brain networks • Avg weighted degree • Avg clustering coefficient • Avg path length • Modularity • Density • Feature selection F 1 F 2 F 3 F 4 F 5 Healthy 0 6.5 .3 1.4 .7 .03

  27. Question 2: task-based evaluation Logistic regression classifier, 10-fold stratified CV • Labels 6.5 .3 1.4 .7 .03 0 Train 1 Predicted health 1 Test 0

  28. Question 2: task-based evaluation Total time: >1 hr Total time: 5 min Average accuracy same — runtime is not!

  29. Conclusion Pipeline for network discovery on time series • ABC: time-consecutive similarity measure + distance metric on binary sequences • Associated LSH family • Modular + applicable in other settings • 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  30. Conclusion Pipeline for network discovery on time series • ABC: time-consecutive similarity measure + distance metric on binary sequences • Associated LSH family • Modular + applicable in other settings • Experiments: shown to be fast + accurate • Brain networks • More experiments on robustness, scalability, parameter sensitivity • 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

  31. Conclusion Pipeline for network discovery on time series • ABC: time-consecutive similarity measure + distance metric on binary sequences • Associated LSH family • Modular + applicable in other settings • Experiments: shown to be fast + accurate • Brain networks • More experiments on robustness, scalability, parameter sensitivity • Impact: integrated into production systems • 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

Recommend


More recommend