signitrend scalable detection of emerging topics in
play

SigniTrend: Scalable Detection of Emerging Topics in Textual - PowerPoint PPT Presentation

SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Erich Schubert, Michael Weiler, Hans-Peter Kriegel Institute of Informatics Database Systems Group Ludwig-Maximilians-Universitt


  1. SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Erich Schubert, Michael Weiler, Hans-Peter Kriegel � Institute of Informatics Database Systems Group 
 Ludwig-Maximilians-Universität München

  2. Trend detection on streams should be early and accurate Term “Facebook” Term “WhatsApp” 70 56 Term frequency 42 28 14 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10

  3. Trend detection on streams should be early and accurate Term “Facebook” Term “WhatsApp” 70 56 Term frequency 42 B 28 A 14 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10

  4. Trend detection on streams should be early and accurate Term “Facebook” Term “WhatsApp” 70 56 Term frequency 42 B 28 ? A 14 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10

  5. Trend detection on streams should be early and accurate Term “Facebook” Term “WhatsApp” Pair {“Facebook”, “WhatsApp”} 70 56 Term frequency 42 B 28 A 14 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10

  6. Trend detection on streams should be early and accurate Term “Facebook” Term “WhatsApp” Pair {“Facebook”, “WhatsApp”} 70 56 Term frequency 42 B 28 A 14 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Facebook bought WhatsApp Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10

  7. Problem description 1. Statistical significance score 
 Popular topics ≠ trending topics (e.g. Obama) 2. Track interacting terms • Facebook bought WhatsApp • Edward Snowden traveled to Moscow 3. Scalability 
 Efficient calculation for all terms and pairs Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 2/10

  8. SigniTrend on textual streams 
 tracking both: single terms and pairs A. Preprocessing (stopwords, stemming, duplicates) B. Trend detection cycle • Temporal slicing for statistical aggregation • Score all terms and pairs based on expectations 
 from past slices C. Refinement with clustering Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 3/10

  9. Trend detection cycle exceeds 
 threshold? Trend Terms 
 Count frequency candidates and pairs new alerting 
 at the end of 
 thresholds each time slice Update statistics Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 4/10

  10. Update statistics for time slice t and term or pair e • How many standard deviations is the current frequency x higher than its mean z ( x t,e ) := x t,e − µ t − 1 ,e σ t -1 ,e — Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 5/10

  11. Update statistics for time slice t and term or pair e • How many standard deviations is the current frequency x higher than its mean — z ( x t,e ) := x t,e − EWMA t − 1 , e p � EWMVar t − 1 , e — • Exponential weighted moving average/variance for continuous estimation on a stream [Finch09] — 4 t,e x t,e � EWMA t − 1 , e EWMA t , e EWMA t − 1 , e + α · 4 t,e � � EWMVar t − 1 , e + α · 4 2 EWMVar t , e (1 � α ) · t,e [Finch09] T. Finch. Incremental calculation of weighted mean and variance. Technical report, University of Cambridge, 2009 Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 5/10

  12. Significance and frequency for term “Facebook” Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 6/10

  13. How to track statistics of all pairs e ffi ciently? Problem: Too many terms and pairs to track everything 2013 News Dataset � STEMMED TERMS OBSERVED PAIRS � TOTAL 56,661,782 660,430,059 UNIQUES 300,141 71,289,359 � Therefore, we designed an e ffi cient hashing scheme (based on Bloom Filters and Heavy Hitters) 
 for probabilistic upper-bound statistics Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 7/10

  14. Hashing scheme for e ffi cient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 #1 #2 #3 #4 #5 #6 #7 Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

  15. Hashing scheme for e ffi cient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 h 1 h 2 45 ± 30 45 ± 30 Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

  16. Hashing scheme for e ffi cient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 h 1 h 2 45 ± 30 2 ± 1 45 ± 30 2 ± 1 Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

  17. Hashing scheme for e ffi cient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 h 1 h 2 ? 45 ± 30 45 ± 30 2 ± 1 2 ± 1 20 ± 10 20 ± 10 Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

  18. Hashing scheme for e ffi cient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) h 1 h 2 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

  19. Hashing scheme for e ffi cient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 h 1 h 2 read MIN read {Obama, US}: 
 (lowest collision) min(45±30, 20±10) = 20±10 Upper-bound estimate for mean and its variance 
 Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

  20. Hashing scheme for e ffi cient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 read MIN read {Obama, US}: 
 (lowest collision) min(45±30, 20±10) = 20±10 Upper-bound estimate for mean and its variance 
 Performance on news dataset: 104s/day with a Raspberry-Pi Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

  21. Artificial trends evaluation Inject artificial words with frequency α 
 e.g. “Obama meets <X123> Netanyahu” 100% Recall of Injected Trends 80% 60% α =0.01 α =0.03 40% α =0.05 α =0.07 α =0.09 20% α =0.11 α =0.13 α =0.15 0% 8 10 12 14 16 18 20 22 24 26 Hash Table Bits ℓ Hash table size large enough → recall saturation Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 9/10

  22. Refinement & clustering • Inverted index (Apache Lucene) to verify trend candidates and measure exactly (without hashing) for precise reporting (false-positives can be eliminated) • Single Link clustering with Ward of remaining trends (similarity matrix is built with the exact significance of all pairs) • Future work : include topic modeling techniques 
 (e.g. pLSI, LDA) Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 10/10

  23. Thank You! Questions? Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds

Recommend


More recommend