detecting changes and anomalies in noisy text streams
play

Detecting Changes and Anomalies in Noisy Text Streams Jerry Wright - PowerPoint PPT Presentation

CoCITe Noise Mixture Distributions Results Summary Detecting Changes and Anomalies in Noisy Text Streams Jerry Wright Networking and Services Research Lab AT&T Labs Research 15 February 2010 Noisy Text Streams CoCITe Noise


  1. CoCITe Noise Mixture Distributions Results Summary Detecting Changes and Anomalies in Noisy Text Streams Jerry Wright Networking and Services Research Lab AT&T Labs — Research 15 February 2010 Noisy Text Streams

  2. CoCITe Noise Mixture Distributions Results Summary Outline CoCITe Noise Mixture Distributions Results Noisy Text Streams

  3. CoCITe Noise Mixture Distributions Results Summary Mining Text Streams for Changes Text Stream Time-stamped ascii text, usually structured into documents (optionally tagged with metadata ), and containing recurrent words Words may be tokenized: Normalize case and punctuation Substitute tokens for named-entities Frequency of words as function of time: Steps and bursts Cycles Trends “We’re seeing more of this and less of that , especially for these customers.” Noisy Text Streams

  4. CoCITe Noise Mixture Distributions Results Summary Model-Based Approach Binning Documents binned and frequencies counted at regular intervals (typically hourly or daily) Assumption: Documents are independent Absolute Frequency (to track raw word-count) Number of occurrences of word in bin at t is Poisson ( λ t ) , where λ t is piecewise-linear function of time with cyclic modulation Relative Frequency (to track proportion of documents containing word) Number of documents in bin at t containing word is Binomial ( n t , p t ) , where n t is total number of documents in bin at t , p t is piecewise-linear function of time with cyclic modulation Noisy Text Streams

  5. CoCITe Noise Mixture Distributions Results Summary Optimization of Model Piecewise-Linear Segmentation Dynamic programming algorithm to maximize likelihood Periodic Model Periodicity test Number and assignment of modulation coefficients Noisy Text Streams

  6. CoCITe Noise Mixture Distributions Results Summary Stream Implementation Condensed History Used for model re-optimization for each bin Mostly geometrically-weighted totals Noisy Text Streams

  7. CoCITe Noise Mixture Distributions Results Summary Outline CoCITe Noise Mixture Distributions Results Noisy Text Streams

  8. CoCITe Noise Mixture Distributions Results Summary Noise Word Occurrence Frequencies Are Noisy ( Over-Dispersed ) Additional to steps, trends, cycles More bin-to-bin variation than Poisson and Binomial models can account for Absolute Frequency Relative Frequency Poisson: variance = mean Binomial: variance < mean (Data from a threat management system) (Data from a CHI Scan customer care app) Noisy Text Streams

  9. CoCITe Noise Mixture Distributions Results Summary Impact On Change-Detection Noise Weakens Significance Significance P-value governs: number of segments discovered, ranking of alerts High noise Low noise Noisy Text Streams

  10. CoCITe Noise Mixture Distributions Results Summary Approaches to Noise Adapt and Mitigate Filter and Attenuate Cheap Expensive Attenuates signal as well Clearer perception of as noise desired signal Noisy Text Streams

  11. CoCITe Noise Mixture Distributions Results Summary Outline CoCITe Noise Mixture Distributions Results Noisy Text Streams

  12. CoCITe Noise Mixture Distributions Results Summary Gamma-Poisson Mixture (Negative Binomial) Absolute Frequency (to track raw word-count) Number of occurrences of word in bin at t is Poisson (Λ t ) , where Λ t ∼ γ ( µ t /θ t , θ t ) , where µ t is piecewise-linear function of t with cyclic modulation θ t controls dispersion (slowly varying) Γ( µ/θ + x ) θ x P ( X = x ) = x !Γ( µ/θ )( 1 + θ ) µ/θ + x P ( X ≤ x ) = I 1 / ( 1 + θ ) ( µ/θ, x + 1 ) (regularized incomplete beta function) Noisy Text Streams

  13. CoCITe Noise Mixture Distributions Results Summary Beta-Binomial Mixture Relative Frequency (to track proportion of documents containing word) Number of documents in bin at t containing word is binomial ( n t , P t ) , where P t ∼ β ( p t /θ t , ( 1 − p t ) /θ t ) , where p t is piecewise-linear function of t with cyclic modulation θ t controls dispersion (slowly varying) “ n ” B ( p /θ + x , ( 1 − p ) /θ + n − x ) P ( x ) = x B ( p /θ, ( 1 − p ) /θ ) where B () is the complete beta function P ( X ≤ x ) is ugly Noisy Text Streams

  14. CoCITe Noise Mixture Distributions Results Summary Goodness of Fit of Beta-Binomial Data from a CHI Scan customer care app, χ 2 not significant Noisy Text Streams

  15. CoCITe Noise Mixture Distributions Results Summary Goodness of Fit of Negative Binomial Data from a threat management system, scaled to “iid” sequence using periodic model, χ 2 not significant Noisy Text Streams

  16. CoCITe Noise Mixture Distributions Results Summary Implementation Test for over-dispersion Poisson — Dean-Lawson statistic (1989) Binomial — Tarone statistic (1979) Noisy Text Streams

  17. CoCITe Noise Mixture Distributions Results Summary Implementation Test for over-dispersion Poisson — Dean-Lawson statistic (1989) Binomial — Tarone statistic (1979) Likelihood — use probability mass function Noisy Text Streams

  18. CoCITe Noise Mixture Distributions Results Summary Implementation Test for over-dispersion Poisson — Dean-Lawson statistic (1989) Binomial — Tarone statistic (1979) Likelihood — use probability mass function Estimation of over-dispersion parameter θ t Moments estimates using geometrically-weighted sums over data Suitable for stream implementation Noisy Text Streams

  19. CoCITe Noise Mixture Distributions Results Summary Implementation Test for over-dispersion Poisson — Dean-Lawson statistic (1989) Binomial — Tarone statistic (1979) Likelihood — use probability mass function Estimation of over-dispersion parameter θ t Moments estimates using geometrically-weighted sums over data Suitable for stream implementation Significance test No standard tests and little prior art Must be efficient ( ∼ µ s ) Noisy Text Streams

  20. CoCITe Noise Mixture Distributions Results Summary Implementation Significance test No standard tests and little prior art Must be efficient ( ∼ µ s ) Noisy Text Streams

  21. CoCITe Noise Mixture Distributions Results Summary Implementation For each bin For each metavalue For each word For each t For each number of segments For each s Is it significant? Significance test No standard tests and little prior art Must be efficient ( ∼ µ s ) Noisy Text Streams

  22. CoCITe Noise Mixture Distributions Results Summary Implementation Test for over-dispersion Poisson — Dean-Lawson statistic (1989) Binomial — Tarone statistic (1979) Likelihood — use probability mass function Estimation of over-dispersion parameter θ t Moments estimates using geometrically-weighted sums over data Suitable for stream implementation Significance test No standard tests and little prior art Must be efficient ( ∼ µ s ) CDFs used to obtain upper and lower bounds on P-value (allowing for variance of nuisance parameter), then weighted geometric mean Noisy Text Streams

  23. CoCITe Noise Mixture Distributions Results Summary Implementation Test for over-dispersion Poisson — Dean-Lawson statistic (1989) Binomial — Tarone statistic (1979) Likelihood — use probability mass function Estimation of over-dispersion parameter θ t Moments estimates using geometrically-weighted sums over data Suitable for stream implementation Significance test No standard tests and little prior art Must be efficient ( ∼ µ s ) CDFs used to obtain upper and lower bounds on P-value (allowing for variance of nuisance parameter), then weighted geometric mean Measure of interest Noisy Text Streams

  24. CoCITe Noise Mixture Distributions Results Summary Significance Test for Two Beta-Binomials Comparing Two Binomials — 2 × 2 contingency table Fisher’s Exact Test Using unknown common P ( A = 1 ) = p , B ` n 01 ´` n 02 ´ p n 10 ( 1 − p ) n 20 P(table) = 1 2 n 11 n 12 1 n 11 n 12 n 10 Conditioning on row totals, nuisance A 2 n 21 n 22 n 20 parameter p disappears: n 01 n 02 ` n 01 ´` n 02 ´‹` n 00 ´ P(table | n 10 , n 20 ) = n 11 n 12 n 10 Sum over tables with same row totals and no more likely than actual one � P-value Comparing Two Beta-Binomials Table probability is product of two beta-binomials, same nuisance parameter p . Conditioning on row totals does not eliminate p . Could use Barnard’s test instead: For each p , sum over all tables no more likely than actual one, then maximize over p � P-value Very slow! Noisy Text Streams

  25. CoCITe Noise Mixture Distributions Results Summary Fast Significance Test (Both Distributions) Estimate common mean from data. Allow for variance of this estimate: if r.v. Y is a function of r.v. X then Var ( Y ) = E [ Var ( Y | X )] + Var [ E ( Y | X )] , and assume same family. One observation must then be larger its expected mean and one smaller. Critical region below red contour (probability equal to that for observed ( f 1 , f 2 ) ). Total mass of rectangular regions can be obtained quickly from product of CDFs. Lower bound on P-value from blue rectangle. Upper bound from difference between purple and green rectangles. Weighted geometric mean of upper and lower (tighter) bound. Noisy Text Streams

  26. CoCITe Noise Mixture Distributions Results Summary Outline CoCITe Noise Mixture Distributions Results Noisy Text Streams

  27. CoCITe Noise Mixture Distributions Results Summary Example from Threat Management System Data Noisy Text Streams

Recommend


More recommend