improvement of log pattern extracting algorithm using
play

Improvement of Log Pattern Extracting Algorithm Using Text - PowerPoint PPT Presentation

Improvement of Log Pattern Extracting Algorithm Using Text Similarity ZHAO Yining Computer Network Information Center, Chinese Academy of Sciences in HPBDC18, 2018/05/21 Content v CNGrid & LARGE v Why Log Patterns & Extracting


  1. Improvement of Log Pattern Extracting Algorithm Using Text Similarity ZHAO Yining Computer Network Information Center, Chinese Academy of Sciences in HPBDC18, 2018/05/21

  2. Content v CNGrid & LARGE v Why Log Patterns & Extracting Algorithm v Algorithm of Identical Word Rate v Text Similarity Based Approach Ø Improved Extracting Formation & LCS Ø Experiment Result v Modified Log Comparing Model v Summary & Future Work

  3. CNGrid & LARGE v China National HPC Environment 2 Operating Centers ( Beijing / Hefei ) 19 Sites ( 200PF + 162PB ) Portal with Micro-Service Architecture Application oriented Global Scheduling & Predicting Resource Evaluation Standard & Comprehensive Evaluation Index

  4. CNGrid & LARGE v Log Analyzing fRamework in Grid Environment

  5. Log Patterns & Extracting Algorithm v We want to be alerted for logs in certain patterns, but… Ø too many logs for human to read Ø need to summarize patterns before defining alert rules v Set of log patterns in our context: Ø patterns are different from each other Ø covering all logs in original set Ø significantly less than original v The process of using log patterns Ø filter and remove frequent normal logs Ø use log pattern extraction algorithms to get the set of patterns Ø manually check the set and pick out abnormal patterns Ø define rules to generate alerts for these patterns

  6. Algorithm of Identical Word Rate v Algorithm of identical word rate – a straight forward way Ø identical words • 2 words that are identical • and in the same position in 2 original logs Ø identical word rate • (number of identical words) / (total words) • predefined threshold t • If IWR is greater than t, the two logs are in one pattern v Process of algorithm of IWR Ø set threshold t and initial empty pattern set P Ø for each new incoming logs, compute IWR with each pattern in P Ø if pattern matched, skip to next; if none matched, add to P v Significant Limitation Ø Logs with different length has IWR of ZERO!

  7. Text Similarity Based Approach (1) v Using Text Similarity to resolve the problem Ø S = P x O Ø S: similarity, P: propotion of common words, O: order factor v Two logs l 1 and l 2 , L 1 and L 2 are word sets respectively Ø define P: P(l 1 , l 2 ) = ( |L 1 ∩ L 2 | × 2) / ( |L 1 | + |L 2 | ) Ø define O: O(l 1 , l 2 ) = SeqSim(l 1 , l 2 ) / |L 1 ∩ L 2 | Ø hence S: S(l 1 , l 2 ) = (SeqSim(l 1 , l 2 ) × 2) / (|L 1 | + |L 2 |) v By this, logs in different lengths can be compared

  8. Text Similarity Based Approach (2) v Using Longest Common Subsequence to define SeqSim(l 1 ,l 2 ) Ø S(l 1 , l 2 ) = ( |LCS(l 1 , l 2 )| × 2) / ( |L 1 | + |L 2 | ) Ø Same pattern if S(l 1 , l 2 ) ≥ t, where t is the predefined threshold v The process of improved log pattern extracting algorithm Ø set the threshold value t. Set the initial log pattern set P to be an empty set Ø for a new log l appearing from the input log set L, compute S i (l, p i ) between l and every p i ∈ P using a LCS algorithm Ø if there is no S i (l, p i ) ≥ t, add l to P Ø after all logs in L have been checked, return P v Increase time cost for single comparison Ø but reduce total number of comparisons Ø can be offset by choosing a better LCS algorithm

  9. Text Similarity Based Approach (3) v Experiment result Ø numbers of extracted patterns

  10. Text Similarity Based Approach (3) v Experiment result Ø time costs of candidate algorithms (in milliseconds)

  11. Modified Pattern Comparing Model (1) v The original model is bad in time cost of searching patterns Ø has to visit all patterns until the one is met v Use hashmap to accelerate the matching Ø divide pattern set into subsets by initial words Ø skip majority of patterns in irrelevant subsets v Matching process : 1. get initial word of the log 2. hash the word 3. find desired subset in hashmap 4. compare with patterns in the subset

  12. Modified Pattern Comparing Model (2) v This approach cannot deal with patterns with unfixed initials Ø build an unfixed pattern set v In real system, we split pattern set in 4 parts: Ø fixed alert pattern set Ø unfixed alert pattern set Ø fixed normal pattern set Ø unfixed normal pattern set v When a new log comes, it is compared in the 4 sets in turn to decide processing methods

  13. Modified Pattern Comparing Model (3) v Real time cost comparison between original & modified models cron maillog millisecond millisecond 1800000 3000000 1600000 2500000 1400000 2000000 1200000 1000000 1500000 800000 600000 1000000 400000 500000 200000 0 0 original model modified model original model modified model secure messages millisecond millisecond 600000 10000000 9000000 500000 8000000 7000000 400000 6000000 300000 5000000 4000000 200000 3000000 2000000 100000 1000000 0 0 original model modified model original model modified model

  14. Summary & Future Work v Log patterns: used to build log recognition v Algorithm of IWR isn’t capable to match logs in different lengths v Using the idea of text similarity and LCS to improve the algorithm v Modify log comparing model to accelerate the process v Future work: log pattern based analyses in CNGrid Ø log pattern associations Ø log flow feature modeling

Recommend


More recommend