An Algorithm for Message Type Discovery in Unstructured Log Data - PowerPoint PPT Presentation

An Algorithm for Message Type Discovery in Unstructured Log Data Daniel Tovarňák Masaryk University CSIRT-MU ICSOFT 2019, July 27

ICSOFT 2019 › Motivation 1 of 16 Motivation

ICSOFT 2019 › Motivation 2 of 16 Log Analysis via Complex Event Processing (CEP) Data stream processing: real-time data processing paradigm CEP: detection of complex patuerns in streams of data elements Event objects: actual representation of the elements in the stream ▶ commonly used to deal with high-velocity data ▶ visions for use in real-time log analysis, especially security monitoring ▶ as opposed to full-text indexing and column-based indexing of log data ▶ expected to be properly structured and described via an explicit data schema ▶ much like in RDBMS Unstructured log entries � event objects ▶ semi-structured log entries � event objects

ICSOFT 2019 › Motivation 3 of 16 Logging and Log Data – 5Vs of Big Data Traditional manifestation – log fjles with arbitrary text messages Value: widely-used source of monitoring information Veracity: poor-quality, unstructured nature, complicated analysis Variability: pervasive devel. practice spanning SW on all IT layers Velocity + Volume: can exceed 100,000 entries/sec, 1 MB/s per node ▶ debugging, troubleshooting, fault detection, security, forensics, compliance ▶ 2017-07-23T19:35:45Z [0] ERR!: Jack said he will take care of this! ▶ this stems from the way logs are generated – messages in natural language ▶ data source and data format heterogeneity ▶ HP company – 1 × 10 12 entries/day generated, 3 × 10 9 entries/day processed

ICSOFT 2019 › Motivation 4 of 16 Bridging the Gap by Normalization Data integration perspective: bridge the gap by normalization Normalization: unifjcation of data on any of its 4 layers Our Goal: Improve the way log data can be represented and accessed by normalizing them ▶ known patuern to improve interoperability ▶ missing structure is added via transformation and enrichment ▶ overall heterogeneity is eliminated thanks to a single canonical form ▶ data structures ▶ data types ▶ data representation ▶ transport into streams of event objects.

ICSOFT 2019 › Motivation SessionClosed() {ts=...425, host="147.251.20.110", user="alice", app="sshd"} WHERE attempts > 1000, success=false FROM UserLogin.win:time(30 sec) SELECT host, user, count(*) AS attempts CREATE MAP SCHEMA UserLogin(host string, success boolean, user string); ServiceCrash() {ts=...427, host="147.251.19.160", service="Billing", code=42} 5 of 16 UserLogin() {ts=...426, host="147.251.10.125", success=False, user="alice"} UserLogin() {ts=...424, host="147.251.11.100", success=True, user="bob"} 3.12.2016 10:03:47 147.251.19.160 [Super.java]: {service=Billing, status=0x2A} Dec 03 2016 10:03:46 [147.251.10.125] --- WARN: User alice failed to log in 2016-12-03T10:03:45Z 147.251.20.110 sshd[1551]: session closed for user alice Dec 03 2016 10:03:44 [147.251.11.100] --- INFO: User bob logged in Research Goal (Simplified) GROUP BY host, user ↓ NORMALIZATION ↓ ⇓ UserLogin ⇓ ⇓ SessionClosed ⇓ ⇓ ServiceCrash ⇓

ICSOFT 2019 › Reactive Normalization 6 of 16 Reactive Normalization

ICSOFT 2019 › Reactive Normalization › Log Abstraction User Ruth logged out Log abstraction is a two-tier procedure: User (?<user>\w+) logged (?<action>\w+) LOG.info("User {} logged {}", user, action); Service (\w+) started User (\w+) logged (\w+) Service * started : [$1] 7 of 16 User * logged * : [$1, $2] Service httpd started User Bob logged in Service sshd started User John logged out User Jack logged in Regular Expressions Log Messages Log Abstraction (Separation) ⇒ Message Types ⇒ ↓ Dec 03 2016 10:03:44 -- INFO: User bob logged in � ▶ message type discovery ▶ patuern-matching via regular expressions

ICSOFT 2019 › Reactive Normalization › Message Type Discovery 8 of 16 Message Type Discovery Manual discovery: tiresome process, which leads to errors Static code analysis: perfectly possible Data mining: use already generated log messages (historical data) ▶ automated approaches are necessary ▶ we were able to discover approx. 4500 message types in Hadoop source code ▶ source code is not always available (e.g. for network devices) ▶ 9 existing approaches were studied, e.g. SLCT, IPLoM, logSig, N-V, … Existing approaches have accuracy and usability issues

ICSOFT 2019 › Reactive Normalization › Message Type Discovery Complicated Parameterization Shortcomings of Existing Approaches Generation of Overlapping Message Types No Support for Multiple Token Delimiters 9 of 16 No Support for Multi-Word Variables ▶ User root logged * ▶ User * logged in ▶ User * logged * ▶ only a single delimiter for tokenization, e.g. ’space’ ▶ limited accuracy ▶ each dataset is difgerent and the algortihms sometimes need to be fjne-tuned ▶ some approaches use up to 5 unbounded parameters ▶ User foo bar logged in ▶ User root logged in ▶ User {1:2} logged {1:1}

ICSOFT 2019 › Reactive Normalization › Message Type Discovery 0 0 2 0 sshd 0 2 started 0 0 0 4 Method of n -th percentile: frequency table + percentile threshold Post-processing to improve accuracy and usability 1. eliminate overlapping message types by merging 10 of 16 httpd 0 Service httpd started | [4,2,4] Extended Nagappan-Vouk Algorithm Service sshd started | [4,2,4] Service httpd started | [4,2,4] Service sshd started 4 | [4,2,4] Service * started 1 2 3 Service 2. identify multi-word variable positions ▶ log messages are tokenized via a set of delimiters ▶ [4,2,4] in example is a log message score ▶ word is a variable, if it has a frequency lower than n -th percentile of score

ICSOFT 2019 › Reactive Normalization › Message Type Discovery BOOL: mt2: 'Start processing (%{WORD:var1}) %{MWRD_1_2:var2}=%{WORD:var3}' mt1: 'User %{WORD:var1} logged %{WORD:var2}' grp0: patterns: # patterns describing the message types "[^ \n\r]+([\\s][^ \n\r]+){0,1}"] [string, MWRD_1_2: "[^ \n\r]+"] [string, ARBITRARY: "[0-9a-zA-Z]+"] [string, WORD: [boolean, "\btrue\b|\bfalse\b"] [integer, "[0-9]+"] 11 of 16 INT: regexes: # regex tokens Start processing (nor) Root=user Service httpd:8080 started Start processing (xor) Tom Sawyer=user Start processing (xor) Thomas=user User Ruth logged out User Bob logged in Start processing (xor) Daniel=user User John logged out Service sshd:22 started Start processing (xor) Jen=user Discovered Pattern-Set Example mt3: 'Service %{WORD:var1}:%{INT:var2} started' ⇓ percentile=60, delimiters=' :=' ⇓

ICSOFT 2019 › Reactive Normalization › Message Type Discovery 0.57 0.64 0.99 0.94 0.90 0.892 LKE 0.67 0.17 0.78 IPLoM 0.81 0.600 LogSig 0.26 0.77 0.91 0.96 0.84 0.99 0.818 12 of 16 HDFS Evaluation Discovered message types partition the log messages into groups F-measure: common accuracy metric in IR, higher is betuer Ground truth: 5 real-life data-sets, MTs manually discovered BGL 0.89 HPC Zookeeer Proxifjer AVG SLCT 0.61 0.81 0.86 0.92 0.748 ▶ F = 2 · Precision · Recall Precision + Recall – how “close“ our grouping is to the ground truth ▶ P. He, et al. An Evaluation Study on Log Parsing and Its Use in Log Mining ▶ best average F-measure ( IPLoM ) – 0.892

ICSOFT 2019 › Reactive Normalization › Message Type Discovery 0.9985 0.9182 0.8220 0.86838 0.4949 0.9856 1.0000 0.9979 0.8547 13 of 16 0.9861 0.9861 1.0000 0.9999 1.0000 0.99690 Percentile threshold [ x ] F-measure BGL HPC HDFS Zookeeper 0.6965 0.86662 0.9191 0.7882 Results & Findings BGL HPC HDFS Zook. Proxif. AVG 0.8556 0.8778 1.0000 Proxifjer 0.9861 0.86756 0.9251 0.95316 0.8547 0.9999 1.0000 0.8162 n = 50 , d = space n = 50 , d = default n = 15 , d = default n = 85 , d = default n = 50 , d = best* 1 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 0 10 20 30 40 50 60 70 80 90 100

ICSOFT 2019 › Summary 14 of 16 Summary

ICSOFT 2019 › Summary 15 of 16 Summary & Future Work Tie designed algorithm has a very high accuracy on real-world data Logging code is constantly evolving How to switch to online (streaming) operation mode? How to switch to fully-automated mode? How to version the discovered patuern-sets?

ICSOFT 2019 › Acknowledgement 16 of 16 Acknowledgement

An Algorithm for Message Type Discovery in Unstructured Log Data - PowerPoint PPT Presentation

An Algorithm for Message Type Discovery in Unstructured Log Data Daniel Tovark Masaryk University CSIRT-MU ICSOFT 2019, July 27 ICSOFT 2019 Motivation 1 of 16 Motivation ICSOFT 2019 Motivation 2 of 16 Log Analysis via Complex

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations

Type Checking Grammar Rule Semantic Rule var-decl id : type-exp Insert (id.name, type-exp .

CFD General Notation System (CGNS) Usage for unstructured grids Edwin van der Weide Stanford

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

COMP31212: Concurrency Topics 4.3: Message Passing Topic 4.3: Message Passing Outline Topic

GSM Short Message Service GSM Short Message Service GSM Short Message Service GSM Short Message

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Data and Analysis Part III Unstructured Data Ian Stark February 2011 Part III: Unstructured

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

ROI HSP Design Clarification Recipient ID in Message must match the physical Message recipient

Lecture Notes: Message Management 1 Slide 1: Message Management Message Management A critical

Web Engineering HTTP-message = Request | Response generic-message = start-line *message-header

References Message Authentication Codes (MACs) Message Authentication Codes (MACs), Chapter

Message Passing Concepts Message Passing Model The message passing model is based on the

Integrating United States And Canadian Air Monitoring Data Bill Sukloff Computer Systems

The The SeETL RunTime RunTime SeETL Utilities Presentation Utilities Presentation

Form 1: Presentation Checklist Delete this italicized text, and add the title of your proposal

Beta Presentation 2020 Business in a Box The Capstone Experience Team DRIVEN-4 Betty Wu Brian

Easy Tracker EasyTracker: Automatic Transit Tracking, Mapping, and Arrival Time Prediction Using

2012 Dell Analyst Meeting June 13, 2012 Join the conversation #DellAM12 Follow @DellShares

Leading corporations to improve our communities: The future of corporate responsibility Erin

Internal Use - Confidential Journey April 2017 May 7, 2019 2001-2007 Merger Collaboration