automatic event log abstraction to support forensic
play

Automatic Event Log Abstraction to Support Forensic Investigation - PowerPoint PPT Presentation

Automatic Event Log Abstraction to Support Forensic Investigation Hudan Studiawan, Ferdous Sohel, Christian Payne College of Science, Health, Engineering and Education Murdoch University, Perth, Australia The Australasian Information Security


  1. Automatic Event Log Abstraction to Support Forensic Investigation Hudan Studiawan, Ferdous Sohel, Christian Payne College of Science, Health, Engineering and Education Murdoch University, Perth, Australia The Australasian Information Security Conference (AISC 2020) Swinburne University of Technology, Melbourne, Victoria, Australia

  2. CORE Student Travel Award We acknowledge that we have received a CORE Student Travel Award. 2

  3. Outline • Introduction • Existing Methods • The Proposed Method • Event Log Preprocessing • Grouping based on Word Count • Graph Model for Log Messages • Grouping with Automatic Graph Clustering • Extraction of Event Log Abstraction • Experimental Results • Conclusion and Future Work 3

  4. Introduction • Abstraction of event logs is the creation of a template that contains the most common words representing all members in a group of event log entries • Abstraction helps the forensic investigators to obtain an overall view of the main events in a log file I n p u t l o g fj l e : a u t h . l o g O u t p u t a b s t r a c t i o n s : # 1 M a r * * n s s a l * r e m o v i n g r e m o v a b l e l o c a t i o n : * # 2 M a r 8 * n s s a l * I n v a l i d u s e r * f r o m * # 3 M a r 8 * n s s a l * F a i l e d p a s s w o r d f o r * f r o m * p o r t * s s h 2 … 4

  5. Existing Methods Existing log abstraction methods require user input parameters. It is time consuming due to the need to identify the best parameters. • SLCT (Vaarandi, 2003): one mandatory parameter and 14 optionals • LogCluster (Vaarandi and Pihelgas, 2015): one mandatory parameter 26 optionals • IPLoM (Makanju et al., 2012): five mandatory parameters • LogSig (Tang et al., 2011): one mandatory parameter • Drain (He et al., 2017): three mandatory parameters • Model training (Thaler et al., 2017) 5

  6. The Proposed Method R a w e v e n t l o g s A u t o m a t i c l o g p r e p r o c e s s i n g G r o u p i n g b a s e d o n w o r d c o u n t R e fj n e g r o u p i n g w i t h a u t o m a t i c g r a p h c l u s t e r i n g G e t t h e e v e n t l o g a b s t r a c t i o n p e r c l u s t e r 6

  7. Event Log Preprocessing • We parse the log files using the nerlogparser, a log parsing tool based on named entity recognition • It supports fully automatic parsing because it provides a pre-trained model • We then extract unique messages from the log entries I n p u t : J a n 1 8 0 9 : 3 1 : 3 2 v i c t o r i a d h c l i e n t : D H C P A C K f r o m 1 0 . 0 . 2 . 2 P r o c e s s : a u t o m a t i c p a r s i n g w i t h t h e n e r l o g p a r s e r t o o l O u t p u t : t i m e s t a m p : J a n 1 8 0 9 : 3 1 : 3 2 h o s t n a m e : v i c t o r i a s e r v i c e : d h c l i e n t m e s s a g e : D H C P A C K f r o m 1 0 . 0 . 2 . 2 7

  8. Grouping based on Word Count • We split the discovered unique messages based on space character then count the word length • An abstraction is extracted from the always-occurring word in a group of log entries having the same length C l u s t e r # 1 : J a n 1 8 0 9 : 3 1 : 3 2 v i c t o r i a d h c l i e n t : D H C P A C K f r o m 1 0 . 0 . 2 . 2 J a n 1 8 1 0 : 5 6 : 4 0 v i c t o r i a d h c l i e n t : D H C P A C K f r o m 1 0 . 0 . 2 . 2 F e b 6 1 3 : 3 1 : 1 2 v i c t o r i a d h c l i e n t : D H C P A C K f r o m 1 0 . 0 . 2 . 5 A b s t r a c t i o n # 1 : * * * v i c t o r i a d h c l i e n t : D H C P A C K f r o m * C l u s t e r # 2 : F e b 6 1 2 : 5 6 : 4 8 v i c t o r i a i n i t : S w i t c h i n g t o r u n l e v e l : 0 J a n 1 8 1 7 : 1 3 : 4 9 v i c t o r i a i n i t : S w i t c h i n g t o r u n l e v e l : 6 F e b 6 1 3 : 0 3 : 5 3 v i c t o r i a i n i t : S w i t c h i n g t o r u n l e v e l : 6 A b s t r a c t i o n # 2 : * * * v i c t o r i a i n i t : S w i t c h i n g t o r u n l e v e l : * 8

  9. Graph Model for Log Messages The log entries have very diverse vocabularies, so we need to refine discovered groups based on the string similarity • We use an automatic graph-based clustering • Vertex: a unique message, edge: the weighted Hamming similarity S e n d i n g o n S o c k e t / f a l l b a c k D H C P A C K f r o m 1 0 . 0 . 2 . 2 0 . 8 3 0 . 8 3 D H C P A C K f r o m 1 0 . 0 . 2 . 5 0 . 8 3 D H C P A C K f r o m 1 9 2 . 1 6 8 . 5 6 . 1 0 0 9

  10. Grouping with Automatic Graph Clustering 10

  11. Building Micro-clusters 11

  12. Extraction of Abstraction: Merging Abstractions • We extract an abstraction from each micro-cluster • Merging is needed because an abstraction from each micro-cluster has a possibility to be very similar with others • We find pair combinations ( A i , A j ) from all abstractions to be compared. • Two abstractions A i and A j will continue to be checked for merging if there is a weighted Hamming similarity between them. 12

  13. Example of Merging Abstractions E x a m p l e 1 : A b s t r a c t i o n # 1 : I n v a l i d u s e r * f r o m * A b s t r a c t i o n # 2 : I n v a l i d u s e r a d m i n f r o m * E x a m p l e 2 : A b s t r a c t i o n # 1 : I n v a l i d u s e r * f r o m 2 0 0 . 2 7 . 1 4 8 . 4 5 A b s t r a c t i o n # 2 : I n v a l i d u s e r * f r o m * 13

  14. Extraction of Abstraction: Final Abstractions • In all previous steps, we consider only the message field in a log entry. • In the final step, we consider all other fields such as timestamp, host name, and service name. C l u s t e r # 1 : J a n 1 8 0 9 : 3 1 : 3 2 v i c t o r i a d h c l i e n t : D H C P A C K f r o m 1 0 . 0 . 2 . 2 J a n 1 8 1 0 : 5 6 : 4 0 v i c t o r i a d h c l i e n t : D H C P A C K f r o m 1 0 . 0 . 2 . 2 F e b 6 1 3 : 3 1 : 1 2 v i c t o r i a d h c l i e n t : D H C P A C K f r o m 1 0 . 0 . 2 . 5 A b s t r a c t i o n # 1 : * * * v i c t o r i a d h c l i e n t : D H C P A C K f r o m * C l u s t e r # 2 : F e b 6 1 2 : 5 6 : 4 8 v i c t o r i a i n i t : S w i t c h i n g t o r u n l e v e l : 0 J a n 1 8 1 7 : 1 3 : 4 9 v i c t o r i a i n i t : S w i t c h i n g t o r u n l e v e l : 6 F e b 6 1 3 : 0 3 : 5 3 v i c t o r i a i n i t : S w i t c h i n g t o r u n l e v e l : 6 A b s t r a c t i o n # 2 : * * * v i c t o r i a i n i t : S w i t c h i n g t o r u n l e v e l : * 14

  15. Experimental Results: Datasets • For all datasets except DFRWS 2016, we recovered the directory /var/log/ from the forensic disk images • We retrieved some common log files such as authentication logs, kernel logs, and system logs 15

  16. Parameter Settings 16

  17. Comparison of Performance • IPLoM shows a good performance because the bijective relationship in a group of log entries can accurately capture the most frequently occurring words • LogSig’ clustering is performed based on a local search algorithm and can lead to local optima. Therefore, it cannot cluster log messages precisely 17

  18. Comparison of Performance • Drain performs well because it considers the first few words in a log entry as contributing most significantly to its abstraction. These words are used to construct a fixed-depth tree. • LogMine performs over-clustering for all datasets because the clustering process is conducted incrementally. If a log entry similarity with an existing cluster representation is less than the given threshold, it will be grouped with that particular cluster. • Spell employs the longest common subsequence (LCS) technique to obtain the abstractions. LCS cannot capture any potential abstraction that has separate substrings. 18

  19. Over-clustering vs Under-clustering • The most important procedure in discovering event log abstractions is the clustering step • If the clustering is performed well, then good abstractions will be produced • We need to get the best cluster composition from event logs 19

Recommend


More recommend