SATTVA: SpArsiTy inspired classificaTion of malware VAriants Lakshmanan Nataraj, S. Karthikeyan, B.S. Manjunath Vision Research Lab University of California, Santa Barbara Sattva ( सत्थत्थव ) means Purity 1
Introduction • The number of malware is increasing! • In 2014, Kaspersky Lab reported they process on average 325,000 malware per day • The main reason for such a deluge is: malware mutation: the process of creating new malware from existing ones http://usa.kaspersky.com/about-us/press-center/press-releases/kaspersky-lab-detecting-325000- 2 new-malicious-files-every-day
Introduction • Variants are created either by making small changes to the malware code or by changing the structure of the code using executable packers • Based on their function, variants are classified into different malware families • Identifying the family of a malware plays an important role in understanding and thwarting new attacks 3
Examples of malware variants Variants of Family Alueron.gen!J Variants of Family Fakerean
Problem Statement • Consider a Malware Dataset comprising of: • N labelled malware • L malware families • P malware per family • Problem is to identify the family of an unknown malware 𝐯 5
Related Work • Static Code analysis based features • Disassembles the executable code and studies its control flow • Suffers from obfuscation (packing) • Dynamic analysis based features • Executes malware in a virtual environment and studies its behavior • Time consuming and many recent aware are VM aware • Statistical and Content based features • Analyzes statistical patterns based on the malware content • n-grams, fuzzy hashing, Image similarity based features 6
Statistical and Content based Features • n-grams • n-grams are computed either on raw bytes or instructions • n > 1 which makes this computationally expensive • Fuzzy hashing (ssdeep, pehash) • Fuzzy hashes are computed on raw bytes or PE parsed data • Does not work well on packed malware • Image similarity • Malware binaries are converted to digital images • Image Similarity features (GIST) are computed on the malware Malware Images: Visualization and Automatic Classification, L. Nataraj, S.Karthikeyan, G. Jacob, B.S. Manjunath, VizSec 2011 7
Image Similarity based Features Sub-band Sub-block L-D Feature Filtering Averaging Vector N = 1 . . . . . . . . . kL-D Feature Malware Sub-band Sub-block L-D Feature Resize Vector Image Filtering Averaging Vector . . . . . . . . . Sub-band Sub-block L-D Feature Filtering Averaging Vector N = k N = k
Image Similarity based Features • Pros • Fast and compact • Better than static code based analysis (works on both packed and unpacked malware) • Comparable with dynamic analysis • Cons • Arbitrary column cutting and reshaping • Images are resized to a small size for normalization which introduces interpolation artifacts • A large malware image, on resizing, lose lots of information 9
Approach – Signal Representation • Let 𝐲 be the signal representation of a malware sample • Every entry of 𝐲 is a byte value of the sample in the range [0,255] 10
Variants in Signal Representation Variant 1 Variant 2 Variants of recently exposed Regin malware. Differ only in 7 out of 13,284 (0.0527%) 11
Approach – Dataset as a Matrix • Since malware are of different sizes, the vectors are zero padded such that all vectors are of length M , the number of bytes in the largest malware. • We now represent the dataset as an 𝑁 × 𝑂 matrix A , where every column of A is a malware sample 12
Approach – Dataset as a Matrix • Further, for every family k , ( k = 1,2,…,L), we define an M x P block matrix 𝐵 𝑙 : 𝐁 𝑙 = [𝐲 𝑙1 , 𝐲 𝑙2 , … , 𝐲 𝑙𝑄 ] • 𝐁 can now be represented as a concatenation of block matrices: 𝐁 = [𝐁 1 , 𝐁 2 , … , 𝐁 𝑀 ] 13
Approach – Sparse Linear Combination • Let 𝐯 ∈ R 𝑁 be an unknown malware test sample whose family is to be determined. • Then 𝐯 can be represented as a sparse linear combination of the training samples: 𝑀 𝑄 𝐯 = 𝛽 𝑗𝑘 𝒚 𝑗𝑘 = 𝐁𝜷 𝑗=1 𝑘=1 where 𝜷 = [ 𝛽 11 , 𝛽 12 , … , 𝛽 𝑗𝑘 , … , 𝛽 𝑀𝑄 ] 𝑈 is the coefficient vector 14
Approach – Sparse Linear Combination 𝐯 = 𝐁𝜷 u 1 α 1 u 2 α 2 . 𝐵 1 𝐵 2 𝐵 𝑂 = . . . . . . u 𝑁 . . 𝑁 × 𝑂 𝑁 × 1 α 𝑂 𝑂 × 1 𝛽 u A Matrix of Sparse Unknown training samples Coefficient Vector test sample 15
Illustration • Let the unknown malware belong to family 2 = 𝛽 21 + 𝛽 22 𝜷 = [ 0,0 … , 𝛽 21 , 𝛽 22 , … , 0 , 0] 𝑈 16
Approach – Sparse Solution • Sparsest solution can be obtained by Basis Pursuit by solving the 𝑚 1 - norm minimization problem: 𝛽 ′ ∈R 𝑂 ||𝜷 ′ || 𝟐 𝑡𝑣𝑐𝑘𝑓𝑑𝑢 𝑢𝑝 𝐯 = 𝐁𝜷 ′ 𝜷 = argmin where ||. || 𝟐 represents the 𝑚 1 - norm 17
Approach – Minimal Residue • To estimate the family of 𝐯 , we compute residues for every family in the training set and then choose the family with minimal residue: 𝑠 𝑙 𝐯 = ||𝐯 − 𝐁 ( 𝜷) || 𝟑 𝒍 𝐝 = argmin 𝑠 𝑙 𝐯 𝑙 where 𝒍 ( 𝜷) is the characteristic function that selects coefficients from 𝜷 that are associated with family k and zeros out the rest, 𝐝 is the index of the estimated family 18
Random Projections • Dimensionality of malware M can be high • We project all the malware to lower dimensions using Random Projections: 𝐱 = 𝐒𝐯 = 𝐒𝐁𝜷 where 𝐒 is a 𝐸 × 𝑁 pseudo random matrix ( 𝐸 ≪ 𝑁) and 𝐱 is a 𝐸 × 1 lower dimensional vector 19
Sparse Solution • The system of equations are underdetermined and can be solved using 𝑚 1 - norm minimization: 𝛽 ′ ∈R 𝑂 ||𝜷 ′ || 𝟐 𝑡𝑣𝑐𝑘𝑓𝑑𝑢 𝑢𝑝 𝐱 = 𝐒𝐁𝜷 ′ 𝜷 = argmin w 1 α 1 . 𝑆𝐵 1 𝑆𝐵 𝑂 α 2 . = . . . . w 𝐸 . 𝐸 × 1 D × 𝑂 . . α 𝑂 𝑂 × 1 𝛽 w RA 20
Complete Approach Malware Signal Representation Data Sparse Modeling w 1 u 1 α 1 α 1 u 2 . 𝑆𝐵 1 𝑆𝐵 𝑂 α 2 α 2 . = . . . . . 𝐵 1 𝐵 2 𝐵 𝑂 = w 𝐸 . . . . . . . 𝐸 × 1 D × 𝑂 Random . u 𝑁 . Projections . . 𝑁 × 𝑂 𝑁 × 1 α 𝑂 α 𝑂 𝑂 × 1 𝑂 × 1 𝛽 𝛽 w RA A u
Modeling Malware Variants • New variants are created from existing malware samples by making small changes and both variants share code • We model a malware variant as: 𝐯 ′ = 𝐯 + 𝐟 𝐯 = 𝐁𝜷 + 𝐟 𝐯 where 𝐯 ′ is the vector representing malware variant and 𝐟 𝐯 is the error vector 22
Modeling Malware Variants • This can be expressed in matrix form as: 𝜷 𝐯 ′ = 𝐁 𝐉 𝑵 𝐟 𝐯 = 𝐂 𝐯 𝐭 𝐯 where 𝐂 𝐯 = 𝐁 𝐉 𝑵 is an 𝑁 × 𝑂 + 𝑁 matrix, 𝐉 𝑵 is an 𝑁 × 𝑁 Identity matrix, and 𝐭 𝐯 = 𝜷 𝐟 𝐯 𝑼 • This ensures that the above system of equations is always underdetermined and spare solutions can be obtained 23
Sparse Solutions in Lower Dimensions 𝛽 ′ ∈R 𝑂 ||𝜷 ′ || 𝟐 𝑡𝑣𝑐𝑘𝑓𝑑𝑢 𝑢𝑝 𝐱 ′ = 𝐂 𝐱 𝐭 𝐱 𝜷 = argmin 𝑙 𝐱 ′ = ||𝐱 ′ − 𝐂 𝐱 𝐭 𝐱 𝑠 ( 𝜷) || 𝟑 𝒍 𝑙 𝐱 ′ 𝐝 = argmin 𝑠 𝑙 where 𝐱 ′ = 𝐱 + 𝒇 𝐱 = 𝐒𝐯 + 𝒇 𝐱 , 𝐂 𝐱 = 𝐒𝐁𝜷 𝐉 𝑬 is a 𝐸 × 𝑂 + 𝐸 matrix, 𝐉 𝑬 is a 𝐸 × 𝐸 Identity matrix and 𝐭 𝐱 = 𝜷 𝐟 𝐱 𝑼 . 24
Experiments • Two datasets: Malimg and Malheur • Malimg Dataset: 25 families, 80 samples per family, M = 840,960. • Malheur Dataset: 23 families, 20 samples per family, M = 3,364,864. • Vary Randomly projected dimensions D in {48,96,128,256,512} • We compare with GIST features of same dimensions • Two Classification methods: Sparse Representation based Classification (SRC) and Nearest Neighbor (NN) Classifier • 80% Training and 20% Testing 25
Results on Malimg Dataset 100 RP+NN GIST+NN GIST+SRC 95 RP+SRC Accuracy 90 85 80 0 50 100 150 200 250 300 350 400 450 500 550 Dimensions 26
Results on Malimg Dataset • Best classification accuracy of 92.83% for combination of Random Projections (RP) + Sparse Representation based Classification (SRC) at D = 512 • Accuracies of GIST features for both classifiers almost the same in the range 88% - 90% • Lowest accuracy for RP + Nearest Neighbor (NN) classifier 27
Results on Malheur Dataset 100 95 Accuracy 90 RP+NN GIST+NN GIST+SRC RP+SRC 85 80 0 50 100 150 200 250 300 350 400 450 500 550 Dimensions 28
Results on Malheur Dataset • Again, best classification accuracy of 98.66% for combination of Random Projections (RP) + Sparse Representation based Classification (SRC) at D = 512 • Accuracies of GIST features for both classifiers almost the same at around 93%. • However, the combination of RP + Nearest Neighbor (NN) classifier also had high accuracy of 96.06% - Projections Closely Packed 29
Recommend
More recommend