HELSINKI UNIVERSITY OF TECHNOLOGY HELSINKI UNIVERSITY OF TECHNOLOGY Mika Ilvesmäki, D.Sc. (Tech.) Contents • Basic concepts (events, traces) Measurement analysis basics - I • Data preprocessing, sampling • Basic statistics – ranges, avearages, variations etc. Lecture slides for S-38.3183 • Distributions 16.3.2006 – concepts, characteristics, Mika Ilvesmäki parameterization • Histograms Networking laboratory HELSINKI UNIVERSITY OF TECHNOLOGY HELSINKI UNIVERSITY OF TECHNOLOGY Mika Ilvesmäki, D.Sc. (Tech.) Mika Ilvesmäki, D.Sc. (Tech.) Mandatory reading Goals of this lecture • Please, download from the course webpages • After this lecture you should know “Chapter 2” of the ‘hopefully someday to be – Basic concepts related to traffic published’ -book measurements • Trace, sampling, mask, aggregation, – Chapter 2 contains a lot of information on statistics. – What can be measured in a network • However, it is a draft and, therefore, full of typos, – What is done in data preprocessing inconsistencies etc. Beware! – Basic statistics and their meaning – And if you find errors etc. please let the course personnel know of them! Thank you! – How distributions/histograms are formed – The material in Chapter 2 has to be mastered in from measurements and how they can be the exercises (and in the final exam). interpreted/characterized – http://www.netlab.tkk.fi/opetus/s383183/k06/draft/chapter2.pdf •1
HELSINKI UNIVERSITY OF TECHNOLOGY HELSINKI UNIVERSITY OF TECHNOLOGY Mika Ilvesmäki, D.Sc. (Tech.) Mika Ilvesmäki, D.Sc. (Tech.) What is there to measure? Measurement file: TRACE • A file that has a set of measured properties • The event itself from the network is called a trace – Count of packets • Trace has the following (relevant) property • The size or some other quantitative – Length, inidicating the number of events (packets, flows or sessions etc.) property of the event itself • Event entry consists of relevant event data – Packet size, flow duration – Packet nr, flow id., timestamp, addresses, ports, • Inter-event relation duration (flow), volume (flow) etc. • If some event data is not available, you might be able to – Frequency of events, the time between two create it (e.g., packet timestamps and 5-tuple info result events in 5-tuple flows) – Though this migh not be very straightforward HELSINKI UNIVERSITY OF TECHNOLOGY HELSINKI UNIVERSITY OF TECHNOLOGY Mika Ilvesmäki, D.Sc. (Tech.) Mika Ilvesmäki, D.Sc. (Tech.) Data preprocessing I Data preprocessing II • Data normalization • Data cleaning – Normalization is done to achieve comparability of – Caveat! Are you cleaning away the noise or a data across two or more sets of measurements previously undetected phenomena – Normalization is also a way to reduce variation in • Data integration measurements (normalizing to a range) – Different sources, same concept but values – Examples: expressed differently • Min-Max method – Careful measurement planning, coherent use of • Z-score method measurement equipment • Decimal scaling • Data reduction – Methods to reduce the dataset to smaller representations of the original data. •2
HELSINKI UNIVERSITY OF TECHNOLOGY HELSINKI UNIVERSITY OF TECHNOLOGY Mika Ilvesmäki, D.Sc. (Tech.) Mika Ilvesmäki, D.Sc. (Tech.) Sampling When to sample • With sampling one tries to form a • If you have too much data picture of the whole by looking at a – …to fit into memory/spreadsheet/given small(er) part amount of processor cycles etc. • Remember: Trace is a sample of the • Sample of sampling methods: – Sample packets with a fixed probability p and trace headers network of sampled packets. This is the approach used by Cisco – Rather than sampling the trace file, sample Netflow. – Independent Sampling: Sample every packet independently the network (obtain more traces) with a probability 1/p. Difficult to implement. Easy to analyze. – Periodic Sampling: Sample every 1/pth packet with probability 1. Easy to implement. Difficult to analyze. HELSINKI UNIVERSITY OF TECHNOLOGY HELSINKI UNIVERSITY OF TECHNOLOGY Mika Ilvesmäki, D.Sc. (Tech.) Mika Ilvesmäki, D.Sc. (Tech.) Masking & Aggregating Basic statistics – ranges and quantiles • Regrouping packets by selected (parts of) • Statistical range indicates the range in header information -> obtaining sets of which data lies packets with common header value(s) – This new set is a network event that can be • Quantiles perform the division of data measured (size, nr of contained elements, etc.) – Quartile -> four groups – 5-tuple flow is one of the most common ones • Iterative masking may be used to aggregate – Percentile -> 100 groups traffic further and provide reference points – Interquantile ranges (for instance, the – Group by 5-tuple -> set of flows range between 2nd and 3rd quartile). – Group flows by TCP/UDP Sport value -> set of flows originating from different Sports – OR group initially by TCP/UDP Sport value • What statistics remain the same? •3
HELSINKI UNIVERSITY OF TECHNOLOGY HELSINKI UNIVERSITY OF TECHNOLOGY Mika Ilvesmäki, D.Sc. (Tech.) Mika Ilvesmäki, D.Sc. (Tech.) Basic statistics – indications of Basic statistics – indications of variation average • Variance is a measure of absolute variation • Arithmetic mean is one of the most – Depends on values and scale of measurements common statistic – Standard deviation is the squre root of variance – Uses all data available • Coefficient of Variation (CofV) is used to compare variation between several sets of • Is affected by extreme values data – instant, exponential • Mean deviation • Median – Descriptive statistic, mean deviation from the mean – Is the middlemost value in an ordered set • Uses absolute values, analytical calculations are harder to perform – Not affected by extreme values • Good, ”intuitive” measure of variation HELSINKI UNIVERSITY OF TECHNOLOGY HELSINKI UNIVERSITY OF TECHNOLOGY Mika Ilvesmäki, D.Sc. (Tech.) Mika Ilvesmäki, D.Sc. (Tech.) Higher-moment statistics - Skewness Higher-moment statistics - Kurtosis • Used to describe histograms • Kurtosis • Skewness – 4th moment statistic – 3rd moment statistic – Measure of combined weight of tails in – Measure of asymmtery of a frequency relation to the rest of the distribution distribution – Heavy tails, larger kurtosis value – Towards the tail, larger skewness, longer • Peaked distribution, Kurtosis >0, Leptokurtic tail • Flat distribution, Kurtosis <0, Platykurtic • Large positive, right sided tail • Large negative, left sided tail •4
HELSINKI UNIVERSITY OF TECHNOLOGY HELSINKI UNIVERSITY OF TECHNOLOGY Mika Ilvesmäki, D.Sc. (Tech.) Mika Ilvesmäki, D.Sc. (Tech.) Moving averages Arithmetic moving average • MAs are lagging indicators of trends in the • Average value over a set number of dataset observations – If there are no trends, MAs are pretty useless – Determine – Otherwise MAs smooth the behavior and make it • Window size (how many samples) easy to follow trends – Longer windows produce more reliable results of • Several types of MAs to choose from trends but are not that sensitive to sudden changes – Simple moving average (SMA) • Move the start point as you get new samples – Exponential moving average (EMA) • Better to identify long-term trend – Smoothed moving average changes – Linear Weighted moving average HELSINKI UNIVERSITY OF TECHNOLOGY HELSINKI UNIVERSITY OF TECHNOLOGY Mika Ilvesmäki, D.Sc. (Tech.) Mika Ilvesmäki, D.Sc. (Tech.) Exponential moving average Distributions - concepts • A distribution gives the frequency (probability) • Reduces the lag of AMA by applying of possible events more weight to recent values – Sample space: individual IATs • The weight is determined by the window – Events: intervals of IAT (0.01s-0.02s) size • In probability the distributions are completely described by distribution type and parameters – The shorter the window, the more weight – Inifinite number of independent random events -> on recent values normal distribution • Very sensitive to quick changes – Rare events -> Poisson distribution – Reference point to statistical distributions – Verification of assumptions •5
Recommend
More recommend