Generating Realistic Impressions for File-System Benchmarking Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
“For better or for worse, � benchmarks shape a field” � David Patterson 2
Inputs to file-system benchmarking Input: Benchmark workload Application Postmark, FileBench, Fstress, Bonnie, IOZone, TPCC, etc etc FS logical organization Input: In-memory state Cold cache/warm cache File System Input: File-System Image Anything goes! Disk layout Storage device 3
FS images in past: use what is convenient Typical desktop file system w/ no description (SOSP 05) 5-deep tree, 5 subdirs, 10 8KB files in each (FAST 04) Randomly generated files of several MB (FAST 08) 1000 files in 10 dirs w/ random data (SOSP 03) 188GB and 129GB volumes in Engg dept (OSDI 99) 10702 files from /usr/local, size 354MB (SOSP 01) 1641 files, 109 dirs, 13.4 MB total size (OSDI 02)
Performance of find operation Disk layout File-system logical & cache state organization Time Taken Relative 5
Problem scope Characteristics of file-system images have strong impact on performance We need to incorporate representative file-system images in benchmarking & design How to create representative file-system images? 6
Requirements for creating FS images • Access to data on file systems and disk layout – Properties of file-system metadata [Satyanarayan81, Mullender84, Irlam93, Sienknecht94, Douceur99, Agrawal07] – Disk fragmentation [Smith97] – More such studies in future? • A technique to create file-system images that is – Representative: given a set of input distributions – Controllable: supply additional user constraints – Reproducible: control & report internal parameters – Easy to use: for widespread adoption and consensus 7
Introducing Impressions • Powerful statistical framework to generate file-system images – Takes properties of file-system attributes as input – Works out underlying statistical details of the image – Mounted on a disk partition for real benchmarking – Satisfies the four design goals • Applying Impressions gives useful insights – What is the impact on performance and storage size? – How does an application behave on a real FS image? 8
Outline • Introduction • Generating realistic file-system images • Applying Impressions: Desktop search • Conclusion 9
Overview of Impressions Impressions 10
Properties of file-system metadata “Five-year study of file-system metadata” [FAST07] (Agrawal, Bolosky, Douceur, Lorch) Used as exemplar for metadata properties in Impressions 11
Features of Impressions • Modes of operation for different usages – Basic mode: choose default settings for parameters – Advanced mode: several individually tunable knobs • Thorough statistical machinery ensures accuracy – Uses parameterized curve fits – Allows arbitrary user constraints – Built-in statistical tests for goodness-of-fit • Generates namespace, metadata, file content, and disk fragmentation using above techniques 12
Creating valid metadata • Creating file-system namespace – Uses Generative Model proposed earlier [FAST 07] – Explains the process of directory tree creation – Accurately regenerates distribution of directory size and of directory depth 13
Creating namespace Dirs by namespace depth Dirs by subdir count Directories by Namespace Depth Directories by Subdirectory Count Cumulative % of directories Fraction of directories 0.18 100 D 0.16 G 90 0.14 0.12 80 0.1 0.08 Dataset 70 0.06 0.04 60 D Generated 0.02 G 0 50 0 0 2 2 4 4 6 6 8 10 12 14 16 8 10 12 14 16 Namespace depth (bin size 1) Count of subdirectories i Directory tree Monte Carlo run Incorporates dirs by depth Probability of parent selection and dirs by subdir count ≈ Count(subdirs)+2 14
Creating valid metadata • Creating file-system namespace • Creating files: stepwise process – File size, file extension, file depth, parent directory – Uses statistical models & analytical approximations 15
Example: creating realistic file sizes File Sizes to used space Contribution Lognormal Hybrid Containing file size (bytes, log scale) • Pure lognormal distribution no longer good fit • Hybrid model: lognormal body, Pareto tail – Fits observed data more accurately, used to recreate file sizes in Impressions 16
Creating files Files by containing bytes Files by Containing Bytes 0.12 D Fraction of bytes 0.1 G 0.08 0.06 0.04 0.02 0 0 8 2K 512K 512M 128G File Size (bytes, log scale) i File Size Model S 9 S 8 S 6 S 4 S 3 S 2 S 7 S 5 S 1 Lognormal body, Pareto tail Captures bimodal curve 17
Creating files Top extensions by count Top Extensions by Count 1 0.8 Fraction of files others others 0.6 txt txt 0.4 null null jpg jpg htm htm h h 0.2 gif gif exe exe dll dll cpp cpp 0 i Desired Generated File Extensions S 9 S 8 S 6 S 4 S 3 S 2 S 7 S 5 S 1 Percentile values E 9 E 8 E 7 E 6 E 5 E 4 E 3 E 2 E 1 Top 20 extensions account for 50% of files and bytes 18
Creating files Bytes by namespace depth Files by namespace depth Files by Namespace Depth Bytes by Namespace Depth 0.16 Mean bytes per file D D 0.14 Fraction of files G G 2MB 0.12 (log scale) 768KB 0.1 0.08 256KB 0.06 64KB 0.04 0.02 16KB 0 0 0 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Namespace depth (bin size 1) Namespace depth (bin size 1) i File Depth S 9 S 8 S 6 S 4 S 3 S 2 S 7 S 5 S 1 Poisson E 9 E 8 E 7 E 6 E 5 E 4 E 3 E 2 E 1 Multiplicative model along D 9 D 8 D 6 D 4 D 3 D 2 D 7 D 5 D 1 with bytes by depth 19
Creating files Files by namespace depth Files by Namespace Depth (with Special Directories) w/ special dirs 0.25 D Fraction of files 0.2 G 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 Namespace depth (bin size 1) i Parent Dir Inverse Polynomial Satisfies distribution of dirs with file count 20
Resolving arbitrary constraints 21
Resolving arbitrary constraints 0.15 0.15 O C Fraction of files Fraction of files Constrained Original 0.1 0.1 Contrived for sum 0.05 0.05 0 0 8 8 2K 2K 512K 8M 512K 8M File Size (bytes, log scale) File Size (bytes, log scale) Constraint: Given count of files & size distribution, ensure Accurate both for the sum and the distribution sum of file sizes matches a desired total file system size 22
Resolving arbitrary constraints • Arbitrarily specified on file system parameters • Variant of NP-complete “Subset Sum Problem” – Approximation algorithm based solution (in paper) – Oversampling to get additional sample values – Local improvement to iteratively converge to the desired sum by identifying best-fit in current sample • While constraints are satisfied, constrained distribution also retains original characteristics 23
Interpolation and extrapolation • Why don’t we just use available data values? – Limited to empirical data in input dataset – “What-if” analysis beyond available dataset – Efficient to maintain compact curve fits and use interpolation/extrapolation instead of all data • Technique: Piecewise interpolation 24
Interpolation technique & accuracy Piecewise Interpolation Piecewise Interpolation 0.14 0.12 100 GB Interpola9on: Seg 19 0.06 50 GB Segment Value 0.1 0.04 10 GB Fraction of bytes 0.02 0.08 0 0.06 0 50 100 File System Size (GB) 0.04 Segment 19 Segment 19 Segment 19 Segment 19 0.02 0 0 2 8 32 128 512 32K 128K 512K 2M 8M 32M 128M 512M 2G 8G 32G 128G 2K 8K File Size File Size extrapolation 125GB File Size interpolation 75GB • Each distribution broken down into segments Extrapolation (125 GB) Interpolation (75 GB) 0.12 0.12 Fraction of bytes R Real Fraction of bytes R 0.1 Interpolated 0.1 E • Data points within a segment used for curve fit I 0.08 0.08 0.06 Real 0.06 0.04 0.04 • Combine segment interpolations for new curve Extrapolated 0.02 0.02 0 0 8 2K 512K 128M 32G 25 8 2K 512K 128M 32G File Size File Size
File content • Files having natural language content – Word-popularity model (heavy tailed) – Word-length frequency model (for the long tail) • Content for other files (mp3, gif, mpeg etc) – Impressions generates valid header/footer – Uses third-party libraries and software 26
Disk layout and fragmentation 27
Disk layout and fragmentation • Simplistic technique – Layout Score for degree of fragmentation [Smith97] – Pairs of file create/delete operations till desired layout score is achieved • In future more nuanced ways are possible File 1 File 2 – Out-of-order file writes, writes with long delays – Access to file-system specific interfaces • FIPMAP in Linux, XFS_IOC_GETBMAP for XFS 1 non-contiguous block (out of 8) All blocks contiguous File Layout Score = 7/8 File Layout Score = 1 (6/6) – Perhaps a tool complementary to Impressions 28
Outline • Introduction • Generating realistic file-system images • Applying Impressions: Desktop search • Conclusion 29
Applying Impressions • Case study: desktop search – Google desktop for linux (GDL) and Beagle – Metrics of interest: • Size of index, time to build initial search index – Identifying application bugs and policies • GDL doesn’t index content beyond 10-deep • Computing realistic rules of thumb – Overhead of metadata replication? 30
Recommend
More recommend