generating realistic impressions
play

Generating Realistic Impressions for File-System Benchmarking Nitin - PowerPoint PPT Presentation

Generating Realistic Impressions for File-System Benchmarking Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau For better or for worse, benchmarks shape a field David Patterson 2 Inputs to file-system benchmarking


  1. Generating Realistic Impressions for File-System Benchmarking Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

  2. “For better or for worse, � benchmarks shape a field” � David Patterson 2

  3. Inputs to file-system benchmarking Input: Benchmark workload Application Postmark, FileBench, Fstress, Bonnie, IOZone, TPCC, etc etc FS logical organization Input: In-memory state Cold cache/warm cache File System Input: File-System Image Anything goes! Disk layout Storage device 3

  4. FS images in past: use what is convenient Typical desktop file system w/ no description (SOSP 05) 5-deep tree, 5 subdirs, 10 8KB files in each (FAST 04) Randomly generated files of several MB (FAST 08) 1000 files in 10 dirs w/ random data (SOSP 03) 188GB and 129GB volumes in Engg dept (OSDI 99) 10702 files from /usr/local, size 354MB (SOSP 01) 1641 files, 109 dirs, 13.4 MB total size (OSDI 02)

  5. Performance of find operation Disk layout File-system logical & cache state organization Time Taken Relative 5

  6. Problem scope Characteristics of file-system images have strong impact on performance We need to incorporate representative file-system images in benchmarking & design How to create representative file-system images? 6

  7. Requirements for creating FS images • Access to data on file systems and disk layout – Properties of file-system metadata [Satyanarayan81, Mullender84, Irlam93, Sienknecht94, Douceur99, Agrawal07] – Disk fragmentation [Smith97] – More such studies in future? • A technique to create file-system images that is – Representative: given a set of input distributions – Controllable: supply additional user constraints – Reproducible: control & report internal parameters – Easy to use: for widespread adoption and consensus 7

  8. Introducing Impressions • Powerful statistical framework to generate file-system images – Takes properties of file-system attributes as input – Works out underlying statistical details of the image – Mounted on a disk partition for real benchmarking – Satisfies the four design goals • Applying Impressions gives useful insights – What is the impact on performance and storage size? – How does an application behave on a real FS image? 8

  9. Outline • Introduction • Generating realistic file-system images • Applying Impressions: Desktop search • Conclusion 9

  10. Overview of Impressions Impressions 10

  11. Properties of file-system metadata “Five-year study of file-system metadata” [FAST07] (Agrawal, Bolosky, Douceur, Lorch) Used as exemplar for metadata properties in Impressions 11

  12. Features of Impressions • Modes of operation for different usages – Basic mode: choose default settings for parameters – Advanced mode: several individually tunable knobs • Thorough statistical machinery ensures accuracy – Uses parameterized curve fits – Allows arbitrary user constraints – Built-in statistical tests for goodness-of-fit • Generates namespace, metadata, file content, and disk fragmentation using above techniques 12

  13. Creating valid metadata • Creating file-system namespace – Uses Generative Model proposed earlier [FAST 07] – Explains the process of directory tree creation – Accurately regenerates distribution of directory size and of directory depth 13

  14. Creating namespace Dirs by namespace depth Dirs by subdir count Directories by Namespace Depth Directories by Subdirectory Count Cumulative % of directories Fraction of directories 0.18 100 D 0.16 G 90 0.14 0.12 80 0.1 0.08 Dataset 70 0.06 0.04 60 D Generated 0.02 G 0 50 0 0 2 2 4 4 6 6 8 10 12 14 16 8 10 12 14 16 Namespace depth (bin size 1) Count of subdirectories i Directory tree Monte Carlo run Incorporates dirs by depth Probability of parent selection and dirs by subdir count ≈ Count(subdirs)+2 14

  15. Creating valid metadata • Creating file-system namespace • Creating files: stepwise process – File size, file extension, file depth, parent directory – Uses statistical models & analytical approximations 15

  16. Example: creating realistic file sizes File Sizes to used space Contribution Lognormal Hybrid Containing file size (bytes, log scale) • Pure lognormal distribution no longer good fit • Hybrid model: lognormal body, Pareto tail – Fits observed data more accurately, used to recreate file sizes in Impressions 16

  17. Creating files Files by containing bytes Files by Containing Bytes 0.12 D Fraction of bytes 0.1 G 0.08 0.06 0.04 0.02 0 0 8 2K 512K 512M 128G File Size (bytes, log scale) i File Size Model S 9 S 8 S 6 S 4 S 3 S 2 S 7 S 5 S 1 Lognormal body, Pareto tail Captures bimodal curve 17

  18. Creating files Top extensions by count Top Extensions by Count 1 0.8 Fraction of files others others 0.6 txt txt 0.4 null null jpg jpg htm htm h h 0.2 gif gif exe exe dll dll cpp cpp 0 i Desired Generated File Extensions S 9 S 8 S 6 S 4 S 3 S 2 S 7 S 5 S 1 Percentile values E 9 E 8 E 7 E 6 E 5 E 4 E 3 E 2 E 1 Top 20 extensions account for 50% of files and bytes 18

  19. Creating files Bytes by namespace depth Files by namespace depth Files by Namespace Depth Bytes by Namespace Depth 0.16 Mean bytes per file D D 0.14 Fraction of files G G 2MB 0.12 (log scale) 768KB 0.1 0.08 256KB 0.06 64KB 0.04 0.02 16KB 0 0 0 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Namespace depth (bin size 1) Namespace depth (bin size 1) i File Depth S 9 S 8 S 6 S 4 S 3 S 2 S 7 S 5 S 1 Poisson E 9 E 8 E 7 E 6 E 5 E 4 E 3 E 2 E 1 Multiplicative model along D 9 D 8 D 6 D 4 D 3 D 2 D 7 D 5 D 1 with bytes by depth 19

  20. Creating files Files by namespace depth Files by Namespace Depth (with Special Directories) w/ special dirs 0.25 D Fraction of files 0.2 G 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 Namespace depth (bin size 1) i Parent Dir Inverse Polynomial Satisfies distribution of dirs with file count 20

  21. Resolving arbitrary constraints 21

  22. Resolving arbitrary constraints 0.15 0.15 O C Fraction of files Fraction of files Constrained Original 0.1 0.1 Contrived for sum 0.05 0.05 0 0 8 8 2K 2K 512K 8M 512K 8M File Size (bytes, log scale) File Size (bytes, log scale) Constraint: Given count of files & size distribution, ensure Accurate both for the sum and the distribution sum of file sizes matches a desired total file system size 22

  23. Resolving arbitrary constraints • Arbitrarily specified on file system parameters • Variant of NP-complete “Subset Sum Problem” – Approximation algorithm based solution (in paper) – Oversampling to get additional sample values – Local improvement to iteratively converge to the desired sum by identifying best-fit in current sample • While constraints are satisfied, constrained distribution also retains original characteristics 23

  24. Interpolation and extrapolation • Why don’t we just use available data values? – Limited to empirical data in input dataset – “What-if” analysis beyond available dataset – Efficient to maintain compact curve fits and use interpolation/extrapolation instead of all data • Technique: Piecewise interpolation 24

  25. Interpolation technique & accuracy Piecewise Interpolation Piecewise Interpolation 0.14 0.12 100 GB Interpola9on: Seg 19 0.06 50 GB Segment Value 0.1 0.04 10 GB Fraction of bytes 0.02 0.08 0 0.06 0 50 100 File System Size (GB) 0.04 Segment 19 Segment 19 Segment 19 Segment 19 0.02 0 0 2 8 32 128 512 32K 128K 512K 2M 8M 32M 128M 512M 2G 8G 32G 128G 2K 8K File Size File Size extrapolation 125GB File Size interpolation 75GB • Each distribution broken down into segments Extrapolation (125 GB) Interpolation (75 GB) 0.12 0.12 Fraction of bytes R Real Fraction of bytes R 0.1 Interpolated 0.1 E • Data points within a segment used for curve fit I 0.08 0.08 0.06 Real 0.06 0.04 0.04 • Combine segment interpolations for new curve Extrapolated 0.02 0.02 0 0 8 2K 512K 128M 32G 25 8 2K 512K 128M 32G File Size File Size

  26. File content • Files having natural language content – Word-popularity model (heavy tailed) – Word-length frequency model (for the long tail) • Content for other files (mp3, gif, mpeg etc) – Impressions generates valid header/footer – Uses third-party libraries and software 26

  27. Disk layout and fragmentation 27

  28. Disk layout and fragmentation • Simplistic technique – Layout Score for degree of fragmentation [Smith97] – Pairs of file create/delete operations till desired layout score is achieved • In future more nuanced ways are possible File 1 File 2 – Out-of-order file writes, writes with long delays – Access to file-system specific interfaces • FIPMAP in Linux, XFS_IOC_GETBMAP for XFS 1 non-contiguous block (out of 8) All blocks contiguous File Layout Score = 7/8 File Layout Score = 1 (6/6) – Perhaps a tool complementary to Impressions 28

  29. Outline • Introduction • Generating realistic file-system images • Applying Impressions: Desktop search • Conclusion 29

  30. Applying Impressions • Case study: desktop search – Google desktop for linux (GDL) and Beagle – Metrics of interest: • Size of index, time to build initial search index – Identifying application bugs and policies • GDL doesn’t index content beyond 10-deep • Computing realistic rules of thumb – Overhead of metadata replication? 30

Recommend


More recommend