via Multi-Dimensional Trace Analysis Yanpei Chen, Kiran Srinivasan, - PowerPoint PPT Presentation

Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis Yanpei Chen, Kiran Srinivasan, Garth Goodson, Randy Katz UC Berkeley AMP Lab, NetApp Inc.

Motivation – Understand data access patterns Client Server How do apps access data? How are files accessed? How do users access data? How are directories accessed? Better insights  better storage system design Slide 2

Improvements over prior work • Minimize expert bias – Make fewer assumptions about system behavior • Multi-dimensional analysis – Correlate many dimensions to describe access patterns • Multi-layered analysis – Consider different semantic scoping Slide 3

Example of multi-dimensional insight Files with >70% sequential read or sequential write have no repeated reads or overwrites. • Covers 4 dimensions 1. Read sequentiality 2. Write sequentiality 3. Repeated reads 4. Overwrites • Why is this useful? – Measuring one dimension easier – Captures other dimensions for free Slide 4

Outline Observe • Define semantic access layers 1. Traces • Extract data points for each layer Analyze 2. Identify • Select dimensions, minimize bias access patterns • Perform statistical analysis (kmeans) Interpret 3. Draw design • Interpret statistical analysis • Translate from behavior to design implications Slide 5

CIFS traces • Traced CIFS (Windows FS protocol) • Collected at NetApp datacenter over three months • One corporate dataset, one engineering dataset • Results relevant to other enterprise datacenters Slide 6

Scale of traces • Corporate production dataset – 2 months, 1000 employees in marketing, finance, etc. – 3TB active storage, Windows applications – 509,076 user sessions , 138,723 application instances – 1,155,099 files , 117,640 directories • Engineering production dataset – 3 months, 500 employees in various engineering roles – 19TB active storage, Windows and Linux applications – 232,033 user sessions , 741,319 application instances – 1,809,571 files , 161,858 directories Slide 7

Covers several semantic access layers • Semantic layer – Natural scoping for grouping data accesses – E.g. a client’s behavior ≠ aggregate impact on server • Client – User sessions, application instances • Server – Files, directories • CIFS allows us to identify these layers – Extract client side info from the traces (users, apps) Slide 8

Multi-dimensional analysis • Many dimensions describe an access pattern – E.g. IO size, read/write ratio … – Vector across these dimensions is a data point • Multiple dimensions help minimize bias – Bias arises from designer assumptions – Assumptions influence choice of dimensions – Start with many dimensions, use statistics to reduce • Discover complex behavior – Manual analysis limited to 2 or 3 dimensions – Statistical clustering correlates across many dimensions Slide 10

K-means clustering algorithm Pick random Assign multi-D Re-compute Iterate until initial cluster data point to means using the means means nearest mean new clusters converge

Applying K-means • For each semantic layer: – Pick a large number of relevant dimensions – Extract values for each dimension from the trace – Run k-means clustering algorithm – Interpret resulting clusters – Draw design implications Slide 12

Example – application layer analysis • Selected 16 dimensions: 1. Total IO size by bytes 7. Read sequentiality 13. File opens 2. Read:write ratio by bytes 8. Write sequentiality 14. Unique files opened 3. Total IO requests 9. Repeated read ratio 15. Directories accessed 4. Read:write ratio by requests 10. Overwrite ratio 16. File extensions accessed 5. Total metadata requests 11. Tree connects 6. Avg. time between IO requests 12. Unique trees accessed • 16-D data points: 138,723 for corp., 741,319 for eng. • K-means identified 5 significant clusters for each • Many dimensions were correlated Slide 13

Example – application clustering results Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 But what do these clusters mean? Need additional interpretation … Slide 14

Label application types Viewing app. Supporting App. gen. Viewing human Content Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 gen. content metadata file updates gen. content update Slide 16

Design insights based on applications Viewing app. Supporting App. gen. Viewing human Content Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 gen. content metadata file updates gen. content update Observation: Apps with any sequential read/write have high sequentiality Implication: Clients can prefetch based on sequentiality only Slide 17

Design insights based on applications Viewing app. Supporting App. gen. Viewing human Content Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 gen. content metadata file updates gen. content update Observation: Small IO, open few files multiple times Implication: Clients should always cache the first few KB of every file, in addition to other cache policies Slide 18

Apply identical method to engineering apps Compilation Supporting Content up- Viewing human Content view- app metadata date – small gen. content ing - small Identical method can find apps types for other CIFS workloads Slide 19

Other design insights Consolidation : Clients can consolidate sessions based on only the read write ratio. File delegation : Servers should delegate files to clients based on only access sequentiality. Placement : Servers can select the best storage medium for each file based on only access sequentiality. Simple, threshold-based decisions on one dimension High confidence that it’s the correct dimension Slide 20

New knowledge – app. types depend on IO, not software! 1 Fraction of others application others n.f.e. & xls instances others others n.f.e. 0.8 n.f.e. & html others n.f.e. & htm n.f.e. & pdf n.f.e. & doc n.f.e. & ppt n.f.e. & pdf n.f.e. & lnk n.f.e. & ppt 0.6 n.f.e. & doc n.f.e. & lnk n.f.e. & doc n.f.e. & ppt n.f.e. & xls ini pdf pdf 0.4 no files opened n.f.e. & doc n.f.e. & xls 0.2 n.f.e. & xls n.f.e. n.f.e. & xls 0 Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 content supporting app generated content content update viewing app - metadata file updates viewing app - app app generated human n.f.e. = No file extension content generated content Slide 21

New knowledge – app. types depend on IO, not software! 1 Fraction of others application others n.f.e. & xls instances others others n.f.e. 0.8 n.f.e. & html others n.f.e. & htm n.f.e. & pdf n.f.e. & doc n.f.e. & ppt n.f.e. & pdf n.f.e. & lnk n.f.e. & ppt 0.6 n.f.e. & doc n.f.e. & lnk n.f.e. & doc n.f.e. & ppt n.f.e. & xls ini pdf pdf 0.4 no files opened n.f.e. & doc n.f.e. & xls 0.2 n.f.e. & xls n.f.e. n.f.e. & xls 0 Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 content supporting app generated content content update viewing app - metadata file updates viewing app - app app generated human n.f.e. = No file extension content generated content Slide 22

Summary • Contribution: – Multi-dimensional trace analysis methodology – Statistical methods minimize designer bias – Performed analysis at 4 layers – results in paper – Derived 6 client and 6 server design implications • Future work: – Optimizations using data content and working set analysis – Implement optimizations – Evaluate using workload replay tools • Traces available from NetApp under license Thanks!!! Slide 23

Backup slides Slide 24

How many clusters? – Enough to explain variance % data % data corp eng variance variance explained explained 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of clusters, k Number of clusters, k Slide 25

Behavior variation over time 1.00 supporting metadata Fraction of all app instances app generated file updates 0.10 content update app content viewing app - app 0.01 generated content 0 1 2 3 4 5 6 7 content viewing app - human generated content week # 1.00 Sequentiality 0.75 ratio seq ratio for content update app 0.50 seq ratio for content viewing 0.25 app - human generated 0.00 content 0 1 2 3 4 5 6 7 week # Slide 26

via Multi-Dimensional Trace Analysis Yanpei Chen, Kiran Srinivasan, - PowerPoint PPT Presentation

Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis Yanpei Chen, Kiran Srinivasan, Garth Goodson, Randy Katz UC Berkeley AMP Lab, NetApp Inc. Motivation Understand data access patterns Client Server How

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis Piotr

Our Hobbies 1B Cindy Chan Trace Chan Yuki Lo All: Good morning ,everybody. Cindy: I am Cindy

Trace Elements in igneous petrology Abundances of trace elements are used to test petrogenetic

Trace and center of the twisted Heisenberg category Michael Reeks June 4, 2018 Michael Reeks

Assessing the Performance of MPI Applications Through Time-Independent Trace Replay . Desprez 1

DIV 26000 AND HEAT TRACE FOR MECHANICAL SYSTEMS ACE/ASM DOS AND DONTS OF HEAT TRACE IN

Semantic Trace-based Malware Variants Detection Khalid Alzarooni CREST - DCS - UCL April 6,

Multi-Dimensional Reflective BSDE July 29 2010, Cornell University By Qinghua Li, Columbia

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Trace Level Automated Mercury Speciation Analysis Vivien Taylor, 1 Brian Jackson, 1 Annie

Multi-Dimensional LSTM Networks for Video Prediction Wonmin Byeon NVIDIA Research March 29, 2018

Multi Multi-dimensional Data and Spatial Range dimensional Data and Spatial Range Query in

Multi-dimensional Dependency Grammar as Graph Description Ralph Debusmann and Gert Smolka

Multi-Dimensional Gas Flows Tai-Ping Liu Academia Sinica, Taiwan Stanford University Final

Multi- -dimensional Data and dimensional Data and Spatial Range Spatial Range Multi Query in

An Introduction to Visual Analysis of Social Networks Nan Cao @ HKUST nancao@cse.ust.hk April

Outline Anonymous communications techniques CSci 5271 Announcements intermission Introduction

Software Engineering for Outsourcing and Offshoring Bertrand Meyer Peter Kolb ETH course,

Catching Social Media Advertisers with Strategy Analysis Meng Jiang University of Illinois at

Transaction clustering using network traffic blockchains analysis for Bitcoin and derived

Task Analy ask Analysis T sis Tool ool

Measurement and Analysis of Online Social Networks Alan Mislove Massimiliano Marcon

Efficient and Precise Points-to Analysis: Modeling the Heap by Merging Equivalent Automata Tian