Introduction Methodology Analysis process Results Conclusions Background & Needs Analysis at different layers Storage is divided to: Clients, creation, and view files via applications, and servers, stores the content, Clients operates on user and applications layers. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Analysis at different layers Storage is divided to: Clients, creation, and view files via applications, and servers, stores the content, Clients operates on user and applications layers. Basically client/server behavior creates layer. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Analysis at different layers Storage is divided to: Clients, creation, and view files via applications, and servers, stores the content, Clients operates on user and applications layers. Basically client/server behavior creates layer. In example: user sessions, application instances, files on server, directories, ... Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Analysis at different layers Storage is divided to: Clients, creation, and view files via applications, and servers, stores the content, Clients operates on user and applications layers. Basically client/server behavior creates layer. In example: user sessions, application instances, files on server, directories, ... (def) Access unit = layer, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Analysis at different layers Storage is divided to: Clients, creation, and view files via applications, and servers, stores the content, Clients operates on user and applications layers. Basically client/server behavior creates layer. In example: user sessions, application instances, files on server, directories, ... (def) Access unit = layer, Access pattern is set of access units behaviors. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs At different dimensions Access unit has certain inherent characteristics, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs At different dimensions Access unit has certain inherent characteristics, Lets call them feature of this access unit, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs At different dimensions Access unit has certain inherent characteristics, Lets call them feature of this access unit, For example, for an application, the read size in bytes is a feature; the number of unique files accessed is another, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs At different dimensions Access unit has certain inherent characteristics, Lets call them feature of this access unit, For example, for an application, the read size in bytes is a feature; the number of unique files accessed is another, Each feature represents an independent mathematical dimension, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs At different dimensions Access unit has certain inherent characteristics, Lets call them feature of this access unit, For example, for an application, the read size in bytes is a feature; the number of unique files accessed is another, Each feature represents an independent mathematical dimension, dimension describes access unit. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs At different dimensions Access unit has certain inherent characteristics, Lets call them feature of this access unit, For example, for an application, the read size in bytes is a feature; the number of unique files accessed is another, Each feature represents an independent mathematical dimension, dimension describes access unit. Dimension = feature = characteristic. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Example of multi-dimensional insight Files with > 70% sequential read or sequential write have no repeated reads or overwrites. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Example of multi-dimensional insight Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions: Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Example of multi-dimensional insight Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions: Read sequentiality Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Example of multi-dimensional insight Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions: Read sequentiality Write sequentiality Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Example of multi-dimensional insight Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions: Read sequentiality Write sequentiality Repeated reads Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Example of multi-dimensional insight Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions: Read sequentiality Write sequentiality Repeated reads Overwrites Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Example of multi-dimensional insight Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions: Read sequentiality Write sequentiality Repeated reads Overwrites Measuring one dimension at time is easier, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs Example of multi-dimensional insight Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions: Read sequentiality Write sequentiality Repeated reads Overwrites Measuring one dimension at time is easier, In this same time captures other dimensions. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Background & Needs In short The need for multi-layered and multi-dimensional insights motivates their methodology. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #1 Analysis were taken on Common Internet File System (CIFS), Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #1 Analysis were taken on Common Internet File System (CIFS), Better known by Server Message Block (SMB), Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #1 Analysis were taken on Common Internet File System (CIFS), Better known by Server Message Block (SMB), Or (open source) SAMBA, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #1 Analysis were taken on Common Internet File System (CIFS), Better known by Server Message Block (SMB), Or (open source) SAMBA, M$ uses it in Windows, also known as “Microsoft Windows Network”, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #1 Analysis were taken on Common Internet File System (CIFS), Better known by Server Message Block (SMB), Or (open source) SAMBA, M$ uses it in Windows, also known as “Microsoft Windows Network”, CIFS allows to identify layers, doing trace. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #2 Data were collecting for 3 months, in 2007 Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #2 Data were collecting for 3 months, in 2007 2 different datasets: Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #2 Data were collecting for 3 months, in 2007 2 different datasets: corporate Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #2 Data were collecting for 3 months, in 2007 2 different datasets: corporate engineering Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #3 Corporate trace: Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #3 Corporate trace: 1000 employees in marketing, finance, etc. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #3 Corporate trace: 1000 employees in marketing, finance, etc. 3TB active storage, Windows applications Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #3 Corporate trace: 1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #3 Corporate trace: 1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #3 Corporate trace: 1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories Engineering trace: Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #3 Corporate trace: 1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories Engineering trace: 500 employees in various engineering roles Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #3 Corporate trace: 1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories Engineering trace: 500 employees in various engineering roles 19TB active storage, Windows and Linux applications Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #3 Corporate trace: 1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories Engineering trace: 500 employees in various engineering roles 19TB active storage, Windows and Linux applications 232,033 user sessions, 741,319 application instances Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Traces Scale #3 Corporate trace: 1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories Engineering trace: 500 employees in various engineering roles 19TB active storage, Windows and Linux applications 232,033 user sessions, 741,319 application instances 1,809,571 files, 161,858 directories Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids k-means clustering Partition n observations into k clusters, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids k-means clustering Partition n observations into k clusters, Each observations belong to the cluster with the nearest mean, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids k-means clustering Partition n observations into k clusters, Each observations belong to the cluster with the nearest mean, It is classified as NP-hard problem, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids k-means clustering Partition n observations into k clusters, Each observations belong to the cluster with the nearest mean, It is classified as NP-hard problem, But there are good heuristic, (found local optimum), Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids k-means What we want is Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids k-means What we want is � k x j ∈ S i � x j − c i � 2 min S � i =1 Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids k-means What we want is � k x j ∈ S i � x j − c i � 2 min S � i =1 Where c i is the mean of points in cluster S i Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids Observations in 2 dimensions Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids Chose k = 2 “means” randomly Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids Divide space by nearest mean Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids Connect to the centers Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids New means Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids Once again Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids Once again Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Algorithm, centroids Repeat until not stabilized Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Overview Analysis step by step Collect network storage system traces, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Overview Analysis step by step Collect network storage system traces, Define access unit, (need domain knowledge about storage system), Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Overview Analysis step by step Collect network storage system traces, Define access unit, (need domain knowledge about storage system), Extract multiple instances of each access unit, with values, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Overview Analysis step by step Collect network storage system traces, Define access unit, (need domain knowledge about storage system), Extract multiple instances of each access unit, with values, Input those values into k-means, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Overview Analysis step by step Collect network storage system traces, Define access unit, (need domain knowledge about storage system), Extract multiple instances of each access unit, with values, Input those values into k-means, Interpret the k-means output and derive access patterns, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Overview Analysis step by step Collect network storage system traces, Define access unit, (need domain knowledge about storage system), Extract multiple instances of each access unit, with values, Input those values into k-means, Interpret the k-means output and derive access patterns, Translate access patterns to design insights. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Overview Flow looks like this. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means How it works For each access unit extract instances of from trace Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means How it works For each access unit extract instances of from trace i.e. session instances, application instances Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means How it works For each access unit extract instances of from trace i.e. session instances, application instances For all instances compute all numerical values (i.e. opened files), Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means How it works For each access unit extract instances of from trace i.e. session instances, application instances For all instances compute all numerical values (i.e. opened files), This gives data array with row - instances, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means How it works For each access unit extract instances of from trace i.e. session instances, application instances For all instances compute all numerical values (i.e. opened files), This gives data array with row - instances, k-means algorithm produces clusters, which are described as a data access patterns Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means How to determinate number of clusters In heuristic K-means algorithm k is fixed, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means How to determinate number of clusters In heuristic K-means algorithm k is fixed, Remind - k will be correlated with number of our access patterns Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means How to determinate number of clusters In heuristic K-means algorithm k is fixed, Remind - k will be correlated with number of our access patterns Optimization, searching for best k Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means How to determinate number of clusters In heuristic K-means algorithm k is fixed, Remind - k will be correlated with number of our access patterns Optimization, searching for best k by computing more than one value, Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means How to determinate number of clusters In heuristic K-means algorithm k is fixed, Remind - k will be correlated with number of our access patterns Optimization, searching for best k by computing more than one value, Using metrics describing clusters “quality”. Piotr Dobrowolski MIMUW/Distributed Systems Design Implications
Recommend
More recommend