Using Provenance to Extract Semantic File Attributes Daniel Margo Robin Smogor Harvard University Harvard University Abstract not consider context : historical and other relationships between files. Rich, semantically descriptive file attributes are valu- We propose that a great deal of information can be ob- able in many contexts, such as semantic namespaces and tained by examining file provenance: how and when files desktop search. Descriptive attributes help users to find are used, and the agents that use them. For example, a files placed in seemingly-arbitrary locations by differ- file that is always opened by an application and is con- ent applications. However, extracting semantic attributes tained in that application’s directory is likely to be part of from file contents is nontrivial. An alternative is to ex- that application. Conversely, a file that is always opened amine file provenance: how and when files are used, and by many applications but is not contained in any applica- the agents that use them. tion’s directory is likely to be a library. A file that is occa- We study the extraction of semantic attributes from file sionally opened by an application and is not contained in provenance by applying data mining and machine learn- that application’s directory is likely to be content manip- ing techniques to file metadata. We show that provenance ulated by that application. One can imagine a machine and other metadata predict semantic attributes such as learning classification algorithm that matches file prove- file extensions. This complements previous work, which nance to patterns such as “application component”. has shown that file extensions predict access patterns. There is significant prior work on using contextual file metadata in desktop search. Shah et al. have used 1 Introduction provenance to improve desktop search [5], and Soules and Ganger have researched attribute propagation using content similarity and temporal context [6]. Temporal Semantic attributes, which describe an object in human- context can be thought of as a coarse approximation of readable terms, are useful to many applications. For ex- provenance, because objects that share a provenance re- ample, iTunes represents a music collection as a semantic lationship must be active in the same timespan. True namespace in which songs are located by attributes such provenance records contain more data at a finer granu- as album, artist, and genre. Desktop search engines such larity, but this consequently introduces novel challenges. as Google Desktop Search also locate data semantically Classifying provenance is challenging because prove- and benefit from descriptive attributes. nance data is large and multi-dimensional. Provenance is One of the fundamental challenges in semantic appli- generally represented as a graph, and the size of a graph cations is the problem of extracting attributes. A seman- is quadratic in its number of nodes. Furthermore, each tic application is not a useful tool unless it has rich, ac- node and edge is labeled with metadata such as name curate attributes to work with. Unfortunately, manual la- and version. In contrast, a typical machine learning clas- beling is an arduous task (akin to assigning a file mul- sifier can handle feature vectors on the order of tens or tiple directories) and is intractable for importing extant hundreds of features in length. Therefore, we must intel- systems. This “labeling problem” has been the subject ligently reduce the large graph to a few relevant features. of research, but is far from solved. Recent projects ex- tract acoustic features from music [1] and summaries and We use provenance collected by the Provenance- other features from text documents [7]. However, these Aware Storage System v2 [4] to describe file history. systems are necessarily limited in that they must under- Then, using a variety of clustering and machine learn- stand how to read and interpet the contents of each type ing algorithms, we classify files by their provenance of file. Furthermore, they treat files as individuals and do and other metadata. Because this data is large, we ex-
Node Count Provenance Graph Merge Names Ancestors Edge Count De-version File Graph Don’t Merge Descendants Max Depth Process Graph Neighbors Table 1: The feature extraction pipeline. Each step is chosen from left to right. plore different methods of distilling and extracting rele- ing. Recall that, due to versioning, there is no one-to-one vant features. We demonstrate that provenance and other correspondence between files and nodes. Therefore, we metadata are predictive of extant semantic attributes, must reconcile multiple version nodes when generating per-file features. One simple solution is to de-version such as file extensions. the graph by merging nodes that refer to a single ver- 2 Design sioned object. Alternatively, we could process the ver- sioned graph, generate per-node features, and reconcile them in post-processing (for example, by averaging ver- Our high-level goal is to take the provenance of a set sions, or discarding all but the most recent version). of files, and output semantically meaningful classifica- De-versioning emphasizes the relationship between tions. We approach this problem in three stages: collect- different versions of a file and reduces graph sparsity, but ing provenance, processing provenance, and feeding pro- also discards topological information and introduces cy- cessed provenance to a machine learning classifier. The cles and false dependencies. Conversely, post-processing processing stage can be further broken down into a set lets us retain and better manage topological information, of component techniques that assemble in various ways. but does not address sparsity and introduces further de- In the following section, we broadly map out this design cisions with regards to how the versions are reconciled. and its implications. In our initial work we have found sparsity to be a chal- We collect provenance using PASSv2 [4], which cap- lenge and the decision space to be large, so to date we tures provenance relationships at the Linux syscall inter- have only operated on de-versioned graphs. We can fur- face. For example, PASSv2 captures and logs the source ther explore this concept and reduce sparsity by merging and destination of a processes’s write to a file. The re- nodes with identical pathnames. This ensures one node sulting output is a directed acyclic graph in which the per file, and captures relationships such as instances of a process and file are nodes, and the write is an edge. process or an application’s temporary files, but can also When a process writes to a file it has previously read, introduce wholly false relationships. However, it is pos- this creates a cyclic dependency that PASSv2 resolves by sible that small amounts of such noise will be ”washed creating a new version of the file that implicitly depends out” by the machine learner. on the old version. As a result, PASSv2 nodes do not Other notable extrinsic features in PASSv2 include correspond to files and processes, but rather historical in- node types and edge timestamps. Node types are seman- stances of files and processes. This distinction is typical tic labels such as “file”, “process”, etc. We can reduce in provenance systems, and is important both when we a provenance graph to a file graph or a process graph process provenance and interpret the results. by omitting all non-file or non-process objects, respec- tively. The former emphasizes dependencies between 2.1 Processing Provenance files, whereas the latter emphasizes workflow. These al- ternative representations of the graph are then interesting In the processing stage, we reduce the large, singular candidates for intrinsic feature analysis, as described be- provenance graph to a small number of per-file features. low. We plan to incorporate edge timestamps into our Graph features can be divided into two categories, extrin- processing pipeline in future work. sic and intrinsic. Extrinsic features are attributes and la- In addition, we collect features from file system meta- bels of nodes and edges that are particular to a given class data. Per-file features such as directory depth, last ac- of graphs (in this case, PASSv2 graphs). Conversely, cess time, etc. are readily available via a stat syscall. instrinsic features are structures that depend solely on While these features are not provenance, they are consis- graph topology. Extrinsic features are often semantically tent with our file content-ignorant approach. meaningful, but intrinsic features are more generalizable. 2.1.2 Intrinsic Features 2.1.1 Extrinsic Features Probably the most significant extrinsic feature in PASSv2 A traditional technique to summarize intrinsic graph (and many other provenance systems) is node version- topology is topological clustering. Typical clustering al- 2
Recommend
More recommend