understanding the structure of programs is difficult
play

Understanding the Structure of Programs is Difficult Developers - PDF document

Understanding the Structure of Programs is Difficult Developers create sophisticated applications that Software Clustering are complex and involve a large number of Decomposing a large software system into meaningful subsystems


  1. Understanding the Structure of Programs is Difficult • Developers create sophisticated applications that Software Clustering are complex and involve a large number of Decomposing a large software system into meaningful subsystems interconnected components. • Result : Program understanding is difficult • Goal : Use automated techniques to help developers understand the structure of software systems. Common Problems Solutions • Creating a good mental model of the structure of a complex system. • Automatic : Use software clustering techniques to decompose the structure of software systems into • Keeping a mental model consistent with changes meaningful subsystems. that occur as the system evolves. • Subsystems help developers navigate through the numerous software components and their interconnections. • These problems are exacerbated by: • Non-existent or inconsistent design documentation • Manual : Use notations such as UML to specify • High rate of turnover among IT professionals the software structure. • Assumption : Understanding the structure of a software system is valuable for maintainers. Why is clustering useful? Example (before) • Helps new developers create a mental model of the software structure. • Especially useful in the absence of experts or accurate design documentation. • Helps developers understand the structure of legacy software. • Enables developers to compare the documented structure with the automatically created (actual) structure.

  2. Example (after) Software Clustering Challenges • There are many ways to partition a set of entities into clusters. • How do we create efficient algorithms to find partitions that are representative of a system’s structure? • How do we distinguish between good and bad partitions? How Hard is this Problem? Some solutions • The number of partitions of n objects into k • Enumerating every possible partition of the clusters is: software structure graph is not practical. � k k � S n , k = 1 � ( − 1 ) k − j j n • Heuristics can be used to reduce the number of j k ! partitions: j = 0 • Searching algorithms • The number of ways to partition a set of n objects • Knowledge about the source code is: B n = � n k = 1 S n , k • Names, directory structure, designer input • This function grows exponentially with respect to • Remove entities that provide little structural value n . Some values: • Libraries, omnipresent nodes 1 5 10 15 20 • Result is sub-optimal, but often adequate. 1 52 115,975 1,382,958,545 51,724,158,235,372 Software Clustering Research Clustering Techniques • There are many different clustering techniques, • Clustering Procedures/Functions into Modules but they all need to consider: • Clustering Modules/Classes into Subsystems • Representation : The entities and relationships to be clustered • Similarity : What determines the degree of similarity between the • Evaluating clustering algorithms software entities • Measuring distance between partitions • Algorithms : Algorithms that use the similarity measurement to • Algorithm stability make clustering decisions

  3. Representation Similarity • Similarity measurements are used to determine • There are many choices based on the desired the degree of similarity between a pair of entities granularity of recovered system design • Different types: • Entities may be variables/procedures or modules/classes. • Association coefficients : Based on common features that exist • What types of relationships will be considered? (or do not exist) between a pair of entities • Will the relationships be weighted? • Most common type of similarity measurement • Distance measures : Measure of the degree of dissimilarity between entities. Similarity Measurements Similarity Measurements • Assume that every entity is expressed in terms of • We can also include information about who binary features, 1 denoting the existence of a developed what file, and where each file is located feature, 0 its absence. f 1 f 2 f 3 u 1 u 2 Alice Bob p 1 p 2 p 3 f 1 f 2 f 3 u 1 u 2 f 1 0 1 1 1 1 1 0 0 1 0 f 1 0 1 1 1 1 f 2 1 0 1 1 1 0 1 1 0 0 f 2 1 0 1 1 1 f 3 1 1 0 1 1 0 1 0 1 0 f 3 1 1 0 1 1 u 1 1 1 1 0 0 1 0 0 0 1 u 1 1 1 1 0 0 u 2 1 1 1 0 0 0 1 0 0 1 u 2 1 1 1 0 0 For two entities i and j, we can define... Association Coefficients • Association co-efficients can be defined based on • a : Number of features present in both entities these values: • b : Number of features unique to entity i a + d Simple Matching coefficient a + b + c + d • c : Number of features unique to entity j a Jaccard coefficient a + b + c • d : Number of features absent in both entities 2 a Sorensen coefficient 2 a + b + c

  4. Agglomerative hierarchical algorithm Dendrogram example • Start by creating one cluster for each object • Join the two most similar objects into one cluster • Continue joining the two most similar objects/clusters until everything is in one cluster • What you get is a dendrogram... Cut height Update rule • By choosing to “cut” the dendrogram at a • How to determine the similarity between two particular height, we can create a partition of the already formed clusters (or an object and a set of objects, e.g. a cut height of 0.45 in the cluster) previous example would give us 3 clusters • Many possibilities • Finding an appropriate cut height is a tough • Minimum of all pair-wise similarities problem • Maximum of all pair-wise similarities • Heuristics, such as the number of clusters, are • Weighted or unweighted averages usually employed Assignment tool: aa input.rsf output.mbd call f1 f2 f1 f2 f3 u1 u2 • The aa tool allows to run any version of the call f1 f3 f2 f1 f3 u1 u2 agglomerative algorithms described before call f2 f3 f3 f1 f2 u1 u2 call f1 u1 u1 f1 f2 f3 • It requires input in “market basket data” form. You call f1 u2 u2 f1 f2 f3 can transform from RSF to MBD with: call f2 u1 unitrans input.rsf output.mbd call f2 u2 call f3 u1 call f3 u2

  5. Assignment tool: aa Pattern-based software clustering • Example: aa input.mbd contain.rsf • Manual decompositions of large pieces of -c0.4 -s1 -a2 software often contain certain types of • Cluster the objects in input.mbd using a cut-height of 0.4, the subsystems Simple Matching Coefficient, and the Weighted Average Algorithm • A software clustering algorithm that creates • Output: clusters based on these patterns would have a contain ss5 u1 better chance of creating a decomposition that contain ss5 u2 can help system comprehension contain ss3 f3 • These clusters can also have better names contain ss3 f1 (based on the pattern they were derived from) as contain ss3 f2 well as a more manageable number of contents The ACDC algorithm Example pattern: Subgraph Dominator • A skeleton of the decomposition is created based on the identified patterns • Entities not clustered this way are assigned to the cluster that they exhibit the largest connectivity to • Experiments with large systems have shown that the skeleton usually contains at least half the system entities Assignment tool: acdc Optimization-based Clustering • If one can express the desired properties of a • The acdc tool is an implementation of this clustering as a formula, then the problem of algorithm clustering is reduced to that of finding the decomposition that optimizes the value of the • Example: formula acdc input.rsf output.rsf -l25 • Cluster the objects in input.rsf with a maximum size of 25 for • A typical goal is to maximize cohesion and the Subgraph Dominator pattern minimize coupling

  6. Bunch Bunch • Bunch attempts to maximize the value of the MQ function • Finding the optimal clustering based on this � k � � k i , j = 1 E i , j i = 1 A i − k > 1 formula is impractical k ( k − 1 ) MQ = k 2 • Exhaustive search is not recommended for more than 15 entities A 1 k = 1 � • Bunch employs hill climbing and genetic 0 i = j where A i = µ i i and E i , j = ǫ i , j algorithms to find approximate solutions N 2 i � = j 2 N i N j N i : the number of entities in cluster i µ i : the number of intra-edges in cluster i ǫ i , j : the number of inter-edges between clusters i and j Assignment tool: bunch Other ideas • The literature contains many more ideas for • Bunch is an interactive tool written in Java clustering algorithms • Input is in a format that is exactly like RSF except • Data mining techniques as well as mathematical that the first token is missing, i.e. only one type of tools such as concept analysis have been used for relationship is assumed clustering purposes • Output is in a format called SIL that can be • Using naming or ownership information has also translated to RSF (see webpage) been shown to improve clustering results

Recommend


More recommend