using data fusion and web mining to support feature
play

Using Data Fusion and Web Mining to Support Feature Location in - PowerPoint PPT Presentation

Using Data Fusion and Web Mining to Support Feature Location in Software SEMERU Feature: a requirement that user can invoke and that has an observable behavior. Feature Location Impact Analysis Existing Feature Location Work Static


  1. Using Data Fusion and Web Mining to Support Feature Location in Software SEMERU

  2. Feature: a requirement that user can invoke and that has an observable behavior.

  3. Feature Location Impact Analysis

  4. Existing Feature Location Work Static ASDGs SUADE SNIAFL FCA DORA Cerberus Software LSI Textual Dynamic Reconn PROMESIR NLP SPR SITIR Meghan Revelle and Denys Poshyvanyk. “Feature Location in Source Code: A Taxonomy and Survey.” Submission to Journal of Software Maintenance and Evolution: Research and Practice .

  5. Textual Feature Location • Information Retrieval (IR) – Searching for documents or within docs for relevant information • First used for feature location by Marcus et al. in 2004 * . – Latent Semantic Indexing ** (LSI) • Utilized by many existing approaches: PROMESIR, SITIR, HIPIKAT etc. * Marcus, A., Sergeyev, A., Rajlich, V., and Maletic, J., "An Information Retrieval Approach to Concept Location in Source Code", in Proc. of Working Conference on Reverse Engineering, 2004, pp. 214-223. ** Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R., "Indexing by Latent Semantic 5 Analysis", Journal of the American Society for Information Science , vol. 41, no. 6, Jan. 1990, pp. 391-407.

  6. Applying LSI to Source Code • Corpus creation synchronized void print(TestResult print test result ... – Choose granularity result, long runTime) { • Preprocessing printHeader(runTime); m 1 5 1 3 ... printErrors(result); – Stop word removal, printFailures(result); m 2 ... ... ... ... splitting, stemming printFooter(result); • Indexing } – Term-by-document print test result result run time print header print test result result run time print header print test result result run time print head matrix run time print errors result print failure run time print errors result print failure run time print error result print fail – Singular Value result print footer result result print footer result result print foot result Decomposition • Querying – User-formulated print test result ... • Generate results m 1 5 1 3 ... – Ranked list m 2 ... ... ... ... 6

  7. Dynamic Feature Location Software Scenario-based Reconnaissance * Probabilistic Ranking (SPR) ** I 1 R I 2 Feature t 1 m k Invoked R I 1 t 2 m k m k I 1 I 2 R Feature Not t 3 m k m k Invoked * Wilde, N. and Scully, M., "Software Reconnaissance: Mapping Program Features to Code", Software Maintenance: Research and Practice , vol. 7, no. 1, Jan.-Feb. 1995, pp. 49-62. ** Antoniol, G. and Guéhéneuc, Y. G., "Feature Identification: An Epidemiological Metaphor", IEEE Trans. on 7 Software Engineering , vol. 32, no. 9, Sept. 2006, pp. 627-641.

  8. Hybrid Feature Location PROMESIR * SITIR ** LSI SPR PROMESIR LSI Execution score score Score score Trace m 15 0.91 m 52 0.80 m 6 0.715 m 15 0.91 main m 16 0.88 m 47 0.66 m 47 0.70 m 16 0.88 | m 1 m 2 0.85 m 6 0.64 m 52 0.70 m 2 0.85 | m 2 m 6 0.79 m 2 0.53 m 2 0.69 m 6 0.79 | | m 6 m 47 0.74 m 15 0.37 m 15 0.64 m 47 0.74 | | m 15 m 52 0.60 m 16 0.34 m 16 0.61 m 52 0.60 | m 3 ... ... ... ... ... ... ... ... | m 47 ... * P robabilistic R anking o f M ethods Based on E xecution ** SI ngle T race and I nformation R etrieval S cenarios and I nformation R etrieval Poshyvanyk, D., Guéhéneuc, Y. G., Marcus, A., Antoniol, G., Liu, D., Marcus, A., Poshyvanyk, D., and Rajlich, V., "Feature and Rajlich, V., "Feature Location using Probabilistic Ranking of Location via Information Retrieval based Filtering of a Single Methods based on Execution Scenarios and Information 8 Scenario Execution Trace", in Proc. of International Conference Retrieval", IEEE Trans. on Software Engineering , vol. 33, no. 6, on Automated Software Engineering, 2007, pp. 234-243. June 2007, pp. 420-432.

  9. Data Fusion Example Global Positioning System (GPS) χ Actual Position - Discrete measurements χ - Meter accuracy χ - Noisy χ Position + No drift χ χ Inertial Navigation System (INS) + Continuous measurements χ χ + Centimeter accuracy χ + Low noise χ χ - Drifts over time χ χ Time 9

  10. Data Fusion for Feature Location • Combining information from multiple sources will yield better results than if the data is used separately – Previous • Textual, Dynamic, and Static (i.e., Cerberus) – Current • Textual info from IR • Execution info from dynamic tracing • Web mining 10

  11. Web Mining m 1 m 2 m 3 m 4 m 5 m 6 m 7 m 8 m 9 m 10 m 11 m 12 m 14 m 13 m 15 m 16 m 17 m 18 m 19 m 20 11

  12. Web Mining Algorithms PageRank – Measure the relative importance of a web page – Used by the Google search engine – Link from X to Y means a vote by X for Y – A node’s PageRank depends on # incoming links and the PageRank of nodes that link to it Image source: http://en.wikipedia.org/wiki/Pagerank 12 Brin, S. and Page, L., "The Anatomy of a Large-Scale Hypertextual Web Search Engine", in Proc. of 7th International Conference on World Wide Web, Brisbane, Australia, 1998, pp. 107-117.

  13. Web Mining Algorithms HITS – Hyperlinked-Induced Topic Search – Identifies hub and authority pages – Hubs point to many good authorities – Authorities are pointed to by many hubs Hubs Authorities Kleinberg, J. M., "Authoritative sources in a hyperlinked 13 environment", Journal of the ACM , vol. 46, no. 5, 1999, pp. 604-632.

  14. Probabilistic Program 15 Dependence Graph* 16 20 1/7 1/6 m 1 1/7 1/6 1/6 1/7 13 2/7 1/6 1/7 1/6 1/7 1/6 17 m 2 m 3 m 4 m 5 m 6 2/8 1/4 3/8 1/4 18 2/8 1/4 1/8 1/4 19 m 7 m 8 m 9 PPDG 14 1/1 1/1 1/2 1/5 1/3 1/2 1/2 4/5 1/2 2/3 10 – Derived from m 10 m 11 m 12 12 feature- m 14 1/1 1/1 1/1 3/3 11 specific trace 1/1 1/1 m 13 7 1/1 1/1 – Binary 2/4 1/2 8 2/4 1/2 weights m 15 2/6 1/2 9 3/9 1/4 – Execution 3/9 1/4 1/4 1/9 2/9 1/4 2 m 16 frequency m 17 m 18 m 19 3 4/6 1/2 weights 4 m 20 5 *Baah, G. K., Podgurski, A., and Harrold, M. J. 2008. The probabilistic program dependence graph 6 14 and its application to fault diagnosis. In Proceedings of the 2008 International Symposium on 1 Software Testing and Analysis , 2008.

  15. Incorporating Web Mining with Feature Location PR m 15 0.14 m 16 0.09 m 20 0.07 m 13 0.04 m 17 0.001 LSI score ... ... m 15 0.91 m 16 0.88 m 2 0.85 m 6 0.79 m 47 0.74 15 m 52 0.60

  16. Feature Location Techniques Evaluated LSI & LSI, Dyn, & Dynamic Web Mining LSI, Dyn, & HITS PageRank Analysis LSI PR(bin) LSI+Dyn+PR(bin) top LSI+Dyn+HITS(h,bin) top LSI+Dyn+HITS(h,bin) bottom LSI+Dyn PR(freq) LSI+Dyn+PR(bin) bottom LSI+Dyn+HITS(h,freq) top LSI+Dyn+HITS(h,freq) bottom (baseline) HITS(h, bin) LSI+Dyn+PR(freq) top LSI+Dyn+HITS(a,bin) top LSI+Dyn+HITS(a,bin) bottom HITS(h, freq) LSI+Dyn+PR(freq) bottom LSI+Dyn+HITS(a,freq) top LSI+Dyn+HITS(a,freq) bottom HITS(a, bin) HITS(a, freq) Use LSI to Use web Use LSI to rank methods. Prune unexecuted. Use web mining algorithm to rank mining also rank methods and prune top- or bottom- ranked methods from LSI+Dyn’s methods, algorithm to results. prune rank 16 unexecuted methods.

  17. Feature Location Techniques Explained m 1 + m 2 m 3 m 4 m 5 m 6 m 7 m 8 m 9 m 10 m 11 m 12 PR(bin) PR(bin) m 14 HITS(h, bin) bottom PR(bin) top LSI+Dyn LSI+Dyn m 13 Web Scenario Executed Tracer m 15 Mining Methods Query m 16 m 17 m 18 m 19 Ranked Source Ranked + LSI Methods Code Methods LSI m 20 Ranked, + Executed Methods 17 Final Results

  18. Subject Systems • Eclipse 3.0 – 10K classes, 120K methods, and 1.6 million LOC – 45 features – Gold set: methods modified to fix bug – Queries: short description from bug report – Traces: steps to reproduce bug 18

  19. 19

  20. Subject Systems • Rhino 1.5 – 138 classes, 1,870 methods, and 32,134 LOC – 241 features – Gold set: Eaddy et al.’s dataset * – Queries: description in specification – Traces: test cases 20 * http://www.cs.columbia.edu/~eaddy/concerntagger/

  21. Size of Traces Min Max 25% Med 75% σ μ Methods 88K 1.5MM 312K 525K 1MM 666K 406K Unique Methods 1.9K 9.3K 3.9K 5K 6.3K 5.1K 2K Eclipse Size-MB 9.5 290 55 98 202 124 83 Threads 1 26 7 10 12 10 5 Methods 160K 12MM 612K 909K 1.8MM 1.8MM 2.3MM Unique Methods 777 1.1K 870 917 943 912 54 Rhino Size-MB 18 1,668 71 104 214 210 273 Threads 1 1 1 1 1 1 0 21

  22. Research Questions • RQ1 – Does combining web mining algorithms with an existing approach to feature location improve its effectiveness? • RQ2 – Which web-mining algorithms, HITS or PageRank, produces better results? 22

  23. Data Collection & Testing • Effectiveness measure LSI score – Descriptive statistics m 15 0.91 m 16 0.88 • 45 Eclipse features m 2 0.85 Effectiveness = 4 • 241 Rhino features m 6 0.79 m 47 0.74 m 52 0.60 • Statistical Testing – Wilcoxon rank sum test – Null hypothesis • There is no significant difference between the effectiveness of X and the baseline (LSI+Dyn). – Alternative hypothesis • The effectiveness of X is significantly better than the baseline (LSI+Dyn). 23

Recommend


More recommend