Inferring semantically related words from software context Jinqiu - PowerPoint PPT Presentation

Inferring semantically related words from software context Jinqiu Yang , Lin Tan University of Waterloo 1

Motivation I need to find all functions that disable interrupts in the Linux kernel. Hmmm, so I search for “ disable*interrupt ”. MISSING: disable_irq(...), mask_irq(...) New Search Queries: “disable*irq”, “mask*irq” BUT how am I supposed to know ??? 2

How to Find Synonyms or Related Words? Can’t find that disable & mask are synonyms! Guess on my own Ask developers 3

Our Approach: Leveraging Context • Identifiers: • Comments: “Disable all interrupt sources” void mask _all_interrupts() void disable _all_interrupts() “Disable all irq sources” Real comments and identifiers from the Linux kernel • We call a pair of such semantically related words an rPair . 4

Contributions • A general context-based approach to automatically infer semantically related words from software context • Has a reasonable accuracy in 7 large code bases written in C and Java. • Is more helpful to code search than the state of art. 5

Outline • Motivation, Intuition and Contributions • Our Approach • A Running Example: Parsing, Clustering, Extracting, Refining • Evaluation Methods & Results • Related Work • Conclusion 6

A Running Example Parsing Apache maybe add a higher-level description min of spare daemons data in the appropriate order the compiled max daemons an iovec to store the trailer sent after the file data in the wrong order an iovec to store the headers sent before the file return err maybe add a higher-level desc if a user manually creates a data file Real comments from Apache HTTPD Server 7

Extracting rPairs an iovec to store the trailer sent after the file an iovec to store the headers sent before the file SimilarityMeasure = 8/10 the compiled max threads min of spare threads SimilarityMeasure = 1/4 Number of Common Words in the Two Sequences SimilarityMeasure = Total Number of Words in the Shorter Sequence threshold = 0.7 You can find how di fg erent thresholds a fg ect our results in our paper. 8

Running Out of Time • Pairwise comparisons of a large number of sequences is expensive. • 519,168 unique comments in the Linux kernel ➔ over 100 billion comparisons 9

Clustering add daemons maybe add a higher-level description min of spare daemons data in the appropriate order the compiled max daemons data an iovec to store the trailer sent after the file data in the wrong order an iovec to store the headers sent before the file return err maybe add a higher-level desc if a user manually creates a data file iovec 10

Clustering add maybe add a higher-level description daemons maybe add a higher-level description maybe add a higher-level description min of spare daemons data in the appropriate order the compiled max daemons data an iovec to store the trailer sent after the file data in the wrong order an iovec to store the headers sent before the file return err maybe add a higher-level desc if a user manually creates a data file iovec 10

Clustering add maybe add a higher-level description return err maybe add a higher-level desc daemons maybe add a higher-level description m i n o maybe add a higher-level description f s p a r e d a e m o n s min of spare daemons min of spare daemons the compiled max daemons data in the appropriate order data in the appropriate order the compiled max daemons the compiled max daemons data an iovec to store the trailer sent after the file an iovec to store the trailer sent after the file data in the wrong order data in the wrong order data in the appropriate order an iovec to store the headers sent before the file an iovec to store the headers sent before the file data in the wrong order return err maybe add a higher-level desc return err maybe add a higher-level desc if a user manually create a data file if a user manually creates a data file if a user manually creates a data file iovec an iovec to store the headers sent before the file an iovec to store the trailer sent after the file 10

The Speedup After Clustering • Pairwise comparisons of a large number of sequences is expensive. • 519,168 unique comments in the Linux kernel ➔ over 100 billion comparisons. • Clustering speeds up the process for the Linux kernel by almost 100 times. 11

Refining rPairs • Filtering: • Using stemming to remove rPairs that consists of words with the same root, e.g., (called, call) . • Normalization: • (threads, daemons) (thread, daemon). • (called, invoked) (call, invoke) 12

Outline • Motivation, Intuition and Contributions • Our Approach • A Running Example: Parsing, Clustering, Extracting, Refining • Evaluation Methods & Results • Related Work • Conclusion 13

Evaluation Methods • Extraction Accuracy • 7 large code bases, in Java & C, from Comment- Comment, Code-Code, Comment-Code • Search-Related Evaluation • Comparison with SWUM [Hill Phd Thesis] in Code-Code 14

Comment-Comment Accuracy Results Not in Webster or Software rPairs Accuracy WordNet Linux 108,571 47% 76.6% HTTPD 1,428 47% 93.6% Collections 469 74% 97.3% iReport 878 84% 95.2% jBidWatcher 111 64% 98.4% javaHMO 144 56% 91.1% jajuk 203 69% 94.2% Total/Average 111,804 63% 91.7% We randomly sample 100 rPairs per project for manual verification (all 111 for jBidWatcher). • The majority (91.7%) of correct rPairs discovered are not in Webster or WordNet. 15

Evaluation Methods • Extraction Accuracy • 7 large code bases, in Java & C, from Comment- Comment, Code-Code, Comment-Code • Search-Related Evaluation • Comparison with SWUM [Hill Phd Thesis] in Code-Code 16

Search-Related Evaluation In jBidWatcher, “Add auction” Query expansion: “XXX auction” Our SWUM approach new register register ... ... 17

Search-Related Evaluation } In jBidWatcher, “Add auction” JBidMouse.DoAuction(...) SWUM AuctionServer.registerAuction(...) gold set AuctionManager.newAuctionEntry(...) FilterManager.addAuction(...) ... add register, do, new our gold set 17

Search-Related Evaluation In jBidWatcher, “Add auction” add register, do, new Our approach SWUM (55 words) (84 words) new register register do do ... load ... Precision = 3/55 = 5.5% Precision = 2/84 = 2.3% Recall = 3/3 =100% Recall = 2/3 = 67.7% 18

Search-Related Evaluation In jBidWatcher, “Add auction” add register, do, new Our approach SWUM (55 words) (84 words) Our approach achieves higher precision and higher/ new register register equal recall for 5 out of 6 rPair groups in the gold set. do do ... load ... Precision = 3/55 = 5.5% Precision = 2/84 = 2.3% Recall = 3/3 =100% Recall = 2/3 = 67.7% 18

Related Work • Verb-DO (Direct Object) [Shepherd et al. AOSD] & SWUM - Improved version of Verb-DO [Hill Phd Thesis] • Requires Natural Language Processing (NLP) techniques • Requires manually generated heuristics 19

Conclusions • A simple, general technique to automatically infer semantically related words from software context • No Natural Language Processing (NLP) required • Reasonable accuracy in 7 large C & Java code bases • The majority of rPairs discovered are not in the dictionaries or WordNet. • Higher precision & recall than the state of art 20

Inferring semantically related words from software context Jinqiu - PowerPoint PPT Presentation

Inferring semantically related words from software context Jinqiu Yang , Lin Tan University of Waterloo 1 Motivation I need to find all functions that disable interrupts in the Linux kernel. Hmmm, so I search for disable*interrupt .

ASL-English Semantically Mismatched Code Blends An Analysis of Motivations for Nonequivalent

Inferring Internet Inferring Internet Denial- -of of- -Service Activity Service Activity

On Inferring and Characterizing On Inferring and Characterizing Internet Routing Policies

Introduction to Software Testing Software Testing - Module 1 Part 1 The Software Engineering

Exploring semantically-related concepts from Wikipedia: the case of SeRE Daniel Hienert, Dennis

Context Sensitivity Example of a CSG Informatics 2A: Lecture 26 2 Context in Programming

Inferring Temporal System Properties Samuel Huang, joint work with Rance Cleaveland University of

The Challenge of Cultural The Challenge of Cultural Modeling for Inferring Modeling for

Inferring Required Permissions for Statically Composed Programs Tero Hasu Anya Helene Bagge

Inferring Descriptive Generalisations of Formal Languages Dominik D. Freydenberger 1 Daniel

Inferring User Intent for Learning by Observation Kevin R. Dixon krd@cs.cmu.edu Department of

From Dirt to Shovels: From Dirt to Shovels: Inferring PADS descriptions from ASCII Data ASCII

Understanding and Aiding Code Evolution by Inferring Change Patterns Miryung Kim Doctoral

Inferring Required Permissions for Statically Composed Programs Tero Hasu Anya Helene Bagge

From Uncertainty to Belief: Inferring the Specification Within Stephen McLaughlin Stephen

Proving and inferring invariants David Monniaux CNRS / VERIMAG Grenoble, France December 13,

Network Security CIA +Availability By Jinjian Ma Topics DOS/DDoS Detection & Defense

Efficient Memory Disaggregation with Infiniswap Juncheng Gu , Youngmoon Lee, Yiwen Zhang,

Cray Management Services (CMS) Group Charter The Problem with Log and State Information

NDN-Trace A A PATH TRA RACING UT UTILITY Y FOR R NDN NDN SIHAM KHOUSSI, DAVIDE PESAVENTO ,

End Site Control Plane System (ESCPS) Network service to

Siphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics Shuhao Liu, Li Chen ,

Spanner Stephanie New Overview Scalable, multi-version, globally distributed, and synchronously

Self-Stabilizing Master--Slave Token Circulation and Efficient Size-Computation in a

Sambuz

Useful Links

Newsletter

Mail Us

Inferring semantically related words from software context Jinqiu - PowerPoint PPT Presentation

Inferring semantically related words from software context Jinqiu Yang , Lin Tan University of Waterloo 1 Motivation I need to find all functions that disable interrupts in the Linux kernel. Hmmm, so I search for disable*interrupt .

ASL-English Semantically Mismatched Code Blends An Analysis of Motivations for Nonequivalent

Inferring Internet Inferring Internet Denial- -of of- -Service Activity Service Activity

On Inferring and Characterizing On Inferring and Characterizing Internet Routing Policies

Introduction to Software Testing Software Testing - Module 1 Part 1 The Software Engineering

Exploring semantically-related concepts from Wikipedia: the case of SeRE Daniel Hienert, Dennis

Context Sensitivity Example of a CSG Informatics 2A: Lecture 26 2 Context in Programming

Inferring Temporal System Properties Samuel Huang, joint work with Rance Cleaveland University of

The Challenge of Cultural The Challenge of Cultural Modeling for Inferring Modeling for

Inferring Required Permissions for Statically Composed Programs Tero Hasu Anya Helene Bagge

Inferring Descriptive Generalisations of Formal Languages Dominik D. Freydenberger 1 Daniel

Inferring User Intent for Learning by Observation Kevin R. Dixon krd@cs.cmu.edu Department of

From Dirt to Shovels: From Dirt to Shovels: Inferring PADS descriptions from ASCII Data ASCII

Understanding and Aiding Code Evolution by Inferring Change Patterns Miryung Kim Doctoral

Inferring Required Permissions for Statically Composed Programs Tero Hasu Anya Helene Bagge

From Uncertainty to Belief: Inferring the Specification Within Stephen McLaughlin Stephen

Proving and inferring invariants David Monniaux CNRS / VERIMAG Grenoble, France December 13,

Network Security CIA +Availability By Jinjian Ma Topics DOS/DDoS Detection &amp; Defense

Efficient Memory Disaggregation with Infiniswap Juncheng Gu , Youngmoon Lee, Yiwen Zhang,

Cray Management Services (CMS) Group Charter The Problem with Log and State Information

NDN-Trace A A PATH TRA RACING UT UTILITY Y FOR R NDN NDN SIHAM KHOUSSI, DAVIDE PESAVENTO ,

End Site Control Plane System (ESCPS) Network service to

Siphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics Shuhao Liu, Li Chen ,

Spanner Stephanie New Overview Scalable, multi-version, globally distributed, and synchronously

Self-Stabilizing Master--Slave Token Circulation and Efficient Size-Computation in a

Sambuz

Useful Links

Newsletter

Mail Us

Network Security CIA +Availability By Jinjian Ma Topics DOS/DDoS Detection & Defense