Technical Aspects of the Paper: Improving Code Readability Models with Textual Features Deeksha Arya COMP762
Key Concepts ´ Previous work as mentioned in the paper: ´ QALP tool (to compute similarity between comment and code) ´ Entropy ´ Halstead’s volume metric ´ Area Under the Curve (AUC) ´ Concepts used in paper’s experiments: ´ Center selection (used to get 200 representative code snippets) ´ Cronbach-alpha (to evaluate agreement between participants regarding readability value) ´ Logistic Regression with wrapper strategy (binary classification algorithm) ´ Wilcoxon test ´ Cliff’s Delta
QALP Score (Quality Assessment using Language Processing) Measures correlation between the natural language used in a program code ´ (mainly identifiers) w.r.t its documentation (in this case its comments), hence identifies well-documented code Pre-processing involves: ´ Using stop-words (custom defined for the code – includes keywords, library function and ´ predefined variable names) Stemming (elimination of word suffixes) ´ Atomic splitting of identifiers from code (splits compound identifiers into multiple atomic ´ terms using a lex-based scanner based on an island grammar) Weight the words using tf-idf (high weight to terms which occur more than average in ´ document but are rarer in entire collection) Considers each word as a separate dimension in an n -dimensional vector space – ´ vectorizes comments and code separately Calculates cosine similarity between comment and code tokens ´ Greater QALP score indicates both document models in question describe ´ concepts using the same vocabulary Ref: Increasing diversity: Natural language measures for software fault prediction
Entropy ´ Measures the complexity, the degree of disorder, or the amount of information in a data set ´ Let x i be a term in document X, and p(x i ) is the ratio of the count of occurrences of x i to the total number of words in the document. Then H(X) is the entropy and is given by: ´ Highest entropy indicates uniform distributions and lower entropy indicates highly skewed distributions Ref: A Simpler Model of Software Readability
Halstead’s Volume ´ Similar to the idea of entropy ´ Represents the minimum number of bits needed to naively represent the program, or the number of mental comparisons to write the program ´ Program length N = total number of operators + total number of operands ´ Program vocabulary n = number of distinct operators + number of distinct operands ´ Halstead Volume: V = Nlog 2 n ´ Greater volume indicates greater complexity Ref: A Simpler Model of Software Readability
AUC – Area Under the (ROC) Curve ´ Receiver Operating Characteristic (ROC) curve ´ True Positive Rate (Sensitivity): TP/(TP+FN) ´ False Positive Rate (1-Specificity): FP/(FP+TN) ´ ROC curve plotted by varying discrimination threshold ´ All such curves pass through (0,0) and (1,1) ´ Point (0, 1) represents perfect classification and points on the ROC curve close to (0, 1) represent good classifiers. Ref: A Simpler Model of Software Readability
Binary Classification with Logistic Regression ´ Supervised learning algorithm ´ Binary classification: “Not Readable”(0), “Readable”(1) ´ Takes real-valued inputs of some dimension n and predicts the probability of the input belonging to the default class(1). If probability > 0.5, predicted class is 1, else 0. ´ Probability = sigmoid(output) = sigmoid( ⍬ 0 + ⍬ 1 *x 1 + ⍬ 2 *x 2 + … + ⍬ n *x n )
Training with Logistic Regression ´ Training involves finding the gradient of the error and updating the coefficient vector ⍬ to better represent the model and improve accuracy over a number of iterations Gradient Descent Step: Here, m = total number of training examples h ⍬ (x) = predicted output y = actual labelled output ⍺ = learning rate ´ When an optimal set of coefficients is found, the model is then used to predict the class of previously unseen datapoints
Overfitting To reduce overfitting: Reduce number of features used to model ´ data
Feature Selection using Wrapper Method ´ Create all possible subsets of size k from feature vector ´ k is determined via cross-validation ´ Perform classification on each subset of features ´ Feature subset upon which classification with highest accuracy is obtained is chosen as best feature representation Ref: Large Scale Attribute Selection Using Wrappers
Center Selection Used to select 200 most representative methods for evaluation ´ Continuously draw an edge between closest pair of points based ´ on distance In this case Euclidean distance – square root of the difference ´ between the squares of the vector components Do not create edges between two components which are ´ already in the same cluster -> hence single-linked clusters Once there are k-connected components, stop the procedure ´ Ref: Algorithm design by J. Kleinberg and E. Tardos
Cronbach-alpha Measures reliability – how well a test measures what it should ´ Measure of how closely items within a group are related ´ Used to measure level of agreement of annotators on what ´ readable code is Can be written as a function of the number of items and the ´ average inter-correlation among the items N: number of items c-bar: average inter-item covariance v-bar: variance Ref: https://stats.idre.ucla.edu/spss/faq/what-does-cronbachs-alpha-mean
Wilcoxon Test Used when comparing two related values, matched values, or repeated ´ measurements on a single value to assess whether their mean ranks differ Used to determine if classification accuracy of proposed model is significantly ´ different from other models Algorithm: ´ ´ Find the difference between each pair of values ´ Rank the absolute value of these differences, ignoring any “0” differences. Give the lowest rank to the smallest absolute difference-score. If two or more difference-scores are the same, this is a "tie": tied scores get the average of the ranks that those scores would have obtained, had they been different from each other. ´ Apply the negative sign to ranks for negative differences and add together all the rank scores – this is called the critical value ´ N = number of non-0 differences ´ Compare with Wilcoxon chart and check critical value with alpha = 0.05 and N ´ If Wilcoxon chart value < critical value then data is similar, if it is more, data is highly different Ref: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
Critical value: |W| = |1.5+1.5-3-4-5-6+7+8+9| = 9 W c (alpha=0.05, N=9) = 6 Since W c < |W|, the two datasets are dissimilar Ref: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
Cliff’s Delta Measure of how often the values in one distribution are larger ´ than the values in a second distribution. Used to perform pairwise comparison between all-features model ´ and other models. In this expression, x 1 and x 2 are scores within a group 1 and group ´ 2, and n 1 and n 2 are the sizes of the sample groups respectively. Ranges from 1 when all values from one group are higher than ´ the values from the other group, to -1 when the reverse is true. Completely overlapping distributions have a Cliff’s delta of 0. Ref: http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S1657-92672011000200018
P-value ´ The p -value is defined as the probability of obtaining a result on a null hypothesis (H 0 ) equal to or more extreme than what was actually observed. ´ Null hypothesis is a prediction of no-difference, that is, for example “Is there a significant difference if we add a particular feature to the input set to determine readability?” ´ The smaller the p -value, the higher the significance because it tells the investigator that the null hypothesis under consideration may not adequately explain the observation. ´ The hypothesis is rejected if any of these probabilities is less than or equal to a pre-defined threshold value ⍺ which is referred to as the level of significance. Ref: https://www.statsdirect.com/help/basics/p_values.htm
Recommend
More recommend