improving trace accuracy through data driven
play

Improving Trace Accuracy through Data-Driven Configuration and - PowerPoint PPT Presentation

Improving Trace Accuracy through Data-Driven Configuration and Composition of Tracing Features Barleen Kaur COMP 762 What is traceability? Traces are navigable links between data held in software artifacts (like requirement document, design


  1. Improving Trace Accuracy through Data-Driven Configuration and Composition of Tracing Features Barleen Kaur COMP 762

  2. What is traceability? Traces are navigable links between data held in software artifacts (like requirement document, design documents, code, test cases) that are otherwise disconnected 1 . Requirement Traceability matrix: A matrix which shows which pair of artifacts will be associated via which link. 1 Software and Systems Traceability by Jane Cleland Huang et. al.

  3. Algorithms for traceability Developers need to manually go through each document/artifact thoroughly and then generate links. Manual maintenance of traceability links is an error prone and laborious job. Algorithms to semi-automate the process of creating trace links: 1) Vector Space model 2) Probabilistic approach 3) Latent semantic indexing 4) Rule based approaches and so on … Problem: Different algorithms work best for different datasets.

  4. Motivation Manual creation and maintenance of trace links is an error prone and laborious ● job. No one size fit solution: Finding the best configuration of tracing techniques for a ● specific dataset can lead to significant improvements in the quality of generated links. To set the best configuration of existing techniques as a baseline for new ● techniques rather than a single technique which performs inadequately for a specific dataset. Goal : Find a best combination of existing traceability techniques in order to generate accurate trace links for a specific dataset at hand.

  5. Dynamic Trace Configuration : High Level Architecture Selection of best configuration dynamically at run-time for a specific dataset. Best/Top configuration generating DTC Training set of source and target quality links for a given dataset and artifacts and validated trace links. feature model. Genetic algorithm (to search through space of viable Existing tracing techniques configurations intelligently) Elements of DTC : 1) Feature model 2) Simulation environment to generate trace links from a configuration for a dataset 3) Intelligent search algorithm

  6. Feature Model Representation of Feature model is done Using: Textual Variability Language (non graphical, text based language like C). Goal: Scalable to be succinct, modular To be comprehensible. It has its own Grammer rules, and semantics. Preprocessor: Acronym expander: ● “RBAC ” -> Role based access Control Stemmer: Inflected word forms to base form. For e.g. Bank, banking -> bank ● Stopper: remove the commonly occuring words which don’t convey significant meaning “the”, ● “this” etc. Dynamic stopper : not using a precompiled list of stopwords, but generating it dynamically. ●

  7. Feature Model Dictionary builder: Local tf-idf: based on ● terms present in artifacts ANC : based on terms in ● American national corpus. Trace Algorithms: Generate the trace links using VSM (vector representation tf-idf and then cosine similarity) or LSI (uses SVD to match queries and documents by meaning). Ordering of trace links: Ranked order: based on similarity scores. ● Incremental approach: Incremental feedback is needed to decide the order. ● Direct Query Manipulation: Query modification till user gets his desired results. ● In addition to these, “requires” and “constraints” relationships and parameters of each feature are also captured.

  8. Evaluation of generated trace links Intuition behind Mean Average Precision (MAP) : Suppose we are searching for images of a flower on image retrieval system, we do get back a bunch of ranked images (from most likely to least likely). Usually not all of them are correct. So we compute the precision at every correctly returned image, and then take an average. If our returned result is 1, 0, 0, 1, 1, 1 where 1 is an image of a flower, while 0 not, then the precision at every correct point is: how many correct images have been encountered up to this point (including current) divided by the total images seen up to this point. 1/1, 0, 0, 2/4, 3/5, 4/6 . The AP for above example is 0.6917. For example, an AP of 0.5 could have results like 0, 1, 0, 1, 0, 1, … where every second image is correct, while an AP of 0.333 has 0, 0, 1, 0, 0, 1, 0, 0, 1, … where every third image is correct. MAP is just an extension, where the mean is taken across all AP scores for many queries. Source: https://makarandtapaswi.wordpress.com/2012/07/02/intuition-behind-average-precision-and-map/

  9. Pipe and filter architecture Components in the pipeline can be turned on or off to produce different ● configurations. Future Work: Dynamic sequencing of the components of preprocessors ● and/or merge output of multiple tracing techniques using voting techniques.

  10. Genetic Algorithm A wise man called Charles Darwin once said….. It is not the strongest of the species that survives, nor the most intelligent , but the one most responsive to change.

  11. Intelligent search: Genetic Algorithm Initialisation : Define your population, where each ● individual has its own set of chromosomes (binary strings). Fitness Function : Compare two chromosomes based ● on fitness score like MAP in the paper. Selection: Select fit chromosomes from the population ● which can mate and create their healthy off-springs. But that would lead to chromosomes that are more close to one another in a few next generation, and therefore less diversity. Roulette Wheel: let’s divide the wheel into m divisions, where m is the number of chromosomes in our populations. The area occupied by each chromosome will be proportional to its fitness value. Now this wheel is rotated and the region of wheel which comes in front of the fixed point is chosen. Source: https://www.analyticsvidhya.com/blog/2017/07/introduction-to-genetic-algorithm/

  12. Genetic Algorithm Contd... Crossover : Nothing but reproduction. We select a random crossover point and the tails of both the chromosomes are swapped to produce a new off-springs. This is also known as one point crossover. Mutation: Children don’t have the same exact traits as their parents. This process is known as mutation, which may be defined as a random tweak in the chromosome, which also promotes the idea of diversity in the population. The off-springs thus produced are again validated using our fitness function, and if considered fit then will replace the less fit chromosomes from the population.

  13. Stopping Criteria There is no improvement in the population for over x iterations. ( 5 generations) ● We have already predefined an absolute number of generation for our algorithm. ( 60th ● generation) When our fitness function has reached a predefined value. ● End result : Highest performing configuration across all generation is selected.

  14. Hypothesis Testing Question: A company has stated that their straw machine makes straws that are 4 mm in diameter. A worker believes that the machines no longer makes straws of that size and samples 100 straws to perform a hypothesis test with 99% confidence. N =100, c= 0.99, alpha = 1-c =0.01 Null Hypothesis: H0 => mean = 100 Alternative Hypothesis: Ha => mean is not equal to 100

  15. Source: https://www.youtube.com/watch?v=cW16A7hXbTo

  16. If the P-value is low, the null must go! (reject Ho) If the P-value is high, the null must fly! (fail to reject Ho)

Recommend


More recommend