discovering information explaining api types using text
play

Discovering Information Explaining API Types Using Text Course - PowerPoint PPT Presentation

Discovering Information Explaining API Types Using Text Course Instructor: Classification Dr. Jin Guo Presented by: Sunyam Bagga TEXT CLASSIFICATION Relevant/Irrelevant [API type, Section fragment] Source:


  1. Discovering Information Explaining API Types Using Text Course Instructor: Classification Dr. Jin Guo Presented by: Sunyam Bagga

  2. TEXT CLASSIFICATION Relevant/Irrelevant [API type, Section fragment] Source: https://www.python-course.eu/text_classification_introduction.php

  3. Technical Concepts 1. Recodoc tool 2. LOOCV 3. Maximum Entropy 4. Cosine similarity with tf-idf weighting 5. KAPPA

  4. RecoDoc “Recovering Traceability Links between an API and Its Learning Resources” 1

  5. Aim : - Find API types referenced in a tutorial: - Identifies CLTs - Links these CLTs to exact API type “DateTime….such as year() or monthOfYear().” - Precisely link code-like terms (e.g., year()) to specific code elements (e.g., DateTime.year())

  6. Ambiguity ▪ Declaration Ambiguity : CLTs are rarely fully qualified. ▪ Overload Ambiguity: CLTs do not indicate the number/type of parameters (method is overloaded). ▪ External Reference Ambiguity: May refer to code elements in external libraries. ▪ Language Ambiguity: Human errors: typographical (HtttpClient), case errors, forgetting parameters etc.

  7. Parsing Artifacts and Recovering Traceability Links - Linking Types: Given a CLT, they find all types in the codebase whose name matches the term. - Disambiguate and filter.

  8. LOOCV “Evaluating a classifier’s performance” 2

  9. Leave-one-out Cross Validation Source: https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

  10. MaxEnt Classifier “Using Maximum Entropy for Text Classification” by Nigam et al. 3

  11. Maximum Entropy: - Technique for estimating probability distributions from data - Principle : Without external knowledge, pick the distribution that has the maximum entropy (most-uniform). - Labeled training data helps put constraints on the distribution

  12. Example Source: NLP by Dan Jurafsky and Chris Manning

  13. Add Noun feature: f1 = {NN, NNS, NNP, NNPS} Add Proper Noun feature: f2 = {NNP, NNPS} Source: NLP by Dan Jurafsky and Chris Manning

  14. Constraints and Features - Restricts the model distribution to have the same expected value for a feature as seen in training data, D: - Features for text classification:

  15. Cosine Similarity with tf-idf “Comparison with Information Retrieval” 4

  16. Tf-Idf - Technique to vectorise text data - Term Frequency is a simple frequency count of a term in a document - Inverse Document Frequency gives more weight rare words.

  17. Cosine Similarity - Measures the cosine of the angle between the vectors: - They consider a section relevant if the similarity value is higher than a certain threshold.

  18. KAPPA score “Annotating the Experimental Corpus” 5

  19. Kappa formula - Measures inter-annotator agreement. ▪ Po: observed agreement among annotators ▪ Pe: hypothetical probability of chance agreement ▪ More robust than simple percent agreement calculation

  20. Kappa Example: ▪ P o = (20+15) / 50 = 0.7 ▪ P(Yes) = 0.5*0.6 = 0.3 ▪ P(No) = 0.5*0.4 = 0.2 ▪ P e = P(Yes) + P(No) = 0.5 Kappa = (0.7 - 0.5) / (1 - 0.5) = 0.4 Source: https://en.wikipedia.org/wiki/Cohen%27s_kappa

  21. Thanks! Any questions?

Recommend


More recommend