Discovering Information Explaining API Types Using Text Course Instructor: Classification Dr. Jin Guo Presented by: Sunyam Bagga
TEXT CLASSIFICATION Relevant/Irrelevant [API type, Section fragment] Source: https://www.python-course.eu/text_classification_introduction.php
Technical Concepts 1. Recodoc tool 2. LOOCV 3. Maximum Entropy 4. Cosine similarity with tf-idf weighting 5. KAPPA
RecoDoc “Recovering Traceability Links between an API and Its Learning Resources” 1
Aim : - Find API types referenced in a tutorial: - Identifies CLTs - Links these CLTs to exact API type “DateTime….such as year() or monthOfYear().” - Precisely link code-like terms (e.g., year()) to specific code elements (e.g., DateTime.year())
Ambiguity ▪ Declaration Ambiguity : CLTs are rarely fully qualified. ▪ Overload Ambiguity: CLTs do not indicate the number/type of parameters (method is overloaded). ▪ External Reference Ambiguity: May refer to code elements in external libraries. ▪ Language Ambiguity: Human errors: typographical (HtttpClient), case errors, forgetting parameters etc.
Parsing Artifacts and Recovering Traceability Links - Linking Types: Given a CLT, they find all types in the codebase whose name matches the term. - Disambiguate and filter.
LOOCV “Evaluating a classifier’s performance” 2
Leave-one-out Cross Validation Source: https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
MaxEnt Classifier “Using Maximum Entropy for Text Classification” by Nigam et al. 3
Maximum Entropy: - Technique for estimating probability distributions from data - Principle : Without external knowledge, pick the distribution that has the maximum entropy (most-uniform). - Labeled training data helps put constraints on the distribution
Example Source: NLP by Dan Jurafsky and Chris Manning
Add Noun feature: f1 = {NN, NNS, NNP, NNPS} Add Proper Noun feature: f2 = {NNP, NNPS} Source: NLP by Dan Jurafsky and Chris Manning
Constraints and Features - Restricts the model distribution to have the same expected value for a feature as seen in training data, D: - Features for text classification:
Cosine Similarity with tf-idf “Comparison with Information Retrieval” 4
Tf-Idf - Technique to vectorise text data - Term Frequency is a simple frequency count of a term in a document - Inverse Document Frequency gives more weight rare words.
Cosine Similarity - Measures the cosine of the angle between the vectors: - They consider a section relevant if the similarity value is higher than a certain threshold.
KAPPA score “Annotating the Experimental Corpus” 5
Kappa formula - Measures inter-annotator agreement. ▪ Po: observed agreement among annotators ▪ Pe: hypothetical probability of chance agreement ▪ More robust than simple percent agreement calculation
Kappa Example: ▪ P o = (20+15) / 50 = 0.7 ▪ P(Yes) = 0.5*0.6 = 0.3 ▪ P(No) = 0.5*0.4 = 0.2 ▪ P e = P(Yes) + P(No) = 0.5 Kappa = (0.7 - 0.5) / (1 - 0.5) = 0.4 Source: https://en.wikipedia.org/wiki/Cohen%27s_kappa
Thanks! Any questions?
Recommend
More recommend