michael ryan john noecker jr evaluating variations in
play

Michael Ryan, John Noecker Jr Evaluating Variations in Language Lab - PowerPoint PPT Presentation

Michael Ryan, John Noecker Jr Evaluating Variations in Language Lab Duquesne University mryan, jnoecker @ jgaap.com Tools JGAAP (Java Graphical Authorship Attribution Program) - a modular test bed for authorship attribution methods.


  1. Michael Ryan, John Noecker Jr Evaluating Variations in Language Lab Duquesne University mryan, jnoecker @ jgaap.com

  2. Tools  JGAAP (Java Graphical Authorship Attribution Program) - a modular test bed for authorship attribution methods.  All methods used are either available in JGAAP or were extensions of it  Source code for the methods used in this experiment is available at jgaap.com

  3. Mixture of Experts  Combined three Authorship Attribution techniques  Each technique assigns a vote on the author of the document  If there is not majority author assume the author was not in the sample group

  4. Centroid L1  Break documents into feature vectors of character 3- grams using relative frequencies of 3-grams  Build Centroids for the known authors  Take the average of that authors feature vectors  Measure the L1 Distance between the authors’ centroids and the unknown’s feature vector  Assign your vote to the author whose centroid had the smallest L1 Distance

  5. WEKA SMO  Break documents into feature vectors of character 3- grams using relative frequencies of 3-grams  Train WEKA’s Sequential Minimal Optimization Support Vector Machines (SMO) using the known authors’ feature vectors  SMO will rate authors similarity  Assign a vote to the most similar author

  6. Repeated Microdocument Analysis  Break all documents into 3,000 character chunks  Reduce all contiguous whitespace to single spaces and all character to lower case  Break chunks into feature vectors of character 11-grams using relative frequencies of 11-grams  Generate Centroids for the known authors  Take the average of the author’s feature vectors  Measure the Intersection Distance between the author centroids and chunks, assigning the closest centroid’s author to each chunk  Vote on the author who receives a majority of the chunks

  7. Author Diarization Method  Break documents into paragraphs  Extract named entities from paragraphs  Group paragraphs with named entities in common  Assume each group is an author  Use the grouped paragraphs as known chunks with Repeated Microdocument Analysis and ungrouped paragraphs as unknowns  Add the ungrouped paragraph that is closest to a group to that group and re-run the analysis until all paragraphs are grouped

  8. Results Problem Number Correct Total Accuracy A 6 6 100% B 7 10 70% C 7 8 87.5% D 10 17 58.8% E 83 90 92.2% F 77 80 96.3% I 12 14 85.7% J 12 16 75.0% Total 214 241 88.8%

  9. Conclusions  These methods show promise with document accuracy of 88.8% and mean accuracy of 83.2%, respectively first and third in the competition.  The method used preformed poorly on open-class problems because they were developed with only closed class in mind, removing the open-class portions changes our accuracies to 91.6% and 88.5%

  10. Future Work  Refine analysis of open-class problems by examining how different experts preform in identifying them and how many experts it takes to reach a conclusion.

Recommend


More recommend