Automated Topic Naming to Support Cross-project Analysis of Software Maintenance Activities Abram Hindle Neil A. Ernst Dept. of Computer Science Dept. of Computer Science University of California, Davis University of Toronto Davis, CA, USA Toronto, Ontario, CANADA abram@softwareprocess.es nernst@cs.toronto.edu Michael W. Godfrey John Mylopoulos David Cheriton School of Dept. Information Eng. and Computer Science Computer Science University of Waterloo University of Trento Waterloo, Ontario, CANADA Trento, ITALY migod@uwaterloo.ca jm@disi.unitn.it US NSF SHF Medium 0964703 1
Who Cares About Quality? Managers Developers New Developers Investors Customers 2
What is this commit about? Added a test for bug #1326 on OSX 3
What is this commit about? Added a test for bug #1326 on OSX 4
What is this commit about? Added a test for bug #1326 on OSX Maintain- Reliability Portability ability 5
But we have many commits.. Maintain- Reliability Portability ability 6
Developer Topics Commit Commit Developer Topic Developer Topic purpose? Maintainability Reliability L D A 7 L S I
Cross Project Relevance Version Version Control Control efficiency Shared usability reliability and functionality Concepts (includes correctness) maintainability Version portability Version Control Control 8
Quality-related Non Functional Requirements (NFRs) portability reliability and functionality (includes correctness) usability efficiency [iso9126] m a i n t a i n a b i l i t y [cleland-huang03] [ernst10] 9
Can't we just Revisions Software summarize Repositories quality related Source Code efforts within Source Code Build / Configuration this project? T ests Documentation Non-Functional Requirements Maintainability Functionality Portability Efficiency Usability Reliability time -> 10
Labelled Developer Topics Unique Topics Time (months) 11
Labelled Developer Topics Linux Unique Topics Kernel Windows AMD64 Time (months) 12
Labelled Developer Topics efficiency portability efficiency portability functionality maintainability efficiency Unique Topics reliability maintainability portability functionality Time (months) 13
Example [Blei] apologies to those with prior LDA/LSI experience 14
Opinion Arts International News 15
Arts International News Section Section Article Article Article Article Article Article Article Article Article Article 16
What if we didn't know what section the articles were in? Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article 17
Article Article Article Article Article Article Article Article Article Article Article Article Article LDA LSI 18
Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article LDA LSI 19
Word Distribution Article dog cat car city pound festival street mischief Documents are LDA represented as word distributions LSI (word counts) 20
Word Distributions Topics: Independent Word Distributions LDA finds independent word distributions that the LDA documents are related to. Documents can be associated LSI with more than one topic. 21
Topics: Original Word Independent Article Distributions Word Distributions Baseball Movie Sports Athlete and Actor Award Entertainment Nominees Theatre Review 22
Documents are represented as a linear combination of independent topics Topics: Independent Word Distributions Word Distributions Sports Athlete Entertainment and Actor C x 0 ~ + = C x 1 23
Article Article Article Article Article Article Article Here are two topics. I Article Article Article Article Article Article don't know what they are about! LDA LSI These word lists look look like: Sports and Topic 1 Topic 2 Entertainment ! * play * gambling * game * play * inning * night life * player * comedy * quarter * movie * opponent * theatre * ... * ... 24
25
Word bag analysis Usability Maintainability Portability Reliability Efficiency 26
Word Bag Examples Reliability Portability portability reliability transferability failure interoperability error documentation redundancy internationalization fails i18n bug ... ... 27
Labelled Topics of MaxDB 7.500 efficiency portability efficiency portability functionality maintainability efficiency reliability Unique Topics maintainability portability functionality Time (months) 28
MaxDB 7.500 Timeline Maintainability Maintainability Maintainability Portability Portability Effeciency Reliability Effeciency 29
Topics of MySQL 3.23 functionality usability functionality efficiency functionality/portability portability reliability portability reliability/usability functionality/portability maintainability/reliability/portability portability usability Unique Topics portability tags Time (months) 30
MySQL 3.23 Timeline Maintainability Maintainability Functionality Functionality Functionality Reliability Efficiency Portability Portability 31
ROC Values of Semi-Supervised Word Bags 0.8 Ma xDB e xp 2 MyS QLe xp 2 0.75 Ma xDB e xp 3 0.7 MyS QLe xp 3 0.65 0.6 0.55 ROC 0.5 0.45 0.4 porta bility efficiency relia bility functiona lity m a inta ina bility usa bility tota l NFR 32
Supervised Tags 33
Supervised Multitag Classifiers: MySQL and MaxDB 1 1 m ic ro 0.9 0.9 m a cro m ic ro 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 CLR HOMER CLR HOMER BR BR MySQL MaxDB Classifiers Classifiers 34
Conclusions efficiency portability Usability Version Managers efficiency Control Portability Developer Topic Analysis Maintainability reliability and Labelling Revisions maintainability portability Core Developers Source Control functionality functionality Efficiency L D A Version Version Control Control Reliability L S I efficiency Shared usability reliability and [Hindle09ICSM] Concepts functionality (includes correctness) New Developers maintainability Version portability Version Control Control Investors and Acquisitions Customers 35 http://softwareprocess.es/name/
F-1 Measure of Semi-Supervised Word Bags 0.8 Ma xDB e xp 2 MyS QLe xp 2 Ma xDB e xp 3 MyS QLe xp 3 0.7 0.6 0.5 0.4 0.3 F-1 0.2 0.1 0 porta bility efficiency relia bility functiona lity m a inta ina bility usa bility tota l NFR 36
Many Documents Topic 1 Topic 10 Few Documents Topic 20 37
Annotation: Stop Words MaxDB 7.500 Case Study 2 long trends instead of one topics joined due to similarity STOP STOP words words 38
Annotation: Training Sets Maintainability+ Version Control Maintainability- 39
a l r e p a e d r y h c Annotation: Stop Words t a l e h p a o s b r r e l o t y e u t w n d b g d h o e e s n e s l y t w e n t h l m h s e t a h o h s i t h g STOP e t e s b a h m h e i n t v m e c e i e b n a r a n m c e n e u e a ' ' s y t e s s s n s i o d m e d s k e b n i a e i e v a n o & e r words s e t e i v y g n e e s n r o e t b s n e h w n h t g e a y s x e e n e e i a v l p l b l e l s t m e e w l g e l s u s e r s b e e w p y s a s f e e o t n o l h fi l e d h m f u t a o e f a o i r t t n f r m o r n h h e e e s g f s y s i o h o o a t r w u n u e d l u a m l t p v ' p o n h n t m e o e w d g Used in topic analysis e i c n w n r ' u e s r m o y i e e h t e c d a n w or to reduce # of e a h s i m c h t f i s d k t o e p 3 e c a u o r features for learners. a h e r o y n s n r s e e c c n b o i s d i n r o i d t n e f o e a g y a n e c e c u i i d b 40 f x n i t n d p o a u o a c i i g n i o r t n r u i e d m e l n i g r p n t n n d w i e g t g ' i o i t h s r o n h u t n l t o t a t t n h u a e ' s w t 1 l i g r e e d s e e a s s e n n s # i i s n t t n e y 5 m i n c b t e h o s e c e r u e o a l a e i r n r n e n e s e v t ' w t i h ) x e s h e c t i g x i y r ' l s e e o g s e l h s l c a e l t l l i l e t k o e t e r o w l y v r e e ! g r a r d s
AUTO Annotation: Training Sets Version Maintainability+ Control MANUAL Maintainability Maintainability+ Maintainability- sample and correct 41 Maintainability-
Message Word Distribution Top 10 Words: Topic * perforce * bug # * POSIX * Opteron * ... Trend 42
Recommend
More recommend