using ml to design a flexible loc counter
play

Using ML to Design a Flexible LOC Counter Mirosaw Ochodek Miroslaw - PowerPoint PPT Presentation

MaLTeSQuE2017, Feb 21 st, 2017, Klagenfurt Using ML to Design a Flexible LOC Counter Mirosaw Ochodek Miroslaw Staron Dominik Bargowski Wilhelm Meding Regina Hebig Workshop on Machine Learning Techniques for SoKware Quality EvaluaNon


  1. MaLTeSQuE2017, Feb 21 st, 2017, Klagenfurt Using ML to Design a Flexible LOC Counter Mirosław Ochodek Miroslaw Staron Dominik Bargowski Wilhelm Meding Regina Hebig Workshop on Machine Learning Techniques for SoKware Quality EvaluaNon

  2. So7ware size Cost predicNon ProducNvity Size Metrics normalizaNon #Defects Defects density = Size 2

  3. The Problem i n g b y U s i n t y C e r t a m e n t t s u r e e m e n M e a e a s u r o v i n g t i c M I m p r t e m a d S y s r e o F i n M e a s u i o n t o d e i b r a t o f - C C a l L i n e s - e o f A C a s o r – E r r Miroslaw Staron 1 , Darko Durisic 2 , and Rakesh Rana 1 w e d e n n b u r g , S o f G o t h e n v i e r s i t y e r i n g , U d E n g i n e @ g u . s e , c i e n c e a n e s h . r a n a m p u t e r S r o n / r a k 1 C o s l a w . s t a w e d e n m i r o G r o u p , S v o C a r 2 V o l a r s . c o m c @ v o l v o c . d u r i s i d a r k o a r e o f - e s - o f c - o d e e r o f l i n t , h e n u m b j e c t e ff o r s u c h a s t a a s p r o m e a s u r e s h e n o m e n o n B a s e u t s u c h p n w e r e l y b s t r a c t . i o n s a b o q u i t e o f t e A e p r e d i c t o w e v e r , l a t i n g d t o m a k e ff o r t . H f o r c a l c u t e n u s e n t e n a n c e g l o i r t h m y o r m a i e e x a c t a e a r c h i s u c t q u a l i t w h e r e t h o f o u r r e s p r o d t r u m e n t s b j e c t i v e t - m e n t i n s w n . T h e o e s n i s o f m e a s u r e n o t k n o e m e a s u r t h e e a s u r e i s t y o f b a s o u r o f t h e m e c e r t a n i w e u s e f h e v a u l e c r e a s e h t d y w h e r e t w e c a n i n r k i n g s t u k n o w n o r e h o w b e n h c m a w i t h u n o t e x p l o n d u c t a u r e m e n t n g . W e c o d e m e a s c a n a d - e n g n i e e r i l i n e s - o f - c t h a t w e w a r e m e n t s f o r u l t s s h o w c t i n s t r u O u r r e s s y s t e m a t i a s u r e m e n d e b a s e s . w i n g t h e m e r e fi v e c o 2 0 % k n o t r u - t o m e a s u m u c h a s m e n t i n s e r t a i n t y u e s b y a s m e a s u r e c m e n t v a l a t i n g t h e e m e n t m e a s u r e h a t c a l i b r n m e a s u r j u s t t h e n c l u d e t c c u r a c y i o . l W e c o c r e a s e d a y o f p r e - o f t h e t o u t e t o i n a c c u r a c e r r o r y c o n t r i b m p a c t t h e e g n i fi c a n t l h i s w i l l i c r e a s e t h Four tools n t s c a n s i e r i n g . T r e f o r e i n m e r e e n g i n e ) a n d t h e n i s o f t w a p r o j e c t s r o c e s s e s s o f t w a r e p e ff o t r i n c e s s e s . ( e g . . o f e r i n g p r o d i c t i o n s r e e n g i n e o f s o f t w a ffi c i e n c y c o s t - e With the introduction of the measurement information model in the interna- Introduction 1 tional ISO/IEC 15939 standard for measurement processes the discipline of soft- ware engineering evolved from discussing metrics in general to categorizing them Error (vs. median) into three categories – base measures, derived measures and indicators. The use of base measures is fundamental for the construction of derived measures and indicators. The base measures are also the types of measures which are collected directly and are a result of a measurement method. In many cases this measure- ment method is an automated algorithm (e.g. a script) which we can refer to as the measurement instrument which quantifies an attribute of interest into a up to ~20% Since in software engineering we do not have reference measurement etalons as we do in other disciplines (e.g. kilogram or meter for physics), we often rely number. on arbitrary definitions of the base quantities. One of such quantities is the size of programs measured as the number of lines of code. Even though the num- ber of lines of code of a given program is a deterministic and fully quantifiable Introduces (unknown) measurement error, problems with reliability of the Output: 2512 LOC measurement, difficulNes in measuring mulN-language code base… 3

  4. Poten>al solu>ons A tool based on Programming A machine learning (ML) approach Language (PL) parsers • It is difficult to explicitly define the rules • Explicitly known rules for coun3ng that (either not known or too complex) • Learns from examples (require training set) can be somehow formulated • ClassificaNon error depending on the • 100% accurate according to the rules quality of training set • Requires implementaNon for each PL • Doesn’t require new implementaNon for • Can be also implemented to allow for new language (however, may require a some configuraNon of rules (however, new training set) probably somehow limited) ? 4

  5. Poten>al solu>ons A tool based on Programming A machine learning (ML) approach Language (PL) parsers • It is difficult to explicitly define the rules • Explicitly known rules for coun3ng that (either not known or too complex) • Learns from examples (require training set) can be somehow formulated • ClassificaNon error depending on the • 100% accurate according to the rules quality of training set • Requires implementaNon for each PL • Doesn’t require new implementaNon for • Can be also implemented to allow for new language (however, may require a some configuraNon of rules (however, new training set) probably somehow limited) ? 5

  6. Idea of the solu>on • Flexible lines of code counter (CCFlex) – A user teaches the tool which lines should be counted based on a sample (a training set) 10 LOC JusNficaNon 6

  7. Idea of the solu>on 7

  8. Feature acquisi>on Each line is characterized by a set of features and its decision class (count or ignore) We parse the text to extract those features. File type #Characters If … Decision class java 25 TRUE … Count … … … … … 8

  9. Feature acquisi>on ID Name Type Description F01 File Nominal The extension of the file (e.g., extension java, cpp, etc.) F02 Full Numeric The number of characters in the length line. F03 Length Numeric The number of characters in the • Plain text (F01-F04): line after removing all leading and trailing white characters. F04 Tokens Numeric The number of tokens in the line – File extension (the line is split based on white characters). F05 Semicolons Numeric The number of semicolons in the – Full and trimmed length (characters) line. F06 Comments Boolean The line includes any of //, /*, */ or after trimming starts with *. – Tokens F07 Assignments Numeric the number of single assignment signs in the line (=). F08 Brackets Numeric The number of brackets: (, )in • Programming language (F05-F19): the line. F09 Square Numeric The number of square brackets: brackets [ , ] in the line. – Assignment, F10 Curly Numeric The number of curly brackets: { , brackets } in the line. F11 Class Boolean The word ”class” appears in the line. – Brackets, F12 For Boolean The word ”for” appears in the line. – Class, F13 If Boolean The word ”if” appears in the line. F14 While Boolean The word ”while” appears in the line. – Comment, F15 Case Boolean The word ”case” appears in the line. – Semicolons, F16 Try Boolean The word ”try” appears in the line. F17 Catch Boolean The word ”catch” appears in the – … line. F18 Expect Boolean The word ”expect” appears in the line. F19 Member Numeric Counts members accessors: . or access - > 9

  10. Feature acquisi>on • Bag of words approach (automa>c) – Tokenize: ()[]{}!@#$%ˆ&*-=;:’”\|‘ ̃,.<>/? – Treat split character as a token – Calculate thresholds: • Frequencies of tokens in the code base (min. 5) • % of files a token is present in (min. 25%) – If thresholds are met: • F i : the number of Nmes the token i occurs in a line 10

  11. Preliminary valida>on • RQ1: What level of predicNon quality can be achieved by the proposed approach? • RQ2: How the automaNc features acquisiNon affects the classificaNon quality? • RQ3: How the choice of classificaNon algorithm affects the classificaNon quality? 11

  12. Code databases • 2402 physical lines of code in total – Eclipse: 475 LOC, – Jasper Reports 757 LOC, – Spring MVC: 1170 LOC • ELOC (Count 1492 / Ignore 910) • Subjec>ve (Count 1237, Ignore 1165) 12

Recommend


More recommend