using software trails to recover the evolution of software
play

Using software trails to recover the evolution of software 3rd - PowerPoint PPT Presentation

Using software trails to recover the evolution of software 3rd ELISA 2003 Daniel M. German Software Engineering Group University of Victoria, Canada September 23, 2003 Version: 1.0.0 1 Introduction By using tools that become vital to the


  1. Using software trails to recover the evolution of software 3rd ELISA 2003 Daniel M. German Software Engineering Group University of Victoria, Canada September 23, 2003 Version: 1.0.0 1

  2. Introduction • By using tools that become vital to the success of a project, its history is being recorded in software trails : – Configuration management systems (including version control and defect management systems) – Mailing lists – ChangeLogs 2

  3. Evolution • The initial objective of this research was to try to recover the evolution of Evolution using its software trails – It is the Outlook of the GNOME project – Almost 4 years of development – It is becoming one of the free mail clients – Unlike many other OSS projects ∗ It started as a group project, with its software requirements drawn before the code was written ∗ It has been driven by one company: Ximian (recently bought by Novell) 3

  4. Methodology • Define a schema that represents and correlates software trails • Gather the trails: – Recover the trails and map them to the schema – Trails are usually available as logs and history reports • Extend the information: – Combine the available information, creating new facts – It might require some heuristics • Analyze: – Using query languages and visualization tools – It is a time consuming task 4

  5. Is this info useful? • The most important question: can we trust this information ? • The answer: it depends • Some projects establish clear guidelines –and follow them– on how to use these tools. – IBM uses a Configuration Management System that tracks several trails – Many free/Open Source software projects use a toolkit based on CVS, Bugzilla, mailman, following a set of de-facto standards 5

  6. Evolution Trails • This papers uses info from – ChangeLogs: “explain how earlier versions of software were different from the current version.” – CVS: Most popular version control system ∗ Keeps track of who modifies what, and when, supports branching ∗ It does not support transaction-oriented operations – Mailing lists ∗ For developers and for users – Source code releases • In several cases, it was necessary to reverse engineer their formats 6

  7. The Challenge of Extending the Trails • It is difficult to correlate raw trails • For example, identifying developers: – CVS uses an id to record the developer – The ChangeLog lists his/her preferred email address – The mailing list might list his/her spam, or commonly used address – Some changes come from non-cvs developers and they are recorded in the ChangeLogs • Nonetheless, they provide a gold mine of information to follow the evolution of a project 7

  8. Milestones of Evolution Milestones Date Coding of camel starts 1999-01-01 Evolution starts 1999-04-16 Ximian is established 1999-10-01 Version 0.0 2000-05-10 Version 1.0 2001-11-21 Version 1.1.1 2002-09-09 Version 1.2.0 2002-11-07 LinuxWorld “Best Front Office Solution” award 2003-01-23 Version 1.3.1 2003-02-28 8

  9. Size of the Distributions 70 Size of version Size of source code Size of translations Size of ChangeLogs 60 Major releases 50 Size (in MBytes) 40 30 20 10 0 00/07 01/01 01/07 02/01 02/07 03/01 Month 9

  10. Size of the Distributions... 550000 1400 LOCS clean LOCS 500000 Total number of files 1275 Major releases 450000 1150 Number of Source Files 400000 1025 350000 900 300000 775 250000 650 200000 525 150000 400 00/07 01/01 01/07 02/01 02/07 03/01 Month 10

  11. How is the code base changing? 100000 New LOCS New Source Files (right axis) Major releases 80000 200 60000 150 New Source Files New LOCS 40000 100 20000 50 0 0 -20000 -50 00/07 01/01 01/07 02/01 02/07 03/01 Month 11

  12. And the developers? 1200 120000 MRs Release 0.0 Release 1.0 Release 1.2 code MRs 1000 100000 Major releases Release 1.1.1 Release 1.3.1 Minor releases 800 80000 MRs 600 60000 Ximian starts operations 400 40000 200 20000 0 0 98/01 98/07 99/01 99/07 00/01 00/07 01/01 01/07 02/01 02/07 03/01 Date 12

  13. Change in code base vs. contributors activity 1200 120000 MRs Release 1.0 Release 1.2 New LOCS (right axis) 1000 100000 LOCS added in release Major releases Release 1.1.1 Release 1.3.1 Major releases 800 80000 code MRs 600 60000 400 40000 200 20000 0 0 -200 -20000 00/01 00/07 01/01 01/07 02/01 02/07 03/01 Date 13

  14. How many contributors? 1 Contributors activity Proportion of total MRs (log scale) 0.1 0.01 0.001 0.0001 1e-05 1 2 4 8 16 32 64 128 Contributors (log scale) 14

  15. Revisions per type of file Number of files Extension Prop. Accum. in CVS .c 0.41 0.41 1195 ChangeLog 0.22 0.62 43 .h 0.13 0.75 1063 .am 0.05 0.81 174 .po 0.04 0.85 71 15

  16. Most files are rarely changed Prop. of rev. to a given code file (log scale) Revisions to Files 0.012 0.01 0.008 0.006 0.004 0.002 0 1 10 100 1000 Files (log scale) 16

  17. Modules MRs per Module mail camel calendar addressbook shell widgets composer e-util filter my-evolution tests libical libibex executive-summary wombat importers im libversit notes tools libwombat cmdline ebook 0 500 1000 1500 2000 2500 3000 Number of MRs for each Module 17

  18. Evolution of the size of the modules camel calendar 100 mail addressbook shell libical widgets Major releases 80 60 LOCS 40 20 0 00/07 01/01 01/07 02/01 02/07 03/01 18 Date

  19. Changes are usually localized in a given module Number of Modules in codeMR 10000 Number of codeMRs (log scale) 1000 100 10 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of Modules in a codeMR 19

  20. Developers tend to concentrate in one module Mod Developers Id Prop Acc shell 17 ettore 0.65 0.65 danw 0.11 0.76 toshok 0.05 0.81 clahey 0.04 0.84 zucchi 0.03 0.87 mail 19 fejj 0.52 0.52 rodo 0.13 0.65 zucchi 0.12 0.77 ettore 0.07 0.83 danw 0.06 0.89 calendar 17 jpr 0.40 0.40 rodrigo 0.32 0.72 ettore 0.07 0.79 danw 0.06 0.85 damon 0.03 0.88 20

  21. Observations • One software trail does not tell the whole story • Schema evolution • Informal structure in trail • Information overload and the need for analysis and visualization tools. • Quality of software trails. 21

  22. Quality of Trails • Some projects keep better trails than others. • One hypothesis: it is a measure of: – The number of developers, – their dislocation, – and the maturity of the project. 22

  23. Conclusions and Future Work • Extracting and correlating software trails can tell a detailed story of how a software project has evolved • But it comes at a cost: too much information to analyze • It is needed: – Creating of standardized schemas – More tools to recover and enhance the trails – Heuristics to automatically discover “interesting” facts – Metrics to quantify trails 23

Recommend


More recommend