evolution of github repositories
play

Evolution of GitHub Repositories Aseel Awdeh Kalonji Kalala School - PowerPoint PPT Presentation

Evolution of GitHub Repositories Aseel Awdeh Kalonji Kalala School of Computer Science School of Computer Science University of Ottawa, Ottawa, Canada University of Ottawa, Ottawa, Canada araed104@uottawa.ca hkalo081@uottawa.ca COMP 5900


  1. Evolution of GitHub Repositories Aseel Awdeh Kalonji Kalala School of Computer Science School of Computer Science University of Ottawa, Ottawa, Canada University of Ottawa, Ottawa, Canada araed104@uottawa.ca hkalo081@uottawa.ca COMP 5900 Project Presentation

  2. Outline  Background  Motivation  Research Questions  Challenges  Methodology  Results  Future Work and Implications COMP 5900 Project Presentation

  3. Background  GitHub:  Collaborative code hosting  Collaborative code review  Integrated issue tracking  Social Features  Over 10 million git repositories and 5 million developers.  Largest code hosting site in the world.  Important source of software artifacts on the Internet. COMP 5900 Project Presentation

  4. Background Source: http://www.dataschool.io/ COMP 5900 Project Presentation

  5. Motivation  Increasing number of projects and users of GitHub.  Surpassed in size and popularity of older forges (Sourceforge).  Research:  GitHub's event logs.  Effects of branching and pull-based software development.  Social nature of GitHub  No studies on evolution of GitHub repositories. COMP 5900 Project Presentation

  6. Research Questions How do the projects evolve? RQ1 How does the popularity of RQ2 projects change over time? What is the health of the RQ3 projects? COMP 5900 Project Presentation

  7. Methodology 2. Data Extraction 1. Data Collection 3. Data Analysis COMP 5900 Project Presentation

  8. Dataset Collection Challenges  MSR2016 challenges faced.  Ghtorrent dataset challenge.  GitHub allows access to its internal data through a REST API.  Gathers event streams and data from GitHub.  Used MSR 2014 dataset. COMP 5900 Project Presentation

  9. MSR 2014 Dataset  Top 10 software projects for the top programming languages on GitHub  Resulting in 90 projects.  Year of creation till 2013 for each project.  Some characteristics of each project:  issues  pull requests  followers  stars  commits COMP 5900 Project Presentation

  10. Dataset Challenges  Limited dataset  Selection of projects.  Tables provided for each project (project_language table).  Incomplete attributes for some tables.  Not a representation of GitHub’s historical dataset. COMP 5900 Project Presentation

  11. Factors  Each project:  Commits.  Issues.  Pull Requests.  Committers.  Watchers.  Language. COMP 5900 Project Presentation

  12. Overall Evolution COMP 5900 Project Presentation

  13. Number of Projects 100 10 20 30 40 50 60 70 80 90 0 March April 2008 June August October January February March April May 2009 June August COMP 5900 Project Presentation September October November December January February March April Year May 2010 June July August Project Growth September October November December January February March April 2011 July August September October November December February 2012 April June

  14. Number of Commits 10000 12000 14000 16000 18000 2000 4000 6000 8000 0 February 2003 - May 2004 August November February 2004 - 2005 May August November February 2005 - 2006 May August November COMP 5900 Project Presentation February 2006 - 2007 May August November February 2007 - 2008 May August November February 2008 - Year 2009 May August Commits Growth November February 2009 - 2010 May August November February 2010 - 2011 May August November February 2011 - 2012 May August November February 2012 - 2013 May August November February 2013 - 2014 May August

  15. Number of Committers 1000 1200 1400 1600 1800 2000 200 400 600 800 0 February 2003 - May 2004 August November February 2004 - 2005 May August November February 2005 - 2006 May August November COMP 5900 Project Presentation February 2006 - 2007 May August November February 2007 - 2008 May August November Committers Growth February Years 2008 - 2009 May August November February 2009 - 2010 May August November February 2010 - 2011 May August November February 2011 - 2012 May August November February 2012 - 2013 May August November February 2013 - 2014 May August

  16. Growth of projects in terms of.. COMP 5900 Project Presentation

  17. Number of Commits 10000 15000 20000 25000 30000 35000 40000 45000 50000 5000 0 accessible-boilerplate Agony-WoW-Core android barchart-project-netty beanstalkd blog Number of commits per project boto ccv clojure CommunityCraftBukkit contrib-libuv d3 COMP 5900 Project Presentation devise diaspora django-cms doom3.gpl facebook-android-sdk flockdb Font-Awesome gitlabhq hiphop-php Projects httpie impress.js jquery kestrel libgit2 memcached mongo mosh Nancy paperclip php-sdk plupload prettyredcarpet rails reddit RestSharp scalatra shiny SignalR SparkleShare storm three.js zf2

  18. Number of Issues 10000 15000 20000 25000 30000 5000 0 ActionBarSherlock android beanstalkd blueprint-css cakephp chosen CodeIgniter CraftBukkit Number of issues per project devise diaspora django-cms doom3.gpl COMP 5900 Project Presentation facebook-android-sdk flask folly foundation gizzard homebrew httpie impress.js jquery Project knitr libuv memcached mongo mosh netty octopress paperclip php-sdk plupload rails redcarpet redis RestSharp scala ServiceStack Sick-Beard Slim stat-cookbook symfony three.js TrinityCore zf2

  19. Number of Committers 1000 2000 3000 4000 5000 6000 0 accessible-boilerplate Agony-WoW-Core android Number of committers per project barchart-project-netty beanstalkd blog boto ccv clojure CommunityCraftBukkit contrib-libuv d3 COMP 5900 Project Presentation devise diaspora django-cms doom3.gpl facebook-android-sdk flockdb Font-Awesome gitlabhq hiphop-php Projects httpie impress.js jquery kestrel libgit2 memcached mongo mosh Nancy paperclip php-sdk plupload prettyredcarpet rails reddit RestSharp scalatra shiny SignalR SparkleShare storm three.js zf2

  20. Popularity COMP 5900 Project Presentation

  21. Number of Watchers 10000 15000 20000 25000 5000 0 March 2008 - 2009 May July September November January March 2009 - 2010 May July COMP 5900 Project Presentation September November January March 2010 - 2011 May July September November Year Watchers Growth January March 2011 - 2012 May July September November January March 2012 - 2013 May July September November January 2013 - 2014 March May July September

  22. Number of watchers per project 20000 18000 16000 14000 Number of watchers 12000 10000 8000 6000 4000 2000 0 Projects COMP 5900 Project Presentation

  23. Initial Data Analysis Pearson Correlation Watchers and  Commits -0.12411  Committers -0.08331  Issues -0.0308 Commits and Issues 0.765961 Commits and Committers 0.79765 COMP 5900 Project Presentation

  24. Anticipated Health Results  Assumption: Active projects are healthy.  A project with a large number of watchers is not necessarily healthy/active.  Projects die out usually after a 3 years of development.  Health of most projects increases in 2011.  Most projects are there for storage, not development. COMP 5900 Project Presentation

  25. Commits 10000 15000 20000 25000 30000 35000 40000 5000 0 SparkleShare Example: Commits per project per plupload mono Nancy ServiceStack AutoMapper RestSharp 2003 ravendb MiniProfiler storm elasticsearch 2004 ActionBarSherlock facebook-android-sdk COMP 5900 Project Presentation clojure 2005 CraftBukkit netty android 2006 node jquery html5-boilerplate 2007 impress.js d3 Projects chosen 2008 Font-Awesome three.js foundation 2009 symfony CodeIgniter php-sdk 2010 zf2 cakephp ThinkUp 2011 phpunit Slim django 2012 tornado httpie flask requests reddit year boto django-debug-toolbar Sick-Beard django-cms rails homebrew jekyll

  26. Future Data Analysis  Relationships:  Number of issues/committers related to number of commits per project.  Number of commits related to programming languages used.  Number of commits related to number of watchers.  Kruskal-Wallis Test  Test whether distribution of independent projects is identical. COMP 5900 Project Presentation

  27. Implications and Future Work  Sample and analyze from the larger GitHub torrent dataset.  Analyze GitHub per user, instead of per project.  Help predict future growth patterns and requirements of GitHub. COMP 5900 Project Presentation

  28. Thank you COMP 5900 Project Presentation

Recommend


More recommend