Large-Scale Reuse in Open Source Software Audris Mockus - PowerPoint PPT Presentation

Large-Scale Reuse in Open Source Software Audris Mockus audris@avaya.com Avaya Labs Research Basking Ridge, NJ 07920 http://mockus.org/

Open Source Innovations ✦ Fundamentally different model of software development ✧ Built by large numbers of volunteers without physical contact ✧ Work is not assigned but chosen ✧ Design controlled by a few architects ✦ Resulting properties of software and process [2] ✧ Small core team controlling code submission and new features with an order of magnitute wider bug fix community and two orders of magnitude larger problem reporting community ✧ Low post-feature-test defect density ✧ Large developer productivity ✧ Rapid response to user problems 2 A. Mockus Large Scale Reuse in Open Source Software

Research Goals ✦ A key premise of open source is that the code can be used in other projects ✧ Reduces risks of project’s code being no longer available or supported ✧ Provides social value by encouraging innovation (no need to reimplement existing functionality) ✦ These suggest the following research questions: ✧ What is the extent of reuse? ✧ What are properties of highly reused code? ✧ How to evaluate reuse potential for a component? ✧ How to to find code most suitable for reuse? ✧ How to produce code that is more likely to be reused? 3 A. Mockus Large Scale Reuse in Open Source Software

Experimental approach ✦ Sample a large set of open source projects ✦ Identify and quantify instances of large-scale reuse ✧ not a copy and paste in an editor ✧ not a case of reuse where another project is reused as-is through libraries without copying the code ✦ Identify common patterns of reuse ✦ Quantify quality and other properties of the reused code 4 A. Mockus Large Scale Reuse in Open Source Software

Sample selection and retrieval ✦ Sample ✧ Important projects: Apache, Gnome, KDE, Mozilla, OpenSolaris, Postgres, and W3C ✧ Large distributions: Fedora 6, Gentoo, Slackware, FreeBSD, NetBSD, and OpenBSD ✧ Development portals: Savannah, SourceForge, and Tigris ✧ Random or language specific: FreshMeat, CPAN, RpmForge, and Gallery of Free Software Packages ✦ Retrieval ✧ SVN/CVS, wget, and page scraping (FreshMeat) ✧ 13 . 2 M files from 49 . 9 K bundles ✧ 5 . 3 M source code files and 38 . 7 K bundles after normalization (removing package versions, binary files, ...) 5 A. Mockus Large Scale Reuse in Open Source Software

Quantify large-scale reuse ✦ Method ✧ Identify pairs of directories with a large fraction of filenames that are shared between them [1] as reused directories ✧ Consider files with the same names in reused directories to be reused ✦ Measures ✧ Overall reuse — a fraction of files that are in more than one project ✧ Component reuse — a number of projects in which the component is present 6 A. Mockus Large Scale Reuse in Open Source Software

Results ✦ Results using different parameter values for the minimal fraction of shared filenames between two directories (30%) (50%) (80%) 2 , 837 , 233 2 , 782 , 339 2 , 654 , 977 File count . 53 . 52 . 49 Overall reuse Table 1: Reused files in open source projects. 7 A. Mockus Large Scale Reuse in Open Source Software

Scenarios of reuse ✦ Most reused (numbers are based on 80% cutoff) ✧ Text template: 657 projects using language translations, “po” directory with almost 50 files: “am.po”, ..., “zh TW.po” ✧ Functional template: 576 projects using install module for Perl ✧ Verbatim copy: 547 projects using C functions for internationalization ✦ Largest components reused at least 50 times ✧ 701 include files for Linux kernel ✧ System dependent configuration: glibc/sysdeps/generic with 750 files 8 A. Mockus Large Scale Reuse in Open Source Software

Validity ✦ Sampling process to increase the representativeness of project sample ✦ The definition of large-scale reuse ✧ not a copy and paste in an editor ✧ not a case of reuse where another project is reused as-is through libraries without copying the code ✦ No substantial changes to filenames or directory structure ✦ The instances of reuse are underestimated (no cases of mistaken identification of reuse were found) 9 A. Mockus Large Scale Reuse in Open Source Software

Summary and future work ✦ Findings ✧ The three most common patterns of reuse do not suggest immediate ways to increase reuse but point out less intuitive avenues for reuse ✧ The reuse is, indeed, massive and, therefore, has to facilitate innovation and to ensure that reused code lives on even if some projects die or vegetate ✧ The amount of OSS code is not that vast ✦ Future ✧ Better sample, identification of reuse, classification of patterns ✧ Reconstructing authorship and implicit collaborations via universal version history ✧ Quantifying quality and other properties of highly reused code ✧ Quantifying benefits to society 10 A. Mockus Large Scale Reuse in Open Source Software

References [1] Hung-Fu Chang and Audris Mockus. Constructing universal version history. In ICSE’06 Workshop on Mining Software Repositories , pages 76–79, Shanghai, China, May 22-23 2006. [2] Audris Mockus, Roy T. Fielding, and James Herbsleb. Two case studies of open source software development: Apache and mozilla. ACM Transactions on Software Engineering and Methodology , 11(3):1–38, July 2002.

Large-Scale Reuse in Open Source Software Audris Mockus - PowerPoint PPT Presentation

Large-Scale Reuse in Open Source Software Audris Mockus audris@avaya.com Avaya Labs Research Basking Ridge, NJ 07920 http://mockus.org/ Open Source Innovations Fundamentally different model of software development Built by large

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

Software Reuse From informal reuse (scavenging) to systematic reuse Management and technical

Make Money With Open Source What is Open Source? Community Free software vs. open source

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

UC Berkeley ReUSE Programs March 9, 2017 Lin King Cal Zero Waste Manager UC Berkeley Chair

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

Design with Reuse Building software from reusable components 2 Software reuse In most

Automating Your Lights with Open Source Combining Open Source Hardware with Free and Open Source

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

The State of Open Source Databases Peter Zaitsev CEO, Percona October 1 st , 2019 Open Source

Open Source Software/Hardware Decoupling Open Source Software (OpenStack, CORD)

Software Architecture and Reuse R. Kuehl p. 1 R I T Software Engineering Reuse: The Big

Open Source Databases Peter Zaitsev, CEO Percona What a Year! Huge changes for Open Source and

Creating Open Source Electronic Hardware with Open Source Software Tom Anderson Overview

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

If XML is so easy, how come its so hard? The usability of editing software for structured

Online Editor and Committee Report J. Nicholas Laneman Dept. of Electrical Engineering

CS 5150 So(ware Engineering 17. Program Development William Y. Arms Integrated Development

Eclipse Coordination Tools Software Development at SEN3 Christian Koehler Centrum voor Wiskunde

Prioritization Plus: Entering Program Narratives and Supplemental Data Online workshop for

Network Metrics, Planar Graphs, and Software Tools Based on materials by Lala Adamic, UMichigan

Dynamic Translation for EPIC Architectures David R. Ditzel Chief Architect for Hybrid

UBL Update Jon Bosak Sun Microsystems http:// oasis- open.org / OASIS Symposium on the