Research Infrastructure for Empirical Science of F/OSS Les Gasser, - PowerPoint PPT Presentation

Research Infrastructure for Empirical Science of F/OSS Les Gasser, Gabriel Ripoche , Robert J. Sandusky Graduate School of Library and Information Science University of Illinois at Urbana-Champaign {gasser,gripoche,sandusky}@uiuc.edu ICSE – MSR Workshop May 25, 2004

Introduction ● UCI/UIUC 2003 “Design in F/OSS” workshop: Pressing need for research infrastructure ● What are the objects and methods of analysis? ● What are the data requirements? ● What are the available data? ● What are the common issues? ● How can these issues be addressed?

Our research Questions ● How are software problems managed in practice, in large- scale, distributed communities? – What are the factors and processes that impact performance? – How are these processes enacted? How do they unfold? ● How does information shape activity? How does activity shape information? ● Bug Report Networks: How information networks structure social activity?

Our research Bug report networks

Objects of study in F/OSS research Objects Success measures Critical driving factors Size, complexity, software Quality, reliability, Artifacts architecture (structure, substrates, usability, durability, fit, ... infrastructure), ... Time, cost, complexity, Size, distribution, collaboration, Processes manageability, knowledge/information management, artifact structure, ... predictability, ... Ease of creation, Size, economic setting, Communities sustainability, trust, organizational architecture, behaviors, incentive structures, ... social capital, ... Creation, use, need, Tools, conventions, norms, social Knowledge structures, technical content, ... management, ... ● RI should support variety, and allow for extension

Current research approaches ● Large-scale quantitative cross-analyses – Code size, code change evolution, group size, composition and organization, development processes ● Small-scale qualitative case studies – Specific processes and practices, hypothesis development and testing ● Main issues: – Scalability – Richness ● RI should facilitate articulation of the two sides

Data requirements Characteristics Requirements ● Reflect reality Empirical and natural ● Adequate coverage ● Representative level of variance Sufficient size and variety ● Statistical significance ● Comparable results Common frameworks ● Repeatable, testable, extendable and representations (sharable)

Data available Variety of Types Examples Discussion forums, newsgroups, chats, Communication community digests, ... HOWTOs, FAQs, user and developer Content Documentation documentation, tutorials, ... Development Source code, bug reports, design documents, ... Communication Mailman, Phpbb, ... Medium Source control CVS, Subversion, Bitkeeper, ... Issue tracking Buzilla, Scarab, Gnats, ... Content mgt. Wiki, Plone, ... Project sites Mozilla, Linux, KDE, Gnome, Gimp, ... Location Community sites Slashdot, Newsforge, FSF, ... Repositories & indexes SourceForge, Freshmeat, Tigris, ... ● Data available as byproducts, not generated for research

Issues with empirical data ● Discovery and selection ● Access and gathering ● Cleaning and normalization ● Linked aggregation ● Evolution Data Research prep.

Issues with empirical data Cleaning and normalization ● Bug report normalization – Multiple formats of the “bug report” object (Bugzilla, Scarab, ...) – What information is necessary for research? (and is that information readily available?) ● Bug reference normalization – Various types of references: How do we normalize them? ● E.g.: depends on , blocks , duplicate , ... – Some of them not formalized: How do we mine them? ● E.g.: see also , related , ...

Issues with empirical data Linked aggregation ● BRN complete only if multiple repositories are aggregated ● Some issues span across multiple repositories – Gnome & Red Hat: Who's got responsibility for a bug? – Debian, Gentoo bug posting instructions ● The need for aggregation is two way: – Same tool, different projects – Same project, different tools

Components of a research infrastructure ● Representation standards ● Metadata ● Tools (downstream & upstream) ● Centralized data repositories ● Federated access ● Processed research collection ● Integrated data-to-literature environments

Components of a research infrastructure Representation standards ● Bug report XML representation <!ELEMENT bug_report ( id, alias?, creation_ts, last_modification_ts, ● Abstracted properties status, resolution, product, component, hardware_list, os_list, version_list, severity, priority, target_milestone, – Smallest or largest reporter, responsible_party, qa_contact, cc_list, manifesting_url, summary, common denominator? status_whiteboard, keywords, dependency_list, attachment_list, vote_list, comment_list, ● Additional information bug_activity_transaction_list, provenance )> for research purposes <!ATTLIST bug_report id ID #REQUIRED> – Metadata  <!ELEMENT id ( #PCDATA )> <!ELEMENT alias ( #PCDATA )> – Mined/inferred properties  <!ELEMENT creation_ts ( %timestamp; )> <!ELEMENT last_modification_ts ( %timestamp; )>  <!ELEMENT status ( #PCDATA )> <!ELEMENT resolution ( #PCDATA )> ...

Components of a research infrastructure Tools ● Extraction of bug cross-references – 100% of formalized references are automatically minable – 40-70% of non-formalized references are minable (regex) but hard to automatically categorize – Remaining % require help of a human ● Three possible approaches: – Facilitate human mining (downstream) – Improve automated extraction tools (downstream) E.g.: more complex regex, NLP – Increase formalization at creation time (upstream)

Recommendations ● Refine knowledge of F/OSS research needs ● Exploit experience from other domains ● Develop data selection policies ● Develop data standards ● Instrument studied tools ● Create federation middleware ● Create prototypes

Conclusions Research infrastructure might increase collaboration and lower “entry cost” of doing F/OSS research, but: ● Is there a sufficient drive for a common infrastructure? – What are the common questions? – What are the common needs? ● Risk of limiting research to “low hanging fruits” – Features easy to measure and extract – Many studies on few common corpora – Same underlying assumptions about data

Research Infrastructure for Empirical Science of F/OSS Les Gasser, - PowerPoint PPT Presentation

Research Infrastructure for Empirical Science of F/OSS Les Gasser, Gabriel Ripoche , Robert J. Sandusky Graduate School of Library and Information Science University of Illinois at Urbana-Champaign {gasser,gripoche,sandusky}@uiuc.edu ICSE

2014/07/10 1 ZDA One Stop Shop Department Topics OSS Background OSS Our Services OSS

LibreOffice oss-fuzz, crashtesting, coverity Overview Oss-Fuzz Crashtesting Coverity

Empirical Project Monitor and Results from 100 OSS Development Projects Masao Ohira Empirical

PNDA.io: when big data and OSS collide [Build Slide] Simplified OSS / BSS Stack Bills and

Towards data reference catalog With MDweb tool Program ROSELT/ OSS ROSELT/OSS Regional

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Cyber- -Science Infrastructure: Science Infrastructure: Cyber Cyber-Science Infrastructure:

Empirical research on economic inequality: Normative considerations and empirical practice.

The Power of Bots: Understanding Bots in OSS Projects Mairieli Wessel Bruno Mendes Igor

8/29/2015 Effect of Empirical Left Atrial Appendage Isolation on Effect of Empirical Left Atrial

Empirical problem solving Statistical method R.W. Oldford Empirical problem solving - PPDAC The

Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs Pczos Empirical Risk

Spring Cloud, Spring Boot and Netflix OSS http://localhost:4000/decks/cloud-boot-netflix.html

Office of Support Services (OSS) Carole Collins, Deputy Director Scott Speares, Assistant Deputy

c c NSO IDN WG sub-gr oup 3 c r oss-ove r issue s Mar c h, 2007 Minjung Par k(.kr ) T he

Dea Death th on on th the e Cross oss Ten Arguments From the Bible Hadhrat Maulana Abulata

Faster GPS via the Sparse Fourier Transform Haitham Hassanieh Fadel Adib Dina Katabi Piotr Indyk

Who Gets Placed Where and Why? An Empirical Framework for Foster Care Placement Alejandro

14.581 International Trade Lecture 6: Ricardian Model (Empirics) 14.581

Time-Bounded Sequential Parameter Optimization Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown,

Synthesis and Review Week 8 7 March, 2016 Prof. Robin Harding Nice tools, but what do we do

Empirical Investigation of Optimization Algorithms in Neural Machine Translation Parnia Bahar,

Military Institutional Stigma and Nursing CPT Amy Brzuchalski, RN, MSN, DNP Student CPT Douglas

Mismatches in Russian Nominal Ellipsis