Dominoes: Exploratory Data Analysis of So5ware Repositories Through - PowerPoint PPT Presentation

Dominoes: Exploratory Data Analysis of So5ware Repositories Through GPU Processing Jose Ricardo Esteban Clua Leonardo Murta Anita Sarma

2 Introduction • Identifying expertise is important: • Normally, expertise is identified through informal process: • Social network • Implicit knowledge of work dependencies • Even more challenging in globally distributed development

3 Introduction • Software development leaves behind the activity logs for mining relationships • Commits in a version system • Tasks in a issue tracker • Communication • Finding them is not a trivial task • There is an extensive amount of data to be analyzed • Data is typically stored across different repositories • Scalability problems depending on the project size • Processing the history of large repositories at fine grain for exploratory analysis at interactive rate

4 Related Work • EEL scope the analysis to 1,000 project elements • Restrict the history to small chunk of data • Cataldo analyze data at coarse-grain • Developer is expert of the whole artifact • Boa allows fine-grain analysis by using a CPU cluster • Normally require a time slice for using the cluster • Require data submission for processing

5 Solving the Problem Usability Performance + = User interface Implement used operations for data analysis in GPU Deeper Research Happy user / researcher

6 Solving the Problem • Efficient large-scale repository analysis • Enable users to explore relationships across different levels of granularity • No requirement for a specialized infrastructure

7 Dominoes • Infrastructure that enables interactive exploratory data analysis at varying levels of granularity using GPU • Organizes data from software repositories into multiple matrices • Each matrix is treated as Dominoes tile • Tiles can be combined through operations to generate derived tiles • Transposition, multiplication, addition, …

8 Dominoes UI • Dominoes’ tiles resemble a Dominoes game, where the user can play with to build new relationships

10 Examples of Derived Building Tiles • [method|method] (MM = CM T × CM) : represents method dependencies • [class|class] (ClCl = ClM × MM × ClM T ) : represents class dependencies • [issue|method] (IM = IC × CM) : represents the methods that were changed to implement/fix an issue

11 Dominoes Architecture • Extractor module gather information from repository and save to database 2D Tile Dominoes Extractor 3D Tile • Basic block builder is responsible Derived Tile Database to generate building blocks Basic Tile relationship from database Builder Client Analysis Request • Operations are performed in GPU Memory using a Java Native Interface call Serialize Linear Transformations Unserialize • Derived and basic building block still Data Mining 10010011000 in memory for future use 10010011000 Statistics CUDA Kernels

12 Data Structure Java • Matrix are very sparse for … some relationships long pointer m1, m2, res; createObj(res); • Developer x Commit multiplication(m1, m2, res) … • The java side maintain a pointer to the sparse matrix allocated in C side C / CUDA • The matrix are stored in CRS format void multiplication(JNIEnv *env, jclass obj, jlong m1, jlong m2, res ) { • Matrix operations Matrix *_m1 = (Matrix*) m1; performed in C using a Matrix *_m2 = (Matrix*) m2; Matrix *_res = (Matrix*) m2; JNI interface GPUMul(m1, m2, _res); }

13 Operations in GPU Linear Transformation Addition Multiplication Transposition Data Mining Confidence Lift Support Statistics Mean Standard Deviation Z-Score

14 Linear Transforms • Allows connecting pieces in the Dominoes by changing its edge • Allows extracting further relationships in the data by combining the different types of data • Uses cusp library for performing linear transforms

15 Reduction • Normally used for calculating the amount of relationship • Total of classes modified by a developer • How many bugs a developer have inserted in a method Y • Uses the Thrust library for calculating it in GPU

16 Confidence • Used to detect the relationship direction Depends (98%) Class A Class B Depends (0.8%) • Each GPU thread is responsible for processing the confidence for each element M conf [ i , j ] = M SUP [ i , j ] M SUP [ i , i ]

17 Confidence • Due to the fact that the row and column must be know, they are computed and stored in a vector. • Given a sparse M × M with t non zero values: Value Row Col Diagonal V 1 V 2 V 3 … V t R 1 R 2 R 3 … R t C 1 C 2 C 3 … C t D 1 D 2 … D M For each t GPU thread diagIdx = row[idx]; conf[idx] = value[idx] / diagonal[diagIdx]

18 Z-Score • Responsible to convert an absolute value to a score above the mean z = ( x − µ ) σ x = absolute score • Require a set of steps µ = mean • Calculating the mean / column σ = standard deviation • Calculating the standard deviation • Finally calculating the z-score

19 Z-Score • Calculating the mean / column • Given a Matrix M × N , containing t non zero values, the GPU is responsible to sum up all values for a column, producing a vector sized N for the mean. Value Col V 1 V 2 V 3 … V t C 1 C 2 C 3 … C t Kernel 1 Kernel 2 For each t GPU thread For each N GPU thread colIdx = col[idx] mean[idx] = sum[idx] / count[idx] atomicAdd(value[idx], sum[idx]) atomicAdd(1, count[idx])

20 Z-Score • Calculating the standard deviation / column • Given a Matrix M × N , the GPU is responsible to sum up all values for a column, producing a vector sized N for the standard deviation Value Col Mean M V 1 V 2 V 3 … V t C 1 C 2 C 3 … C t M 1 M 2 M 3 … N Kernel 1 Kernel 2 For each t GPU thread For each N GPU thread colIdx = col[idx] colVariance = variance[idx] colMean = mean[colIdx] colVarianceSqrt = sqrt(colVariance / M) deviate = value[idx] – colMean deviation[idx] = colVarianceSqrt deviatePower2 = deviate * deviate atomicAdd(deviatePower2, variance[colIdx])

21 Z-Score • Calculating the standard score • Given a Matrix M × N with t non zero elements, the GPU is responsible to produce the z-score Value Col Mean SD M S 1 S 2 S 3 … S N V 1 V 2 V 3 … V t C 1 C 2 C 3 … C t M 1 M 2 M 3 … N For each t GPU thread colIdx = col[idx] colMean = mean[colIdx] standardDev = sd[colIdx] z = (value[idx] – colMean) / standardDev zscore[idx] = z

22 Applicability Expertise Identification Dependency Identification Expertise breadth identification

23 Results • Evaluation time (support and confidence). • [file|commit] (34,335 x 7,578) • CPU: 696 minutes | GPU: 0.7 minutes | Speed up: 994 • [method|commit] (305,551 x 7,578) • CPU: N/A | GPU: 5 minutes | Speed up: - * Intel Core 2 Quad Q6600 2.40GHz PC with 4GB RAM and a nVidia GeForce GTX580 graphics card was used.

24 Results • [Developer|File|Time]: 114 layers of 36 x 3400 (13,953,600 elements) • [Developer|Method|Time] : 114 layers of 36 x 43,788 (179,705,952 elements) EBD when Considering Files EBD when Considering (seconds) Methods (seconds) Mean & Mean & Z-Score Total Z-Score Total SD SD CPU 2.19 301.23 303.42 424.71 1,573,60 1,998.31 GPU 0.10 19.49 19.59 8.55 203.46 212.01 Speed 21.90 15.45 15.48 49.67 7.73 9.42 Up * EBD = Expertise Breadth of a Developer.

25 Results • J. R. da Silva, E. Clua, L. Murta, and A. Sarma. Niche vs. breadth: Calculating expertise over time through a fine-grained analysis . In Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on, pages 409–418, Mar. 2015 • J. R. da Silva, E. Clua, L. Murta, and A. Sarma. Multi-Perspective Exploratory Analysis of Software Development Data . International Journal of Software Engineering and Knowledge Engineering, 25(01):51–68, 2015. • J. R. da Silva Junior, E. Clua, L. Murta, and A. Sarma. Exploratory Data Analysis of Software Repositories via GPU Processing . 26th SEKE, 2014

26 Conclusions • The main contribution is using GPU for solving Software Engineering problems • Employment of GPU allows seamless relationship manipulations at interactive rates • Uses matrices underneath to represents building blocks • Dominoes opens a new realm of exploratory software analysis, as endless combinations of Dominoes’ pieces can be experimented in an exploratory fashion • Thanks to the use of GPU, the user can do its analysis on its own machine

27 Questions jricardo@ic.uff.br http://www.josericardojunior.com https://twitter.com/jricardojunior https://br.linkedin.com/in/jose-ricardo-da-silva-junior-7299987

Dominoes: Exploratory Data Analysis of So5ware Repositories Through - PowerPoint PPT Presentation

Dominoes: Exploratory Data Analysis of So5ware Repositories Through GPU Processing Jose Ricardo Esteban Clua Leonardo Murta Anita Sarma 2 Introduction Identifying expertise is important: Normally, expertise is identified through

Introduction to Data Science: x (1) x 1 x 2 x ( n ) x i n 1 1 Size: size

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory

CME/STATS 195 CME/STATS 195 Lecture 5: Exploratory Data Analysis Lecture 5: Exploratory Data

Exploratory Data Analysis Exploratory Data Analysis for Ecological Modelling and for Ecological

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification:

VISUALIZATION Jeff Goldsmith, PhD Department of Biostatistics 1 Exploratory data analysis

Exploratory Data Analysis Maneesh Agrawala CS 448B: Visualization Fall 2018 1 A2: Exploratory

Exploratory Monitoring at Bing AUTOMATED SYNTHETIC EXPLORATORY MONITORING OF DYNAMIC WEB SITES

Startups and business accelerators as Human-Centred So5ware

Five Nines of Southbound Reliability in So5ware-Defined Networks

Exploratory Data Analysis Nam Wook Kim Mini-Courses January @ GSAS 2018 Goal Learn the

Project: Exploratory Data Analysis Tony Yao-Jen Kuo Project Overview Project source Assignment

Exploratory Data Analysis Ma Maneesh Agrawala CS 448B: Visualization Winter 2020 1 A2:

Exploratory Data Analysis Ma Maneesh Agrawala CS 448B: Visualization Fall 2020 1 A2:

DOMINOES, DIMERS AND DETERMINANTS Discrete Mathematics Seminar 28 August 2012 Norm Do Monash

No-Idle, No-Wait: When Shop Scheduling Meets Dominoes, Eulerian and Hamiltonian Paths J.C.

What Will My Account Really Be Worth? An Experiment on Exponential Growth Bias and Retirement

FOR PROFESSIONAL USE ONLY PORTFOLIO PROCESS THE PORTFOLIO: GENERAL OVERVIEW Individual

The Apprenticeship-to-Work Transition: Experimental Evidence from Ghana Morgan Hardy Isaac Mbiti

The Enhanced GSMA Smart City Index Dr Guo Chao Alex Peng g.c.alexpeng@gmail.com (on behalf of Prof

Bank Complexity, Governance, and Risk Ricardo Correa 1 , Linda Goldberg 2 1 Federal Reserve Board 2

NJ-SLA ELA and MATHEMATICS Spring 2019 z October 2019 Dr. Lia Lendis District Testing

Returns to Apprenticeship Training? Experimental Evidence from Ghanas National Apprenticeship

WARSAW 2014 What is An independent stock market information company Publishing a portal