Do Developers Feel Emotion? An Exploratory Analysis of Emotions.
Motivation • Feelings and emotions dictate to a large extent our actions and decisions. • Developers ʼ potential and productivity is fully unlockable if people feel safe and happy. • It is important to support managers and project leaders in detecting emotions
Final Goal • Building a tool for automatic emotion detection. A first step: • Can emotions actually be detected from issue reports? • If so, can human actually agree on the identified emotions?
Our approach • A significant sample of developers’ comments of the Apache were analyzed based on Parrott’s emotional framework. • Can human raters, without any training, agree on the presence of emotions in issue reports? • Dose training improve the agreement of human raters? • Dose context improve the agreement of human raters?
Related Work Ahmed Hassan et. al tried to answering these questions: • What is the personality type of OSS developers? • Dose the language and attitude of a developer change as moves from being a current, to a departing developer?
Related Work • Guzman et. al proposed an approach to improve emotional awareness in software development teams by means of quantitative emotion summaries. • Their approach automatically extracts and summarizes emotions expressed in collaboration artifacts by combining probabilistic topic modelling with lexical sentiment analysis techniques.
Emotion Mining • Emotion mining tries to identify the presence of emotions like joy or fear • Sentiment analysis evaluates a given emotion as being positive or negative
Emotion Mining in Software Engineering • Applied to text artifacts can be used to provide hints on factors responsible for joy and satisfaction, or fear and anger among developers. • It provides a different perspective to interpret productivity and job satisfaction.
Parrott’s Framework
Issue Tracking System • A repository used by software companies to organize software maintenance and evolution. • Team members submit and discuss issues including bugs and feature requests, ask for advice or share opinions • It might reveal how committers feel towards a bug, feature, project or even their colleagues. • Each issue is characterized by several attributes like: priority, status, type(improvement, perfective maintenance, new feature, corrective maintenance, adaptive maintenance)
Experimental Setup • Goal: Understand the kinds of emotions found in issue reports • Four authors rated issue reports from open source systems • Analyzing the identified emotions and rater’s agreement
Dataset • Issue repository of the Apache software foundation • host of 117 open source projects rating large long-lived to small representative data
Dataset • Issue reports since 19th of October 2000 till July 2013 • Developers’ comments + issue report attributes • No distinction between bugs, new features, and enhancements • Granularity: issue comment level • Enough number of issue commits to obtain 95% confidence level.
Emotion Mining • Each rater identified emotions associated to each comment according to Parrott’s six emotions: love, joy, surprise, anger, sadness, fear • Personal rate • Based on common understanding of Parrott’s framework • No ground true: agreement is considered as correct , agreement: majority vote
Examples • I'm not so convinced that moving all the static methods out is useful (Fear). • How is a bunch of static methods on a utility class easier than a bunch of static methods within the HtmlCalendarRenderer better? (Anger) • The risk of introducing new bugs for no great benefit (Fear). • Previously almost all these helper methods were private; this \textbf{patch} makes them all public [...]} (Neutral)
Measuring Agreement • Degree of inter-rare agreement • Cohen’s for two raters • Fleiss’s k value for more than two raters
Question 1 • Can human raters, without any training, agree on the presence of emotions in issue reports? • Motivation: Emotion mining from software development artifacts is not trivial, since they consist of unstructured data, they are relatively short, written in informal way.
Question 1: Approach • 400 issue report comments were arbitrary assigned to two of the raters. • Each author selected the emotions that were present in the comment • Once all comments had been annotated, the four files were collected and analyzed using Cohen’s K.
Question1: Result • In 41% of the comments, the raters agreed on all 6 emotions whereas 85% of comments do not contain any emotion • Only for Love, the raters achieved more than slight agreement, moderate value. • 6.5% agreed on the presence of a particular emotion, Love, 96.75+5 on the absence, Surprise.
Result • While some emotions obtain higher agreement than others, only one emotion obtained moderate agreement, and raters agree the most on the absence of an emotion.
Question 2 • Dose training improve the agreement of human raters on the presence of emotions in issue reports? • Motivation: Without thorough training, raters achieve only a slight agreement. This leads to the current question.
Question 2: Approach • Each rater compiled a list of generic expressions he or she felt insecure • A general example and emotion added • 144 expressions were obtained • Meeting for discussion • Replication and refinement study performed
Question 2: Replication and Refinement Study • Replicated our study of RQ1 on a second sample. • Refinement study revisited 235 comments of RQ1 with at least one emotion disagreement, all four authors decide about occurrence of emotion. • Why refinement was done?
Question 2: Results • 65% of comments, the raters agreed on all 6 emotions • Four out of six emotions improve from slight to fair agreement. Joy, Anger, Sadness and Fear • 4.17% agreed on the presence of an emotion, Love • 72.76 obtained agreement by at least 3 raters.
Result • Training improves the overall agreement on emotions, as well as for most of the individual emotions. Love, joy and sadness are the most common emotions.
Question 3 • Dose context improve the agreement of human raters on the presence of emotions in issue reports? • Motivation: previous experiments can be compared to eavesdropping on a group, and catching just one phrase. • Due to technical and unstructured nature of software development artifacts, the impact of context might be different than in literary English.
Question3: example • Sentence: “yeah right” • “moving to java 8 we solve all problems” • “breaking backward compatibility is risky”
Question3: Approach • Experiment with two steps: • Replication of study RQ2: 384 comments, two raters • Same analyze with the context of those comments
Question3:Results • Adding context reduces rater agreement for love • More raters change their mind for comments with context • Context seems to make raters doubt about t h e i r r a t i n g , i n t r o d u c i n g m o r e disagreement.
Discussion • A. Impact of Context: • at first, our findings seem counter-intuitive. • Using a simple yes/no decision as rating is too large as simplification. Instead, multiple rating. • B. Do Emotions Really Matter for Issue Reports: • Our finding suggests there is link between emotions and software development. Reports with “love” emotion tend to have a lower number of comments and fixing time.
Threats to Validity • Internal validity: We rely on the presence of a casual relationship between a developer’s emotions and what he or she writes in issue report comments. • Construct validity: Ambiguity of messages and subjectivity of emotions. To reduce: • Parrott’s framework is adopted • explanation and clarifying of framework • each commit was analyzed by at least two authors
Threats to Validity • External validity: Replication of this work on other open source systems and on commercial projects are needed to confirm our findings. • Reliability Validity: No ground truth exist to compare our findings. Different groups of raters overall will obtain the same results as well.
Conclusion • Software development, as collaborative activity of developers, is influenced by human emotions. • Issue reports do express emotions towards design choices, maintenance activity or colleagues. • Love, joy and sadness are easier to agree on. • Emotion mining can improve through training • Some challenges like the impact of context need to be studied more, on more data sources and systems.
34
Recommend
More recommend