Writing Reusable Code Feedback at Scale with Mixed-Initiative Program Synthesis Andrew Head*, Elena Glassman*, Gustavo Soares*, Ryo Suzuki, Lucas Figueredo, Loris D’Antoni, Björn Hartmann * These three authors contributed equally to the work.
When Writing Feedback on Student Code, Teachers Can Draw on Deep Domain Knowledge Incorrect Student Code Submissions Teacher Comments X What happens when n is zero? Hint: look at lecture 5’s slides X …but it does not scale. While this helper function is useful, it does not handle the ca X Have you considered what would happen if combiner was set Motivation � 2 � 2
In lieu of Teacher-Written Feedback, Autograder Shows Test Cases Student Submission …but there’s still a gulf of evaluation . Test Case Results Course Autograder Motivation � 3 3
Program Synthesis Techniques Can Shrink the Gulf by Automatically Finding and Suggesting Bug Fixes for Students 1 Student Submission In line 2, change total = 0 to total = 1 …but the automatically generated feedback is Test Case Results often mechanical, formulaic AutoGrader [PLDI13] Can we combine teachers’ deep domain knowledge with AutomataTutor [TOCHI15] program synthesis to give students better feedback ? CodeAssist [FSE16] Motivation � 4 4
Program Synthesis Learning Code Transformations from Pairs of Incorrect and Correct Submissions Student 1 fixes iterative solution Student 2 fixes recursive solution Generalized code transformation Motivation � 5 5
Program Synthesis Learning Bug-Fixing Code Transformations 6 Motivation � 6 6
We Scale Up a Little Teacher-Written Feedback by Attaching It to Code Transformations Incorrect Student Code Submissions X Code Transformation (add base case) X Teacher Comments What happens when n is zero? Hint: look at lecture 5’s slides on base cases. X Motivation � 7 � 7
Two Interfaces for Attaching Feedback to Code Transformations MistakeBrowser: giving feedback on clusters Learn transformations from Autograder Collect feedback from teachers x x x T x incorrect x x … o x submissions o o final correct submission S x S x x S x x o o S o o o S S Feedback Bank Related Systems: Divide and Conquer [ITS14], AutoStyle [ITS16] Motivation � 8 � 8
Two Interfaces for Attaching Feedback to Code Transformations FixPropagator: attaching feedback to individual fixes Learns transformations from and collect feedback from… Teacher fixes Teacher submission and picks a writes a hint submission T x … o x Feedback Bank Motivation � 9 � 9
Our Program Synthesis Backend Refazer (/h ɛ .fa. ˈ ze(h)/) Means “To redo.” Using Refazer [ICSE17] as a backend, our systems learn bug-fixing code transformations. Program Synthesis Motivation � 10 10
Contributions • An approach for combining human expertise with program synthesis for delivering reusable, scalable code feedback • Implementations of two different systems that use our approach: FixPropagator , MistakeBrowser • In-lab studies that suggest that the systems fulfill our goals, also inform teachers about common student bugs
Outline • Related Work • Program Synthesis • Systems • Evaluation
System Design Suggest fixes, feedback Interfaces for Teachers Refazer Program Synthesis [ICSE ’17] [L@S ’17] Demonstrate fixes, write feedback Mixed-initiative workflows Systems � 13 � 13
Uploads test cases Writes feedback for each cluster Test 1 … T T … Test N Teacher Learns Finds transformation x x x x x transformations o that fixes next o o o o submission x Trans 1 x Trans 1 … Trans N o o … … and returns x Clusters submissions Trans N feedback o by transformation written for it System x … x x x x incorrect x x o submissions x o Submits o final correct incorrect submission S S code Submit code S S S S … Next Semester S Students Systems: MistakeBrowser � 14
Systems: MistakeBrowser � 15 � 15
Systems: MistakeBrowser � 16 � 16
Looks like you’re writing a recursive call. What might you be missing to enable recursion? Systems: MistakeBrowser � 17 � 17
But Not All Classes Have Submission Histories for Hundreds of Students x x incorrect x submissions S S Submit code S S S S Systems: MistakeBrowser 18 �
Accepts or modifies Uploads test cases Picks Fixes Writes suggested fixes, submission hint feedback Test 1 … T x x T … T … o o Teacher Test N … x Learns x x o Suggests fixes o o transformations, x and feedback x x makes clusters, x x Returns x attaches … x x x feedback to x x feedback o o students System … x x … x incorrect x x x submissions … x S S S Submit code S S S S S S S S S Students Systems: MistakeBrowser Systems: FixPropagator 19 �
Systems: FixPropagator � 20 20
Systems: FixPropagator � 21 21
New Student Submission with Same Bug Suggested Fix Systems: FixPropagator � 22 22
Systems: FixPropagator � 23 23
Both Fixes and Feedback Can Be Further Modified Systems: FixPropagator � 24 24
A Study of the Systems Participants : Current and former teaching staff from CS1 MistakeBrowser ( N = 9) FixPropagator ( N = 8) Interface Walkthrough (5 mins.) Main Task (30 mins.): Giving feedback on student submissions Measurements : Feedback, Manual corrections, Response to feedback recommendations (accepted, changed, rejected), Between-task surveys… Qualitative Feedback : Survey and Post-interview Evaluation � 25
1. Can a few manual corrections fix many submissions? Evaluation � 26
1. Can a few manual corrections fix many submissions? FixPropagator propagates fixes from dozens of corrections to hundreds of submissions. Evaluation � 27
1. Can a few manual corrections fix many submissions? FixPropagator propagates fixes from dozens of corrections to hundreds of submissions. Median # submissions given feedback by… Teacher FixPropagator 0 50 100 150 200 250 • Fixes were propagated within minutes ( median = 2m20s, σ = 7m34s for each correction). Evaluation � 28
2. How often is a teacher’s feedback relevant when it is matched to other students’ submission? Evaluation � 29
2. How often is a teacher’s feedback relevant when it is matched to other students’ submission? Feedback propagated with FixPropagator was correct a majority of the time, but not always. Teachers reused feedback a median of 20 times, modifying it a median of 6 times (30%). Generalizable Non-Generalizable Comment Comment “Check if you have the “Your starting value product of the correct of z should be a number of terms.” function, not an int.” Evaluation � 30
2. How often is a teacher’s feedback relevant when it is matched to other students’ submission? MistakeBrowser created conceptually consistent clusters of student bugs. Evaluation � 31
2. How often is a teacher’s feedback relevant when it is matched to other students’ submission? MistakeBrowser created conceptually consistent clusters of student bugs. 40% 30% % of clusters 20% 10% 0% No or 50% 75% Almost 100% “No idea” 100% Do these submissions share the same misconception? Responses for N = 11 clusters Evaluation � 32
Evaluation Questions 1. Can a few manual corrections fix many submissions? With a median of 10 corrections, FixPropagator suggested fixes for a median of 201 submissions. 2. How often is a teacher’s feedback relevant when it is matched to another student submission? Matched feedback was relevant ~75% of the time. Evaluation � 33
Limitations • The impact of teacher feedback on student learning outcomes has not been evaluated • Code transformations were created that fix submissions one or two bugs away from correct Evaluation � 34
Conclusion We present an approach for combining human expertise with program synthesis for delivering reusable, scalable code feedback. And two systems implementing this approach: MistakeBrowser FixPropagator
Conclusion We present an approach for combining human expertise with program synthesis for delivering reusable, scalable code feedback. And two systems implementing this approach: MistakeBrowser FixPropagator Questions?
Recommend
More recommend