where s the reward where s the reward
play

Where's The Reward? Where's The Reward? A Review of Reinforcement - PowerPoint PPT Presentation

Where's The Reward? Where's The Reward? A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Research Question Research Question Over the past 50 years, how Over the past 50 years, how successful has RL


  1. The Dark Ages The Dark Ages c. 1972 - 2000s c. 1972 - 2000s By 1970s - Howard, Smallwood, Matheson et al. go back to operations research (sans education) 1975 - Atkinson leaves research (for administrative positions) 17

  2. Suppes (1974) The Place of Theory in Educational Research AERA Presidential Address “The mathematical techniques of optimization used in theories of instruction draw upon a wealth of results from other areas of science, especially from tools developed in mathematical economics and operations research over the past two decades, and it would be my prediction that we will see increasingly sophisticated theories of instruction in the near future. ” 18

  3. Suppes (1974) The Place of Theory in Educational Research AERA Presidential Address “The mathematical techniques of optimization used in theories of instruction draw upon a wealth of results from other areas of science, especially from tools developed in mathematical economics and operations research over the past two decades, and it would be my prediction that we will see increasingly sophisticated theories of instruction in the near future. ” Atkinson (2014) “work [on MOOCs] is promising, but the key to success is individualizing instruction, and necessarily that requires a psychological theory of the learning process” 18

  4. Second Wave: 2000s Second Wave: 2000s Why 2000s? 19

  5. Second Wave: 2000s Second Wave: 2000s Why 2000s? Intelligent Tutoring Systems 19

  6. Second Wave: 2000s Second Wave: 2000s Why 2000s? Intelligent Tutoring Systems Reinforcement Learning formed as a field 19

  7. Second Wave: 2000s Second Wave: 2000s Why 2000s? Intelligent Tutoring Systems Reinforcement Learning formed as a field AIED/EDM: studying statistical models of learning 19

  8. Second Wave: 2000s Second Wave: 2000s Why 2000s? Intelligent Tutoring Systems Reinforcement Learning formed as a field AIED/EDM: studying statistical models of learning Parallels 1960s 19

  9. Second Wave: 2000s Second Wave: 2000s Why 2000s? Intelligent Tutoring Systems Reinforcement Learning formed as a field AIED/EDM: studying statistical models of learning Parallels 1960s Teaching machines and Computer-Assisted Instruction Dynamic Programming and Markov Decision Processes Mathematical Psych: studying mathematical models of learning 19

  10. Reinforcement Learning AI in Education / ITS 20

  11. Reinforcement Learning AI in Education / ITS Andrew Barto Beverly Woolf Joe Beck 20

  12. Reinforcement Learning AI in Education / ITS Andrew Barto Beverly Woolf Balaraman Ravindran Joe Beck 20

  13. Reinforcement Learning AI in Education / ITS Emma Brunskill Vincent Aleven Shayan Doroudi 21

  14. The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon Why 2010s? 22

  15. The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon Why 2010s? Massive Open Online Courses (MOOCs) 22

  16. The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon Why 2010s? Massive Open Online Courses (MOOCs) Deep Reinforcement Learning formed as a field 22

  17. The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon Why 2010s? Massive Open Online Courses (MOOCs) Deep Reinforcement Learning formed as a field Deep Learning: building deep models of learning 22

  18. The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon Why 2010s? Massive Open Online Courses (MOOCs) Deep Reinforcement Learning formed as a field Deep Learning: building deep models of learning 35% increase in papers/books mentioning “reinforcement learning” from 2016 to 2017 (Google Scholar) 22

  19. Three Waves: Summary Three Waves: Summary First Wave Second Wave Third Wave (1960s-70s) (2000s-2010s) (2010s) Medium of Teaching Intelligent Massive Open Instruction Machines / CAI Tutoring Systems Online Courses Optimization Decision Reinforcement Deep RL Models Processes Learning Models of Mathematical Machine Learning Deep Learning Learning Psychology AIED/EDM 23

  20. Three Waves: Summary Three Waves: Summary First Wave Second Wave Third Wave (1960s-70s) (2000s-2010s) (2010s) Medium of Teaching Intelligent Massive Open Instruction Machines / CAI Tutoring Systems Online Courses Optimization Decision Reinforcement Deep RL Models Processes Learning More data-driven Models of Mathematical Machine Learning Deep Learning Learning Psychology AIED/EDM 23

  21. Three Waves: Summary Three Waves: Summary First Wave Second Wave Third Wave (1960s-70s) (2000s-2010s) (2010s) Medium of Teaching Intelligent Massive Open More data-generating Instruction Machines / CAI Tutoring Systems Online Courses Optimization Decision Reinforcement Deep RL Models Processes Learning More data-driven Models of Mathematical Machine Learning Deep Learning Learning Psychology AIED/EDM 23

  22. Overview Overview Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future 24

  23. Inclusion Criteria Inclusion Criteria We consider any papers where: 25

  24. Inclusion Criteria Inclusion Criteria We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state of a student. 25

  25. Inclusion Criteria Inclusion Criteria We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state of a student. There is an instructional policy that maps past observations from a student (e.g., responses to questions) to instructional actions. 25

  26. Inclusion Criteria Inclusion Criteria We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state of a student. There is an instructional policy that maps past observations from a student (e.g., responses to questions) to instructional actions. Data collected from students are used to learn either: the model an adaptive policy 25

  27. Inclusion Criteria Inclusion Criteria We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state of a student. There is an instructional policy that maps past observations from a student (e.g., responses to questions) to instructional actions. Data collected from students are used to learn either: the model an adaptive policy If the model is learned, the instructional policy is designed to (approximately) optimize that model according to some reward function 25

  28. What's Not Included? What's Not Included? 26

  29. What's Not Included? What's Not Included? Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules) 26

  30. What's Not Included? What's Not Included? Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules) Experiments that do not control for everything other than sequence of instruction 26

  31. What's Not Included? What's Not Included? Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules) Experiments that do not control for everything other than sequence of instruction Machine teaching experiments 26

  32. What's Not Included? What's Not Included? Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules) Experiments that do not control for everything other than sequence of instruction Machine teaching experiments Experiments that use RL for other educational purposes, such as: generating data-driven hints (Stamper et al., 2013) or giving feedback (Rafferty et al., 2015) 26

  33. Review Overview Review Overview 27 studies empirically compare adaptive policy to baseline 27

  34. Review Overview Review Overview 27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation 27

  35. Review Overview Review Overview 27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation ≥ 16 papers build policies only on simulated data 27

  36. Review Overview Review Overview 27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation ≥ 16 papers build policies only on simulated data ≥ 7 papers that propose using RL for instructional sequencing 27

  37. Review Overview Review Overview 27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation ≥ 16 papers build policies only on simulated data ≥ 7 papers that propose using RL for instructional sequencing ≥ 3 other papers with policies used on real students 27

  38. Review Overview Review Overview Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline 28

  39. Review Overview Review Overview Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline 2 found sig aptitude-treatment interaction Policy is sig better for below median learners 28

  40. Review Overview Review Overview Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline 2 found sig aptitude-treatment interaction Policy is sig better for below median learners 2 found sig difference between adaptive policy and some but not all baselines 28

  41. Review Overview Review Overview Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline 2 found sig aptitude-treatment interaction Policy is sig better for below median learners 2 found sig difference between adaptive policy and some but not all baselines 9 found no sig difference between policies 28

  42. Studies by Year Studies by Year 29

  43. Review Summary Review Summary 30

  44. Overview Overview Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future 31

  45. Where's the Reward? Where's the Reward? The Pessimistic Story Studies with sig difference were often constrained: 32

  46. Where's the Reward? Where's the Reward? The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 32

  47. Where's the Reward? Where's the Reward? The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn 32

  48. Where's the Reward? Where's the Reward? The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn 2 of the studies (+ 2 ATI studies) sequenced activity types rather than content 32

  49. Where's the Reward? Where's the Reward? The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn 2 of the studies (+ 2 ATI studies) sequenced activity types rather than content 2 of the studies did not optimize for learning 32

  50. Where's the Reward? Where's the Reward? The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn 2 of the studies (+ 2 ATI studies) sequenced activity types rather than content 2 of the studies did not optimize for learning 1 study seems to have been “lucky” 32

  51. Where's the Reward? Where's the Reward? The Pessimistic Story Among papers without sig difference: 33

  52. Where's the Reward? Where's the Reward? The Pessimistic Story Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy 33

  53. Where's the Reward? Where's the Reward? The Pessimistic Story Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy Only 3 of them were on paired-association or concept learning tasks 33

  54. Where's the Reward? Where's the Reward? The Pessimistic Story Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy Only 3 of them were on paired-association or concept learning tasks Only 2 of them sequenced activity types rather than content. 33

  55. Where's the Reward? Where's the Reward? The Pessimistic Story Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy Only 3 of them were on paired-association or concept learning tasks Only 2 of them sequenced activity types rather than content. Papers that showed no sig. difference were generally more complex and ambitious in a number of dimensions 33

Recommend


More recommend