reinforcement learning for interactive theorem proving in
play

Reinforcement Learning for Interactive Theorem Proving in HOL4 - PowerPoint PPT Presentation

Reinforcement Learning for Interactive Theorem Proving in HOL4 Minchao Wu 1 Michael Norrish 1,2 Christian Walder 1,2 Amir Dezfouli 2 1 Research School of Computer Science Australian National University 2 Data61, CSIRO September 14, 2020


  1. Reinforcement Learning for Interactive Theorem Proving in HOL4 Minchao Wu 1 Michael Norrish 1,2 Christian Walder 1,2 Amir Dezfouli 2 1 Research School of Computer Science Australian National University 2 Data61, CSIRO September 14, 2020

  2. Overview ◮ Interface: HOL4 as an RL environment ◮ Enables interaction with HOL4. ◮ Monitor proof states on the Python side.

  3. Overview ◮ Interface: HOL4 as an RL environment ◮ Enables interaction with HOL4. ◮ Monitor proof states on the Python side. ◮ Reinforcement learning settings

  4. Overview ◮ Interface: HOL4 as an RL environment ◮ Enables interaction with HOL4. ◮ Monitor proof states on the Python side. ◮ Reinforcement learning settings ◮ Policies for choosing proof states, tactics, and theorems or terms as arguments.

  5. Overview ◮ Interface: HOL4 as an RL environment ◮ Enables interaction with HOL4. ◮ Monitor proof states on the Python side. ◮ Reinforcement learning settings ◮ Policies for choosing proof states, tactics, and theorems or terms as arguments. ◮ Learning: policy gradient

  6. Environment ◮ An environment can be created by specifying an initial goal. e = HolEnv(GOAL) ◮ An environment can be reset by providing a new goal. e.reset(GOAL2) ◮ The basic function is querying HOL4 about tactic applications. e.query(" ∀ l. NULL l ⇒ l = []", "strip_tac")

  7. Environment The e.step(action) function applies action to the current state and generates the new state. e.step(action) step takes an action and returns the immediate reward received, and a Boolean value indicating whether the proof attempt has finished.

  8. Demo ◮ A quick demo.

  9. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition.

  10. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals.

  11. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals. ◮ A fringe consists of all the remaining goals.

  12. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals. ◮ A fringe consists of all the remaining goals. ◮ The main goal is proved if everything in any one fringe is discharged.

  13. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals. ◮ A fringe consists of all the remaining goals. ◮ The main goal is proved if everything in any one fringe is discharged. ◮ A state s is a finite sequence of fringes.

  14. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals. ◮ A fringe consists of all the remaining goals. ◮ The main goal is proved if everything in any one fringe is discharged. ◮ A state s is a finite sequence of fringes. ◮ A fringe can be referred by its index i , i.e., s ( i ) .

  15. RL Formalization ◮ A goal g ∈ G is a HOL4 proposition. ◮ A fringe is a finite set of goals. ◮ A fringe consists of all the remaining goals. ◮ The main goal is proved if everything in any one fringe is discharged. ◮ A state s is a finite sequence of fringes. ◮ A fringe can be referred by its index i , i.e., s ( i ) . ◮ A reward is a real number r ∈ R .

  16. Examples Fringe 0 0: p ∧ q ⇒ p ∧ q Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q Figure: Example fringes and states

  17. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic .

  18. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic . ◮ i selects the i th fringe in a state s .

  19. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic . ◮ i selects the i th fringe in a state s . ◮ j selects the j th goal within fringe s ( i ) .

  20. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic . ◮ i selects the i th fringe in a state s . ◮ j selects the j th goal within fringe s ( i ) . ◮ t is a HOL4 tactic.

  21. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic . ◮ i selects the i th fringe in a state s . ◮ j selects the j th goal within fringe s ( i ) . ◮ t is a HOL4 tactic. ◮ Example: (0 , 0 , fs[listTheory.MEM] )

  22. RL Formalization ◮ An action is a triple ( i, j, t ) : N × N × tactic . ◮ i selects the i th fringe in a state s . ◮ j selects the j th goal within fringe s ( i ) . ◮ t is a HOL4 tactic. ◮ Example: (0 , 0 , fs[listTheory.MEM] ) ◮ Rewards ◮ Successful application: 0.1 ◮ Discharges the current goal completely: 0.2 ◮ Main goal proved: 5 ◮ Otherwise: -0.1

  23. Example Fringe 0 0: p ∧ q ⇒ p ∧ q (0,0,strip_tac) Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q (1,0,simp[]) (1,0,Induct_on `p`) Fringe 2 Fringe 3 0: p ⇒ q ⇒ q 0: p ⇒ q ⇒ q 1: F ⇒ q ⇒ F (2,0,simp[]) 2: T ⇒ q ⇒ T Fringe 4 QED Figure: Example proof search

  24. Example Fringe 0 0: p ∧ q ⇒ p ∧ q (0,0,strip_tac) Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q (1,0,simp[]) (1,0,Induct_on `p`) Fringe 2 Fringe 3 0: p ⇒ q ⇒ q 0: p ⇒ q ⇒ q 1: F ⇒ q ⇒ F (2,0,simp[]) 2: T ⇒ q ⇒ T Fringe 4 QED Figure: Example proof search

  25. Example Fringe 0 0: p ∧ q ⇒ p ∧ q (0,0,strip_tac) Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q (1,0,simp[]) (1,0,Induct_on `p`) Fringe 2 Fringe 3 0: p ⇒ q ⇒ q 0: p ⇒ q ⇒ q 1: F ⇒ q ⇒ F (2,0,simp[]) 2: T ⇒ q ⇒ T Fringe 4 QED Figure: Example proof search

  26. Example Fringe 0 0: p ∧ q ⇒ p ∧ q (0,0,strip_tac) Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q (1,0,simp[]) (1,0,Induct_on `p`) Fringe 2 Fringe 3 0: p ⇒ q ⇒ q 0: p ⇒ q ⇒ q 1: F ⇒ q ⇒ F (2,0,simp[]) 2: T ⇒ q ⇒ T Fringe 4 QED Figure: Example proof search

  27. Example Fringe 0 0: p ∧ q ⇒ p ∧ q (0,0,strip_tac) Fringe 1 0: p ⇒ q ⇒ p 1: p ⇒ q ⇒ q (1,0,simp[]) (1,0,Induct_on `p`) Fringe 2 Fringe 3 0: p ⇒ q ⇒ q 0: p ⇒ q ⇒ q 1: F ⇒ q ⇒ F (2,0,simp[]) 2: T ⇒ q ⇒ T Fringe 4 QED Figure: Example proof search

  28. Choosing fringes An action is a triple ( i, j, t ) . Given state s . ◮ A value network V goal : G → R .

  29. Choosing fringes An action is a triple ( i, j, t ) . Given state s . ◮ A value network V goal : G → R . ◮ The value v i of fringe s ( i ) is defined by: v i = Σ g ∈ s ( i ) V goal ( g )

  30. Choosing fringes An action is a triple ( i, j, t ) . Given state s . ◮ A value network V goal : G → R . ◮ The value v i of fringe s ( i ) is defined by: v i = Σ g ∈ s ( i ) V goal ( g ) ◮ Sample from the following distribution π fringe ( s ) = Softmax( v 1 , ..., v | s | )

  31. Choosing fringes An action is a triple ( i, j, t ) . Given state s . ◮ A value network V goal : G → R . ◮ The value v i of fringe s ( i ) is defined by: v i = Σ g ∈ s ( i ) V goal ( g ) ◮ Sample from the following distribution π fringe ( s ) = Softmax( v 1 , ..., v | s | ) ◮ By default, j is fixed to be 0. That is, we always deal with the first goal in a fringe.

  32. Generating tactics Suppose we are dealing with goal g . ◮ A tactic is either

  33. Generating tactics Suppose we are dealing with goal g . ◮ A tactic is either ◮ A tactic name followed by a list of theorem names, or

  34. Generating tactics Suppose we are dealing with goal g . ◮ A tactic is either ◮ A tactic name followed by a list of theorem names, or ◮ A tactic name followed by a list of terms

  35. Generating tactics Suppose we are dealing with goal g . ◮ A tactic is either ◮ A tactic name followed by a list of theorem names, or ◮ A tactic name followed by a list of terms ◮ A value network V tactic : G → R D where D is the total number of tactic names allowed.

  36. Generating tactics Suppose we are dealing with goal g . ◮ A tactic is either ◮ A tactic name followed by a list of theorem names, or ◮ A tactic name followed by a list of terms ◮ A value network V tactic : G → R D where D is the total number of tactic names allowed. ◮ Sample from the following distribution π tactic ( g ) = Softmax( V tactic ( g ))

  37. Argument policy softmax softmax a 0 v 0 a 1 v 1 a t v t a t+1 Policy Policy Policy h 0 h 1 h t+1 . . . x 0 x 1 x t Figure: Generation of arguments. x i is the candidate theorems. h i is a hidden variable. a i is a chosen argument. v i is the values computed by the policy. Each theorem is represented by an N -dimensional tensor based on its tokenized expression in Polish notation. If we have M candidate theorems, then the shape of x i is M × N . The representations are computed by a separately trained transformer.

  38. Generating arguments Generation of arguments Given a chosen goal g . Each theorem is represented by an N - dimensional tensor based on its tokenized expression. Suppose we have M candidate theorems. Input : the chosen tactic or theorem t ∈ R N , the candidate theorems X ∈ R M × N and a hidden variable h ∈ R N . Policy : V arg : R N × R M × N × R N → R N × R M Initialize hidden variable h to t . l ← [ t ] . Loop for allowed length of arguments (e.g., 5): h, v ← V arg ( t, X, h ) t ← sample from π arg ( g ) = Softmax( v ) l ← l. append( t ) Return l and the associated (log) probabilities.

  39. Generating actions Given state s , we now have some (log) probabilities. ◮ p ( f | s ) given by π fringe .

  40. Generating actions Given state s , we now have some (log) probabilities. ◮ p ( f | s ) given by π fringe . ◮ p ( t | s, f ) given by π tactic .

Recommend


More recommend