Branes with Brains Reinforcement learning in the landscape of intersecting brane worlds F ABIAN R UEHLE (U NIVERSITY OF O XFORD ) String_Data 2017, Boston 11/30/2017 Based on [work in progress] with Brent Nelson and Jim Halverson
Motivation - ML ‣ Three approaches to machine learning: • Supervised Learning: Train the machine by telling it what to do ✦ • Unsupervised Learning: Let the machine train without telling it what to do ✦ • Reinforcement Learning: [Sutton, Barto ’98 ’17] Based on behavioral psychology ✦ Don’t tell the machine exactly what to do but reward “good” and/or ✦ punish “bad” actions AI = reinforcement learning + deep (neural networks) learning ✦ [Silver ’16]
Motivation - RL ‣ Agents interact with an environment (e.g. string landscape) ‣ Each interaction changes the state of the agent, e.g. the dof’s parameterizing the string vacuum ‣ Each step is either rewarded (action lead to a more realistic vacuum) or punished (action lead to a less realistic vacuum) ‣ The agent acts with the aim of maximizing its long-term reward ‣ Agent repeats actions until it is told to stop (found a realistic vacuum or give up)
Outline ‣ String Theory setup: • Intersecting D6-branes on orbifolds of toroidal orientifolds ‣ Implementation in Reinforcement Learning (RL) • Basic overview • Implementing the RL code • Modelling the environment ‣ Preliminary results • Finding consistent solutions ‣ Conclusion
String Theory 101 Intersecting D6-branes on orbifolds of toroidal orientifolds
String Theory 101 ‣ Have: IIA String Theory in 9D + time with 32 supercharges ‣ Want: A Theory in 3D + time with 4 supercharges ‣ Idea: Make extra 6D so small that we do not see them ‣ How do we do that? 1. Make them compact 2. Make their diameter so small that our experiments cannot detect them ‣ Reduce supercharges from 32 to 4: • Identify some points with their mirror image
String Theory 101 - Setup ‣ Why this setup? • Well studied [Blumenhagen,Gmeiner,Honecker,Lust,Weigand '04'05; Douglas, Taylor '07, ...] • Comparatively simple [Ibanez, Uranga ’12] • Number of (well-defined) solutions known to be finite: [Douglas, Taylor ’07] Use symmetries to relate different vacua ✦ Combine consistency conditions to rule out combinations ✦ • BUT: Number of possibilities so large that not a single “interesting” solution could be found despite enormous random scans (estimated to 1:10 9 ) • Seems Taylor-made for big data / AI methods
String Theory 101 - Compactification ‣ How to make a dimension compact? Pacman ⇒
String Theory 101 - Compactification ‣ How to make a dimension compact? Pacman ⇒
String Theory 101 - Compactification ‣ How to make a dimension compact? Pacman ⇒
String Theory 101 - Compactification ‣ How to make a dimension compact? Pacman ⇒
String Theory 101 - Compactification y 1 y 2 y 3 x 1 x 2 x 3 ‣ Now six compact dimensions, but idea too simple ‣ Resulting space too simple (but just a little bit) ‣ Make it a bit more complicated
String Theory 101 - Orbifolds y 1 T 2 / Z 2 T 2 x 1 ‣ Mathematically: ( x 1 , y 1 ) → ( − x 1 , − y 1 ) ‣ Resulting object is called an orbifold ‣ Need to also orientifold: ( x 1 , y 1 ) → ( x 1 , − y 1 ) (plus something similar for the string itself)
String Theory 101 - Winding numbers ( n, m ) = (1 , 2) ( n, m ) = (1 , 0) ( n, m ) = (0 , 1) ( n, m ) Winding numbers : ( n, m ) , ( n, − m ) Note: Due to orientifold: include
String Theory 101 - D6 branes T 2 T 2 T 2 y 1 y 2 y 3 3D x 1 x 2 x 3 ‣ D6 brane: our 3D + a line on each torus ‣ Can stack multiple D6 branes on top of each other ‣ Brane stacks Tuple: ( N, n 1 , m 1 , n 2 , m 2 , n 3 , m 3 ) ⇔
String Theory 101 - Gauge group and particles ‣ Observed gauge group: SU (3) × SU (2) × U (1) Y D6 branes on top of each other U ( N ) : N Special cases: • D6 branes parallel to O6-plane SO (2 N ) : N • D6 branes orthogonal to O6-plane Sp ( N ) : N ‣ Intersection of -brane and -brane stack: N M Particles in representation ( N, M ) 1 , − 1 ‣ Observed particles in the universe: 3 × (3 , 2) 1 + 3 × (3 , 1) − 4 + 3 × (3 , 1) 2 + Quarks 4 × (1 , 2) − 3 + 1 × (1 , 2) 3 + 3 × (1 , 1) 6 Leptons + Higgs
String Theory 101 - MSSM T 2 T 2 T 2 y 1 y 2 y 3 3D x 1 x 2 x 3 ‣ Green and yellow intersect in points 3 · 1 · 1 = 3 ‣ Note: Counting intersections on the orbifold a bit more subtle
String Theory 101 - Consistency ‣ Tadpole cancellation: Balance energy of D6 and O6: N a n a 0 1 0 1 1 n a 2 n a 8 3 #stacks − N a n a 1 m a 2 m a 4 X B C B C 3 A = B C B C − N a m a 1 n a 2 m a 4 @ @ A 3 a =1 − N a m a 1 m a 2 n a 8 3 ‣ K-Theory: Global consistency: 0 1 0 1 0 1 2 N a m a 2 0 1 m a 2 m a 3 #stacks 2 0 − N a m a 1 n a 2 n a X B C B C B C 3 A mod A = − N a n a B C B C B C 2 0 1 m a 2 n a @ @ @ A 3 a =1 − 2 N a n a 2 0 1 n a 2 m a 3
String Theory 101 - Consistency ‣ SUSY (computational control): ∀ a = 1 , . . . , # stacks m a 1 m a 2 m a 3 − j m a 1 n a 2 n a 3 − k n a 1 m a 2 n a 3 − ` n a 1 n a 2 m a 3 = 0 n a 1 n a 2 n a 3 − j n a 1 m a 2 m a 3 − k m a 1 n a 2 m a 3 − ` m a 1 m a 2 n a 3 > 0 ‣ Pheno: + particles SU (3) × SU (2) × U (1) ‣ is iff: T = ( T 1 , T 2 , . . . , T k ) , k = # U ( N ) stacks U (1) T 1 2 N k m k 2 N 1 m 1 2 N 2 m 2 0 T 2 · · · 1 1 1 2 N k m k 2 N 1 m 2 2 N 2 m 2 = 0 . · · · · 1 2 2 . . 2 N k m k 2 N 1 m 2 2 N 2 m 2 0 · · · 3 3 3 T k
String Theory 101 - IIA state space ‣ State space gigantic • Choose a maximal value for winding number w max • Let be the number of possible winding number N B combinations (up to ) after symmetry reduction w max • Let be the maximal number of stacks N S ✓ N B ◆ • Allows for combinations N S • Note: Each stack can have branes N = 1 , 2 , 3 , . . .
Reinforcement learning
Reinforcement learning - Overview ‣ At time , agent in state s t ∈ S total t ‣ Select action from action space based on policy A a t π π : S total 7! A ‣ Receive reward for action based on reward r t ∈ R a t function R, R : S total × A → R ‣ Transition to the next state s t +1 ∞ X γ k r t + k ‣ Try to maximize long-term return , γ ∈ (0 , 1] G t = k =1 ‣ Keep track of state value (“how good is the state”) v ( s ) ‣ Compute advantage estimate Adv = r − v (“how much better than expected has the action turned out to be”)
Reinforcement Learning - Overview ‣ How to maximize future return? • Depends on policy π ‣ Several approaches • Tabular (small state/action spaces): [Sutton, Barto ’98] Temporal difference learning ✦ my breakout group on Friday ⇒ ✦ SARSA Q-learning ✦ • Deep RL (large/infinite state/action spaces): [Mnih et al ’15] ✦ Deep Q-Network [Mnih et al ’16] Asynchronous advantage actor-critic (A3C) ✦ Variations/extensions: Wolpertinger [Dulac-Arnold et al ’16], Rainbow [Hessel et al '17] ✦
Reinforcement Learning - A3C Global instance Policy Value Network Input … Worker 1 Worker 2 Worker n Policy Value Policy Value Policy Value Network Network Network Input Input Input Environment Environment Environment
Reinforcement Learning - A3C ‣ Asynchronous : Have n workers explore the environment simultaneously and asynchronously • improves training stability (experience of workers separated) • improves exploration ‣ Advantage : Use advantage to update policy ‣ Actor-critic : To maximize return need to know state or action value and optimize policy. Methods like Q-learning focuses on value function • Methods like policy-gradient focus on policy • AC: Use value estimate (“critic”) to update policy (“actor”) •
Reinforcement Learning - Implementation ‣ Open AI Gym: Interface between agent (RL) and environment (string landscape) [Brockman et al '16] We provide the environment • We use ChainerRL’s implementation of A3C for the agent • step Environment Chainer RL ✦ method ✦ action space make (A3C,DQN,…) ✦ observation (state) env reset ✦ NN architecture space (FF, LSTM,…) ‣ step: ‣ make environment • go to new state ‣ specify RL method (A3C) • return (new_state, reward, done, comment) ‣ specify policy NN (FF,LSTM) ‣ reset: • reset episode • return start_state
Recommend
More recommend