“Humies” Compe--on GECCO 2018 Emergent solu-ons to high dimensional mul--task reinforcement learning Stephen Kelly & Malcolm Heywood
Why does the result qualify as human compe--ve? Visual State s ( t ) End-of-Evalua-on Game Playing Game score Agent Game -tle: - Atari - Doom Atomic Ac-on a ( t ) July 2018 Humies 2
Visual RL dominated by Deep learning • DQN (2015) – Visual RL on Atari Learning Environment (49 -tles) – Q-learning with Deep learning – Cropped visual image (84 × 84) – Frame stacking (removes the interleaving of sprites & stochas-c proper-es) – “able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games” [Nature (2015) Vol. 518] • Gorila (2015), Double Q (2016), Dueling DL (2016), AC3 (2016), Noisy DQN (2017), Distribu-onal DQN (2017), Rainbow (2018) • One policy per game -tle • Learning parameters and DNN topology iden-fied a priori July 2018 Humies 3
Visual RL Compared to ‘human’ 100 (algorithm – rnd)/(Human – rnd) Log ( % human ) TPG DQN Gorila Double-DQN H-NEAT 10000 Sta-s-cally equivalent Algorithm Beker than Human 1000 Human 100 level Best -tle Worst -tle 10 Algorithm Worse than Human 1 July 2018 Humies 4
Visual RL and Mul--task learning • Mul-ple game -tles played by single agent • Single -tle DQN provides the baseline • Best DNN result needs prior knowledge regarding parameters and topology • Cons-tutes an example of a task pertaining to ‘Ar-ficial General Intelligence’ July 2018 Humies 5
Single Title DQN score (%DQN score) July 2018 Mul---tle TPG versus Single--tle DQN Worse Beker Log 1000 100 10 1 Alien Bakle Zone Group 1 Asteroids Bank Heist Bowling Copper Com. Humies Cen-pede Group 2 Fishing Derby Kangaroo Frostbite Krull Kung-Fu Group 3 Ms.Pac-Man Time Pilot Private Eye 6
Why [is our entry] ‘best’ in comparison to other entries? • Single -tle task – TPG provides solu-ons compe--ve with human and DQN – Agents have to be compe--ve over mul-ple game -tles • Mul---tle task – TPG mul--task solu-on is compe--ve with DQN trained under single -tle sepng – DNN state-of-the-art in single task does not address Mul---tle task • TPG for Single -tle task a special case of TPG for Mul-- -tle task July 2018 Humies 7
The ‘icing on the cake’ • TPG addresses mul-ple issues simultaneously: – Complexity of topology is emergent and: • Highly modular • Unique to the task • Explicitly reflects a decomposi-on of the task – No image specific instruc9ons just: • Four 2 Argument operators {+, −, ×, ÷} • Three 1 Argument operators {log, exp, cosine} • One condi-onal operator – TPG highly efficient computa9onally – Some examples… July 2018 Humies 8
Entire champion policy graph ● Visited per decision during test ● Bowling Ms. Pac − Man Teams (nodes) per 200 [ diko pixels used ] ● ● Overall solu-on graph emerge… 100 ● Boxing complexity ● ● ● ● ● Alien ● ● 50 ● ● ● ● Number of Teams ● ● ● ● ● ● ● 20 ● ● 10 ● Asteroids ● Per Decision complexity 5 2 ● Rand 1 0 200 400 600 800 Generation July 2018 Humies 9
Emergent discovery of Mul---tle solu-ons { } { } { } { } Ms. Pac − Man { } { } 1 { } { } { } { } { } { } 7 0 { } { } { } { } 1 { } { } { } { } { } { } { } { } { } { } 3 0 1 { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } Frostbite Centipede { } { } { } { } { } { } July 2018 Humies 10
Run -me complexity DQN TPG • ≈1.6 million weights in MLP • Single -tle – 71 – 2346 Instruc-ons (avg) • ≈3.2 million convolu-on opera-ons in DNN • Mul- -tle – 413 – 869 Instruc-ons (avg) • 3.2 GHz Intel i7-4700s • 2.2 GHz Intel E5-2650 – 5 decisions per second – Single -tle: • GPU accelera-on • 758-2853 decisions per sec. – 330 decisions per second – Mul---tle • 1832-2922 decisions per sec. July 2018 Humies 11
Ques-ons? July 2018 Humies 12
Recommend
More recommend