Harnessing Wake Vortices for Efficient Collective Swimming via Deep Reinforcement Learning Siddhartha Verma With: Guido Novati and Petros Koumoutsakos CSE lab http://www.cse-lab.ethz.ch
Collective Swimming • Hydrodynamic benefit of swimming in groups • Are Wake Vortices exploited by fish for propulsion? Credit: Artbeats • Theoretical work on Schooling & Formation Swimming Breder (1965), Weihs (1973,1975), Shaw (1978) • Experiments : Abrahams & Colgan (1985), Herskin & Steffensen (1998), Svendsen (2003), Killen et al. (2011) • Simulations : Pre-assigned, fixed formations Hemelrijk et al. (2015), Daghooghi & Borazjani (2015), Maertens et al. (2017) • But schools evolve dynamically Breder (1965) Weihs (1973), Shaw (1978)
THIS TALK: Adaptive Collective Swimming • Autonomous decision making capability, based on learning from experience • Goal: Maximize energy-efficiency No positional or formation constraints
The Need for Control • Without control, trailing fish may get ejected from leader’s wake • Coordinated Swimming through unsteady flow field requires: • Ability to observe the environment • Decision to react appropriately • The swimmers learn how to interact with the environment Prior Work @CSE Lab: “Vanilla” Reinforcement learning - Goal: Follow the Leader ( Novati et al., Bioinspir. Biomim. 2017 ) • HERE: Deep Reinforcement Learning - GOAL : Energy Extraction from Vortex Wake
Reinforcement Learning • An agent learning the best action, through trial- and-error interaction with environment Credit: https:// • Actions have consequences www.cs.utexas.edu/ ~eladlieb/RLRG.html • Reward (feedback) is delayed • Goal • Maximize cumulative future reward: • Credit assignment : • Agent receives feedback • Specify what to do, not how to do it • Expected reward updated in previously visited states r t +1 + γ r t +2 + γ 2 r t +3 + . . . | a k = π ( s k ) ∀ k > t ⇥ ⇤ Q π ( s t , a t ) = E • Q-learning • POLICY for taking the best ACTION in a given STATE Q π ( s t , a t ) = = E [ r t +1 + γ Q π ( s t +1 , π ( s t +1 )] Bellman (1957)
Deep Reinforcement Learning • Stable algorithm for training NN surrogates of Q • Sample past transitions: experience replay • Break correlations in data • Learn from all past policies • "Frozen" target Q-network to avoid oscillations Acting Learning at each iteration: at each iteration • agent is in a state s • sample tuple { s, a, s’, r } (or batch) select action a : • greedy: based on max Q(s,a,w) update wrt target with old weights : • • ⇣ ⌘ 2 � ∂ explore: random • a 0 Q ( s 0 , a 0 , w � ) − Q ( s, a, w) r + γ max observe new state s’ and reward r ∂ w • store in memory tuple { s, a, s’, r } • Periodically update fixed weights • w − ← w V. Mnih et al. "Human-level control through deep reinforcement learning." Nature (2015)
Actions, States, Reward Actions: Turn and modulate velocity by controlling body deformation • Increase curvature • Decrease curvature θ States: • Orientation relative to leader: Δ x, Δ y, θ • Time since previous tail beat: Δ t Δ y Δ x • Current shape of the body (manoeuvre) Reward : based on swimming e ffi ciency
After training: E ffi ciency-maximizing ‘Follower’ Leader Smart Follower • Smart-follower stays in-line with leader • Decides on its own the best strategy • Free to swim outside wake’s influence • Energetics: smart-follower exploits wake • Head synchronised with lateral flow-velocity • Compared to solitary swimmer with identical muscle movements • Presence/absence of wake is the only difference η Speed CoT P Def Smart 1.32 1.11 0.64 0.71 Solo 1 1 1 1
After training: E ffi ciency-maximizing ‘Follower’ Leader Smart Follower • Smart-follower stays in-line with leader • Decides on its own the best strategy • Free to swim outside wake’s influence • Energetics: smart-follower exploits wake • Head synchronised with lateral flow-velocity First 10,000 transitions Last 10,000 transitions • How does smart follower’s behaviour evolve during training? • Why the peaks in distribution?
Sequence of events W 1 L 1 • Snapshot when η is maximum S 1 • Wake-vortex ( W 1 ) lifts-up the boundary layer on the swimmer’s body ( L 1 ) • Lifted vortex generates secondary vortex ( S 1 ) • Secondary vortex - high speed region => suction due to low pressure • Flow-induced force + body deformation determine P def (muscle use) • Low P def values preferable 11
Implementing the Learned Strategy in 3D • Target coordinates - maxima in velocity correlation: • PID controller: • Modulate follower’s undulations (curvature + amplitude) • Maintain the target position specified 12
13
3D Wake Interactions • Wake-interactions benefit the follower 1 Follower Leader • 11.6% increase in efficiency - 5.3% reduction in CoT 0.8 • Oncoming wake-vortex ring intercepted η • Generates a new ‘lifted-vortex’ ring (LR) 0.6 Similar to the 2D case • 0.4 17.5 18 18.5 19 19.5 t LR LR 14
11% increase in efficiency for each follower 15
Summary • Autonomous swimmer learns to exploit unsteady fluctuations in the velocity field • Decides to interact with the wake, even when free to swim clear • Large energetic savings, without loss in speed (Improvements: 30% and 11% ) Swimming via Reinforcement Learning : An effective and robust method for harnessing energy from unsteady flow NEXT: Energy Efficient Swarms of Drones ? 16
Backup
Reacting to an erratic leader Note: Reward allotted here has no connection to relative displacement Two fish swimming together in Greece Two fish swimming together in the Swiss supercomputer
Robustness: Responds E ff ectively to Perturbations • Agent never experienced deviations in the leader’s behaviour during training • But analogous situations encountered during training (random actions during learning) • Agent reacts appropriately to maximise cumulative reward 19
Numerical methods 0 in 2D • Remeshed vortex methods (2D) ∂ω ∂ t + u · r ω = ω · r u + ν r 2 ω + λ r ⇥ ( χ ( u s � u )) • Solve vorticity form of incompressible Navier-Stokes Advection Diffusion Penalization • Brinkman penalization • Accounts for fluid-solid interaction Angot et al., Numerische Mathematik (1999) • 2D: Wavelet-based adaptive grid • Cost-effective compared to uniform grids Rossinelli et al., J. Comput. Phys. (2015) • 3D: Finite Difference - pressure projection (Chorin 1968) Rossinelli et al., SC'13 Proc. Int. Conf. High Perf. Comput., Denver, Colorado
Reinforcement Learning: Reward Goal #1 : learn to stay behind the leader R<0 Reward: vertical displacement R>0 R ∆ y = 1 − | ∆ y | 0 . 5 L • Failure condition R end = − 1 • Stray too far or collide with leader Goal #2 : learn to maximise swimming-efficiency Reward: efficiency P thrust R η = P thrust + max( P def , 0) Thrust power T | u CM | R η = R T | u CM | + max( ∂ Ω F ( x ) · u def ( x ) d x , 0) Deformation power
Reinforcement Learning: Basic idea • An agent learning the best action, through trial- and-error interaction with environment Credit: https:// • Actions have long term consequences www.cs.utexas.edu/ ~eladlieb/RLRG.html • Reward (feedback) is delayed • Goal • Maximize cumulative future reward: • Specify what to do, not how to do it • Credit assignment : A Example : Maze solving -11 -12 0 0 0 0 • Agent receives feedback State Agent’s position (A) -2 -1 -1 -12 -11 -10 • Expected reward updated in previously Actions go U, D, L, R -13 -2 visited states -9 -8 • Now we have a policy -1 per step taken Reward -14 -15 -3 0 at terminal state -7 r t +1 + γ r t +2 + γ 2 r t +3 + . . . | a k = π ( s k ) ∀ k > t ⇥ ⇤ Q π ( s t , a t ) = E A A A -4 -5 -7 -6 Q π ( s t , a t ) = = E [ r t +1 + γ Q π ( s t +1 , π ( s t +1 )] Bellman (1957)
Recurrent Neural Network (a 1 ) (a 2 ) (a 3 ) (a 4 ) (a 5 ) q n q n q n q n q n LSTM Layer 3 LSTM Layer 2 LSTM Layer 1 o n
A flexible maneuvering model c c - Modified midline kinematics preserves travelling wave: { { Traveling wave Traveling spline - Each action prescribes a point of the spline: c + ¼ c + ¾ c + ½ c +1 c Increase Curvature Decrease Curvature c + ½ c c +1 c + ¾ c + ¼
Examples Effect of action depends on when action is made Increasing local curvature Reducing local curvature Chain of actions
Simulation Cost (2D) • Wavelet-based adaptive grid • https://github.com/cselab/MRAG-I2D Rossinelli et al., JCP (2015) • Production runs (Re=5000) • Domain : [0,1] x [0,1] • Resolution : 8192 x 8192 • Training simulations (lower resolution) • 1600 points along fish midline • Resolution : 2048 x 2048 • Running with 24 threads (12 hyper • 10 tail-beat cycles : 36 core hours threaded cores - Piz Daint) • Learning converges in : 150,000 tail-beats • 10 tail-beat cycles : 27000 time steps • 0.54 Million core hrs per learning episode • Approx. 96 core hours: 1 second/step 26
Recommend
More recommend