energy efficient adaptive beamforming on sensor networks
play

Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. - PowerPoint PPT Presentation

Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna http://pacman.usc.edu


  1. Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna http://pacman.usc.edu

  2. Outline � Problem Definition � Computational Characteristics � Prior Solution � Power Optimizations � Sensor Node Level � Inter Node Level � Challenges/Discussion 1

  3. Problem Scenario Energy Constrained Network Passive Active 2

  4. Beamforming Def: The technique which spatially filters the signals received from an array of sensors and estimates the spatial features of the sources Procedure: 1. passively and repeatedly sample acoustic propagation wave field signals 2. input data, linearly combined with a weight matrix to form a sonar beam for a particular direction of look Adaptive Sonar Beamforming: For High SNR and High resolution Time changing signal and noise properties included in the derivation of weights, making them adapt accordingly 3

  5. Space Time Adaptive Processing Range gates 1 2 L Elements 1 N Pulse Repetition Interval N s I R P L M Each CPI Target Detection (Coherent Processing Interval) 4

  6. MITRE RT_STAP Benchmark Preprocessing Step 2 Preprocessing Step 1 Input Data L .. .. (1920) N (22) M (64) . . Doppler Weight Weight Processing Computation Application T latency = 161.25 msec & T period = 32.25 msec 5

  7. Elements (N = 22) (M = 64) PRIs Input Data Cube (L = 1920) Gates Range 6

  8. Sonar Signal Processing Adaptive Beamforming Sampling Rate Output Rate Frequency Domain =10 Hz~25 KHz =1 Hz~100 Hz Element Adaptive FFT Space Beam- forming Adaptive Beam Space FFT Beam- forming Conventional 100 ~5000 Beams per Output Beamforming Time Domain 7

  9. An Example Adaptive Beamformer MVDR (Minimum Variance Distortionless Response) Frequency Bins s Channel F Corner FFT N N Turn F Beams per Bin N B F N Factorization N F N F Linear Solver & Beamformer Steering Covariance B 8

  10. Computational Characteristics D D A D D A T A A T S 1 S 2 S 3 S 4 A T T A A A Outputs Initial Data Layout � Overall processing consists of sequence of subproblems � Computational requirements are different for each subproblem � Large amount of data is repeatedly processed in real-time � Data access patterns change from subproblem to subproblem � Throughput and latency performance requirements 9

  11. Adaptive Processing Key Problems � Doppler Processing (FFT) � Weight Computation apply (Co Variance matrix factorization) � Weight Application adaptation Gates (Matrix Vector Product) Range Elements (N = 22) (L = 1920) PRIs (M = 64) 10

  12. Prior Solution Architecture= tightly coupled collection of processors Target detection High bandwidth, low latency network 11

  13. Key Issue: Communication Cost Coarse grain machines : Powerful processing nodes - T3E: Typical Configuration -SP-2: Typical Configuration •1200 Mflops/node (T3E- 1200) • 640 Mflops/node • Local Memory Access Time: • 64 MB – 4 GB Memory 87 ~ 253 nsec • 4.5 – 36.2 GB Internal Disk • Global Memory Access Time: µ µ µ µ 1~2 sec (SHMEM) � Large software overhead for message transfer - SP-2: ~39 µ sec overhead/message using MPL/MPI ~ 9 nsec/byte/node transfer rate - local memory access: 100’s of nsec 12

  14. Key Idea- Data Remapping P 0 P 3 P 0 P 3 P 0 P 3 Data Access Pattern S 1 S 2 S 3 Remap? Remap? Benefits of Remapping Must Exceed the Overhead 13

  15. Impact of Data Remapping Our Results Results reported in IPPS ‘95 Implementation performed on IBM SP-2 at MHPCC Code developed using C, MPI and ESSL 14

  16. Lessons learnt Objective : Adaptive beamforming on parallel machines � Task level parallelism � Minimize communication cost � Data Remapping 15

  17. Energy Efficiency � Energy Constrained � Network Power is critical and must be conserved � Sensors � Reduce power dissipation at sensor node level � energy efficient algorithms � Decrease power dissipation at inter-node level � Optimize on communication cost between sensors � 16

  18. Power Model for a Processing Element Frequency Frequency Control Control f p f b Processor Processor Memory FU Cache FU Cache Power Total = Power Processor +Power Data bus + Power Memory Power unit = Power Dynamic + Power Static = 0.5f(n)CV 2 f Active + VI Leakage F max ∝ (V-V t )/V 17

  19. Reduce Processor-Memory Data Traffic Instructions for Memory access consume lot of power Instruction Energy (10 -8 Joules) (Intel 486DX2) MOV DX BX 2.49 MOV DX [BX] 3.53 MOV [BX] DX 4.30 Reduce # of memory accesses � reduce cache misses high data reuse in cache � use registers � Reduce power consumed on the data bus 18

  20. Example: Matrix Multiplication Cache size =n j k j i x i k A B C Do i = 0 ; Do j = 0 ; A[i,j] � � 0 ; � � Do k = 0 ; A[i, j] � � A[i,j] + B[i,k] x C[k,j] ; � � k++; j++; i++ ; ≈ Energy = α n 3 + β (n+n 2 )n + γ (3n 2 ) ( α + β )n 3 Time = n 3 + lower order terms 19

  21. Optimization I: Reduce Bus Traffic Block Matrix Multiply n n n n x Energy = α n 3 + 2 β (n.n 1 /2 )n + γ (3n 2 ) Time = n 3 + lower order terms 20

  22. Optimization II: Reduce Peak Bus Bandwidth A B C n n n n n n 3 2 n 2 Data = 2 n Time = 1 Bus Data Rate ∝ Processor Rate! n 21

  23. Optimization III: Application directed Data Layouts � Applications have different data access patterns � Matrices accessed by rows, columns, diagonals, sub-squares � Tree structures accessed along paths, sub-trees � “Naive” data layouts degrade performance � Large working sets cause capacity misses � Improper alignment in memory causes conflict misses Row major Layout Block Layout a 0,2 a 0,3 a 0,0 a 0,1 a 0,0 a 0,1 a 0,2 a 0,3 a 1,0 a 1,1 a 1,2 a 1,3 a 1,0 a 1,0 a 1,2 a 1,3 a 2,0 a 2,1 a 2,2 a 2,3 a 2,0 a 2,1 a 2,2 a 2,3 a 3,0 a 3,1 a 3,2 a 3,3 a 3,0 a 3,1 a 3,2 a 3,3 Page 0 Page 2 Page 2 Page 0 Page 1 Page 3 Page 1 Page 3 22

  24. Cache Friendly Algorithms Cache friendly � High data reuse � Low cache pollution � Regular access patterns Data layouts � Static data layouts (Matrix Multiply) � Dynamic data layouts (FFT) 23

  25. Fast Fourier Transform DFT: Cooley-Tukey Algorithm � Compute DFT of size N = N 1 *N 2 � Step1: compute N 2 DFTs of size N 1 � Step2: multiply twiddle factors � Step3: compute N 1 DFTs of size N 2 � Divide and conquer recursively Current Approach � MIT FFTW � Determine optimal factorization � Perform low level optimizations for kernels � Construct larger size FFTs from kernels � Key Assumption � All DFTs of same size have same execution time 24

  26. Problem with Current Approach All N-point DFTs do not have the same cost! � different data access patterns with various strides � stride affects execution time 32-point FFT with Strided Access - Experimental Results N = 32 70 60 Execution Time 50 (usec) 40 30 20 10 0 0 5 10 15 20 Stride (2^s) Sun Ultra 1: 167MHz, L2 Cache = 512 KB = 32 K points 25

  27. Our Approach Reorganize input data layout to change non-unit stride to unit stride Dynamic Data Layout Perform data reorganization during computation N 1 N 2 N 2 -point FFTs N 1 -point FFTs Data Reorganization 26

  28. Example FFTW USC approach Decomposition trees for a 1024*1024 point FFT 1611.125 ms 1039.6496 ms 54.96% improvement over state-of-the-art FFTW package on DEC Alpha 27

  29. Other Techniques for Node Level Power Optimizations ? � Voltage frequency scaling f max α (V-V t )/V � Power management (idle/sleep/active states) � Reduce precision Instruction Energy (10 -8 Joules) (Fujitsu Sparc‘934) � Clock Gating OR 3.26 MUL 3.26 28

  30. Current Work � Development and Verification of techniques proposed for power optimization � Existing simulators Simple Power(based on Simple Scalar architecture) � Joule Track (Code Length Limitations) � � Board level Power Measurements Brutus Evaluation Board (SA-1100) � � Build a functional level power simulation Fast with acceptable level of accuracy. � Develop a multiprocessor power model � 31

  31. Space Time Representation A ⊗ B for N x N matrices Compute results in each block B 11 B 12 B 1N � Schedule blocks row-major � N 2 steps … A 11 A 12 c c c � Data per step ∝ N √ c … … � Operations per step ∝ Nc � Data reuse per step ∝ √ c … A 1N � Total traffic ∝ N 2 * N √ c = N 3 = computation for result (i,j) c √ c c = cache size 33

  32. Theorem Unidirectional Space-Time representation leads to cache friendly algorithms => Energy Efficient Algorithms 34

  33. Network level Energy Optimization � Computation cost is much lower than communication cost � Radio interface consumes a large amount of power POWER WINS sensor Consumed Node Transmission(100m) 600mw (at 100kbits/sec) Reception 300mw Processor (SA1100) 250MIPS/watt � Energy to transfer 32 bits over 100m in WINS sensor node =( (600 +300)mw ÷ 100kbits/s) x 32 = 288 x 10 –6 Joules � Energy to execute a 32 bit instruction using SA1100 processor = 1 ÷ 250 MIPS/watt = 0.004 x 10 –6 Joules � Additional overhead for bits added for error correction � Retransmissions are frequent due to unreliable links(e.g.wireless) 29

Recommend


More recommend