BigStation: Enable Scalable Real-time Signal Processing in Large - PowerPoint PPT Presentation

BigStation: Enable Scalable Real-time Signal Processing in Large MU-MIMO Systems Qing Yang  Xiaoxiao Li § Hongyi Yao ¶ Ji Fang ‡ Kun Tan † Wenjun Hu † Jiansong Zhang † Yongguang Zhang † † Microsoft Research Asia, Beijing, China  MSRA and CUHK, Hong Kong § MSRA and Tsinghua University, Beijing, China ¶ MSRA and USTC, He Fei, An Hui, China ‡ MSRA and BJTU, Beijing, China

Motivation • Demand for more wireless capacity – Proliferation of mobile devices: wireless access is primary – Data-intensive applications: video, tele-presence – “amount of net traffic carried on wireless will exceed the amount of wired traffic by 2015” (sourced from CISCO VNI 2011-2016) SIGCOMM 2013, Hong Kong, Aug 2013 2

Motivation • Demand for more wireless capacity – Proliferation of mobile devices: wireless access is primary – Data-intensive applications: video, tele-presence Can we engineer next wireless network to match existing wired network – Giga-bit wireless throughput to every user? – “amount of net traffic carried on wireless will exceed the amount of wired traffic by 2015” (sourced from CISCO VNI 2011-2016) SIGCOMM 2013, Hong Kong, Aug 2013 3

How to Gain More Wireless Capacity • More spectrum (DSA) – Spectrum is scarce, shared resource and there is a limit • Spectrum reuse (micro cell, pico cell, …) – Existing cells are already small (like Wi-Fi) – Increased deployment and management complexity • Spatial multiplexing (MU-MIMO) – More promising SIGCOMM 2013, Hong Kong, Aug 2013 4

Background: MU-MIMO Access Point (AP) Joint Signal Processing mobile mobile m AP antennas n total client antennas mobile mobile mobile mobile • Transmit to/Receive from multiple mobile stations 𝑍 = 𝐼S, 𝑌 = 𝐼 ∗ (𝐼𝐼 ∗−1 )𝐼𝑍 S = 𝑌 Uplink: S = 𝐼 ∗ 𝐼 −1 𝐼 ∗ 𝑌 Y = 𝐼𝑇 = 𝑌 Downlink: • In theory, linearly scale capacity with # of AP antennas SIGCOMM 2013, Hong Kong, Aug 2013 5

How Many Antennas do We need • … for giga-bit wireless link per user 802.11n # of ant 1 2 4 8 16 32 64 128 20MHz 72.2M 144M 289M 578M 1.2G 2.3G 4.6G 9.2G 40MHz 150M 300M 600M 1.2G 2.4G 4.8G 9.6G 19.2G 80MHz 325M 650M 1.3G 2.6G 5.2G 10.4G 20.8G 41.6G 160MHz 650M 1.3G 2.6G 5.2G 10.4G 20.8G 41.6G 83.2G 802.11ac Large-scale MU-MIMO systems Giga-bit to 20 concurrent users: 160MHz channel with at least 40 antennas SIGCOMM 2013, Hong Kong, Aug 2013 6

Challenge • Can we build a scalable AP to support such large- scale MU-MIMO operation? – When n, so as m, increases large? Access Point (AP) m AP antennas Joint Signal Processing mobile mobile mobile mobile mobile mobile n total client antennas SIGCOMM 2013, Hong Kong, Aug 2013 7

Computation and Throughput Requirement: a Back-of-Envelope Estimation • Setting: 160MHz, 40 antennas • Data path: – 160MHz channel width  𝑠 = 5 Gbps sa. per ant. – 40 antennas  200Gbps in total • Computation: – Channel inverse (once every frame): 𝑃(𝑛𝑜 2 𝑠/𝑢 𝑔 )  269 GOPS – Spatial demutiplexing/precoding: 𝑃(𝑛𝑜𝑠)  1.5 TOPS – Channel Decoding: 𝑃(𝑜𝑠)  5.5 TOPS – 7.27 TOPS in total! • State-of-art multi-core CPU achieves only 50 GOPS SIGCOMM 2013, Hong Kong, Aug 2013 8

A Single Central Processing Unit Access Point (AP) Joint Signal Processing m AP antennas mobile mobile mobile mobile mobile mobile n total client antennas SIGCOMM 2013, Hong Kong, Aug 2013 9

BigStation: Parallelizing to Scale BigStation AP Simple Simple Processing Processing Inter-connecting Unit Unit Network Simple Simple Simple Processing Processing Processing Unit Unit Unit m AP antennas mobile mobile mobile mobile mobile mobile n total client antennas SIGCOMM 2013, Hong Kong, Aug 2013 10

Outline • Parallel architecture • Parallel algorithms and optimization • Performance • Conclusion SIGCOMM 2013, Hong Kong, Aug 2013 11

Naive Architecture • A pool of processing servers • A pool of processing servers – Sending samples of the same frame to one server… • Enough processing capability with ⌈𝑢 𝑞 /𝑢 𝑔 ⌉ servers SIGCOMM 2013, Hong Kong, Aug 2013 12

Naive Architecture • Issue: long processing latency for a frame ( ~1𝑡 ) • Wireless protocols requirement: milliseconds SIGCOMM 2013, Hong Kong, Aug 2013 13

Our Approach: Distributed Pipeline Channel Spatial Channel inversion demultiplexing decoding • Parallelizing MU-MIMO processing into 3-stage pipeline • At each stage, the computation is further parallelized among multiple servers SIGCOMM 2013, Hong Kong, Aug 2013 14

Data Partitioning across Servers • Exploiting data parallelism inside MU-MIMO Partitioning by subcarriers Channel Spatial Channel OFDM signal inversion demultiplexing decoding SIGCOMM 2013, Hong Kong, Aug 2013 15

Data Partitioning across Servers • Exploiting data parallelism inside MU-MIMO Partitioning by spatial streams Channel Spatial Channel OFDM signal inversion demultiplexing decoding SIGCOMM 2013, Hong Kong, Aug 2013 16

Example • Giga-bit to 20 users – 160MHz  468 parallel subcarriers • Subcarrier partitioning – Each server needs to handle a minimum of 10Mbps data • Spatial stream partitioning – Each server needs to handle 5Gbps data • Generally within existing server’s processing capability – Multi-core (4~16) – 10G NIC SIGCOMM 2013, Hong Kong, Aug 2013 17

Summary • Distributed pipeline for low latency • Exploiting data parallelism across servers at each processing stage • If single datum is still beyond capability of a single processing unit – Building deeper pipeline (see paper for details) SIGCOMM 2013, Hong Kong, Aug 2013 18

Computation Partitioning in a Server • Three key operations in MU-MIMO – Matrix multiplication – Matrix inversion – Viterbi decoding (channel decoding) SIGCOMM 2013, Hong Kong, Aug 2013 20

Parallel Matrix Multiplication • Divide-and-conquer 𝐼 1 𝐼 ∗ 𝐼 = 𝐼 1 ∗ 𝐼 2 ∗ 𝐼 2 ∗ 𝐼 1 + 𝐼 2 ∗ 𝐼 2 = 𝐼 1 Core 1 Core 2 SIGCOMM 2013, Hong Kong, Aug 2013 21

Parallel Matrix Inversion • Based on Gauss-Jordan method ℎ 1𝑜 ℎ 11 ℎ 12 1 0 0 0 ℎ 21 ℎ 22 ℎ 2𝑜 0 1 0 0 ℎ 31 ℎ 32 0 0 ⋱ ⋮ ⋱ ⋮ 0 0 … 1 0 0 ℎ 𝑜1 ℎ 𝑜2 … ℎ 𝑜𝑜 Core 2 Core 1 SIGCOMM 2013, Hong Kong, Aug 2013 22

Parallel Matrix Inversion • Based on Gauss-Jordan method 𝑗 11 𝑗 12 𝑗 1𝑜 ℎ 1𝑜 ℎ 11 ℎ 12 1 0 0 1 0 0 0 𝑗 21 𝑗 22 𝑗 2𝑜 ℎ 21 ℎ 22 ℎ 2𝑜 0 1 0 0 1 0 0 𝑗 31 𝑗 32 0 0 ℎ 31 ℎ 32 0 0 ⋱ ⋮ ⋱ ⋮ ⋱ ⋮ ⋱ ⋮ 0 0 0 0 … 1 𝑗 𝑜1 𝑗 𝑜2 0 0 … 𝑗 𝑜𝑜 … 1 0 0 ℎ 𝑜1 ℎ 𝑜2 … ℎ 𝑜𝑜 Core 2 Core 1 SIGCOMM 2013, Hong Kong, Aug 2013 23

Parallel Viterbi Decoding • Challenge: sequential operations on a continuous (soft-)bit stream • Solution: – Artificially divide bit-stream into blocks Core 1 Core 2 SIGCOMM 2013, Hong Kong, Aug 2013 24

Parallel Viterbi Decoding • Challenge: sequential operations on a continuous (soft-)bit stream • Solution: – Artificially divide bit-stream into blocks – Add overlaps to ensure converging to optimal Core 1 Core 2 SIGCOMM 2013, Hong Kong, Aug 2013 25

Parallel Viterbi Decoding • How to choose a right block size? – The tradeoff between latency and overhead • Our goal: fully utilize the computation capacity while keeping 𝑀 minimal • Optimal size: 𝑀 ∗ = 2𝐸𝑣/(𝑛𝑤 − 𝑣) 𝑣 : stream bit rate 𝑤 : processing rate per core 𝑛: # of cores G L D Core 1 Core 2 SIGCOMM 2013, Hong Kong, Aug 2013 26

Optimization: Lock-free Computing Structure • Complex interaction between communication and computation threads (1.31x  ) Contention at output buffer Lock free SIGCOMM 2013, Hong Kong, Aug 2013 27

Optimization: Communication • Parallelizing communication among multiple cores • Dealing with incast problem – Application-level flow control • Isolating communication and computation on different cores SIGCOMM 2013, Hong Kong, Aug 2013 28

Micro-benchmarks • Platform: Dell server with an Intel Xeon E5520 CPU (2.26 GHz, 4 cores) Channel inversion SIGCOMM 2013, Hong Kong, Aug 2013 30

Micro-benchmarks Spatial demultiplexing Viterbi decoding SIGCOMM 2013, Hong Kong, Aug 2013 31

Micro-benchmarks 6 users, 100Mbps 20 users, 600Mbps 50 users, 1Gbps SIGCOMM 2013, Hong Kong, Aug 2013 32

Prototype • Software radio: Sora MIMO Kit – 4x phase coherent radio chains – Extensible with an external clock SIGCOMM 2013, Hong Kong, Aug 2013 33

Capacity Gain Caped at a constant value due to random-user selection! SIGCOMM 2013, Hong Kong, Aug 2013 34

Capacity Gain 6.8x  Overprovisioned AP antennas SIGCOMM 2013, Hong Kong, Aug 2013 35

Processing Delay 860𝜈𝑡 Light load (1 frame per 10ms) Heavy load (back-to-back frames) SIGCOMM 2013, Hong Kong, Aug 2013 36

BigStation: Enable Scalable Real-time Signal Processing in Large - PowerPoint PPT Presentation

BigStation: Enable Scalable Real-time Signal Processing in Large MU-MIMO Systems Qing Yang Xiaoxiao Li Hongyi Yao Ji Fang Kun Tan Wenjun Hu Jiansong Zhang Yongguang Zhang Microsoft Research Asia, Beijing, China

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Speech Processing 15-492/18-492 Speech Synthesis Signal Processing Signal Manipulation Signal

Waveform Generation Fundamental part of signal processing is the signal. Within the

Sampling a Signal an analog signal together with some samples of the signal. The samples

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Real Time Operating Systems Shirvaikar Chapter 4 REAL TIME SYSTEMS SHIRVAIKAR 1 Real Time

RTOS Real-Time Operating Systems Chenyang Lu OS Support for Real-Time Real-Time OS

GAUSS - GEANT4 based simulat ion f or LHCb GEANT4 Workshop 2 Oct ober 2002 W. Pokor ski /

Placement resource view visualization $ openstack resource provider tree balazs.gibizer@est.tech

CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario,

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

SLDC: an open-source workflow for object detection in multi-gigapixel images Romain Mormont,

Elementary Functions Part 1, Functions Lecture 1.0a, Excellence in Algebra: Exponents Dr. Ken W.

Current progress in higher-order curvature flow Glen Wheeler 6 th October 2020 Asia-Pacific

HCAL uTCA Readout Crate Ethernet GBT links from front-ends 12 AMC Slots Power 1 H C M