Widar3.0 Zero-Effort Cross-Domain Gesture Recognition with Wi-Fi Yue Zheng 1 , Yi Zhang 1 , Kun Qian 1 , Guidong Zhang 1 , Yunhao Liu 1,3 , Chenshu Wu 2 , Zheng Yang 1 1 Tsinghua University 2 University of Maryland, College Park 3 Michigan State University
Motivation • Human gesture recognition is the core enabler for a wide range of applications. Smart Home Virtual Reality Security Surveillance • RF radios VS. Cameras/Wearable devices/Ultrasound Wi-Fi is currently widely deployed ! – Less privacy concern. – No requirement of on-body sensors. – More ubiquitous deployment and larger sensing range. 2
State-of-the-Art Works • E-eyes (Wang et al, MobiCom’14) – a pioneer work to use strength distribution of commercial Wi-Fi signals and KNN to recognize human activities. • CARM (Wang et al, MobiCom’15) – calculates power distribution of Doppler Frequency Shifts components as learning features of HMM model. • WIMU (Venkatnarayan et al, MobiSys’18) – segments DFS power profiles for multi-person activity recognition. They use primitive signal features which usually carry environment information unrelated to gestures. E-eyes CARM WIMU 3
State-of-the-Art Works Cross-domain Gesture Recognition Domain: Location, Orientation, Environment • Explore cross-domain generalization ability of recognition model. All require extra training efforts at each – CrossSense (Zhang et al, MobiCom’18) time a new target domain is added into – EI (Jiang, MobiCom’18) • Generate signal features of target domain for model re-training. the recognition model. – WiAG (Virmani et al, MobiSys’17) 4
Key Idea • Can we avoid extra data collection or model- retraining for cross-domain recognition ? – Yes! We move generalization ability downwardly at the lower signal level, rather than the upper model level. – Extract domain-independent features – Trained once and used anywhere 5
System Overview C1: How to define a domain-independent feature in theory ? C2: How to estimate the feature in practice with collected Wi-Fi measurements ? C3: How to devise the recognition model to fully capture the characteristics of the new feature ? 6
Our Prior Efforts • Widar (MobiHoc’17) – models the relation among person’s walking velocity, location and DFS, and pinpoints the person passively. – achieves a decimeter-level accuracy with only one commercial Wi-Fi sender and two receivers. 8
Our Prior Efforts • Widar2.0 (MobiSys’18) – proposes a unified model of ToF, AoA and DFS and devises an efficient algorithm for their joint estimation. – with fine-grained range and AoA provided by a single link, directly localizes the moving person at the decimeter-level. DFS Prior works regard a person as a single point, Ellipse which is infeasible for recognizing complex Curve gestures that involve multiple body parts. Ray ToF We need to define a new feature! Array AoA Baseline LoS Path Tx Rx 9
Anticipated Properties of Signal Features for Finer-Grained Tasks • Domain-independent – capture only human actions rather than domain factors (location, orientation, environment, etc.). • Zero-effort – no model re-training for a new domain. • Finer-grained – contain multiple signal components that correspond to different body parts. 10
Our Solution • BVP: B ody-coordinate V elocity P rofile – Same gestures may exhibit different velocity distributions in the global coordinate system. – Transformation can be achieved with the knowledge of locations of devices, and location and orientation information of the user. 11
One-Link DFS and BVP • The relation between DFS profile of the 𝑗 𝑢ℎ link 𝐸 (𝑗) and the vectorized BVP 𝑊 which include multiple velocity components can be modeled as: – 𝐸 (𝑗) = 𝑑 (𝑗) 𝐵 (𝑗) 𝑊 𝑑 (𝑗) - scaling factor due to propagation loss 𝐵 (𝑗) - assignment matrix. 𝑘 = 𝑔 (𝑗) റ (𝑗) = ൝1, 𝑔 𝑤 𝑙 𝐵 𝑘,𝑙 0, 𝑓𝑚𝑡𝑓 𝐸 (𝑗) : 𝐺 × 1 , 𝐺 is the number of sampling points in frequency domain. 𝑊 : 𝑂 2 × 1 , 𝑂 is the number of sampling points in velocity domain. 13
From Multiple DFS to BVP • DFS from one link only depicts radial velocity components. [1] • DFS from multiple links are utilized to fully recover BVP. [1] Widar, MobiHoc ’17 14
Problems of BVP Estimation • The equation system 𝐸 (𝑗) = 𝑑 (𝑗) 𝐵 (𝑗) 𝑊 is severely under-determined. – DFS profiles from multiple links provide much fewer constraints compared with the variables which required to be estimated in BVP. • Only a few dominant velocity components exist in each BVP snapshot. 15
Optimization of BVP Estimation • We adopt sparse recovery to estimate BVP. • We formulate the estimation of BVP as a 𝑚 0 optimization problem: 𝐹𝑁𝐸(𝐵 𝑗 𝑊, 𝐸 (𝑗) ) + 𝜃 𝑁 𝑊 σ 𝑗=1 ԡ𝑊 0 ԡ – min – The sparsity of the number of the velocity components is coerced by the term 𝜃 ԡ𝑊 0 . ԡ – EMD(Earth Mover’s Distance) resolves the unknown scaling factor caused by propagation loss of the reflected signal and relieves quantization error in BVP. 16
Comparison of Signal Features • Investigate raw CSI, DFS, BVP – example gesture: Pushing and Pulling – two domains 17
Comparison of Signal Features Domain-1 orientation #1 location #1 environment #1 CSI DFS BVP Domain-2 orientation #2 location #2 environment #2 CSI and DFS of same gestures are CSI DFS BVP probable to vary across different domains, but BVP stays consistent! 18
BVP Examples Sliding Pushing & Pulling Clapping 19
Gesture Recognition Model • A hybrid CNN+RNN model is designed to fully capture characteristics of BVP. GRU captures temporal dependencies among BVP snapshots, and is easier to train with less data CNN extract spatial features from each single With the help of BVP, the simple BVP snapshot. recognition model is effective. 21
Experiment • Implementation – Mini-desktops with Intel 5300 NIC. • Setup – 3 scenarios: classroom, hall, office. 0.5m Sensing 0.9m 0.5m Area 0.5m Tx A B 2 Rx 1 3 (b) Hall 0.9m 2m E 4 Loc Sensing 5 Sensing Area D C Orient Area Sensing Area 2m 0.5m (a) Classroom (c) Office 22
Overall Accuracy • Dataset:12,000 gesture samples (16 users × 5 positions × 5 orientations × 6 gestures × 5 instances) • Gestures: pushing and pulling, sweeping, clapping, sliding, drawing circle and drawing zigzag. • Widar3.0 achieves consistent high accuracy across different domains. 23
Method Comparison Different Approaches Different Input Different learning model • Widar3.0 outperforms with the state-of-the-art cross- domain learning methodologies. – It does not require extra data from a new domain or model re-training. • BVP outperforms both denoised CSI and DFS. • The proposed recognition model is simple but effective with BVP as input. 24
Parameter Study Impact of training set diversity. • The accuracy increases from 74% to 89% when the number of training users varies from 1 to 7. – More data to train the learning model. – More likely to reduce the behavior difference between testing persons and training persons. 25
Conclusion • From Widar, Widar2.0 to Widar3.0 – Widar3.0 aims at recognizing complex gestures that involve multiple body parts rather than regarding a person as a single point. • Zero-effort cross-domain gesture recognition system – We propose the domain-independent feature, BVP. – With BVP as input, the recognition model does not require extra data collection or model-retraining when a new domain is added. – With spatial-temporal characteristics of BVP fully captured, the system achieves high recognition accuracy across different domain factors, specifically, 89.7%, 82.6%, 92.4% for user’s location, orientation, and environment. – The dataset is available to public. 26
Data Availability • We collect a hand gesture dataset, which consists of raw Wi-Fi readings (CSI) and other sophisticated features (e.g., DFS and BVP) of 258K instances, duration of 8,620 minutes, from 75 domains. • The dataset and Widar series of works can be found in http://tns.thss.tsinghua.edu.cn/widar3.0/index.html 27
Yue Zheng Tsinghua University cczhengy@gmail.com 28
Recommend
More recommend