L ear ning and Vision R esear ch Gr oup Shuic he ng YAN Natio nal U nive rsity o f Singapo re
Learning and Vision Research Group (LV) Founded early 2008 20-30 members
Three Indicators of Excellence for Members Industry Commercialization Competition Awards High Citations One indicator is enough for a member to be an excellent researcher
Past, Present and Future of LV Smart Services/Devices (Never-ending Learning) Deep Learning Sparsity/Low-rank Future 微展未来 Subspace Learning Present 立足现在 Past 回顾经典
Learning and Vision Group, Past Subspace Learning, Sparsity/Low ‐ rank [Block ‐ Diagonality] [Guangcan LIU, Canyi LU, Jiashi FENG]
Subspace: Graph Embedding and Extensions Intrinsic Graph: [{ } , ], N n x x G S 1 i i i Penalty Graph [{ } , ] 2 min || - || ( ), [ , ,..., ] P N P T x G S y y S Tr YLY Y y y y 1 2 1 i j ij N i i Y i j , L D S D S 2 max || - || ( ) p p T y y S Tr YL Y ii ij j i i j ij Y , p p p p p i j L D S D S ii ij j i Graph Embedding and Extensions: A General Framework for Dimensionality Reduction, TPAMI’07, Yan, et al.
Subspace: Graph Embedding and Extensions Intrinsic Graph: [{ } , ], N n x x G S 1 i i i Direct Graph Embedding Linearization Kernelization ( p T ) T r YL Y ( ) m ax T W A x y W x i i i ( ) T T r YL Y Y Original PCA & LDA, PCA, LDA, LPP, LEA KPCA, KDA ISOMAP, LLE, …… …… Laplacian Eigenmap Penalty Graph Tensorization Type [{ } , ] P N P x G S Formulation 1 2 1 X n i i y W W W 1 2 i i n , L D S D S ii ij j i CSA, DATER Example , p p p p p L D S D S …… ii ij j i Graph Embedding and Extensions: A General Framework for Dimensionality Reduction, TPAMI’07, Yan, et al.
Block ‐ diagonality ‐ 1: Low ‐ Rank Representation (LRR) Given data learn the affinity matrix by LRR Theorem: The solution to LRR is block diagonal when the data are drawn from independent subspaces. Block diagonal affinity matrix Robust recovery of subspace structures by low-rank representation, TPAMI’13, Liu, et al.
Block ‐ diagonality ‐ 2: Unified Block ‐ diagonal Conditions Theorem: The solution to the following problem is block diagonal if (1) lies in independent subspaces; (2) satisfies the EBD conditions or is the unique solution. Enforced Block Diagonal (EBD) conditions • • • Many known regularizers satisfy EBD conditions, e.g., Robust and efficient subspace segmentation via least squares regression, ECCV’12, Lu, et al.
Block ‐ diagonality ‐ 3: Hard Block ‐ biagonal Constraint Key property The block diagonal prior LRR with hard block diagonal constraint Robust Subspace Segmentation with Laplacian Constraint, CVPR’14, Feng, et al.
Learning and Vision Group, Present NUS ‐ Purine: A Bi ‐ graph based Deep Learning Framework [A 3 : Architecture, Algorithms, Applications]
Deep Learning in Learning and Vision (LV) Research Group Purine : Network ‐ in ‐ Network + Computational Baby Learning : General, bi ‐ graph based DL framework More human ‐ brain ‐ like network structure and learning process, reguralizers Multi ‐ PC Multi ‐ CPU/GPU Approximate Linear speedup High re ‐ usability, bridge academia and industry Algorithms Landing Smart Services/Devices + Cloud/Embedded System : Object analytics, product search/recom., human analytics, others Architecture Applications
Deep Learning in Learning and Vision (LV) Research Group Purine : Network ‐ in ‐ Network + Computational Baby Learning : General, bi ‐ graph based DL framework More human ‐ brain ‐ like network structure and learning process, reguralizers Multi ‐ PC Multi ‐ CPU/GPU Approximate Linear speedup High re ‐ usability, bridge academia and industry Algorithms Landing Smart Services/Devices + Cloud/Embedded System : Object analytics, product search/recom., human analytics, others Architecture Applications 1. 4 winner awards in VOC Best paper/demo awards: LFW: 98.78%, 2nd best 2. One 2nd prize in VOC ACM MM13, Best human parsing performance 3. 2nd prize in ImageNet’13 ACM MM12, Cross ‐ age synthesis Also licensed Face analysis with occlusions 4. 1st prize in ImageNet’14
A 3 ‐ I. Architecture Purine: a Bi-graph based Deep Learning Framework [Min LIN, Xuan LUO, Shuo LI]
What is “Purine” ● Benefited from the open source deep learning framework Caffe. ● In purine, the math functions and core computations are adapted from Caffe. ● Close molecular structure http://caffe.berkeleyvision.org/
Difference from Caffe Caffe Purine
Definition Graph vs Computation Graph Definition Graph Computation Graph Computation Graph of Convolutional Layer
Definition Graph vs Computation Graph Definition Graph Computation Graph Computation Graph of Dropout Layer
Purine Overview Two Subsystems in Purine: Interpretation: Compose network in Python, generate computation graph in YAML Optimization: Dispatch and solve computation graphs
Basic Components ● Blob (a tensor that contains data) Built in Op types SoftmaxLoss Conv ● Op (operator that performs computation SoftmaxLossDown ConvDown on blobs and outputs blobs) Gaussian ConvWeight Bernoulli Inner InnerDown Constant Ops are modular, Uniform InnerWeight They can be developed and packed in a shared Copy Bias Merge BiasDown library with some common functions exported. Slice Pool PoolDown Sum Purine can then dynamically load the ops like WeightedSum Relu extensions. Mul ReluDown Swap Softmax Dumper SoftmaxDown Loader
Sub ‐ system ‐ 1: Interpretation Definition Graph Computation Graph
Sub ‐ system ‐ 2: Optimization How to solve computation graph? ● Start from sources ● Stop at sinks ● Applies to any Directed Acyclic Graph (DAG) ● Op will compute when all its inputs are ready ● Blob is ready when all its inputs have computed ● All computations are event based and asynchronous, parallelized where possible
Why Computation Graph ● Less hard coding hard coded ● All tasks (algorithm and parallel computing) are consistently defined in graphs ● Solver ● Forward and Backward pass in the same graph In definition graph: Introduce concepts like forward pass and backward pass hard coded to alternate forward and backward pass. In computation graph: The logic is in the graph ● Any scheme of parallelism can be expressed in computation graph
Parallelization Implementation Properties of Ops and Blobs Example Blob defined in YAML type: blob Location: name: weight The location that the blob/op resides on, size: [96, 3, 11, 11] including: location: ip: 127.0.0.1 ● ip address of the target machine device: 0 ● what device it is on (CPU/GPU) Example Op defined in YAML Thread: type: op Thread is needed for op because both op_type: Conv name: conv1 CPU and GPU can be multiple threaded inputs: [ bottom, weight ] (Streams in terms of NVIDIA GPU). outputs: [ top ] location: ip: 127.0.0.1 device: 0 thread: 1 other fields ...
Parallelization ‐ 1 (Pipeline) One computation graph can span multiple machines! Special Op: Copy. Special Op: Copy. Case 1, Pipeline ● Location A & B are same machine ● Copy is executed as soon as input different devices: blob is ready Copy does one of the following: 1. cudaMemcpyHostToDevice ● Copy is run in its own worker 2. cudaMemcpyDeviceToDevice thread. Computation and data 3. cudaMemcpyDeviceToHost transfer are overlapped wherever How to run this pipeline? possible. ● Location A & B are on different machines: ● GPU inbound and outbound copy Copy reside on both machines are in different streams, fully utilize Source side: CUDA’s dual copy engines. nn_send(socket, data) Target side: nn_receive(socket, data)
Parallelization ‐ 1 (Pipeline) Graph 2 Graph 1 Graph 3 Replicate Iterate Graphs/Subgraphs
Parallelization ‐ 2 (Data parallelism) Case 2, Data parallelism ● Explicitly duplicate the nets at different locations ● Each duplicate run different data ● Gather weight gradients at parameter server
Parallelization ‐ 2 (Data parallelism) ● Higher layer gradients are computed Overlap data transfer and computation earlier than lower layers. ● Higher layer can send gradients to parameter server and get them back while the lower layers are doing their computation. ● Especially true for very deep networks ● Data parallelism even for fully connected layers. Though lots of parameter for FC layer, latency is hidden. ● Cross machine (network) latency is less of a problem
Profiling Result Data transfer overlaps with computation Parameter update of lowest layer Images per second 800 700 600 Note that 8 GPUs are on different 500 machines. 400 300 200 8 GPUs train GoogleNet in 40 hours. 100 Top5 error rate 12.67% (tuning) 0 GPUs 1 2 3 4 8
A 3 ‐ II. Algorithms Network-in-Network [More Human-brain-like Network Structure] [Min LIN, Qiang CHENG]
Recommend
More recommend