l ear ning and vision r esear ch gr oup
play

L ear ning and Vision R esear ch Gr oup Shuic he ng YAN Natio - PowerPoint PPT Presentation

L ear ning and Vision R esear ch Gr oup Shuic he ng YAN Natio nal U nive rsity o f Singapo re Learning and Vision Research Group (LV) Founded early 2008 20-30 members Three Indicators of Excellence for Members Industry


  1. L ear ning and Vision R esear ch Gr oup Shuic he ng YAN Natio nal U nive rsity o f Singapo re

  2. Learning and Vision Research Group (LV)  Founded early 2008  20-30 members

  3. Three Indicators of Excellence for Members Industry Commercialization Competition Awards High Citations One indicator is enough for a member to be an excellent researcher

  4. Past, Present and Future of LV Smart Services/Devices (Never-ending Learning) Deep Learning Sparsity/Low-rank Future 微展未来 Subspace Learning Present 立足现在 Past 回顾经典

  5. Learning and Vision Group, Past Subspace Learning, Sparsity/Low ‐ rank [Block ‐ Diagonality] [Guangcan LIU, Canyi LU, Jiashi FENG]

  6. Subspace: Graph Embedding and Extensions Intrinsic Graph:   [{ } , ], N n x x G S  1 i i i Penalty Graph     [{ } , ] 2 min || - || ( ), [ , ,..., ] P N P T x G S y y S Tr YLY Y y y y 1 2  1 i j ij N i i Y i j      , L D S D S  2 max || - || ( ) p p T  y y S Tr YL Y ii ij j i i j ij  Y    , p p p p p i j L D S D S  ii ij j i Graph Embedding and Extensions: A General Framework for Dimensionality Reduction, TPAMI’07, Yan, et al.

  7. Subspace: Graph Embedding and Extensions Intrinsic Graph:   [{ } , ], N n x x G S  1 i i i Direct Graph Embedding Linearization Kernelization ( p T ) T r YL Y     ( ) m ax T W A x y W x i i i ( ) T T r YL Y Y Original PCA & LDA, PCA, LDA, LPP, LEA KPCA, KDA ISOMAP, LLE, …… …… Laplacian Eigenmap Penalty Graph Tensorization Type  [{ } , ] P N P x G S Formulation      1 2 1 X  n i i y W W W 1 2 i i n     , L D S D S  ii ij j i CSA, DATER Example     , p p p p p L D S D S ……  ii ij j i Graph Embedding and Extensions: A General Framework for Dimensionality Reduction, TPAMI’07, Yan, et al.

  8. Block ‐ diagonality ‐ 1: Low ‐ Rank Representation (LRR)  Given data learn the affinity matrix by LRR  Theorem: The solution to LRR is block diagonal when the data are drawn from independent subspaces. Block diagonal affinity matrix Robust recovery of subspace structures by low-rank representation, TPAMI’13, Liu, et al.

  9. Block ‐ diagonality ‐ 2: Unified Block ‐ diagonal Conditions  Theorem: The solution to the following problem is block diagonal if (1) lies in independent subspaces; (2) satisfies the EBD conditions or is the unique solution.  Enforced Block Diagonal (EBD) conditions • • •  Many known regularizers satisfy EBD conditions, e.g., Robust and efficient subspace segmentation via least squares regression, ECCV’12, Lu, et al.

  10. Block ‐ diagonality ‐ 3: Hard Block ‐ biagonal Constraint  Key property  The block diagonal prior  LRR with hard block diagonal constraint Robust Subspace Segmentation with Laplacian Constraint, CVPR’14, Feng, et al.

  11. Learning and Vision Group, Present NUS ‐ Purine: A Bi ‐ graph based Deep Learning Framework [A 3 : Architecture, Algorithms, Applications]

  12. Deep Learning in Learning and Vision (LV) Research Group Purine : Network ‐ in ‐ Network + Computational Baby Learning : General, bi ‐ graph based DL framework More human ‐ brain ‐ like network structure and learning process, reguralizers Multi ‐ PC Multi ‐ CPU/GPU Approximate Linear speedup High re ‐ usability, bridge academia and industry Algorithms Landing Smart Services/Devices + Cloud/Embedded System : Object analytics, product search/recom., human analytics, others Architecture Applications

  13. Deep Learning in Learning and Vision (LV) Research Group Purine : Network ‐ in ‐ Network + Computational Baby Learning : General, bi ‐ graph based DL framework More human ‐ brain ‐ like network structure and learning process, reguralizers Multi ‐ PC Multi ‐ CPU/GPU Approximate Linear speedup High re ‐ usability, bridge academia and industry Algorithms Landing Smart Services/Devices + Cloud/Embedded System : Object analytics, product search/recom., human analytics, others Architecture Applications 1. 4 winner awards in VOC Best paper/demo awards: LFW: 98.78%, 2nd best 2. One 2nd prize in VOC ACM MM13, Best human parsing performance 3. 2nd prize in ImageNet’13 ACM MM12, Cross ‐ age synthesis Also licensed Face analysis with occlusions 4. 1st prize in ImageNet’14

  14. A 3 ‐ I. Architecture Purine: a Bi-graph based Deep Learning Framework [Min LIN, Xuan LUO, Shuo LI]

  15. What is “Purine” ● Benefited from the open source deep learning framework Caffe. ● In purine, the math functions and core computations are adapted from Caffe. ● Close molecular structure http://caffe.berkeleyvision.org/

  16. Difference from Caffe Caffe Purine

  17. Definition Graph vs Computation Graph Definition Graph Computation Graph Computation Graph of Convolutional Layer

  18. Definition Graph vs Computation Graph Definition Graph Computation Graph Computation Graph of Dropout Layer

  19. Purine Overview Two Subsystems in Purine: Interpretation: Compose network in Python, generate computation graph in YAML Optimization: Dispatch and solve computation graphs

  20. Basic Components ● Blob (a tensor that contains data) Built in Op types SoftmaxLoss Conv ● Op (operator that performs computation SoftmaxLossDown ConvDown on blobs and outputs blobs) Gaussian ConvWeight Bernoulli Inner InnerDown Constant Ops are modular, Uniform InnerWeight They can be developed and packed in a shared Copy Bias Merge BiasDown library with some common functions exported. Slice Pool PoolDown Sum Purine can then dynamically load the ops like WeightedSum Relu extensions. Mul ReluDown Swap Softmax Dumper SoftmaxDown Loader

  21. Sub ‐ system ‐ 1: Interpretation Definition Graph Computation Graph

  22. Sub ‐ system ‐ 2: Optimization How to solve computation graph? ● Start from sources ● Stop at sinks ● Applies to any Directed Acyclic Graph (DAG) ● Op will compute when all its inputs are ready ● Blob is ready when all its inputs have computed ● All computations are event based and asynchronous, parallelized where possible

  23. Why Computation Graph ● Less hard coding hard coded ● All tasks (algorithm and parallel computing) are consistently defined in graphs ● Solver ● Forward and Backward pass in the same graph In definition graph: Introduce concepts like forward pass and backward pass hard coded to alternate forward and backward pass. In computation graph: The logic is in the graph ● Any scheme of parallelism can be expressed in computation graph

  24. Parallelization Implementation Properties of Ops and Blobs Example Blob defined in YAML type: blob Location: name: weight The location that the blob/op resides on, size: [96, 3, 11, 11] including: location: ip: 127.0.0.1 ● ip address of the target machine device: 0 ● what device it is on (CPU/GPU) Example Op defined in YAML Thread: type: op Thread is needed for op because both op_type: Conv name: conv1 CPU and GPU can be multiple threaded inputs: [ bottom, weight ] (Streams in terms of NVIDIA GPU). outputs: [ top ] location: ip: 127.0.0.1 device: 0 thread: 1 other fields ...

  25. Parallelization ‐ 1 (Pipeline) One computation graph can span multiple machines! Special Op: Copy. Special Op: Copy. Case 1, Pipeline ● Location A & B are same machine ● Copy is executed as soon as input different devices: blob is ready Copy does one of the following: 1. cudaMemcpyHostToDevice ● Copy is run in its own worker 2. cudaMemcpyDeviceToDevice thread. Computation and data 3. cudaMemcpyDeviceToHost transfer are overlapped wherever How to run this pipeline? possible. ● Location A & B are on different machines: ● GPU inbound and outbound copy Copy reside on both machines are in different streams, fully utilize Source side: CUDA’s dual copy engines. nn_send(socket, data) Target side: nn_receive(socket, data)

  26. Parallelization ‐ 1 (Pipeline) Graph 2 Graph 1 Graph 3 Replicate Iterate Graphs/Subgraphs

  27. Parallelization ‐ 2 (Data parallelism) Case 2, Data parallelism ● Explicitly duplicate the nets at different locations ● Each duplicate run different data ● Gather weight gradients at parameter server

  28. Parallelization ‐ 2 (Data parallelism) ● Higher layer gradients are computed Overlap data transfer and computation earlier than lower layers. ● Higher layer can send gradients to parameter server and get them back while the lower layers are doing their computation. ● Especially true for very deep networks ● Data parallelism even for fully connected layers. Though lots of parameter for FC layer, latency is hidden. ● Cross machine (network) latency is less of a problem

  29. Profiling Result Data transfer overlaps with computation Parameter update of lowest layer Images per second 800 700 600 Note that 8 GPUs are on different 500 machines. 400 300 200 8 GPUs train GoogleNet in 40 hours. 100 Top5 error rate 12.67% (tuning) 0 GPUs 1 2 3 4 8

  30. A 3 ‐ II. Algorithms Network-in-Network [More Human-brain-like Network Structure] [Min LIN, Qiang CHENG]

Recommend


More recommend