kickoff ia chaire biscotte
play

Kickoff IA Chaire BiSCottE ( Bridging Statistical and Computational - PowerPoint PPT Presentation

Kickoff IA Chaire BiSCottE ( Bridging Statistical and Computational Efficiency in AI) Gilles Blanchard Universit Paris-Saclay 9 sept. 2020 Participating doctoral candidates: Collaborating Colleagues: Jean-Baptiste Fermanian (ENS


  1. Kickoff IA – Chaire BiSCottE ( Bridging Statistical and Computational Efficiency in AI) Gilles Blanchard Université Paris-Saclay 9 sept. 2020 Participating doctoral candidates: Collaborating Colleagues: ◮ Jean-Baptiste Fermanian (ENS Rennes) ◮ Sylvain Arlot (IMO, Saclay) ◮ Karl Hajjar (Saclay) ◮ Frédéric Chazal (INRIA, Saclay) ◮ Hannah Marienwald (TU Berlin) ◮ Lénaïc Chizat (CNRS,IMO, Saclay) ◮ El Mehdi Saad (Saclay) ◮ Elisabeth Gassiat (IMO, Saclay) ◮ Olympio Hacquard (Saclay) ◮ Christophe Giraud (IMO, Saclay) ◮ Jérémie Capitao Miniconi (Saclay) ◮ Rémi Gribonval (INRIA, Lyon) 1 / 7

  2. High-level goals ◮ Project positioned in current trend of of statistical and computational tradeoffs ◮ Label efficiency – information theoretic sense ◮ Example: Requesting only just enough data (online) as needed for the task at hand ◮ Example: “Small data” problem – many learning tasks with few data ◮ Computational resource efficiency ◮ Computation time ◮ Memory ◮ Example: early stopping of iterative approximation/optimization ◮ Structural efficiency – taking advantage of unknown structures in data ◮ Example: data lies (close to) an unknown manifold ◮ Example: finding efficient representations ◮ Mainly theoretical orientation – interactions welcome 2 / 7

  3. Efficient variable selection Work with El Mehdi Saad ◮ Start from the fundamental linear regression problem: Y i = � X i , β ∗ � + ε i , with ( X i , Y i ) i.i.d. ◮ Assume X i ∈ R d but � � i ≤ d : β ( i ) | Supp ( β ∗ ) | ≪ d Supp ( β ∗ ) : = ∗ � = 0 . ◮ Many variable selection methods, Orthogonal Matching Pursuit still very popular: 0. ¯ β ← 0 , S ← ∅ , all data ( i = 1 , . . . , n ) available � � X i , ¯ 1. [Residuals] R i ← Y i − β , i = 1 , . . . , n � E ( RX ( s ) ) S ← S ∪ Arg Max 2. [Selection] s ∈ [ d ] \ S � Y − � X , β �� 2 ¯ 3. [OLS] β ← Arg Min n Supp ( β ) ⊆ S 4. Go to point 1. ◮ Statistical reliability studied by Zhang (JMLR 2009): minimum data size n (under appropriate assumptions) for selection consistency 3 / 7

  4. Efficient variable selection Online OMP ◮ Complexity of batch OMP (for k selection steps): O ( knd ) and n depends on some a priori assumptions (RIP, smallest coefficient magnitude) ◮ Approach: ◮ query data only as needed for reliable selection at each step (bandit arm style) ◮ approximate OLS as needed by ASGD ◮ Study sample & computational complexity under: ◮ Data Base model (arbitrary (data,coordinate) queries with unit cost) ◮ Data Stream model (asked for partially observed new sample, can’t query backwards) 4 / 7

  5. Efficient multiple-mean estimation Work with Hannah Marienwald, Jean-Baptiste Fermanian ◮ Independent samples X ( b ) • , b = 1 , . . . , B on R d : � X ( b ) : = ( X ( b ) i . i . d . ) 1 ≤ i ≤ N b ∼ P b , • i ( X ( 1 ) • , . . . , X ( B ) • ) independent, ◮ Goal is to estimate means µ b : = E X ∼ P b [ X ] ∈ R d , b = 1 , . . . , d . ◮ Question: can we exploit unknown structure in the true means (clustering, N b i = 1 X ( b ) : = N − 1 µ NE manifold...) to improve over naive estimation � b ∑ ? b i → Structural efficiency and small data problem ◮ Relation to AI/machine learning? ◮ large databases of that form (e.g. medical records, online activity of many users) ◮ relation to Kernel Mean Embedding (KME): estimation of Φ ( P ) = E X ∼ P [ Φ ( X )] where Φ is some kernel feature map ◮ improving KME estimations has many applications (Muandet et al., ICML 2014) ◮ improving multiple mean estimation also analyzed in ML (Feldman et al. NIPS 2012, JMLR 2014) 5 / 7

  6. Multiple-mean estimation by local averaging ◮ Assume standard Gaussian distributions and equal sample sizes N b = N µ NE has MSE ( µ 0 ) = d / N = : σ 2 ◮ Focus on estimating µ 0 . Naive estimator � 0 i = � µ i − µ 0 � 2 ≤ δ 2 for “neighbor tasks” 1,...,K ◮ Assume we know that ∆ 2 ◮ Consider simple neighbor averaging: K σ 2 1 µ NE K + 1 + δ 2 . ∑ µ 0 : = � � then MSE ( � µ 0 ) ≤ i K + 1 i = 0 ◮ Gain if we can detect “neighboring tasks” s.t. ∆ 2 i ≤ δ ≪ σ 2 . √ ◮ Is it a pipe dream? No, can detect ∆ 2 i � σ 2 / d ! ◮ Blessing of dimensionality phenomenon. 6 / 7

  7. THANK YOU (Do not hesitate to reach out!) 7 / 7

Recommend


More recommend