Learning and Data Selection in Big Datasets H. S. Ghadikolaei , H. - PowerPoint PPT Presentation

Learning and Data Selection in Big Datasets H. S. Ghadikolaei , H. Ghauch, C. Fischione, and M. Skoglund School of Electrical Engineering and Computer Science KTH Royal Institute of Technology Stockholm, Sweden http://www.kth.se/profile/hshokri hshokri@kth.se International Conference on Machine Learning (ICML) Long Beach, CA, USA, June 2019

Big data era Outstanding performance of ML - Usually trained over massive datasets - Examples: MNIST (70k samples) and MovieLens (20M samples) What about a small set of critical samples that best describes an unknown model? H. S. Ghadikolaei (hshokri@kth.se) | Learning and data selection for big dataset 1/7

Related works Experiment design [Sacks-Welch-Mitchell-Wynn, 1989] - to minimize total labeling cost - different setting Active learning [Settles, 2012] - to minimize total labeling cost - different setting Core set selection [Tsang-Kwok-Cheung, 2005] - to find a small representative dataset - limited to SVM Influence score [Koh-Liang, 2017] - to understand the importance of every sample - greedy: cannot score a set of samples H. S. Ghadikolaei (hshokri@kth.se) | Learning and data selection for big dataset 2/7

Our approach Conventional training: ( ℓ i : loss of sample i , N : dataset size, h : parameterized function from space H ) N 1 � minimize ℓ i ( h ) . N h ∈ H i =1 Our proposal: (joint learning and data selection) N N 1 1 ℓ i ( h ) ≤ ǫ , 1 T z ≥ K . � � minimize z i ℓ i ( h ) , s . t . 1 T z N h ∈ H , z ∈{ 0 , 1 } N i =1 i =1 Maximum compression rate: 1 − K/N Solved efficiently using our proposed Alternating Data Selection and Function Approximation algorithm � d/δ ) d ⌉ samples are Under some regularity assumptions, K ≥ ⌈ (1 + 2 LT enough for learning an L -Lipschitz function defined on interval [0 , T ] d with arbitrary accuracy δ ( δ ≤ ǫ ) H. S. Ghadikolaei (hshokri@kth.se) | Learning and data selection for big dataset 3/7

Experimental results Illustrative example: 1 . 8 Compressed Dataset ( K = 12 ) 1 . 2 Original function Function value Approximated function 0 . 6 0 − 0 . 6 − 1 . 2 0 1 2 3 4 5 6 7 8 x Real-world data sets (from UCI repos.): - experiments on Individual household electric power consumption ( N = 1 . 5 M , d = 9 ) and YearPredictionMSD ( N = 463 K , d = 90 ) datasets - almost no loss in learning performance after 95% compression using our approach H. S. Ghadikolaei (hshokri@kth.se) | Learning and data selection for big dataset 4/7

Final remarks Theoretically, almost 100% compressibility of big data is feasible without a noticeable drop in the learning performance Much faster training over the small representative dataset Inefficiency of the existing approaches to create datasets (which lead to a massive amounts of redundancy) Applications: - edge computing: reducing the communication overhead - IoT: enabling low-latency learning and inference over a communication- limited network Visit our poster: Pacific Ballroom #170 H. S. Ghadikolaei (hshokri@kth.se) | Learning and data selection for big dataset 5/7

References - J. Sacks, W.J. Welch, T.J. Mitchell, and H.P. Wynn, “Design and analysis of computer experiments,” Statistical Science , 1989. - B. Settles, “Active learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning , 2012. - I.W. Tsang, J.T. Kwok, and P.M. Cheung, “Core vector machines: Fast SVM training on very large data sets,” Journal of Machine Learning Research , 2005. - P.W. Koh, and P. Liang, “Understanding black-box predictions via influence functions,” in Proc. International Conference on Machine Learn- ing , 2017. H. S. Ghadikolaei (hshokri@kth.se) | Learning and data selection for big dataset 6/7

Learning and Data Selection in Big Datasets H. S. Ghadikolaei , H. Ghauch, C. Fischione, and M. Skoglund School of Electrical Engineering and Computer Science KTH Royal Institute of Technology Stockholm, Sweden http://www.kth.se/profile/hshokri hshokri@kth.se International Conference on Machine Learning (ICML) Long Beach, CA, USA, June 2019

Learning and Data Selection in Big Datasets H. S. Ghadikolaei , H. - PowerPoint PPT Presentation

Learning and Data Selection in Big Datasets H. S. Ghadikolaei , H. Ghauch, C. Fischione, and M. Skoglund School of Electrical Engineering and Computer Science KTH Royal Institute of Technology Stockholm, Sweden http://www.kth.se/profile/hshokri

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK Outline 1) What is big data? 2)

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Compact Summaries for Big Data Large Datasets Graham Cormode University of Warwick

documentation Overview The datasets Common data manipulations Analysis using weights

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

Instance Compression and Succint PCPs for NP Sivaramakrishnan.N.R. March 31, 2012 Outline

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

Random sequences Randomness of individual sequences Consider an infinite binary sequence A (0) A

Advanced Thermodynamics: Lecture 11 Shivasubramanian Gopalakrishnan sgopalak@iitb.ac.in

simulation turbulence in galaxy clusters insights on stochastic acceleration and the impact of

INAF-Astronomical Observatory of Padova II. Explosion mechanisms Luca Zampieri - Supernovae, PhD

Importance of stress information in Engineered Geothermal Systems: Which attributes of stress are

13 Variational Formulation of Plane Beam Element IFEM Ch 13 Slide 1 Department of

Learning and Data Selection in Big Datasets H. S. Ghadikolaei , H. - PowerPoint PPT Presentation

Learning and Data Selection in Big Datasets H. S. Ghadikolaei , H. Ghauch, C. Fischione, and M. Skoglund School of Electrical Engineering and Computer Science KTH Royal Institute of Technology Stockholm, Sweden http://www.kth.se/profile/hshokri

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK Outline 1) What is big data? 2)

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Compact Summaries for Big Data Large Datasets Graham Cormode University of Warwick

documentation Overview The datasets Common data manipulations Analysis using weights

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

Instance Compression and Succint PCPs for NP Sivaramakrishnan.N.R. March 31, 2012 Outline

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

Random sequences Randomness of individual sequences Consider an infinite binary sequence A (0) A

Advanced Thermodynamics: Lecture 11 Shivasubramanian Gopalakrishnan sgopalak@iitb.ac.in

simulation turbulence in galaxy clusters insights on stochastic acceleration and the impact of

INAF-Astronomical Observatory of Padova II. Explosion mechanisms Luca Zampieri - Supernovae, PhD

Importance of stress information in Engineered Geothermal Systems: Which attributes of stress are

13 Variational Formulation of Plane Beam Element IFEM Ch 13 Slide 1 Department of

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?