scalable learning technologies scalable learning
play

Scalable Learning Technologies Scalable Learning Technologies for - PowerPoint PPT Presentation

DASFAA 2015 Hanoi Tutorial DASFAA 2015 Hanoi Tutorial Scalable Learning Technologies Scalable Learning Technologies for Big Data Mining for Big Data Mining Gerard de Melo, Tsinghua University Gerard de Melo, Tsinghua University


  1. Distant Supervision Distant Supervision ● Sentiment Analysis: ● Look for Twitter tweets with emoticons like “:)”, “:(“ ● Remove emoticons. Then use as training data! Crimson Hexagon

  2. Representation Representation Learning to Better Learning to Better Exploit Big Data Exploit Big Data

  3. Representations Representations Representations Representations Image: David Warde-Farley via Bengio et al. Deep Learning Book

  4. Representations Representations Representations Representations Note sharing Note sharing between classes between classes Inputs Bits: 0011001….. Images: Marc'Aurelio Ranzato

  5. Representations Representations Representations Representations Massive improvements in Massive improvements in image object recognition (human-level?), image object recognition (human-level?), speech recognition. speech recognition. Good improvements in Good improvements in NLP and IR-related tasks. NLP and IR-related tasks. Inputs Bits: 0011001….. Images: Marc'Aurelio Ranzato

  6. Example Example Example Example Google's image Source: Jeff Dean, Google

  7. Inspiration: The Brain Inspiration: The Brain Inspiration: The Brain Inspiration: The Brain Input: delivered via dendrites Input: delivered via dendrites from other neurons from other neurons Processing: Processing: Synapses may alter input signals. Synapses may alter input signals. The cell then combines all input The cell then combines all input signals signals Output: If enough activation Output: If enough activation from inputs, output signal sent from inputs, output signal sent through a long cable (“axon”) through a long cable (“axon”) Source: Alex Smola

  8. Perceptron Perceptron Perceptron Perceptron Input: Features Input: Features Every feature f i gets a weight w i . Every feature f i gets a weight w i . w 1 feature weight Feature f 1 dog 7.2 w 2 Feature f 2 food 3.4 Neuron w 3 bank -7.3 Feature f 3 delicious 1.5 w 4 Feature f 4 train -4.2

  9. Perceptron Perceptron Perceptron Perceptron t f ( x ) Activation of Neuron Activation of Neuron a ( x )= ∑ w i f i ( x )= w Multiply the feature values Multiply the feature values of an object x of an object x i with the feature weights. with the feature weights. w 1 Feature f 1 w 2 Feature f 2 Neuron Output w 3 Feature f 3 w 4 Feature f 4

  10. Perceptron Perceptron Perceptron Perceptron t f ( x )+ b ) o u t p u t ( x )= g ( w Output of Neuron Output of Neuron Check if activation exceeds Check if activation exceeds a threshold t = –b a threshold t = –b w 1 Feature f 1 e.g. g could return w 2 1 (positive) if positive, Feature f 2 -1 otherwise Neuron Output w 3 Feature f 3 w 4 Feature f 4 e.g. e.g. 1 for “spam”, 1 for “spam”, -1 for “not-spam” -1 for “not-spam”

  11. Decision Surfaces Decision Surfaces Decision Surfaces Decision Surfaces Kernel-based Classifiers Linear Classifiers Decision Trees (Kernel Perceptron, (Perceptron, SVM) Kernel SVM) Only straight Only straight Not max-margin Not max-margin Multi-Layer Perceptron decision surface decision surface Any decision Any decision Images: Vibhav Gogate surface surface

  12. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Feature f 1 Neuron 1 Feature f 2 Neuron Output Feature f 3 Neuron 2 Feature f 4 Hidden Layer Output Layer Input Layer

  13. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Feature f 1 Neuron 1 Feature f 2 Neuron Output Feature f 3 Neuron 2 Feature f 4 Neuron 2 Output Layer Hidden Layer Input Layer

  14. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Feature f 1 Neuron 1 Feature f 2 Neuron Output 1 Feature f 3 Neuron 2 Feature f 4 Neuron Output 2 Neuron 2 Output Layer Hidden Layer Input Layer

  15. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Input Layer (Feature Extraction) f ( x ) Single-Layer: output ( x )= g ( W f ( x )+ b ) Three-Layer Network: output ( x )= g 2 ( W 2 g 1 ( W 1 f ( x )+ b 1 )+ b 2 ) Four-Layer Network: output ( x )= g 3 ( W 3 g 2 ( W 2 g 1 ( W 1 f ( x )+ b 1 )+ b 2 )+ b 3 )

  16. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron

  17. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Computing the Output Computing the Output Computing the Output Computing the Output Simply Simply Output z 2 y 1 evaluate the evaluate the Input x 1 output function output function z 1 (for each node, (for each node, compute an compute an Input x 2 Output output based on z 3 output based on y 2 the node inputs) the node inputs)

  18. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Training Training Training Training Compute error Compute error on output, on output, Output if non-zero, z 2 if non-zero, y 1 do a stochastic Input x 1 do a stochastic gradient step gradient step z 1 on the error on the error function to fix it function to fix it Input x 2 Output z 3 y 2 Backpropagation Backpropagation The error is The error is propagated back propagated back from output from output nodes towards nodes towards the input layer the input layer

  19. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Training Training Training Training Compute error Compute error We are interested in the gradient, i.e. the partial derivatives for the on output, on output, output function z=g(y) z=g(y) if non-zero, if non-zero, with respect to all [inputs and] do a stochastic do a stochastic weights, including those at a ∂ z gradient step gradient step deeper part of the network ∂ y on the error on the error function to fix it y=f(x) function to fix it Exploit the chain rule to compute ∂ y the gradient ∂ x Backpropagation Backpropagation ∂ x = ∂ z ∂ z ∂ y x ∂ y ∂ x The error is The error is propagated back propagated back from output from output nodes towards nodes towards the input layer the input layer

  20. DropOut Technique DropOut Technique DropOut Technique DropOut Technique Basic Idea Basic Idea While training, randomly drop inputs (make the feature zero) While training, randomly drop inputs (make the feature zero) Effect Effect Training on variations of original training data (artificial increase Training on variations of original training data (artificial increase of training data size). of training data size). Trained network relies less on the existence of specific features. Trained network relies less on the existence of specific features. Reference: Hinton et al. (2012) Also: Maxout Networks by Goodfellow et al. (2013)

  21. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Convolutional Neural Networks Convolutional Neural Networks Convolutional Neural Networks Convolutional Neural Networks Reference: Yann LeCun's work Image: http://torch.cogbits.com/doc/tutorials_supervised/

  22. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Recurrent Neural Networks Recurrent Neural Networks Recurrent Neural Networks Recurrent Neural Networks Source: Bayesian Behavior Lab, Northwestern University

  23. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Recurrent Neural Networks Recurrent Neural Networks Recurrent Neural Networks Recurrent Neural Networks Then can do backpropagation. Then can do backpropagation. Challenge: Vanishing/Exploding gradients Challenge: Vanishing/Exploding gradients Source: Bayesian Behavior Lab, Northwestern University

  24. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks Source: Bayesian Behavior Lab, Northwestern University

  25. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks Deep LSTMs for Sequence-to-sequence Learning Suskever et al. 2014 (Google)

  26. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks French Original: La dispute fait rage entre les grands constructeurs aéronautiques ̀ propos de la largeur des sìges de la classe touriste sur les vols long-courriers, ouvrant la voie ̀ une confrontation am̀re lors du salon aéronautique de Dubä qui a lieu de mois-ci. LSTM's English Translation: The dispute is raging between large aircraft manufacturers on the size of the tourist seats on the long-haul flights, leading to a bitter confrontation at the Dubai Airshow in the month of October. Ground Truth English Translation: A row has flared up between leading plane makers over the width of tourist-class seats on long-distance flights, setting the tone for a bitter confrontation at this Month's Dubai Airshow. Suskever et al. 2014 (Google)

  27. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Source: Bayesian Behavior Lab, Northwestern University

  28. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Source: Bayesian Behavior Lab, Northwestern University

  29. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Source: Bayesian Behavior Lab, Northwestern University

  30. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Source: Bayesian Behavior Lab, Northwestern University

  31. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Source: Bayesian Behavior Lab, Northwestern University

  32. Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Learning to sort! Learning to sort! Vectors for numbers are random

  33. Big Data in Big Data in Feature Engineering Feature Engineering and and Representation Learning Representation Learning

  34. Web Semantics: Web Semantics: Statistics from Big Data as Features Statistics from Big Data as Features ● L a n g u a g e M o d e l s f o r A u t o c o m p l e t i o n

  35. Word Segmentation Word Segmentation Source: Wang et al. An Overview of Microsoft Web N-gram Corpus and Applications

  36. Parsing: Ambiguity Parsing: Ambiguity ● N P C o o r d i n a t i o n Source: Bansal & Klein (2011)

  37. Parsing: Web Semantics Parsing: Web Semantics Source: Bansal & Klein (2011)

  38. Adjective Ordering Adjective Ordering ● Lapata & Keller (2004): The Web as a Baseline (also: Bergsma et al. 2010) ● “big fat Greek wedding” but not “fat Greek big wedding” Source: Shane Bergsma

  39. Coreference Resolution Coreference Resolution Source: Bansal & Klein 2012

  40. Coreference Resolution Coreference Resolution Source: Bansal & Klein 2012

  41. Distributional Semantics Distributional Semantics ● Data Sparsity: Brown Corpus E.g. most words are rare (in the “long tail”) Source: Baroni → Missing in training data & Evert ● S o l u t i o n ( Blitzer et al. 2006, Koo & Collins 2008, Huang & Yates 2009, etc.) ● Cluster together similar features ● Use clustered features instead of / in addition to original features

  42. Spelling Correction Spelling Correction Even worse: Arnold Schwarzenegger

  43. Vector Representations Vector Representations Put words into Put words into a vector space a vector space x bird petronia x x (e.g. with d=300 (e.g. with d=300 dimensions) dimensions) sparrow parched arid x x x dry

  44. Word Vector Representations Word Vector Representations Tomas Mikolov et al. Proc. ICLR 2013. Available from https://code.google.com/p/word2vec/

  45. Wikipedia Wikipedia

  46. Text Simplification Text Simplification ● E x p l o i t e d i t h i s t o r y , especially on Simple English Wikipedia ● “collaborate” → “work together” “stands for” → “is the same as”

  47. Answering Questions IBM's Jeopardy!-winning Watson system Gerard de Melo

  48. Knowledge Integration

  49. UWN/MENTA multilingual extension of WordNet for word senses and taxonomical information over 200 languages www.lexvo.org/uwn/

  50. WebChild: WebChild: Common-Sense Knowledge Common-Sense Knowledge WebChild WebChild AAAI 2014 AAAI 2014 WSDM 2014 WSDM 2014 AAAI 2011 AAAI 2011

  51. Challenge: From Really Big Data Challenge: From Really Big Data Challenge: From Really Big Data Challenge: From Really Big Data to Real Insights to Real Insights to Real Insights to Real Insights Image: Brett Ryder

  52. Big Data Mining Big Data Mining in Practice in Practice

  53. Gerard de Melo (Tsinghua University, Bejing China) Aparna Varde (Montclair State University, NJ, USA) DASFAA, Hanoi, Vietnam, April 2015 1

  54. Dr. Aparna Varde 2

  55.  Internet-based computing - shared resources, software & data provided on demand, like the electricity grid  Follows a pay-as-you-go model 3

  56.  Several technologies, e.g., MapReduce & Hadoop  MapReduce: Data-parallel programming model for clusters of commodity machines • Pioneered by Google • Processes 20 PB of data per day  Hadoop: Open-source framework, distributed storage and processing of very large data sets • HDFS (Hadoop Distributed File System) for storage • MapReduce for processing • Developed by Apache 4

  57. • Scalability – To large data volumes – Scan 100 TB on 1 node @ 50 MB/s = 24 days – Scan on 1000-node cluster = 35 minutes • Cost-efficiency – Commodity nodes (cheap, but unreliable) – Commodity network (low bandwidth) – Automatic fault-tolerance (fewer admins) – Easy to use (fewer programmers) 5

  58.  Data type key-value records  Map function (K in , V in )  list(K inter , V inter )  Reduce function (K inter , list(V inter ))  list(K out , V out ) 6

  59. MapReduce Example Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 the quick brown, 2 fox, 1 Map brown fox, 2 fox Reduce how, 1 now, 1 the, 1 the, 3 fox, 1 the, 1 the fox Map ate the quick, 1 mouse how, 1 ate, 1 now, 1 ate, 1 cow, 1 brown, 1 Reduce mouse, 1 mouse, 1 how now Map quick, 1 brown cow, 1 cow 7

  60.  40 nodes/rack, 1000-4000 nodes in cluster  1 Gbps bandwidth in rack, 8 Gbps out of rack  Node specs (Facebook): 8-16 cores, 32 GB RAM, 8 × 1.5 TB disks, no RAID Aggregation switch Rack switch 8

  61.  Files split into 128MB blocks Namenode  Blocks replicated across File1 1 several data nodes (often 3) 2 3  Name node stores metadata 4 (file names, locations, etc)  Optimized for large files, sequential reads  Files are append-only 1 2 1 3 2 1 4 2 4 3 3 4 Datanodes 9

  62.  Hive: Relational D/B on Hadoop developed at Facebook  Provides SQL-like query language 10

  63.  Supports table partitioning, complex data types, sampling, some query optimization  These help discover knowledge by various tasks, e.g., • Search for relevant terms • Operations such as word count • Aggregates like MIN, AVG 11

  64. /* Find documents of enron table with word frequencies within range of 75 and 80 */ SELECT DISTINCT D.DocID FROM docword_enron D WHERE D.count > 75 and D.count < 80 limit 10; OK 1853… 11578 16653 Time taken: 64.788 seconds 12

  65. /* Create a view to find the count for WordID=90 and docID=40, for the nips table */ CREATE VIEW Word_Freq AS SELECT D.DocID, D.WordID, V.word, D.count FROM docword_Nips D JOIN vocabNips V ON D.WordID=V.WordID AND D.DocId=40 and D.WordId=90; OK Time taken: 1.244 seconds 13

  66. /* Find documents which use word "rational" from nips table */ SELECT D.DocID,V.word FROM docword_Nips D JOIN vocabnips V ON D.wordID=V.wordID and V.word="rational" LIMIT 10; OK 434 rational 275 rational 158 rational …. 290 rational 422 rational Time taken: 98.706 seconds 14

  67. /* Find average frequency of all words in the enron table */ SELECT AVG(count) FROM docWord_enron; OK 1.728152608060543 Time taken: 68.2 seconds 15

  68. Query Execution Time for HQL & MySQL on big data sets Similar claims for other SQL packages 16

  69. Server Storage Capacity Max Storage per instance 17

  70. 18

Recommend


More recommend