Distant Supervision Distant Supervision ● Sentiment Analysis: ● Look for Twitter tweets with emoticons like “:)”, “:(“ ● Remove emoticons. Then use as training data! Crimson Hexagon
Representation Representation Learning to Better Learning to Better Exploit Big Data Exploit Big Data
Representations Representations Representations Representations Image: David Warde-Farley via Bengio et al. Deep Learning Book
Representations Representations Representations Representations Note sharing Note sharing between classes between classes Inputs Bits: 0011001….. Images: Marc'Aurelio Ranzato
Representations Representations Representations Representations Massive improvements in Massive improvements in image object recognition (human-level?), image object recognition (human-level?), speech recognition. speech recognition. Good improvements in Good improvements in NLP and IR-related tasks. NLP and IR-related tasks. Inputs Bits: 0011001….. Images: Marc'Aurelio Ranzato
Example Example Example Example Google's image Source: Jeff Dean, Google
Inspiration: The Brain Inspiration: The Brain Inspiration: The Brain Inspiration: The Brain Input: delivered via dendrites Input: delivered via dendrites from other neurons from other neurons Processing: Processing: Synapses may alter input signals. Synapses may alter input signals. The cell then combines all input The cell then combines all input signals signals Output: If enough activation Output: If enough activation from inputs, output signal sent from inputs, output signal sent through a long cable (“axon”) through a long cable (“axon”) Source: Alex Smola
Perceptron Perceptron Perceptron Perceptron Input: Features Input: Features Every feature f i gets a weight w i . Every feature f i gets a weight w i . w 1 feature weight Feature f 1 dog 7.2 w 2 Feature f 2 food 3.4 Neuron w 3 bank -7.3 Feature f 3 delicious 1.5 w 4 Feature f 4 train -4.2
Perceptron Perceptron Perceptron Perceptron t f ( x ) Activation of Neuron Activation of Neuron a ( x )= ∑ w i f i ( x )= w Multiply the feature values Multiply the feature values of an object x of an object x i with the feature weights. with the feature weights. w 1 Feature f 1 w 2 Feature f 2 Neuron Output w 3 Feature f 3 w 4 Feature f 4
Perceptron Perceptron Perceptron Perceptron t f ( x )+ b ) o u t p u t ( x )= g ( w Output of Neuron Output of Neuron Check if activation exceeds Check if activation exceeds a threshold t = –b a threshold t = –b w 1 Feature f 1 e.g. g could return w 2 1 (positive) if positive, Feature f 2 -1 otherwise Neuron Output w 3 Feature f 3 w 4 Feature f 4 e.g. e.g. 1 for “spam”, 1 for “spam”, -1 for “not-spam” -1 for “not-spam”
Decision Surfaces Decision Surfaces Decision Surfaces Decision Surfaces Kernel-based Classifiers Linear Classifiers Decision Trees (Kernel Perceptron, (Perceptron, SVM) Kernel SVM) Only straight Only straight Not max-margin Not max-margin Multi-Layer Perceptron decision surface decision surface Any decision Any decision Images: Vibhav Gogate surface surface
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Feature f 1 Neuron 1 Feature f 2 Neuron Output Feature f 3 Neuron 2 Feature f 4 Hidden Layer Output Layer Input Layer
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Feature f 1 Neuron 1 Feature f 2 Neuron Output Feature f 3 Neuron 2 Feature f 4 Neuron 2 Output Layer Hidden Layer Input Layer
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Feature f 1 Neuron 1 Feature f 2 Neuron Output 1 Feature f 3 Neuron 2 Feature f 4 Neuron Output 2 Neuron 2 Output Layer Hidden Layer Input Layer
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Input Layer (Feature Extraction) f ( x ) Single-Layer: output ( x )= g ( W f ( x )+ b ) Three-Layer Network: output ( x )= g 2 ( W 2 g 1 ( W 1 f ( x )+ b 1 )+ b 2 ) Four-Layer Network: output ( x )= g 3 ( W 3 g 2 ( W 2 g 1 ( W 1 f ( x )+ b 1 )+ b 2 )+ b 3 )
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron Multi-Layer Perceptron
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Computing the Output Computing the Output Computing the Output Computing the Output Simply Simply Output z 2 y 1 evaluate the evaluate the Input x 1 output function output function z 1 (for each node, (for each node, compute an compute an Input x 2 Output output based on z 3 output based on y 2 the node inputs) the node inputs)
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Training Training Training Training Compute error Compute error on output, on output, Output if non-zero, z 2 if non-zero, y 1 do a stochastic Input x 1 do a stochastic gradient step gradient step z 1 on the error on the error function to fix it function to fix it Input x 2 Output z 3 y 2 Backpropagation Backpropagation The error is The error is propagated back propagated back from output from output nodes towards nodes towards the input layer the input layer
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Training Training Training Training Compute error Compute error We are interested in the gradient, i.e. the partial derivatives for the on output, on output, output function z=g(y) z=g(y) if non-zero, if non-zero, with respect to all [inputs and] do a stochastic do a stochastic weights, including those at a ∂ z gradient step gradient step deeper part of the network ∂ y on the error on the error function to fix it y=f(x) function to fix it Exploit the chain rule to compute ∂ y the gradient ∂ x Backpropagation Backpropagation ∂ x = ∂ z ∂ z ∂ y x ∂ y ∂ x The error is The error is propagated back propagated back from output from output nodes towards nodes towards the input layer the input layer
DropOut Technique DropOut Technique DropOut Technique DropOut Technique Basic Idea Basic Idea While training, randomly drop inputs (make the feature zero) While training, randomly drop inputs (make the feature zero) Effect Effect Training on variations of original training data (artificial increase Training on variations of original training data (artificial increase of training data size). of training data size). Trained network relies less on the existence of specific features. Trained network relies less on the existence of specific features. Reference: Hinton et al. (2012) Also: Maxout Networks by Goodfellow et al. (2013)
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Convolutional Neural Networks Convolutional Neural Networks Convolutional Neural Networks Convolutional Neural Networks Reference: Yann LeCun's work Image: http://torch.cogbits.com/doc/tutorials_supervised/
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Recurrent Neural Networks Recurrent Neural Networks Recurrent Neural Networks Recurrent Neural Networks Source: Bayesian Behavior Lab, Northwestern University
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Recurrent Neural Networks Recurrent Neural Networks Recurrent Neural Networks Recurrent Neural Networks Then can do backpropagation. Then can do backpropagation. Challenge: Vanishing/Exploding gradients Challenge: Vanishing/Exploding gradients Source: Bayesian Behavior Lab, Northwestern University
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks Source: Bayesian Behavior Lab, Northwestern University
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks Deep LSTMs for Sequence-to-sequence Learning Suskever et al. 2014 (Google)
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks Long Short Term Memory Networks French Original: La dispute fait rage entre les grands constructeurs aéronautiques ̀ propos de la largeur des sìges de la classe touriste sur les vols long-courriers, ouvrant la voie ̀ une confrontation am̀re lors du salon aéronautique de Dubä qui a lieu de mois-ci. LSTM's English Translation: The dispute is raging between large aircraft manufacturers on the size of the tourist seats on the long-haul flights, leading to a bitter confrontation at the Dubai Airshow in the month of October. Ground Truth English Translation: A row has flared up between leading plane makers over the width of tourist-class seats on long-distance flights, setting the tone for a bitter confrontation at this Month's Dubai Airshow. Suskever et al. 2014 (Google)
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Source: Bayesian Behavior Lab, Northwestern University
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Source: Bayesian Behavior Lab, Northwestern University
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Source: Bayesian Behavior Lab, Northwestern University
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Source: Bayesian Behavior Lab, Northwestern University
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Source: Bayesian Behavior Lab, Northwestern University
Deep Learning: Deep Learning: Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Neural Turing Machines Neural Turing Machines Learning to sort! Learning to sort! Vectors for numbers are random
Big Data in Big Data in Feature Engineering Feature Engineering and and Representation Learning Representation Learning
Web Semantics: Web Semantics: Statistics from Big Data as Features Statistics from Big Data as Features ● L a n g u a g e M o d e l s f o r A u t o c o m p l e t i o n
Word Segmentation Word Segmentation Source: Wang et al. An Overview of Microsoft Web N-gram Corpus and Applications
Parsing: Ambiguity Parsing: Ambiguity ● N P C o o r d i n a t i o n Source: Bansal & Klein (2011)
Parsing: Web Semantics Parsing: Web Semantics Source: Bansal & Klein (2011)
Adjective Ordering Adjective Ordering ● Lapata & Keller (2004): The Web as a Baseline (also: Bergsma et al. 2010) ● “big fat Greek wedding” but not “fat Greek big wedding” Source: Shane Bergsma
Coreference Resolution Coreference Resolution Source: Bansal & Klein 2012
Coreference Resolution Coreference Resolution Source: Bansal & Klein 2012
Distributional Semantics Distributional Semantics ● Data Sparsity: Brown Corpus E.g. most words are rare (in the “long tail”) Source: Baroni → Missing in training data & Evert ● S o l u t i o n ( Blitzer et al. 2006, Koo & Collins 2008, Huang & Yates 2009, etc.) ● Cluster together similar features ● Use clustered features instead of / in addition to original features
Spelling Correction Spelling Correction Even worse: Arnold Schwarzenegger
Vector Representations Vector Representations Put words into Put words into a vector space a vector space x bird petronia x x (e.g. with d=300 (e.g. with d=300 dimensions) dimensions) sparrow parched arid x x x dry
Word Vector Representations Word Vector Representations Tomas Mikolov et al. Proc. ICLR 2013. Available from https://code.google.com/p/word2vec/
Wikipedia Wikipedia
Text Simplification Text Simplification ● E x p l o i t e d i t h i s t o r y , especially on Simple English Wikipedia ● “collaborate” → “work together” “stands for” → “is the same as”
Answering Questions IBM's Jeopardy!-winning Watson system Gerard de Melo
Knowledge Integration
UWN/MENTA multilingual extension of WordNet for word senses and taxonomical information over 200 languages www.lexvo.org/uwn/
WebChild: WebChild: Common-Sense Knowledge Common-Sense Knowledge WebChild WebChild AAAI 2014 AAAI 2014 WSDM 2014 WSDM 2014 AAAI 2011 AAAI 2011
Challenge: From Really Big Data Challenge: From Really Big Data Challenge: From Really Big Data Challenge: From Really Big Data to Real Insights to Real Insights to Real Insights to Real Insights Image: Brett Ryder
Big Data Mining Big Data Mining in Practice in Practice
Gerard de Melo (Tsinghua University, Bejing China) Aparna Varde (Montclair State University, NJ, USA) DASFAA, Hanoi, Vietnam, April 2015 1
Dr. Aparna Varde 2
Internet-based computing - shared resources, software & data provided on demand, like the electricity grid Follows a pay-as-you-go model 3
Several technologies, e.g., MapReduce & Hadoop MapReduce: Data-parallel programming model for clusters of commodity machines • Pioneered by Google • Processes 20 PB of data per day Hadoop: Open-source framework, distributed storage and processing of very large data sets • HDFS (Hadoop Distributed File System) for storage • MapReduce for processing • Developed by Apache 4
• Scalability – To large data volumes – Scan 100 TB on 1 node @ 50 MB/s = 24 days – Scan on 1000-node cluster = 35 minutes • Cost-efficiency – Commodity nodes (cheap, but unreliable) – Commodity network (low bandwidth) – Automatic fault-tolerance (fewer admins) – Easy to use (fewer programmers) 5
Data type key-value records Map function (K in , V in ) list(K inter , V inter ) Reduce function (K inter , list(V inter )) list(K out , V out ) 6
MapReduce Example Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 the quick brown, 2 fox, 1 Map brown fox, 2 fox Reduce how, 1 now, 1 the, 1 the, 3 fox, 1 the, 1 the fox Map ate the quick, 1 mouse how, 1 ate, 1 now, 1 ate, 1 cow, 1 brown, 1 Reduce mouse, 1 mouse, 1 how now Map quick, 1 brown cow, 1 cow 7
40 nodes/rack, 1000-4000 nodes in cluster 1 Gbps bandwidth in rack, 8 Gbps out of rack Node specs (Facebook): 8-16 cores, 32 GB RAM, 8 × 1.5 TB disks, no RAID Aggregation switch Rack switch 8
Files split into 128MB blocks Namenode Blocks replicated across File1 1 several data nodes (often 3) 2 3 Name node stores metadata 4 (file names, locations, etc) Optimized for large files, sequential reads Files are append-only 1 2 1 3 2 1 4 2 4 3 3 4 Datanodes 9
Hive: Relational D/B on Hadoop developed at Facebook Provides SQL-like query language 10
Supports table partitioning, complex data types, sampling, some query optimization These help discover knowledge by various tasks, e.g., • Search for relevant terms • Operations such as word count • Aggregates like MIN, AVG 11
/* Find documents of enron table with word frequencies within range of 75 and 80 */ SELECT DISTINCT D.DocID FROM docword_enron D WHERE D.count > 75 and D.count < 80 limit 10; OK 1853… 11578 16653 Time taken: 64.788 seconds 12
/* Create a view to find the count for WordID=90 and docID=40, for the nips table */ CREATE VIEW Word_Freq AS SELECT D.DocID, D.WordID, V.word, D.count FROM docword_Nips D JOIN vocabNips V ON D.WordID=V.WordID AND D.DocId=40 and D.WordId=90; OK Time taken: 1.244 seconds 13
/* Find documents which use word "rational" from nips table */ SELECT D.DocID,V.word FROM docword_Nips D JOIN vocabnips V ON D.wordID=V.wordID and V.word="rational" LIMIT 10; OK 434 rational 275 rational 158 rational …. 290 rational 422 rational Time taken: 98.706 seconds 14
/* Find average frequency of all words in the enron table */ SELECT AVG(count) FROM docWord_enron; OK 1.728152608060543 Time taken: 68.2 seconds 15
Query Execution Time for HQL & MySQL on big data sets Similar claims for other SQL packages 16
Server Storage Capacity Max Storage per instance 17
18
Recommend
More recommend