Introduction Classification Visualization Conclusion Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C � EPFL c J.-C. Chappelier Textual Data Analysis – 1 / 48
Introduction Objectives of this lecture Classification Visualization Conclusion Basics of textual data analysis: ➥ classification ➥ visualization: dimension reduction / projection (usefull for a good understanding/presentation of classification/clustering results) � EPFL c J.-C. Chappelier Textual Data Analysis – 2 / 48
Introduction Is this course a Machine Learning Course? Classification CAVEAT/REMINDER Visualization Conclusion ◮ NLP makes use of Machine Learning (as would Image Processing for instance) ◮ but good results require: ◮ good preprocessing ◮ good data (to learn from), relevant annotations ◮ good understanding of the pros/cons, features, outputs, results, ... ☞ The goal of this course is to provide you with specific knowledge about NLP . New: ☞ The goal of this lecture is to make some link between general ML and NLP . This lecture is worth deepening with some real ML course. � EPFL c J.-C. Chappelier Textual Data Analysis – 3 / 48
Introduction Introduction: Data Analysis Classification Visualization Conclusion WHAT does Data Analysis consist in? “to represent in a live and intelligible manner the (statistical) informations, simplifying and summarizing them in diagrams” [L. Lebart] complementary ☞ classification (regrouping in the original space) ☞ visualization: projection in a low-dimension space Classification/clustering consists in regrouping several objects in categories/clusters (i.e. subsets of objects) Vizualisation: display in a intelligible way the internal structures of data (documents here) � EPFL c J.-C. Chappelier Textual Data Analysis – 4 / 48
Introduction Contents Classification Visualization Conclusion ➀ Classification ➀ Framework ➁ Methods (in general) ➂ Presentation of a few methods ➃ Evaluation ➁ Visualization ➀ Introduction ➁ Principal Component Analysis (PCA) ➂ Multidimentional Scaling � EPFL c J.-C. Chappelier Textual Data Analysis – 5 / 48
Introduction Supervized/unsupervized Classification Framework Methods Evaluation Visualization Conclusion The classification can be ◮ supervized (strict meaning of classification) : Classes are known a priori They are usually meaningfull for the user ◮ unsupervized (called: clustering ) : Clusters are based on the inner structures of the data (e.g. neighborhoods) Their meaning is really more dubious Textual Data Analysis: relate documents(or words) so as to... structure (supervized) / discover structure (unsupervized) � EPFL c J.-C. Chappelier Textual Data Analysis – 6 / 48
Introduction Classify what? Classification Framework Methods Evaluation WHAT is to be classified? Visualization Conclusion Stating point: a chart (numbers) representing in a way or another a set of objects ◮ continuous values ◮ contingency tables: cooccurence counts ◮ presence/absence of attributes ◮ distance/(dis)similarity (square symetric chart) ☞ N "row" objects (or "observations") x ( i ) characterized by m "features" (columns) x ( i ) j Two complementary points of view: ➀ N points in R m ➁ m points in R N Not necessarily the same metrics: objects similarities vs. feature similarity � EPFL c J.-C. Chappelier Textual Data Analysis – 7 / 48
Introduction Classify what? Classification Framework Methods Evaluation Visualization features Conclusion j m objects (i) x = "importance" of j i (i) x feature j for object i j N � EPFL c J.-C. Chappelier Textual Data Analysis – 8 / 48
Introduction Textual Data Classification Classification Framework Methods Evaluation Visualization ◮ What is classified? Conclusion ◮ authors (1 object = several documents) ◮ documents ◮ paragraphs ◮ "words"(/tokens) (vocabulary study, lexicometry) ◮ How to represent the objects? ◮ document indexing ◮ choose the textual units that are meanigfull ◮ choice of the metric/similarity ☞ preprocessing: "unsequentialize" text, suppress (meaningless) lexical variability Frequently: lines = documents, columns = "words" (tokens, words, n -grams) ☞ the former two "visions" are complementary � EPFL c J.-C. Chappelier Textual Data Analysis – 9 / 48
Introduction Textual Data Classification: Examples of applications Classification Framework Methods Evaluation Visualization Conclusion ◮ Information Retrieval ◮ Open-Questions Survey (polls) ◮ emails classification/routing ◮ client survey (complaints analysis) ◮ Automated processing of adds ◮ ... � EPFL c J.-C. Chappelier Textual Data Analysis – 10 / 48
Introduction (Dis)Symilarity Matrix Classification Framework Methods Evaluation Visualization Conclusion Most of classification techniques use distance mesures or (dis)similaries: matrix of the distances between each data points: N ( N − 1 ) values (symetric with null 2 diagonal) distance: ➀ d ( x , y ) ≥ 0 and d ( x , y ) = 0 ⇐ ⇒ x = y ➁ d ( x , y ) = d ( y , x ) ➂ d ( x , y ) ≤ d ( x , z )+ d ( z , y ) dissimilarity: ➀ and ➁ only � EPFL c J.-C. Chappelier Textual Data Analysis – 11 / 48
Introduction Some of the usual metrics/symilarities Classification Framework Methods Evaluation ◮ Euclidian: Visualization � � Conclusion � m ( x j − y j ) 2 � ∑ d ( x , y ) = j = 1 ◮ generalized ( p ∈ [ 1 ... ∞ [ ): � � 1 / p m ( x j − y j ) p ∑ d p ( x , y ) = j = 1 ◮ χ 2 : m λ j ( x j ∑ x j ′ − y j ∑ y j ′ ) 2 ∑ d ( x , y ) = j = 1 where λ j = ∑ i ∑ j u ij depends on some reference data ( u i , i = 1 ... N ) ∑ i u ij � EPFL c J.-C. Chappelier Textual Data Analysis – 12 / 48
Introduction Some of the usual metrics/symilarities Classification ◮ cosine (similarity) : Framework Methods Evaluation m Visualization ∑ x j y j Conclusion x y j = 1 S ( x , y ) = � x j 2 � y j 2 = || x || · || y || ∑ ∑ j j ◮ for probability distributions : ◮ KL-divergence: � x j � m ∑ D KL ( x , y ) = x j log y j j = 1 ◮ Jensen-Shannon divergence: � � JS ( x , y ) = 1 D KL ( x , x + y )+ D KL ( y , x + y ) 2 2 2 ◮ Hellinger distance: � � m √ � x , √ y ) = ( � x j − � y j ) 2 � ∑ d ( x , y ) = d Euclid ( j = 1 � EPFL c J.-C. Chappelier Textual Data Analysis – 13 / 48
Introduction Computational Complexity Classification Framework Methods Evaluation Visualization Conclusion Various complexities (depends on the method), but typically: N ( N − 1 ) distances 2 m computations for one single distance ☞ complexity in m · N 2 Costly: m ≃ 10 3 , N ≃ 10 4 ☞ → 10 11 !! � EPFL c J.-C. Chappelier Textual Data Analysis – 14 / 48
Introduction Classification as a mathematical problem Classification Framework Methods Evaluation Visualization Conclusion ◮ supervized: ◮ function approximation f ( x 1 ,..., x m ) = C k ◮ distribution estimation: or P ( C k | x 1 ,..., x m ) P ( x 1 ,..., x m | C k ) ◮ parametric: multi-gaussian, maximum likelihood, Bayesian inference, discriminative analysis ◮ non-parametric: kernels, K nearest neighbors, LVQ, neural nets (Deep Learning, SVM) ◮ inference: if x i = ... and x j = ... (etc.) then C = C k ☞ decision trees ◮ unsupervized (clustering): ◮ (local) minimization of a global criterion over the data set � EPFL c J.-C. Chappelier Textual Data Analysis – 15 / 48
Introduction Many different classification methods Classification Framework Methods Evaluation How to choose? ☞ Several criteria Visualization Conclusion Task specification: ◮ supervized ◮ overlapping ◮ hierarchical ◮ unsupervized ◮ non overlapping (partition) ◮ non hierarchical Model choices: ◮ generative models ( P ( X , Y ) ) ◮ discriminative models ( P ( Y | X ) ) ◮ parametric ◮ non parametric (= many parameters) ◮ linear methods (Statistics) ◮ trees (GOFAI) ◮ neural networks � EPFL c J.-C. Chappelier Textual Data Analysis – 16 / 48
Introduction Classification methods: examples Classification Framework Methods Evaluation ◮ supervized Visualization ◮ Naive Bayes Conclusion ◮ K-nearest neighbors ◮ ID3 – C4.5 (decision tree) ◮ Kernels, Support Vector Machines (SVM) ◮ Gaussian Mixtures ◮ Neural nets: Deep Learning, SVM, MLP , Learning Vector Quantization ◮ ... ◮ unsupervized ◮ K-means ◮ dendrograms ◮ minimum spanning tree ◮ Neural net: Kohonen’s Self Organizing Maps (SOM) ◮ ... ☞ The question you should ask yourself: What is the optimized criterion ? � EPFL c J.-C. Chappelier Textual Data Analysis – 17 / 48
Recommend
More recommend