Textual Data Analysis J.-C. Chappelier Laboratoire dIntelligence - PowerPoint PPT Presentation

Introduction Classification Visualization Conclusion Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C � EPFL c J.-C. Chappelier Textual Data Analysis – 1 / 48

Introduction Objectives of this lecture Classification Visualization Conclusion Basics of textual data analysis: ➥ classification ➥ visualization: dimension reduction / projection (usefull for a good understanding/presentation of classification/clustering results) � EPFL c J.-C. Chappelier Textual Data Analysis – 2 / 48

Introduction Is this course a Machine Learning Course? Classification CAVEAT/REMINDER Visualization Conclusion ◮ NLP makes use of Machine Learning (as would Image Processing for instance) ◮ but good results require: ◮ good preprocessing ◮ good data (to learn from), relevant annotations ◮ good understanding of the pros/cons, features, outputs, results, ... ☞ The goal of this course is to provide you with specific knowledge about NLP . New: ☞ The goal of this lecture is to make some link between general ML and NLP . This lecture is worth deepening with some real ML course. � EPFL c J.-C. Chappelier Textual Data Analysis – 3 / 48

Introduction Introduction: Data Analysis Classification Visualization Conclusion WHAT does Data Analysis consist in? “to represent in a live and intelligible manner the (statistical) informations, simplifying and summarizing them in diagrams” [L. Lebart] complementary ☞ classification (regrouping in the original space) ☞ visualization: projection in a low-dimension space Classification/clustering consists in regrouping several objects in categories/clusters (i.e. subsets of objects) Vizualisation: display in a intelligible way the internal structures of data (documents here) � EPFL c J.-C. Chappelier Textual Data Analysis – 4 / 48

Introduction Contents Classification Visualization Conclusion ➀ Classification ➀ Framework ➁ Methods (in general) ➂ Presentation of a few methods ➃ Evaluation ➁ Visualization ➀ Introduction ➁ Principal Component Analysis (PCA) ➂ Multidimentional Scaling � EPFL c J.-C. Chappelier Textual Data Analysis – 5 / 48

Introduction Supervized/unsupervized Classification Framework Methods Evaluation Visualization Conclusion The classification can be ◮ supervized (strict meaning of classification) : Classes are known a priori They are usually meaningfull for the user ◮ unsupervized (called: clustering ) : Clusters are based on the inner structures of the data (e.g. neighborhoods) Their meaning is really more dubious Textual Data Analysis: relate documents(or words) so as to... structure (supervized) / discover structure (unsupervized) � EPFL c J.-C. Chappelier Textual Data Analysis – 6 / 48

Introduction Classify what? Classification Framework Methods Evaluation WHAT is to be classified? Visualization Conclusion Stating point: a chart (numbers) representing in a way or another a set of objects ◮ continuous values ◮ contingency tables: cooccurence counts ◮ presence/absence of attributes ◮ distance/(dis)similarity (square symetric chart) ☞ N "row" objects (or "observations") x ( i ) characterized by m "features" (columns) x ( i ) j Two complementary points of view: ➀ N points in R m ➁ m points in R N Not necessarily the same metrics: objects similarities vs. feature similarity � EPFL c J.-C. Chappelier Textual Data Analysis – 7 / 48

Introduction Classify what? Classification Framework Methods Evaluation Visualization features Conclusion j m objects (i) x = "importance" of j i (i) x feature j for object i j N � EPFL c J.-C. Chappelier Textual Data Analysis – 8 / 48

Introduction Textual Data Classification Classification Framework Methods Evaluation Visualization ◮ What is classified? Conclusion ◮ authors (1 object = several documents) ◮ documents ◮ paragraphs ◮ "words"(/tokens) (vocabulary study, lexicometry) ◮ How to represent the objects? ◮ document indexing ◮ choose the textual units that are meanigfull ◮ choice of the metric/similarity ☞ preprocessing: "unsequentialize" text, suppress (meaningless) lexical variability Frequently: lines = documents, columns = "words" (tokens, words, n -grams) ☞ the former two "visions" are complementary � EPFL c J.-C. Chappelier Textual Data Analysis – 9 / 48

Introduction Textual Data Classification: Examples of applications Classification Framework Methods Evaluation Visualization Conclusion ◮ Information Retrieval ◮ Open-Questions Survey (polls) ◮ emails classification/routing ◮ client survey (complaints analysis) ◮ Automated processing of adds ◮ ... � EPFL c J.-C. Chappelier Textual Data Analysis – 10 / 48

Introduction (Dis)Symilarity Matrix Classification Framework Methods Evaluation Visualization Conclusion Most of classification techniques use distance mesures or (dis)similaries: matrix of the distances between each data points: N ( N − 1 ) values (symetric with null 2 diagonal) distance: ➀ d ( x , y ) ≥ 0 and d ( x , y ) = 0 ⇐ ⇒ x = y ➁ d ( x , y ) = d ( y , x ) ➂ d ( x , y ) ≤ d ( x , z )+ d ( z , y ) dissimilarity: ➀ and ➁ only � EPFL c J.-C. Chappelier Textual Data Analysis – 11 / 48

Introduction Some of the usual metrics/symilarities Classification Framework Methods Evaluation ◮ Euclidian: Visualization � � Conclusion � m ( x j − y j ) 2 � ∑ d ( x , y ) = j = 1 ◮ generalized ( p ∈ [ 1 ... ∞ [ ): � � 1 / p m ( x j − y j ) p ∑ d p ( x , y ) = j = 1 ◮ χ 2 : m λ j ( x j ∑ x j ′ − y j ∑ y j ′ ) 2 ∑ d ( x , y ) = j = 1 where λ j = ∑ i ∑ j u ij depends on some reference data ( u i , i = 1 ... N ) ∑ i u ij � EPFL c J.-C. Chappelier Textual Data Analysis – 12 / 48

Introduction Some of the usual metrics/symilarities Classification ◮ cosine (similarity) : Framework Methods Evaluation m Visualization ∑ x j y j Conclusion x y j = 1 S ( x , y ) = � x j 2 � y j 2 = || x || · || y || ∑ ∑ j j ◮ for probability distributions : ◮ KL-divergence: � x j � m ∑ D KL ( x , y ) = x j log y j j = 1 ◮ Jensen-Shannon divergence: � � JS ( x , y ) = 1 D KL ( x , x + y )+ D KL ( y , x + y ) 2 2 2 ◮ Hellinger distance: � � m √ � x , √ y ) = ( � x j − � y j ) 2 � ∑ d ( x , y ) = d Euclid ( j = 1 � EPFL c J.-C. Chappelier Textual Data Analysis – 13 / 48

Introduction Computational Complexity Classification Framework Methods Evaluation Visualization Conclusion Various complexities (depends on the method), but typically: N ( N − 1 ) distances 2 m computations for one single distance ☞ complexity in m · N 2 Costly: m ≃ 10 3 , N ≃ 10 4 ☞ → 10 11 !! � EPFL c J.-C. Chappelier Textual Data Analysis – 14 / 48

Introduction Classification as a mathematical problem Classification Framework Methods Evaluation Visualization Conclusion ◮ supervized: ◮ function approximation f ( x 1 ,..., x m ) = C k ◮ distribution estimation: or P ( C k | x 1 ,..., x m ) P ( x 1 ,..., x m | C k ) ◮ parametric: multi-gaussian, maximum likelihood, Bayesian inference, discriminative analysis ◮ non-parametric: kernels, K nearest neighbors, LVQ, neural nets (Deep Learning, SVM) ◮ inference: if x i = ... and x j = ... (etc.) then C = C k ☞ decision trees ◮ unsupervized (clustering): ◮ (local) minimization of a global criterion over the data set � EPFL c J.-C. Chappelier Textual Data Analysis – 15 / 48

Introduction Many different classification methods Classification Framework Methods Evaluation How to choose? ☞ Several criteria Visualization Conclusion Task specification: ◮ supervized ◮ overlapping ◮ hierarchical ◮ unsupervized ◮ non overlapping (partition) ◮ non hierarchical Model choices: ◮ generative models ( P ( X , Y ) ) ◮ discriminative models ( P ( Y | X ) ) ◮ parametric ◮ non parametric (= many parameters) ◮ linear methods (Statistics) ◮ trees (GOFAI) ◮ neural networks � EPFL c J.-C. Chappelier Textual Data Analysis – 16 / 48

Introduction Classification methods: examples Classification Framework Methods Evaluation ◮ supervized Visualization ◮ Naive Bayes Conclusion ◮ K-nearest neighbors ◮ ID3 – C4.5 (decision tree) ◮ Kernels, Support Vector Machines (SVM) ◮ Gaussian Mixtures ◮ Neural nets: Deep Learning, SVM, MLP , Learning Vector Quantization ◮ ... ◮ unsupervized ◮ K-means ◮ dendrograms ◮ minimum spanning tree ◮ Neural net: Kohonen’s Self Organizing Maps (SOM) ◮ ... ☞ The question you should ask yourself: What is the optimized criterion ? � EPFL c J.-C. Chappelier Textual Data Analysis – 17 / 48

Textual Data Analysis J.-C. Chappelier Laboratoire dIntelligence - PowerPoint PPT Presentation

Introduction Classification Visualization Conclusion Textual Data Analysis J.-C. Chappelier Laboratoire dIntelligence Artificielle Facult I&C EPFL c J.-C. Chappelier Textual Data Analysis 1 / 48 Introduction Objectives

Textual Criticism Textual Criticism: Definition Textual criticism is the study of copies of

Dynamic Embedding on Textual Networks via a Gaussian Process Presenter : Pengyu Cheng Joint work

Natural logic and textual inference Bill MacCartney CS224U 12 May 2014 Textual inference

Design and Realization of the EXCITEMENT Open Platform for Textual Entailment Gnter Neumann,

Textual Entailment Alina Petrova EMCL TUD, HLT FBK February 22, 2012 Alina Petrova EMCL TUD,

S Sanctification Under the OT tifi ti U d th OT Theocracy eoc acy Charles Clough What

Spaten : a Spatio-Temporal and Textual Big Data Generator Thaleia Dimitra Doudali* Ioannis

System Analysis Chapter 3: Textual Modeling Jonathan Thaler Department of Computer Science 1 /

Student Response Analysis Using Textual Entailment Ashudeep Singh Devanshu Arya Natural

R.N Institute of Professional Studies D-12 Main Kanti Nagar Krishna Nagar Delhi-51 CHAPTER 5

Textual Analytics for Accounting and Auditing Thanks to Ingrid Fisher (SUNY Albany)

Dynamic memory networks for Dynamic memory networks for visual and textual question visual and

Recognizing Textual Entailment Using a Subsequence Kernel Method Rui Wang & Gnter Neumann

Textual Influence Modeling Through Non-Negative Tensor Decomposition Robert Earl Lowe July 12,

Comparing Textual and Block Interfaces in a Novice Programming Environment Thomas Price Tiffany

TAIC PART 2010 Linguistic Security Testing for Textual Protocols Authors Ben Kam, Tom Dean

Introductory Missive January 2012 About The Class Students in almost every field must use

Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 1 Discover the world at

About(Us( Sebas&an(Pado( Rui(Wang( Professor(of(Computa&onal(

Performance Assessments For Deeper Learning D R . R UTH C HUNG W EI S TANFORD U NIVERSITY N

Hong Kongs Financial Markets: Positioning in the Asia Pacific Region Andrew Sheng Chairman

Singapore Day in Rome Presented by: Ms Francisca SIOW Graduate Studies Office About NTU NTU

BSC Panel 194 9 February 2012 Report on Progress of Modification Proposals Adam Richardson 9

Algebraic normal form of a bent function: what is it? Natalia Tokareva Sobolev Institute of

Sambuz

Useful Links

Newsletter

Mail Us

Textual Data Analysis J.-C. Chappelier Laboratoire dIntelligence - PowerPoint PPT Presentation

Introduction Classification Visualization Conclusion Textual Data Analysis J.-C. Chappelier Laboratoire dIntelligence Artificielle Facult I&C EPFL c J.-C. Chappelier Textual Data Analysis 1 / 48 Introduction Objectives

Textual Criticism Textual Criticism: Definition Textual criticism is the study of copies of

Dynamic Embedding on Textual Networks via a Gaussian Process Presenter : Pengyu Cheng Joint work

Natural logic and textual inference Bill MacCartney CS224U 12 May 2014 Textual inference

Design and Realization of the EXCITEMENT Open Platform for Textual Entailment Gnter Neumann,

Textual Entailment Alina Petrova EMCL TUD, HLT FBK February 22, 2012 Alina Petrova EMCL TUD,

S Sanctification Under the OT tifi ti U d th OT Theocracy eoc acy Charles Clough What

Spaten : a Spatio-Temporal and Textual Big Data Generator Thaleia Dimitra Doudali* Ioannis

System Analysis Chapter 3: Textual Modeling Jonathan Thaler Department of Computer Science 1 /

Student Response Analysis Using Textual Entailment Ashudeep Singh Devanshu Arya Natural

R.N Institute of Professional Studies D-12 Main Kanti Nagar Krishna Nagar Delhi-51 CHAPTER 5

Textual Analytics for Accounting and Auditing Thanks to Ingrid Fisher (SUNY Albany)

Dynamic memory networks for Dynamic memory networks for visual and textual question visual and

Recognizing Textual Entailment Using a Subsequence Kernel Method Rui Wang &amp; Gnter Neumann

Textual Influence Modeling Through Non-Negative Tensor Decomposition Robert Earl Lowe July 12,

Comparing Textual and Block Interfaces in a Novice Programming Environment Thomas Price Tiffany

TAIC PART 2010 Linguistic Security Testing for Textual Protocols Authors Ben Kam, Tom Dean

Introductory Missive January 2012 About The Class Students in almost every field must use

Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 1 Discover the world at

About(Us( Sebas&amp;an(Pado( Rui(Wang( Professor(of(Computa&amp;onal(

Performance Assessments For Deeper Learning D R . R UTH C HUNG W EI S TANFORD U NIVERSITY N

Hong Kongs Financial Markets: Positioning in the Asia Pacific Region Andrew Sheng Chairman

Singapore Day in Rome Presented by: Ms Francisca SIOW Graduate Studies Office About NTU NTU

BSC Panel 194 9 February 2012 Report on Progress of Modification Proposals Adam Richardson 9

Algebraic normal form of a bent function: what is it? Natalia Tokareva Sobolev Institute of

Sambuz

Useful Links

Newsletter

Mail Us

Recognizing Textual Entailment Using a Subsequence Kernel Method Rui Wang & Gnter Neumann

About(Us( Sebas&an(Pado( Rui(Wang( Professor(of(Computa&onal(