Text Classification using Weka Jrg Steffen, DFKI Substitute Gnter - PowerPoint PPT Presentation

Text Classification using Weka Jörg Steffen, DFKI Substitute Günter Neumann, DFKI steffen@dfki.de 10.11.2014 1 Language Technology I - An Introduction to Text Classification - WS 2014/2015

What is Weka? • Workbench for machine learning and data mining • Supports a large number of ML approaches • Developed by the ML group at the University of Waikato (NZ) • Implemented in Java • Open Source software under GNU GPL • http://www.cs.waikato.ac.nz/~ml/weka/index.html 2 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Weka Datasets • Used for training and testing • Collection of examples � attributes with values • Represented as ARFF file � ARFF: attribute-relation file format � header with attribute types • nominal � finite set of strings • numeric • string • date � example instances as comma-separated list of attribute values 3 Language Technology I - An Introduction to Text Classification - WS 2014/2015

ARFF Example @relation golf_weather @attribute outlook {sunny, overcast, rainy} Header @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute playGolf {yes, no} @data sunny, 29, 85, false, no sunny, 27, 90, true, no overcast, 28, 86, false, yes rainy, 21, 96, false, yes rainy, 20, 80, false, yes rainy, 18, 70, true, no overcast, 17, 65, true, yes Instances sunny, 22, 95, false, no sunny, 21, 70, false, yes rainy, 21, 80, false, yes sunny, 24, 70, true, yes overcast, 22, 90, true, yes overcast, 27, 75, false, yes rainy, 22, 91, true, no 4 Language Technology I - An Introduction to Text Classification - WS 2014/2015

J48 Decision Tree > java -cp weka-3.6.3.jar weka.classifiers.trees.J48 -t weather.arff –i J48 pruned tree ------------------ outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = true: no (2.0) | windy = false: yes (3.0) Number of Leaves : 5 Size of the tree : 8 === Error on training data === Correctly Classified Instances 14 100 % Incorrectly Classified Instances 0 0 % 5 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Vector-Based Text Classification • Document features as numeric Weka attributes • Feature weight as attribute values • Document class as last Weka attribute • Example instances as feature vectors followed by document class @attribute ‘I' numeric @attribute ‘walk' numeric @attribute ‘drive' numeric @attribute moving_type {walking, driving} @data 1,1,0,walking 1,0,1,driving 6 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Language Identification • Classes: 12 languages � German (de) Italian (it) � Catalan (ca) Norwegian (no) � Finnish (fi) Danish (dk) � Sorbian (sb) Swedish (sv) � French (fr) English (en) � Estonian (et) Dutch (nl) • http://corpora.uni-leipzig.de/download.html • Features: character unigrams and bigrams 7 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Language Identification • Training data: 1000 sentences per language � train.arff • Test data: 500 sentences per language � test.arff • Features selection using corpus frequency >= 4 � 4764 total features, 1845 filtered � 2919 features left • Feature weight: tf.idf 8 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Language Identification ARFF File ... @attribute 'Ru' numeric @attribute 'Ry' numeric @attribute 'Rà' numeric @attribute 'Rä' numeric @attribute 'Rå' numeric @attribute 'Ré' numeric ... @attribute lang {de,it,ca,no,fi,dk,sb,sv,fr,en,et,nl} @data ... 0,0,14.2323,0,0,7.456, ..., de ... 9 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Language Identification Results > java -Xms2048m -Xmx2048m -Dfile.encoding=utf-8 -cp weka-3.6.3.jar \ weka.classifiers.bayes.NaiveBayes –t train.arff –T test.arff Time taken to build model: 9.57 seconds Time taken to test model on training data: 101.29 seconds === Error on test data === Correctly Classified Instances 5514 91.9 % Incorrectly Classified Instances 486 8.1 % ... Total Number of Instances 6000 === Confusion Matrix === a b c d e f g h i j k l <-- classified as 479 0 1 3 0 0 3 3 0 3 0 8 | a = de 0 479 5 4 0 1 6 1 0 4 0 0 | b = it 9 6 445 3 0 0 5 6 8 6 0 12 | c = ca 12 0 3 388 0 72 1 17 0 2 0 5 | d = no 2 1 0 2 487 0 0 4 0 0 3 1 | e = fi 4 1 2 73 1 393 0 8 0 9 1 8 | f = dk 3 0 0 1 1 1 492 0 0 1 1 0 | g = sb 6 0 0 11 1 10 0 461 0 8 0 3 | h = sv 3 0 13 5 0 0 2 1 453 4 0 19 | i = fr 3 0 1 4 0 2 3 2 0 464 0 21 | j = en 1 0 0 1 1 0 2 1 1 2 489 2 | k = et 7 0 0 1 0 0 1 1 2 4 0 484 | l = nl 10 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Language Identification Results > java -Xms2048m -Xmx2048m -Dfile.encoding=utf-8 -cp weka-3.6.3.jar \ weka.classifiers.functions.SMO -t train.arff –T test.arff Time taken to build model: 94.77 seconds Time taken to test model on training data: 23.07 seconds === Error on test data === Correctly Classified Instances 5703 95.05 % Incorrectly Classified Instances 297 4.95 % ... Total Number of Instances 6000 === Confusion Matrix === a b c d e f g h i j k l <-- classified as 497 0 0 2 0 0 1 0 0 0 0 0 | a = de 0 490 6 0 0 1 0 0 2 1 0 0 | b = it 0 8 486 1 0 1 0 1 2 1 0 0 | c = ca 9 3 1 431 1 43 0 8 1 2 0 1 | d = no 1 1 0 2 492 0 0 3 0 0 1 0 | e = fi 4 1 1 84 0 402 0 5 0 1 0 2 | f = dk 3 4 1 2 0 1 483 1 1 0 4 0 | g = sb 4 1 4 15 0 5 0 468 1 1 1 0 | h = sv 0 2 2 0 0 0 0 0 492 2 0 2 | i = fr 1 2 6 2 0 0 0 1 3 485 0 0 | j = en 1 0 1 0 2 0 0 0 0 0 496 0 | k = et 4 1 1 1 0 2 0 0 6 4 0 481 | l = nl 11 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Text Classification using Weka Jrg Steffen, DFKI Substitute Gnter - PowerPoint PPT Presentation

Text Classification using Weka Jrg Steffen, DFKI Substitute Gnter Neumann, DFKI steffen@dfki.de 10.11.2014 1 Language Technology I - An Introduction to Text Classification - WS 2014/2015 What is Weka? Workbench for machine learning

Advanced Data Mining with Weka Class 4 Lesson 1 What is distributed Weka? Mark Hall Pentaho

Advanced Data Mining with Weka Class 2 Lesson 1 Incremental classifiers in Weka Albert Bifet

Advanced Data Mining with Weka Class 5 Lesson 1 Invoking Python from Weka Peter Reutemann

Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer

Data Mining with Weka Class 4 Lesson 1 Classification boundaries Ian H. Witten Department of

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Urania tables and integrating Weka to Java project Bc. Peter Nos 207773@mail.muni.cz

Advanced Data Mining with Weka Department of Computer Science University of Waikato New Zealand

Data Mining with Weka Department of Computer Science University of Waikato New Zealand

Data Mining with Weka Class 3 Lesson 1 Simplicity first! Ian H. Witten Department of Computer

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer

More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of

Advanced Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of

More Data Mining with Weka Class 4 Lesson 1 Attribute selection using the wrapper

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification

A unified continuum mechanical approach for the computer age About the course Hans Petter

New class of limited-memory variationally-derived variable metric methods 1 Jan Vl cek,

Analysis of a Parallel 3D MD application Russian-German School on High-Performance Computer

Spherical and hyperbolic 2-spheres with cone singularities Workshop Hyperbolic geometry and

Influence of Salicylic Acid applica2on on Oxida2ve and Molecular

Low frequency estimates and local energy decay for asymptotically Euclidean Laplacians Jean-Marc

SCATTERING THEORY FOR MATHEMATICAL MODELS OF THE WEAK INTERACTION BENJAMIN LOUIS ALVAREZ AND J