Chapter 6: Automatic Classification (Supervised Data Organization) - PowerPoint PPT Presentation

Chapter 6: Automatic Classification (Supervised Data Organization) 6.1 Simple Distance-based Classifiers 6.2 Feature Selection 6.3 Distribution-based (Bayesian) Classifiers 6.4 Discriminative Classifiers: Decision Trees 6.5 Discriminative Classifiers: Support Vector Machines 6.6 Hierarchical Classification 6.7 Classifiers with Semisupervised Learning 6.8 Hypertext Classifiers 6.9 Application: Focused Crawling 6-1 IRDM WS 2005

Classification Problem (Categorization) f2 determine class/topic given: membership(s) feature vectors of feature vectors f1 known classes + unknown classes: f2 f2 labeled training data: unsupervised supervised learning learning (clustering) ? f1 f1 6-2 IRDM WS 2005

Uses of Automatic Classification in IR • Filtering: test newly arriving documents (e.g. mail, news) if they belong to a class of interest (stock market news, spam, etc.) • Summary/Overview: organize query or crawler results, directories, feeds, etc. • Query expansion: assign query to an appropriate class and expand query by class-specific search terms • Relevance feedback: classify query results and let the user identify relevant classes for improved query generation • Word sense disambiguation: mapping words (in context) to concepts • Query efficiency: restrict (index) search to relevant class(es) • (Semi-) Automated portal building : automatically generate topic directories such as yahoo.com, dmoz.org, about.com, etc. Classification variants: • with terms, term frequencies, link structure, etc. as features • binary: does a document d belong class c or not? • many-way: into which of k classes does a document fit best? • hierarchical: use multiple classifiers to assign a document to node(s) of topic tree 6-3 IRDM WS 2005

Automatic Classification in Data Mining Goal: Categorize persons, business entities, or scientific objects and predict their behavioral patterns Application examples: • categorize types of bookstore customers based on purchased books • categorize movie genres based on title and casting • categorize opinions on movies, books, political discussions, etc. • identify high-risk loan applicants based on their financial history • identify high-risk insurance customers based on observed demoscopic, consumer, and health parameters • predict protein folding structure types based on specific properties of amino acid sequences • predict cancer risk based on genomic, health, and other parameters ... 6-4 IRDM WS 2005

Classification with Training Data ... (Supervised Learning): Overview WWW / Intranet ...... ..... ...... Science classes feature space: ..... ( ) m + term frequencies f i ∈ c R k new 0 (i = 1, ..., m) Mathematics documents Algebra ... automatische Zuweisung automatische Zuweisung automatische Zuweisung automatic assignment Probability ∈ P d c f � [ | ] estimate and Statistics k and assign document to the class Hypotheses with the highest probability Testing Large Deviation e.g. with Bayesian method: ... intellectual ∈ ∈ P [ f | d c ] P [ d c ] � k � k ∈ = P [ d c | f ] � assignment training k P [ f ] data 6-5 IRDM WS 2005

Assessment of Classification Quality empirical by automatic classification of documents that do not belong to the training data (but in benchmarks class labels of test data are usually known) For binary classification with regard to class C: a = #docs that are classified into C and do belong to C b = #docs that are classified into C but do not belong to C c = #docs that are not classified into C but do belong to C d = #docs that are not classified into C and do not belong to C + a d Error (Fehler) = 1 − accuracy Acccuracy (Genauigkeit) = + + + a b c d a a Precision (Präzision) = Recall (Ausbeute) = + + a b a c − 1 1 1 + recall   F1 (harmonic mean of precision and recall) = precision       For manyway classification with regard to classes C 1 , ..., C k : • macro average over k classes or • micro average over k classes 6-6 IRDM WS 2005

Estimation of Classifier Quality use benchmark collection of completely labeled documents (e.g., Reuters newswire data from TREC benchmark) cross-validation (with held-out training data): • partition training data into k equally sized (randomized) parts, • for every possible choice of k-1 partitions train with k-1 partitions and apply classifier to k th partition • • determine precision, recall, etc. • compute micro-averaged quality measures leave-one-out validation/estimation: variant of cross-validation with two partitions of unequal size: use n-1 documents for training and classify the n th document 6-7 IRDM WS 2005

6. 1 Distance-based Classifiers: k-Nearest-Neighbor Method (kNN) Step 1: find among the training documents of all classes the k (e.g. 10-100) most similar documents (e.g., based on cosine similarity): d the k nearest neighbors of � Step 2: d � Assign to class Cj for which the function value ∈ if v C 1 � j  � = � f ( d , C ) sim ( d , v ) * � j ∑  otherwise 0 ∈ v kNN ( d ) � �  is maximized d � With binary classification assign to class C if is above some threshold δ ( δ >0.5) f ( d , C ) � 6-8 IRDM WS 2005

Distance-based Classifiers: Rocchio Method Step 1: Represent the training documents for class Cj by a prototype vector with tf*idf-based vector components d d � � 1 1 = α − β c : � j ∑ ∑ − d d C � D C � ∈ ∈ − j d C j d D C � � j j with appropriate coefficients α and β (e.g. α =16, β =4) Step 2: d � Assign a new document to the class Cj for which cos( c , d ) � the cosine similarity is maximized. � j 6-9 IRDM WS 2005

6.2 Feature Selection For efficiency of the classifier and to suppress noise choose subset of all possible features. → Selected features should be • frequent to avoid overfitting the classifier to the training data, • but not too frequent in order to be characteristic. Features should be good discriminators between classes (i.e. frequent/characteristic in one class but infrequent in other classes). Approach: - compute measure of discrimination for each feature - select the top k most discriminative features in greedy manner tf*idf is usually not a good discrimination measure, and may give undue weight to terms with high idf value (leading to the danger of overfitting) 6-10 IRDM WS 2005

Example for Feature Selection theorem integral vector group chart limit film Class Tree: hit f1 f2 f3 f4 f5 f6 f7 f8 d1: 1 1 0 0 0 0 0 0 Entertainment Math d2: 0 1 1 0 0 0 1 0 d3: 1 0 1 0 0 0 0 0 d4: 0 1 1 0 0 0 0 0 Algebra Calculus d5: 0 0 0 1 1 1 0 0 d6: 0 0 0 1 0 1 0 0 training docs: d7: 0 0 0 0 1 0 0 0 d1, d2, d3, d4 d8: 0 0 0 1 0 1 0 0 → Entertainment d9: 0 0 0 0 0 0 1 1 d5, d6, d7, d8 d10: 0 0 0 1 0 0 1 1 → Calculus d11: 0 0 0 1 0 1 0 1 d9, d10, d11, d12 d12: 0 0 1 1 1 0 1 0 → Algebra 6-11 IRDM WS 2005

Simple (Class-unspecific) Criteria for Feature Selection Document Frequency Thresholding: Consider for class Cj only terms ti that occur in at least δ training documents of Cj. Term Strength: For decision between classes C1, ..., Ck select (binary) features Xi with the highest value of = s X P X occurs in doc d X occurs in similar doc d ( ) : [ | ' ] i i i To this end the set of similar doc pairs (d, d‘) is obtained • by thresholding on pairwise similarity or • by clustering/grouping the training docs. + further possible criteria along these lines 6-12 IRDM WS 2005

Feature Selection Based on χ χ 2 Test χ χ For class Cj select those terms for which the χ 2 test (performed on the training data) gives the highest confidence that Cj and ti are not independent. As a discrimination measure compute for each class Cj and term ti: 2 ∧ − ( freq ( X C ) freq ( X ) freq ( C ) / n ) 2 χ = ( X , c ) ( i j freq ( X ) freq ( C ) / n ∑ ∑ ∈ ∈ X { X , X } C { C , C } i i j j with absolute frequencies freq 6-13 IRDM WS 2005

Feature Selection Based on Information Gain Information gain: For discriminating classes c1, ..., ck select the binary features Xi (term occurrence) with the largest gain in entropy k 1 = G ( Xi ) P [ c ] log j 2 ∑ P [ c ] j = j 1 k 1 − P [ X ] P [ c | X ] log i j i 2 ∑ P [ c | X ] j i = j 1 k 1 − P [ X ] P [ c | X ] log i j i 2 ∑ P [ c | X ] j i = j 1 can be computed in time O(n)+O(mk) for n training documents, m terms, and k classes 6-14 IRDM WS 2005

Chapter 6: Automatic Classification (Supervised Data Organization) - PowerPoint PPT Presentation

Chapter 6: Automatic Classification (Supervised Data Organization) 6.1 Simple Distance-based Classifiers 6.2 Feature Selection 6.3 Distribution-based (Bayesian) Classifiers 6.4 Discriminative Classifiers: Decision Trees 6.5 Discriminative

CS330 Paper Presentation: October 16th, 2019 Supervised Classification Semi-Supervised

Automatic Classification of Automatic Classification of Audio Data Audio Data Carlos H. C.

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Weakly Supervised Classification Weakly Supervised Classification and Robust Learning and Robust

Iterative Hybrid Algorithm for Semi-supervised Classification Martin SAVESKI Supervised by

Automatic text classification and extraction of Automatic text classification and extraction of

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

10701 Semi supervised learning Can Unlabeled Data improve supervised learning? Important

Shoestring: Graph-Based Semi- Supervised Classification with Severely Limited Labeled Data Wanyu

On-line Hierarchical Multi-label Text Classification Jesse Read Supervised by Bernhard (and Eibe

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Marcel Dettling Institute fr Datenanalyse und Prozessdesign Zrcher Hochschule fr Angewandte

On Culture and Sharing knowledge in WebAgencies Bastian Widmer / @dasrecht Who are you? Bastian

The fourth flight of the ANITA experiment DPF Fermilab 2017 Andrew Ludwig 1 Outline Ultra high

paradigms how can the usability of an interactive system be demonstrated or measured?

An average of 63 Afghans fell victim to mines and ERW each month during 2006: a more than 50%

Volunteering for the London Olympics Anna Barrett Feb 2013 Outline What did I do?

1 PageRank, TextRank, closeness, betweenness, NP chunks , An Overview of Graph-Based Keyword

Interim results 2012/13 1 David Tyler Chairman 2 John Rogers Chief Financial Officer 3