Guiding People to Context Information: Providing an Interface to - PDF document

Guiding People to Context Information: Providing an Interface to • Information Retrieval Systems a Digital Library Using and Machine Learning Reference as a Basis for – ML techniques/algorithms used in IR Indexing – IR applied to ML, esp. CBR • User feedback in learning systems S. Bradshaw, A. Scheinkman & K. Hammond Plan for Discussion Larger Issues Raised • Background on IR • ML in document classification • Indexing and textual • Citation indexing classification: – CiteSeer, Rosetta – As specific to text-based • Indexing from the perspective knowledge systems of rhetorical theory – As relevant to automated • Practical & theoretical aspects knowledge acquisition in general of the underlying cognitive – Implications as a cognitive model problem of textual indexing & classification • CBR & texts; CBR “textuality” 1

Automated Text Automated Text Categorization Categorization • Task: assign a value (usually • CPC vs DPC: {0, 1}) to each entry in a – Category-pivoted categorization decision matrix: • One row at a time • When categories added dynamically – Document-pivoted categorization d ... d ... d 0 j n • One column at a time c a ... a ... a 0 00 0 j 0 n • When documents added over long ... ... ... ... ... ... period of time c a ... a ... a i i0 ij in • Assignment vs “relevance” ... ... ... ... ... ... – The latter is subjective c a ... a ... a m m 0 m j m n – It is largely the same as the notion • Categories are labels (no access to of relevance to an information meaning) need • Attribution is content-based (no metadata) • Constraints can differ wrt cardinalities of the assignment Automated Text Document Categorization Categorization & ML • Applications • Earliest efforts (‘80s): knowledge-engineering – Automatic indexing for IR using a controlled dictionary; usually (manually building expert performed by experts; indices = system) using rules categories – Example: CONSTRUE (for – Classified ads Reuters) – Document filtering (e.g., – Typical problem: “knowledge Reuters/AP) acquisition bottleneck” (updating) – WSD (word sense • More recently (‘90s): ML disambiguation) approach • Word occurrence contexts = docs, – Not construct the classifier, but senses = categories the builder of classifiers • Itself important as an indexing technique – Variety of approaches (both – Web-page categorization (e.g., inductive & lazy) Yahoo) 2

VSM (Vector-Space VSM (Vector-Space Model) Model) • Vector of n weighted index • Standard model: tfidf (term terms: “bag of words” frequency X inverse document frequency) – More sophisticated approaches based on noun phrases = #(tk, dj)*log(|Tr|/#(tk)) • Linguistic vs. statistical notion of • Assumptions: phrase – the more often a term occurs in a • Results not conclusive document, the more representative it is – the more documents a term occurs in, the less discriminating it is – Is this always an appropriate model? Are “representative” and “significant” the same? VSM (Vector-Space VSM (Vector-Space Model) Model) • Pre-processing: • Dimensionality Reduction – Removal of function words – Local: each category separately • Each dj has a different – Stemming representation for each ci • “Distance”: based on dot • In practice: subsets of dj’s original product of vectors representation* • Dimensionality problem – Global: all categories are reduced in same way – IR cosine-matching scales well, – Bases: linear algebra, information but other learning algorithms used theory for classifier induction do not • feature extraction vs selection* – DR: dimensionality reduction (also reduces overfitting) – New features are not subset of original; not homogeneous with original: combine or transform the originals 3

VSM (Vector-Space VSM (Vector-Space Model) Model) • Feature selection: TSR (term • Feature extraction: space reduction): proven to reparameterization diminish effectiveness the least – “synthesized” rather than naturally occurring – Document frequency (those which occur most frequently in – A way of dealing with polysemy, collection are most valuable) homonymy and synonymy • Apparently contradicts premise of • Term clustering tfidf that low df terms are more – Group words with pair-wise informative semantic relatedness, use their • But majority of words that occur in corpus have extremely low df, so “centroid” as term reduction by factor of 10 will only • One way: co-occurrence/co-absence prune these (which are probably • Latent semantic indexing insignificant within the document they occur in as well) – Combines original dimensions on – Other techniques: information basis of co-occurrence gain, chi-square, correlation – Capable of educing an underlying coefficient, etc. semantics not available from • Some improvement original terms • Complexity of techniques obviates easy interpretation of why results are better VSM (Vector-Space Building the Classifier Model) • Example • Two phases – Great number of terms which – Definition of a mapping function contribute a small amout to the – Definition of a threshold on whole values returned by that function category: “Demographic shifts in • Methods for building the the U.S. with economic impact” mapping function text: “The nation grew to 249.6 – Parametric (training data used to million in the 1980s as more estimate parameters of a Americans left the industrial and probability distribution) agricultural heartlands for the – Non-parametric South and West” • Profile-based (linear classifier): • Problems extract vector from training (pre- – Sometimes new terms not readily categorized) documents; use this interpretable profile to categorize documents in D according to RSV (retrieval status – Could eliminate an original term value) which was significant • Example-based: use the documents • Synthetic production of indices: in training set with highest category status values as candidates for relation to the type of indexing classifying documents in D done with citations 4

Building the Classifier Building the Classifier • Parametric • Profile-based, cont’d – “Naïve Bayes” – In general, rewards closeness to – Cannot use feature selection (need positive centroid, distance from the full term space) negative centroid – As in most Bayesian learning, assumes that features are – Produces understandable independent classifiers (amenable to human – It has been shown to work well tuning) • Profile – However, since it is linearly – Embodies explicit/declarative averaged, there are only two representation of the category subspaces (n-spheres), so this – Incremental or batch (on training risks excluding most of positive docs) training examples. – Most common batch: Rocchio • EBLclassifiers (wyj: weight of term in a given document; wyi weight for – Not explicit, declarative; classifying dj as ci) – Use k-NN; looks at the k training w w ∑ ∑ documents most similar to dj, to yi = β + γ yj yj w see if they have been classified ij = 1 ij = 0 |{ j | ca }| |{ j | ca }| d d ij = 1 ij = under ci; threshold value { d j | ca } { d j | ca 0 } determines decision β + γ = 1, β ≥ 0, γ ≤ 0 Building the Classifier Building the Classifier • EBLclassifiers • Lam & Ho: attempts to combine profile & example- ∑ i ( d j ) = ) ⋅ ca ( d j , d z CSV RSV based iz d z ∈ TR k ( d j ) – The k-NN algorithm is given generalized instances (GIs) – TRk(dj) set of k documents dz for instead of training documents which RSV(dj,dz) is maximum and ca values are from the correct • Clustering positive instances of decision matrix category ci into {cli1 … cliki} – RSV is some measure of semantic • Extracting a profile from each relatedness: could be probabilistic or vector-based cosine cluster with linear classifier • Does not subdivide space into • Applying k-NN to these only two spaces – Avoids sensitivity of k-NN to • Efficient: O (|Tr|) noise, but exploits its superiority • One variant uses 1, -1 instead of over linear classifiers 1, 0 for ca 5

Guiding People to Context Information: Providing an Interface to - PDF document

Guiding People to Context Information: Providing an Interface to Information Retrieval Systems a Digital Library Using and Machine Learning Reference as a Basis for ML techniques/algorithms used in IR Indexing IR applied to

City Council Presentation City Council Presentation February 8, 2011 Guiding Principles Guiding

Guiding and Shadow Rays Alexander Keller, Ken Dahm, Nikolaus Binder, Thomas Mller Guiding and

Context Sensitivity Example of a CSG Informatics 2A: Lecture 26 2 Context in Programming

Organizational Chart 1 Guiding Principles & Priorities 2 Key Guiding Principles &

= GWOL (Guiding Without Leading) = GWOL (Guiding Without Leading) Basic Techniques: Key

Rubber Tire Fleet Overview Rubber Tire Fleet Overview Guiding Principles Guiding Principles

2Q07 Earnings Release Presentation GUIDING THE INNOVATION SPIRIT IN GUIDING THE INNOVATION

Path Guiding in Production Courses JI VORBA WETA DIGITAL Ji Vorba | Path Guiding in

Module 14: Evaluation Ethics, Politics, Standards, and Guiding Principles Ethics Ethics Po

Context Context Context Context Full control of EIP no longer yields immediate arbitrary

Treating Tobacco Treating Tobacco Treating Tobacco Treating Tobacco Dependence and Providing

GUIDING PRINCIPLES Leading Education and Innovation Providing High Quality Education

DJUSD Fall Reopening Plan July 23, 2020 1 Guiding Principles Equity Access Innovation

A Funding Formula for Rhode Island: ChildCentered Equitable Accountable Providing

Cooking hot meals ConTEXt meeting 2014 Willi Egger ConTEXt meeting 2014 2 Cooking We cook

The 8 Stages of ADI Stage 1. Identify the Task and the Guiding Question The teacher begins an ADI

Energy in motion Year-end fiscal 2018 earnings results November 15, 2018 Participants on

What is Critical Indigenous Studies and Why Does it Matter? Jennifer Nez Denetdale, Ph.D.

THE KING IN GABON HIS GRACE, MONSIGNOR BASILE MVE ENGONE, ARCHIBISHOP OF LIBREVILLE THE CHURCH

Music 7 th & 8 th April 2016 Agenda Introduction and welcome Presentation of new

SERVICE DESIGN From context to concept in the hospitality and tourism industry Design and the

Opportunities and Challenges for the A Apartment Market given current capital market t t M k t

Revelations of a QCC Research Survey in Japan ICQCC 2009, Cebu, Philippines October 23, 2009

UPDATE & REVELATIONS! Presented by: Nellie Phillips, Records Supervisor Operating with

Sambuz

Useful Links

Newsletter

Mail Us

Guiding People to Context Information: Providing an Interface to - PDF document

Guiding People to Context Information: Providing an Interface to Information Retrieval Systems a Digital Library Using and Machine Learning Reference as a Basis for ML techniques/algorithms used in IR Indexing IR applied to

City Council Presentation City Council Presentation February 8, 2011 Guiding Principles Guiding

Guiding and Shadow Rays Alexander Keller, Ken Dahm, Nikolaus Binder, Thomas Mller Guiding and

Context Sensitivity Example of a CSG Informatics 2A: Lecture 26 2 Context in Programming

Organizational Chart 1 Guiding Principles &amp; Priorities 2 Key Guiding Principles &amp;

= GWOL (Guiding Without Leading) = GWOL (Guiding Without Leading) Basic Techniques: Key

Rubber Tire Fleet Overview Rubber Tire Fleet Overview Guiding Principles Guiding Principles

2Q07 Earnings Release Presentation GUIDING THE INNOVATION SPIRIT IN GUIDING THE INNOVATION

Path Guiding in Production Courses JI VORBA WETA DIGITAL Ji Vorba | Path Guiding in

Module 14: Evaluation Ethics, Politics, Standards, and Guiding Principles Ethics Ethics Po

Context Context Context Context Full control of EIP no longer yields immediate arbitrary

Treating Tobacco Treating Tobacco Treating Tobacco Treating Tobacco Dependence and Providing

GUIDING PRINCIPLES Leading Education and Innovation Providing High Quality Education

DJUSD Fall Reopening Plan July 23, 2020 1 Guiding Principles Equity Access Innovation

A Funding Formula for Rhode Island: ChildCentered Equitable Accountable Providing

Cooking hot meals ConTEXt meeting 2014 Willi Egger ConTEXt meeting 2014 2 Cooking We cook

The 8 Stages of ADI Stage 1. Identify the Task and the Guiding Question The teacher begins an ADI

Energy in motion Year-end fiscal 2018 earnings results November 15, 2018 Participants on

What is Critical Indigenous Studies and Why Does it Matter? Jennifer Nez Denetdale, Ph.D.

THE KING IN GABON HIS GRACE, MONSIGNOR BASILE MVE ENGONE, ARCHIBISHOP OF LIBREVILLE THE CHURCH

Music 7 th &amp; 8 th April 2016 Agenda Introduction and welcome Presentation of new

SERVICE DESIGN From context to concept in the hospitality and tourism industry Design and the

Opportunities and Challenges for the A Apartment Market given current capital market t t M k t

Revelations of a QCC Research Survey in Japan ICQCC 2009, Cebu, Philippines October 23, 2009

UPDATE &amp; REVELATIONS! Presented by: Nellie Phillips, Records Supervisor Operating with

Sambuz

Useful Links

Newsletter

Mail Us

Organizational Chart 1 Guiding Principles & Priorities 2 Key Guiding Principles &

Music 7 th & 8 th April 2016 Agenda Introduction and welcome Presentation of new

UPDATE & REVELATIONS! Presented by: Nellie Phillips, Records Supervisor Operating with