Guiding People to Context Information: Providing an Interface to • Information Retrieval Systems a Digital Library Using and Machine Learning Reference as a Basis for – ML techniques/algorithms used in IR Indexing – IR applied to ML, esp. CBR • User feedback in learning systems S. Bradshaw, A. Scheinkman & K. Hammond Plan for Discussion Larger Issues Raised • Background on IR • ML in document classification • Indexing and textual • Citation indexing classification: – CiteSeer, Rosetta – As specific to text-based • Indexing from the perspective knowledge systems of rhetorical theory – As relevant to automated • Practical & theoretical aspects knowledge acquisition in general of the underlying cognitive – Implications as a cognitive model problem of textual indexing & classification • CBR & texts; CBR “textuality” 1
Automated Text Automated Text Categorization Categorization • Task: assign a value (usually • CPC vs DPC: {0, 1}) to each entry in a – Category-pivoted categorization decision matrix: • One row at a time • When categories added dynamically – Document-pivoted categorization d ... d ... d 0 j n • One column at a time c a ... a ... a 0 00 0 j 0 n • When documents added over long ... ... ... ... ... ... period of time c a ... a ... a i i0 ij in • Assignment vs “relevance” ... ... ... ... ... ... – The latter is subjective c a ... a ... a m m 0 m j m n – It is largely the same as the notion • Categories are labels (no access to of relevance to an information meaning) need • Attribution is content-based (no metadata) • Constraints can differ wrt cardinalities of the assignment Automated Text Document Categorization Categorization & ML • Applications • Earliest efforts (‘80s): knowledge-engineering – Automatic indexing for IR using a controlled dictionary; usually (manually building expert performed by experts; indices = system) using rules categories – Example: CONSTRUE (for – Classified ads Reuters) – Document filtering (e.g., – Typical problem: “knowledge Reuters/AP) acquisition bottleneck” (updating) – WSD (word sense • More recently (‘90s): ML disambiguation) approach • Word occurrence contexts = docs, – Not construct the classifier, but senses = categories the builder of classifiers • Itself important as an indexing technique – Variety of approaches (both – Web-page categorization (e.g., inductive & lazy) Yahoo) 2
VSM (Vector-Space VSM (Vector-Space Model) Model) • Vector of n weighted index • Standard model: tfidf (term terms: “bag of words” frequency X inverse document frequency) – More sophisticated approaches based on noun phrases = #(tk, dj)*log(|Tr|/#(tk)) • Linguistic vs. statistical notion of • Assumptions: phrase – the more often a term occurs in a • Results not conclusive document, the more representative it is – the more documents a term occurs in, the less discriminating it is – Is this always an appropriate model? Are “representative” and “significant” the same? VSM (Vector-Space VSM (Vector-Space Model) Model) • Pre-processing: • Dimensionality Reduction – Removal of function words – Local: each category separately • Each dj has a different – Stemming representation for each ci • “Distance”: based on dot • In practice: subsets of dj’s original product of vectors representation* • Dimensionality problem – Global: all categories are reduced in same way – IR cosine-matching scales well, – Bases: linear algebra, information but other learning algorithms used theory for classifier induction do not • feature extraction vs selection* – DR: dimensionality reduction (also reduces overfitting) – New features are not subset of original; not homogeneous with original: combine or transform the originals 3
VSM (Vector-Space VSM (Vector-Space Model) Model) • Feature selection: TSR (term • Feature extraction: space reduction): proven to reparameterization diminish effectiveness the least – “synthesized” rather than naturally occurring – Document frequency (those which occur most frequently in – A way of dealing with polysemy, collection are most valuable) homonymy and synonymy • Apparently contradicts premise of • Term clustering tfidf that low df terms are more – Group words with pair-wise informative semantic relatedness, use their • But majority of words that occur in corpus have extremely low df, so “centroid” as term reduction by factor of 10 will only • One way: co-occurrence/co-absence prune these (which are probably • Latent semantic indexing insignificant within the document they occur in as well) – Combines original dimensions on – Other techniques: information basis of co-occurrence gain, chi-square, correlation – Capable of educing an underlying coefficient, etc. semantics not available from • Some improvement original terms • Complexity of techniques obviates easy interpretation of why results are better VSM (Vector-Space Building the Classifier Model) • Example • Two phases – Great number of terms which – Definition of a mapping function contribute a small amout to the – Definition of a threshold on whole values returned by that function category: “Demographic shifts in • Methods for building the the U.S. with economic impact” mapping function text: “The nation grew to 249.6 – Parametric (training data used to million in the 1980s as more estimate parameters of a Americans left the industrial and probability distribution) agricultural heartlands for the – Non-parametric South and West” • Profile-based (linear classifier): • Problems extract vector from training (pre- – Sometimes new terms not readily categorized) documents; use this interpretable profile to categorize documents in D according to RSV (retrieval status – Could eliminate an original term value) which was significant • Example-based: use the documents • Synthetic production of indices: in training set with highest category status values as candidates for relation to the type of indexing classifying documents in D done with citations 4
Building the Classifier Building the Classifier • Parametric • Profile-based, cont’d – “Naïve Bayes” – In general, rewards closeness to – Cannot use feature selection (need positive centroid, distance from the full term space) negative centroid – As in most Bayesian learning, assumes that features are – Produces understandable independent classifiers (amenable to human – It has been shown to work well tuning) • Profile – However, since it is linearly – Embodies explicit/declarative averaged, there are only two representation of the category subspaces (n-spheres), so this – Incremental or batch (on training risks excluding most of positive docs) training examples. – Most common batch: Rocchio • EBLclassifiers (wyj: weight of term in a given document; wyi weight for – Not explicit, declarative; classifying dj as ci) – Use k-NN; looks at the k training w w ∑ ∑ documents most similar to dj, to yi = β + γ yj yj w see if they have been classified ij = 1 ij = 0 |{ j | ca }| |{ j | ca }| d d ij = 1 ij = under ci; threshold value { d j | ca } { d j | ca 0 } determines decision β + γ = 1, β ≥ 0, γ ≤ 0 Building the Classifier Building the Classifier • EBLclassifiers • Lam & Ho: attempts to combine profile & example- ∑ i ( d j ) = ) ⋅ ca ( d j , d z CSV RSV based iz d z ∈ TR k ( d j ) – The k-NN algorithm is given generalized instances (GIs) – TRk(dj) set of k documents dz for instead of training documents which RSV(dj,dz) is maximum and ca values are from the correct • Clustering positive instances of decision matrix category ci into {cli1 … cliki} – RSV is some measure of semantic • Extracting a profile from each relatedness: could be probabilistic or vector-based cosine cluster with linear classifier • Does not subdivide space into • Applying k-NN to these only two spaces – Avoids sensitivity of k-NN to • Efficient: O (|Tr|) noise, but exploits its superiority • One variant uses 1, -1 instead of over linear classifiers 1, 0 for ca 5
Recommend
More recommend