Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea
Text Representation � Document Preprocessing � Vector Space Model for Document Storage � Measure of Similarity 2
Document preprocessing(1/4) � Tokenization • Filtering away tags • Tokens regarded as nonempty sequence of characters excluding spaces and punctuations. • Token represented by a suitable integer, tid , typically 32 bits • Optional: stemming/conflation of words • Result: document (did) transformed into a sequence of integers ( tid, pos ) 3
Document preprocessing(2/4) � Stopwords • Function words and connectives • Appear in large number of documents and little use in pinpointing documents • Issues � Queries containing only stopwords ruled out � Polysemous words that are stopwords in one sense but not in others – E.g.; can as a verb vs. can as a noun 4
Document preprocessing(3/4) � Stemming • Remove inflections that convey parts of speech, tense and number • E.g.: university and universal both stem to universe. • Techniques � morphological analysis (e.g., Porter's algorithm) � dictionary lookup (e.g., WordNet ). • Stemming may increase the number of documents in the response of a query but at the price of precision � It is not a good idea to stem Abbreviations, and names coined in the technical and commercial sectors � E.g.: Stemming “ides” to “IDE”, the hard disk standard, “SOCKS” firewall protocol to “sock” worn on the foot, may be bad ! 5
Document preprocessing(4/4) � Non-uniformity of word spellings • dialects of English • transliteration from other languages � Two ways to reduce this problem. 1. Aggressive conflation mechanism to collapse variant spellings into the same token • E.g.: Soundex : takes phonetics and pronunciation details into account • used with great success in indexing and searching last names in census and telephone directory data. 2. Decompose terms into a sequence of q-grams or sequences of q characters ≤ q ≤ ( 2 4 ) • Check for similarity in the grams q • Looking up the inverted index : a two-stage affair: • Smaller index of q-grams consulted to expand each query term into a set of slightly distorted query terms • These terms are submitted to the regular index • Used by Google for spelling correction • Idea also adopted for eliminating near-duplicate pages 6
The vector space model (1/4) � Documents represented as vectors in a multi-dimensional Euclidean space • Each axis = a term (token) � Coordinate of document d in direction of term t determined by: • Term frequency TF(d,t) � number of times term t occurs in document d, scaled in a variety of ways to normalize document length • Inverse document frequency IDF(t) � to scale down the coordinates of terms that occur in many documents 7
The vector space model (2/4) � Term frequency n(d, t) n(d, t) = • . TF(d, t) = ∑ TF(d, t) τ ) n(d, τ max (n(d, )) . τ τ � Cornell SMART system uses a smoothed version = if ( , ) 0 n d t = ( , ) 0 TF d t = + + otherwise ( , ) 1 log( 1 ( , )) TF d t n d t 8
The vector space model (3/4) � Inverse document frequency • Given � D is the document collection and is the set of D t documents containing t • Formulae D � mostly dampened functions of | | D t � SMART + 1 | | D = ( ) log( ) IDF t | | D t 9
Vector space model (4/4) � Coordinate of document d in axis t d t = • . ( , ) ( ) TF d t IDF t r • Transformed to in the TFIDF-space d � Query q • Interpreted as a document r • Transformed to in the same TFIDF-space q as d 10
Measures of Similarity (1/3) � Distance measure • Magnitude of the vector difference r r − � . | | d q • Document vectors must be normalized to unit ( or ) length L L 1 2 � Else shorter documents dominate (since queries are short) � Cosine similarity r r • cosine of the angle between and d q � Shorter documents are penalized 11
Measures of Similarity (2/3) • Jaccard coefficient of similarity between document and d d 1 2 • T(d) = set of tokens in document d ∩ | ( ) ( ) | T d T d • . = 1 2 ' ( , ) r d d ∪ 1 2 | ( ) ( ) | T d T d 1 2 • Symmetric, reflexive • Forgives any number of occurrences and any permutations of the terms. 12
Measures of Similarity (3/3) � Represent each document as a set of q-grams (shingles) � A shingle is a contiguous subsequence of tokens taken from a document � S(d,w) is the set of distinct shingles of width w taken from document d � When w is fixed S(d,w) is shortened to S(d) � When w = 1, S(d) = T(d) � Using the shingled document representation one may define the resemblance between and using Jaccard d d 1 2 similarity by replacing T(d) by S(d,w) ( , ) r d 1 d 2 � The two documents are similar if Jaccard similarity is above a threshold 13
Recommend
More recommend