Text Representation - PowerPoint PPT Presentation

Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea

Text Representation � Document Preprocessing � Vector Space Model for Document Storage � Measure of Similarity 2

Document preprocessing(1/4) � Tokenization • Filtering away tags • Tokens regarded as nonempty sequence of characters excluding spaces and punctuations. • Token represented by a suitable integer, tid , typically 32 bits • Optional: stemming/conflation of words • Result: document (did) transformed into a sequence of integers ( tid, pos ) 3

Document preprocessing(2/4) � Stopwords • Function words and connectives • Appear in large number of documents and little use in pinpointing documents • Issues � Queries containing only stopwords ruled out � Polysemous words that are stopwords in one sense but not in others – E.g.; can as a verb vs. can as a noun 4

Document preprocessing(3/4) � Stemming • Remove inflections that convey parts of speech, tense and number • E.g.: university and universal both stem to universe. • Techniques � morphological analysis (e.g., Porter's algorithm) � dictionary lookup (e.g., WordNet ). • Stemming may increase the number of documents in the response of a query but at the price of precision � It is not a good idea to stem Abbreviations, and names coined in the technical and commercial sectors � E.g.: Stemming “ides” to “IDE”, the hard disk standard, “SOCKS” firewall protocol to “sock” worn on the foot, may be bad ! 5

Document preprocessing(4/4) � Non-uniformity of word spellings • dialects of English • transliteration from other languages � Two ways to reduce this problem. 1. Aggressive conflation mechanism to collapse variant spellings into the same token • E.g.: Soundex : takes phonetics and pronunciation details into account • used with great success in indexing and searching last names in census and telephone directory data. 2. Decompose terms into a sequence of q-grams or sequences of q characters ≤ q ≤ ( 2 4 ) • Check for similarity in the grams q • Looking up the inverted index : a two-stage affair: • Smaller index of q-grams consulted to expand each query term into a set of slightly distorted query terms • These terms are submitted to the regular index • Used by Google for spelling correction • Idea also adopted for eliminating near-duplicate pages 6

The vector space model (1/4) � Documents represented as vectors in a multi-dimensional Euclidean space • Each axis = a term (token) � Coordinate of document d in direction of term t determined by: • Term frequency TF(d,t) � number of times term t occurs in document d, scaled in a variety of ways to normalize document length • Inverse document frequency IDF(t) � to scale down the coordinates of terms that occur in many documents 7

The vector space model (2/4) � Term frequency n(d, t) n(d, t) = • . TF(d, t) = ∑ TF(d, t) τ ) n(d, τ max (n(d, )) . τ τ � Cornell SMART system uses a smoothed version = if ( , ) 0 n d t = ( , ) 0 TF d t = + + otherwise ( , ) 1 log( 1 ( , )) TF d t n d t 8

The vector space model (3/4) � Inverse document frequency • Given � D is the document collection and is the set of D t documents containing t • Formulae D � mostly dampened functions of | | D t � SMART + 1 | | D = ( ) log( ) IDF t | | D t 9

Vector space model (4/4) � Coordinate of document d in axis t d t = • . ( , ) ( ) TF d t IDF t r • Transformed to in the TFIDF-space d � Query q • Interpreted as a document r • Transformed to in the same TFIDF-space q as d 10

Measures of Similarity (1/3) � Distance measure • Magnitude of the vector difference r r − � . | | d q • Document vectors must be normalized to unit ( or ) length L L 1 2 � Else shorter documents dominate (since queries are short) � Cosine similarity r r • cosine of the angle between and d q � Shorter documents are penalized 11

Measures of Similarity (2/3) • Jaccard coefficient of similarity between document and d d 1 2 • T(d) = set of tokens in document d ∩ | ( ) ( ) | T d T d • . = 1 2 ' ( , ) r d d ∪ 1 2 | ( ) ( ) | T d T d 1 2 • Symmetric, reflexive • Forgives any number of occurrences and any permutations of the terms. 12

Measures of Similarity (3/3) � Represent each document as a set of q-grams (shingles) � A shingle is a contiguous subsequence of tokens taken from a document � S(d,w) is the set of distinct shingles of width w taken from document d � When w is fixed S(d,w) is shortened to S(d) � When w = 1, S(d) = T(d) � Using the shingled document representation one may define the resemblance between and using Jaccard d d 1 2 similarity by replacing T(d) by S(d,w) ( , ) r d 1 d 2 � The two documents are similar if Jaccard similarity is above a threshold 13

Text Representation - PowerPoint PPT Presentation

Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea Text Representation Document Preprocessing Vector Space Model for Document Storage Measure of Similarity 2 Document preprocessing(1/4)

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

1 Introduction The Text Mining Process Text representation Learning Conclusion Introduction

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Title of an article [16 pt] Introduction [14 pt] Text. Text. Text. Text. Text. Text. Text. Text.

American Graphic Design in the 1920s-30s was dominated by traditional illustration and

Information Retrieval TDT4215 Web intelligence g Based on slides from: Christopher Manning

classical? Aaron M. HELFAND Bulfinch awards ceremony institute of Classical Architecture &

Information In formation Sy Systems stems Susan Dumais Microsoft Research

High Dimensional Search Min-Hashing Locality Sensi6ve Hashing

Community structures Slides modified from Huan Liu, Lei Tang, Nitin Agarwal Community Detection

Beach Guide for Dogs and Their Owners 2 3 www.thecornishcoast.co.uk 4 7 9 5 8 6 10 Dogs

Professor Flavia Berys 619.665.3528 www.BerysLaw.com/cwsl Class 1 www.BerysLaw.com/cwsl

Sambuz

Useful Links

Newsletter

Mail Us

Text Representation - PowerPoint PPT Presentation

Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea Text Representation Document Preprocessing Vector Space Model for Document Storage Measure of Similarity 2 Document preprocessing(1/4)

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

1 Introduction The Text Mining Process Text representation Learning Conclusion Introduction

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Title of an article [16 pt] Introduction [14 pt] Text. Text. Text. Text. Text. Text. Text. Text.

American Graphic Design in the 1920s-30s was dominated by traditional illustration and

Information Retrieval TDT4215 Web intelligence g Based on slides from: Christopher Manning

classical? Aaron M. HELFAND Bulfinch awards ceremony institute of Classical Architecture &amp;

Information In formation Sy Systems stems Susan Dumais Microsoft Research

High Dimensional Search Min-Hashing Locality Sensi6ve Hashing

Community structures Slides modified from Huan Liu, Lei Tang, Nitin Agarwal Community Detection

Beach Guide for Dogs and Their Owners 2 3 www.thecornishcoast.co.uk 4 7 9 5 8 6 10 Dogs

Professor Flavia Berys 619.665.3528 www.BerysLaw.com/cwsl Class 1 www.BerysLaw.com/cwsl

Sambuz

Useful Links

Newsletter

Mail Us

classical? Aaron M. HELFAND Bulfinch awards ceremony institute of Classical Architecture &