Motivation for Text Mining Motivation for Text Mining � Approximately 90% of the World’s data is held in unstructured formats Text Mining Text Mining � Web pages � Emails � Technical documents � Corporate documents � Books � Digital libraries � Customer complaint letters � Growing rapidly in size and importance 1 2 Text Mining Applications Text Mining Applications Personalizing an Online Newspaper Personalizing an Online Newspaper � Classification of news stories, web pages, … , according to their content • Politics � Email and news filtering • Economic � Organize repositories of document-related meta-information -------------- • UK for search and retrieval (search engines) • World � Clustering documents or web pages -------------- • Sport � Gain insights about trends, relations between people, places • Entertainment and/or organizations � Find associations among entities such as: Author = Wilson ⇒ Author = Holmes Supervisor = William ⇒ Examiner = Ferdinand 3 4
Challenges Challenges Clustering Results Of Search Engine Queries Clustering Results Of Search Engine Queries � Information is in unstructured textual form � Large textual data base almost all publications are also in electronic form � � Very high number of possible “dimensions” (but sparse): � all possible word and phrase types in the language!! � Complex and subtle relationships between concepts in text � “AOL merges with Time-Warner” “Time-Warner is bought by AOL” � Word ambiguity and context sensitivity � automobile = car = vehicle = Toyota � Apple (the company) or apple (the fruit) � Noisy data � Example: Spelling mistakes 5 6 Semi- -Structured Data Structured Data Text Mining Mining Process Process Semi Text � Text databases are, in general, semi-structured Text preprocessing � Syntactic/Semantic � Example: � text analysis Features Generation � Title � Bag of words � � Author Features Selection � � Publication_Date Structured attribute/value pairs Simple counting � Statistics � Length � Text/Data Mining � � Category Classification � � Abstract Clustering � Unstructured Associations � � Content Analyzing results � 7 8
“Search “ Search” ” versus versus “ “Discover Discover” ” Handling Text Data Handling Text Data � Modeling semi-structured data Search Discover � Information Retrieval (IR) from unstructured (goal-oriented) (opportunistic) documents � Locates relevant documents and Ranks documents Structured Data Data � Keyword based (Boolean matching) Data Retrieval Mining � Similarity based � Text mining Unstructured Information Text � Classify documents Data (Text) Retrieval Mining � Cluster documents � Find patterns or trends across documents 9 10 Information Retrieval (IR) Information Retrieval (IR) Structuring Textual Information Structuring Textual Information � Many methods designed to analyze structured data � If we can represent documents by a set of attributes we will be able to use existing data mining methods � Information retrieval problem: locating relevant � How to represent a document? documents (e.g., given a set of keywords) in a corpus of documents � Vector based representation � (referred to as “bag of words” as it is invariant to permutations) � Use statistics to add a numerical dimension to unstructured text � Major application: Web search engines Term frequency Document frequency Term proximity Document length 11 12
Document Representation Document Representation Document Representation Document Representation � A document representation aims to capture what the document � Another approach: is about � Each entry describes a document � One possible approach: � Attributes represent the frequency in which a term appears � Each entry describes a document in the document � Attribute describe whether or not a term appears in the document Example: Term frequency table Example Terms Terms Camera Digital Memory Print … Camera Digital Memory Pixel … Document 1 3 2 0 1 Document 1 1 1 0 1 Document 2 0 4 0 3 Document 2 1 1 0 0 … … … … … … … … … … 13 14 Document Representation Document Representation More on Document Representation More on Document Representation � Stop Word removal: Many words are not informative and thus � But a term is mentioned more times in longer documents irrelevant for document representation � Therefore, use relative frequency (% of document): � the, and, a, an, is, of, that, … � No. of occurrences/No. of words in document � Stemming: reducing words to their root form � A document may contain several occurrences of words like � fish, fishes, fisher, and fishers � But would not be retrieved by a query with the keyword Terms � fishing Camera Digital Memory Print … Document 1 0.03 0.02 0 0.01 � Different words share the same word stem and should be represented with its stem, instead of the actual word Document 2 0 0.004 0 0.003 � fish … … … … … 15 16
Weighting Scheme for Term Frequencies Weighting Scheme for Term Frequencies Locating Relevant Documents Locating Relevant Documents � TF-IDF weighting: give higher weight to terms that are rare � TF: term frequency (increases weight of frequent terms) � Given a set of keywords � If a term is frequent in lots of documents it does not have discriminative power � Use similarity/distance measure to find � IDF: inverse term frequency similar/relevant documents n � Rank documents by their relevance/similarity For a given term w and document d TF ij = j i ij d n is the number of occurrence s of w in document d i ij j i d is the number of words in document d n i i IDF log j = n is the number of documents j n How to determine if two documents are similar? n is the number of documents that contain w x TF IDF j j = ⋅ ij ij j There is no compelling motivation for this method but it has been shown to be superior to other methods 17 18 Angle Based Matching Angle Based Matching Distance Based Matching Distance Based Matching � In order retrieve documents similar to a given document we need a � Cosine of the angle between the vectors representing the document measure of similarity and the query � Euclidean distance (example of a metric distance): � Documents “in the same direction” are closely related. � The Euclidean distance between � Transforms the angular measure into a measure ranging from 1 for the highest similarity to 0 for the lowest X=(x 1 , x 2 , x 3 ,…x n ) and Y =(y 1 ,y 2 , y 3 ,…y n ) � is defined as: C B n B C X T Y ∑ D ( X , Y ) ( x y ) = − D ( X , Y ) cos( X , Y ) 2 = = = i i X Y ⋅ i = 1 ∑ x y A i i A = Properties of a metric distance: ∑ ∑ x y ⋅ 2 2 • D(X,X)=0 i i • D(X,Y)=D(Y,X) D • D(X,Z)+D(Z,Y) ≥ D(X,Y) D 19 20
Performance Measure Performance Measure Distance vs. Angle Distance vs. Angle The set of retrieved documents can be formed by collecting the top- � ranking documents according to a similarity measure The quality of a collection can be compared by the two following measures � C B B C { Relevant } { Retrieved } ∩ percentage of retrieved documents that are in fact precision relevant to the query (i.e., “correct” responses) = { Retrieved } { Relevant } { Retrieved } ∩ percentage of documents that are relevant to the query recall = A and were, in fact, retrieved A { Relevant } D D Retrieved Relevant Relevant & documents documents retrieved All documents 21 22 Text Mining Text Mining Document Classification Document Classification � Human experts classify a set of documents � Document classification � training data set � Document clustering � Induce a classification model � Key-word based association rules Terms Class Oil Iraq build France … Interesting/Not interesting Document 1 0.01 0.05 0.03 0 Interesting Document 2 0 0.05 0 0.01 Not interesting … … … … … … 23 24
Recommend
More recommend