information retrieval what is clustering
play

Information Retrieval What Is Clustering? Group data into clusters - PDF document

Information Retrieval What Is Clustering? Group data into clusters Similar to one another within the same cluster Dissimilar to the objects in other clusters Unsupervised learning: no predefined classes Outliers Cluster 1


  1. Information Retrieval

  2. What Is Clustering? • Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes Outliers Cluster 1 Cluster 2 CMPT 354: Database I -- Information Retrieval 2

  3. Application Examples • A stand-alone tool: explore data distribution • A preprocessing step for other algorithms • Pattern recognition, spatial data analysis, image processing, market research, WWW, … – Cluster documents – Cluster web log data to discover groups of similar access patterns CMPT 354: Database I -- Information Retrieval 3

  4. What Is Good Clustering? • High intra-class similarity and low inter-class similarity – Depending on the similarity measure • The ability to discover some or all of the hidden patterns CMPT 354: Database I -- Information Retrieval 4

  5. Partitioning Algorithms: Basics • Partition n objects into k clusters – Optimize the chosen partitioning criterion • Global optimal: examine all possible partitions – (k n -(k-1) n -…-1) possible partitions, too expensive! • Heuristic methods: k-means and k-medoids – K-means: a cluster is represented by the center – K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster CMPT 354: Database I -- Information Retrieval 5

  6. K-Means: Example 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 Update 4 Assign 3 3 3 the 2 2 each 2 1 cluster 1 1 objects 0 0 means 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 to most 0 1 2 3 4 5 6 7 8 9 10 similar reassign reassign center 10 10 K= 2 9 9 8 8 Arbitrarily choose K 7 7 6 6 object as initial 5 5 cluster center Update 4 4 3 3 the 2 2 cluster 1 1 means 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 CMPT 354: Database I -- Information Retrieval 6

  7. K-means • Arbitrarily choose k objects as the initial cluster centers • Until no change, do – (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster – Update the cluster means, i.e., calculate the mean value of the objects for each cluster CMPT 354: Database I -- Information Retrieval 7

  8. Pros and Cons of K-means • Relatively efficient: O(tkn) – n: # objects, k: # clusters, t: # iterations; k, t << n. • Often terminate at a local optimum • Applicable only when mean is defined – What about categorical data? • Need to specify the number of clusters • Unable to handle noisy data and outliers • Unsuitable to discover non-convex clusters CMPT 354: Database I -- Information Retrieval 8

  9. Information retrieval • Dealing with the representation, storage, organization of, and access to information items – Information instead of just data – “interpret” the contents of the documents – Rank documents according to a degree of relevance to the user query • The notion of relevance is at the center of information retrieval CMPT 354: Database I -- Information Retrieval 9

  10. Information Retrieval History • Simple information retrieval functions: book content tables, index cards, and traditional library management systems – Computer-centered view: building efficient indexes – Human-centered view: understand the behavior of the user and his information needs • The Web and digital libraries CMPT 354: Database I -- Information Retrieval 10

  11. Information Retrieval Systems • Information retrieval (IR) systems use a simpler data model than database systems – Information organized as a collection of documents – Documents are unstructured, no schema • Information retrieval locates relevant documents, on the basis of user input such as keywords or example documents – e.g., find documents containing the words “database systems” • Can be used even on textual descriptions provided with non-textual data such as images • Web search engines are the most familiar example of IR systems CMPT 354: Database I -- Information Retrieval 11

  12. IR versus DB • IR systems do not deal with transactional updates (including concurrency control and recovery) • Database systems deal with structured data, with schemas that define the data organization • IR systems deal with some querying issues not generally addressed by database systems – Approximate searching by keywords – Ranking of retrieved answers by estimated degree of relevance CMPT 354: Database I -- Information Retrieval 12

  13. Data and Queries Structured data Unstructured (relational data) data (e.g., free text, multimedia) Structured Relational XML for semi- queries databases structured data Unstructured IR in DB (new Information queries direction in DB retrieval (keywords research and only) development) CMPT 354: Database I -- Information Retrieval 13

  14. Keyword Search • In full text retrieval, all the words in each document are considered to be keywords • Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not – Ands are implicit, even if not explicitly specified • Relevance ranking is based on factors such as – Term frequency • Frequency of occurrence of query keyword in document – Inverse document frequency • How many documents the query keyword occurs in – Fewer � give more importance to keyword – Hyperlinks to documents • More links to a document � document is more important CMPT 354: Database I -- Information Retrieval 14

  15. TF-IDF • Term frequency/Inverse Document frequency ranking • Let n(d) = number of terms in the document d • n(d, t) = number of occurrences of term t in the document d • Relevance of a document d to a term t n ( d , t ) n ( d , t ) TF ( ( d d , , t t ) = ) = log log TF 1 + 1 + n ( ( d d ) ) n The log factor is to avoid excessive weight to frequent terms • Relevance of document to query Q ∑ ) = ∑ TF ( ( d d , , t t ) ) TF r ( ( d d , , Q Q ) = r n ( t ) n ( t ) ∈ Q t ∈ Q t CMPT 354: Database I -- Information Retrieval 15

  16. Relevance Ranking Using Terms • Most systems also consider – Words that occur in title, author list, section headings, etc. are given greater importance – Words whose first occurrence is late in the document are given lower importance – Very common words (stop words) such as “a”, “an”, “the”, “it” etc are eliminated – Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart • Documents are returned in decreasing order of relevance score – Usually only top few documents are returned, not all CMPT 354: Database I -- Information Retrieval 16

  17. Similarity Based Retrieval • Similarity based retrieval - retrieve documents similar to a given document – Similarity may be defined on the basis of common words: e.g. find k terms in A with highest TF ( d, t ) / n ( t ) and use these terms to find relevance of other documents. • Relevance feedback: Similarity can be used to refine answer set to keyword query – User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these CMPT 354: Database I -- Information Retrieval 17

  18. Vector space model • Define an n -dimensional space, where n is the number of words in the document set • Vector for document d goes from origin to a point whose i th coordinate is TF ( d,t ) / n ( t ) • The cosine of the angle between the vectors of two documents is used as a measure of their similarity CMPT 354: Database I -- Information Retrieval 18

  19. Relevance Using Hyperlinks • The number of documents relevant to a query can be enormous if only term frequencies are taken into account • Using term frequencies makes “spamming” easy – E.g. a travel agency can add many occurrences of the words “travel” to its page to make its rank very high • People often look for pages from popular sites • Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords – Problem: hard to find actual popularity of site CMPT 354: Database I -- Information Retrieval 19

  20. Relevance Using Hyperlinks • Use the number of hyperlinks to a site as a measure of the popularity or prestige of the site – Count only one hyperlink from each site (why?) – Popularity measure is for site, not for individual page • But, most hyperlinks are to root of site • Also, concept of “site” is difficult to define since a URL prefix like cs.sfu.ca contains many unrelated pages of varying popularity • Refinements – When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige • Definition is circular • Set up and solve system of simultaneous linear equations CMPT 354: Database I -- Information Retrieval 20

  21. PageRank n ∑ = + − PR ( a ) q ( 1 q ) PR ( p ) / C ( p ) i i = i 1 • Simulate a user navigating randomly in the web who jumps to a random page with probability q or follows a random hyperlink with probability (1-q) • C(a) is the number of outgoing links of page a • Page a is pointed to by pages p 1 to p n CMPT 354: Database I -- Information Retrieval 21

Recommend


More recommend