Linear Algebraic Models in Information Retrieval Nathan Pruitt and Rami Awwad December 12th, 2016 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 1 / 18
Information Retrieval In a Nutshell Information Retrieval– Defined as finding relevant information to a search in a database containing documents, images, articles, etc. Practical real life example– Finding an article or book in a library through catalog system or through library’s database via search engine Most common type are internet search engines a la Google, Yahoo, but also used on many other sites wherever there’s a search feature Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 2 / 18
A Brief History of I.R. in the Digital Domain S.M.A.R.T. (System for the Mechanical Analysis and Retrieval of Text) developed at Cornell University in the 1960s Obtains legacy for the development of I.R. models including the vector space model Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 3 / 18
The Vector Space Model A text based ranking model common to internet search engines in the early 1990s Works by making a t × d matrix, where t can represent all terms in an English dictionary d representing the number of documents in a search engine database Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 4 / 18
The Vector Space Model d 1 d 2 d 3 d 4 d 5 d 1 , 000 , 000 . . . t 1 m 1 , 1 m 1 , 2 m 1 , 3 m 1 , 4 m 1 , 5 m 1 , 1 , 000 , 000 t 2 m 2 , 1 m 2 , 2 m 2 , 3 m 2 , 4 m 2 , 5 m 2 , 1 , 000 , 000 t 3 m 3 , 1 m 3 , 2 m 3 , 3 m 3 , 4 m 3 , 5 m 3 , 1 , 000 , 000 . . . t 4 m 4 , 1 m 4 , 2 m 4 , 3 m 4 , 4 m 4 , 5 m 4 , 1 , 000 , 000 t 5 m 5 , 1 m 5 , 2 m 5 , 3 m 5 , 4 m 5 , 5 m 5 , 1 , 000 , 000 . . . . . . t 300 , 000 m 300 , 000 , 1 m 300 , 000 , 2 m 300 , 000 , 3 m 300 , 000 , 4 m 300 , 000 , 5 m 300 , 000 , 1 , 000 , 000 Each m given a weight depending on number of times each term t occurs in document d , then weighed with an arithmetic weighing scheme Weight allows comparison between document to document and document to query by the angles between their column vectors Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 5 / 18
VSM: A Simpler Example doc 1 doc 2 doc 3 internet 38 14 20 M example = graph 10 20 5 directed 0 2 10 term 1 internet Query = graph 1 1 directed Entries called term frequencies Term frequencies processed through arithmetic weighing scheme because higher tf doesn’t necessarily mean a more relevant website Engine considers query as a bag of words – order of terms eschewed Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 6 / 18
Length Normalized t × d Matrix and Query Vector term doc 1 doc 2 doc 3 1 internet √ internet 0 . 790 0 . 630 0 . 659 3 1 Query ∗ = graph M example ∗ = graph 0 . 612 0 . 676 0 . 487 √ 3 1 0 0 . 382 0 . 573 directed directed √ 3 After arithmetic scheme, matrix and query vector are length normalized Serves to simplify calculation of angles between document vectors, and between the document vectors and the query Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 7 / 18
VSM: The ”Cosine Similarity” 0 . 790 0 . 630 · 0 . 612 0 . 676 0 0 . 382 ≈ 0 . 912 cos ( doc 1 , doc 2) ≈ ≈ 0 . 912 � doc 1 �� doc 2 � 1 cos ( doc 1 , doc 3) ≈ 0 . 819 cos ( doc 2 , doc 3) ≈ 0 . 963 cos ( Query , doc 1) ≈ 0 . 810 cos ( Query , doc 2) ≈ 0 . 975 cos ( Query , doc 3) ≈ 0 . 993 These calculations imply the following angles separate each vector: � 180 ◦ � ≈ 24 . 188 ◦ ( doc 1 , doc 2) ≈ arccos 0 . 912 π ( doc 1 , doc 3) ≈ 34 . 985 ◦ ( doc 2 , doc 3) ≈ 15 . 530 ◦ ( Query , doc 1) ≈ 35 . 901 ◦ ( Query , doc 2) ≈ 12 . 918 ◦ ( Query , doc 3) ≈ 7 . 006 ◦ Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 8 / 18
VSM: Visualization of Document Vectors and their Shared Angles Figure: Cosine similarity between Figure: Cosine similarity between doc1 to doc2 and doc2 to doc3 doc1 and doc3 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 9 / 18
VSM: Visualization of Document Vectors and their Shared Angles with Query Vector Figure: Cosine similarity between Figure: Cosine similarity between doc2 to the query and doc3 to query doc1 and the query Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 10 / 18
PageRank Algorithm Google’s matrix has over 8 billion row and columns. 1 2 6 7 3 4 5 This directed graph represents the overall rankings of the websites. This is a Markov Chain. The arrows represent links between different websites. For example, website 1 only links to website 2. Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 11 / 18
PageRank Algorithm j 1 j 2 j 3 j 4 j 5 j 6 j 7 1 0 0 0 0 0 0 i 1 2 1 1 1 i 2 1 0 0 0 2 2 4 1 0 0 0 0 0 0 i 3 3 1 1 1 P = i 4 0 0 0 0 3 2 4 1 1 0 0 0 0 0 i 5 2 4 1 1 i 6 0 0 0 0 0 3 2 1 0 0 0 0 0 1 i 7 4 This matrix P shows the probabilities of movement between these websites. Because website 1 only links to website 2, there is a 100 percent chance of that move. Matrix P is a transition matrix because the entries describe the probability of a transition from state j to state i . Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 12 / 18
PageRank Algorithm Notice that each column vector in transition matrix P obtains entries that when added total 1. Therefore, all column vectors in P are probability vectors . Thus our transition matrix is also a stochastic matrix , which describes a Markov chain with some interesting properties. One of these properties state that all stochastic matrices have at least one eigenvalue of 1. The eigenvector corresponding to 1 will tell us the rank of our 7 websites, or in Google terms, the PageRank of each website. Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 13 / 18
PageRank Algorithm To approach this eigenvector, we calculate the steady-state vector x n of our 7 website chain: a 1 . . . x n = a j . . . a 7 All stochastic matrices have a steady-state vector. Our x n is a probability vector describing the chance of landing on each website after clicking through n links within our chain. Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 14 / 18
PageRank Algorithm We use this equation to compute steady-state vectors: n →∞ x n = P n lim k x 0 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 15 / 18
Adjustment to Transition Matrix Google is said to use a p with a value of 0 . 85. Then, we retrieve our P n k as follows: 1 1 1 1 1 1 1 1 1 0 0 0 0 0 2 7 7 7 7 7 7 7 7 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 2 4 7 7 7 7 7 7 7 7 1 1 1 1 1 1 1 1 1 0 0 0 0 0 3 7 7 7 7 7 7 7 7 P n 1 1 1 1 1 1 1 1 1 1 1 k = 0 . 85 0 0 0 + 0 . 15 3 2 4 7 7 7 7 7 7 7 7 1 1 1 1 1 1 1 1 1 1 0 0 0 0 2 4 7 7 7 7 7 7 7 7 1 1 1 1 1 1 1 1 1 1 0 0 0 0 3 2 7 7 7 7 7 7 7 7 1 1 1 1 1 1 1 1 1 0 0 0 0 0 4 7 7 7 7 7 7 7 7 0 . 02142 0 . 02142 0 . 02142 0 . 44642 0 . 02142 0 . 02142 0 . 14285 0 . 87142 0 . 02142 0 . 44642 0 . 02142 0 . 44642 0 . 23392 0 . 14285 0 . 02142 0 . 30476 0 . 02142 0 . 02142 0 . 02142 0 . 02142 0 . 14285 = 0 . 02142 0 . 30476 0 . 44642 0 . 02142 0 . 02142 0 . 23392 0 . 14285 0 . 02142 0 . 02142 0 . 02142 0 . 44642 0 . 02142 0 . 23392 0 . 14285 0 . 02142 0 . 30476 0 . 02142 0 . 02142 0 . 44642 0 . 02142 0 . 14285 0 . 02142 0 . 02142 0 . 02142 0 . 02142 0 . 02142 0 . 23392 0 . 14285 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 16 / 18
Recommend
More recommend