CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019 Colorado State University, Spring 2019 CS535 BIG DATA FAQs • Term project deliverable 0 • Item 1: Your team members PART A. BIG DATA TECHNOLOGY • Item 2: Tentative project titles (up to 3) 3. DISTRIBUTED COMPUTING • Submission deadline: Feb. 1 MODELS FOR SCALABLE BATCH • Via email or canvas COMPUTING • PA1 • Hadoop and Spark installation guides are posted • If you would like to start your homework, please send me an email with your team information. I will assign the port range for your team. Sangmi Lee Pallickara • Quiz 1: February 4. 2019 in class Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 1/30/2019 Week 2-A-2 1/30/2019 Week 2-A-3 Colorado State University, Spring 2019 Colorado State University, Spring 2019 Topics of Todays Class • Overview of the Programing Assignment 1 • 3. Distributed Computing Models for Scalable Batch Computing • MapReduce Programming Assignment 1 Hyperlink-Induced Topic Search (HITS) Week 2-A-4 Week 2-A-5 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 This material is built based on Types of Web queries • Kleinberg, Jon. "Authoritative sources in a hyperlinked environment". Journal of the • Yes/No queries ACM . 46 (5): 604–632 • Does Chrome support .ogv video format? • Broad topic queries • Find information about “polar vortex” • Similar-page query • Find pages similar to ‘https://stackoverflow.com’ Image credit: https://www.cnn.com/2019/01/30/weather/winter-weather-wednesday-wxc/index.html http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 1
CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-6 Week 2-A-7 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Ranking algorithm to find the most “authoritative” pages Challenge of content-based ranking • To find the small set of the most authoritative pages that are relevant to the query • Most useful pages do not include the keyword (that the users are looking for) • ” computer ” in the APPLE page? • Examples of the authoritative pages • For the topic, “python” • https://www.python.org/ • For the information about “Colorado State University” • https://www.colostate.edu/ • For the images about ”iPhone” • https://www.apple.com/iphone/ Captured Jan.30, 2019 1/30/2019 Week 2-A-8 1/30/2019 Week 2-A-9 Colorado State University, Spring 2019 Colorado State University, Spring 2019 Challenge of content-based ranking Challenge of content-based ranking • How about IBM’s web page? • Pages are not sufficiently descriptive • “ health care ” in Poudre Valley Hospital? Captured Jan.30, 2019 Captured Jan.30, 2019 Week 2-A-10 Week 2-A-11 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 HITS (Hipertext-Induced Topic Search) HITS (Hypertext-Induced Topic Search) • PageRank captures simplistic view of a network • A.K.A. Hubs and Authorities • Jon Kleinberg 1997 • Topic search • Authority • Automatically determine hubs/authorities • A Web page with good, authoritative content on a specific topic • A Web page that is linked by many hubs • In practice • Performed only on the result set (PageRank is applied on the complete set of documents) • Hub • Developed for the IBM Clever project • A Web page pointing to many authoritative Web pages • Used by Teoma (later Ask.com) • e.g. portal pages (Yahoo) http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 2
CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-12 Week 2-A-13 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Understanding Authorities and Hubs [1/2] Understanding Authorities and Hubs [2/2] • Intuitive Idea to find authoritative results using link analysis : • A good hub page points to many good authoritative pages • Not all hyperlinks are related to the conferral of authority • A good authoritative page is pointed to by many good hub pages • Patterns that authoritative pages have • Authoritative Pages share considerable overlap in the sets of pages that point to them. • Authorities and hubs have a mutual reinforcement relationship Authorities Hubs 1/30/2019 Week 2-A-14 1/30/2019 Week 2-A-15 Colorado State University, Spring 2019 Colorado State University, Spring 2019 P1 Calculating Authority/Hub scores [1/3] Calculating Authority/Hub scores [2/3] Let there be n Web pages Each Web page has an authority score a i and a hub Define the n x n adjacency matrix A such that, P1 score h i . A uv = 1 if there is a link from u to v. P2 P4 We define the authority score by summing up the Otherwise A uv = 0 hub scores that point to it, ( ! " = $ ℎ % * %" 0 1 1 1 P1 P4 Graph with pages P2 %&' P3 0 0 1 1 P2 0 1 1 1 j: row # in the matrix 1 0 0 1 0 0 1 1 i: column # in the matrix P3 Graph with pages 0 0 0 1 P3 1 0 0 1 P4 This can be written concisely as, 0 0 0 1 ! = * + ℎ Week 2-A-16 Week 2-A-17 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 P1 P1 Calculating Authority/Hub scores [3/3] Hubs and Authorities Let’s start arbitrarily from a 0 =1, h 0 =1 , where 1 is the all- one vector. Similarly, we define the hub score by summing up P4 P4 P2 P2 a 0 =(1,1,1,1) the authority scores ! " , h 0 =(1,1,1,1) ) Repeating this, the sequences a 0 , a 1 , a 2 ,… and h 0 , h 1 , h 2 ,… ℎ $ = & ! " * "$ converge (to limits x * and y * ) "'( Graph with pages a 1 =(((1 x 0)+(1 x 0)+(1 x 1)+(1 x 0)), P3 P3 Graph with pages ((1 x 1)+(1 x 0)+(1 x 0)+(1 x 0)), j: row # in the matrix 0 1 1 1 ((1 x 1)+(1 x 1)+(1 x 0)+(1 x 0)), i: column # in the matrix 0 1 1 1 0 0 1 1 ((1 x 1)+(1 x 1)+(1 x 1)+(1 x 1))) = (1,1,2,4) 0 0 1 1 This can be written concisely as, Normalize it: (1/(1+1+2+4), 1/(1+1+2+4), 2/(1+1+2+4), 1 0 0 1 ℎ = *! 4/(1+1+2+4)) = (1/8, 1/8, ¼ , ½ ) 1 0 0 1 0 0 0 1 a 1 = (1/8, 1/8, ¼ , ½ ) ( ß authority values after the first 0 0 0 1 iteration) http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 3
CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-18 Week 2-A-19 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Hubs and Authorities Implementing Topic Search using HITS Graph with pages Let’s start arbitrarily from a 0 =1, h 0 =1 , where 1 is the all-one vector. • Step 1. 0 1 1 1 a 0 =(1,1,1,1) • Constructing a focused subgraph based on a query 0 0 1 1 h 0 =(1,1,1,1) a 1 =(1/8, 1/8, ¼ , ½ ) 1 0 0 1 h 1 =(((1/8 x 0)+(1/8 x 1)+(1/4 x 1)+(1/2 x 1)), • Step 2. ((1/8 x 0)+(1/8 x 0)+(1/4 x 1)+(1/2 x 1)), 0 0 0 1 • Iteratively calculate the authority value and hub value of the page in the subgraph ((1/8 x 1)+(1/8 x 0)+(1/4 x 0)+(1/2 x 1)), ((1/8 x 0)+(1/8 x 0)+(1/4 x 0)+(1/2 x 1))) = (7/8,6/8,5/8, 4/8) After the normalization: h 1 =(7/22,6/22,5/22, 4/22) ( ß hub values after the first iteration) 1/30/2019 Week 2-A-20 1/30/2019 Week 2-A-21 Colorado State University, Spring 2019 Colorado State University, Spring 2019 Step 1. Constructing a focused subgraph (root set) Step 2. Constructing a focused subgraph ( base set ) • For each page p ∈ R • Generate a root set from a text-based search engine • Add the set of all pages p points to • e.g. pages containing query words • Add the set of all pages pointing to p Root set Base set Week 2-A-22 Week 2-A-23 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Step 3. Initial values Step 4. After the first iteration Nodes Hubs Authority Nodes Hubs Authority P1 1 1 P1 7/22 1/8 P1 P1 P2 1 1 P2 6/22 1/8 P3 1 1 P3 5/22 2/8 P4 1 1 P4 4/22 4/8 Ranks P4 P4 P2 P2 Hub: P1>P2>P3>P4 Ranks Authority: P1=P2<P3<P4 Hub: P1=P2=P3=P4 Authority: P1=P2=P3=P4 Normalization P3 • Original paper: using squares sum (to 1) P3 • You can use sum (to 1) • value = value/(sum of all values) http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 4
Recommend
More recommend