CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • TP0 • There may be adjustment of your team composition PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING • PA1 MODELS FOR SCALABLE BATCH • Hadoop and Spark installation video clips are posted COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • Overview of the Programing Assignment 1 • 3. Distributed Computing Models for Scalable Batch Computing • MapReduce Programming Assignment 1 Hyperlink-Induced Topic Search (HITS) CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University This material is built based on Types of Web queries • Kleinberg, Jon. "Authoritative sources in a hyperlinked environment". Journal of the • Yes/No queries ACM . 46 (5): 604–632 • Does Chrome support .ogv video format? • Broad topic queries • Find information about “Coronavirus” Image credit: https://www.cnn.com/2020/01/22/world/wuhan-coronavirus-visual-guide-intl/index.html • Similarity query • Find person similar to “Justin Bieber” Im age credit: https://w w w .google.com /search?source=hp&ei=tM YxXsH aFZO 4tQ ae7ILAC Q &q=sim ilar+to+justin+bieber&oq =Sim ilar+to+justin+&gs_l=psy-ab.3.0.0l3j0i22i30l7.546394.575419..576451...17.0..0.184.1712.34j1......0....1..gw s-w iz.......0i131j0i70i249j0i10.D W TD 5rf16d8 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1
CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Challenge of content-based ranking for topic search Challenge of content-based ranking for topic search • Assume that you are looking for “computer” • How about IBM’s web page? • ” computer ” in the APPLE page? CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Challenge of content-based ranking for topic search Challenge of content-based ranking for topic search • O.K… Now, Google? • Most useful pages do not include the keyword (that the users are looking for) • Pages are not sufficiently descriptive! • Semantic mismatch • Search keys vs. descriptions Image Credit: https://e360.yale.edu/features/could-massive-storm-surge-barriers-end-the-hudson-rivers-revival CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Ranking algorithm to find HITS (Hipertext-Induced Topic Search) the most “authoritative” pages for the given topic • To find the small set of the most authoritative pages that are relevant to the query • PageRank captures simplistic view of a network • Examples of the authoritative pages • Authority • For the topic, “python” • A Web page with good, authoritative content on a specific topic • https://www.python.org/ • A Web page that is linked by many hubs • For the information about “Colorado State University” • https://www.colostate.edu/ • Hub • A Web page pointing to many authoritative Web pages • e.g. portal pages (Yahoo) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 2
CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University HITS (Hypertext-Induced Topic Search) Understanding Authorities and Hubs [1/2] • A.K.A. Hubs and Authorities • Intuitive Idea to find authoritative results using link analysis : • Jon Kleinberg 1997 • Not all hyperlinks are related to the conferral of authority • Topic search • Automatically determine hubs/authorities • Patterns that authoritative pages have • Authoritative Pages share considerable overlap in the sets of pages that point to them. • In practice • Performed only on the result set (PageRank is applied on the complete set of documents) • Developed for the IBM Clever project Authorities Hubs • Used by Teoma (later Ask.com) CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Understanding Authorities and Hubs [2/2] Calculating Authority/Hub scores [1/3] Let there be n Web pages • A good hub page points to many good authoritative pages Define the n x n adjacency matrix A such that, P1 A uv = 1 if there is a link from u to v. • A good authoritative page is pointed to by many good hub pages Otherwise A uv = 0 • Authorities and hubs have a mutual reinforcement relationship 0 1 1 1 P1 P4 P2 0 0 1 1 P2 1 0 0 1 P3 Graph with pages 0 0 0 1 P3 P4 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University P1 P1 Calculating Authority/Hub scores [2/3] Calculating Authority/Hub scores [3/3] Each Web page i has an authority score a i and a hub score h i . Similarly, we define the hub score of a Web page i P4 P4 P2 P2 We define the authority score by summing up the by summing up the authority scores ! " , hub scores that point to it, ) ( ℎ $ = & ! " * "$ ! " = $ ℎ % * %" "'( Graph with pages Graph with pages P3 %&' P3 0 1 1 1 j: row # in the matrix 0 1 1 1 j: row # in the matrix i: column # in the matrix 0 0 1 1 i: column # in the matrix 0 0 1 1 This can be written concisely as, 1 0 0 1 1 0 0 1 ℎ = *! This can be written concisely as, 0 0 0 1 0 0 0 1 ! = * + ℎ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 3
CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University P1 Hubs and Authorities Hubs and Authorities Graph with pages Let’s start arbitrarily from a 0 =1, h 0 =1 , where 1 is the all- Let’s start arbitrarily from a 0 =1, h 0 =1 , where 1 is the all-one vector. 0 1 1 1 one vector. a 0 =(1,1,1,1) P4 P2 a 0 =(1,1,1,1) h 0 =(1,1,1,1) 0 0 1 1 h 0 =(1,1,1,1) a 1 =(1/8, 1/8, ¼ , ½ ) 1 0 0 1 Repeating this, the sequences a 0 , a 1 , a 2 ,… and h 0 , h 1 , h 2 ,… h 1 =(((1/8 x 0)+(1/8 x 1)+(1/4 x 1)+(1/2 x 1)), converge (to limits x * and y * ) ((1/8 x 0)+(1/8 x 0)+(1/4 x 1)+(1/2 x 1)), 0 0 0 1 a 1 =(((1 x 0)+(1 x 0)+(1 x 1)+(1 x 0)), ((1/8 x 1)+(1/8 x 0)+(1/4 x 0)+(1/2 x 1)), P3 Graph with pages ((1 x 1)+(1 x 0)+(1 x 0)+(1 x 0)), ((1/8 x 0)+(1/8 x 0)+(1/4 x 0)+(1/2 x 1))) = (7/8,6/8,5/8, 4/8) ((1 x 1)+(1 x 1)+(1 x 0)+(1 x 0)), 0 1 1 1 ((1 x 1)+(1 x 1)+(1 x 1)+(1 x 1))) = (1,1,2,4) After the normalization: 0 0 1 1 Normalize it: (1/(1+1+2+4), 1/(1+1+2+4), 2/(1+1+2+4), h 1 =(7/22,6/22,5/22, 4/22) ( ß hub values after the first iteration) 4/(1+1+2+4)) = (1/8, 1/8, ¼ , ½ ) 1 0 0 1 a 1 = (1/8, 1/8, ¼ , ½ ) ( ß authority values after the first 0 0 0 1 iteration) CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Implementing Topic Search using HITS Step 1. Constructing a focused subgraph (root set) • Step 1. • Generate a root set from a text-based search engine • Constructing a focused subgraph based on a query • e.g. pages containing query words • Step 2. • Iteratively calculate the authority value and hub value of the page in the subgraph Root set CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Step 2. Constructing a focused subgraph ( base set ) Step 3. Initial values • For each page p ∈ R Nodes Hubs Authority P1 1 1 • Add the set of all pages p points to P1 P2 1 1 • Add the set of all pages pointing to p P3 1 1 P4 1 1 P4 P2 Ranks Hub: P1=P2=P3=P4 Authority: P1=P2=P3=P4 P3 Base set http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 4
Recommend
More recommend