Announcements: - Thank you for your course feedback! - Watch out for homework 2 feedback poll - Course project –TAs will reach out with feedback - Regrade requests for HW1 – Deadline Thu next week at 23:59pm - Today: HW2 due / HW3 release 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 1
A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2
[1/N] NxN M 0.8·½+0.2·⅓ y 1/2 1/2 0 1/3 1/3 1/3 + 0.2 1/2 0 0 1/3 1/3 1/3 0.8 0.8·½+0.2·⅓ 0 1/2 1 1/3 1/3 1/3 0.8·½+0.2·⅓ 0.2·⅓ 0.2· ⅓ y 7/15 7/15 1/15 0.8+0.2·⅓ a 7/15 1/15 1/15 0.8·½+0.2·⅓ a m 1/15 7/15 13/15 m 0 . 2 · ⅓ 0 . 2 A · ⅓ y 1/3 0.33 0.24 0.26 7/33 a = . . . 1/3 0.20 0.20 0.18 5/33 m 1/3 0.46 0.52 0.56 21/33 r = A r 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3
¡ Input: Graph 𝑯 and parameter 𝜸 § Directed graph 𝑯 (can have spider traps and dead ends ) § Parameter 𝜸 ¡ Output: PageRank vector 𝒔 (#) = % § Set: 𝑠 & , 𝑢 = 1 ! (𝒖$𝟐) 𝒔 𝒋 § Do: ∀𝑘: 𝒔′ 𝒌 = ∑ 𝒋→𝒌 𝜸 𝒆 𝒋 If the graph has no dead- 𝒔′ 𝒌 = 𝟏 if in-degree of 𝒌 is 0 ends then the amount of leaked PageRank is 1-β . But § Now re-insert the leaked PageRank: since we have dead-ends the (𝒖) = 𝒔 (𝒌 + 𝟐)𝑻 amount of leaked PageRank ∀𝒌: 𝒔 𝒌 where: 𝑇 = ∑ ! 𝑠′ ! may be larger. We have to 𝑶 explicitly account for it by § 𝒖 = 𝒖 + 𝟐 computing S . (,) − 𝑠 (,-%) < 𝜁 § while ∑ ! 𝑠 ! ! 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4
¡ Measures generic popularity of a page § Will ignore/miss topic-specific authorities § Solution: Topic-Specific PageRank ( next ) ¡ Uses a single measure of importance § Other models of importance § Solution: Hubs-and-Authorities ¡ Susceptible to Link spam § Artificial link topographies created in order to boost page rank § Solution: TrustRank 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5
4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6
¡ Instead of generic popularity, can we measure popularity within a topic? ¡ Goal: Evaluate Web pages not just according to their popularity, but also by how close they are to a particular topic, e.g. “sports” or “history” ¡ Allows search queries to be answered based on interests of the user § Example: Query “Trojan” wants different pages depending on whether you are interested in sports, history, or computer security 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7
¡ Random walker has a small probability of teleporting at any step ¡ Teleport can go to: § Standard PageRank: Any page with equal probability § To avoid dead-end and spider-trap problems § Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set) ¡ Idea: Bias the random walk § When the walker teleports, she picks a page from a set S § S contains only pages that are relevant to the topic § E.g., Open Directory (DMOZ) pages for a given topic/query § For each teleport set S , we get a different vector r S 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 8
¡ To make this work all we need is to update the teleportation part of the PageRank formulation: 𝑩 𝒋𝒌 = 𝜸 𝑵 𝒋𝒌 + (𝟐 − 𝜸)/|𝑻| if 𝒋 ∈ 𝑻 𝜸 𝑵 𝒋𝒌 + 𝟏 otherwise § A is a stochastic matrix! ¡ We weighted all pages in the teleport set S equally § Could also assign different weights to pages! ¡ Compute as for regular PageRank: § Multiply by M , then add a vector § Maintains sparseness 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 9
Suppose S = {1} , b = 0.8 0.2 1 0.5 Node Iteration 0.5 0.4 0 1 2 … stable 0.4 1 0.25 0.4 0.28 0.294 1 2 3 2 0.25 0.1 0.16 0.118 0.8 3 0.25 0.3 0.32 0.327 1 1 4 0.25 0.2 0.24 0.261 0.8 0.8 4 S β r 1 r 2 r 3 r 4 S β r 1 r 2 r 3 r 4 {1,2,3,4} 0.8 0.13 0.10 0.39 0.36 {1} 0.9 0.17 0.07 0.40 0.36 {1,2,3} 0.8 0.17 0.13 0.38 0.30 {1} 0.8 0.29 0.12 0.33 0.26 {1,2} 0.8 0.26 0.20 0.29 0.23 {1} 0.7 0.39 0.14 0.27 0.19 {1} 0.8 0.29 0.12 0.33 0.26 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10
¡ Create different PageRanks for different topics § The 16 DMOZ top-level categories: § Arts, Business, Sports,… ¡ Which topic ranking to use? § User can pick from a menu § Classify query into a topic § Can use the context of the query § E.g., query is launched from a web page talking about a known topic § History of queries e.g., “basketball” followed by “Jordan” § User context, e.g., user’s bookmarks, … 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 11
Random Walk with Restarts: set S is a single node 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12
[Tong-Faloutsos, ‘06] I 1 J 1 1 A 1 H 1 B 1 1 D 1 1 1 E G F a.k.a.: Relevance, Closeness, ‘Similarity’… 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13
¡ Shortest path is not good: ¡ No effect of degree-1 nodes (E, F, G)! ¡ Multi-faceted relationships 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 14
¡ Network flow is not good: ¡ Does not punish long paths 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 15
¡ Need a method that considers: § Multiple connections § Multiple paths § Direct and indirect connections § Degree of the node 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 16
¡ SimRank: Random walks from a fixed node on k -partite graphs Conferences Tags Authors ¡ Setting: k -partite graph with k types of nodes § E.g.: Authors, Conferences, Tags ¡ Topic Specific PageRank from node u : teleport set S = { u } ¡ Resulting scores measure similarity/proximity to node u ¡ Problem: § Must be done once for each node u § Only suitable for sub-Web-scale applications 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 17
… … Q: What is the most IJCAI related conference Philip S. Yu KDD to ICDM ? Ning Zhong ICDM A: Topic-Specific R. Ramakrishnan SDM PageRank with M. Jordan AAAI teleport set S={ICDM} … NIPS … Conference Author 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 18
4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 19
Pin Board 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20
¡ Pins belong to Boards 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 21
Input: 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22
Input: Recommendations: 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 23
Input: 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 24
Input: Recommendations: 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 25
4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 26
4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 27
4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 28
¡ Idea: § Every node has some importance § Importance gets evenly split among all edges and pushed to the neighbors ¡ Given a set of QUERY NODES Q, simulate a random walk: Q 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 29
¡ Proximity to query node(s) Q : 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 30
¡ Proximity to query node(s) Q : 5 5 5 5 5 5 14 9 Q 16 7 8 8 8 8 1 1 1 Yummm Strawberries Smoothies Smoothie Madness!•!•!•! 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 31
¡ Pixie: § Outputs top 1k pins with highest visit count Extensions: ¡ Weighted edges: § The walk prefers to traverse certain edges: § Edges to pins in your local language ¡ Early stopping: § Don’t need to walk a fixed big number of steps § Walk until 1k-th pin has at least 20 visits 4/27/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 32
Recommend
More recommend