http mmds org high dim high dim graph graph infinite
play

http://www.mmds.org High dim. High dim. Graph Graph Infinite - PowerPoint PPT Presentation


  1. ������������������������������������������������� �������������������������������������������� �������������������������������������������������������������������������������������������������� ������������������������������������������������������������������������������������������������� ������������������������������������������������������������������������������������ Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

  2. High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps Apps data data data data data data learning learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Network Web Decision Association Clustering Analysis advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

  3. [Hays and Efros, SIGGRAPH 2007] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

  4. [Hays and Efros, SIGGRAPH 2007] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

  5. [Hays and Efros, SIGGRAPH 2007] 10 nearest neighbors from a collection of 20,000 images J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

  6. [Hays and Efros, SIGGRAPH 2007] 10 nearest neighbors from a collection of 2 million images J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

  7. � Many problems can be expressed as finding “similar” sets: � Find near-neighbors in high-dimensional space � Examples: � Pages with similar words � For duplicate detection, classification by topic � Customers who purchased similar products � Products with similar customer sets � Images with similar features � Users who visited similar websites J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

  8. � Given: High dimensional data points � � � � � � � � For example: Image is a long vector of pixel colors � � � � ������������������� � � � � � � � And some distance function ��� � � � � � � Which quantifies the “distance” between � � and � � � Goal: Find all pairs of data points �� � � � � � that are within some distance threshold � � � � � � � � � Note: Naïve solution would take � � � � � � � where � is the number of data points � MAGIC: This can be done in � � !! How? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

  9. � Last time: Finding frequent pairs ������� ! ������� ( ������� ( ������� ! "�������������#��$% "�������������#��$%� ����������� ����������� ��������� ��������������� ����������� ��������'��������������� &�����������������'������ ����������������� ���������������� ������'���������������� ��������� ���������������������������� ��������������� ���'����) &����������� !� ������������������������� (� ������������������������������ � � ��������������������������� J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

  10. � Last time: Finding frequent pairs � Further improvement: PCY � Pass 1: ������� ! � Count exact frequency of each item: � Take pairs of items {i,j}, hash them into B buckets and count of the number of pairs that hashed to each bucket: ,�������� , �������������������������� ��������� #��*�+% ������ #��*%�#��+%�#*�+% J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

  11. � Last time: Finding frequent pairs � Further improvement: PCY � Pass 1: ������� ! � Count exact frequency of each item: � Take pairs of items {i,j}, hash them into B buckets and count of the number of pairs that hashed to each bucket: � Pass 2: ,�������� , ����������� � For a pair {i,j} to be a candidate for a frequent pair , its singletons {i}, {j} ��������� #��*�+% ������ #��*%�#��+%�#*�+% have to be frequent and the pair ��������� #��*�-% has to hash to a frequent bucket! ������ #��*%�#��-%�#*�-% J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

  12. � Last time: Finding frequent pairs � Further improvement: PCY Previous lecture: A-Priori Main idea: Candidates � Pass 1: ������� ! Instead of keeping a count of each pair, only keep a count � Count exact frequency of each item: of candidate pairs! � Take pairs of items {i,j}, hash them into B buckets and Today’s lecture: Find pairs of similar docs count of the number of pairs that hashed to each bucket: Main idea: Candidates -- Pass 1: Take documents and hash them to buckets such that � Pass 2: ,�������� , documents that are similar hash to the same bucket ����������� � For a pair {i,j} to be a candidate for -- Pass 2: Only compare documents that are candidates (i.e., they hashed to a same bucket) a frequent pair , its singletons have ��������� #��*�+% Benefits: Instead of O(N 2 ) comparisons, we need O(N) ������ #��*%�#��+%�#*�+% to be frequent and its has to hash comparisons to find similar documents ��������� #��*�-% to a frequent bucket! ������ #��*%�#��-%�#*�-% J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

  13. � Goal: Find near-neighbors in high-dim. space � We formally define “near neighbors” as points that are a “small distance” apart � For each application, we first need to define what “ distance ” means � Today: Jaccard distance/similarity � The Jaccard similarity of two sets is the size of their intersection divided by the size of their union: sim (C 1 , C 2 ) = |C 1 � � C 2 |/|C 1 � � C 2 | � � � � � Jaccard distance: d (C 1 , C 2 ) = 1 - |C 1 � � � � C 2 |/|C 1 � � C 2 | � � +���������������� .��������� /������ ����������0�+�. /������ ���������0�1�. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

  14. � Goal: Given a large number ( � in the millions or billions) of documents, find “near duplicate” pairs � Applications: � Mirror websites, or approximate mirrors � Don’t want to show both in search results � Similar news articles at many news sites � Cluster articles by “same story” � Problems: � Many small pieces of one document can appear out of order in another � Too many documents to compare all pairs � Documents are so large or so many that they cannot fit in main memory J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

  15. 1. Shingling: Convert documents to sets 2. Min-Hashing: Convert large sets to short signatures, while preserving similarity Locality-Sensitive Hashing: Focus on 3. pairs of signatures likely to be from similar documents � Candidate pairs! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

Recommend


More recommend