saurav kumar singh department of computer science
play

Saurav Kumar Singh Department of Computer Science & Engineering - PowerPoint PPT Presentation

Saurav Kumar Singh Department of Computer Science & Engineering Dual degree 4 th year Outline Motivation Basics Hierarchical Structure Parameter Generation Query Types Algorithm Motivation All previous clustering


  1. Saurav Kumar Singh Department of Computer Science & Engineering Dual degree 4 th year

  2. Outline  Motivation  Basics  Hierarchical Structure  Parameter Generation  Query Types  Algorithm

  3. Motivation  All previous clustering algorithm are query dependent  They are built for one query and generally no use for other query.  Need a separate scan for each query.  So computation more complex at least O(n).  So we need a structure out of Database so that various queries can be answered without rescanning.

  4. Basics  Grid based method-quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed  Develop hierarchical Structure out of given data and answer various queries efficiently.  Every level of hierarchy consist of cells  Answering a query is not O(n) where n is the number of elements in the database

  5. A hierarchical structure for STING clustering

  6. continue …..  The root of the hierarchy be at level 1  Cell in level i corresponds to the union of the areas of its children at level i + 1  Cell at a higher level is partitioned to form a number of cells of the next lower level  Statistical information of each cell is calculated and stored beforehand and is used to answer queries

  7. Cell parameter  Attribute Independent parameter- n - number of objects (points) in this cell  Attribute dependent parameters- m - mean of all values in this cell s - standard deviation of all values of the attribute in this cell min - the minimum value of the attribute in this cell max - the maximum value of the attribute in this cell distribution - the type of distribution that the attribute value in this cell follows

  8. Parameter Generation  n, m, s, min , and max of bottom level cells are calculated directly from data  Distribution can be either assigned by user or can be obtained by hypothetical tests - χ 2 test  Parameters of higher level cells is calculated from parameter of lower level cells.

  9. continue…..  n, m, s, min, max, dist be parameters of current cell  n i , m i , s i , min i , max i and dist i be parameters of corresponding lower level cells

  10. dist for Parent Cell  Set dist as the distribution type followed by most points in this cell  Now check for conflicting points in the child cells call it confl. 1. If dist i ≠ dist, m i ≈ m and s i ≈ s, then confl is increased by an amount of n i ; 2. If dist i ≠ dist, but either m i ≈ m or s i ≈ s is not satisfied, then set confl to n 3. If dist i = dist, m i ≈ m and s i ≈ s, then confl is increased by 0; 4. If dist i = dist, but either m i ≈ m or s i ≈ s is not satisfied, then confl is set to n.

  11. continue…..  If is greater than a threshold t set dist as NONE.  Other wise keep the original type. Example :

  12. continue…..  Parameter for parent cell would be n = 220 m = 20.27 s = 2.37 min = 3.8 max = 40 dist = NORMAL  210 points whose distribution type is NORMAL  Set dist of parent as Normal  confl = 10  = 0.045 < 0.05 so keep the original.

  13. Query types  STING structure is capable of answering various queries  But if it doesn’t then we always have the underlying Database  Even if statistical information is not sufficient to answer queries we can still generate possible set of answers.

  14. Common queries  Select regions that satisfy certain conditions Select the maximal regions that have at least 100 houses per unit area and at least 70% of the house prices are above $400K and with total area at least 100 units with 90% confidence SELECT REGION FROM house-map WHERE DENSITY IN (100, ∞ ) AND price RANGE (400000, ∞ ) WITH PERCENT (0.7, 1) AND AREA (100, ∞ ) AND WITH CONFIDENCE 0.9

  15. continue….  Selects regions and returns some function of the region Select the range of age of houses in those maximal regions where there are at least 100 houses per unit area and at least 70% of the houses have price between $150K and $300K with area at least 100 units in California. SELECT RANGE(age) FROM house-map WHERE DENSITY IN (100, ∞ ) AND price RANGE (150000, 300000) WITH PERCENT (0.7, 1) AND AREA (100, ∞ ) AND LOCATION California

  16. Algorithm  With the hierarchical structure of grid cells on hand, we can use a top-down approach to answer spatial data mining queries  For any query, we begin by examining cells on a high level layer  calculate the likelihood that this cell is relevant to the query at some confidence level using the parameters of this cell  If the distribution type is NONE, we estimate the likelihood using some distribution free techniques instead

  17. continue….  After we obtain the confidence interval, we label this cell to be relevant or not relevant at the specified confidence level  Proceed to the next layer but only consider the Childs of relevant cells of upper layer  We repeat this until we reach to the final layer  Relevant cells of final layer have enough statistical information to give satisfactory result to query.  However for accurate mining we may refer to data corresponding to relevant cells and further process it.

  18. Finding regions  After we have got all the relevant cells at the final level we need to output regions that satisfies the query  We can do it using Breadth First Search

  19. Breadth First Search  we examine cells within a certain distance from the center of current cell  If the average density within this small area is greater than the density specified mark this area  Put the relevant cells just examined in the queue.  Take element from queue repeat the same procedure except that only those relevant cells that are not examined before are enqueued. When queue is empty we have identified one region.

  20. Statistical Information Grid-based Algorithm 1. Determine a layer to begin with. 2. For each cell of this layer, we calculate the confidence interval (or estimated range) of probability that this cell is relevant to the query. 3. From the interval calculated above, we label the cell as relevant or not relevant. 4. If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5. 5. We go down the hierarchy structure by one level. Go to Step 2 for those cells that form the relevant cells of the higher level layer. 6. If the specification of the query is met, go to Step 8; otherwise, go to Step 7. 7. Retrieve those data fall into the relevant cells and do further processing. Return the result that meet the requirement of the query. Go to Step 9. 8. Find the regions of relevant cells. Return those regions that meet the requirement of the query. Go to Step 9. 9. Stop.

  21. Time Analysis:  Step 1 takes constant time. Steps 2 and 3 require constant time.  The total time is less than or equal to the total number of cells in our hierarchical structure.  Notice that the total number of cells is 1.33 K, where K is the number of cells at bottom layer.  So the overall computation complexity on the grid hierarchy structure is O( K)

  22. Time Analysis:  STING goes through the database once to compute the statistical parameters of the cells  time complexity of generating clusters is O(n), where n is the total number of objects.  After generating the hierarchical structure, the query processing time is O(g), where g is the total number of grid cells at the lowest level, which is usually much smaller than n.

  23. Comparison

  24. CLIQUE: A Dimension-Growth Subspace Clustering Method  First dimension growth subspace clustering algorithm  Clustering starts at single-dimension subspace and move upwards towards higher dimension subspace  This algorithm can be viewed as the integration Density based and Grid based algorithm

  25. Informal problem statement  Given a large set of multidimensional data points, the data space is usually not uniformly occupied by the data points.  CLIQUE’s clustering identifies the sparse and the “crowded” areas in space (or units), thereby discovering the overall distribution patterns of the data set.  A unit is dense if the fraction of total data points contained in it exceeds an input model parameter.  In CLIQUE, a cluster is defined as a maximal set of connected dense units.

  26. Formal Problem Statement  Let A= {A 1 , A 2 , . . . , A d } be a set of bounded, totally ordered domains and S = A 1 × A 2 × · · · × A d a d- dimensional numerical space.  We will refer to A 1 , . . . , A d as the dimensions (attributes) of S.  The input consists of a set of d-dimensional points V = {v 1 , v 2 , . . . , v m }  Where v i = v i1 , v i2 , . . . , v id . The j th component of v i is drawn from domain A j .

  27. Clique Working  2 Step Process  1 st step – Partitioning the d- dimensional data space  2 nd step- Generates the minimal description of each cluster.

  28. 1 st step- Partitioning  Partitioning is done for each dimension.

  29. Example continue….

  30. c ontinue….  The subspaces representing these dense units are intersected to form a candidate search space in which dense units of higher dimensionality may exist.  This approach of selecting candidates is quite similar to Apiori Gen process of generating candidates.  Here it is expected that if some thing is dense in higher dimensional space it cant be sparse in lower dimension state.

Recommend


More recommend