clustering data streams
play

Clustering Data Streams - PDF document

Clustering Data Streams zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Sudipto Guha zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Liadan OCallaghan zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Abstract


  1. Clustering Data Streams zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Sudipto Guha zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Liadan O’Callaghan zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Abstract zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA * Nina Mishra t Rajeev Motwani 5 the sequence observed zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA large to fit in main memory and are typically stored in secondary storage devices, making access, particu- larly random access, very expensive. Data stream al- W e study clustering under the data stream model model zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA gorithms access the input only via linear scans with- of computation where: given a sequence of points, the out random access and only require a few (hopefully, objective is to maintain a consistently good clustering of one) such scans over the data. Furthermore, since the so far, using a small amount of amount of data far exceeds the amount of space (main memory and time. The data stream model is relevant memory) available to the algorithm, it is not possible to new classes of applications involving massive data for the algorithm to “remember” too much of the data sets, such as web click stream analysis and multimedia scanned in the past. This scarcity of space necessitates data analysis. W e give constant-factor approximation the design of a novel kind of algorithm that stores only algorithms for the k-Median problem in the data stream a summary of past data, leaving enough memory for of computation in a single pass. W e also show the processing of future data. We remark that this is negative results implying that our algorithms cannot be not the same as the model of online algorithms. improved in a certain sense. times. Formally, a data stream is a sequence of points zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Clustering has recently been widely studied across several disciplines, but only a few of the techniques de- veloped scale to support clustering of very large data 1 Introduction sets. A common formulation of clustering is the k- Median problem: find k centers in a set of n points so A data stream is an ordered sequence of points as to minimize the sum of distances from data points that can be read only once or a small number of to their closest cluster centers. Most algorithms for k- one pass over the data using n‘ memory (for zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Median have large space requirements and involve ran- ..., ,..., x i x , , read in increasing order of the in- dom access to the input data. We give constant-factor XI, dices i. The performance of an algorithm that op- approximation algorithms for the k-Median problem erates on data streams is measured by the number that naturally fit into this data stream setting. Our of passes the algorithm must make over the stream, algorithms make a single pass over the data and use that uses zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA when.constrained in terms of available memory, in ad- small space. We first give a randomized constant-factor dition to the more conventional measures. The data approximation algorithm for k-Median, which makes E < 1) stream model is motivated by emerging application in- volving massive data sets, e.g., customer click streams, and requires only d(nk) time. We also prove that telephone records, large sets of web pages, multime- any deterministic k-Median algorithm that achieves a dia data, and sets of retail chain transactions can be constant-factor approximation cannot run in time less modeled as data streams. These data sets are far too than !2(nk). Finally, we give a deterministic d(nk)- time, polylog(n)-approximation single-pass algorithm space, for E < 1. *Department of Computer Science, Stanford University, CA nE 94305. Email: sudiptoQcs. stanford.edu. Research supported by IBM Research Fellowship and NSF Grant 11s-9811904. t Hewlett Packard Laboratories, Palo Alto, CA 94304, Email: Related Work on Data Streams One of the first nmishraQhpl.hp.com results in data streams w a s the result of Munro and $Department of Computer Science, Stanford University, CA Paterson [16], where they studied the space require- 94305. Email: rajeevQcs. stanford.edu. Research supported a function of the num- in part by NSF Grant 11s-9811904. ment of selection and sorting as §Department of Computer Science, Stanford University, CA ber of passes over the data. The model was formal- . Email: locQcs stanford.edu. 94305. Research supported ized by Henzinger, Raghavan, and Rajagopalan [7], i n part by an NSF Graduate Fellowship, ARO MURI Grant who gave several algorithms and complexity results re- DAAH04-96-1-0007, and NSF Grant 11s-9811904. 359 2000 IEEE 0-7695-0850-2/00 $10.00 0

Recommend


More recommend