wave computing in the cloud
play

Wave Computing in the Cloud Bingsheng He Mao Yang Zhenyu Guo - PDF document

Wave Computing in the Cloud Bingsheng He Mao Yang Zhenyu Guo Rishan Chen Wei Lin Bing Su Hongyi Wang Lidong Zhou Microsoft Research Asia Beijing University A BSTRACT databases and the view matching [15]


  1. Wave Computing in the Cloud Bingsheng He † Mao Yang † Zhenyu Guo † Rishan Chen †‡ Wei Lin † Bing Su † Hongyi Wang † Lidong Zhou † † Microsoft Research Asia ‡ Beijing University A BSTRACT databases and the view matching [15] techniques are par- ticularly effective in identifying common computations We introduce the new Wave model for exposing the tem- or sub-computations across queries and in allowing the poral relationship among the queries in data-intensive results to be reused. distributed computing. The model defines the notion of While leveraging the proven concepts in the fields query series to capture the recurrent nature of batched such as databases is clearly a step in the right direc- computation on periodically updated input streams. This tion, applying those concepts in the current computing seemingly simple concept captures a significant portion environment itself is particularly challenging due to the of the queries we observed in a production system. The inherent complexity and unpredictability in the system. recurring nature of the computation on the same input For example, query optimization in databases hinges on stream opens up surprisingly significant opportunities for a cost model. For a query in our environment, the sys- achieving better performance and higher resource utiliza- tem often has little knowledge about the data being pro- tion. cessed; a query could use custom functions with un- 1 I NTRODUCTION known performance characteristics; a query is often com- plicated and contains sub-queries, resulting in compu- Recent work on data-intensive distributed computing tation consisting of multiple distributed steps . All those (e.g., MapReduce [4], Dryad [7], and Hadoop [6]) has make a reliable cost model nearly impossible. enabled large-scale data analysis as a query to exe- With challenges also come opportunities. We observe cute in parallel on a large cluster of machines, despite that log data mining has been the original motivation for failures during the computation. While the emergence such data-intensive distributed computing systems and it of high-level languages, such as Sawzall [11], Pig [9], remains a dominant workload in such systems. We there- SCOPE [3], and DryadLINQ [14], has further reduced fore introduce a new Wave model that captures the key programming complexity, the research remains largely properties of log mining. In the Wave computing, we centered on individual queries. In reality, we are fac- model the data not as a static file, but as a stream that ing the challenging system problem of executing a large is periodically updated. The stream is append-only and number of potentially complicated queries on a large partitioned on multiple machines. A segment is the data amount of data every day on a large-scale cluster. Ques- from a single bulk update, e.g., the daily generated log. tions naturally arise: is the system doing a good job of We further define the notion of query series to refer to re- utilizing the resources fully? Is the system executing the current computations on a stream, with each performed queries in a globally optimal way? We have not yet been on one or more stream segments. Query series captures able to answer such basic system questions satisfactorily a sequence of the same computation on different sets or even to define the system goals precisely. of segments of the same stream and explicitly exposes Our experience with a production computing cluster the correlations among the queries in the query series in shows that we are far from reaching the ideal. For ex- terms of both data and computation. ample, in the cluster we investigate, we have seen signif- icant redundancy in computation across queries; that is, This seemingly simple notion of query series brings the same computation is performed multiple times on the predictability into the system, and opens up new re- same data for different queries, resulting in wasted I/O search opportunities by making previously unsolvable and computation. Load imbalance is also evident over problems tractable. For example, with query series, the time with periods of system overload and periods of re- system knows the queries that need to be executed as source under-utilization. Those can be attributed to inad- data streams are updated. Query series makes the occur- equate data and resource management in the system. rence of these queries predictable. This offers flexibility Performance and resource optimization through the in the scheduling decisions: Queries in different query management of data and resources has been studied ex- series might share the same I/O to scan the input data and tensively in databases systems and (distributed) operat- might even share common computation. Those queries ing systems for decades. It is natural for us to look for could be scheduled to run together as a single combined solutions and inspirations from those fields, as proposed query by removing redundancies. Furthermore, query se- by Olston et al. [1, 8]. For example, the notion of views in ries makes the construction of a reliable cost model a 1

Recommend


More recommend