Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, Sharad Mehrotra, Bhavani Thuraisingham 1
Cloud Computing App Code Server Cloud Email Computing Database Multimedia Like Software as a service and DAS model offers many advantages Better availability Reduced Costs Unlimited scalability and elasticity 2
Hybrid Cloud Integrates local infrastructure with public cloud resources Private/ Public/ Internal External Hybrid Cloud Extra Advantages The flexibility of shifting workload to public cloud when the private cloud is overwhelmed (Cloud Bursting) Utilizing in-house resources along with public resources Cons Sensitive data exposure Public Cloud Resource Allocation Cost (both storage and computing) 3
Data & Computation Partitioning Challenge Sensitive Student Q1 : SELECT name, ssn from Student s_id name ssn dept 1 James 1234 CS Q2 : SELECT dept, count(*) FROM Student 2 Charlie 4321 EE GROUP_BY dept 3 John 5645 CS How to split computation? 4 Matt 8743 ECON How to partition the table ? Constraints • Q1 contains sensitive information • Q2 execution is more expensive 4
Our Hybrid Cloud Architecture Queries Q Constraints C Relations R Results for Q pub Results for Q priv User Interface Layer Statistics Gathering Layer Data and Query Management Layer R pub, Q pub R , Q priv Hive Hive Hadoop HDFS Hadoop HDFS Private Public 5
Design Spectrum Data Model Relational, Semi-structured, Key-Value Stores, Text Sensitivity Model Attribute Level, Privacy Associations, View-Based Partitioning Models Workload Partitioning, Intra-query Parallelism, Dynamic Workload Minimization Priority Running Time, Sensitive Data Disclosure, Monetary Cost 6
Outline of Solution Notation Formulate Computation Partition Problem (CPP) Solution to CPP Experimental Results 7
Notation sens (R’) : The estimated number of sensitive cells in dataset R’ baseTables(q): The estimated minimum set of data items necessary to answer query q Є Q runT x (q): The estimated running time of query q Є Q at site x (either public or private) ORunT (Q’,Q’’) : Overall execution time of queries in Q’, given that queries in Q’’ are executed on the public cloud freq ( q ) x runT ( q ) pub q Q ' ' ORunT ( Q ' , Q ' ' ) max freq ( q ) x runT ( q ) priv q Q ' Q ' ' 8
Detailed Hybrid Cloud Architecture Queries Q Constraints C Relations R SR Statistics Gathering Layer runT x (q), baseTables(q) Data And Query Management Layer Monetary Cost Estimator Computation Partitioning Module Disclosure Risk Estimator R pub, Q pub R , Q priv Hive Hive Hadoop HDFS Hadoop HDFS Public Private 9
Computation Partitioning Problem (CPP) Find a subset of given query workload , Q pub Q and subset of the given dataset where R pub R minimize ORunT ( Q , Q ) pub subject to ( 1 ) store ( R ) freq ( q ) x proc ( q ) MC pub q Q pub ( 2 ) sens ( R ) DC pub ( 3 ) q Q baseTables ( q ) R pub pub , are user defined constraints MC DC 10
Metrics in CPP Query Execution Time ( runT x (q) ) inpSize ( ) outSize ( ) operator q runT (q) x w x Monetary Costs stor(R pub ) : Storage monetary cost of the public cloud partition proc(q) : Processing monetary cost of a public side query q Sensitive Data Disclosure Risk ( sens(R pub ) ) Estimated number of sensitive cells within R pub 11
Solution to CPP CPP can be simplified to only finding Q pub Dynamic Programming Approach Output CPP (Q, MC, DC) = Qpub Input Query Set Monetary Const. Disclosure Const. 12
Example Q q , q , q 1 2 3 q 3 can only run on private side. If MC < 25 or DC < 20 CPP({ q 1 , q 2 , q 3 }, MC, DC) = CPP({ q 1 , q 2 }, MC , DC) 13
Example Q q , q , q 1 2 3 What if q 3 If q 3 can run on both sides runs on private side. Case 1 CPP({ q 1 , q 2 , q 3 }, MC, DC) = CPP({ q 1 , q 2 }, MC , DC) 14
Example Q q , q , q 1 2 3 2 Q q , 1 q 2 What if q 3 runs on Case 2 public side. 2 Q CPP(Q, MC, DC) = MIN_TIME (CPP( , j, k)+ q 3 ) where MC- 25 ≤ j ≤ MC -15 and DC- 20 ≤ j ≤ DC -0 Max-Min possible Max-Min possible monetary cost by q 3 disclosure risk by q 3 Choose the minimum overall running time between Case 1 and Case 2 15
Experimental Setting Experimental Setting Private Cloud: 14 Nodes, located at UTD, Pentium IV, 4 GB Ram, 290-320 GB disk space Public Cloud: 38 Nodes, located at UCI, AMD Dual Core, 8GB Ram, 631 GB disk space Hadoop 0.20.2 and Hive 0.7.1 Dataset and Statistic Collection 100 GB TPC-H Data Query Workload 40 queries containing modified versions of Q1, Q3, Q6, Q11 17
Experimental Setting Estimation of Weight (w x ) Running all 22 TPC-H queries for a 300 GB dataset w pub ≈ 40MB/sec , w priv ≈ 8MB/sec Resource Allocation Cost Amazon S3 Pricing for storage and communication Storage = $0.140/GB + PUT, Communication= $0.120/GB + GET PUT=$0.01/1000 request, GET=$0.01/10000 request Amazon EC2 and EMR Pricing for processing $0.085 + $0.015 = $0.1/hour Sensitivity Customer : c_name, c_phone, c_address attributes Lineitem: All attributes in %1-5-10 of tuples 18
Experimental Results 19
Experimental Results 20
Future Work Extend work to enable intra-query parallelism Support Dynamically Changing (or arriving) Workload Extend this work to other cloud computing technologies Support Different Sensitivity Models 21
Recommend
More recommend