kerim y oktay vaibhav khadilkar bijit hore murat
play

Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, - PowerPoint PPT Presentation

Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, Sharad Mehrotra, Bhavani Thuraisingham 1 Cloud Computing App Code Server Cloud Email Computing Database Multimedia Like Software as a service and DAS model offers many


  1. Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, Sharad Mehrotra, Bhavani Thuraisingham 1

  2. Cloud Computing App Code Server Cloud Email Computing Database Multimedia  Like Software as a service and DAS model offers many advantages  Better availability  Reduced Costs  Unlimited scalability and elasticity 2

  3. Hybrid Cloud  Integrates local infrastructure with public cloud resources Private/ Public/ Internal External Hybrid Cloud  Extra Advantages  The flexibility of shifting workload to public cloud when the private cloud is overwhelmed (Cloud Bursting)  Utilizing in-house resources along with public resources  Cons  Sensitive data exposure  Public Cloud Resource Allocation Cost (both storage and computing) 3

  4. Data & Computation Partitioning Challenge Sensitive Student Q1 : SELECT name, ssn from Student s_id name ssn dept 1 James 1234 CS Q2 : SELECT dept, count(*) FROM Student 2 Charlie 4321 EE GROUP_BY dept 3 John 5645 CS How to split computation? 4 Matt 8743 ECON How to partition the table ? Constraints • Q1 contains sensitive information • Q2 execution is more expensive 4

  5. Our Hybrid Cloud Architecture Queries Q Constraints C Relations R Results for Q pub Results for Q priv User Interface Layer Statistics Gathering Layer Data and Query Management Layer R pub, Q pub R , Q priv Hive Hive Hadoop HDFS Hadoop HDFS Private Public 5

  6. Design Spectrum  Data Model  Relational, Semi-structured, Key-Value Stores, Text  Sensitivity Model  Attribute Level, Privacy Associations, View-Based  Partitioning Models  Workload Partitioning, Intra-query Parallelism, Dynamic Workload  Minimization Priority  Running Time, Sensitive Data Disclosure, Monetary Cost 6

  7. Outline of Solution  Notation  Formulate Computation Partition Problem (CPP)  Solution to CPP  Experimental Results 7

  8. Notation  sens (R’) : The estimated number of sensitive cells in dataset R’  baseTables(q): The estimated minimum set of data items necessary to answer query q Є Q  runT x (q): The estimated running time of query q Є Q at site x (either public or private)  ORunT (Q’,Q’’) : Overall execution time of queries in Q’, given that queries in Q’’ are executed on the public cloud   freq ( q ) x runT ( q ) pub    q Q ' '  ORunT ( Q ' , Q ' ' ) max   freq ( q ) x runT ( q ) priv    q Q ' Q ' ' 8

  9. Detailed Hybrid Cloud Architecture Queries Q Constraints C Relations R SR Statistics Gathering Layer runT x (q), baseTables(q) Data And Query Management Layer Monetary Cost Estimator Computation Partitioning Module Disclosure Risk Estimator R pub, Q pub R , Q priv Hive Hive Hadoop HDFS Hadoop HDFS Public Private 9

  10. Computation Partitioning Problem (CPP)  Find a subset of given query workload , Q pub  Q and subset of the given dataset where R pub  R minimize ORunT ( Q , Q ) pub    subject to ( 1 ) store ( R ) freq ( q ) x proc ( q ) MC pub  q Q pub  ( 2 ) sens ( R ) DC pub    ( 3 ) q Q baseTables ( q ) R pub pub  , are user defined constraints MC DC 10

  11. Metrics in CPP  Query Execution Time ( runT x (q) )     inpSize ( ) outSize ( )     operator q runT (q) x w x  Monetary Costs  stor(R pub ) : Storage monetary cost of the public cloud partition  proc(q) : Processing monetary cost of a public side query q  Sensitive Data Disclosure Risk ( sens(R pub ) )  Estimated number of sensitive cells within R pub 11

  12. Solution to CPP  CPP can be simplified to only finding Q pub  Dynamic Programming Approach Output  CPP (Q, MC, DC) = Qpub Input Query Set Monetary Const. Disclosure Const. 12

  13. Example    Q q , q , q 1 2 3 q 3 can only run on private side.  If MC < 25 or DC < 20  CPP({ q 1 , q 2 , q 3 }, MC, DC) = CPP({ q 1 , q 2 }, MC , DC) 13

  14. Example    Q q , q , q 1 2 3 What if q 3  If q 3 can run on both sides runs on private side.  Case 1  CPP({ q 1 , q 2 , q 3 }, MC, DC) = CPP({ q 1 , q 2 }, MC , DC) 14

  15. Example    Q q , q , q 1 2 3 2    Q q , 1 q 2 What if q 3 runs on  Case 2 public side. 2 Q  CPP(Q, MC, DC) = MIN_TIME (CPP( , j, k)+ q 3 ) where MC- 25 ≤ j ≤ MC -15 and DC- 20 ≤ j ≤ DC -0 Max-Min possible Max-Min possible monetary cost by q 3 disclosure risk by q 3  Choose the minimum overall running time between Case 1 and Case 2 15

  16. Experimental Setting  Experimental Setting  Private Cloud: 14 Nodes, located at UTD, Pentium IV, 4 GB Ram, 290-320 GB disk space  Public Cloud: 38 Nodes, located at UCI, AMD Dual Core, 8GB Ram, 631 GB disk space  Hadoop 0.20.2 and Hive 0.7.1  Dataset and Statistic Collection  100 GB TPC-H Data  Query Workload  40 queries containing modified versions of Q1, Q3, Q6, Q11 17

  17. Experimental Setting  Estimation of Weight (w x )  Running all 22 TPC-H queries for a 300 GB dataset  w pub ≈ 40MB/sec , w priv ≈ 8MB/sec  Resource Allocation Cost  Amazon S3 Pricing for storage and communication  Storage = $0.140/GB + PUT, Communication= $0.120/GB + GET  PUT=$0.01/1000 request, GET=$0.01/10000 request  Amazon EC2 and EMR Pricing for processing  $0.085 + $0.015 = $0.1/hour  Sensitivity  Customer : c_name, c_phone, c_address attributes  Lineitem: All attributes in %1-5-10 of tuples 18

  18. Experimental Results 19

  19. Experimental Results 20

  20. Future Work  Extend work to enable intra-query parallelism  Support Dynamically Changing (or arriving) Workload  Extend this work to other cloud computing technologies  Support Different Sensitivity Models 21

Recommend


More recommend