A Cloud-native Architecture for Replicated Data Services Hemant Saxena, Jeffery Pound University of Waterloo, SAP Labs Waterloo
Outline ● Problem overview ● Solution overview ○ Kafka ○ Cassandra ● Evaluation 2
Problem overview Cloud has become de facto standard for deploying applications ➢ However, applications designed for on-premise infrastructure ➢ find it challenging to leverage the Cloud storage efficiently, because: Data replication for on-premise provides fault-tolerance (FT) and high ○ availability (HA) Whereas, Cloud storage already uses replication to provides FT and HA ○ Making application’s replication redundant resulting into additional storage ○ cost 3
Typical replicated application on-premise client Replicated application replica-set - - - 4
Typical replicated application on Cloud client Replicated application - Application-level replica-set replication - - - (replica-set) - Storage-level replication - Resulting into redundant replicas Storage service - Introducing additional storage cost 5
Problem overview We ask the following research question... How can we easily allow applications designed for on-premise infrastructure to efficiently leverage the Cloud storage? 6
Outline ● Problem overview ● Solution overview ○ Kafka ○ Cassandra ● Evaluation 7
Na ȉ ve solution replica-set - Have one replica (i.e. no application-level replication) - Solves the problem of redundant replication - But, it is prone to node failure. Hence not highly available. 8
Contributions of this work We show how a well-known main-delta architecture can be ➢ used to leverage cloud storage efficiently i.e. ensure no redundant replication ○ while maintaining the fault-tolerance and availability guarantees of the ○ applications We show that incorporating main-delta architecture in ➢ existing on-premise applications is easy by controlling how buffers are managed and flushed to storage ○ and it is compatible with the whole spectrum of replication strategies ○ 9
Quick recap of main-delta architecture Originally designed for ➢ efficiently handling mixed read/update workloads Two parts ➢ ○ Static, read-only, read optimized main ○ Small, write-optimized delta Deltas are merged with the main ○ at regular intervals 10
Solution overview replica-set - Replicated local deltas, maintained by application - But single shared main on Cloud storage (which is fault-tolerant) M M M 11
Solution overview replica-set - Replicated local deltas, maintained by application - But single shared main on Cloud storage (which is fault-tolerant) How to merge the M deltas? M M 12
Merging Deltas to Main Details are in how the delta is merged to the main such that ➢ No data is lost from any deltas ○ And applications have same guarantees as on-premise deployment ○ Delta-merge strategy depends on the replication strategy ➢ ○ Single primary node means single delta to merge Multiple primary nodes means multiple deltas to merge ○ 13
Classification of replication strategies ▪ Write to any, read from ▪ Write to primary, ▪ Write to primary, any (e.g. quorum): read from primary: read from any: Request-handler Request-handler Request-handler replica-set replica-set replica-set 14
Case-study 1: Delta merge for single primary Idea: In-memory buffers as ● deltas , on-disk data as main. Only the primary will merge its ● replica-set delta to main . Other replicas will discard their deltas when they are full. M In case of primary node failure, ● new primary node takes the M M responsibility of merging deltas. 15
Case-study 2: Delta merge for quorum system The memtable and sstables can ● be easily leveraged as delta and main . Deciding which node merges ● the delta is tricky: replica-set Each node can have different set ○ of updates M M M 16
Case-study 2: Delta merge for quorum system Nodes flush their deltas to ● cloud storage Background compaction job ● combines the deltas and merges it to the main 17
Outline ● Problem overview ● Solution overview ○ Kafka ○ Cassandra ● Evaluation 18
Evaluation Want to show that our cloud-native design can save storage cost while ● keeping the performance same Tested performance of our prototype on Kafka and Cassandra ● ○ Used real Cloud infrastructure - Amazon Web Services (AWS) Tested different types of storage types - EBS and EFS ○ 19
Evaluation Implementations: ● md-kafka : main-delta architecture ○ based Kafka implementation kafka : vanilla Kafka ○ 3x storage cost savings ● Replication factor 3x ○ Savings by design ○ Similar write throughput for block base ● storage (EBS) Almost 2x throughput improvement for ● EFS storage, due to batching 20
Evaluation Implementations: ● md-cassandra-efs : main-delta ○ based Cassandra using EFS storage cassandra-ebs : vanilla ○ Cassandra using EBS cassandra-efs : vanilla ○ Cassandra using EFS Close to 2.8x storage cost saving ● With replication factor of 3x ○ Almost similar throughput for all 3 ● types of workloads 21
Conclusion Existing on-premise applications (with replication) when deployed on ➢ cloud ends up with redundant replication We proposed a main-delta based cloud-native architecture to solve this ➢ problem ○ Allowing for storage cost savings up to factor of k (applications replication factor) We show our approach is general enough to work with the complete ➢ spectrum of replication strategies Simplest strategy: single primary (Kafka case study) ○ ○ Complex strategy: quorum based systems(Cassandra case study) 22
Thank you! Contact for any follow-up questions: Hemant Saxena email : hemant.saxena@uwaterloo.ca 23
Recommend
More recommend