redefining data locality for cross data center storage
play

Redefining Data Locality for Cross-Data Center Storage Kwangsung - PowerPoint PPT Presentation

Redefining Data Locality for Cross-Data Center Storage Kwangsung Oh, Ajaykrishna Raghavan Abhishek Chandra, and Jon Weissman Department of Computer Science and Engineering University of Minnesota Twin Cities Background Private Cloud


  1. Redefining Data Locality for Cross-Data Center Storage Kwangsung Oh, Ajaykrishna Raghavan Abhishek Chandra, and Jon Weissman Department of Computer Science and Engineering University of Minnesota Twin Cities

  2. Background Private Cloud

  3. Background Computation Storage Network

  4. Background ElastiCache App / Server S3 EBS

  5. Data replication is unavoidable

  6. Questions • Where to store data? • Which Datacenter, Local or Near or Remote DCs? • Which Storage tier, Faster or Slower Tiers? • When and where to replicate or move data? • Which data? • No single answer. • Answer should be changed based on user requirements such as QoS (Performance), consistency, expected workload, and cost.

  7. Disk-locality in datacenter computing considered irrelevant App / Server

  8. Motivation From http://www.datacentermap.com

  9. Key observations • Multiple DCs are in the same region Memcache App / Server and close each other. • By using nearby DC, data locality can Disk be extended. Azure Storage • Data can be stored in non-local DC’s storage without (or less) data locality concern. ElastiCache EBS S3

  10. DC locations example

  11. Latency and bandwidth between DCs Latency (ms) between DCs Region US West US East Europe West Asia Southeast AWS Azure AWS Azure AWS Azure GC AWS Azure AWS - 3.84 - 1.97 - 17.58 16.33 - 1.84 Azure 3.62 - 1.99 - 18.67 - 16.02 1.98 - GC - - - - 16.35 16.12 - - - Bandwidth (MB/s) between DCs Region US West US East Europe West Asia Southeast AWS Azure AWS Azure AWS Azure GC AWS Azure AWS - 48.75 - 48.13 - 48.38 48.63 - 48.88 Azure 21.62 - 23.63 - 45.25 - 53.5 24.38 - GC - - - - 32.38 40.25 - - -

  12. Data Retrieval Time (100KB)

  13. Data Retrieval time (100KB) in US East

  14. Disk performance of AWS and Azure

  15. Various Data Size

  16. Summary of experiments • Accessing data in memory, in a nearby DC is faster than local slower storage tier. • Accessing data from disk (archival) storage in a nearby DC can be as fast as accessing disk (archival) in the local DC. • These trends hold for data sizes up to 1MB (can be increased), which encompass many common internet applications.

  17. Usecases • Simpler Consistency Policy • Using faster (memory) tier can reduce the number of replicas. • Lowering number of replicas reduces the network traffic for consistency. • Hot and Cold Data • Data can be located in Memory on DC A and Disk or Archival in DC B based on data access. • Higher Availability • If DC A fails, the application can minimize the performance penalty by using DC B’s faster storage.

  18. Usecases • Expanding the Memory Tier • New VM instance needs to be spawned but it can be expensive. • Spawning VM instance can be rejected by providers’ policy but not outage.

  19. Usecases • Competitive Pricing • Each cloud provider has different pricing policy for their service. The Cheapest VM Instance for 3.5GB Memory from each cloud provider in US East AWS T2.medium (4GB, 2 cores) $0.052 / hour – $37.44 / month ( $9.36 / GB ) Azure A2-Basic Tier (3.5GB, 2 cores) $0.088 / hour – $63.36 / month ($18.10 / GB) Google Cloud n1-standard-1 (3.5GB, 1 core) $0.049 / hour – $35.29 / month ($10.08 / GB) The Cheapest VM Instance for 25GB Memory from each cloud provider in US East AWS r3.xlarge (30.5GB, 4 cores) $0.350 / hour – $252 / month ($8.26 / GB) Azure D12 (28GB, 4 cores) $0.476 / hour – $342.72 / month ($12.24 / GB) Google Cloud n1-highmem-8 (26GB, 4 core) $0.226 / hour – $162.72 / month ( $6.25 / GB )

  20. Web application case study (RUBiS) • eBay like web application (Apache + MySQL) • 1,000,000 users, 1,000,000 items -> 2GB • Emulate 300 users (view, sell, bid, buy, comment ...) • Change the location where MySQL is running • Local node’s disk • with system buffer cache. • limited size of buffer cache. • Nearby DC node’s disk • Ramdisk (with system buffer cache)

  21. Benchmark (RUBiS)

  22. Challenges • Infrastructure Dynamics • Cloud services do not provide consistent performance over time. • Performance throttling based on the VM instance size.

  23. Challenges • Application Dynamics • Data size and access patterns keep changing. • Simple Storage Abstraction • More complexities from different storage interfaces and various pricing policy • Discovering nearby DCs • Network performance between DCs is not decided by physical distance. • Cloud providers’ implementation and policies • Same storage tier has different performance. • New types of VM Instance and new pricing policy. • Network cost should be considered for optimized cost.

  24. Conclusion • Data locality can be extended with denser data centers • Accessing data in nearby DC can be faster than local storage tiers. • Small size data can be stored nearby DC without (less) locality concern. • Benefits from using multiple data centers • Better performance. • Reduced cost. • Better availability. • Durability. • M any challenges to be overcome for realizing such benefits

  25. Thank you! • Questions? 25

Recommend


More recommend