The Performance Analysis of Cache Architecture based on Alluxio over Virtualized Infrastructure � Xu Chang, Li Zha � 1
Contents • Background • Related Works • Motivation • Experiments • Results • Conclusion • Future Work ��
Background • Cloud Computing – Computing as a service – Application of resources on demand and payment on demand • Virtualization – Integrates and encapsulates the resources – Provide the resource in piece – Transparent to users � ��
Background Traditional Architecture Compute Node Compute Node Compute Node Data Node Data Node Data Node Decoupling architecture of Decoupling vs Traditional computing and storage Compute cluster Advantage: Compute Compute Compute • More flexible Node Node Node • Overall cost is reduced Shortcoming: Data Center (Object Storage) • Performance decline Data Node Data Node Data Node � ��
Related Works For making up the loss of performance • Traditional optimization method – Speed up the shuffle part of jobs with SSDs – [kambatla2014truth] [ruan2017improving] • Reduce the frequency of accessing the object storage – Construct the cache layer between applications and object storage – [shankar2017performance] [qureshi2014cache] ��
Related Works Alluxio (Tachyon) • The world’s first memory speed virtual distributed storage system • Resides between computation frameworks and storage systems �� Source: https://www.alluxio.org/
Motivation • Only concern about performance, do not care about cost • Cost reduction is critical • Question: – How to design the caching architecture to make the cost performance highest? ��
Experiments System architecture � MapReduce MapReduce MapReduce Alluxio Alluxio Alluxio Cloud Storage �� Source: https://www.alluxio.org/
Experiments Experimental environment Experiment 1: Experiment 2: Platform: AWS Platform: G-Cloud Servers: m3.2xlarge * 4 Servers: 8 cores & 30G Object storage: S3 memory * 4 Object storage: Ceph � � ��
Experiments Experimental scheme • Experiment 1: – Workload: Terasort * 6 • Experiment 2: – Workload: Hive-Join * 3 • Data Size: 120G • Cost ratio of memory to SSD Memory : 8:0 � 7:1 � 5:3 � 3:5 � 1:7 � 0:8 � SSD � ��
Results Experimental 1: Performance 92.00 Throughput (MB/s) � 90.00 88.00 86.00 84.00 82.00 80.00 78.00 76.00 Cost Performance 5.00 4.50 COST PERFORMANCE � 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 � 100%MEM 87.5%MEM 62.5%MEM 37.5%MEM 12.5%MEM 100%SSD 12.5%SSD 37.5%SSD 62.5%SSD 87.5%SSD
Results Experimental 2: Performance 210 Throughput (MB/s) � 205 200 195 190 185 180 175 Cost Performance 3 COST PERFORMANCE � 2.5 2 1.5 1 0.5 0 100%MEM 87.5%MEM 62.5%MEM 37.5%MEM 12.5%MEM 100%SSD 12.5%SSD 37.5%SSD 62.5%SSD 87.5%SSD ��
Conclusion • Hybrid cache architecture is recommended. • For the workload with large size of output and small size of hot data, the cost ratio of memory to SSD in cache should be around 1:7 • For the workload with small size of output and large size of hot data, the cost ratio of memory to SSD in cache should be around 5:3 ��
Future Work • Study several aspects that affect the cost performance, and try to give a configuration scheme with the best cost performance • Increase workload types and application scenarios, so that the conclusion is closer to the real scene and has generality ��
Q & A � Thanks! � ��
Recommend
More recommend