Caching for Data Intensive Scientific Repositories Ani Thakar, Dan Wang, Tanu Malik, Philip Little, Amitabh Chaudhary
Scientific repositories can have a large “network footprint” Network Telescope Data Repository Pan-STARRS is expected to service over 10 TB of query results each day. LSST will be a 150 times the size of Pan-STARRS.
Well-designed proxy caches can help reduce the network footprint Network Telescope Data Repository In simulations on SDSS (static data), traffic reduced to one-fifth.
Well-designed proxy caches are hard to design Network Telescope Data Repository Three challenges — • How do we adaptively choose the best objects to cache? • How do we process queries on transient objects? • How do we move large data objects?
Cache objects have varying sizes and varying load costs A B E F G Network C D H I Objects can be relations, columns, horizontal partitions, vertical partitions, etc.
Caching decisions are not limited to loading and evicting objects A B E F G Network C D H I INSERT INTO A SELECT MAX(A.x) U1 Q1 VALUES (2.3, 30, ...) FROM A, B VALUES (4.5, 25, ...) WHERE A.x = B.y UPDATE G SELECT G.z U2 Q2 SET G.z = G.z+3.34 FROM A, G WHERE G.z ≤ 9.6 WHERE A.x G.z Three types of data communication — • Query shipping • Object loading • Update shipping
Query shipping is for answering queries without using the cache contents SELECT MAX(A.x) SELECT G.z Q1 Q2 FROM A, B FROM A, G WHERE A.x = B.y WHERE A.x G.z A B E F G Network C D H I INSERT INTO A Q1 Result U1 Q2 Result VALUES (2.3, 30, ...) 4.5 2.3, 4.5, 7.9, 2.1, .... VALUES (4.5, 25, ...) UPDATE G U2 SET G.z = G.z+3.34 WHERE G.z ≤ 9.6
Loading is for moving frequently accessed objects A G B A B E F G Network C D H I SELECT MAX(A.x) INSERT INTO A Q1 U1 FROM A, B VALUES (2.3, 30, ...) WHERE A.x = B.y VALUES (4.5, 25, ...) SELECT G.z UPDATE G Q2 U2 FROM A, G SET G.z = G.z+3.34 WHERE A.x ≤ G.z WHERE G.z ≤ 9.6 This may require evicting other objects from the cache.
Update shipping is for keeping objects up-to-date INSERT INTO A UPDATE G U2 U1 VALUES (2.3, 30, ...) SET G.z = G.z+3.34 VALUES (4.5, 25, ...) WHERE G.z ≤ 9.6 A B G E F Network A C D G B H I SELECT MAX(A.x) Q1 FROM A, B WHERE A.x = B.y SELECT G.z Q2 FROM A, G WHERE A.x ≤ G.z
The objective is to keep the heavily queried objects in cache, and the heavily updated objects out of it — adaptively Loading, Update Shipping A U1 U2 B Query Network Update Hotspots Hotspots Q1 Q2 Query Shipping The interdependencies between objects makes this even harder.
Algorithm Benefit learns from the past window, but is hard to tune Cumulative Network Traffic Cost (GB) 250 NoCache Benefit 200 VCover SOptimal 150 100 50 0 50k 100k 150k 200k 250k Query and Update Events It greedily loads objects by the benefit of keeping them in cache.
Algorithm VCover is conservative but performs close to the offline (static) optimal Cumulative Network Traffic Cost (GB) 250 NoCache Benefit 200 VCover SOptimal 150 100 50 0 50k 100k 150k 200k 250k Query and Update Events Characteristics — • It is based on online algorithms for caching. • It incorporates a rent-versus-buy approach. • It captures query-update interactions in a bi-partite graph (the minimum weighted vertex cover of which is the optimal solution).
Several open questions remain in creating an effective database cache Loading, Update Shipping A U1 U2 B Query Network Update Hotspots Hotspots Q1 Q2 Query Shipping • Can we reduce the size of VCover data structures? • Are there better caching algorithms? • What is the best granularity for a data object? • Should we be caching query results rather than data objects? • How do we re-write queries for transient data objects? • Can we use indices on transient data objects?
In summary, clever algorithms can help build effective caching solutions for data intensive repositories, but much remains to be done Loading, Update Shipping A U1 U2 B Query Network Update Hotspots Hotspots Q1 Q2 Query Shipping Questions?
Recommend
More recommend