Caching for Data Intensive Scientific Repositories Ani Thakar, Dan - PowerPoint PPT Presentation

Caching for Data Intensive Scientific Repositories Ani Thakar, Dan Wang, Tanu Malik, Philip Little, Amitabh Chaudhary

Scientific repositories can have a large “network footprint” Network Telescope Data Repository Pan-STARRS is expected to service over 10 TB of query results each day. LSST will be a 150 times the size of Pan-STARRS.

Well-designed proxy caches can help reduce the network footprint Network Telescope Data Repository In simulations on SDSS (static data), traffic reduced to one-fifth.

Well-designed proxy caches are hard to design Network Telescope Data Repository Three challenges — • How do we adaptively choose the best objects to cache? • How do we process queries on transient objects? • How do we move large data objects?

Cache objects have varying sizes and varying load costs A B E F G Network C D H I Objects can be relations, columns, horizontal partitions, vertical partitions, etc.

Caching decisions are not limited to loading and evicting objects A B E F G Network C D H I INSERT INTO A SELECT MAX(A.x) U1 Q1 VALUES (2.3, 30, ...) FROM A, B VALUES (4.5, 25, ...) WHERE A.x = B.y UPDATE G SELECT G.z U2 Q2 SET G.z = G.z+3.34 FROM A, G WHERE G.z ≤ 9.6 WHERE A.x G.z Three types of data communication — • Query shipping • Object loading • Update shipping

Query shipping is for answering queries without using the cache contents SELECT MAX(A.x) SELECT G.z Q1 Q2 FROM A, B FROM A, G WHERE A.x = B.y WHERE A.x G.z A B E F G Network C D H I INSERT INTO A Q1 Result U1 Q2 Result VALUES (2.3, 30, ...) 4.5 2.3, 4.5, 7.9, 2.1, .... VALUES (4.5, 25, ...) UPDATE G U2 SET G.z = G.z+3.34 WHERE G.z ≤ 9.6

Loading is for moving frequently accessed objects A G B A B E F G Network C D H I SELECT MAX(A.x) INSERT INTO A Q1 U1 FROM A, B VALUES (2.3, 30, ...) WHERE A.x = B.y VALUES (4.5, 25, ...) SELECT G.z UPDATE G Q2 U2 FROM A, G SET G.z = G.z+3.34 WHERE A.x ≤ G.z WHERE G.z ≤ 9.6 This may require evicting other objects from the cache.

Update shipping is for keeping objects up-to-date INSERT INTO A UPDATE G U2 U1 VALUES (2.3, 30, ...) SET G.z = G.z+3.34 VALUES (4.5, 25, ...) WHERE G.z ≤ 9.6 A B G E F Network A C D G B H I SELECT MAX(A.x) Q1 FROM A, B WHERE A.x = B.y SELECT G.z Q2 FROM A, G WHERE A.x ≤ G.z

The objective is to keep the heavily queried objects in cache, and the heavily updated objects out of it — adaptively Loading, Update Shipping A U1 U2 B Query Network Update Hotspots Hotspots Q1 Q2 Query Shipping The interdependencies between objects makes this even harder.

Algorithm Benefit learns from the past window, but is hard to tune Cumulative Network Traffic Cost (GB) 250 NoCache Benefit 200 VCover SOptimal 150 100 50 0 50k 100k 150k 200k 250k Query and Update Events It greedily loads objects by the benefit of keeping them in cache.

Algorithm VCover is conservative but performs close to the offline (static) optimal Cumulative Network Traffic Cost (GB) 250 NoCache Benefit 200 VCover SOptimal 150 100 50 0 50k 100k 150k 200k 250k Query and Update Events Characteristics — • It is based on online algorithms for caching. • It incorporates a rent-versus-buy approach. • It captures query-update interactions in a bi-partite graph (the minimum weighted vertex cover of which is the optimal solution).

Several open questions remain in creating an effective database cache Loading, Update Shipping A U1 U2 B Query Network Update Hotspots Hotspots Q1 Q2 Query Shipping • Can we reduce the size of VCover data structures? • Are there better caching algorithms? • What is the best granularity for a data object? • Should we be caching query results rather than data objects? • How do we re-write queries for transient data objects? • Can we use indices on transient data objects?

In summary, clever algorithms can help build effective caching solutions for data intensive repositories, but much remains to be done Loading, Update Shipping A U1 U2 B Query Network Update Hotspots Hotspots Q1 Q2 Query Shipping Questions?

Caching for Data Intensive Scientific Repositories Ani Thakar, Dan - PowerPoint PPT Presentation

Caching for Data Intensive Scientific Repositories Ani Thakar, Dan Wang, Tanu Malik, Philip Little, Amitabh Chaudhary Scientific repositories can have a large network footprint Network Telescope Data Repository Pan-STARRS is expected

Agenda Caching Caching Gitlab Demo Caching Demos Mirroring Caching Limitations Manual

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

Web Caching and Content Delivery Web Caching and Content Delivery Caching for a Better Web

Web Caching based on: Web Caching , Geoff Huston Web Caching and Zipf-like Distributions:

Scaling Your Cache & Caching at Scale Alex Miller @puredanger Mission Why does caching

Web Caching Web Caching and wireless networks Next generation Wireless Networks Helsinki

Mining Software Repositories What is MSR? Mining Software Repositories (MSR) uses data

Working together to make ORCID work for repositories ORCID in repositories task force Open

Bazel and External Repositories Which version do you get? Klaus Aehlig October 910, 2018

Topics The Scientific Data Deluge Data-Intensive Scientific Discovery NSF OCI Data/Viz Task

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&D Engineer Thomson

1 Harvest Harvest- -Style ICP Hierarchies Style ICP Hierarchies Issues for Cache Hierarchies

1 Web Traffic Characterization Zipf Web Traffic Characterization Zipf [Breslau/Cao99] and

Some advice from a reproducible researcher about how some advice from research data repositories

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Testing the reachability of (new) address space Steve Uhlig Delft University of Technology

GENERALIZABLE AI: A NEW FOUNDATION ANIMA ANANDKUMAR TRINITY OF AI ALGORITHMS COMPUTE DATA 2

Fault attack vulnerability assessment of binary code Cryptography and Security in Computing

Video 3.1 Vijay Kumar and Ani Hsieh Robo3x-1.3 1 Property of Penn Engineering, Vijay Kumar

Ani Aprahamian Robustness of observational r-process patterns Uncertainties in

In Defense of Corpus Data Summary from Week 1: Introspective judgments about

Final Presenta,on Logis,cs Friday, June 3 rd , 10am Start

Visualization of dCache accounting information with state-of-the-art Data Analysis Tools Tigran

Caching for Data Intensive Scientific Repositories Ani Thakar, Dan - PowerPoint PPT Presentation

Caching for Data Intensive Scientific Repositories Ani Thakar, Dan Wang, Tanu Malik, Philip Little, Amitabh Chaudhary Scientific repositories can have a large network footprint Network Telescope Data Repository Pan-STARRS is expected

Agenda Caching Caching Gitlab Demo Caching Demos Mirroring Caching Limitations Manual

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

Web Caching and Content Delivery Web Caching and Content Delivery Caching for a Better Web

Web Caching based on: Web Caching , Geoff Huston Web Caching and Zipf-like Distributions:

Scaling Your Cache &amp; Caching at Scale Alex Miller @puredanger Mission Why does caching

Web Caching Web Caching and wireless networks Next generation Wireless Networks Helsinki

Mining Software Repositories What is MSR? Mining Software Repositories (MSR) uses data

Working together to make ORCID work for repositories ORCID in repositories task force Open

Bazel and External Repositories Which version do you get? Klaus Aehlig October 910, 2018

Topics The Scientific Data Deluge Data-Intensive Scientific Discovery NSF OCI Data/Viz Task

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&amp;D Engineer Thomson

1 Harvest Harvest- -Style ICP Hierarchies Style ICP Hierarchies Issues for Cache Hierarchies

1 Web Traffic Characterization Zipf Web Traffic Characterization Zipf [Breslau/Cao99] and

Some advice from a reproducible researcher about how some advice from research data repositories

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Testing the reachability of (new) address space Steve Uhlig Delft University of Technology

GENERALIZABLE AI: A NEW FOUNDATION ANIMA ANANDKUMAR TRINITY OF AI ALGORITHMS COMPUTE DATA 2

Fault attack vulnerability assessment of binary code Cryptography and Security in Computing

Video 3.1 Vijay Kumar and Ani Hsieh Robo3x-1.3 1 Property of Penn Engineering, Vijay Kumar

Ani Aprahamian Robustness of observational r-process patterns Uncertainties in

In Defense of Corpus Data Summary from Week 1: Introspective judgments about

Final Presenta,on Logis,cs Friday, June 3 rd , 10am Start

Visualization of dCache accounting information with state-of-the-art Data Analysis Tools Tigran

Scaling Your Cache & Caching at Scale Alex Miller @puredanger Mission Why does caching

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&D Engineer Thomson