Hello ! CS 744: SCOPE Shivaram Venkataraman Fall 2020
↳ ADMINISTRIVIA Thursday - Assignment grades this week Single PDF file next - Midterm details on Piazza → → - Course Project Proposal Submission a convert I ppf photo Hot CRP ↳ Peer review ppf upload Anonymous → names include your don't in include them Only itself Hot CRP
Applications ← I ✓ Pytoiphipep rear Machine Learning SQL Streaming Graph upper → MapReduce Computational Engines Ray spark Scalable Storage Systems Resource Management Datacenter Architecture
SQL: STRUCTURED QUERY LANGUAGE I to language database a query
↳ Sou DATABASE SYSTEMS ' : . . ÷ : - OLAP m - . - O LTP - t Transaction processing Airline . reservation - -
PROCEDURAL VS. RELATIONAL artie schema tendered data great :b : ! ! ) ^ \ lines = sc.textFile(“users") csv = lines.map(x => Esv ' " SELECT COUNT(*) ← Men - x.split(‘,’)) FROM “users” young = csv.filter(x => . . WHERE age < 21 x(1) < 21) !÷÷ .int an age ' - Ekin :& • println(young.count()) ÷ :* :c . " easy ftp.ograrre
r → Microsoft SCOPE - Submit → SELECT query, COUNT(*) AS count ← FROM "search.log" to %¥ USING LogExtractor GROUP BY query HAVING count > 1000 hang ORDER BY count DESC; ÷ . Motl
↳ SCOPE OPERATORS x RDD ? powiat information ① asthma Input reading: What is different? A . text File EXTRACT column[:<type> ] [, ...] so # - filenames us - . FROM <input_stream(s) > ② pluggable USING <Extractor> [(args)] X or csr Extractor class [HAVING <predicate>] :p .com?l function " pwndoiirb geqrad-wgv.in?:::M:.ev::;.:ia & furring .
SQL OPERATORS ! these ] Yay Select – read rows that satisfy some predicate Join – Equijoin with support for Inner and Outer join operators GroupBy – Group by some column A large operations → or OrderBy – Sorting the output muser → analytics Aggregations – COUNT, SUM, MAX etc. - - -
↳ LANGUAGE INTEGRATION C # R1 = SELECT A+C AS ac, B.Trim() AS B1 stdtib " # FROM R C# from Trim WHERE StringOccurs(C, “xyz”) > 2 function C # Custom I → inline #CS public static int StringOccurs(string str, string ptrn){ int cnt=0; int pos=-1; compiler while (pos+1 < str.Length) { # C pos = str.IndexOf(ptrn, pos+1); if (pos < 0) break; functions - defined cnt++; } User return cnt; uDFs } - #ENDCS -
↳ MAPREDUCE-LIKE? Rpf ! Yet ) inotnutpa to takes operator Lone UDF ← like Process map → reduce huoperator → ongroy# Reduce → l Combine → Rxwsety I pparciismediw ; - join COMBINE S1 WITH S2 ← equi - - ON S1.A==S2.A AND S1.B==S2.B AND S1.C==S2.C # ← www.F#ihon ← USING MultiSetDifference PRODUCE A, B, C columns 1. Commutative ? times \ , many produce multiple Wk if be run can combine Sl comb 52 152 gaff
⇐ EXECUTION: COMPILER - - SELECT query, COUNT() AS count Check syntax, resolve names - FROM "search.log" I USING LogExtractor Checks if columns have been defined ← ← GROUP BY query 2 HAVING count > 1000 Result: Internal parse tree . = - - ORDER BY count DESC; on ↳ smiter ÷ . compiler seamy J
w :* :* : postman OPTIMIZER chunk every optimizer cost - based - . . Rewrite the query expression à lowest cost → itqie.gr?z Quite Examples: a > Removing unnecessary columns query ← 2110 only Pushing down selection predicates columns 't Pre-aggregating query ) query add 't ↳ combiner similar ↳ filtering quem before y grouping I 71000 C Also need to reason about partitioning . . I :> L 7 (See VLDBJ paper)
m!EodEEuy! µ ;÷g : dnt " Mmm RUNTIME OPTIMIZATIONS a rack within all ⇒ Aff not agg Idiom bw racks have links some Hierarchical aggregation → do also | to they similar intermediate Locality-sensitive task placement → this ,fas ; spark IMR ¥ Grouping heuristics? partitions ) FI÷¥E¥ ↳ Default paper the [ in vague C # code * set automatically m * group BT l ) after ↳ binary
↳ SUMMARY, TAKEAWAYS Relational API Schema . → - Enables rich space of optimizations - Easy to use, integration with C# UDFS I Scope Execution - Compiler to check for errors, generate DAG - Optimizer to accelerate queries (static + dynamic) Precursor to systems like SparkSQL
DISCUSSION https://forms.gle/hL8VJ6uSG7Lzm164A
↳ Consider you have a column-oriented data layout on your storage system (Example below). What are some reasons that a SCOPE query might be faster than running equivalent MR program? parquet Apache £ Robin offsets notion of Extractor qs wk forage EITI , → 9 I D 8 7 b g 5 - se . . . - → Ogletree Pre - filtering edpmfofihow via touches → query → the column in single as is easier this extractor MN is this well efficient http://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html
⇒ ⇒ Does SCOPE-like Optimizer help ML workloads? Consider the code in your Assignment2. What parts of your code would benefit and what parts would not? extraction ! yn ! in Joins workloads feature filtering Colum ML rare ? or optimization other a. adieu µ , , µ , , , ag# outputs intermediate caching details about Hash No = aopeit.IE : → dgjfjkn.gg " ? Dort merge optimizer → join
NEXT STEPS Next class: Elastic Data Warehousing with SnowFlake Project proposals due tomorrow! See Piazza! Midterm coming up! " ÷ ÷ :*
Recommend
More recommend