sorting aggregations
play

Sorting & Aggregations Lecture # 11 Database Systems Andy - PowerPoint PPT Presentation

Sorting & Aggregations Lecture # 11 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018 2 TO DAY'S AGEN DA Sorting Algorithms Aggregations CMU 15-445/645 (Fall 2018) 3 WH Y DO WE N


  1. Sorting & Aggregations Lecture # 11 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018

  2. 2 TO DAY'S AGEN DA Sorting Algorithms Aggregations CMU 15-445/645 (Fall 2018)

  3. 3 WH Y DO WE N EED TO SO RT? Tuples in a table have no specific order But users often want to retrieve tuples in a specific order. → Trivial to support duplicate elimination ( DISTINCT ) → Bulk loading sorted tuples into a B+ tree index is faster → Aggregations ( GROUP BY ) CMU 15-445/645 (Fall 2018)

  4. 4 SO RTIN G ALGO RITH M S If data fits in memory, then we can use a standard sorting algorithm like quick-sort. If data does not fit in memory, then we need to use a technique that is aware of the cost of writing data out to disk. CMU 15-445/645 (Fall 2018)

  5. 5 EXTERN AL M ERGE SO RT Sorting Phase → Sort small chunks of data that fit in main-memory, and then write back the sorted data to a file on disk. Merge Phase → Combine sorted sub-files into a single larger file. CMU 15-445/645 (Fall 2018)

  6. 6 OVERVIEW We will start with a simple example of a 2-way external merge sort. Files are broken up into N pages. The DBMS has a finite number of B fixed-size buffers. CMU 15-445/645 (Fall 2018)

  7. 7 2- WAY EXTERN AL M ERGE SO RT Pass #0 → Reads every B pages of the table into memory → Sorts them, and writes them back to disk. → Each sorted set of pages is called a run. Pass #1,2,3,… → Recursively merges pairs of runs into runs twice as long → Uses three buffer pages (2 for input pages, 1 for output) Memory Disk CMU 15-445/645 (Fall 2018)

  8. 7 2- WAY EXTERN AL M ERGE SO RT Pass #0 → Reads every B pages of the table into memory → Sorts them, and writes them back to disk. → Each sorted set of pages is called a run. Pass #1,2,3,… → Recursively merges pairs of runs into runs twice as long → Uses three buffer pages (2 for input pages, 1 for output) Memory Memory Disk CMU 15-445/645 (Fall 2018)

  9. 7 2- WAY EXTERN AL M ERGE SO RT Pass #0 → Reads every B pages of the table into memory → Sorts them, and writes them back to disk. → Each sorted set of pages is called a run. Pass #1,2,3,… → Recursively merges pairs of runs into runs twice as long → Uses three buffer pages (2 for input pages, 1 for output) Memory Memory Disk CMU 15-445/645 (Fall 2018)

  10. 7 2- WAY EXTERN AL M ERGE SO RT Pass #0 → Reads every B pages of the table into memory → Sorts them, and writes them back to disk. → Each sorted set of pages is called a run. Pass #1,2,3,… → Recursively merges pairs of runs into runs twice as long → Uses three buffer pages (2 for input pages, 1 for output) Memory Memory Memory Disk CMU 15-445/645 (Fall 2018)

  11. 7 2- WAY EXTERN AL M ERGE SO RT Pass #0 → Reads every B pages of the table into memory → Sorts them, and writes them back to disk. → Each sorted set of pages is called a run. Pass #1,2,3,… → Recursively merges pairs of runs into runs twice as long → Uses three buffer pages (2 for input pages, 1 for output) Memory Memory Memory Disk CMU 15-445/645 (Fall 2018)

  12. 8 2- WAY EXTERN AL M ERGE SO RT EOF 3,4 6,2 9,4 8,7 5,6 3,1 2 ∅ In each pass, we read and write each page in file. Number of passes = 1 + ⌈ log 2 N ⌉ Total I/O cost = 2 N ∙ (# of passes) CMU 15-445/645 (Fall 2018)

  13. 8 2- WAY EXTERN AL M ERGE SO RT EOF 3,4 6,2 9,4 8,7 5,6 3,1 2 ∅ In each pass, we read and PASS 1 -PAGE write each page in file. 3,4 2,6 4,9 7,8 5,6 1,3 2 ∅ #0 RUNS Number of passes = 1 + ⌈ log 2 N ⌉ Total I/O cost = 2 N ∙ (# of passes) CMU 15-445/645 (Fall 2018)

  14. 8 2- WAY EXTERN AL M ERGE SO RT EOF 3,4 6,2 9,4 8,7 5,6 3,1 2 ∅ In each pass, we read and PASS 1 -PAGE write each page in file. 3,4 2,6 4,9 7,8 5,6 1,3 2 ∅ #0 RUNS PASS 2-PAGE Number of passes 2,3 4,7 1,3 ∅ #1 RUNS 4,6 8,9 5,6 2 = 1 + ⌈ log 2 N ⌉ Total I/O cost = 2 N ∙ (# of passes) CMU 15-445/645 (Fall 2018)

  15. 8 2- WAY EXTERN AL M ERGE SO RT EOF 3,4 6,2 9,4 8,7 5,6 3,1 2 ∅ In each pass, we read and PASS 1 -PAGE write each page in file. 3,4 2,6 4,9 7,8 5,6 1,3 2 ∅ #0 RUNS PASS 2-PAGE Number of passes 2,3 4,7 1,3 ∅ #1 RUNS 4,6 8,9 5,6 2 = 1 + ⌈ log 2 N ⌉ PASS 4-PAGE ∅ 2,3 #2 RUNS Total I/O cost 4,4 1,2 6,7 3,5 = 2 N ∙ (# of passes) 8,9 6 CMU 15-445/645 (Fall 2018)

  16. 8 2- WAY EXTERN AL M ERGE SO RT EOF 3,4 6,2 9,4 8,7 5,6 3,1 2 ∅ In each pass, we read and PASS 1 -PAGE write each page in file. 3,4 2,6 4,9 7,8 5,6 1,3 2 ∅ #0 RUNS PASS 2-PAGE Number of passes 2,3 4,7 1,3 ∅ #1 RUNS 4,6 8,9 5,6 2 = 1 + ⌈ log 2 N ⌉ PASS 4-PAGE ∅ 2,3 #2 RUNS Total I/O cost 4,4 1,2 6,7 3,5 = 2 N ∙ (# of passes) 8,9 6 PASS 8-PAGE ∅ #3 RUNS 1,2 2,3 3,4 4,5 6,6 7,8 9 CMU 15-445/645 (Fall 2018)

  17. 9 2- WAY EXTERN AL M ERGE SO RT This algorithm only requires three buffer pages ( B =3 ). Even if we have more buffer space available ( B >3 ), it does not effectively utilize them. Let’s next generalize the algorithm to make use of extra buffer space. CMU 15-445/645 (Fall 2018)

  18. 10 GEN ERAL EXTERN AL M ERGE SO RT Pass #0 → Use B buffer pages. → Produce ⌈ N / B ⌉ sorted runs of size B Pass #1,2,3,… → Merge B -1 runs (i.e., K-way merge). Number of passes = 1 + ⌈ log B -1 ⌈ N / B ⌉ ⌉ Total I/O Cost = 2 N ∙ (# of passes) CMU 15-445/645 (Fall 2018)

  19. 10 GEN ERAL EXTERN AL M ERGE SO RT Pass #0 → Use B buffer pages. → Produce ⌈ N / B ⌉ sorted runs of size B Pass #1,2,3,… → Merge B -1 runs (i.e., K-way merge). Number of passes = 1 + ⌈ log B -1 ⌈ N / B ⌉ ⌉ Total I/O Cost = 2 N ∙ (# of passes) CMU 15-445/645 (Fall 2018)

  20. 10 GEN ERAL EXTERN AL M ERGE SO RT Pass #0 → Use B buffer pages. → Produce ⌈ N / B ⌉ sorted runs of size B Pass #1,2,3,… → Merge B -1 runs (i.e., K-way merge). Number of passes = 1 + ⌈ log B -1 ⌈ N / B ⌉ ⌉ Total I/O Cost = 2 N ∙ (# of passes) CMU 15-445/645 (Fall 2018)

  21. 12 EXAM PLE Sort 108 page file with 5 buffer pages: N =108 , B =5 → Pass #0: ⌈ N / B ⌉ = ⌈ 108 / 5 ⌉ = 22 sorted runs of 5 pages each (last run is only 3 pages) → Pass #1: ⌈ N’ / B -1 ⌉ = ⌈ 22 / 4 ⌉ = 6 sorted runs of 20 pages each (last run is only 8 pages) → Pass #2: ⌈ N’’ / B -1 ⌉ = ⌈ 6 / 4 ⌉ = 2 sorted runs, 80 pages and 28 pages → Pass #3: Sorted file of 108 pages 1+ ⌈ log B-1 ⌈ N / B ⌉ ⌉ = 1+ ⌈ log 4 22 ⌉ = 1+ ⌈ 2.229... ⌉ = 4 passes CMU 15-445/645 (Fall 2018)

  22. 13 USIN G B+ TREES If the table that must be sorted already has a B+ tree index on the sort attribute(s), then we can use that to accelerate sorting. Retrieve tuples in desired sort order by simply traversing the leaf pages of the tree. Cases to consider: → Clustered B+ tree → Unclustered B+ tree CMU 15-445/645 (Fall 2018)

  23. 14 CASE 1: CLUSTERED B+ TREE Index Traverse to the left-most leaf page, (Directs search) Data Entries and then retrieve tuples from all leaf ("Sequence set") pages. This will always better than external 101 102 103 104 sorting. Data Records CMU 15-445/645 (Fall 2018)

  24. 15 CASE 2: UN CLUSTERED B+ TREE Index Chase each pointer to the page that (Directs search) Data Entries contains the data. ("Sequence set") This is almost always a bad idea. In general, one I/O per data record. 101 102 103 104 Data Records CMU 15-445/645 (Fall 2018)

  25. 17 AGGREGATIO NS Collapse multiple tuples into a single scalar value. Two implementation choices: → Sorting → Hashing CMU 15-445/645 (Fall 2018)

  26. 18 SO RTIN G AGGREGATIO N enrolled(sid,cid,grade) SELECT DISTINCT cid sid cid grade FROM enrolled 53666 15-445 C WHERE grade IN ('B','C') 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 15-445 C sid cid grade 53666 15-445 C 53688 15-826 B 53666 15-721 C Filter 53655 15-445 C CMU 15-445/645 (Fall 2018)

  27. 18 SO RTIN G AGGREGATIO N enrolled(sid,cid,grade) SELECT DISTINCT cid sid cid grade FROM enrolled 53666 15-445 C WHERE grade IN ('B','C') 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 15-445 C sid cid grade cid 53666 15-445 C 15-445 53688 15-826 B 15-826 15-721 53666 15-721 C Filter Remove 15-445 53655 15-445 C Columns CMU 15-445/645 (Fall 2018)

  28. 18 SO RTIN G AGGREGATIO N enrolled(sid,cid,grade) SELECT DISTINCT cid sid cid grade FROM enrolled 53666 15-445 C WHERE grade IN ('B','C') 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 15-445 C sid cid grade cid cid 53666 15-445 C 15-445 15-445 53688 15-826 B 15-826 15-445 15-721 15-721 53666 15-721 C Filter Sort Remove 15-445 15-826 53655 15-445 C Columns CMU 15-445/645 (Fall 2018)

  29. 18 SO RTIN G AGGREGATIO N enrolled(sid,cid,grade) SELECT DISTINCT cid sid cid grade FROM enrolled 53666 15-445 C WHERE grade IN ('B','C') 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 15-445 C sid cid grade cid cid 53666 15-445 C 15-445 15-445 X 53688 15-826 B 15-826 15-445 15-721 15-721 53666 15-721 C Filter Sort Remove 15-445 15-826 53655 15-445 C Columns Eliminate Dupes CMU 15-445/645 (Fall 2018)

  30. 20 ALTERN ATIVES TO SO RTIN G What if we don’ t need the data to be ordered? → Forming groups in GROUP BY (no ordering) → Removing duplicates in DISTINCT (no ordering) CMU 15-445/645 (Fall 2018)

Recommend


More recommend