Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 - - PowerPoint PPT Presentation

β–Ά
algorithm engineering
SMART_READER_LITE
LIVE PREVIEW

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 - - PowerPoint PPT Presentation

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 6 Yan n Gu I/O Algorithms and Parallel Samplesort Review of Samplesort CS260: Algorithm Semisort Engineering Lecture 6 Course Policy 2 Sample-sort


slide-1
SLIDE 1

Algorithm Engineering

(aka. How to Write Fast Code)

I/O Algorithms and Parallel Samplesort

CS26 S260 – Lecture cture 6 Yan n Gu

slide-2
SLIDE 2

CS260: Algorithm Engineering Lecture 6

2

Review of Samplesort Semisort Course Policy

slide-3
SLIDE 3

Sample-sort outline

Analo logou gous s to mult ltiw iway ay quic ickso ksort 1.

  • 1. Sp

Spli lit in input ut array in into 𝑂 contiguo iguous us suba barra rrays ys of siz ize 𝑂. So Sort subar arrays rays recursi sivel vely

… 𝑂, sorted 𝑂

slide-4
SLIDE 4

Sample-sort outline

𝑂, sorted …

Analo logou gous s to mult ltiw iway ay quic ickso ksort 1.

  • 1. Sp

Spli lit in input ut array in into 𝑂 contiguo iguous us suba barra rrays ys of siz ize 𝑂. So Sort subar arrays rays recursi sivel vely y (sequ equent entia ially lly)

slide-5
SLIDE 5

Sample-sort outline

2.

  • 2. Choo
  • ose

se 𝑂 βˆ’ 1 β€œgood” pivots π‘ž1 ≀ π‘ž2 ≀ β‹― ≀ π‘ž π‘‚βˆ’1 3.

  • 3. Dis

istribu ribute te su subar barrays rays in into

  • buckets

ckets, , ac accordin

  • rding

g to pivot vots

𝑂, sorted … Bucket 1 Bucket 2 Bucket 𝑂 ≀ π‘ž1 ≀ ≀ π‘ž2 ≀ β‹― ≀ π‘ž π‘‚βˆ’1 ≀

Size β‰ˆ 𝑂

slide-6
SLIDE 6

4.

  • 4. Recurs

cursively ively sort rt the buckets ckets 5.

  • 5. Copy

py conca

  • ncatenated

tenated buckets ckets bac ack k to input put ar arra ray

Sample-sort outline

Bucket 1 Bucket 2 Bucket 𝑂 ≀ π‘ž1 ≀ ≀ π‘ž2 ≀ β‹― ≀ π‘ž π‘‚βˆ’1 ≀ sorted

slide-7
SLIDE 7

CS260: Algorithm Engineering Lecture 6

7

Review of Samplesort Semisort Course Policy

slide-8
SLIDE 8
  • Input:
  • An array of records with associated keys
  • Assume keys can be hashed to the range [π‘œπ‘™]
  • Goal:
  • All records with equal keys should be adjacent

key 45 12 45 61 28 61 61 45 28 45 Value 2 5 3 9 5 9 8 1 7 5

What is semisort?

slide-9
SLIDE 9
  • Input:
  • An array of records with associated keys
  • Assume keys can be hashed to the range [π‘œπ‘™]
  • Goal:
  • All records with equal keys should be adjacent

key 12 61 61 61 45 45 45 45 28 28 Value 5 8 9 9 2 5 1 3 7 5

What is semisort?

slide-10
SLIDE 10
  • Input:
  • An array of records with associated keys
  • Assume keys can be hashed to the range [π‘œπ‘™]
  • Goal:
  • All records with equal keys should be adjacent
  • Different keys are not necessarily sorted
  • Records with equal keys do not need to be sorted by their values

key 45 45 45 45 12 61 61 61 28 28 Value 2 5 1 3 5 8 9 9 7 5

What is semisort?

slide-11
SLIDE 11
  • Input:
  • An array of records with associated keys
  • Assume keys can be hashed to the range [π‘œπ‘™]
  • Goal:
  • All records with equal keys should be adjacent
  • Different keys are not necessarily sorted
  • Records with equal keys do not need to be sorted by their values

key 45 45 45 45 12 61 61 61 28 28 Value 1 5 3 2 5 8 9 9 7 5

What is semisort?

slide-12
SLIDE 12

Semisort is one of the most useful primitives in parallel algorithms

Parallel In-Place Algorithms: Theory and Practice Julienne: A Framework for Parallel Graph Algorithms using Work- efficient Bucketing Semi-Asymmetric Parallel Graph Algorithms for NVRAMs Efficient BVH Construction via Approximate Agglomerative Clustering Theoretically-Efficient and Practical Parallel DBSCAN

12

slide-13
SLIDE 13

Why is semisort so useful? (albeit not seen before)

13

  • Semisorting can be done by sorting, but faster (less restriction)
  • Theoretically can be done in 𝑃 π‘œ work not 𝑃 π‘œ log π‘œ work
  • Can be used to implement counting / integer sort
  • Integer sort: given π‘œ key-value pairs with keys in range [1, … , π‘œ], query the

KV-pairs with a certain key

  • Counting sort: given π‘œ key-value pairs with keys in range [1, … , π‘œ], query the

number of KV-pairs with a certain key

  • In database community, this is called the GroupBy operator
slide-14
SLIDE 14

Why is semisort so useful? (albeit not seen before)

14

  • Semisorting can be done by sorting, but faster (less restriction)
  • Theoretically can be done in 𝑃 π‘œ work not 𝑃 π‘œ log π‘œ work
  • Can be used to implement counting / integer sort

keys 37 … 58 … 92 …

12 9 52

92 56

11 19 8

key value

Linked lists of values

56

slide-15
SLIDE 15

Attempts – Sequentially: Pre-allocated array

12 9 52

92 56

11 19 8 44 31

56

keys 37 … 58 … 92 … key value

Arrays

  • f

values

ο‚’ Problem ο‚— Need to pre-count the number of each key

slide-16
SLIDE 16
  • Generate adjacency array for a graph

Edge list Sorted edge list (3,5) (3,5) (1,7) (3,7) (2,3) (3,6) (3,6) (5,4) (5,4) (1,6) (3,7) (1,7) (1,6) (2,3) 1 2 3 4 5 6 7

Another use case for semisrot

slide-17
SLIDE 17
  • Input:
  • An array of records with associated keys
  • Assume keys can be hashed to the range [π‘œπ‘™]
  • Goal:
  • All records with equal keys should be adjacent
  • Different keys are not necessarily sorted
  • Records with equal keys do not need to be sorted by their values

key 45 45 45 45 12 61 61 61 28 28 Value 1 5 3 2 5 8 9 9 7 5

What is semisort?

slide-18
SLIDE 18
  • There can be many duplicate keys
  • Heavy keys
  • Or, there can be almost no duplicate keys
  • Light keys

key 45 45 45 45 12 61 61 61 28 28 Value 1 5 3 2 5 8 9 9 7 5

Why is semisort hard?

slide-19
SLIDE 19
  • Input: 𝒐 KV-pairs with key in [π‘œ]
  • Step 1: hash the keys (i.e., for 𝒍𝒋, π’˜π’‹ , generate π’Šπ’‹ = 𝐒𝐛𝐭𝐒(𝒍𝒋))
  • Step 2: semisort π’Šπ’‹, (𝒍𝒋, π’˜π’‹) , and resolve conflicts
  • Step 3: get the pointer for each key 𝒍𝒋

key 45 45 45 45 12 61 61 61 28 28 Value 1 5 3 2 5 8 9 9 7 5

Implement integer sort using semisort

slide-20
SLIDE 20

The Top-Down Parallel Semisort Algorithm

22

slide-21
SLIDE 21
  • And tell the heavy keys from light ones. By how?

Sampling!

  • For a key appear more than 𝐨/𝒖 times, we call it a heavy key
  • Otherwise, we call it a light key
  • We can treat them separately

The main goal estimate key counts

slide-22
SLIDE 22
  • Take 𝒖 log 𝒐 samples and sort them
  • For those keys with more than log 𝒐 appearances, we mark them

as heavy keys, others are light keys

  • We give each heavy key a bucket, and the another 𝒖 buckets for

light keys each corresponds to a range of 𝒐𝒍/𝒖

  • The input keys are hashed into 𝒐𝒍
  • In total we have no more than 2𝑒 buckets
  • The rest of the algorithm is pretty similar to samplesort

The algorithm

slide-23
SLIDE 23

Phase 1: Sampling and sorting

……

5 5 5 8 8 8 8 8 17 17 …… 11 17

  • 1. Select a sample set 𝑇 with 𝑒 log π‘œ of keys
  • 2. Sort 𝑇

……

S

Sampling (Counting) Sorting

slide-24
SLIDE 24

Phase 2: Bucket Construction

5 5 5 8 8 8 8 8 17 17 …… 11 17

Counting & Filtering

keys 8 20 65 … Range 0-15 16-31 keys 5 11 17 21 26 31 ... Heavy keys Light keys

Sorted samples:

slide-25
SLIDE 25
  • In total we have no more than 2𝑒 buckets
  • 𝑒 of them are for light keys
  • Then we construct a hash table for the heavy keys
  • Now we know which bucket each KV-pair (𝒍𝒋, π’˜π’‹) goes to:
  • If 𝑙𝑗 is found in the hash table, assign it to the associated heavy bucket
  • Otherwise, it goes to the light bucket based on the range of 𝑙𝑗
  • The rest of the algorithm is almost identical to samplesort

At the end of Phase 2

slide-26
SLIDE 26

Sample-sort outline

Analogous to multiway quicksort

  • 1. Split input array into 𝑂 contiguous

subarrays of size 𝑂

… 𝑂 𝑂

𝑂/𝑒 𝑒

slide-27
SLIDE 27

Sample-sort outline

…

Analogous to multiway quicksort

  • 1. Split input array into 𝑂/𝑒 contiguous

subarrays of size 𝑒. Sort subarrays recursively (sequentially)

Size β‰ˆ 𝑒

slide-28
SLIDE 28

Sample-sort outline

  • 2. Distribute subarrays into

buckets

… Bucket 1 Bucket 2 Bucket 𝑂 ≀ π‘ž1 ≀ ≀ π‘ž2 ≀ β‹― ≀ π‘ž π‘‚βˆ’1 ≀ …

slide-29
SLIDE 29
  • 3. Recursively sort the buckets
  • 4. Copy concatenated buckets back to input array

Sample-sort outline

Bucket 1 Bucket 2 Bucket 𝑂 … sorted

Only for the light buckets

slide-30
SLIDE 30

Difference 2: subarrays are not sorted

  • For simplicity, assume 𝒐 = πŸπŸ•, and the input is

[𝟐, πŸ‘, πŸ’, πŸ“, 𝟐, 𝟐, πŸ’, πŸ’, 𝟐, πŸ‘, πŸ‘, πŸ“, 𝟐, πŸ‘, πŸ“, πŸ“]

  • First, get the count for each subarray in each bucket

[𝟐, 𝟐, 𝟐, 𝟐, πŸ‘, 𝟏, πŸ‘, 𝟏, 𝟐, πŸ‘, 𝟏, 𝟐, 𝟐, 𝟐, 𝟏, πŸ‘]

  • Then, transpose the array and scan to compute the offsets

[𝟐, πŸ‘, 𝟐, 𝟐, 𝟐, 𝟏, πŸ‘, 𝟐, 𝟐, πŸ‘, 𝟏, 𝟏, 𝟐, 𝟏, 𝟐, πŸ‘] [𝟏, 𝟐, πŸ’, πŸ“, πŸ”, πŸ•, πŸ•, πŸ—, 𝟘, 𝟐𝟏, πŸπŸ‘, πŸπŸ‘, πŸπŸ‘, πŸπŸ’, πŸπŸ’, πŸπŸ“]

  • Lastly, move each element to the corresponding bucket

[βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…]

32

[𝟐, βˆ…, βˆ…, βˆ…, βˆ…, πŸ‘, βˆ…, βˆ…, βˆ…, πŸ’, βˆ…, βˆ…, πŸ“, βˆ…, βˆ…, βˆ…] [𝟐, 𝟐, 𝟐, βˆ…, βˆ…, πŸ‘, βˆ…, βˆ…, βˆ…, πŸ’, πŸ’, πŸ’, πŸ“, βˆ…, βˆ…, βˆ…]

slide-31
SLIDE 31

Difference 2: subarrays are not sorted, but doesn’t matter

  • For simplicity, assume 𝒐 = πŸπŸ•, and the input is

[𝟐, πŸ’, πŸ‘, πŸ“, 𝟐, πŸ’, 𝟐, πŸ’, 𝟐, πŸ‘, πŸ‘, πŸ“, 𝟐, πŸ‘, πŸ“, πŸ“]

  • First, get the count for each subarray in each bucket

[𝟐, 𝟐, 𝟐, 𝟐, πŸ‘, 𝟏, πŸ‘, 𝟏, 𝟐, πŸ‘, 𝟏, 𝟐, 𝟐, 𝟐, 𝟏, πŸ‘]

  • Then, transpose the array and scan to compute the offsets

[𝟐, πŸ‘, 𝟐, 𝟐, 𝟐, 𝟏, πŸ‘, 𝟐, 𝟐, πŸ‘, 𝟏, 𝟏, 𝟐, 𝟏, 𝟐, πŸ‘] [𝟏, 𝟐, πŸ’, πŸ“, πŸ”, πŸ•, πŸ•, πŸ—, 𝟘, 𝟐𝟏, πŸπŸ‘, πŸπŸ‘, πŸπŸ‘, πŸπŸ’, πŸπŸ’, πŸπŸ“]

  • Lastly, move each element to the corresponding bucket

[βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…, βˆ…]

33

[𝟐, βˆ…, βˆ…, βˆ…, βˆ…, πŸ‘, βˆ…, βˆ…, βˆ…, πŸ’, βˆ…, βˆ…, πŸ“, βˆ…, βˆ…, βˆ…] [𝟐, 𝟐, 𝟐, βˆ…, βˆ…, πŸ‘, βˆ…, βˆ…, βˆ…, πŸ’, πŸ’, πŸ’, πŸ“, βˆ…, βˆ…, βˆ…]

slide-32
SLIDE 32

Take away for semisort

  • Semisort is very useful
  • Implements bucket and integer sort, and can apply on even large key range
  • Theoretically takes linear work and 𝑃 log π‘œ depth, although in this lecture I

talked about a simpler version that does not have either bound

  • The key insight is the partition of heavy and light keys
  • Heavy keys have own buckets, which can be large but need no further sort
  • Light keys are grouped based on ranges. Since the keys are hashed, the

light buckets are small (contains 𝑃 π‘œ/𝑒 elements, analysis in [GSSB15])

34

slide-33
SLIDE 33

CS260: Algorithm Engineering Lecture 6

35

Review of Samplesort Semisort Course Policy

slide-34
SLIDE 34

Paper Reading and Course Presentation

36

slide-35
SLIDE 35

Paper Reading and Course Presentation

  • 10 s

studen dents ts have e reserved ved the paper ers s for reading ing and pre resentin enting

  • If Paper

er 8 i is no not reserved, rved, Yunshu shu wil ill l present ent it it o

  • n 4/27
  • Deadl

dlines, ines, in instruc uction tions s and schedu edule les s are on course e webpag age e and and il ilearn

37

slide-36
SLIDE 36

Course Presentation

  • Eac

ach h of f yo you will l gi give ve a 2 a 22-min minute ute tal alk an and hav ave a 5 a 5- mi minute nute Q& Q&A.

  • A. T

Time ime ma manage agement ment is is cruc rucial. ial.

  • Tomorrow, I will upload Prof. Sun’s lecture on how

w to gi give ve a c a cle lear ar tal alk. . I It is is ma mandatory datory to st study dy it it before fore yo your r pre resenta entatio tion. n.

  • Meanwhile,

anwhile, I wi will ll al also so at attach ach a a sp speaking aking sk skil ill l eva valua uation tion fo form rm that at is used ed to eva valua uate te yo your r tal alk.

  • You should check it before you give the presentation

38

slide-37
SLIDE 37

Preparation for Course Presentation

  • It’s highly recommended to give 2-3 practic

ice e talk lks to y your fri riends nds and/ d/or r cl classmates smates before re your r pre resent sentat ation, ion, in in ord rder r to gu guarante ntee e that everything ything you say y makes s sense nse and is is unde derst rstand andab able le.

  • Otherwise you are just wasting everyone’s time. Let’s don’t

do it since it’s embarrassing.

  • You’re obliged to submit mostly-do

done ne sli lides es to Ya Yan 48h ahead, ad, as well ll as the co corr rrespo ponding nding paper er re reading. ing.

39

slide-38
SLIDE 38

Quiz

40

slide-39
SLIDE 39

Quiz

  • Quiz

iz is is on 4/ 4/24. I wil ill s l send nd each of you a go googl gle doc, and you shoul uld d answer wer in in it it.

  • Don’t write in other apps and copy and paste to that. Google

doc keeps track of all your edits (so please don’t cheat).

  • Cheating the quiz/exam is fatal. Please don’t let me handle that.
  • Only for 10% score. Don’t panic.
  • It is

is op

  • pen-boo

book, k, but you shoul uld d stil ill l revie iew w the le lectures es sin ince e the le lengt gth h is is fo for 1 h hour and d you mig ight t not have e tim ime to s search ch for each proble lem. m.

41

slide-40
SLIDE 40

Midterm and Final Project

42

slide-41
SLIDE 41

Midterm Project

  • Due on April

il 29, so so you stil ill l have e more than 2 w weeks. s.

  • It’s a hard deadline, if you feel short of time, submit what you

have e at th that tim ime

  • Pre-proposal meeting is on May 1, and final proposal is due on May 4
  • You shoul

uld d start now, , and meanwh nwhile ile, , You shoul uld d expe pect ct at le least t two d days ys in in w writ itin ing g the report

  • Writing a good report can largely increase your score

43

slide-42
SLIDE 42

Final Project

  • Pre-pr

proposa posal l meeting: ing: 5/1

  • Proposal

sal: : 5/4

  • Weekly

ly progr gress ss report 1: 5/ 5/13

  • Mil

ilest stone

  • ne:

: 5/22

  • Weekly

ly progr gress ss report 2: 5/ 5/29

  • Fin

inal l project t presenta entation ion: : 6/1-5

  • Fin

inal l re report rt due: : 6/ 6/8

44

slide-43
SLIDE 43

Final Project: Score Breakdown

  • Proposal

sal: : 10%

  • Weekly

ly progr gress ss report 1: 5% 5%

  • Mil

ilest stone

  • ne:

: 10%

  • We

Weekly ly pro rogr gress ss re report rt 2: 2: 5% 5%

  • Fin

inal l project t presenta entation ion: : 20%

  • Fin

inal l report: : 50%

45

slide-44
SLIDE 44

Milestone and Final project

  • Mil

ilest stone

  • ne:

: 5-min inute ute talk lk for each studen dent, , dis iscuss uss the pro rogr gress ss and if if you meet the go goals ls in in th the pro roposal sal

  • Fin

inal l project t presenta entation ion: : 20+5(Q&A &A) ) min inutes utes for each student, dent, talk lk about t your r work rk lik like the paper er pre resentat entation ion

46