Massive Data Algorithmics Gerth Stlting Brodal Aarhus University - PowerPoint PPT Presentation

Massive Data Algorithmics Gerth Stølting Brodal Aarhus University Forskningsdag for Datamatikerlærere, Erhvervsakademiet Lillebælt, Vejle, November 2, 2010

Gerth Stølting Brodal Kurt Mehlhorn 2007- 1994-2006 Erik Meineche Schmidt M.Sc. PhD Associate Professor AU MPII MPII AU AU 1969 1983 1989 1993 1994 1997 1998 2004 September September August January November January August April April M.Sc. PhD Post Doc Faculty

Outline of Talk • – who, where, what ? – reseach areas • External memory algorithmics – models – searching and sorting • Fault-tolerant searching • Flow simulation

– Wher here?

 Center of  Lars Arge , Professor, Centerleader  Gerth S. Brodal, Associate Professor  5 Post Docs, 10 PhD students, 4 TAP  Total budget for 5 years ca. 60 million DKR Frankfurt MPII MIT AU Arge Brodal Demaine Indyk Mehlhorn Meyer

Faculty Lars Arge Gerth Stølting Brodal Researchers PhD Students Henrik Blunck Lasse Kosetski Deleuran Jakob Truelsen Brody Sandel Freek van Walderveen Kostas Tsakalidis Nodari Sitchinava Casper Kejlberg-Rasmussen Mark Greve Elad Verbin Kasper Dalgaard Larsen Morten Revsbæk Qin Zhang Jesper Erenskjold Moeslund Pooya Davoodi

PhD Education @ AU ”4+4” ”3+5” ”5+3” 8. år 8. år 8. år PhD PhD Part B Part B 7. år Licentiat merit 7. år 7. år merit (PhD) 6. år 6. år 6. år PhD PhD Part A 5. år 5. år 5. år Part A MSc MSc datamatiker 4. år 4. år 4. år MSc 3. år 3. år 3. år 2. år Bachelor Bachelor 2. år Bachelor Bachelor 2. år 1. år 1. år 1. år 90’erne 00’erne 80’erne

PhD Education @ MADALGO ”3+5” 8. år 6 months abroad PhD Part B 7. år merit 6. år Morten, Pooya, Freek PhD 5. år Part A MSc Mark, Jakob, Lasse, Kostas Casper 4. år Kasper 3. år Bachelor Bachelor 2. år 1. år

• High level objectives – Advance algorithmic knowledge in “massive data” processing area – Train researchers in world-leading international environment – Be catalyst for multidisciplinary/industry collaboration • Building on – Strong international team – Vibrant international environment (focus on people)

Massive Data • Pervasive use of computers and sensors • Increased ability to acquire/store/process data → Massive data collected everywhere • Society increasingly “data driven” → Access/process data anywhere any time Nature special issues – 2/06: “2020 – Future of computing” – 9/08: “BIG DATA • Scientific data size growing exponentially, while quality and availability improving • Paradigm shift: Science will be about mining data → Computer science paramount in all sciences

Massive Data • Pervasive use of computers and sensors • Increased ability to acquire/store/process data → Massive data collected everywhere • Society increasingly “data driven” → Access/process data anywhere any time Nature special issues Obviously not only in sciences: 2/06: “2020 – Future of computing”  Economist 02/10: 9/08: “BIG DATA  From 150 Billion Gigabytes five years ago Scientific data size growing exponentially, to 1200 Billion today while quality and availability improving  Managing data deluge difficult; doing so Paradigm shift: Science will be about mining data will transform business/public life → Computer science paramount in all sciences

Example: Massive Terrain Data • New technologies: Much easier/cheaper to collect detailed data – Previous ‘manual’ or radar based methods − Often 30 meter between data points − Sometimes 10 meter data available – New laser scanning methods (LIDAR) − Less than 1 meter between data points − Centimeter accuracy (previous meter) Denmark ~2 million points at 30 meter (<1GB) ~18 billion points at 1 meter (>1TB)

Streaming I/O Efficient Algorithms Algorithms Cache Algorithm Oblivious Algorithms Engineering

The problem... Normal algorithm running time I/O-efficient algorithm input size bottleneck = memory size

Memory Hierarchies CPU R Processor L1 L2 L3 Disk A M bottleneck increasing access times and memory sizes

Memory Hierarkies vs. Running Time L1 L2 L3 RAM running time input size

Disk Mechanics read/write head read/write arm track magnetic surface “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

Disk Mechanics read/write head read/write arm track magnetic surface • I/O is often bottleneck when handling massive datasets • Disk access is 10 7 times slower than main memory access! • Disk systems try to amortize large access time transferring large contiguous blocks of data • Need to store and access data to take advantage of blocks !

Memory Access Times Latency Relative to CPU Register 0.5 ns 1 L1 cache 0.5 ns 1-2 L2 cache 3 ns 2-7 DRAM 150 ns 80-200 TLB 500+ ns 200-2000 10 7 Disk 10 ms

I/O-Efficient Algorithms Matter • Example: Traversing linked list (List ranking) – Array size N = 10 elements – Disk block size B = 2 elements – Main memory size M = 4 elements (2 blocks) 1 5 2 6 3 8 9 4 7 10 1 2 10 9 5 6 3 4 8 7 Algorithm 1: N=10 I/Os Algorithm 2: N/B=5 I/Os • Difference between N and N/B large since block size is large – Example: N = 256 x 10 6 , B = 8000 , 1ms disk access time  N I/Os take 256 x 10 3 sec = 4266 min = 71 hr  N/B I/Os take 256/8 sec = 32 sec

I/O Efficient Scanning B A N O ( N / B ) I/Os

External-Memory Merging 2 3 5 6 9 2 3 5 6 9 11 13 15 19 21 25 27 1 4 7 10 14 29 33 41 49 51 52 57 k -way 1 2 3 4 5 6 7 8 9 10 11 12 13 14 merger 8 12 16 18 22 24 3 34 35 38 42 46 write 1 17 20 23 26 28 30 32 37 39 43 45 50 read Merging k sequences with N elements requires O ( N / B ) IOs (provided k ≤ M / B – 1)

External-Memory Sorting ... M M N Unsorted input Partition into runs Run 1 Run 2 Run N / M Sort each run Sorted Sorted Sorted Merge pass I Sorted Sorted Merge pass II Sorted ouput • MergeSort uses O ( N / B· log M/B ( N / B )) I/Os • Practice number I/Os: 4-6 x scanning input

B-trees - The Basic Searching Structure  Searches Practice: 4-5 I/Os Internal memory  Repeated searching Practice: 1-2 I/Os B .... Search path

B-trees Best possible  Searches O (log B N ) I/Os Internal memory  Updates O (log B N ) I/Os B .... ? Search path

Brodal and Fagerberg (2003) B-trees with Buffered Updates B x x x x  Searches cost O (log B N ) I/Os  N updates cost √ B O ( N /√ B ∙ log B N ) I/Os .... Trade-off between search and update times – optimal !

Brodal and Fagerberg (2003) B-trees with Buffered Updates

Hedegaard (2004) B-trees with Buffered Updates Experimental Study • 100.000.000 elements • Search time basically unchanged with buffers • Updates 100 times faster ....

Massive Data Algorithmics Gerth Stlting Brodal Aarhus University - PowerPoint PPT Presentation

Massive Data Algorithmics Gerth Stlting Brodal Aarhus University Forskningsdag for Datamatikerlrere, Erhvervsakademiet Lilleblt, Vejle, November 2, 2010 Gerth Stlting Brodal Kurt Mehlhorn 2007- 1994-2006 Erik Meineche Schmidt M.Sc.

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

Massive Data Algorithmics Lecture 3: External Search Trees Massive Data Algorithmics Lecture 3:

Massive Data Algorithmics Lecture 5: External Search Trees Massive Data Algorithmics Lecture 5:

Massive Data Algorithmics Lecture 6: Interval Trees Massive Data Algorithmics Lecture 6:

Massive Data Algorithmics Lecture 4: External Search Trees Massive Data Algorithmics Lecture 4:

Massive Data Algorithmics Lecture 5: External Search Trees Massive Data Algorithmics Lecture 5:

Massive Data Algorithmics Lecture 11: BFS and DFS Massive Data Algorithmics Lecture 11: BFS and

Massive Data Algorithmics Lecture 7: Range Searching Massive Data Algorithmics Lecture 7: Range

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

Pedagogical Introduction Algorithmics and C Programming Lecture 0 Karim Bouzoubaa Objective

Algorithmics and C basis Introduction For beginners . . . Definition of algorithm Examples

Multivariate Algorithmics for Voting Britta Dorn University of Ulm, Germany FET11 Britta

Points, Distances, and Cellular Automata: Geometric and Spatial Algorithmics Luidnel Maignan

A different look to massive MIMO Ana Garca Armada Communications Research Group (GCOM)

1 2 Compress a massive object to a small sketch 2 Compress a massive object to a small

Today HW3 extension Phew! Lab 4? Finish up caches, exceptional control flow 1

26 July 2020 Access service sheets at thecrossing.com.sg/im-new/sunday-services/ Family

Todays Message: Loving people who drive you nuts! 1 Corinthians 13 hopecc.com/slides

Impossible So Will I (100 Billion X) Hillsong Worship Wonder Fairest (with Fairest Lord

. Bruno Durand LIRMM CNRS Universit de Montpellier II November26 th 2011 . . . 1.

Decay of aftershock density with distance indicates triggering by dynamic stress 2017 6/12

CS675: Convex and Combinatorial Optimization Fall 2019 Combinatorial Problems as Linear and

Using program slicing data to predict code faults David Bowes University of Hertfordshire

Sambuz

Useful Links

Newsletter

Mail Us