massive data algorithmics
play

Massive Data Algorithmics Gerth Stlting Brodal Aarhus University - PowerPoint PPT Presentation

Massive Data Algorithmics Gerth Stlting Brodal Aarhus University Forskningsdag for Datamatikerlrere, Erhvervsakademiet Lilleblt, Vejle, November 2, 2010 Gerth Stlting Brodal Kurt Mehlhorn 2007- 1994-2006 Erik Meineche Schmidt M.Sc.


  1. Massive Data Algorithmics Gerth Stølting Brodal Aarhus University Forskningsdag for Datamatikerlærere, Erhvervsakademiet Lillebælt, Vejle, November 2, 2010

  2. Gerth Stølting Brodal Kurt Mehlhorn 2007- 1994-2006 Erik Meineche Schmidt M.Sc. PhD Associate Professor AU MPII MPII AU AU 1969 1983 1989 1993 1994 1997 1998 2004 September September August January November January August April April M.Sc. PhD Post Doc Faculty

  3. Outline of Talk • – who, where, what ? – reseach areas • External memory algorithmics – models – searching and sorting • Fault-tolerant searching • Flow simulation

  4. – Wher here?

  5.  Center of  Lars Arge , Professor, Centerleader  Gerth S. Brodal, Associate Professor  5 Post Docs, 10 PhD students, 4 TAP  Total budget for 5 years ca. 60 million DKR Frankfurt MPII MIT AU Arge Brodal Demaine Indyk Mehlhorn Meyer

  6. Faculty Lars Arge Gerth Stølting Brodal Researchers PhD Students Henrik Blunck Lasse Kosetski Deleuran Jakob Truelsen Brody Sandel Freek van Walderveen Kostas Tsakalidis Nodari Sitchinava Casper Kejlberg-Rasmussen Mark Greve Elad Verbin Kasper Dalgaard Larsen Morten Revsbæk Qin Zhang Jesper Erenskjold Moeslund Pooya Davoodi

  7. PhD Education @ AU ”4+4” ”3+5” ”5+3” 8. år 8. år 8. år PhD PhD Part B Part B 7. år Licentiat merit 7. år 7. år merit (PhD) 6. år 6. år 6. år PhD PhD Part A 5. år 5. år 5. år Part A MSc MSc datamatiker 4. år 4. år 4. år MSc 3. år 3. år 3. år 2. år Bachelor Bachelor 2. år Bachelor Bachelor 2. år 1. år 1. år 1. år 90’erne 00’erne 80’erne

  8. PhD Education @ MADALGO ”3+5” 8. år 6 months abroad PhD Part B 7. år merit 6. år Morten, Pooya, Freek PhD 5. år Part A MSc Mark, Jakob, Lasse, Kostas Casper 4. år Kasper 3. år Bachelor Bachelor 2. år 1. år

  9. • High level objectives – Advance algorithmic knowledge in “massive data” processing area – Train researchers in world-leading international environment – Be catalyst for multidisciplinary/industry collaboration • Building on – Strong international team – Vibrant international environment (focus on people)

  10. Massive Data • Pervasive use of computers and sensors • Increased ability to acquire/store/process data → Massive data collected everywhere • Society increasingly “data driven” → Access/process data anywhere any time Nature special issues – 2/06: “2020 – Future of computing” – 9/08: “BIG DATA • Scientific data size growing exponentially, while quality and availability improving • Paradigm shift: Science will be about mining data → Computer science paramount in all sciences

  11. Massive Data • Pervasive use of computers and sensors • Increased ability to acquire/store/process data → Massive data collected everywhere • Society increasingly “data driven” → Access/process data anywhere any time Nature special issues Obviously not only in sciences: 2/06: “2020 – Future of computing”  Economist 02/10: 9/08: “BIG DATA  From 150 Billion Gigabytes five years ago Scientific data size growing exponentially, to 1200 Billion today while quality and availability improving  Managing data deluge difficult; doing so Paradigm shift: Science will be about mining data will transform business/public life → Computer science paramount in all sciences

  12. Example: Massive Terrain Data • New technologies: Much easier/cheaper to collect detailed data – Previous ‘manual’ or radar based methods − Often 30 meter between data points − Sometimes 10 meter data available – New laser scanning methods (LIDAR) − Less than 1 meter between data points − Centimeter accuracy (previous meter) Denmark ~2 million points at 30 meter (<1GB) ~18 billion points at 1 meter (>1TB)

  13. Streaming I/O Efficient Algorithms Algorithms Cache Algorithm Oblivious Algorithms Engineering

  14. The problem... Normal algorithm running time I/O-efficient algorithm input size bottleneck = memory size

  15. Memory Hierarchies CPU R Processor L1 L2 L3 Disk A M bottleneck increasing access times and memory sizes

  16. Memory Hierarkies vs. Running Time L1 L2 L3 RAM running time input size

  17. Disk Mechanics read/write head read/write arm track magnetic surface “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

  18. Disk Mechanics read/write head read/write arm track magnetic surface • I/O is often bottleneck when handling massive datasets • Disk access is 10 7 times slower than main memory access! • Disk systems try to amortize large access time transferring large contiguous blocks of data • Need to store and access data to take advantage of blocks !

  19. Memory Access Times Latency Relative to CPU Register 0.5 ns 1 L1 cache 0.5 ns 1-2 L2 cache 3 ns 2-7 DRAM 150 ns 80-200 TLB 500+ ns 200-2000 10 7 Disk 10 ms

  20. I/O-Efficient Algorithms Matter • Example: Traversing linked list (List ranking) – Array size N = 10 elements – Disk block size B = 2 elements – Main memory size M = 4 elements (2 blocks) 1 5 2 6 3 8 9 4 7 10 1 2 10 9 5 6 3 4 8 7 Algorithm 1: N=10 I/Os Algorithm 2: N/B=5 I/Os • Difference between N and N/B large since block size is large – Example: N = 256 x 10 6 , B = 8000 , 1ms disk access time  N I/Os take 256 x 10 3 sec = 4266 min = 71 hr  N/B I/Os take 256/8 sec = 32 sec

  21. I/O Efficient Scanning B A N O ( N / B ) I/Os

  22. External-Memory Merging 2 3 5 6 9 2 3 5 6 9 11 13 15 19 21 25 27 1 4 7 10 14 29 33 41 49 51 52 57 k -way 1 2 3 4 5 6 7 8 9 10 11 12 13 14 merger 8 12 16 18 22 24 3 34 35 38 42 46 write 1 17 20 23 26 28 30 32 37 39 43 45 50 read Merging k sequences with N elements requires O ( N / B ) IOs (provided k ≤ M / B – 1)

  23. External-Memory Sorting ... M M N Unsorted input Partition into runs Run 1 Run 2 Run N / M Sort each run Sorted Sorted Sorted Merge pass I Sorted Sorted Merge pass II Sorted ouput • MergeSort uses O ( N / B· log M/B ( N / B )) I/Os • Practice number I/Os: 4-6 x scanning input

  24. B-trees - The Basic Searching Structure  Searches Practice: 4-5 I/Os Internal memory  Repeated searching Practice: 1-2 I/Os B .... Search path

  25. B-trees Best possible  Searches O (log B N ) I/Os Internal memory  Updates O (log B N ) I/Os B .... ? Search path

  26. Brodal and Fagerberg (2003) B-trees with Buffered Updates B x x x x  Searches cost O (log B N ) I/Os  N updates cost √ B O ( N /√ B ∙ log B N ) I/Os .... Trade-off between search and update times – optimal !

  27. Brodal and Fagerberg (2003) B-trees with Buffered Updates

  28. Hedegaard (2004) B-trees with Buffered Updates Experimental Study • 100.000.000 elements • Search time basically unchanged with buffers • Updates 100 times faster ....

Recommend


More recommend