� OS Support for a Commodity Database on PC Clusters Distributed Devices vs. Distributed File Systems Felix Rauch (National ICT Australia) Thomas M. Stricker (Google Inc., USA) Laboratory for Computer Systems, ETH Zurich, Switzerland NIC TA Member s NIC TA Pa r tner s
Commodity Solutions for OLAP Workloads TPC-D Customer Nation Region data model Supplier Part Order Database size: 10-100 GByte PartSupp LineItem What kind of system architectures are suitable for this type of workload? 2
Platforms More recently: Traditionally: Cluster of commodity PCs Symmetric Multi- processor (SMP) E.g. Patagonia multi-use cluster at ETH Zurich E.g. DEC 8400 3
Killer SMPs vs. Clusters of PCs P P P P P P P Processor C C C C C C C Caches M M M M Bus / Network Memory D D D D Disks M M D D Network Killer SMP Cluster of commodity PCs • Killer performance! • Killer price! • Killing price... • Killing performance? 4
Overview • Introduction • Motivation • Distributed storage architectures • Evaluation • Analysis of results • Alternative: Middleware • Conclusion 5
Research Goal Turn PC clusters into ''killer SMPs'' for OLAP. Combine excess storage and high-speed network already available on cluster nodes. Provide transparent distributed storage architecture as database's storage backend for OLAP applications. System architect's point of view. Focus on performance and understanding. 6
Storage Architectures for Clusters of PCs Traditional: • Big server with RAID • Storage-area networks (SAN) • Network-attached storage (NAS) → Additional hardware and costs Our proposed alternative: Use available commodity hardware and distribute data in software layers. 7
Why Should Such an Architecture Work? Commodity hardware and software (OS) allows high cost effectiveness. Trends: • Disks becoming larger and cheaper • Built-in high-speed network 8
Large Hard-Disk Drives 140 Disk size 120 (median) 100 Size [GByte] 80 Full OS size 60 (incl. applications) 40 20 0 1998 1999 2000 2001 2002 2003 2004 Year of survey 9
Large Hard-Disk Drives 140 Disk size 120 (median) 100 Size [GByte] 80 Full OS size 60 (incl. applications) 40 20 0 1998 1999 2000 2001 2002 2003 2004 Year of survey 10
Large Hard-Disk Drives 140 Disk size 120 (median) 100 Size [GByte] 80 Full OS size 60 (incl. applications) 40 20 0 1998 1999 2000 2001 2002 2003 2004 Year of survey 11
High-Speed Network 2000 1000 Fast Ethernet 10 Gigabit Ethernet Throughput [MByte/s] 100 Gigabit Ethernet 10 1 1995 2000 2005 Year Max. disk throughput Ethernet throughput → Enough bandwith to support distributed storage. 12
Our Scenario Parallel file systems for high- Distributed File System performance computing (network RAID0) Compute Compute Compute Compute DB node node node node node I/O I/O I/O I/O I/O I/O I/O node node node node node node node Boost DB performance Scalable (Lustre, PVFS) 13
Our Scenario Parallel file systems for high- Distributed File System performance computing (network RAID0) Compute Compute Compute Compute DB node node node node node I/O I/O I/O I/O I/O I/O I/O node node node node node node node Boost DB performance Scalable (Lustre, PVFS) 14
Alternative Systems • Petal [Lee & Thekkath, 1996]: Distributed virtual disks with special emphasis on dynamic reconfiguration and load balancing. • Frangipani [Thekkath, Mann & Lee, 1997]: Distributed file system that builds on Petal. • Lustre [Cluster File Systems, Inc.]: Object oriented file system for large clusters. 15
Investigated Architectures Fast Network Block Device ( FNBD ) • Maps hard-disk device over network • No intelligence, but highly optimised Parallel Virtual File System ( PVFS ) • Integrates nodes' disks into parallel FS • Fully-featured file system 16
Fast Network Block Device (FNBD) • Loosely based on Linux network block dev. • Implemented as kernel modules • Maps remote disk blocks over Gigabit Ethernet (from 3 servers) • Uses hardware features of commodity network interface to implement zero copy • Multiple instances into RAID0-like array of networked disks 17
Parallel Virtual File System (PVFS) • Widely used for PC clusters • Implemented as dynamically linked library • Fully featured distributed file system • Can be accessed by any participating node • Combines special directories on server nodes into large file system • 6 servers due to space limitations 18
Architecture of Reference Case Application Application(s) OS kernel File system Disk driver Single node Local disk access 19
Architecture of FNBD Application Application(s) Application(s) OS kernel OS kernel File system Distributed device driver Distributed device driver Disk driver Disk driver (server part) (client part) Server nodes Client node Fast Network Block Device 20
Architecture of PVFS Application Application(s) Application(s) PVFS server daemon PVFS library OS kernel OS kernel File system File system Disk driver Disk driver Server nodes Client node Parallel Virtual File System 21
A Stream-Based Analytic Model Presented at EuroPar 2000 conference. Considers flow of data stream and limits of building blocks. → Set of (in)equations. Solve to find maximal throughput of stream. Simple, works well for large data streams. 22
Modelling Workload Need to know performance characteristics of all involved building blocks. • Easy for small and simple parts (HW, OS functionality): Measurements or data sheets. • Very difficult for complex, closed software (RDBMS): Black-box. → Calibration model with know queries. 23
Calibration of Database Performance Two cases: • ''Simple'' case: Full table scan (find max.) • ''Complex'' case: Scan including CPU (sort) Experimental calibration with data in RAM: • 140 MByte/s throughput for simple case • 7.75 MByte/s throughput for complex case 24
Modelling OLAP on FNBD App. Pipe User space User space (reduced copy) RDBMS Copy OS kernel File system OS FNBD driver Special Special FNBD driver Disk driver kernel (server part) NIC driver NIC driver (client part) DMA DMA DMA Gigabit/s network Server side Client side 25
Modelling OLAP on PVFS App. Pipe User space User space PVFS Daemon (reduced copy) RDBMS PVFS library Copy Copy Copy OS kernel File system TCP/IP TCP/IP OS Special Special Disk driver kernel NIC driver NIC driver DMA DMA DMA Gigabit/s network Server side Client side 26
Evaluation Criteria Small microbenchmark ''speed'': • Throughput for large contiguous I/O opera- tions with varying user-level block sizes. Application benchmark TPC-D: • Broad range of decision support applica- tions, long-running, complex ad-hoc queries. • New TPC-H and TPC-R include updates. 27
Experimental Testbed Multi-use cluster with 16 nodes, each with: • Two 1-GHz PentiumIII CPUs • 512 MByte ECC SDRAM • 2 x 9 GByte disk space • 2 Gigabit Ethernet adapters • Linux kernel 2.4.3 28
Microbenchmarks 45 40 Throughput [MByte/s] 35 30 25 Reference case: 20 Single local disk 15 (1 disk) 10 5 0 4 32 256 User-level block size [KByte] Distributed devices FNBD Distributed file system PVFS (3 servers) (6 servers) 29
Experimental Evaluation with OLAP TPC-D decision support benchmark on ORACLE 1.2 1 Speedup over local disk 0.8 Reference case: Single local disk 0.6 (1 disk) 0.4 0.2 0 1 2 3 4 6 9 10 12 13 17 TPC-D query number Distr. devices FNBD Distr. file system PVFS (3 servers) (6 servers) 30
Experimental Evaluation with OLAP TPC-D decision support benchmark on ORACLE 1.2 Disk-limited query 1 Speedup over local disk 0.8 Reference case: Single local disk 0.6 (1 disk) 0.4 0.2 0 1 2 3 4 6 9 10 12 13 17 TPC-D query number Distr. devices FNBD Distr. file system PVFS (3 servers) (6 servers) 31
Quantitative Performance: Model vs. Measurements 1.6 1.4 Speedup over local disk 1.2 1 0.8 Reference case: 0.6 Single local disk 0.4 (1 disk) 0.2 0 Simple query Complex query TPC-D query 4 measured modelled Distr. devices FNBD Distr. file system PVFS 32
Analysis of Results Performance lower than expected. Aggregation of distributed disks did not increase application performance. Fully-featured distributed file system failed to deliver decent performance. Stream-based analytic model too simple for complex workload. 33
Alternative: Performance with TP-Lite Middleware Data distribution in middleware layer: TP-Lite by [Böhm et al, 2000] • Distributes queries to multiple database servers in parallel • Needs multiple servers (costs) • Small changes to application (not always possible) 34
Modelling OLAP with TP-Lite App. Pipe User space User space RDBMS (reduced copy) RDBMS Reduced copy Copy Reduced copy File system TCP/IP TCP/IP OS kernel OS Special Special Disk driver kernel NIC driver NIC driver DMA Reduced DMA Reduced DMA Gigabit/s network Server side Client side 35
Recommend
More recommend