17-June-2011 TERENA TF on Storage 1 PERSISTENT I/O CHALLENGES & APPROACHES CHALLENGES & APPROACHES Angelos Bilas, FORTH bilas@ics.forth.gr
17-June-2011 TERENA TF on Storage 2 Outline Outline • Modern application stacks pp • Stream processing (STREAM) • Transaction processing (CumuloNimbo) • Storage technologies St t h l i • Storage virtualization and scaling with multicores (IOLanes) • Abstractions for modern applications (CumuloNimbo) • Parallel I/O (SCALUS) • Remarks
17-June-2011 TERENA TF on Storage 3 Application Stacks Application Stacks • STREAM • CumuloNimbo
17-June-2011 TERENA TF on Storage 4 Stream Global Architecture Picture Stream Global Architecture Picture Telephony Credit Card SLA Fraud Aggregation Fraud Detection Fraud Detection Compliance COIs Profiles Profiles Queries Queries Monitoring Fraud Fraud SLA Violation Detection Fraud Detection Profiles Detection Queries Queries StreamMine StreamCloud Stream Parallel Stream Parallel DB MapReduce MapReduce Operators Operators Operators Operators Dynamic Graphs State Fault Tolerance Self ‐ Provisioning Machine Communication & Storage Queue mem ‐ to ‐ mem Compressed SSD communication Persistent Streaming Silent Error Detection
17-June-2011 TERENA TF on Storage 5 CumuloNimbo Global Architecture CumuloNimbo Global Architecture JEE Application Server: JBoss+Hibernate Self Provisioner Transactions Object Cache: CumuloCache Query Engine: Derby Concurrency Controllers Monitors Column-Oriented Data Store & Block Cache: HBASE Commit Sequencers Distributed File System: HDFS y Commu- nication Load Storage Balancers Loggers Transaction Elasticity Management Management
17-June-2011 TERENA TF on Storage 6 Application Stacks Application Stacks • They tend to be complex y p • Each layer adds substantial protocol “machinery” • E.g. transactions, global name space • Today I/O significant bottleneck • Hard to know what all layers do • Questionable what can be modified realistically • Questionable what can be modified realistically • How can modern storage systems best support these?
17-June-2011 TERENA TF on Storage 7 Outline Outline • Modern application stacks pp • Stream processing (STREAM) • Transaction processing (CumuloNimbo) • Storage technologies St t h l i • Storage virtualization and scaling with multicores (IOLanes) • Abstractions for modern applications (CumuloNimbo) • Parallel I/O (SCALUS) • Remarks
17 ‐ June ‐ 2011 TERENA TF on Storage 8 Dimension Infrastructure Properly Multicores + Multicores + memory + IO xput Different flavors of Different flavors of Different flavors of PCs/blades PCs/blades PCs/blades PCs/blades PCs/blades High ‐ speed Interconnect Interconnect High ‐ speed Interconnect 1000s of appl servers servers 100s file servers 10 ‐ 100 Gbits/s 10 ‐ 40 Gbits/s Disk controllers • Dimensioning issues not straight forward today ~2GB/s - I/O application overheads not understood SATA disks, 12 ‐ 36 disks/node SATA disks, 36 disks/node - Do you balance thin or fat? 100 MBy/s, ~2TBytes - Other factors besides performance, power +10% SSD cache
17-June-2011 TERENA TF on Storage 9 Scaling I/O on multicore CPUs Scaling I/O on multicore CPUs • Observation • As the number of cores increases in modern systems, we are not able to perform more I/O • Target: 1M IOPS 10 GBytes/s • Target: 1M IOPS, 10 GBytes/s • Goal • Provide scalable I/O stack (virtualized) over direct and networked storage devices • Go over Performance and scaling analysis Performance and scaling analysis 1. 1. Hybrid hierarchies to take advantage of potential 2. Design for memory and synchronization issues 3. Parallelism in lower part of networked I/O stack Parallelism in lower part of networked I/O stack 4 4.
17-June-2011 TERENA TF on Storage 10 (1) Performance and Scaling Analysis (1) Performance and Scaling Analysis Guest OS • Bottom-up Bottom up Applications Applications User Space U S • Controller Middleware System Calls • Actual controller Guest OS VFS + FS • PCI Kernel Virtual Drivers • Host drivers System Calls • Block layer Bl k l Host OS VFS + FS • SCSI Block Devices • Block • Block SCSI Layers, HW device drivers, PCI driver • Filesystem PCI Express Interconnect • xfs (a well accepted fs) ( p ) Network Storage Controller, Controller Disk Controller • vfs (integral linux part)
17-June-2011 TERENA TF on Storage 11 I/O Controller [Systor’10] I/O Controller [Systor 10] • (1) A queue protocol over PCI ( ) q p • Many parameters and quite complex • Requires decisions: Tune for high throughput • (2) Request translation on controller (2) R t t l ti t ll • Memory management: Balance between speed and waste • (3) Request issue completion towards devices (3) q p • Use existing mechanisms but do careful scheduling • Prototype comparable to commercial products
17-June-2011 TERENA TF on Storage 12 Results and Outlook Results and Outlook DMA Throughput host-to-HBA HBA-to-host New-head : valid queue element 1800 head 1600 1400 MB/sec MB/sec : valid element to dequeue 1200 Host 1000 800 tail 4 8 16 32 64 PCIe interconnect transfer size (KB) DMA head head Impact of host-issued PIO on DMA Throughput tail 2-way to-host from-host host-issued PIO? Controller ON Controller initiates DMA OFF -Needs to know tail at Host New-tail 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 -Host needs to know head at Controller MB/sec • xput: Each controller can achieve 2 Gbytes/s bi-dir • IOPs: Each controller can achieve ~80K IOPs • 50K for commercial controllers with full I/O processing f f /O • Controller CPU is an important limitation • Outlook • (1) Scale throughput and IOPs by using multiple controllers • (2) Outlook: I/O controllers should be fused with host CPU (2) O tlook I/O controllers sho ld be f sed ith host CPU
17-June-2011 TERENA TF on Storage 13 Block Layer Block Layer • I/O request protocol • I/O request protocol translation, e.g. SCSI • Buffer management Buffer management and placement • Other layers involved, y , essentially a block-type operation • Modern architecture Modern architecture trends create significant problems
17-June-2011 TERENA TF on Storage 14 Results and Outlook Results and Outlook • Translation processing scales MB/s Random I/O Operations Sequential I/O Throughput 6000 4 with number of cores with number of cores rs 5000 5000 # controller 3 4000 • Both throughput and IOPs 2 write IOPS 3000 2000 read IOPS 1 seq.reads • I/O translation incurs overhead 1000 seq.writes 0 • Affinity an important problem 1 2 3 4 # Controllers IOPS • Wrong placement can reduce throughput almost to half g p 6 4 TLOR0 TROR0 TROR0 2 TLORPRIL TLORPLIL 0 1 2 3 4 5 6 7 8 1 7 No. of Instance of Benchmark
17-June-2011 TERENA TF on Storage 15 Filesystem Filesystem • Complex layer p y • Many complain about FS performance on multicores • Translates from (request, file, offset, size) API to (request, block#) API API • Responsible for recovery (first layer to include extensive metadata in traditional systems) • We include VFS in our analysis – additional complexity We include VFS in our analysis additional complexity • Detailed analysis with extensive modifications to kernel • Required non-trivial instrumentation to measure lock and wait times • Extensive tuning to ensure that we measure “meaningful” cases
17-June-2011 TERENA TF on Storage 16 Results and Outlook Results and Outlook kfsmark ‐ CPU breakdown (1MB files, 64 app. threads) 120 250 sands sands IO‐WAIT USER SYSTEM INTERRUPT IDLE 1LOG PER PROCESS 1LOG PER 100 PROCESS 200 100 Thous Thous 80 80 CREAT 150 READ % CPU 60 60 ops/sec 40 100 40 ops/sec 20 20 50 #CPUs 0 o 0 #CPUs 1 2 4 8 16 0 1 2 4 6 8 12 16 • Most FS operations do not scale with # of cores 1 2 4 6 8 12 16 • Two main scaling problems • (2) FS journaling ( ) j g • All modern FSs need to worry about recovery • Most use a journaling scheme that is integrated with lookup/update path • Synchronization over this journal is hindering scaling • (1) vfs locking • vfs uses a structure for maintaining directory entry and inode information (dentry and inode f t t f i t i i di t t d i d i f ti (d t d i d caches) • Synchronization over the dentry cache is problematic due to vfs design • Outlook • There is significant potential from both (1) and (2) g p ( ) ( ) • (1) is being discussed and (a) people are working on it, (b) there is potential to bypass • (2) is more fundamental – our goal is to target this
17-June-2011 TERENA TF on Storage 17 Summary of Analysis Summary of Analysis • (1) Fundamentally, I/O performance should scale ( ) y, p • (2) Controller: use spatial parallelism and go with technology trends • (3) Block: worry about placement and affinity problems • (4) FS: worry about synchronization at specific points • Both (3) and (4) are due to current trends in multicores Both (3) and (4) are due to current trends in multicores • Not broadly known problems yet
Recommend
More recommend