Distributed Storage and Consistency Distributed Storage and Consistency
Storage moves into the net Storage moves into the net Net work delays Net work delays Net work cost Net work cost St orage capacit y/ volume St orage capacit y/ volume Administ r at ive cost Administ r at ive cost Net work bandwidt h Net work bandwidt h Shared st orage wit h scalable bandwidt h and capacit y. Shared st orage wit h scalable bandwidt h and capacit y. Consolidat e — — mult iplex mult iplex — — decent ralize decent ralize — — replicat e. replicat e. Consolidat e Reconf igure t o mix- Reconf igure t o mix -and and- -mat ch loads and resources. mat ch loads and resources.
Storage as a service Storage as a service SSP ASP SSP ASP Storage Service Provider Application Service Provider Storage Service Provider Application Service Provider Out sourcing: st orage and/ or applicat ions as a service service . . Out sourcing: st orage and/ or applicat ions as a For ASPs (e.g., Web services), st orage is j ust a component . For ASPs (e.g., Web services), st orage is j ust a component .
Storage Abstractions Storage Abstractions • relational database (IBM and Oracle) tables, transactions, query language • file system hierarchical name space of files with ACLs Each file is a linear space of fixed-size blocks. • block storage SAN, Petal, RAID-in-a-box (e.g., EMC) Each logical unit (LU) or volume is a linear space of fixed-size blocks. • object storage object == file, with a flat name space: NASD, DDS, Porcupine Varying views of the object size: NASD/OSD/Slice objects may act as large-ish “buckets” that aggregate file system state. • persistent objects pointer structures, requires transactions: OODB, ObjectStore
Network Block Storage Network Block Storage One approach to scalable storage is to attach raw block storage to a network. • Abstraction: OS addresses storage by <volume, sector> . iSCSI, Petal, FibreChannel: access through special device driver • Dedicated S torage A rea N etwork or general-purpose network. FibreChannel (FC) vs. Ethernet • Volume-based administrative tools backup, volume replication, remote sharing • Called “raw” or “block”, “storage volumes” or just “SAN”. • Least common denominator for any file system or database.
“NAS vs. SAN” “NAS vs. SAN” In the commercial sector there is a raging debate today about “NAS vs. SAN”. • N etwork- A ttached S torage has been the dominant approach to shared storage since NFS. NAS == NFS or CIFS: named files over Ethernet/Internet. E.g., Network Appliance “filers” • Proponents of FibreChannel SANs market them as a fundamentally faster way to access shared storage. no “indirection through a file server” (“SAD”) lower overhead on clients network is better/faster (if not cheaper) and dedicated/trusted Brocade, HP, Emulex are some big players.
NAS vs. SAN: Cutting through the BS NAS vs. SAN: Cutting through the BS • FibreChannel a high-end technology incorporating NIC enhancements to reduce host overhead.... ...but bogged down in interoperability problems. • Ethernet is getting faster faster than FibreChannel. gigabit, 10-gigabit, + smarter NICs, + smarter/faster switches • Future battleground is Ethernet vs. Infiniband. • The choice of network is fundamentally orthogonal to storage service design. Well, almost: flow control, RDMA, user-level access (DAFS/VI) • The fundamental questions are really about abstractions . shared raw volume vs. shared file volume vs. private disks
Storage Architecture Storage Architecture Any of these abstractions can be built using any, some, or all of the others. Use the “right” abstraction for your application. Basic operations: create/remove, open/close, read/write. The fundamental questions are: • What is the best way to build the abstraction you want? division of function between device, network, server, and client • What level of the system should implement the features and properties you want?
Duke Mass Storage Testbed Testbed Duke Mass Storage Goal: managed st orage on : managed st orage on Goal I BM Shark/ HSM I BM Shark/ HSM demand f or cross- -disciplinary disciplinary demand f or cross research. research. Direct SAN access f or “power Direct SAN access f or “power client s” and NAS PoPs PoPs; ot her ; ot her client s” and NAS client s access t hrough NAS. client s access t hrough NAS. Campus FC net Campus FC net I P LAN I P LAN I P LAN I P LAN Brain Lab Med Ct r Ct r Brain Lab Med
Problems Problems poor interoperability • Must have a common volume layout across heterogeneous SAN clients. poor sharing control • The granularity of access control is an entire volume. • SAN clients must be trusted. • SAN clients must coordinate their access. $$$
Duke Storage Testbed Testbed, v2.0 , v2.0 Duke Storage I BM Shark/ HSM I BM Shark/ HSM Each SAN volume is managed Each SAN volume is managed by a single NAS PoP PoP. . by a single NAS All access t o each volume is All access t o each volume is mediat ed by it s NAS PoP PoP. . mediat ed by it s NAS Campus FC net Campus FC net Campus I P net Campus I P net Brain Lab Med Ct r Ct r Brain Lab Med
Testbed v2.0: pro and con v2.0: pro and con Testbed Supports resource sharing and data sharing. Does not leverage Fibre Channel investment. Does not scale access to individual volumes. Prone to load imbalances. Data crosses campus IP network in the clear. Identities and authentication must be centrally administered. It’s only as good as the NAS clients, which tend to be fair at best.
Sharing Network Storage Sharing Network Storage How can we control sharing to a space of files or blocks? • Access control etc. • Data model and storage abstraction • Caching • Optimistic replication Consistency • One-copy consistency vs. weak consistency • Read-only (immutable) files? • Read-mostly files with weak consistency? • Write-anywhere files?
File/Block Cache Consistency File/Block Cache Consistency • Basic write-ownership protocol. Distributed shared memory (software DSM) • Timestamp validation (NFS). Timestamp each cache entry, and periodically query the server: “has this file changed since time t ?”; invalidate cache if stale. • Callback invalidation (AFS, Sprite, Spritely NFS). Request notification (callback) from the server if the file changes; invalidate cache and/or disable caching on callback. • Leases (NQ-NFS, NFSv4, DAFS) [Gray&Cheriton89,Macklem93]
Software DSM 101 Software DSM 101 Software-based distributed shared memory (DSM) provides an illusion of shared memory on a cluster. • remote-fork the same program on each node • data resides in common virtual address space library/kernel collude to make the shared VAS appear consistent • The Great War: shared memory vs. message passing for the full story, take CPS 221 switched interconnect
Page Based DSM (Shared Virtual Memory) Page Based DSM (Shared Virtual Memory) Virtual address space is shared Virtual Address Space physical physical DRAM DRAM
The Sequential Consistency Memory Model The Sequential Consistency Memory Model sequential processors P3 P1 P2 issue memory ops in program order switch randomly set after each memory op ensures some serial Easily implemented with shared bus. order among all operations For page-based DSM, weaker consistency models may be useful….but that’s for later. Memory
Inside Page- -Based DSM (SVM) Based DSM (SVM) Inside Page The page-based approach uses a write-ownership token protocol on virtual memory pages. • Kai Li [Ivy SVM, 1986], Paul Leach [Apollo, 1982] • Each node maintains per-node per-page access mode. {shared, exclusive, no-access} determines local accesses allowed For SVM, modes are enforced with VM page protection mode load (read) store (write) shared yes no yes yes exclusive no no no-access
Write- -Ownership Protocol Ownership Protocol Write A write-ownership protocol guarantees that nodes observe sequential consistency of memory accesses: • Any node with any access has the latest copy of the page. On any transition from no-access , fetch current copy of page. • A node with exclusive access holds the only copy. At most one node may hold a page in exclusive mode. On transition into exclusive , invalidate all remote copies and set their mode to no-access . • Multiple nodes may hold a page in shared mode. Permits concurrent reads: every holder has the same data. On transition into shared mode, invalidate the exclusive remote copy (if any), and set its mode to shared as well.
Network File System (NFS) Network File System (NFS) server client syscall layer user programs VFS syscall layer NFS VFS server *FS NFS *FS client RPC over UDP or TCP
NFS Protocol NFS Protocol NFS is a network protocol layered above TCP/IP. • Original implementations (and most today) use UDP datagram transport for low overhead. Maximum IP datagram size was increased to match FS block size, to allow send/receive of entire file blocks. Some implementations use TCP as a transport. • The NFS protocol is a set of message formats and types. Client issues a request message for a service operation. Server performs requested operation and returns a reply message with status and (perhaps) requested data.
Recommend
More recommend