file and metadata replication in xtreemfs bj rn kolbeck
play

File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse - PowerPoint PPT Presentation

File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse Institute Berlin File and Metadata Replication in XtreemFS 1 Why Replicate ? fault tolerance mail server source repository bandwidth start 1,000 VMs in


  1. File and Metadata Replication in XtreemFS Björn Kolbeck Zuse Institute Berlin File and Metadata Replication in XtreemFS · 1

  2. Why Replicate ? – fault tolerance mail server – source repository – – bandwidth start 1,000 VMs in parallel – grid workflows – – latency local repositories (climate data, telescope images) – HSM: fast (disk) vs. slow (tape) replicas – File and Metadata Replication in XtreemFS · Björn Kolbeck 2

  3. The CAP Theorem – C onsistency A – A vailability – P artition tolerance C P – "dernier cri" A+P (eventual consistency) Brewer, Eric. T owards Robust Distributed Systems. PODC Keynote, 2004. File and Metadata Replication in XtreemFS · Björn Kolbeck 3

  4. CAP: Examples C+A A+P single server Amazon S3 A Linux HA Mercurial (one data center) Coda/AFS C P C+P distributed databases and file systems File and Metadata Replication in XtreemFS · Björn Kolbeck 4

  5. File System: Expected Semantics App A App A create send foo.txt message ok FS ok open foo.txt App B App A App A create index.txt ok FS EEXISTS create index.txt App B File and Metadata Replication in XtreemFS · Björn Kolbeck 5

  6. File System: Consistency – linearizability (metadata and file data) communication between applications / users – – atomic operations (metadata only) unique file names (create, rename) – used by real-world applications – e.g. dovecot – expensive File and Metadata Replication in XtreemFS · Björn Kolbeck 6

  7. File System: Do we really need consistency? – A+P = conflicts name clashes – multiple versions – – A+P vs. POSIX API can't resolve name clashes – no support for multiple versions – no interface to resolve conflicts – – A+P vs. Expectations developers assume consistency – synchronization – File and Metadata Replication in XtreemFS · Björn Kolbeck 7

  8. XtreemFS – distributed file system – "POSIX semantics" – object-based design – focus on replication (grid, cloud) File and Metadata Replication in XtreemFS · Björn Kolbeck 8

  9. T wo problems – one solution 1.Metadata replication problem: bottleneck – replication algorithms – "relax" requirements – our solution – 2.File data replication problem: scale – our solution – central lock service – 3.Other file systems File and Metadata Replication in XtreemFS · Björn Kolbeck 9

  10. Metadata: How to replicate? Replicated State Machine (C+P) – Paxos + — – no primary/master – slow two round trips – no SPOF – needs distr. transaction – no extra latency on – difficult to implement failure File and Metadata Replication in XtreemFS · Björn Kolbeck 10

  11. Metadata: How to replicate? Primary/Backup (C+P) – replicated databases + — – fast – primary failover write = 1RT, read = local short interruption – no distr. transactions – primary = bottleneck – easy to implement File and Metadata Replication in XtreemFS · Björn Kolbeck 11

  12. 1. Metadata: How to replicate? Linux HA (C+A) – heartbeat signal + STONITH shared storage – Lustre failover + — – can be added "on-top" – still SPOFs: STONITH... – only for clusters – passive backups File and Metadata Replication in XtreemFS · Björn Kolbeck 12

  13. 1. Metadata: "relax" – read all replicas = sequential consistency stat , getattr , readdir (50 - 80% of all calls) – load balancing – upper bound on "staleness" – – write updates asynchronously ack after local write – max. window of data loss – similar to sync in PostgreSQL – File and Metadata Replication in XtreemFS · Björn Kolbeck 13

  14. 1. Metadata: Implementation in XtreemFS – map metadata on a flat index – replicate index with primary/backup use leases to elect primary – replicate insert/update/delete – – future work: weaker consistency for some ops – e.g. chmod, file size updates upper bound on "staleness" – File and Metadata Replication in XtreemFS · Björn Kolbeck 14

  15. 1. Metadata: Excursion — Flat index vs. Tree – database backend (BabuDB, LSM-Tree based) – ext4 (empty files) 500 BabuDB BabuDB 450 1904 ext4 ext4 2000 1799 385 400 357 350 1500 300 250 1000 200 150 500 100 50 0 0 duration (s) duration (s) linux kernel build IMAP trace (docevot imapstress) ➔ competitive performance File and Metadata Replication in XtreemFS · Björn Kolbeck 15

  16. 2. File Data: Expected Semantics – same as metadata but no atomic operations – – many applications require less read-only files / write-once – single process reading/writing – explicit fsync – File and Metadata Replication in XtreemFS · Björn Kolbeck 16

  17. 2. File Data: Implementation in XtreemFS – write-once: separate mechanism more efficient – support for partial replicas – large number of replicas – – read-write: primary/backup use leases for primary failover – requires service for lease coordination, – e.g. a lock service File and Metadata Replication in XtreemFS · Björn Kolbeck 17

  18. 2. File Data: Problem of Scale – Large number of storage servers – Large number of files – Primary per open file? – Primary per partition? – Long leases timeouts, e.g. 1min? File and Metadata Replication in XtreemFS · Björn Kolbeck 18

  19. 2. File Data: How to coordinate many leases? – Flease: decentralized lease coordination no central lock service – coordinated among storage servers holding a replica – – numbers: Google's Chubby: ~640 ops/sec – Zookeeper: ~7,000 ops/sec – Flease: ~5,000 ops/sec (3 nodes), – ~50,000 ops/sec (30 nodes) File and Metadata Replication in XtreemFS · Björn Kolbeck 19

  20. 2. File Data: Max. number of open files/server 120000 102058 Flease 100000 10 sec Zookeeper 17010 5 sec 2445 8500 80000 1 sec 1223 1700 60000 245 51029 40000 25515 20000 17010 14672 8505 7336 3668 3402 2445 1701 1223 489 245 0 0 10 20 30 40 50 60 lease timeout (s) 30 nodes, LAN File and Metadata Replication in XtreemFS · Björn Kolbeck 20

  21. Replication: Other File Systems Metadata File data Linux HA Linux HA Lustre CEPH - primary/backup + central cfg. service and monitoring - RAID 1 GlusterFS HDFS - write-once File and Metadata Replication in XtreemFS · Björn Kolbeck 21

  22. Replication: Lessons Learned – event-based design → no message re-ordering – separation of replication layers → simplified implementation, testing – no free lunch – consistency across data-centers is expensive File and Metadata Replication in XtreemFS · Björn Kolbeck 22

  23. Thank You – http://www.xtreemfs.org – upcoming release 1.3 includes replication XtreemFS is developed within the XtreemOS project. XtreemOS is – funded by the European Commission under contract #FP6-033576. File and Metadata Replication in XtreemFS · Björn Kolbeck 23

Recommend


More recommend