an intro to ceph and big data
play

an intro to ceph and big data patrick mcgarry inktank Big Data - PowerPoint PPT Presentation

an intro to ceph and big data patrick mcgarry inktank Big Data Workshop 27 JUN 2013 what is ceph? distributed storage system reliable system built with unreliable components fault tolerant, no SPoF commodity hardware


  1. an intro to ceph and big data patrick mcgarry – inktank Big Data Workshop – 27 JUN 2013

  2. what is ceph? ● distributed storage system – reliable system built with unreliable components – fault tolerant, no SPoF ● commodity hardware – expensive arrays, controllers, specialized networks not required ● large scale (10s to 10,000s of nodes) – heterogenous hardware (no fork-lift upgrades) – incremental expansion (or contraction) ● dynamic cluster

  3. what is ceph? ● unified storage platform – scalable object + compute storage platform – RESTful object storage (e.g., S3, Swift) – block storage – distributed file system ● open source – LGPL server-side – client support in mainline Linux kernel

  4. APP APP HOST/VM CLIENT APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS RADOSGW CEPH FS RBD LIBRADOS LIBRADOS A bucket-based A POSIX-compliant A reliable and fully- A bucket-based A POSIX-compliant A reliable and fully- A library allowing REST gateway, distributed file A library allowing distributed block REST gateway, distributed file distributed block apps to directly apps to directly compatible with S3 device, with a Linux system, with a compatible with S3 device, with a Linux system, with a access RADOS, access RADOS, and Swift kernel client and a Linux kernel client and Swift kernel client and a Linux kernel client with support for with support for QEMU/KVM driver and support for and support for QEMU/KVM driver C, C++, Java, FUSE C, C++, Java, FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS – the Ceph object store RADOS – the Ceph object store A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes

  5. OSD OSD OSD OSD OSD btrfs FS FS FS FS FS xfs ext4 zfs? DISK DISK DISK DISK DISK M M M

  6. hash(object name) % num pg 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 CRUSH(pg, cluster state, policy)

  7. 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10

  8. CLIENT CLIENT ??

  9. CLIENT ??

  10. So what about big data? ● CephFS ● s/HDFS/CephFS/g ● Object Storage ● Key-value store

  11. librados L L ● direct access to RADOS from applications ● C, C++, Python, PHP, Java, Erlang ● direct access to storage nodes ● no HTTP overhead

  12. rich librados API ● efficient key/value storage inside an object ● atomic single-object transactions – update data, attr, keys together – atomic compare-and-swap ● object-granularity snapshot infrastructure ● inter-client communication via object ● embed code in ceph-osd daemon via plugin API – arbitrary atomic object mutations, processing

  13. Data and compute ● RADOS Embedded Object Classes ● Moves compute directly adjacent to data ● C++ by default ● Lua bindings available

  14. die, POSIX, die ● successful exascale architectures will replace or transcend POSIX – hierarchical model does not distribute ● line between compute and storage will blur – some processes is data-local, some is not ● fault tolerance will be first-class property of architecture – for both computation and storage

  15. POSIX – I'm not dead yet! ● CephFS builds POSIX namespace on top of RADOS – metadata managed by ceph-mds daemons – stored in objects ● strong consistency, stateful client protocol – heavy prefetching, embedded inodes ● architected for HPC workloads – distribute namespace across cluster of MDSs – mitigate bursty workloads – adapt distribution as workloads shift over time

  16. CLIENT CLIENT metadata 01 data 01 10 10 M M M M M M

  17. M M M M M M

  18. one tree three metadata servers ??

  19. DYNAMIC SUBTREE PARTITIONING

  20. recursive accounting ● ceph-mds tracks recursive directory stats – file sizes – file and directory counts – modification time ● efficient $ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1 drwx--x--- 1 luko adm 19G 2011-01-21 12:17 luko drwx--x--- 1 eest adm 14G 2011-02-04 16:29 eest drwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2 drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph

  21. snapshots ● snapshot arbitrary subdirectories ● simple interface – hidden '.snap' directory – no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot

  22. how can you help? ● try ceph and tell us what you think – http://ceph.com/resources/downloads ● http://ceph.com/resources/mailing-list-irc/ – ask if you need help ● ask your organization to start dedicating resources to the project http://github.com/ceph ● find a bug (http://tracker.ceph.com) and fix it ● participate in our ceph developer summit – http://ceph.com/events/ceph-developer-summit

  23. questions?

  24. thanks patrick mcgarry patrick@inktank.com http://github.com/ceph @scuttlemonkey http://ceph.com/

Recommend


More recommend