build and operate a ceph infrastructure university of
play

Build and operate a CEPH Infrastructure - University of Pisa case - PowerPoint PPT Presentation

Build and operate a CEPH Infrastructure - University of Pisa case study Simone Spinelli simone.spinelli@unipi.it 17 TF-Storage meeting - Pisa 13-14 October 2015 Agenda C E P H @ u n i p i : a n o v e r v i e w Performances


  1. Build and operate a CEPH Infrastructure - University of Pisa case study Simone Spinelli simone.spinelli@unipi.it 17 TF-Storage meeting - Pisa 13-14 October 2015

  2. Agenda ● C E P H @ u n i p i : a n o v e r v i e w ● Performances ● Infrastructure bricks: ● Our experience – Network ● Conclusions – OSD nodes – Monitor Node – Racks – MGMT tools 17 TF-Storage meeting - Pisa 13-14 October 2015

  3. University of Pisa ● Big sized Italian university: – 70K students – 8K employees – Not a campus but spread all over the city → no big datacenter but many small sites Own and manage an optical infrastructure with on top an MPLS-based MAN ● Proud host of GARR Network PoP ● ● Surrounded by other research/educational institutions (CNR/SantAnna/Scuola Normale…) 17 TF-Storage meeting - Pisa 13-14 October 2015

  4. How we use CEPH Currently in production as backend for an Openstack installation, it hosts: ● department tenants (Web servers, etc.. ) ● tenants for research projects (DNA seq, etc… ) ● tenants for us: multimedia content from elearning platforms Working on: ● An email system for students hosted on Openstack → RBD ● A sync&share platform → RadosGW 17 TF-Storage meeting - Pisa 13-14 October 2015

  5. Timeline ● Spring 2014: we started to plan: Capacity/Replica planning – Rack engineering (power/cooling) – Bare metal management – Confjguration Management – ● Dec 2014: fjrst testbed ● Feb 2015: 12 nodes cluster goes in production ● Jul 2015: Openstack goes in production ● Oct 2015: Start to deploy new ceph nodes (+12) 17 TF-Storage meeting - Pisa 13-14 October 2015

  6. Overview ● 3 sites (we started with 2): – One replica per site – 2 active computing and storage – 1 for storage and quorum ● 2 difgerent network infrastructures: – services (1Gb and 10 Gb) – storage (10Gb and 40Gb) 17 TF-Storage meeting - Pisa 13-14 October 2015

  7. Network ● Ceph clients and cluster networks are realized as VLAN on the same switching infrstructure ● Redundancy and loadbalancing are achieved by LACP ● Switching platforms: – Juniper EX4550: 32p SFP – Juniper EX4200: 24p copper 17 TF-Storage meeting - Pisa 13-14 October 2015

  8. Storage ring ● Sites interconnected wirh a 2x40Gb ERP ● For Storage nodes: 1VirtualChassis per DC: – Maximize the bandwidht: 128GB backend inside the VC – Easy to confjgure and manage (NSSU) – No more than 8 nodes per VC – For computing nodes difgerent VC 17 TF-Storage meeting - Pisa 13-14 October 2015

  9. Hardware:OSD nodes DELL R720XD (2U): ● Ubuntu 14.04 ● 2 Xeon e5-2603@1.8Ghz: 8 core total ● Linux 3.13.0-46-generic #77-Ubuntu ● 64GB RAM DDR3 ● Linux bonding driver: ● 2x10Gb Intel X520 Network Adapter – No special functions ● 12 2TB SATA disks (6disks/RUs) – Less complex ● 2 Samsung 850 256GB SSD disks ● Really easy to deploy with iDRAC – Mdadm raid1 for OS ● Intended to be the virtual machine pool – 6 partition per disk for XFS journal (faster) 17 TF-Storage meeting - Pisa 13-14 October 2015

  10. Hardware:OSD nodes Supermicro SSG6047R-OSD120H: ● Ubuntu 14.04 ● 2 Xeon e5-2630v2@2.60Ghz : 24 core ● Linux 3.13.0-46-generic #77-Ubuntu total ● 2 SSD raid 1 for OS (dedicated) ● 256GB RAM DDR3 ● Linux bonding driver: ● 4x10Gb Intel X520 Network Adapter – No special functions ● 30 6TB SATA disks (7.5disks/RU) – Less complex ● 6 intel 3700 SSD disks for XFS journal ● Intended to be the object storage pool (slow) – 1 disk → 5 OSD 17 TF-Storage meeting - Pisa 13-14 October 2015

  11. Hardware: monitor nodes Sun Sunfjre x4150 ● Hardware not virtual (3 in production, going to be 5) ● Ubuntu 14.04 - Linux 3.13.0-46-generic #77-Ubuntu ● 2 Intel Xeon X5355@2.66Ghz ● 2x1GB intel for Ceph Client network (LACP) ● 16GB RAM ● 5x120GB intel 3500 SSD RAID 10 + HotSpare 17 TF-Storage meeting - Pisa 13-14 October 2015

  12. Racks plans IN PROGRESS: NOW: computing and storage will be computing and storage are in specifjc racks. mixed. ● 24U OSD nodes For storage: ● 32U OSD nodes ● 4U Computing nodes ● 2U monitor/cache ● 2U monitor/cache ● 8U network ● 10U network For computing: ● 32U for computing nodes ● 10U network The storage network fan-out is optimized 17 TF-Storage meeting - Pisa 13-14 October 2015

  13. confjguration essential -1 262.1 root default rule serra_fibo_ing_high-end_ruleset { -15 87.36 datacenter fibonacci ruleset 3 -16 87.36 rack rack-c03-fib type replicated -14 87.36 datacenter serra min_size 1 -17 87.36 rack rack-02-ser max_size 10 -35 87.36 datacenter ingegneria step take default -31 0 rack rack-01-ing step choose firstn 0 type datacenter -32 0 rack rack-02-ing step chooseleaf firstn 1 type host-high- -33 0 rack rack-03-ing end -34 0 rack rack-04-ing step emit -18 87.36 rack rack-03-ser } 17 TF-Storage meeting - Pisa 13-14 October 2015

  14. Tools Just 3 people working on CEPH (not 100%) and you need to grow quickly → Automation is REALLY important ● Confjguration management: Puppet – Most of the classes are already production-ready – A lot of documentation (best practices, books, community) ● Bare metal installation:The Foreman – Complete lifecycle for hardware – DHCP, DNS, Puppet ENC 17 TF-Storage meeting - Pisa 13-14 October 2015

  15. Tools F o r m o n i t o r i n g / a l a r m i n g : Test environment: (Vagrant and VirtualBox) to test what is hardware ● Nagios+CheckMK indipendent: – alarms ● new functionalities – graphing ● Puppet classes ● Rsyslog ● upgrades procedures ● Looking at collectD + Graphite – Metrics correlation 17 TF-Storage meeting - Pisa 13-14 October 2015

  16. Openstack integration ● It works straightforward ● Shared storage → live migration ● CEPH as a backend for: ● multiple pools are supported – Volumes ● Current issues: (OS=Juno Ceph=Giant) – Vms – M a s s i v e v o l u m e d e l e t i o n – Images – Evacuate ● Copy on Write: VM as a snapshot 17 TF-Storage meeting - Pisa 13-14 October 2015

  17. Performances – ceph bench writes ===================================== ==================================== =================================== Total time run: 10.353915 Total time run: 60.308706 Total time run: 120.537838 Total writes made: 1330 Total writes made: 5942 Total writes made: 12593 Write size: 4194304 Write size: 4194304 Write size: 4194304 Bandwidth (MB/sec): 513.815 Bandwidth (MB/sec): 394.106 Bandwidth (MB/sec): 417.894 Stddev Bandwidth: 161.337 Stddev Bandwidth: 103.204 Stddev Bandwidth: 84.4311 Max bandwidth (MB/sec): 564 Max bandwidth (MB/sec): 524 Max bandwidth (MB/sec): 560 Min bandwidth (MB/sec): 0 Min bandwidth (MB/sec): 0 Min bandwidth (MB/sec): 0 Average Latency: 0.123224 Average Latency: 0.162265 Average Latency: 0.153105 Stddev Latency: 0.0928879 Stddev Latency: 0.211504 Stddev Latency: 0.175394 Max latency: 0.955342 Max latency: 2.71961 Max latency: 2.05649 Min latency: 0.045272 Min latency: 0.041313 Min latency: 0.038814 ===================================== ==================================== ==================================== 17 TF-Storage meeting - Pisa 13-14 October 2015

  18. Performances – ceph bench reads rados bench -p BenchPool 10 rand rados bench -p BenchPool 10 seq =================================== ================================== Total time run: 10.065519 Total time run: 10.057527 Total reads made: 1561 Total reads made: 1561 Read size: 4194304 Read size: 4194304 Bandwidth (MB/sec): 620.336 Bandwidth (MB/sec): 620.829 Average Latency: 0.102881 Average Latency: 0.102826 Max latency: 0.294117 Max latency: 0.328899 Min latency: 0.04644 Min latency: 0.041481 =================================== ================================== 17 TF-Storage meeting - Pisa 13-14 October 2015

  19. Performances: adding VMs What to measure: See how Latency is infmuenced by IOPS, measuring it while we add VMs (fjxed load generator). ● See how Total bandwidth decrease adding VMs ● Setup: 40VM on Openstack with 2 10G volumes (pre-allocated with dd): ● One with bandwidht CAP (100MB) – One with IOPS CAP (200 total) – We use fjo as benchmark tool and dsh to launch it from a master node. ● Refence: Measure Ceph RBD performance in a quantitative way: https://software.intel.com/en-us/blogs/2013/10/25/measure-ceph-rbd-performance- ● in-a-quantitative-way-part-i 17 TF-Storage meeting - Pisa 13-14 October 2015

  20. Fio fio --size=1G \ fio --size=4G \ --runtime 60 \ --runtime=60 \ --ioengine=libaio \ --ioengine=libaio \ --direct=1 \ --direct=1 \ --rw=randread [randwrite]\ --rw=read [write]\ --name=fiojob \ --name=fiojob \ --blocksize=4K \ --blocksize=128K [256K] \ --iodepth=2 \ --iodepth=64 \ --rate_iops=200 \ --output=seqread.out --output=randread.out 17 TF-Storage meeting - Pisa 13-14 October 2015

  21. Performances -write 17 TF-Storage meeting - Pisa 13-14 October 2015

  22. Performances - write 17 TF-Storage meeting - Pisa 13-14 October 2015

  23. Performances - read 17 TF-Storage meeting - Pisa 13-14 October 2015

  24. Performances - read 17 TF-Storage meeting - Pisa 13-14 October 2015

Recommend


More recommend