Stateful workloads on kubernetes with ceph 네이버 유장선
Agenda ► CaaS Kubernetes ▶ Ceph Storage ▶ Operation ▶ NVRAMOS 2019 10/28/2019
Cloud Service Model On Premises IaaS CaaS PaaS SaaS Applications Applications Applications Applications Applications Data Data Data Data Data Runtime Runtime Runtime Runtime Runtime Middleware Middleware Middleware Middleware Middleware OS OS OS OS OS Virtualization Virtualization Virtualization Virtualization Virtualization Server Server Server Server Server Storage Storage Storage Storage Storage Network Network Network Network Network 원하는대로 ( 비표준 ), 비용 증가 , 시간 증가 표준화 , 비용 절감 , On Demand NVRAMOS 2019 10/28/2019
Transformation of deployment Traditional Virtualized Containerized Ap Ap Ap Ap p p p p Bin/Librar Bin/Librar App App App y y Bin/ Bin/ Bin/ OS OS Library Library Library VM VM App App App Container Container Container Bin/Library Hypervisor Container Runtime OS OS OS Hardware Hardware Hardware App 간의 간섭 발생 VM 으로 격리시킴 VM 이 비해 가벼움 (OS 공유 ) Library 호환성 이슈 보안성 향상 배포가 빠름 Node 분리 시 비용 증가 리소스 효율화 / 확장성 증가 Namespace 로 격리 VM OS 로 인한 리소스 증가 작고 , 독립적인 단위로 분리 부팅 시간 증가 고효율 / 고집적 NVRAMOS 2019 10/28/2019
MSA (micro-service architecture) Monolithic Microservice Application Server Service A Service B Service A Service B DB DB Service C Service D Service C Service D DB DB DB 서비스를 작게 나눔 코드가 커지고 , 복잡해짐 배포를 단순화 시킴 수정 시 QA 범위가 커짐 연계 서비스 변경에 따른 영향 서비스별 기술 다변화 장애 시 리스크 증가 (libraries, languages, framework) 확장성 증가 시장의 빠른 변화 개발 패러다임 변화 NVRAMOS 2019 10/28/2019
Container Orchestration Svc A Svc B Svc Svc Svc Svc C C C C Svc A Svc A Svc A Svc A Svc Svc C D Svc Svc Svc Svc D D D D Svc B Svc B Svc B Svc B … Svc A Svc B Svc B Svc B Svc B Svc B Svc A Svc A Svc A Svc A Svc Svc C D worker node worker node worker node worker node worker node Provisioning / Deployment of containers • Fault Tolerance (Replicas) • Load Balancing • Service Discovery • Cloud Foundry CoreOS Fleet Docker Swarm Mesos Marathon Auto Scaling (Scale in/out) • Resource Limit Control • Scheduling • Health Checking • Cluster Management • Configuration Management • Monitoring Kubernetes Google Container Engine Amazon ECS Azure Container Service • NVRAMOS 2019 10/28/2019
Kubernetes (K8S) • Container Orchestrator 의 de-factor • Google 내부 컨테이너 서비스인 Borg 의 오픈소스 버전 (15 년의 운영경험 ) • CNCF 기부 (Cloud Native Computing Foundation) • 다양한 클라우드 및 베어메탈 환경 지원 • Go 언어로 작성 • Self-healing • Horizontal Scaling • Service Discovery / Load Balancing • Automatic rollouts / rollbacks • Secret / configuration management • Storage orchestration • Batch execution (crontab) • … NVRAMOS 2019 10/28/2019
Kubernetes client Kubectl External IP Internal IP service DNS/LB LB YAML Kind : deployment Worker Node Selector : app: nginx nginx BE K Replicas : 3 Template: image : nginx Worker Node label : app:nginx BE API K nginx Kind : service R Selector : app: nginx Worker Node BE K nginx Type : LoadBalancer K8s Master NVRAMOS 2019 10/28/2019
CI/CD Pipeline Canary / Blue-Green Deployments Developer CI Server Kubernetes Commit Build Create Restart New & Push Code & Run Tests New Pod Pod Docker Registry Health Build Old Pod Check Docker Running New Pod Image Git Repository Push Not Docker Healthy Pod Healthy Pod Image Update Delete Kubernetes Old Pod Depoyment NVRAMOS 2019 10/28/2019
Autoscaling HPA VPA Check metrics Check metrics Horizontal Vertical Metrics Pod Pod Autoscaler Autoscaler Threshold is met ? Threshold is met ? Change cpu / memory values DEPLOYMENT Change number of replicas DEPLOYMENT PO PO PO … PO D D D D PO PO PO Scale in / out number of pods D D D Scale up / down number of pods NVRAMOS 2019 10/28/2019
Stateful workloads NVRAMOS 2019 10/28/2019
Storage in K8S Local Storage Remote Storage Ephemeral Local Disk Shared Storage Block Storage POD / Container POD POD POD POD HOST HOST POD / Container HOST HOST 데이터 저장 Pod( 컨테이너 ) 내부 호스트 로컬 디스크 외부 네트워크 스토리지 외부 네트워크 스토리지 ( 여러 Pod 가 스토리지 공유 ) (Pod 별 스토리지 할당 ) Pod 삭제 시 데이터도 함께 삭제 삭제되지 않음 삭제되지 않음 삭제되지 않음 서비스 영향 없음 Host 장애 시 데이터 사용 불가 데이터 사용 불가 서비스 영향 없음 NVRAMOS 2019 10/28/2019
Volume Plugin in Kubernetes Plugin 자체 개발 진행 • Openstack(Cinder) 와 연동 • Multi-tenancy 지원 • 인증 / 권한 연동 ( 사내 인증 연동 ) • Flexvolume 방식 개발 • Docker Volume Plugin 운영 노하우 적용 • On-line Resize 지원 • Read-only Multi-attached 지원 • Snapshot 지원 • Cephfs Fuse / Kernel mount 지원 • RBD Multi-attached 방지 (lock) • Node Drain 시 BlackList 추가 기능 • IO Monitoring • Front-end QoS (using cgroups) • Quotas 지원 • … NVRAMOS 2019 10/28/2019
Statefulset apiVersion: apps/v1 PODs PVC PV Vol kind: StatefulSet spec: WEB-0 WWW-WEB-O PV-uuid replicas: 3 template: WEB-1 WWW-WEB-1 PV-uuid spec: containers: - name: nginx WEB-2 WWW-WEB-2 PV-uuid image: k8s.gcr.io/nginx-slim:0.8 volumeMounts: - name: www … … mountPath: /usr/share/nginx/html volumeClaimTemplates: - metadata: name: www spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 1Gi NVRAMOS 2019 10/28/2019
Volume Plugin in NAVER Keystone Statefulset 생성 인증 YAML 정의 PV Check Ceph Cinder Attach/mount 생성요청 vol vol vol Ceph Provisioner vol vol vol Attach / Mount Ceph Driver vol vol vol Rbd (kernel map / mount ) /dev/rbd0 POD NVRAMOS 2019 10/28/2019
Distributed platform on distributed storage C ES #1 KAFKA #1 C C C C C C C (Warm) C 3 copy -> 9 copy ES #2 KAFKA #2 KAFKA #3 C C C C C C (Warm) C C ES : 2 copy EC : 1.5 copy = 3 copy C C C C Kafka on ceph rbd (3 copy) Elastic Search (Warm) on ceph ec (1.5 copy) NVRAMOS 2019 10/28/2019
Single Copy Storage VG VG iSCSI DISK DISK VOL#1 Zone Group DISK DISK VOL#2 … #1 DISK DISK KAFKA #1 VOL#3 … … … VG VG DISK DISK VOL#1 Zone Group DISK DISK VOL#2 … #2 DISK DISK VOL#3 … … KAFKA #3 … VG VG KAFKA #2 DISK DISK VOL#1 DISK DISK Zone Group VOL#2 … DISK DISK #3 VOL#3 … … … NVRAMOS 2019 10/28/2019
Ceph Storage NVRAMOS 2019 10/28/2019
Ceph Storage Service Docker/K8 S 사내 QEMU 스토리지 Docker PM/VM Registry rbd. nbd. lib OpenStack SWIFT S3 fuse kernel ko ko rbd 인증 Object -> NFS CEPHFS -> NFS iSCSI Export NVRAMOS 2019 10/28/2019
BlueStore 전환 (+ NVMe) FileStore BlueStore Raid1 / OS NVMe DISK DISK DISK DISK OS Journal SSD SSD DISK DISK Docker DISK DISK DISK DISK DB/WAL DATA DISK DISK DISK DISK BCache DATA DISK DISK DISK DISK … DISK DISK DISK DISK 6TB * 8 = 48 TB (66% 사용 ) 6TB * 12 = 72 TB (100% 사용 ) NVRAMOS 2019 10/28/2019
CephFS 제공 Shared File System (like NFS) • • POSIX-compliant file system Data Pool (RBD 동시 사용 ) • Metadata Pool • Multiple MDS Server • Hot Standby / Standby MDS • Scheduling • Direct Access file data blocks • Fuse / Kernel Mount • Quota Support • https://docs.ceph.com/docs/master/_images/cephfs-architecture.svg NVRAMOS 2019 10/28/2019
MDS High Availability : Standby MDS MDS #1 MDS #2 MDS #1 MDS #2 (RANK: (RANK: (RANK: (RANK: 0) 1) 0) 1) MDS MDS #1 MDS #2 (STANDB (H/S) (H/S) Y) Floating Standby MDS Hot Standby MDS NVRAMOS 2019 10/28/2019
Multiple MDS ceph fs set <fs_name> max_mds 3 MDS #1 (RANK: 0) MDS #2 MDS #1 (RANK: (RANK: 1) 0) Bottleneck MDS #3 (RANK: 2) Single MDS Multiple MDS NVRAMOS 2019 10/28/2019
Subtree Pinning (static) cephfs_volume_prefix = /ceph_ssd setfattr -n ceph.dir.layout.pool -v SSD_POOL /ceph_ssd Root MDS #1 MDS #2 MDS #3 Type /ceph_hdd /ceph_ssd (RANK: (RANK: (RANK: 0) 1) 2) Shard Data setfattr –n ceph.dir.pin -v <rank> </path> … … … … NVRAMOS 2019 10/28/2019
CephFS : fuse / kernel ceph-fuse kernel mount CEPH- User space User space Application Application FUSE CephFS Kernel space Kernel space VFS FUSE VFS Kernel Support Quotas Fast NVRAMOS 2019 10/28/2019
Block Cache : bcache bcache (writeback) • kernel 3.10 • Flashcache (facebook), • EnhanceIO NVMe • Random RW • https://pommi.nethuis.nl/ssd-caching-using-linux-and-bcache/ NVRAMOS 2019 10/28/2019
Recommend
More recommend