Learnings from scaling Ironic at Yahoo Arun S A G saga@yahoo-inc.com zer0c00l on freenode Yahoo Inc May 08, 2017 https://github.com/sagarun/presentations/
Background and Architecture 1 / 25
Cluster Architecture OOB - ipmi power management V I P mysql ro/ mysql ro/ Target Nodes Console / Tools / Bootbox Server (Existing) Database Node API Node db[1-2].ostk api[1-3].ostk Ironic Conductor API Node Ironic Conductor mysql rw/ https/ nova-api ic[1-2].ostk mysql rw/ nova-api nova-scheduler Horizon (dual connected) nova-scheduler neutron-server neutron-server nova-compute Message Queue nova-compute keystone V mq[1-2].ostk keystone glance-api I zookeeper ensemble (2) dhcp (67), tftp (69) V glance-api glance-registry VIP A nfs -rw P glance-registry I ironic-api ipxe binary via tfp T ironic-api https/ P rpc/ images + ipxe via S rpc/ https zookeeper ensemble (3) 5671, 5671 5672 ats[1-2 VIP http/ ].ostk NFS target User https (various) nfs -rw ITS its[1-2].ostk (10g nic) V nfs - ro neutron-agent NFS glance rpc/5671,5672 I x P ATS proxy to yapache http server https Openstack Mgmt Vlan Out of band network Target inventory Corp vlan Nodes 2 / 25
Migrating to Ironic ◮ Import nodes from old system into Ironic 3 / 25
Migrating to Ironic ◮ Import nodes from old system into Ironic ◮ Create neutron port for the node 3 / 25
Migrating to Ironic ◮ Import nodes from old system into Ironic ◮ Create neutron port for the node ◮ If the node is already active in the old system, ’fake’ boot it with fake_pxe driver 3 / 25
Migrating to Ironic ◮ Import nodes from old system into Ironic ◮ Create neutron port for the node ◮ If the node is already active in the old system, ’fake’ boot it with fake_pxe driver ◮ Once everything is successful, switch to pxe_ipmitool driver 3 / 25
Ironic 4 / 25
Ironic Setup ◮ Ironic API runs behind Apache Server 5 / 25
Ironic Setup ◮ Ironic API runs behind Apache Server ◮ Ironic Conductors(2) 5 / 25
6 / 25
7 / 25
What could possibly go wrong? ◮ Ironic Boots started to fail 8 / 25
What could possibly go wrong? ◮ Ironic Boots started to fail ◮ Ironic-conductor was using lot of CPU 8 / 25
What could possibly go wrong? ◮ Ironic Boots started to fail ◮ Ironic-conductor was using lot of CPU ◮ Ironic API calls took too long 8 / 25
Solutions ◮ Sync_Power_State periodic task 9 / 25
Solutions ◮ Sync_Power_State periodic task ◮ Increase the number of Ironic Conductors 9 / 25
Solutions ◮ Sync_Power_State periodic task ◮ Increase the number of Ironic Conductors ◮ Run multiple conductors on the same host 9 / 25
Neutron 10 / 25
Neutron setup ◮ All 3 API servers run neutron-server 11 / 25
Neutron setup ◮ All 3 API servers run neutron-server ◮ 24 API/RPC workers 11 / 25
Neutron setup ◮ All 3 API servers run neutron-server ◮ 24 API/RPC workers ◮ 4 neutron dhcp agents 11 / 25
Neutron setup ◮ All 3 API servers run neutron-server ◮ 24 API/RPC workers ◮ 4 neutron dhcp agents ◮ All networks/subnets are managed by all 4 agents (HA) 11 / 25
Neutron setup ◮ All 3 API servers run neutron-server ◮ 24 API/RPC workers ◮ 4 neutron dhcp agents ◮ All networks/subnets are managed by all 4 agents (HA) ◮ ISC DHCPD driver instead of dnsmasq 11 / 25
What is sync state? 12 / 25
A tale of two drivers ◮ OMShell driver 13 / 25
A tale of two drivers ◮ OMShell driver ◮ Pypureomapi driver 13 / 25
OMShell -bash-4.1$ omshell > server 127.0.0.1 > port 7911 > key keyname secret > connect obj: <null> > new host obj: host > set hardware-address = 00:1c:1a:1d:10:54 obj: host hardware-address = 00:1c:1a:1d:10:54 > open obj: host hardware-address = 00:1c:1a:1d:10:54 ip-address = 0a:d7:a6:b1 name = "hostname.yahoo.com-0" hardware-type = 00:00:00:01 >remove 14 / 25
Sync State with OMShell 15 / 25
Sync State with PypureOMAPI 16 / 25
Where do we go from here? ◮ ISC DHCPD restarts are not ideal 17 / 25
Where do we go from here? ◮ ISC DHCPD restarts are not ideal ◮ VIP thinks dhcpd is down whenever it restarts 17 / 25
Where do we go from here? ◮ ISC DHCPD restarts are not ideal ◮ VIP thinks dhcpd is down whenever it restarts ◮ Move to Kea DHCP Server 17 / 25
Density Test 18 / 25
When did things started to break? ◮ At 24500 nodes, API servers started swapping 19 / 25
Swap and memory usage on API nodes 20 / 25
Memory usage ◮ Neutron the biggest user of memory: 1.4 GB per process 21 / 25
Memory usage ◮ Neutron the biggest user of memory: 1.4 GB per process ◮ Subnets: 2500 Ports: 43000 21 / 25
Memory usage ◮ Neutron the biggest user of memory: 1.4 GB per process ◮ Subnets: 2500 Ports: 43000 ◮ Easy fix: Reduce number of api_workers and rpc_workers 21 / 25
Memory usage ◮ Neutron the biggest user of memory: 1.4 GB per process ◮ Subnets: 2500 Ports: 43000 ◮ Easy fix: Reduce number of api_workers and rpc_workers ◮ Long Term Fix: Investigate memory usage, isolate neutron 21 / 25
Learnings 22 / 25
Learnings ◮ Do a *density* and *scale* testing before taking on production 23 / 25
Learnings ◮ Do a *density* and *scale* testing before taking on production ◮ Avoid spawning processes, try and use native python libraries whenever possible 23 / 25
Learnings ◮ Do a *density* and *scale* testing before taking on production ◮ Avoid spawning processes, try and use native python libraries whenever possible ◮ Pay attention to periodic tasks 23 / 25
Learnings ◮ Do a *density* and *scale* testing before taking on production ◮ Avoid spawning processes, try and use native python libraries whenever possible ◮ Pay attention to periodic tasks ◮ Be prepared to scale horizontally 23 / 25
Learnings ◮ Do a *density* and *scale* testing before taking on production ◮ Avoid spawning processes, try and use native python libraries whenever possible ◮ Pay attention to periodic tasks ◮ Be prepared to scale horizontally ◮ Pay attention to number of workers,conductors,rpc_workers 23 / 25
Learnings ◮ Do a *density* and *scale* testing before taking on production ◮ Avoid spawning processes, try and use native python libraries whenever possible ◮ Pay attention to periodic tasks ◮ Be prepared to scale horizontally ◮ Pay attention to number of workers,conductors,rpc_workers ◮ Don’t forget to have fun :) 23 / 25
Questions 24 / 25
References ◮ Layout and background: https://github.com/mtreinish/openstack-health-presentation ◮ Picture from TV show: http://www.imdb.com/title/tt4338930/ ◮ Picture of explotion: https://en.wikipedia.org/wiki/Explosion 25 / 25
Recommend
More recommend