Resource maintenance
- Sometimes need to restart services (e.g. reload config)
- Can't put a single resource on a single node in maintenance mode
- Can only put a whole node or a whole service in maintenance mode
Can we do better? Do we need to?
?
for help on navigating these slidesm
for a slide menus
for speaker notes ms
/ single master with replicationSOLVED
(yeah, right…)
neutron
HA is tricky, but out of the
scope of this talkcinder-volume
(block storage backend) active-active support still
under developmentkeepalived
often used instead for VIPsPacemaker manages:
Active-active API services now
controlled by systemd
systemd
does auto-restart on crashStandard active-passive HAProxy architecture:
Any downsides?
https://launchpad.net/openstack-resource-agents
systemd:
resourcesPacemaker auto-restarts service on crash
systemd
only, no Pacemaker, as per Red Hatsystemd
auto-restarts service on crash
Pros
systemd
to allow health checkssystemd
developmentsystemctl
/ service(8)
start
/ stop
/ status
delegated to systemd
monitor
implemented in OCF RA
Pros
systemd
via dbus
signallingsystemd
quirkssystemd:
/ ocf:
hybridsstart
/ stop
/ status
via systemd
monitor
via OCF RA
Pros
container
meta-attributePair systemd
resource with monitor-only OCF RA
systemd
requiredhttp://blog.clusterlabs.org/blog/2016/composable-openstack-ha
Can we auto-promote remotes to core to maintain quorum?
Can we do better? Do we need to?
booth
is obvious candidate for deciding master site
pacemaker_remote
to the rescue!NovaCompute
/ NovaEvacuate
OCF agentsNovaCompute
/ NovaEvacuate
OCF agentsopenstack-resource-agents
repo (sort of)nova-compute
nova
fails during recoveryDifferent cloud operators will want to support different SLAs with different workflows:
In some failure scenarios, VMs are still perfectly healthy but unmanageable.
Should they be automatically killed? Depends on the workload.
Corrective action for | |||||
---|---|---|---|---|---|
Admin network | Neutron network | Storage network | nova-compute / libvirtd |
cattle | pets |
Down | Up | Up | Up | Fence → resurrect | Notify operator |
Up | Down | Up | Up | Migrate | |
Up | Up | Down | Up | Fence, resurrect | |
Up | Up | Up | Down | Restart → notify operator |
openstack-resource-agents-specs
repository
libvirtd
OCF RANovaCompute
OCF RAe.g. NovaCompute
start-failure-is-fatal=false
migration-threshold=3
After third failure, Pacemaker gives up!
We need something more proactive.
DEBUG
-level onesDEBUG
should be targeted at users
Already some great steps in the right direction, e.g.
but often documentation / man pages out of date or incomplete.
crm_simulate -SL
Like with a gossip protocol
e.g. 5-node cluster, nodes 2/3/4 lose connection to node 5
This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.