[06:37:43] topranks: you trunked a new vlan for cloudgw1003 in https://netbox.wikimedia.org/extras/changelog/213525/ but then "automation" removed it (I guess because it's not configured on the host) in https://netbox.wikimedia.org/extras/changelog/265623/ so there is an outstanding diff on the switch. [06:39:39] thanks [06:39:53] hopefully the automation tidies up its mess :D [06:41:09] joking… I’ll run homer shortly [06:55:34] This sre_bot guy always messing things up [07:37:04] greetings [07:41:08] I'll get ready to resume the tests in T417393 [07:41:08] T417393: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393 [07:42:34] morning [07:45:23] hey dcaro [07:47:05] ok so today I'm going to try again, one at a time, hosts that are meant to perform automatic failover: namely cloudgw1003.eqiad.wmnet cloudlb1001.eqiad.wmnet cloudnet1005.eqiad.wmnet cloudservices1006.eqiad.wmnet cloudcontrol1011.eqiad.wmnet cloudrabbit1001.eqiad.wmnet [07:47:31] i.e. the rest of C8, without ceph because we tested it last time [07:51:15] ack, let me know if you want me to do check/monitor anything specific, I'm around otherwise [07:51:44] cheers, yeah generally keeping an eye on alerts should be enough I think [07:51:58] I got https://alerts.wikimedia.org/?q=team%3Dwmcs open will be looking at irc too [07:54:27] 👍 [07:59:55] I'll go in the same order as the task, cloudgw1003 first [08:01:16] ok cloudgw1003 done [08:02:10] morning [08:02:43] hey taavi [08:05:38] ok expected for bots to quit, things seems to be working so far [08:07:07] I'll wait for cloudgw1004 network to plateau then move to cloudlb1001 [08:12:01] ok moving on to cloudlb1001 [08:13:56] {{done}} I just noticed that cloudlb traffic seems to follow cloudgw, i.e. traffic moved away from lb1001 when gw1003 went down [08:14:27] or most of the traffic anyway [08:20:01] moving on to cloudnet1005 unless there are objections ? [08:20:17] go for it [08:20:33] {{done}} [08:21:24] bit anticlimactic as there's no load on cloudnet1005 [08:22:09] xd [08:22:13] I'll do cloudservices1006 [08:22:32] again for that I imagine moving cloudgw already moved most of the traffic :P [08:23:27] that services or net taavi ? [08:24:01] services [08:24:34] indeed [08:25:00] the network always chooses the shortest path for these anycast services, and the server in the same rack is always closer than something in the other one [08:27:03] heh thank you, lesson learned not to start from unplugging cloudgw [08:27:45] moving on to cloudrabbit1001 [08:29:48] {{done}} [08:33:58] ok i'm looking at https://grafana.wikimedia.org/d/UUmLqqX4k/wmcs-openstack-api-latency and some latencies are rising, e.g. heat [08:36:43] more or less looks like the same as T418444 [08:36:43] T418444: Increased openstack latency and rabbitmq rolling restarts on certificate update - https://phabricator.wikimedia.org/T418444 [08:38:33] ok I'll bring back cloudrabbit1001 [08:38:44] unless someone wants to look further ? [08:40:00] I'll take that as a no [08:40:31] ok host is back, let's see if there's recovery [08:43:59] there was a spike on 500s for cloudlb1002, probably rabbit [08:46:18] indeed, ok kinda expected openstack api is not recovering by itself, I'm +1 to restart openstack services via cookbook [08:46:32] unless folks want to take a look before I do that ? [08:47:13] I think that's ok, it seems to me as if openstack did not failover to the other rabbit node, and was trying to reconnect to the same [08:47:24] it's back up again right? [08:47:36] `Mar 18 08:42:57 cloudcontrol1007 nova-api-wsgi[3618]: 2026-03-18 08:42:57.217 3618 INFO oslo.messaging._drivers.impl_rabbit [-] [1afdbe95-dc98-4f62-8e28-ff1b39fe8090] Reconnected to AMQP server on rabbitmq01.eqiad1.wikimediacloud.org:5671 via [amqp] client with port 57272.` [08:47:48] but it seems to be having issues delivering/getting messages [08:47:59] yes indeed, looks like that to me too [08:48:04] `The reply 8947a11e8eec4d519422eeafe1f8bdba failed to send after 60 seconds due to a missing queue (reply_7f9954f505f44d89b04f26d910567f3a). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable` [08:48:20] we can debug async I think [08:48:34] +1 for restarting services [08:48:40] ack, doing [08:53:28] new alert, maybe the rabbit setup did get partitioned somehow? `summary: A Rabbitmq Network partition has been detected. 1 hosts marked as partitioned.` [08:54:13] indeed [08:55:07] yeah rabbitmqctl cluster_status says there's a 1001 / 1002+1003 partition [08:56:08] :/, I'm a bit confused, I thought that a single node would not be able to become a partition for being out of quorum, but maybe we have a quorum of 1? [08:57:02] could be, not sure yet, following https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/RabbitmqNetworkPartition [08:59:40] the default is to ignore partitions, and just push forward xd https://www.rabbitmq.com/docs/partitions [08:59:45] maybe we are not changing it [09:00:10] bouncing rabbit con cloudrabbit1001 seems to have done the trick, I don't see the partitions anymore [09:01:25] commented out ye https://codesearch.wmcloud.org/search/?q=cluster_partition_handling [09:01:56] that's a way of forcing rabbit to heal yep, sometimes it does not work though, and then you have to rebuild the whole cluster [09:02:35] you take out the partition you don't want, and bring those nodes up one by one waiting for them to join the existing cluster, they will ignore their data and use the partition that was left up [09:03:28] ok got it, how to tell whether restarting rabbit on 1001 worked to heal the cluster ? [09:04:38] iirc it was tricky sometimes, as in it would not have some queues (botchered state), it showed up as random errors [09:04:56] *butchered? partial inconsistent state essentially xd [09:05:51] the number of 500 errors has flattened though, we did not have those stats back then from what I remember, maybe that's a good indicator [09:06:04] (number of 500s being 0 -> recovered ok) [09:06:11] which dashboard are you looking at ? [09:06:41] the one you passed, api-latency, the graphs on the right, responses/s per response code [09:07:18] https://grafana-rw.wikimedia.org/d/UUmLqqX4k/wmcs-openstack-api-latency [09:07:53] ack thank you [09:09:25] ok of course if things can recover by themselves that'd be better, I don't see latencies going down at least atm [09:10:17] for rabbit, we might want to configure it to not split brain when one node is isolated, using `cluster_partition_handling = pause_minority` [09:11:19] should be safe with three bare metal nodes (it seems to have issues if in VMs that get suspended, due to time shifting for the VMs) [09:11:44] maybe andrewbogot.t already tried in the past and there was some issue or something though xd [09:12:35] the other potential issue is why did openstack not failover? (or seemed not to do so) [09:13:46] yes that too [09:14:09] ok I'm looking at https://grafana.wikimedia.org/goto/bfgdii9275pmoc?orgId=1 and "ready messages" does not inspire confidence [09:15:57] it seems to be piling up ype [09:15:59] *yep [09:16:03] that seems to start when I restarted rabbit on 1001 [09:16:15] sigh ok it seems it is "reset rabbit" time [09:16:56] 🐰 [09:18:13] morning [09:18:31] that would be wmcs.openstack.rabbitmq.rebuild_rabbit_cluster dhinus ? [09:18:35] err sorry, dcaro [09:18:37] good morning dhinus [09:23:50] actually nevermind I think rabbit is recovering [09:25:10] ok I'll bring back the hosts that were previously shut [09:26:19] hosts are back [09:26:19] this went much nicer than last time [09:26:29] indeed, very true [09:26:39] and worse than next time, ideally [09:27:24] I have this feeling the other spicy host is going to be cloudservices [09:27:30] err cloudcontrol [09:38:01] hmm, it took ~30min for rabbit to start shrinking the queued messages [09:43:57] indeed, I'll take a break [11:56:04] i wrote a very tiny script https://phabricator.wikimedia.org/P89878 to slowly restart ~all toolforge web services to backfill the httproute resources, any opposition to me starting that now in the tools project? [13:13:12] lgtm [13:20:40] lgtm [14:07:47] godog: thanks again for running those failure tests. do we already have a subtask to test/experiment more with rabbitmq splitbrain? If not, I'll make one [14:08:39] andrewbogott: sure np! yes I've piggybacked on T418444 since it is the same issue [14:08:40] T418444: Increased openstack latency and rabbitmq rolling restarts on certificate update - https://phabricator.wikimedia.org/T418444 [14:08:59] ok! [14:10:08] also https://gerrit.wikimedia.org/r/c/operations/puppet/+/1254877 for the cluster partition handling change [14:10:20] The other nagging question I have is whether or not dnsaas can actually fail over properly when a cloudcontrol goes down. It sounds like that wasn't an issue. [14:11:09] but you can see that VM scheduling failed for about 20 minutes: https://openstack-browser.toolforge.org/project/admin-monitoring [14:11:12] you would only see issues relating to record updates when taking down a cloudcontrol, right? [14:11:24] oh, yes, was a cloudcontrol not part of this test? [14:11:41] although rabbit misbehavior would also mess up things on cloudcontrols [14:11:59] it was, though after cloudrabbit and I stopped after that, i.e. cloudcontrol yet to do [14:13:06] ok! and the scheduling failures linked above would fit with rabbit misbehavior. [14:13:12] So the designate question remains [14:16:38] speaking of which, how did the cloudcontrol rolling reboot go? meaning is there any explicit action other than reboot ? [14:21:48] Reboot is part of the upgrade cookbook, so we got the reboots 'for free' [14:22:09] but that also means I didn't test a simple reboot since the upgrade includes API version changes and such [14:23:21] got it, ok! [14:37:16] taavi: I'm ready for another review of https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1248047 -- at your convenience [14:39:44] code is less ugly than before, tests are uglier [14:48:07] thanks! [15:15:17] andrewbogott: re: https://openstack-browser.toolforge.org/project/admin-monitoring you mentioned, anything to do for those VMs left behind or they will be cleaned up automatically ? [15:16:19] They hang around for post-mortem and then I delete them manually. [15:16:26] Which I will do shortly since I think we know what happened. [15:16:33] SGTM [15:59:02] 🦄 0 alerts! \o/ [16:45:07] I've repooled clouddb1013 as it's been working fine for the past 48 hours, cc taavi [16:47:19] more details at T420177 [16:47:20] T420177: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177 [16:49:10] 🎉 [17:14:22] I'm slightly worried clouddb1013 might crash again now that it's receiving user traffic.. andrewbogott if you see any alert about it later today, please depool it again, I added the command here https://phabricator.wikimedia.org/T420177#11724310 [17:14:52] ok! It will alert if it crashes? [17:14:59] And you're talking about maria crashing, not the server right? [17:15:31] yep exactly [17:15:51] it will alert but I'm not sure if the alert is routed to our team tag [17:16:18] the alert is not about the crash itself, but about replag increasing, because after crashing mariadb restarts, but replication does not [17:17:04] ok. So... how did you notice when it crashed originally? #wikimedia-operations? [17:17:40] good question, I think taavi noticed first, it was over the weekend [17:17:53] hmmm ok [17:17:54] I noticed a replag alert on #wikimedia-cloud-feed [17:18:07] great, I will give that an extra eye then thanks [17:18:10] update: about half of the 2,000 tools running an active webservice have been restarted to pick up gateway api routing configuration [17:18:14] thanks both! [17:18:49] (we could probably do that a lot faster, but I've made the script run extra slow in case of bonus surprises and we're not in a particular hurry with that) [17:30:22] * dhinus off