[05:23:50] morning. I see several paging alerts for ceph got fired and then resolved [05:26:21] I see some cleanups need to be done [05:26:44] the graphs show dropped packets in the switch [05:27:27] ok, well, I will be back later to properly review all of this [07:09:15] * dcaro paged [07:09:17] looking [07:11:16] It seems that the stats have been very spotty, as in missing chunks of data [07:15:03] I've started undraining a few more osds to increase redundancy [07:19:47] recovery is being quite slow [07:24:39] hmm... the metrics come without an `instance` label [07:27:12] The network is a bit jumpy with RTT between the prometheus node and the mon nodes, jumping from 0.X to >6ms from time to time (both ok, though weird) [07:27:16] https://www.irccloud.com/pastebin/esmoUuPp/ [07:28:12] from within the cloudnet is way stabler (always <0.3ms) [07:28:24] (might make sense) [07:30:24] the mon seems to be replying ok to requests so far [07:30:32] (metrics requests) [07:30:38] root@prometheus1005:~# curl -v http://cloudcephmon1006:9283/metrics [07:43:11] hmm... openstack apis seem a bit unresponsive [07:43:15] Sep 18 07:40:04 cloudcontrol1005 wmcs-dns-floating-ip-updater[4185481]: requests.exceptions.HTTPError: 504 Server Error: GATEWAY TIMEOUT for url: https://openstack.eqiad1.wikimediacloud.org:29001/v2/zones/61d722cc-d44a-4a3c-b431-44c28e9debeb/recordsets [07:45:47] and of course, since I'm looking, the ceph stats have been reliable xd, heisenbug! [07:48:12] nova-fullstack has been failing with time out waiting for a record [07:48:13] Timed out waiting for A record for fullstackd-20240918074231.admin-monitoring.eqiad1.wikimedia.cloud [07:48:40] or waiting for a vm to be crated 'Exception: Creation of 69184699-5754-4c8d-95ea-e88491008512 timed out' [07:59:44] dcaro: em yeah probably our biggest bottleneck is the cloudsw<->cr link, there will probably be some buffering there (and the cloudsw model doesn't have much), might explain occasional jumps in latency from prometheus [07:59:54] 6ms isn't huge, but it is a massive increase which is not good [08:01:05] thanks for the explanation :), that should not be giving us issues, unless it starts dropping (that I have not seen) [08:01:08] looking at the switches they seem stable since the time [08:01:22] I would call it a "theory" rather than explanation - but it's not unlikely [08:02:41] how are we looking in general? do you suspect any network issues or is it more perhaps just the cluster not recovering as quickly as expected? [08:04:19] I have not seen network 'breakage', but I notice the cluster does not seem to saturate the network as it did before and that's a bit weird [08:04:28] https://usercontent.irccloud-cdn.com/file/mDV6yVLO/image.png [08:04:39] might be the cluster though [08:05:18] do you see any retransmits or similar perf indicators in the switch? [08:08:50] designate is complaining about db connections timing out, looking, maybe the mariadb cluster is not healthy [08:09:55] restarting designate services [08:10:23] no, but the switches really either drop or forward, they don't manage flows or do any retransmits etc [08:10:27] topranks, dcaro, congrats on the upgrade!!! And thanks a lot. [08:10:35] for instance between the two hosts above with timeouts I don't see an issue [08:10:39] https://www.irccloud.com/pastebin/NyssQ2LU/ [08:10:58] XioNoX: the universe intervened but we got it over the line now :) [08:11:15] sorry for the confusion, the cloudcontrol1005 timeouts is a different thing than ceph xd [08:11:21] (there's a few fires at the same time) [08:11:32] dcaro: brief iperf test hitting 6Gb/sec between two cloudceph nodes (going from e4 to f4) [08:11:41] https://www.irccloud.com/pastebin/lYPqXFyD/ [08:11:46] that's good to know yep [08:12:09] so doesn't seem to be a bottleneck there. we don't want to go crazy trying to stress-test obviously lest we affect real traffic! [08:12:26] yep :), live testing is tricky [08:14:33] openstack side: after restarting designate it seems to reconnect to the db, I still see some intermittent errors with rabbit connections [08:16:25] ip updater is still timing out when querying zones :/ [08:16:26] Sep 18 08:15:58 cloudcontrol1005 wmcs-dns-floating-ip-updater[12861]: requests.exceptions.HTTPError: 504 Server Error: GATEWAY TIMEOUT for url: https://openstack.eqiad1.wikimediacloud.org:29001/v2/zones [08:16:36] these are connections to "cloudrabbit100x" from elsewhere? [08:17:09] ^^ yeah so the mtr I did in the first paste above was between cloudcontrol1005 and that endpoint. Comms looks good [08:18:34] the destination IP is being injected into the switch routing tables by cloudlb1001 and cloudlb1002 but that looks ok [08:18:56] ack [08:19:01] from cloudcontrol1005 the traffic will route via cloudlb1001 as they are in that same rack [08:19:02] I saw briefly a galera alert yesterday [08:19:15] that was because cloudcontrol1005 was in the rack [08:19:20] (one of the galera nodes) [08:19:44] yeah, the alert seems to be gone today, but I wonder if we would need a full restart of openstack because of that [08:20:09] kinda what I'm doing yep xd, nova is restarting now [08:20:10] it wouldn't hurt I think, so I'd run it unless you don't think is a good idea dcaro [08:20:18] ok [08:20:51] curl to the above url to get the zones gets an instant response (no auth error but I can't reproduce the timeout) [08:20:52] topranks: the connection to cloudrabbit come from cloudcontrols and cloudvirts, though iirc they go through cloudlb? [08:21:13] i don't recall tbh, what's the destination hostname or IP? [08:21:23] I think yes they may be behind the cloudlb [08:21:58] they actually have all the nodes one by one rabbitmq01.eqiad1.wikimediacloud.org [08:22:06] though that might be behind cloudlb too [08:22:23] (same with 02 and 03) [08:22:52] I don't remember either [08:23:05] that's directly to the cloudrabbit host on the cloud-private network [08:23:33] nova services are failing to start with errors connecting to rabbit, looking [08:23:35] cmooney@cloudsw1-c8-eqiad> show arp | match 172.20.1.17 [08:23:36] 5c:6f:69:cd:61:b0 172.20.1.17 cloudrabbit1001.private.e irb.1151 [xe-0/0/21.0] none [08:24:30] dcaro: we may need a rabbitmq reset... :-( [08:25:21] hmm... manually restarted one of the failed nova services and it came back up :/, I don't like flaky errors xd [08:25:42] comms seems rock solid [08:25:45] https://www.irccloud.com/pastebin/jOCenjaA/ [08:27:18] nice :), we have had issues with rabbit in the past being flaky [08:33:00] * dhinus paged cloudvirt1039/nova-compute proc maximum [08:33:06] (and a few other cloudvirts) [08:33:17] yep, sorry about that [08:33:29] I'll try restarting rabbit one by one, and restart everything else after [08:33:59] they're all resolved now [08:35:21] site node: I see some duplicated alerts in alertmanager, because different receiver tag [08:36:56] restarted cloudrabbit01, restarting 02 [08:37:38] cluster seems up and stable, going with 03 [08:38:46] 03 restarted, giving it a minute and I'll restart all openstack services again [08:39:48] ack [08:40:14] starting the full restart of openstack [08:40:39] arturo: duplicated alerts happen quite frequently, maybe it's T353457? [08:40:40] T353457: Karma UI shows duplicate alerts - https://phabricator.wikimedia.org/T353457 [08:41:32] yep, duplicated receivers [08:41:56] yeah, seems the same [08:43:01] Sep 18 08:42:54 cloudvirt1033 nova-compute[3174700]: 2024-09-18 08:42:54.720 3174700 ERROR oslo.messaging._drivers.impl_rabbit [None req-65eb58a1-5a93-4ab4-96f1-fb5430f50679 - - - - - -] [2637bf51-ba19-44e7-839e-6d34b947cabe] AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 is unreachable: . Trying again in 0 seconds.: amqp.exceptions.MessageNacked [08:43:09] ^on a freshly restarted coludvirt [08:43:26] we may need a full rabbit reset [08:43:32] sight rabbitmq ... [08:47:11] I'll try that then [08:50:39] thanks [08:55:32] got this error when bringing back 1002 to the cluster, though starting the app then worked [08:55:40] https://www.irccloud.com/pastebin/NjXk2vi6/ [08:56:58] cluster seems up and running now [08:57:06] I have not seen this error before [08:58:34] I'm getting timeouts from the mobile app for splunk when trying to ack the pages [08:58:34] xd [08:58:49] LOL [08:59:02] they must have rabbitmq internally too [08:59:11] I'll start restarting openstack services, see if that helped [08:59:13] I'm acking everything from the web interface [08:59:43] thanks [09:00:15] from the nova PoV, some hypervisors are UP and some DOWN [09:00:17] https://usercontent.irccloud-cdn.com/file/Cr3MQyMX/image.png [09:00:21] lets see how things evolve [09:00:41] yep, that's the ones that failed restarting nova due to rabbitmq connections I think [09:01:04] I see some of them coming back online now [09:01:50] the rabbitmq full reset procedure should be a cookbook, if we don't have it already [09:05:50] hmm, this does not look good [09:05:56] https://www.irccloud.com/pastebin/5c44z7vf/ [09:10:54] all hypervisors show as up now though [09:11:05] reboot finished [09:14:33] things seem to be stabilizing on the openstack side [09:15:24] no more errors like the above since that one happened on rabbit either [09:16:25] nova-fullstack is passing again [09:16:35] I think that's good enough [09:29:05] oh, found the issue with ceph stats [09:29:11] the mon is taking too long to reply [09:29:17] https://www.irccloud.com/pastebin/vy5NiCYT/ [09:29:42] I'll try increasing the timeout [09:32:12] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073744 that should do it [09:32:20] ^ reviewers welcome [09:32:26] * dcaro starts running pcc [09:33:26] sorry was on a meeting [09:33:27] LGTM [10:31:47] * dcaro lunch [11:33:31] dhinus: I would like to merge this now https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [11:34:36] I'm eating lunch but you can go ahead [11:35:27] thanks, I'll run a last minute double check and then merge [11:42:02] I'll reduce the scope of the change, only merge a tiny security group import for eqiad1, in case it gives problems on the apply step [12:03:44] I'm looking at the plan, why so many resources "created" and not just "imported"? [12:03:51] this is a bit annoying :/ "There is no support at this time for upgrading or deleting CRDs using Helm." [12:04:04] I might have to create a new chart just for the CRDs of things [12:04:31] dhinus: I'm not importing the security group rules, just recreating them. There are many many rules, and it would have been too tedious [12:04:57] so when running tofu apply, I delete by hand all the rules, then tofu apply re-creates them [12:05:00] hmm... that means creating a new repo, toolforge component, and such... hmpf [12:05:10] arturo: ok makes sense [12:05:40] dcaro: is that for tekton? [12:05:45] yep [12:06:01] arturo: did you generate the yaml from the openstack cli? [12:06:09] helm only supports installing CRDs on the first install, after that, there's no support [12:06:13] dhinus: yes [12:06:41] dcaro: ack [12:07:19] arturo: nice, that should ensure nothing was forgotten [12:07:46] have you already applied it to codfw? [12:08:03] dhinus: note however that the security group definition itself is imported. And that's important because there could be cross-references based on the id, and re-creating would destroy them [12:08:12] I see [12:08:16] dhinus: I have applied to both eqiad1 and codfw1dev [12:09:38] but in eqiad1 you only have 1 sec group, are there more that you haven't imported yet? [12:09:58] yeah, I split the MR in two, to only merge 1 sec group in the first MR [12:10:05] ok [12:10:13] the second one was also merged now: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/47 [12:19:29] I noticed there are 2 alerts in victorops that have not resolved automatically [12:20:11] https://portal.victorops.com/ui/wikimedia/incident/5186 and https://portal.victorops.com/ui/wikimedia/incident/5173 [12:21:59] one is from icinga, and it's resolved in icinga. the other one is from alertmanager, and it's resolved in alertmanager [12:22:08] I will resolve them manually in victorops [12:23:08] I think that also happened in the past right? (some alerts resolved in icinga/am but not in splunk, as in if it had lost the update) [12:26:36] I'm unsure but maybe yes [12:27:12] I can see 85 alerts in total in victorops (between yesterday and today) and all of them autoresolved apart from that 2 [12:27:40] I wouldn't worry too much [12:31:56] ack [12:40:17] ack [13:32:27] crds are messy [13:32:27] xd [15:08:03] I'm getting '[GET /status] getStatus (status 503): {}' on alertmanager, am I the only one? [15:08:42] same here [15:09:06] same [15:09:32] asking in o11y [15:10:02] they are switching alertmanagers [15:11:50] I think I'm calling it a day [15:11:52] cya tomorrow [15:11:55] * dcaro off [16:08:50] dhinus: found an "interesting" case with neutron security groups, see T375111 [16:08:50] T375111: openstack: clarify default security group semantics - https://phabricator.wikimedia.org/T375111 [16:09:03] * arturo offline for today now [17:00:50] I've created https://wikitech.wikimedia.org/wiki/Cloud_roots_and_Cloud_admins -- feedback is welcome onwiki or in T375113 [17:00:50] T375113: WMCS: Document different types of root and admin privileges - https://phabricator.wikimedia.org/T375113