[05:23:50] <arturo>	 morning. I see several paging alerts for ceph got fired and then resolved
[05:26:21] <arturo>	 I see some cleanups need to be done
[05:26:44] <arturo>	 the graphs show dropped packets in the switch 
[05:27:27] <arturo>	 ok, well, I will be back later to properly review all of this
[07:09:15] * dcaro paged
[07:09:17] <dcaro>	 looking
[07:11:16] <dcaro>	 It seems that the stats have been very spotty, as in missing chunks of data
[07:15:03] <dcaro>	 I've started undraining a few more osds to increase redundancy
[07:19:47] <dcaro>	 recovery is being quite slow
[07:24:39] <dcaro>	 hmm... the metrics come without an `instance` label
[07:27:12] <dcaro>	 The network is a bit jumpy with RTT between the prometheus node and the mon nodes, jumping from 0.X to >6ms from time to time (both ok, though weird)
[07:27:16] <dcaro>	 https://www.irccloud.com/pastebin/esmoUuPp/
[07:28:12] <dcaro>	 from within the cloudnet is way stabler (always <0.3ms)
[07:28:24] <dcaro>	 (might make sense)
[07:30:24] <dcaro>	 the mon seems to be replying ok to requests so far
[07:30:32] <dcaro>	 (metrics requests)
[07:30:38] <dcaro>	 root@prometheus1005:~# curl -v http://cloudcephmon1006:9283/metrics
[07:43:11] <dcaro>	 hmm... openstack apis seem a bit unresponsive
[07:43:15] <dcaro>	 Sep 18 07:40:04 cloudcontrol1005 wmcs-dns-floating-ip-updater[4185481]: requests.exceptions.HTTPError: 504 Server Error: GATEWAY TIMEOUT for url: https://openstack.eqiad1.wikimediacloud.org:29001/v2/zones/61d722cc-d44a-4a3c-b431-44c28e9debeb/recordsets
[07:45:47] <dcaro>	 and of course, since I'm looking, the ceph stats have been reliable xd, heisenbug!
[07:48:12] <dcaro>	 nova-fullstack has been failing with time out waiting for a record
[07:48:13] <dcaro>	 Timed out waiting for A record for fullstackd-20240918074231.admin-monitoring.eqiad1.wikimedia.cloud
[07:48:40] <dcaro>	 or waiting for a vm to be crated 'Exception: Creation of 69184699-5754-4c8d-95ea-e88491008512 timed out'
[07:59:44] <topranks>	 dcaro: em yeah probably our biggest bottleneck is the cloudsw<->cr link, there will probably be some buffering there (and the cloudsw model doesn't have much), might explain occasional jumps in latency from prometheus 
[07:59:54] <topranks>	 6ms isn't huge, but it is a massive increase which is not good 
[08:01:05] <dcaro>	 thanks for the explanation :), that should not be giving us issues, unless it starts dropping (that I have not seen)
[08:01:08] <topranks>	 looking at the switches they seem stable since the time 
[08:01:22] <topranks>	 I would call it a "theory" rather than explanation - but it's not unlikely 
[08:02:41] <topranks>	 how are we looking in general?  do you suspect any network issues or is it more perhaps just the cluster not recovering as quickly as expected?
[08:04:19] <dcaro>	 I have not seen network 'breakage', but I notice the cluster does not seem to saturate the network as it did before and that's a bit weird
[08:04:28] <dcaro>	 https://usercontent.irccloud-cdn.com/file/mDV6yVLO/image.png
[08:04:39] <dcaro>	 might be the cluster though
[08:05:18] <dcaro>	 do you see any retransmits or similar perf indicators in the switch?
[08:08:50] <dcaro>	 designate is complaining about db connections timing out, looking, maybe the mariadb cluster is not healthy
[08:09:55] <dcaro>	 restarting designate services
[08:10:23] <topranks>	 no, but the switches really either drop or forward, they don't manage flows or do any retransmits etc 
[08:10:27] <XioNoX>	 topranks, dcaro, congrats on the upgrade!!! And thanks a lot.
[08:10:35] <topranks>	 for instance between the two hosts above with timeouts I don't see an issue 
[08:10:39] <topranks>	 https://www.irccloud.com/pastebin/NyssQ2LU/
[08:10:58] <topranks>	 XioNoX: the universe intervened but we got it over the line now :) 
[08:11:15] <dcaro>	 sorry for the confusion, the cloudcontrol1005 timeouts is a different thing than ceph xd
[08:11:21] <dcaro>	 (there's a few fires at the same time)
[08:11:32] <topranks>	 dcaro: brief iperf test hitting 6Gb/sec between two cloudceph nodes (going from e4 to f4)
[08:11:41] <topranks>	 https://www.irccloud.com/pastebin/lYPqXFyD/
[08:11:46] <dcaro>	 that's good to know yep
[08:12:09] <topranks>	 so doesn't seem to be a bottleneck there.  we don't want to go crazy trying to stress-test obviously lest we affect real traffic!
[08:12:26] <dcaro>	 yep :), live testing is tricky
[08:14:33] <dcaro>	 openstack side: after restarting designate it seems to reconnect to the db, I still see some intermittent errors with rabbit connections
[08:16:25] <dcaro>	 ip updater is still timing out when querying zones :/
[08:16:26] <dcaro>	 Sep 18 08:15:58 cloudcontrol1005 wmcs-dns-floating-ip-updater[12861]: requests.exceptions.HTTPError: 504 Server Error: GATEWAY TIMEOUT for url: https://openstack.eqiad1.wikimediacloud.org:29001/v2/zones
[08:16:36] <topranks>	 these are connections to "cloudrabbit100x" from elsewhere?
[08:17:09] <topranks>	 ^^ yeah so the mtr I did in the first paste above was between cloudcontrol1005 and that endpoint.  Comms looks good 
[08:18:34] <topranks>	 the destination IP is being injected into the switch routing tables by cloudlb1001 and cloudlb1002 but that looks ok 
[08:18:56] <dcaro>	 ack
[08:19:01] <topranks>	 from cloudcontrol1005 the traffic will route via cloudlb1001 as they are in that same rack 
[08:19:02] <arturo>	 I saw briefly a galera alert yesterday
[08:19:15] <dcaro>	 that was because cloudcontrol1005 was in the rack
[08:19:20] <dcaro>	 (one of the galera nodes)
[08:19:44] <arturo>	 yeah, the alert seems to be gone today, but I wonder if we would need a full restart of openstack because of that
[08:20:09] <dcaro>	 kinda what I'm doing yep xd, nova is restarting now
[08:20:10] <arturo>	 it wouldn't hurt I think, so I'd run it unless you don't think is a good idea dcaro 
[08:20:18] <arturo>	 ok
[08:20:51] <topranks>	 curl to the above url to get the zones gets an instant response (no auth error but I can't reproduce the timeout)
[08:20:52] <dcaro>	 topranks: the connection to cloudrabbit come from cloudcontrols and cloudvirts, though iirc they go through cloudlb? 
[08:21:13] <topranks>	 i don't recall tbh, what's the destination hostname or IP?
[08:21:23] <topranks>	 I think yes they may be behind the cloudlb 
[08:21:58] <dcaro>	 they actually have all the nodes one by one rabbitmq01.eqiad1.wikimediacloud.org
[08:22:06] <dcaro>	 though that might be behind cloudlb too
[08:22:23] <dcaro>	 (same with 02 and 03)
[08:22:52] <arturo>	 I don't remember either
[08:23:05] <topranks>	 that's directly to the cloudrabbit host on the cloud-private network 
[08:23:33] <dcaro>	 nova services are failing to start with errors connecting to rabbit, looking
[08:23:35] <topranks>	 cmooney@cloudsw1-c8-eqiad> show arp | match 172.20.1.17 
[08:23:36] <topranks>	 5c:6f:69:cd:61:b0 172.20.1.17     cloudrabbit1001.private.e irb.1151 [xe-0/0/21.0]  none
[08:24:30] <arturo>	 dcaro: we may need a rabbitmq reset... :-(
[08:25:21] <dcaro>	 hmm... manually restarted one of the failed nova services and it came back up :/, I don't like flaky errors xd
[08:25:42] <topranks>	 comms seems rock solid 
[08:25:45] <topranks>	 https://www.irccloud.com/pastebin/jOCenjaA/
[08:27:18] <dcaro>	 nice :), we have had issues with rabbit in the past being flaky
[08:33:00] * dhinus paged cloudvirt1039/nova-compute proc maximum
[08:33:06] <dhinus>	 (and a few other cloudvirts)
[08:33:17] <dcaro>	 yep, sorry about that
[08:33:29] <dcaro>	 I'll try restarting rabbit one by one, and restart everything else after
[08:33:59] <dhinus>	 they're all resolved now
[08:35:21] <arturo>	 site node: I see some duplicated alerts in alertmanager, because different receiver tag
[08:36:56] <dcaro>	 restarted cloudrabbit01, restarting 02
[08:37:38] <dcaro>	 cluster seems up and stable, going with 03
[08:38:46] <dcaro>	 03 restarted, giving it a minute and I'll restart all openstack services again
[08:39:48] <arturo>	 ack
[08:40:14] <dcaro>	 starting the full restart of openstack
[08:40:39] <dhinus>	 arturo: duplicated alerts happen quite frequently, maybe it's T353457?
[08:40:40] <stashbot>	 T353457: Karma UI shows duplicate alerts - https://phabricator.wikimedia.org/T353457
[08:41:32] <dcaro>	 yep, duplicated receivers
[08:41:56] <arturo>	 yeah, seems the same
[08:43:01] <dcaro>	 Sep 18 08:42:54 cloudvirt1033 nova-compute[3174700]: 2024-09-18 08:42:54.720 3174700 ERROR oslo.messaging._drivers.impl_rabbit [None req-65eb58a1-5a93-4ab4-96f1-fb5430f50679 - - - - - -] [2637bf51-ba19-44e7-839e-6d34b947cabe] AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 is unreachable: . Trying again in 0 seconds.: amqp.exceptions.MessageNacked
[08:43:09] <dcaro>	 ^on a freshly restarted coludvirt
[08:43:26] <arturo>	 we may need a full rabbit reset
[08:43:32] <arturo>	 sight rabbitmq ...
[08:47:11] <dcaro>	 I'll try that then
[08:50:39] <arturo>	 thanks
[08:55:32] <dcaro>	 got this error when bringing back 1002 to the cluster, though starting the app then worked
[08:55:40] <dcaro>	 https://www.irccloud.com/pastebin/NjXk2vi6/
[08:56:58] <dcaro>	 cluster seems up and running now
[08:57:06] <arturo>	 I have not seen this error before
[08:58:34] <dcaro>	 I'm getting timeouts from the mobile app for splunk when trying to ack the pages
[08:58:34] <dcaro>	 xd
[08:58:49] <dhinus>	 LOL
[08:59:02] <arturo>	 they must have rabbitmq internally too
[08:59:11] <dcaro>	 I'll start restarting openstack services, see if that helped
[08:59:13] <dhinus>	 I'm acking everything from the web interface
[08:59:43] <dcaro>	 thanks
[09:00:15] <arturo>	 from the nova PoV, some hypervisors are UP and some DOWN
[09:00:17] <arturo>	 https://usercontent.irccloud-cdn.com/file/Cr3MQyMX/image.png
[09:00:21] <arturo>	 lets see how things evolve
[09:00:41] <dcaro>	 yep, that's the ones that failed restarting nova due to rabbitmq connections I think
[09:01:04] <arturo>	 I see some of them coming back online now
[09:01:50] <arturo>	 the rabbitmq full reset procedure should be a cookbook, if we don't have it already
[09:05:50] <dcaro>	 hmm, this does not look good
[09:05:56] <dcaro>	 https://www.irccloud.com/pastebin/5c44z7vf/
[09:10:54] <dcaro>	 all hypervisors show as up now though
[09:11:05] <dcaro>	 reboot finished
[09:14:33] <dcaro>	 things seem to be stabilizing on the openstack side
[09:15:24] <dcaro>	 no more errors like the above since that one happened on rabbit either
[09:16:25] <dcaro>	 nova-fullstack is passing again
[09:16:35] <dcaro>	 I think that's good enough
[09:29:05] <dcaro>	 oh, found the issue with ceph stats
[09:29:11] <dcaro>	 the mon is taking too long to reply
[09:29:17] <dcaro>	 https://www.irccloud.com/pastebin/vy5NiCYT/
[09:29:42] <dcaro>	 I'll try increasing the timeout
[09:32:12] <dcaro>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073744 that should do it
[09:32:20] <dcaro>	 ^ reviewers welcome
[09:32:26] * dcaro starts running pcc
[09:33:26] <arturo>	 sorry was on a meeting
[09:33:27] <arturo>	 LGTM
[10:31:47] * dcaro lunch
[11:33:31] <arturo>	 dhinus: I would like to merge this now https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46
[11:34:36] <dhinus>	 I'm eating lunch but you can go ahead
[11:35:27] <arturo>	 thanks, I'll run a last minute double check and then merge
[11:42:02] <arturo>	 I'll reduce the scope of the change, only merge a tiny security group import for eqiad1, in case it gives problems on the apply step
[12:03:44] <dhinus>	 I'm looking at the plan, why so many resources "created" and not just "imported"?
[12:03:51] <dcaro>	 this is a bit annoying :/ "There is no support at this time for upgrading or deleting CRDs using Helm."
[12:04:04] <dcaro>	 I might have to create a new chart just for the CRDs of things
[12:04:31] <arturo>	 dhinus: I'm not importing the security group rules, just recreating them. There are many many rules, and it would have been too tedious
[12:04:57] <arturo>	 so when running tofu apply, I delete by hand all the rules, then tofu apply re-creates them
[12:05:00] <dcaro>	 hmm... that means creating a new repo, toolforge component, and such... hmpf
[12:05:10] <dhinus>	 arturo: ok makes sense
[12:05:40] <arturo>	 dcaro: is that for tekton?
[12:05:45] <dcaro>	 yep
[12:06:01] <dhinus>	 arturo: did you generate the yaml from the openstack cli?
[12:06:09] <dcaro>	 helm only supports installing CRDs on the first install, after that, there's no support
[12:06:13] <arturo>	 dhinus: yes
[12:06:41] <arturo>	 dcaro: ack
[12:07:19] <dhinus>	 arturo: nice, that should ensure nothing was forgotten
[12:07:46] <dhinus>	 have you already applied it to codfw?
[12:08:03] <arturo>	 dhinus: note however that the security group definition itself is imported. And that's important because there could be cross-references based on the id, and re-creating would destroy them
[12:08:12] <dhinus>	 I see
[12:08:16] <arturo>	 dhinus: I have applied to both eqiad1 and codfw1dev
[12:09:38] <dhinus>	 but in eqiad1 you only have 1 sec group, are there more that you haven't imported yet?
[12:09:58] <arturo>	 yeah, I split the MR in two, to only merge 1 sec group in the first MR
[12:10:05] <dhinus>	 ok
[12:10:13] <arturo>	 the second one was also merged now: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/47
[12:19:29] <dhinus>	 I noticed there are 2 alerts in victorops that have not resolved automatically
[12:20:11] <dhinus>	 https://portal.victorops.com/ui/wikimedia/incident/5186 and https://portal.victorops.com/ui/wikimedia/incident/5173
[12:21:59] <dhinus>	 one is from icinga, and it's resolved in icinga. the other one is from alertmanager, and it's resolved in alertmanager
[12:22:08] <dhinus>	 I will resolve them manually in victorops
[12:23:08] <dcaro>	 I think that also happened in the past right? (some alerts resolved in icinga/am but not in splunk, as in if it had lost the update)
[12:26:36] <dhinus>	 I'm unsure but maybe yes
[12:27:12] <dhinus>	 I can see 85 alerts in total in victorops (between yesterday and today) and all of them autoresolved apart from that 2
[12:27:40] <dhinus>	 I wouldn't worry too much
[12:31:56] <dcaro>	 ack
[12:40:17] <arturo>	 ack
[13:32:27] <dcaro>	 crds are messy
[13:32:27] <dcaro>	 xd
[15:08:03] <dcaro>	 I'm getting '[GET /status] getStatus (status 503): {}' on alertmanager, am I the only one?
[15:08:42] <dhinus>	 same here
[15:09:06] <arturo>	 same
[15:09:32] <dcaro>	 asking in o11y
[15:10:02] <dcaro>	 they are switching alertmanagers
[15:11:50] <dcaro>	 I think I'm calling it a day
[15:11:52] <dcaro>	 cya tomorrow
[15:11:55] * dcaro off
[16:08:50] <arturo>	 dhinus: found an "interesting" case with neutron security groups, see T375111
[16:08:50] <stashbot>	 T375111: openstack: clarify default security group semantics - https://phabricator.wikimedia.org/T375111
[16:09:03] * arturo offline for today now
[17:00:50] <dhinus>	 I've created https://wikitech.wikimedia.org/wiki/Cloud_roots_and_Cloud_admins -- feedback is welcome onwiki or in T375113 
[17:00:50] <stashbot>	 T375113: WMCS: Document different types of root and admin privileges - https://phabricator.wikimedia.org/T375113