[07:27:48] * dhinus paged HarborComponentDown [07:28:48] it resolved by itself [07:29:33] need help? [07:30:55] > RESOLVED: InstanceDown: Project tools instance tools-prometheus-9 is down [07:31:03] so that hints that the issue is from traffic>? [07:34:10] the Harbor alert is back to green, and I don't see other active alerts except a k8w worker with processes in D state [07:35:11] I'll make myself a coffee, then have another look [07:35:34] feel free to do more debugging but it doesn't look urgent [08:00:12] there's no OOM on the machine this time either [08:00:46] network started misbehaving too (time outs) [08:02:05] nss did fail this time though [08:06:18] there is some hint of a spike in load in the graphs, at the same time as the network traffic drops to 0 [08:06:22] https://usercontent.irccloud-cdn.com/file/yMmm11QQ/image.png [08:07:20] the prometheus logs only show the network starting to fail [08:10:53] sssd failed with timeout too `(2025-06-24 6:58:23): [be[wikimedia.org]] [generic_ext_search_handler] (0x0020): sdap_get_generic_ext_recv failed: [110]: Connection timed out [ldap_search_timeout]` [08:24:52] -9 is still the idle one, looking at the load graph [08:25:46] load spiked to 20 after the blip, then quickly went back to ~ 0 [08:28:03] -9 is the current one behind the proxies (flapped it yesterday, not sure what you mean with 'still the idle one' xd) [08:28:17] I was trying to guess it looking at the graphs [08:28:39] if it was indeed flipped, then why load on -9 is almost at 0, while load on -8 is around 0.5? [08:29:20] network activity looks very similar between the two hosts, around 1.5MB [08:29:47] the load graphs are the only ones that are significantly different between -8 and -9 [08:31:44] ah my bad, it's just a scale issue [08:31:59] load on -8 is also hovering around 0.5 [08:32:03] until the spike [08:32:14] fresh prometheus restart maybe? [08:32:30] (vs a prometheus that's been running for a while) [08:33:02] no they're exactly the same, I was looking at the graphs ignoring the Y-scale [08:33:35] okok [08:34:23] I'm still surprised that all the graphs are the same, shouldn't the "active" one have more network traffic than the "idle" one? [08:34:48] I guess they both poll the metrics, so maybe that's the main part of network traffic? [08:35:21] yep, both do yes [08:36:43] so on the one hand, the fact that -9 went down seems to confirm taavi's theory that the "active" node is the one affected by the issue [08:37:13] on the other hand, in the graphs I cannot see a significant increase in network activity, so what is causing the crash? [08:37:38] network activity actually drops (and network on the VM starts failing) [08:39:21] so I guess one question is, what's first, the load spike (seen most of the times), or the network going down (seen all the times)? and for whichever is first, what causes it? (a problematic query, weird activity in the network causing an interface issue/kernel bug/..., ceph hand drive slowing down, ...) [08:40:10] this time for example, there's a spike in io waiting/stall [08:40:13] https://usercontent.irccloud-cdn.com/file/bIrN6Om9/image.png [08:40:56] and a small spike on disk read [08:40:58] https://usercontent.irccloud-cdn.com/file/k0dvwnBf/image.png [08:41:55] there seems to be also a spike in ram usage [08:41:57] https://usercontent.irccloud-cdn.com/file/iaiaDmJw/image.png [08:42:00] (the green there) [08:42:13] but there's no OOM or any other logs [08:45:39] write test on prometheus-8 hard disk seems to be good [08:45:44] https://www.irccloud.com/pastebin/Bf2e9Sa7/ [08:46:07] that's >20 times the read speed shown there [08:46:36] reading that file is also quite fast [08:50:02] do we have metrics on the prometheus query volume? [08:50:51] my guess is either a query spike, or a small number of very expensive queries [08:53:22] https://prometheus.svc.toolforge.org/tools/graph?g0.expr=increase(prometheus_http_requests_total%5B10m%5D)&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=6h [08:53:33] there's some stats, but we don't have any in the dashboard [08:54:45] there's no big spike before the issue, there's a spike after (probably expected) [08:55:56] this seems to get a spike https://prometheus.svc.toolforge.org/tools/graph?g0.expr=prometheus_engine_queries&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=6h [08:59:16] maybe we can enable temporarily the query log https://prometheus.io/docs/guides/query-log/ [09:00:03] hmm, we should already have access logs from the apache level [09:02:00] the requests are mostly post, so you don't see the queries [09:04:26] and according to prometheus there's no spike in http requests, but engine queries [09:06:41] ex. from apache logs on -9, these are the 10min before it started failing [09:07:20] https://www.irccloud.com/pastebin/d2xwluuB/ [09:07:36] ~1 query per minute [09:45:52] review for https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/821? [09:47:14] dcaro: can I leave this one to you? ^ [09:56:17] LGTM [10:32:41] turns out it wasn't that simple, T386480 has a bunch of follow-up patches attached [10:32:41] T386480: [o11y,logging,infra] Deploy Loki to store Toolforge log data - https://phabricator.wikimedia.org/T386480 [10:32:42] * taavi lunch [11:38:08] dhinus: fyi. I've fixed the issues with https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/88 and 90 (rebased properly essentially) [11:38:45] let me know when you get to them, if you ping me here I'll be able to reply more quickly [11:39:04] dhinus: follow-up from yesterday: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/54 [11:45:53] taavi: +1d [12:06:36] komla: when you have some time, I'd appreciate a review of the beta launch email https://etherpad.wikimedia.org/p/push-to-deploy-beta-announce [12:06:43] (the link to the wiki does not work yet though) [12:07:26] also, anyone interested, I would appreciate too reviews for the user help page https://wikitech.wikimedia.org/wiki/User:DCaro_(WMF)/Push_to_deploy_user_page [12:09:49] dhinus: btw. if you want to pair this afternoon for the reviews I'm available [12:17:02] what does "or show you a fake example" mean? [12:18:35] dcaro: is the setup usable in tools at this point? I think it'd be nice to migrate some "real" but admin-managed tools to it for experimentation before announcing it to the world [12:19:15] taavi: it's not deployed in tools yet, we do have it in toolsbeta and it has been deploying the sample-complex-app for a few months [12:19:29] when will it be deployed [12:19:30] ? [12:19:43] also, can the deployment token be passed as a header or something? doing it as a GET parameter is not optimal since it'll get logged at the various proxies it'll pass through? [12:19:51] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/785 [12:20:26] that looks simple enough, so +1 [12:20:31] thanks [12:21:02] it's not implemented as a header, but it could be passed as such, feel free to open a task for it, should not be too hard to allow [12:23:36] sure, T397712 [12:23:36] T397712: [components-api] Deployment token should not be a GET param - https://phabricator.wikimedia.org/T397712 [12:25:36] thanks [12:25:51] hmm, apparently the cli was installed by hand on toolsbeta? [12:26:25] I can make a patch to install it on tools as well, do you have a task I can attach it against? T394337 is very specific it's only about the API :D [12:26:26] T394337: [components-api] deploy on tools - https://phabricator.wikimedia.org/T394337 [12:29:09] dcaro: I'm back from lunch and I'll continue with the reviews. I would rather not do a long pairing session but maybe we could schedule 30 mins before the toolforge checkin, if it works for you [12:31:54] sure, whatever works better for you [12:32:09] I'll send an invite [12:33:59] thanks [12:34:37] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/828 <- enabling the component api to be exposed by the gateway on tools (in the openapi.json endpoint) [12:46:23] ^nm. self-merged that one [13:13:08] nice, sample-complex-app is now being deployed using the components-api in tools too [13:13:11] (manually) [13:14:01] this enables the automatic deploy https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/5 [13:28:29] taavi, I would like to have another go at updating neutron policies; do you have time/ability to recheck the thing that I broke yesterday? [13:28:40] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1163003/3/modules/openstack/files/epoxy/neutron/policy.yaml [13:29:09] I am assuming that you were right about the member vs. reader thing [13:32:30] andrewbogott: hmm. I thought `openstack --os-cloud novaobserver --os-project-name tools port list` was a way to reproduce this but that's empty even now [13:32:51] * andrewbogott tries it [13:33:39] ...it's possible that I already broke it again, with my surely-no-op fixes that I already merged... [13:33:49] I mean we can test it via the toolforge tofu repo, but that's not great :-) [13:36:49] is 'port list' a thing that has worked in the past? [13:37:17] oh, it works as novaadmin :/ [13:37:53] not sure if that exact command worked, but listing ports is what failed yesterday [13:37:54] * andrewbogott tries some live hacking in codfw1dev [13:41:10] ok, if I set the policy "get_port": "!" [13:41:16] the running as novaadmin also gets an empty list [13:41:34] so that confirms that access failure produces an empty list [13:42:40] now let's see if I open it to all... [13:43:17] nope, that has it empty for observer and filled for novaadmin [13:50:46] I have not found any policy rule that allows novaobserver to get anything but an empty list [13:51:02] So something 'interesting' is happening [13:51:18] but also I'm pretty convinced that the 'got port' test is not a valid way to demonstrate the tofu failure [13:51:52] * andrewbogott heads for the source [13:52:05] andrewbogott: if you're happy with it, I'm happy with just merging your patches and then testing if we get the same weird errors we got yesterday [13:52:25] it it's easy to retest the exact tofu thing... then let's do that [13:52:39] yeah [13:53:00] ok, I'll merge the first patch and we'll see. [13:54:58] in case you're wondering... the actual point of all this is to get service users to be able to do what only novaadmin can do now. So that when nova creates a VM and calls out to neutron to set up networking, it doesn't need novaadmin privs to do it. [13:55:37] but this is starting to be a lot of trouble for a small cleanup [13:59:27] taavi, test now? [13:59:40] one second [14:00:21] https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/jobs/545410 [14:01:03] looks good to me [14:55:08] dhinus: let me know if you want to do any more syncs for the reviews [15:01:58] taavi: can you try that tofu test again? [15:03:49] yep [15:04:38] andrewbogott: still looks good to me [15:04:44] great! [15:05:14] I'm revisiting the question of whether we can just change a network to have shared=False, might try that in codfw1dev [15:05:57] since wan-transport-codfw has shared=false and floating IPs still seem to work everywhere... [15:16:32] functional tests enabled for components-api in tools :) [15:18:27] dcaro: nice! [15:18:51] dcaro: we can have another sync tomorrow, I'll send an invite [15:19:45] dhinus: this is my first priority, so if you have questions/etc. please don't wait for me to reply in the review section (it might take me some time to see the review happened), and ping me also so I can check right away [15:20:04] ack I will ping you in IRC [16:02:22] dcaro: I approved https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/90 with two minor comments [16:03:31] thanks! [17:01:56] dcaro: approved https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/38 as well [17:02:06] awesome :), thanks! [17:17:36] I'll bump and deploy tomorro :), thanks again! [17:17:41] * dcaro off [20:46:40] taavi, dcaro, dhinus, if anyone is awake please help me with this toolforge outage [20:49:01] Sup? [20:49:20] andrewbogott: I have hands to put to keyboard but don't really know the new API layers [20:49:51] I was/am working on rabbitmq but that really shouldn't have the power to take down toolforge [20:50:05] some tools are still up [20:50:07] looking [20:50:09] dcaro: pretty much no toolforge services were reachable for a few minutes. [20:50:16] (wm-lol.toolforge.org replies) [20:50:18] I think things are coming back now but I don't really know what's up. [20:50:39] I restarted all the neutron agents because that was the only thing that rabbit could've broken. But neutron isn't itself a router so that shouldn't matter... [20:50:54] k8s nodes say they are all ready [20:50:55] #wikimedia-cloud-feed is full of DOWNs and now some RECOVERYs [20:51:19] the current alerts are all probe down, external coms? [20:51:21] My guess would be that the web proxy was done [20:51:34] or yeah, something related to network access [20:51:48] it looks like a bunch of instances lost network yep [20:51:50] flagged as down [20:52:31] things seem to be recovering yep [20:52:50] I both want and don't want this to be correlated with rabbit freezing... [20:53:08] xd [20:53:10] which btw I'm about to do something even more drastic with rabbit (reinstall nodes) so that will be noisy for another few minutes [20:53:46] what is it with erlang and its conviction about ignoring kill signals [20:53:49] also cloudvps was affected right? (toolsbeta and other also seem to have gone down for a bit) [20:54:31] Not that I noticed but I could believe it [20:54:48] if so then... there are things about openstack that I don't understand, possibly related to the recent network changes [20:56:05] `FIRING: InstanceDown: Project project-proxy instance proxy-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown` for example (also in -cloud there's users saying cvn.wmflabs.org went down) [20:56:20] yeah, ok [20:56:24] so... [20:56:32] I will think about that once I have things working again [20:57:17] feel free to correct my message in -cloud xd [20:57:55] rabbit now complains about partitioning hmpf [20:59:42] it should complain, it has no nodes running. [20:59:46] I'm starting one now, gradually [21:00:01] ack [21:00:03] (the context is -- rabbit was complaining about a stuck process that survived the recreation of each individual node. [21:00:06] ) [21:00:15] And I suspect it of messing with cinder in bad ways [21:00:42] so I am trying to do as complete of a rebuild as I can without actually reimaging the servers [21:01:24] okok [21:01:34] network is still unstable (proxy-5 just went down) [21:02:28] crap [21:02:30] got a bunch of emails complaining that nova is not up on many cloudvirts, looking [21:02:33] well, I'll rebuild things as fast as I can [21:02:42] that's because it can't talk to rabbit and (sometimes) crashes as a result [21:02:44] expected [21:02:47] proxy-5 is back on it seems [21:02:49] okok [21:03:59] let me know if there's anything I can do to help, I'll try to not be in the way otherwise [21:04:36] rabbit is up, restarting all neutron workers... [21:05:01] ...and nova-compute nodes to make those alerts clear... [21:05:27] yep, starting to get recoveries [21:05:28] now as far as I know all serious/alerting services should be back up and running. Does that look true to you? [21:05:55] LOL and rabbitmq is STILL reporting a stuck process with the same pid as before. [21:06:12] So I was led completely astray, it must just always say that regardless of actual working state [21:06:13] ffs [21:06:31] things seems to be coming up [21:06:38] yep [21:06:56] * andrewbogott now both sheepish and annoyed [21:07:45] xd, I for one I'm grateful for you doing these things, someone has to ;) [21:08:18] well in this case not really [21:08:36] but I guess I wouldn't have known until I tried. The internet certainly says that that stuck process is a problem not "ignore it it will always say that" [21:08:46] but it has now survived reboots and dpkg --purge [21:08:52] and complete state reset [21:09:19] ok, so... all is well now, right? I will remove my hands from the terminal and send an outage email and then probably step away from the keyboard for a while [21:09:26] thank you for appearing! [21:09:28] +1 for investigating, even if you did not find the issue yet, I think it's part of the process [21:09:41] I really thought the toolforge thing was unrelated, rabbit and neutron should not be able to cause this [21:09:45] but that's a question for another day [21:10:02] yep, would be interesting to understand how those were related [21:10:59] okok, /me clocking out [21:11:33] thanks again! \o [21:24:21] quick summary of what seems to have happend, but is surprising if it happened: https://phabricator.wikimedia.org/T397783 [21:36:23] * andrewbogott out for a while