[08:17:46] hey folks, as heads up I am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1265382 that should be a no-op [08:17:57] it affects basically all hosts in prod [08:24:19] seems good [08:24:28] I tested a couple of nodes here and there [08:26:19] inflatador: https://debmonitor.wikimedia.org/packages/nginx-common shows a number of uses of nginx, but no swift any more. But thanks for the ping :) [11:43:02] hnowlan: should the tasks you've created be #wikimedia-incident, they look to be more incident follow-ups. [11:43:50] RhinosF1: yep, fixing that [11:47:57] FYI, I'm rebooting puppetserver2004, best to hold off puppet merges for ~ 5 minutes [11:51:27] k [11:55:08] 2004 is back, but I'm rebooting puppetserver1003 as well, to best to hold off puppet merge for another ~ 5 minutes [12:02:02] and puppetserver1003 is also back, puppet merges can happen again [13:30:54] Emperor thanks for checking, will merge shortly [15:38:46] moritzm: shall I merge 'Record LDAP access for passimacopoulos'? [15:40:49] (done) [15:50:03] andrewbogott: ah, yes. sorry I got distracze [15:50:06] andrewbogott: ah, yes. sorry I got distracted [16:35:25] sukhe: inflatador: like, I wonder if we could add an include/exclude filter to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/pybal/files/check_pybal_ipvs_diff.py#83 [16:36:08] that way we could run two instances of that alert or something, one without the known-to-be-flappy wdqs services so that the main alert stays an useful signal, and another one just for the flappy signals that can be practically ignored(?) [16:36:39] well, not sure if that is the exact function but the point is the same i think [16:36:53] FYI, there's an issue related to a defunct aqs cassandra host still referenced by a handful of k8s services in configuration, which may impact their ability to come up cleanly if disrupted. this appears to be happening to a single pod if editor-analytics at the moment. [16:36:53] I'm going to attempt to apply quick fix, and then work with urandom to remove the defunct hosts from their configuration. [16:54:31] tracking this in https://phabricator.wikimedia.org/T423168 [17:23:12] taavi: yeah that is where the change needs to be made. whether we should blanket silence and ignore the hosts, or come up with some other way (not sure what) of when we are under duress for the hosts, not sure about that. [17:31:35] taavi sukhe thanks, that looks good to me. I'd say we can blanket silence anything pybal-related for these until we've migrated off blazegraph [17:32:17] inflatador: I see. and out of curiosity, what's the timline on that? I am asking to figure out how long will we keep these disabled for [17:35:27] sukhe 6m-1y , probably closer to the 6m side [17:35:41] ok. I will run it by Traffic tomorrow and we can follow up. thanks! [17:36:07] np [17:36:21] thanks for caring about keeping the alerting SNR in check :) [17:38:06] Young me spent waaaay too much time responding to useless CPU and memory % alerts ;P [17:53:40] following up, the issue in T423168 is sorted for now, but there's some cleanup to do. I need to step away for a bit, but will take care of said cleanup (and work out action items) later today. [17:53:41] T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168 [22:55:14] elukey: I saw you ping h.ashar on T421348. I just wanted to let you know he's out on holiday for 2 weeks starting today, so you probably won't hear back for a while. [22:55:14] T421348: Add tox-uv support to the tox-v{3,4} Docker images - https://phabricator.wikimedia.org/T421348