[08:55:24] morning! [09:13:47] greetings [09:14:53] hmm... maintain-dbusers is failing to connect to some database `pymysql.err.OperationalError: (1130, "Host '10.64.148.21' is not allowed to connect to this MariaDB server")` [09:15:10] an-redacteddb1001.eqiad.wmnet specifically, looking [09:16:17] oh, it's https://phabricator.wikimedia.org/T407485#11691124 [09:18:06] I think this should fix the pint issue with the Neutron alert https://gerrit.wikimedia.org/r/c/operations/alerts/+/1251011 [09:18:57] ^ quick review when someone has a minute in-between things (no rush) [09:19:19] LGTM [09:20:07] thanks! [09:35:18] might've been easier to just silence the pint check for that rule, the metric is indeed expected to not exist for most of the time [09:39:38] morning [09:54:15] well, the linter error is still here it seems [09:57:47] https://gerrit.wikimedia.org/r/c/operations/alerts/+/1251024 [11:03:17] re: k8s memory requests hitting the alert threshold, does https://phabricator.wikimedia.org/T419824#11701502 make sense? I'm thinking of expanding the cluster as a short term bandaid and then work on https://phabricator.wikimedia.org/T414513 [11:19:21] LGTM [11:20:15] remember to use a higher mem image for the new workers (kinda 2x would be ok), we wanted to make them bigger to unblock some people using higher memory [11:21:13] * dcaro lunch [11:21:39] good point, latest workers are using g4.cores8.ram16.disk20.ephem140 thus going to 32gb ? [11:21:43] have a good lunch [11:52:28] godog: please don't create flavors by hand [11:52:42] instead those should go to https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/blob/main/resources/eqiad1-r/admin/flavors.tf [12:43:54] FYI, I'm rolling out postgresql security updates on the cloudbackup hosts [13:14:55] taavi: my bad, thank you for letting me know, I'll fix flavors.tf [13:20:46] are there any docs you followed that need updating/clarifying for that? [13:22:31] good question, I'll update https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_new_nodes which is what I was looking at initially [13:23:03] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Add_a_worker rather [13:25:57] {{done}} [13:48:31] taavi, any idea what the decom script is trying to do here? It's generally not coping well with the way we moved IPs rather than deleting/re-adding them but I think I have all the cases sorted except for 'ge-0/0/6' https://phabricator.wikimedia.org/T419738 [13:49:42] not off the top of my head, and my flight boards in 5' [13:51:34] have a good flight! [13:56:43] safe flight! [14:09:01] andrewbogott, dcaro, for https://phabricator.wikimedia.org/T419647 clouddb1020 doesn't have a `sudo depool` command, any idea how to depool it? [14:09:37] (or if it needs a depool) [14:10:11] https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Depool_wikireplicas [14:11:01] (I did look at enabling "sudo depool" on those hosts, but it's not working atm, they must be depooled from cumin) [14:11:33] dhinus: sounds good, on it [14:11:36] this should do "confctl select name=clouddb1020.eqiad.wmnet set/pooled=no" [14:12:15] done, thx! [14:12:37] ty! [14:15:41] XioNoX: since you're here, do you have a guess about how I can get past this? https://phabricator.wikimedia.org/T419738#11699813 [14:16:05] I'm in the middle of a maintenance :) [14:16:12] ok, nevermind! [14:32:12] andrewbogott: can you run the decom script with the --homer parameter ? [14:32:31] dunno what's the issue bit that might work around it [14:34:48] nevermind, that's for the provision one... [14:35:33] yeah, that flag doesn't seem to exist. Also, I've made things worse in the meantime so will stand back for now. [14:35:42] how worse? [14:36:45] was trying to manually remove the thing that the cookbook wanted to remove, managed to detach the IP from the mgmt interface. [14:40:13] mgmt ip fixed [14:40:26] thx [14:40:43] so now let's see if the decom cookbook fails in the same way... [14:40:55] it probably will [14:41:35] yep, I'm back to [14:41:37] Entering configuration mode [14:41:37] [edit protocols sflow] [14:41:37] + interfaces ge-0/0/6.0; [14:42:32] from the same cookbook? [14:43:13] yes, from "sudo cookbook -d sre.hosts.decommission cloudgw2002-dev.codfw.wmnet -t T419738" [14:43:13] T419738: decommission cloudgw2002-dev - https://phabricator.wikimedia.org/T419738 [14:43:54] andrewbogott: could you try that https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1251099 with test-cookbook ? [14:44:17] (once CI is happy) [14:44:25] sure [14:44:41] once that previous cookbook run finishes aborting... [14:45:24] the current juniper implementation parses cli output, so even if it has been working well, maybe something changed and broke it (like a recent ugrade) [14:45:53] and now that we're moving away virtual chassis, running homer will become the default [14:46:13] with virtual-chassis it was taking ages that's why I took a different approach [14:48:41] I don't think I know what 'virtual-chassis' means, but maybe I don't need to [14:49:53] CI says -1 [14:50:00] yep, looking [14:51:12] good catch CI, I copy pasted it too fast from the provision cookbook [14:57:30] Weeks of me being confused about broken software ultimately resolved by an irc chat -- T419777 [14:57:31] T419777: rproxy running on Toolforge fails to successfully proxy requests to musicbrainz.org - https://phabricator.wikimedia.org/T419777 [14:59:45] XioNoX, stepping into a meeting but will keep an eye on that patch [15:27:35] ok, back, trying with that change now [15:29:37] following up from the meeting: changing the flavor specs in opentofu meant g4.cores8.ram16.disk20.ephem140 got re-recreated and thus tools/toolsbeta instances provisioned with that flavor now show up as "not available" for the flavor [15:30:17] my mistake heh, though the instances themselves are fine since the old flavor is pinned to them, I tested a reboot and works just fine, over time we'll recycle the workers anyways [15:47:55] XioNoX: seems to be working better! Now I see pending changes for dse-k8s-worker1018 though, should I apply those as well? [15:50:15] btullis: I suspect this pending dns change is yours (dse-k8s-worker1018) [15:52:05] I'm applying the change [17:50:23] * dcaro off [17:50:26] cya next week! [17:55:37] am I missing something or was T419582 missing a +1 before it was done? [17:55:37] T419582: Add floating IP and vanity domain for azwikimedia project - https://phabricator.wikimedia.org/T419582 [18:02:42] taavi: we discussed it in the meeting [18:23:14] could we document those things in the task so one doesn't have to guess?