[00:44:48] <andrewbogott>	 ...and again
[00:47:09] <bd808>	 andrewbogott: oh hey... when you get a chance, I could use your gerrit rights to submit https://gerrit.wikimedia.org/r/c/cloud/instance-puppet/+/1204710
[00:47:50] <bd808>	 those records are orphans from some vm cleanup crash
[00:48:19] <bd808>	 oh... got it
[00:48:42] <andrewbogott>	 If they're also orphaned from Horizon then I can take extreme measures :)
[00:49:13] <bd808>	 the instances are gone so I think it needs direct api poking to clean up
[00:49:45] <andrewbogott>	 hm, yeah
[00:49:59] <andrewbogott>	 or rapid creation and deletion of VMs with the same name :p
[00:50:51] <bd808>	 hmmm.... I think I could actually do it with tofu
[00:51:14] <bd808>	 I'd need to import the existing items and then delete them
[00:51:32] <bd808>	 E_TOOHARD :)
[00:53:19] <bd808>	 andrewbogott: 5 years ago you let me do it that way! ;) -- https://gerrit.wikimedia.org/r/c/cloud/instance-puppet/+/598081
[00:53:49] <andrewbogott>	 huh, so those might still have orphan records in the database...
[00:54:01] <andrewbogott>	 or possible nothing is ever properly removed from the git repo
[00:54:47] <bd808>	 the git repo cleans up normally. Those rows from 5 years ago may or may not still be in the db.
[00:56:13] <bd808>	 not a huge deal. I may try to remember how to poke the api directly later. I remember doing it a few months ago while working on tofu things and trying to import an existing config.
[07:07:59] <godog>	 hah https://www.kubernetes.dev/blog/2025/11/12/ingress-nginx-retirement/
[07:08:07] <godog>	 also, greetings
[08:35:05] <taavi>	 yep T392356
[08:35:06] <stashbot>	 T392356: toolforge: Investigate ingress-nginx replacements - https://phabricator.wikimedia.org/T392356
[08:39:16] <godog>	 hah, thank you for the pointer
[08:46:47] <dcaro>	 marning
[08:49:23] <dcaro>	 I got a lot of emails from wikitech static alerts during the night
[08:49:41] <dcaro>	 flapping ~every 5min
[09:21:19] <taavi>	 dcaro: re T410009, where did you see the vscode server? (and you want to let them know about it's licensing issues or should I?)
[09:21:20] <stashbot>	 T410009: SSH session hangs after authentication for user delemike on login.toolforge.org. Logs show hang at debug1: pledge: filesystem. - https://phabricator.wikimedia.org/T410009
[09:22:58] <dcaro>	 ps aux
[09:23:35] <dcaro>	 I'm not sure if it was vscode or codium open alternative though, I think they use the same paths (vscode-...)
[09:23:51] <taavi>	 duh
[09:23:53] * dcaro does not find the terminal scroll
[09:24:08] <taavi>	 codium uses .vscodium-server
[09:24:25] <dcaro>	 oh, nice, I thought it did not (maybe I tried the wrong thing)
[09:24:48] <taavi>	 there's a few different variations iirc, but if it's just vscode-server I think it's safe to assume it's the non-free thing
[09:25:13] <taavi>	 also now I'm there I also see someone running celery on the bastion, so I'll kill that and email the maintainers
[09:25:20] <dcaro>	 👍
[09:26:44] <dcaro>	 you are right, nice!
[09:26:49] <dcaro>	 dcaro    2154740  0.0  0.0   2680  1872 ?        SN   09:26   0:00 sh /home/dcaro/.vscodium-server/bin/9e6954323e23e2f62c1ea78348dbd1b53e5b827e/bin/codium-server --start-server --host=127.0.0.1 --port=0 --connection-token-file /home/dcaro/.vscodium-server/.9e6954323e23e2f62c1ea78348dbd1b53e5b827e.token --telemetry-level off --enable-remote-auto-shutdown --accept-server-license-terms
[09:27:20] <dcaro>	 so we can add some extra killing wheel of misfortune kind of thing
[09:29:47] <taavi>	 T390885. anyway, I will also comment on the task to let this particular user know
[09:29:48] <stashbot>	 T390885: Check for non-libre vscode-server installs/processes on Toolforge bastions - https://phabricator.wikimedia.org/T390885
[09:30:06] <dcaro>	 thanks
[10:11:56] <dcaro>	 first maintain-dbusers alert
[10:11:58] <dcaro>	 https://usercontent.irccloud-cdn.com/file/dyKUjwuD/image.png
[10:12:06] <dcaro>	 taavi: is that your network blip?
[10:12:20] <taavi>	 dcaro: your alert says 38 minutes, so extremely unlikely
[10:12:37] <dcaro>	 ack, looking then
[10:13:47] <dcaro>	 oh... the expression is not correct, reversed :facepalm:
[10:16:03] <dcaro>	 hmm... I wonder why the test was not failiing
[10:16:57] <dcaro>	 added a 'reverse test' too just in case
[10:17:08] <dcaro>	 https://gerrit.wikimedia.org/r/c/operations/alerts/+/1204810
[10:30:46] <dcaro>	 oh, I think it was the grouping, with `up{..} or on() vector(-1) > 0`, the grouping happens like `up{} or on() (vector(-1) > 0)`, no it goes to the right side
[10:38:21] <taavi>	 the interface mtu setting is live on all eqiad1 cloudvirts/cloudnets, and will be live on all misc cloud-private services in the next half an hour
[10:40:16] <dcaro>	 🎉
[10:46:00] <taavi>	 dcaro: T330075 says jumbo frames will also benefit cloudvirt<->ceph communication, do you know if there's a specific setting that we need to flip now or does ceph realize this automatically?
[10:46:00] <godog>	 neat
[10:46:01] <stashbot>	 T330075: [cloudvirt] Enable jumbo frames on cloud-hosts/cloud-private interfaces - https://phabricator.wikimedia.org/T330075
[10:46:46] <taavi>	 next up: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204814 to tell neutron that the interfaces are capable of jumbos
[10:52:41] <godog>	 looking
[10:59:27] <taavi>	 ^ will require manual restarts of neutron agents, so my plan is to wait half an hour and then run the restart cookbook for it
[11:00:03] <taavi>	 now there's a different maintain-dbusers alert?
[11:00:43] <taavi>	 in the meantime
[11:00:44] * taavi lunch
[11:03:30] <dcaro>	 hmpf.... the other alert also has fake positives, promql is tricky
[11:05:55] <dcaro>	 that's a tricky one though.... the metric does not exist if it newer got set after reboot, so how can we guard against us messing up the name in the future?
[11:06:11] <dcaro>	 I guess initializing the metric on reboot?
[11:08:12] <godog>	 this is MaintainDBUsersManyErrors I'm assuming?
[11:11:23] <dcaro>	 yep
[11:12:44] <godog>	 yes if easy to do I recommend initializing all label values
[11:13:10] <dcaro>	 I'll try
[11:13:18] <godog>	 though in this case I'm not sure I understand the or vector condition
[11:13:42] <godog>	 as opposed to increase(maintain_dbusers_populate_total{status="errored"}[1h]) > 0
[11:22:30] <godog>	 dcaro: ^ what do you think ?
[11:23:53] <dcaro>	 I think the expression is not correct there either, I think it should be `(increase(maintain_dbusers_populate_total{status="errored"}[1h]) or on() vector(-1)) > 0`
[11:24:18] <dcaro>	 otherwise if for whichever reason (say, typo) maintain_dbusers_populate_total does not exist, not using vector will still not trigger the alert
[11:24:37] <dcaro>	 wait, the vector should be > 0 xd
[11:24:51] <dcaro>	 `(increase(maintain_dbusers_populate_total{status="errored"}[1h]) or on() vector(1)) > 0`
[11:24:54] <dcaro>	 that should work
[11:27:33] <dcaro>	 this adds the stats initialization https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204820
[11:28:38] <godog>	 I see, for the metric disappearing nowadays we have linting alerts when that happens
[11:29:02] <godog>	 see for example this (sorry long url) https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=problem%3Dprometheus%20%22ops%22%20at%20http%3A%2F%2F127.0.0.1%3A9900%2Fops%20didn%27t%20have%20any%20series%20for%20%22cluster%3Apuppet_agent_resources_total%3Acount0%22%20metric%20in%20the%20last%201w
[11:29:12] <godog>	 I'm on vpn and I can't use w.wiki )o)
[11:29:23] <godog>	 oh wait I can, nevermind
[11:29:25] <dcaro>	 I did not see an alert for it today
[11:29:26] <godog>	 https://w.wiki/G4Yy
[11:29:50] <dcaro>	 oh, maybe it does not have the team flag
[11:30:01] <godog>	 yes because the metric does exist, labels I don't think get checked
[11:30:22] <godog>	 team is inferred from the alert file name, should work out of the box
[11:30:53] * godog lunch, will check later
[11:32:33] <dcaro>	 hmmm... okok, vector still works with labels, though I wonder if the metric exists is good enough
[11:34:05] <dcaro>	 I guess it still has value if the label value is mistyped (ex, it will catch if we change from "error" to "failed")
[11:47:47] <dhinus>	 can I get a +1 for T409970?
[11:47:47] <stashbot>	 T409970: Increase volume storage on project analytics - https://phabricator.wikimedia.org/T409970
[11:48:25] <dhinus>	 and also for T409981
[11:48:25] <stashbot>	 T409981: Request increased build quota for MilHistBot Toolforge tool - https://phabricator.wikimedia.org/T409981
[11:57:46] * dcaro lunch
[11:58:39] <dcaro>	 dhinus: +1d
[11:58:50] <dhinus>	 thanks
[12:49:36] <dcaro>	 there's an alert on harbor down since 9 min 
[12:49:48] <dcaro>	 anyone doing anything related?
[12:53:56] <dcaro>	 I can access it and everything looks ok
[12:54:32] <godog>	 dcaro: true re: label values typos, the expression with vector totally works too
[12:54:59] <godog>	 having said that, I'm logging off and tomorrow I'm off too for oncall comp, see you on Mon
[12:55:31] <dcaro>	 good weekend!
[12:55:38] <godog>	 cheers, you too
[12:57:35] <dcaro>	 taavi: can this be a side-effect of the mtu changes you made? maybe pingthing can't reach harbor anymore?
[12:58:43] <taavi>	 lemme see
[12:59:59] <dcaro>	 I can access harbor ok from my laptop, one of the workers, http and pull an image
[13:00:21] <dcaro>	 note that harbor is running inside docker inside a vm, with the 1450 mtu set
[13:00:34] <dcaro>	 https://www.irccloud.com/pastebin/swqPKJwN/
[13:02:15] <taavi>	 as far as I can tell from the blackbox UI there was a brief moment during which the probe failed when I rebooted the proxy VMs to pick up the new MTU, and the probe has been succeeding since then
[13:02:20] <taavi>	 no clue why the alert is still firing
[13:03:04] <taavi>	 eh, sorry, no I can't just read the interface correctly
[13:03:10] <taavi>	 time=2025-11-13T13:01:55.310Z level=ERROR source=http.go:474 msg="Error for HTTP request" module=http_connect_23xx target=https://tools-harbor.wmcloud.org/api/v2.0/ping err="Get \"https://[2a02:ec80:a000:1::1d]/api/v2.0/ping\": dial tcp [2a02:ec80:a000:1::1d]:443: connect: no route to host"
[13:04:12] <taavi>	 Nov 13 12:27:44 proxy-5 Keepalived_vrrp[524]: (vrrp_ipv6): entering FAULT state (src address not configured)
[13:04:12] <taavi>	 Nov 13 12:27:44 proxy-5 Keepalived_vrrp[524]: (vrrp_ipv6) Entering FAULT STATE
[13:04:42] <dcaro>	 from cloudcontrol2010-dev curl fails
[13:05:02] <dcaro>	 from apt2002 it works
[13:05:26] <dcaro>	 it's trying ip6
[13:06:37] <taavi>	 i restarted keepalived and it seems to work now
[13:06:42] <dcaro>	 ack, looking
[13:06:56] <taavi>	 so it seems like we have a race condition where keepalived starts before the v6 address is assigned to the interface?
[13:07:44] <dcaro>	 who is setting the ip address?
[13:07:59] <dcaro>	 alert went away 👍
[13:09:08] <taavi>	 systemd-networkd via DHCPv6 (which in the logs seems to happen two seconds after the v4 address)
[13:10:25] <taavi>	 https://phabricator.wikimedia.org/P85315 wait-online is happy once a v4 address is assigned (and a v6 link-local?)
[13:10:40] <dcaro>	 I was thinking something like that yep
[13:11:27] <dcaro>	 it seems there's some option to force it to wait also for ip6
[13:11:49] <taavi>	 something to note is that this happened on both bookworm project-proxy instances, but on none of the toolforge k8s haproxies (4 nodes in total, all trixie)
[13:12:00] <taavi>	 so it might or might not be a bookworm-specific bug
[13:12:43] <taavi>	 https://manpages.debian.org/trixie/systemd/systemd-networkd-wait-online.8.en.html says there is a config option to require a routable v6 address to consider a system online
[13:13:05] <taavi>	 also note that we have the complication that we don't set up networkd directly, we have cloud-init configuring netplan configuring networkd configuring the actual interface
[13:17:44] <dcaro>	 🪆
[13:17:45] <dcaro>	 xd
[13:49:19] <dcaro>	 quick review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204820, initializing the stats for maintain-dbusers
[13:49:26] <dcaro>	 (currently deployed in couldcontrol1007)
[13:51:55] <taavi>	 lol at that typo in the hostname
[13:51:56] <taavi>	 but +1
[13:55:41] <dcaro>	 xd, yep, coludcontrol gets caught by puppet checks, so I innovate
[15:25:44] <dhinus>	 does anybody have a good answer to "who is responsible for signing off gadgets?" in T408387?
[15:25:45] <stashbot>	 T408387: CloudVPS instance for ProVe - https://phabricator.wikimedia.org/T408387
[15:26:45] <taavi>	 there is no such thing
[15:28:19] <Reedy>	 Wikidata community would self govern on that
[15:28:44] <andrewbogott>	 I feel like I'm being kind of a pain on that ticket because I'm asking them to be sure that this will satisfy the requirement but the requirement seems to be pretty vague.
[15:29:00] <andrewbogott>	 Reedy: does that mean they should ask on a talk page someplace?
[15:29:14] <Reedy>	 probably, just having a look
[15:29:30] <Reedy>	 >If you have a problem with a gadget or you have an idea for a new gadget: please add a task to Phabricator.
[15:29:34] <Reedy>	 https://phabricator.wikimedia.org/maniphest/task/create/?projects=Wikidata-gadgets
[15:29:39] <taavi>	 well really the question they want answered is not "can this be a gadget", but "can gadgets make network requests to place X" with various values of X
[15:29:48] <andrewbogott>	 I'm also open to someone else intervening and saying that what they want to do is definitely fine (or not fine). It just seems like a weird 'laundering' hack
[15:31:45] <dhinus>	 what I would like to avoid is creating a cloud-vps project, having them deploy their things, and at the end being told "you cannot run this as a gadget anyway"
[15:31:52] <andrewbogott>	 yes!
[15:33:20] <dhinus>	 the last comment on-wiki seems to be https://www.wikidata.org/wiki/Wikidata:Tools/Potential_gadgets#c-Ep%C3%ACdosis-20250806105800-Albert.meronyo-20250806105100
[15:34:50] <dhinus>	 I see they also created a task in the wikidata-gadgets project T374177
[15:34:51] <stashbot>	 T374177: New gadget proposal: Reference Verification Gadget for Wikidata (ProVe) - https://phabricator.wikimedia.org/T374177
[15:34:53] <Reedy>	 Loading stuff from cloud-y things is going to be better than some random other site... but the whole not making production depend on  cloud
[15:35:17] <Reedy>	 https://meta.wikimedia.org/wiki/Third-party_resources_policy
[15:35:45] <Reedy>	 https://meta.wikimedia.org/wiki/Third-party_resources_policy#Opt-in_exemption_granted_by_users
[15:36:02] <taavi>	 ^ that page is not actual policy atm, right?
[15:36:21] <Reedy>	 Not with any sort of enforcement AIUI
[15:38:03] <taavi>	 like, i don't think there is currently any explicit policy about what opt-in gadgets can and can't do, so basically this entire thing that started this was one volunteer wikidata editor saying that they're not comfortable with the current way it's done
[15:43:15] <andrewbogott>	 yeah, I was thinking that the volunteer would ask /that/ editor "how does it sound if I do X"?
[15:43:27] <andrewbogott>	 But maybe it's not easy to do that?
[15:43:37] <dhinus>	 it's the opposite of what I said 5 mins ago, but maybe we could just approve the cloud-vps project, and let them figure out how to convince the wikidata community
[15:44:32] <taavi>	 i would be fine with that
[15:44:53] <andrewbogott>	 Yeah, as long as we make it clear (which I suppose we already have) that we don't have any ability to tell them that this will solve the problem
[15:48:51] <dcaro>	 +1 for that with a clear "this does not mean that it's approved as a gadget"
[15:55:00] <dhinus>	 how does this sound? "The decision on whether this can be approved as a gadget sits with the Wikidata community. The fact that a part of the code runs in Wikimedia Cloud might or might not be enough to convince them."
[15:56:54] <dcaro>	 nice, maybe adding, we are happy to give you a project to get you started, or similar too
[16:00:11] <dhinus>	 thanks, replied in the task
[16:55:35] * andrewbogott waves to cciufo 
[17:36:02] * dhinus off
[17:44:54] * dcaro off
[17:44:57] <dcaro>	 cya!