[00:44:48] ...and again [00:47:09] andrewbogott: oh hey... when you get a chance, I could use your gerrit rights to submit https://gerrit.wikimedia.org/r/c/cloud/instance-puppet/+/1204710 [00:47:50] those records are orphans from some vm cleanup crash [00:48:19] oh... got it [00:48:42] If they're also orphaned from Horizon then I can take extreme measures :) [00:49:13] the instances are gone so I think it needs direct api poking to clean up [00:49:45] hm, yeah [00:49:59] or rapid creation and deletion of VMs with the same name :p [00:50:51] hmmm.... I think I could actually do it with tofu [00:51:14] I'd need to import the existing items and then delete them [00:51:32] E_TOOHARD :) [00:53:19] andrewbogott: 5 years ago you let me do it that way! ;) -- https://gerrit.wikimedia.org/r/c/cloud/instance-puppet/+/598081 [00:53:49] huh, so those might still have orphan records in the database... [00:54:01] or possible nothing is ever properly removed from the git repo [00:54:47] the git repo cleans up normally. Those rows from 5 years ago may or may not still be in the db. [00:56:13] not a huge deal. I may try to remember how to poke the api directly later. I remember doing it a few months ago while working on tofu things and trying to import an existing config. [07:07:59] hah https://www.kubernetes.dev/blog/2025/11/12/ingress-nginx-retirement/ [07:08:07] also, greetings [08:35:05] yep T392356 [08:35:06] T392356: toolforge: Investigate ingress-nginx replacements - https://phabricator.wikimedia.org/T392356 [08:39:16] hah, thank you for the pointer [08:46:47] marning [08:49:23] I got a lot of emails from wikitech static alerts during the night [08:49:41] flapping ~every 5min [09:21:19] dcaro: re T410009, where did you see the vscode server? (and you want to let them know about it's licensing issues or should I?) [09:21:20] T410009: SSH session hangs after authentication for user delemike on login.toolforge.org. Logs show hang at debug1: pledge: filesystem. - https://phabricator.wikimedia.org/T410009 [09:22:58] ps aux [09:23:35] I'm not sure if it was vscode or codium open alternative though, I think they use the same paths (vscode-...) [09:23:51] duh [09:23:53] * dcaro does not find the terminal scroll [09:24:08] codium uses .vscodium-server [09:24:25] oh, nice, I thought it did not (maybe I tried the wrong thing) [09:24:48] there's a few different variations iirc, but if it's just vscode-server I think it's safe to assume it's the non-free thing [09:25:13] also now I'm there I also see someone running celery on the bastion, so I'll kill that and email the maintainers [09:25:20] 👍 [09:26:44] you are right, nice! [09:26:49] dcaro 2154740 0.0 0.0 2680 1872 ? SN 09:26 0:00 sh /home/dcaro/.vscodium-server/bin/9e6954323e23e2f62c1ea78348dbd1b53e5b827e/bin/codium-server --start-server --host=127.0.0.1 --port=0 --connection-token-file /home/dcaro/.vscodium-server/.9e6954323e23e2f62c1ea78348dbd1b53e5b827e.token --telemetry-level off --enable-remote-auto-shutdown --accept-server-license-terms [09:27:20] so we can add some extra killing wheel of misfortune kind of thing [09:29:47] T390885. anyway, I will also comment on the task to let this particular user know [09:29:48] T390885: Check for non-libre vscode-server installs/processes on Toolforge bastions - https://phabricator.wikimedia.org/T390885 [09:30:06] thanks [10:11:56] first maintain-dbusers alert [10:11:58] https://usercontent.irccloud-cdn.com/file/dyKUjwuD/image.png [10:12:06] taavi: is that your network blip? [10:12:20] dcaro: your alert says 38 minutes, so extremely unlikely [10:12:37] ack, looking then [10:13:47] oh... the expression is not correct, reversed :facepalm: [10:16:03] hmm... I wonder why the test was not failiing [10:16:57] added a 'reverse test' too just in case [10:17:08] https://gerrit.wikimedia.org/r/c/operations/alerts/+/1204810 [10:30:46] oh, I think it was the grouping, with `up{..} or on() vector(-1) > 0`, the grouping happens like `up{} or on() (vector(-1) > 0)`, no it goes to the right side [10:38:21] the interface mtu setting is live on all eqiad1 cloudvirts/cloudnets, and will be live on all misc cloud-private services in the next half an hour [10:40:16] 🎉 [10:46:00] dcaro: T330075 says jumbo frames will also benefit cloudvirt<->ceph communication, do you know if there's a specific setting that we need to flip now or does ceph realize this automatically? [10:46:00] neat [10:46:01] T330075: [cloudvirt] Enable jumbo frames on cloud-hosts/cloud-private interfaces - https://phabricator.wikimedia.org/T330075 [10:46:46] next up: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204814 to tell neutron that the interfaces are capable of jumbos [10:52:41] looking [10:59:27] ^ will require manual restarts of neutron agents, so my plan is to wait half an hour and then run the restart cookbook for it [11:00:03] now there's a different maintain-dbusers alert? [11:00:43] in the meantime [11:00:44] * taavi lunch [11:03:30] hmpf.... the other alert also has fake positives, promql is tricky [11:05:55] that's a tricky one though.... the metric does not exist if it newer got set after reboot, so how can we guard against us messing up the name in the future? [11:06:11] I guess initializing the metric on reboot? [11:08:12] this is MaintainDBUsersManyErrors I'm assuming? [11:11:23] yep [11:12:44] yes if easy to do I recommend initializing all label values [11:13:10] I'll try [11:13:18] though in this case I'm not sure I understand the or vector condition [11:13:42] as opposed to increase(maintain_dbusers_populate_total{status="errored"}[1h]) > 0 [11:22:30] dcaro: ^ what do you think ? [11:23:53] I think the expression is not correct there either, I think it should be `(increase(maintain_dbusers_populate_total{status="errored"}[1h]) or on() vector(-1)) > 0` [11:24:18] otherwise if for whichever reason (say, typo) maintain_dbusers_populate_total does not exist, not using vector will still not trigger the alert [11:24:37] wait, the vector should be > 0 xd [11:24:51] `(increase(maintain_dbusers_populate_total{status="errored"}[1h]) or on() vector(1)) > 0` [11:24:54] that should work [11:27:33] this adds the stats initialization https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204820 [11:28:38] I see, for the metric disappearing nowadays we have linting alerts when that happens [11:29:02] see for example this (sorry long url) https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=problem%3Dprometheus%20%22ops%22%20at%20http%3A%2F%2F127.0.0.1%3A9900%2Fops%20didn%27t%20have%20any%20series%20for%20%22cluster%3Apuppet_agent_resources_total%3Acount0%22%20metric%20in%20the%20last%201w [11:29:12] I'm on vpn and I can't use w.wiki )o) [11:29:23] oh wait I can, nevermind [11:29:25] I did not see an alert for it today [11:29:26] https://w.wiki/G4Yy [11:29:50] oh, maybe it does not have the team flag [11:30:01] yes because the metric does exist, labels I don't think get checked [11:30:22] team is inferred from the alert file name, should work out of the box [11:30:53] * godog lunch, will check later [11:32:33] hmmm... okok, vector still works with labels, though I wonder if the metric exists is good enough [11:34:05] I guess it still has value if the label value is mistyped (ex, it will catch if we change from "error" to "failed") [11:47:47] can I get a +1 for T409970? [11:47:47] T409970: Increase volume storage on project analytics - https://phabricator.wikimedia.org/T409970 [11:48:25] and also for T409981 [11:48:25] T409981: Request increased build quota for MilHistBot Toolforge tool - https://phabricator.wikimedia.org/T409981 [11:57:46] * dcaro lunch [11:58:39] dhinus: +1d [11:58:50] thanks [12:49:36] there's an alert on harbor down since 9 min [12:49:48] anyone doing anything related? [12:53:56] I can access it and everything looks ok [12:54:32] dcaro: true re: label values typos, the expression with vector totally works too [12:54:59] having said that, I'm logging off and tomorrow I'm off too for oncall comp, see you on Mon [12:55:31] good weekend! [12:55:38] cheers, you too [12:57:35] taavi: can this be a side-effect of the mtu changes you made? maybe pingthing can't reach harbor anymore? [12:58:43] lemme see [12:59:59] I can access harbor ok from my laptop, one of the workers, http and pull an image [13:00:21] note that harbor is running inside docker inside a vm, with the 1450 mtu set [13:00:34] https://www.irccloud.com/pastebin/swqPKJwN/ [13:02:15] as far as I can tell from the blackbox UI there was a brief moment during which the probe failed when I rebooted the proxy VMs to pick up the new MTU, and the probe has been succeeding since then [13:02:20] no clue why the alert is still firing [13:03:04] eh, sorry, no I can't just read the interface correctly [13:03:10] time=2025-11-13T13:01:55.310Z level=ERROR source=http.go:474 msg="Error for HTTP request" module=http_connect_23xx target=https://tools-harbor.wmcloud.org/api/v2.0/ping err="Get \"https://[2a02:ec80:a000:1::1d]/api/v2.0/ping\": dial tcp [2a02:ec80:a000:1::1d]:443: connect: no route to host" [13:04:12] Nov 13 12:27:44 proxy-5 Keepalived_vrrp[524]: (vrrp_ipv6): entering FAULT state (src address not configured) [13:04:12] Nov 13 12:27:44 proxy-5 Keepalived_vrrp[524]: (vrrp_ipv6) Entering FAULT STATE [13:04:42] from cloudcontrol2010-dev curl fails [13:05:02] from apt2002 it works [13:05:26] it's trying ip6 [13:06:37] i restarted keepalived and it seems to work now [13:06:42] ack, looking [13:06:56] so it seems like we have a race condition where keepalived starts before the v6 address is assigned to the interface? [13:07:44] who is setting the ip address? [13:07:59] alert went away 👍 [13:09:08] systemd-networkd via DHCPv6 (which in the logs seems to happen two seconds after the v4 address) [13:10:25] https://phabricator.wikimedia.org/P85315 wait-online is happy once a v4 address is assigned (and a v6 link-local?) [13:10:40] I was thinking something like that yep [13:11:27] it seems there's some option to force it to wait also for ip6 [13:11:49] something to note is that this happened on both bookworm project-proxy instances, but on none of the toolforge k8s haproxies (4 nodes in total, all trixie) [13:12:00] so it might or might not be a bookworm-specific bug [13:12:43] https://manpages.debian.org/trixie/systemd/systemd-networkd-wait-online.8.en.html says there is a config option to require a routable v6 address to consider a system online [13:13:05] also note that we have the complication that we don't set up networkd directly, we have cloud-init configuring netplan configuring networkd configuring the actual interface [13:17:44] 🪆 [13:17:45] xd [13:49:19] quick review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204820, initializing the stats for maintain-dbusers [13:49:26] (currently deployed in couldcontrol1007) [13:51:55] lol at that typo in the hostname [13:51:56] but +1 [13:55:41] xd, yep, coludcontrol gets caught by puppet checks, so I innovate [15:25:44] does anybody have a good answer to "who is responsible for signing off gadgets?" in T408387? [15:25:45] T408387: CloudVPS instance for ProVe - https://phabricator.wikimedia.org/T408387 [15:26:45] there is no such thing [15:28:19] Wikidata community would self govern on that [15:28:44] I feel like I'm being kind of a pain on that ticket because I'm asking them to be sure that this will satisfy the requirement but the requirement seems to be pretty vague. [15:29:00] Reedy: does that mean they should ask on a talk page someplace? [15:29:14] probably, just having a look [15:29:30] >If you have a problem with a gadget or you have an idea for a new gadget: please add a task to Phabricator. [15:29:34] https://phabricator.wikimedia.org/maniphest/task/create/?projects=Wikidata-gadgets [15:29:39] well really the question they want answered is not "can this be a gadget", but "can gadgets make network requests to place X" with various values of X [15:29:48] I'm also open to someone else intervening and saying that what they want to do is definitely fine (or not fine). It just seems like a weird 'laundering' hack [15:31:45] what I would like to avoid is creating a cloud-vps project, having them deploy their things, and at the end being told "you cannot run this as a gadget anyway" [15:31:52] yes! [15:33:20] the last comment on-wiki seems to be https://www.wikidata.org/wiki/Wikidata:Tools/Potential_gadgets#c-Ep%C3%ACdosis-20250806105800-Albert.meronyo-20250806105100 [15:34:50] I see they also created a task in the wikidata-gadgets project T374177 [15:34:51] T374177: New gadget proposal: Reference Verification Gadget for Wikidata (ProVe) - https://phabricator.wikimedia.org/T374177 [15:34:53] Loading stuff from cloud-y things is going to be better than some random other site... but the whole not making production depend on cloud [15:35:17] https://meta.wikimedia.org/wiki/Third-party_resources_policy [15:35:45] https://meta.wikimedia.org/wiki/Third-party_resources_policy#Opt-in_exemption_granted_by_users [15:36:02] ^ that page is not actual policy atm, right? [15:36:21] Not with any sort of enforcement AIUI [15:38:03] like, i don't think there is currently any explicit policy about what opt-in gadgets can and can't do, so basically this entire thing that started this was one volunteer wikidata editor saying that they're not comfortable with the current way it's done [15:43:15] yeah, I was thinking that the volunteer would ask /that/ editor "how does it sound if I do X"? [15:43:27] But maybe it's not easy to do that? [15:43:37] it's the opposite of what I said 5 mins ago, but maybe we could just approve the cloud-vps project, and let them figure out how to convince the wikidata community [15:44:32] i would be fine with that [15:44:53] Yeah, as long as we make it clear (which I suppose we already have) that we don't have any ability to tell them that this will solve the problem [15:48:51] +1 for that with a clear "this does not mean that it's approved as a gadget" [15:55:00] how does this sound? "The decision on whether this can be approved as a gadget sits with the Wikidata community. The fact that a part of the code runs in Wikimedia Cloud might or might not be enough to convince them." [15:56:54] nice, maybe adding, we are happy to give you a project to get you started, or similar too [16:00:11] thanks, replied in the task [16:55:35] * andrewbogott waves to cciufo [17:36:02] * dhinus off [17:44:54] * dcaro off [17:44:57] cya!