[07:41:42] greetings [08:30:50] hmm, apparently at some point we've accidentally renamed the "python" image into "python3.4" as the canonical name (and "python" as an alias, but unlike jobs, webservice does not resolve aliases) [08:38:57] same thing with golang111 becoming golang1.11 [09:11:00] soooo many tools still using python 3.4 :( [09:14:08] :( :( [09:25:17] with those fixes we are up to 1847 httproute objects (out of 1922 ingress objects, although some of the latter are manually configured by tool maintainers so out of scope for the automatic migration) [09:25:49] very nice [09:28:08] right now running the script with a fix for tools that have set their home directory non-world-readable, after that I think most cases will be related to half-working build service configurations that we've later added strict validation into [09:28:35] bit annoying that the `webservice` logic doesn't allow easily just creating the single object without also having all other details of the service correct [09:34:41] there's a few tools that mangle with the shell/bashrc and such, and python path, just fyi [09:43:39] dcaro: fwiw sample-complex-app is failing with 'toolforge webservice: error: argument --buildservice-image: Build service image option must specify an image tag' [09:44:00] also some of the sample-$FOO-buildpack-apps are failing due to --mount not being set [09:46:10] hmm, looking [09:48:20] fixed sample-complex-app [09:48:44] if you have the list of the others I can go through them and fix it [09:49:56] sample-python-buildpack-app sample-ruby-rails-buildpack-app nodejs-flask-buildpack-sample [09:52:31] ack [09:56:16] morning [09:59:17] also fastapi-blueprint [10:07:42] dhinus: want to reboot clouddumps now, so we can start the k8s worker ones later? [10:07:58] sgtm [10:09:50] I'm not sure if it actually helps or not doing the depooling as described in https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Dumps#Failover [10:10:14] for NFS probably not [10:10:29] for web clients it does reduce downtime, but not sure if it's worth it for the couple minutes it takes [10:10:57] I'd be up to just rebooting them without that and seeing what happens :D [10:11:07] agreed, I'll just reboot, starting from 1001 [10:11:27] ty! [10:14:00] clouddumps1001 rebooting now [10:22:36] ack [10:23:12] clouddumps1001 up again [10:24:04] all looks good on https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview [10:24:13] there was a small bump of "pending pods" during the reboot [10:24:49] I'm gonna reboot clouddumps1002 in a minute, unless you see anything not working [10:29:27] clouddumps1002 rebooting now [10:37:05] clouddumps1002 up again [10:41:07] toolforge graphs are looking good [10:44:37] does anyone know why is there an Ingress object in the infra-tracing-loki namespace? isn't that traffic routed via the NodePort service + haproxy? [10:48:21] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1177 [10:49:10] does it have any metadata that it was created by helm? [10:49:24] (maybe was tweaked manually or similar? leftover from a failed deployment?) [10:49:30] does app.kubernetes.io/managed-by=Helm count? [10:49:39] I'd say so [10:50:19] as in, if it was manually created, it probably would not have that (unless it was dumped from a helm deployment and applied manually), does it have any other info? [10:52:24] i mean, the object is very clearly managed by helm, as the patch above says [10:52:40] I'm wondering /why/ it's configured to do that, since I don't think any traffic is actually flowing through it [10:57:03] I thought you were pointing to the patch that removed it, not one that you just sent to do so xd [11:01:39] maybe that ingress was part of an upstream helm chart that v.olans used before setting up the nodeport? [11:03:03] +1 to removing it [11:04:34] speaking of +1s, I would need a +1 on T420544 [11:04:35] T420544: Disk quota increase for catalyst - https://phabricator.wikimedia.org/T420544 [11:04:40] and also on T420532 [11:04:41] T420532: Request creation of s3etherpad VPS project - https://phabricator.wikimedia.org/T420532 [11:05:21] +1d [11:07:38] with that, the four tools with an Ingress but not a HTTPGateway are two examples I mentioned earlier (fastapi-blueprint and nodejs-flask-buildpack-sample), and two where the maintainer has set things up by hand [11:12:24] dcaro: I did a quick audit on tools memory requests re: what you said about changing default values, details in T420565 [11:12:24] T420565: Audit tools memory requests vs actual usage - https://phabricator.wikimedia.org/T420565 [11:13:03] tl;dr I'd say about 4-5x overprovisioned requests vs usage for ~20% of tools [11:14:35] I think fastapi-blueprint was a test by blancadesal [11:14:57] does that mean it's no longer needed? can it be disabled? [11:15:33] I think so, though it's her personal tool, so maybe ask first [11:15:56] (it does some nlp stuff) [11:18:34] anyway, managed to just restart it so the immediate issue is now fixed [11:18:55] (sorry, thought that was one of the main buildservice tests/examples still used for tests etc) [11:21:09] restarted/fixed the rest, a couple don't work anymore though [11:21:17] but webservice restarted xd [11:26:57] dcaro: any idea what could be causing this? T420547 [11:26:57] T420547: "toolforge build start" fails due to Heroku download error? - https://phabricator.wikimedia.org/T420547 [11:31:53] seems to be some issue on github side (as in, the repo missing the file) [11:33:31] I did not change anything (yet) xd [11:35:05] testing with the sample-php-buildpack-aapp [11:35:27] yep, same issue [11:58:33] I have to go for a bit, but will be back later, I'm thinking on updating the current `--use-latest-versions` buildpacks right away, to unblock php, though that might start breaking some tools that use the current `--use-latest-versions` ones (trying to test locally) [12:04:24] dcaro: +1 to that plan, some tools are going to break anyway, but at least we have a working build path [12:05:01] unless we can find a way to hotfix the old buildpack, but I'm not sure there's an easy way [12:05:15] * dhinus lunch, bbl [12:27:34] didn't we mirror these repos to gitlab at some point to prevent this from happening? [12:27:48] are other buildpacks at risk from breaking in a similar way? [12:46:37] some of it yes, but not all of it (same as we don't mirror python/php/nodejs distribution packages, if upstream obsoletes a version and the buildpack tries to pull it it will fail) [12:48:29] note that this is not the buildpack itself, but something the buildpack pulls in to install php [14:34:27] I'm still testing a bunch of builds with the newer buildpacks, but I have an email prepared, if anyone wants to give it a look (it's in reply to the upgrade email I sent the other day): https://etherpad.wikimedia.org/p/email_buildpacks_ [14:36:30] dcaro: I would add a line break before "this means that if you were...", and clarify that with more details [14:37:47] there's another missing part I think: if you were NOT using the flag, and your build has broken since today, adding the flag might fix it :) [14:37:59] welcome bliviero! [14:38:03] (or is the bug only affecting people using --latest?) [14:38:05] hello everybody! [14:38:09] folks, bliviero is the new manager of tools-infra [14:38:18] \o/ [14:38:19] the 'b' is for Belinda [14:38:24] welcome! [14:39:10] bliviero: welcome! 💃 [14:39:40] dhinus: tried to address your comment, does that help? (feel free to edit/rephrase) [14:40:34] dcaro: yes it does address my first comment! [14:41:59] is the bug also affecting people who are NOT using --latest? [14:42:27] the php one yes [14:42:30] affects everyone [14:42:55] (unfortunately), the buildpacks versions that we have in the current `--use-latest-versions` is quite old already [14:43:24] ok, then I will add another line about that [14:48:02] thanks yes [14:48:46] I added all that I think should be covered, but I feel the overall clarity can still be improved [14:51:23] dhinus: are there any more potential clouddumps reboot artifacts you want to look for on the tools k8s workers or should I start getting the kernel updates applied there too? [14:51:41] taavi: you can go ahead! [14:51:48] will do, thanks! [16:21:09] thank you @dhinus and @dcaro [16:25:41] dhinus: Raymond_Ndibe any last eyes before I merge https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1179 ? [16:42:44] oh I was almost ready to say "clouddb1013 looks stable", and it crashed again :( [16:43:15] :( [16:43:20] I'm depooling it [16:46:32] at least it waited for you to be back at work [16:47:08] haha that was kind :) [16:47:20] I added a comment here, not sure what's the best way forward https://phabricator.wikimedia.org/T420177#11728804 [16:49:24] DBA will need to have a look at it and probably report upstream [16:50:27] sound fair. manuel is out but back next week [16:55:34] I pinged in #-data-persistence as well [16:56:19] * dhinus off for today [18:49:30] * dcaro off [18:53:43] the last ceph osd reboot seems to have caused some kind of disruption, I'm digging in [18:55:13] oh, I see [18:55:17] toolforge started having alerts [18:55:40] toolsdb seems to be suffering too [18:55:51] ceph just says "HEALTH_WARN Reduced data availability: 7 pgs inactive, 7 pgs peering; 305 slow ops, oldest one blocked for 1211 sec, daemons [osd.241,osd.255,osd.297] have slow ops." [18:56:04] it seems stuck there but that really shouldn't be breaking anything [18:56:57] which host are those OSDs on? [18:57:10] the osd slow ops are in 3 different osds in 3 different hosts on F4 and D5 [18:57:29] 1036 was the last reboot [18:57:48] I just unset noout in case ceph was feeling constrained about healing... [18:58:01] it's one of them [18:58:04] (1036) [18:58:18] 1033 and 30 are the other two [18:59:04] I was running wmcs.ceph.roll_reboot_osds; it checks for a good health check before moving to the next node [18:59:13] so in theory everything was at 100% health before 1036 [18:59:42] but everything is up and connecting now, so why are those 7 pgs still cursed? [19:00:03] from journal logs of osd.297 [19:00:04] Mar 19 18:59:49 cloudcephosd1036 ceph-osd[2190]: 2026-03-19T18:59:49.377+0000 7f5acd9986c0 0 log_channel(cluster) log [WRN] : 256 slow requests (by type [ 'delayed' : 256 ] most affected pool [ 'eqiad1-compute' : 226 ]) [19:00:35] ok, so... maybe drop that osd entirely and force it to rebalance? [19:00:41] I'll restart it, see if it gets over the hiccup or forces it to rebalance [19:00:43] ok [19:00:45] 1030 started reporting slow ops at 18:36Z [19:00:50] that's better than dropping it completely [19:00:51] +1 to restarting the affected osd services [19:01:12] dcaro: you're doing 'systemd restart'? or something internal to ceph? [19:01:14] that seems to have done something [19:01:18] yep [19:01:19] health: HEALTH_OK [19:01:26] root@cloudcephosd1036:~# systemctl restart ceph-osd@297.service [19:01:28] yep [19:01:39] :/ [19:01:42] it had been a while [19:01:47] Yeah [19:01:52] Thank you for appearing, both of you! [19:01:58] So what kind of cleanup does toolforge need now? [19:02:22] and I guess that was the usual failure case of the toolforge nfs share being so large that basically any pg disruptions disrupted that as well? [19:03:35] So in theory with the new NFS version the nfs mounts will recover without reboots... [19:03:45] that seems to be true? [19:04:07] oh, thanks for the vote of confidence stashbot [19:04:12] :D [19:04:38] might have just been really unlucky timing with the reboot cookbook I already have running in the background [19:04:40] taavi: yep, any pg that goes inactive will affect any rbd that has any data on it when trying to access it, if you are big enough, you are likely to spread in many pgs [19:04:48] think I should restart the toolsdb engine? [19:04:52] iirc the chunk size was in the order of megabytes [19:05:21] andrewbogott: are you seeing issues on toolsdb? [19:05:35] just alerts, maybe they're stale [19:05:38] but yeah, so far Toolforge seems to be recovering pretty well [19:05:50] ToolsDB number of read/write instances is not 1 #page [19:05:55] ^ that's serious if true [19:06:02] it might be 0 xd [19:06:08] exactly [19:06:09] (/me hopes) [19:06:13] better than many [19:06:19] taavi: you confirmed that it's up and read/write? [19:06:28] one second [19:06:44] it's ro [19:07:09] mariadb seems to have restarted 6 minutes ago [19:07:18] that's normal if it crashes and restarts, right? Needs to be switched to r/w manually? [19:07:37] yeah, it hung [19:07:49] Mar 19 19:07:02 tools-db-6 mysqld[1149860]: 2026-03-19 19:07:02 12831 [ERROR] mysqld: Table './s51211__duplicity_p/no_wd' is marked as crashed and should be repaired [19:08:02] understandable if it can't read its disk [19:08:14] Mar 19 19:02:09 tools-db-6 mysqld[1149860]: 2026-03-19 19:02:09 2643 [ERROR] mysqld: Table './s51083__ukbot/contests' is marked as crashed and should be repaired [19:08:38] there was a repair command or something iirc [19:08:48] yeah, REPAIR TABLE [19:08:52] running it on those two [19:09:09] ack xd [19:09:42] we are back in r/w [19:09:47] \o/ [19:09:56] what are y'all's thoughts about continued reboots? (apart from the fact that I can't use the cookbook now because it'll go back to 1) [19:09:59] thank you taavi! [19:10:17] also repaired s52861__secWatch.rev_log, s55928__wiki.searchindex [19:10:24] * dcaro does not like the slowops [19:12:50] I would like to poke a bit more into the logs if there's something that explains the slowops [19:13:07] since "restart the osd service" is not something that should normally be needed [19:18:23] yep, last things I saw, it finished doing a scrub, and started creating memtables, and then the first slow op appeared [19:18:26] https://www.irccloud.com/pastebin/kqSDCehP/ [19:20:55] oh, there were lost pings also around [19:21:20] https://usercontent.irccloud-cdn.com/file/cRpXB3Hz/image.png [19:21:32] oh, those are the reboots xd [19:25:17] the osd 297 is running on newer disks (not the old dell) I htink [19:25:19] https://www.irccloud.com/pastebin/KU68KUMz/ [19:26:30] 1036 is not slated for refresh next year, that's only up through 1034. But maybe it got one of the drive experiments... [19:26:30] anyhow, /me going away to walk the dogs and such, interesting stuff [19:28:00] taavi: I am in no hurry to complete those reboots so I'll set this task aside for now, we can discuss again during working hours.