[07:45:24] FYI, I'm re-enabling Puppet on apt1002, it was disabled by Andrew to debug an installer (possibly T407586), but we need it re-enabled to allow current changes to preseed.yaml [07:45:24] T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586 [08:35:21] greetings [08:55:55] i'd like to enable the interface-level jumbo frame setting on a single eqiad1 cloudvirt: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203760/ after seeing no issues on that level in codfw1dev [08:58:19] LGTM [09:04:39] morning [09:06:09] hello! [09:06:25] morning [09:50:23] I'm re-creating my lima-kilo and it's consistently failing with: [09:50:24] Error mounting /mnt/lima-cache/local: mount: /mnt/lima-cache/local: special device /var/lib/docker/overlay2 does not exist. [09:50:32] is this a known issue? [09:53:28] it says that dmesg might have info but a quick look at it inside the VM didn't gave me any insight [09:54:19] /var/lib/docker/ does not have an overlay2 [10:19:28] running ./start-devenv.sh --no-cache seems to be going through (thanks francesco for the tip) but not sure how slow it will all be [10:21:09] it should not change much if you have a fast internet connection [10:21:42] we should still fix it, I'll check if I can reproduce the error, maybe something changed in lima and/or docker [10:23:17] I rebuilt it yesterday from scratch and had no issues (fyi) [12:12:53] * dcaro lunch [13:30:12] toolsdb just crashed while I was testing some config options (very minor ones) [13:30:23] and it's failing to restart, expect alerts/pages [13:30:26] I'm looking into it [13:30:55] the disk is not full [13:32:10] "[ERROR] InnoDB: Missing FILE_CHECKPOINT at 123517410991977 between the checkpoint 12351731714...." [13:33:56] this could be related to the crash from last week, though it's strange... [13:34:05] I will fail over to tools-db-6 as I was already planning to do it anyway [13:35:02] ack, please reach out with any support you need dhinus [13:35:26] * dhinus paged as expected [13:36:05] godog: thanks, if you think it's worth it you can add some info to the existing incident doc, or to a new one [13:36:25] tools-db-readonly is still up&running but the primary is hard down [13:36:31] sure I'll start a new doc [13:37:12] I'm following the guide at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Changing_a_Replica_to_Become_the_Primary [13:37:39] doc just copied here, I'll make adjustments https://docs.google.com/document/d/1xinOSzx2JtfX_NYC4ntn4lxEC3PViszPcn6JDkojHuQ/edit?tab=t.0 [13:39:36] the first crash log is Nov 11 13:28:06 tools-db-4 mysqld[3528040]: 2025-11-11 13:28:06 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch> [13:40:53] there's a previous one actually Nov 11 13:20:48 tools-db-4 mysqld[3528040]: 2025-11-11 13:20:48 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last check> [13:41:18] ack [13:43:43] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/281 [13:44:13] LGTM [13:46:08] running tofu apply with cookbook [13:47:04] setting read_only=off on tools-db-6 (the new primary) [13:49:02] I see write transactions are already happening on the new host [13:49:21] `sql tools` from a toolforge bastion also works [13:50:06] neat [13:51:00] alert is resolved [13:51:18] I'm missing only the pt_heartbeat process that requires a change in instance-puppet, but horizon is down for me [13:51:40] https://horizon.wikimedia.org/ hangs and never loads [13:51:46] I don't see other active alerts though [13:51:56] horizon WFM, what should I change ? [13:52:21] weird. it's not supert urgent but you can follow https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Update_the_Hiera_key_for_pt-heartbeat [13:52:34] ok will do [13:52:37] (toolsdb will still work without it, but it won't report lag data) [13:53:54] horizon is now working again, maybe just a glitch in my dns/internet connection [13:54:01] I'll let you finish the change if you're already on it [13:54:09] {{done}} [13:58:45] thanks. running run-puppet-agent to pick up the change [13:59:42] not urgent but I'm adding an action item to make that setting follow the dns cname instead, one less moving part [14:00:30] yes that would be nice [14:02:49] seems like i just missed all of the fun [14:04:31] taavi: :) but you can still try to guess what caused the crash because I have no idea... [14:06:57] I was at lunch, do you need any help? [14:07:46] is there anything to do after merging a patch for the lima-kilo repo? [14:09:41] not really, just update your local repo, you might need to rebuild your lima-kilo depending on what changed [14:13:15] volans: all looking good at the moment [14:13:51] ack, thx [14:14:06] I'm running innochecksum on the old "ibdata1" file on the old server that crashed. it's taking a while, but maybe will give us some info [14:18:15] ah I need to fix the grants for IPv6 T409563 [14:18:16] T409563: [toolsdb] Add users and grants for IPv6, remove obsolete ones - https://phabricator.wikimedia.org/T409563 [14:18:36] otherwise tools-db-readonly might not work correctly (e.g. from Quarry) [14:19:22] taavi: what IPv6 prefix I can use for the IPv6 mysql users? [14:20:35] cloud vps vms are 2a02:ec80:a000::/56 per https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_IP_space [14:23:24] thanks [14:47:13] what about the source IPs for the labsdbadmin grants? should I use the cloudprivate ones, i.e. 2a02:*? [14:47:42] dhinus: yeah, those should be the cloud-private addresses of the cloudcontrol hosts [14:48:04] ack thanks. I'm writing down everything in the task so you can review it before I apply them [15:22:27] what's the procedure to import a new verion of X into docker-registry.svc.toolforge.org ? [15:23:54] let's say I want to use the latest chart of loki from upstream, it looks for loki:3.5.7 while we have 3.5.0 [15:24:07] `cookbook wmcs.toolforge.k8s.logging.copy_images_to_registry` [15:24:31] thx, looking [15:26:25] and it's safe to import because noone will pick it up unless explicitly selecting that version right? [15:26:45] correct [15:29:17] worked like a charm and was also quick, thanks! [15:44:09] the puppet alerts on cloudcontrols are T373815 [15:44:09] T373815: Puppet fails on cloudcontrol when updating /srv/tofu-infra - https://phabricator.wikimedia.org/T373815 [15:44:24] dcaro and I fixed them manually, I reopened the task [16:07:12] taavi: with the split of loki-tracing to its own component that I'm about to send, does https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/92/diffs needs some adjustment? [16:07:36] afaik no [16:54:13] ok, https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1040 should be now ready for a full pass/final review. I'll be out the rest of the week so if the reviews are positive and anyone wants to take over the deploy feel free. (I'll send also a short email about it) [17:06:40] {email sent} [17:14:48] volans: thanks! [17:17:27] added a last minute comment to the MR with some useful log/command to check it [18:09:40] * dhinus off [18:38:04] * dcaro off