[07:45:24] <moritzm>	 FYI, I'm re-enabling Puppet on apt1002, it was disabled by Andrew to debug an installer (possibly T407586), but we need it re-enabled to allow current changes to preseed.yaml
[07:45:24] <stashbot>	 T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586
[08:35:21] <godog>	 greetings
[08:55:55] <taavi>	 i'd like to enable the interface-level jumbo frame setting on a single eqiad1 cloudvirt: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203760/ after seeing no issues on that level in codfw1dev
[08:58:19] <godog>	 LGTM
[09:04:39] <dcaro>	 morning
[09:06:09] <dhinus>	 hello!
[09:06:25] <volans>	 morning
[09:50:23] <volans>	 I'm re-creating my lima-kilo and it's consistently failing with:
[09:50:24] <volans>	         Error mounting /mnt/lima-cache/local: mount: /mnt/lima-cache/local: special device /var/lib/docker/overlay2 does not exist.
[09:50:32] <volans>	 is this a known issue?
[09:53:28] <volans>	 it says that dmesg might have info but a quick look at it inside the VM didn't gave me any insight
[09:54:19] <volans>	  /var/lib/docker/ does not have an overlay2
[10:19:28] <volans>	 running ./start-devenv.sh --no-cache seems to be going through (thanks francesco for the tip) but not sure how slow it will all be
[10:21:09] <dhinus>	 it should not change much if you have a fast internet connection
[10:21:42] <dhinus>	 we should still fix it, I'll check if I can reproduce the error, maybe something changed in lima and/or docker
[10:23:17] <dcaro>	 I rebuilt it yesterday from scratch and had no issues (fyi)
[12:12:53] * dcaro lunch
[13:30:12] <dhinus>	 toolsdb just crashed while I was testing some config options (very minor ones)
[13:30:23] <dhinus>	 and it's failing to restart, expect alerts/pages
[13:30:26] <dhinus>	 I'm looking into it
[13:30:55] <dhinus>	 the disk is not full
[13:32:10] <dhinus>	 "[ERROR] InnoDB: Missing FILE_CHECKPOINT at 123517410991977 between the checkpoint 12351731714...."
[13:33:56] <dhinus>	 this could be related to the crash from last week, though it's strange...
[13:34:05] <dhinus>	 I will fail over to tools-db-6 as I was already planning to do it anyway
[13:35:02] <godog>	 ack, please reach out with any support you need dhinus 
[13:35:26] * dhinus paged as expected
[13:36:05] <dhinus>	 godog: thanks, if you think it's worth it you can add some info to the existing incident doc, or to a new one
[13:36:25] <dhinus>	 tools-db-readonly is still up&running but the primary is hard down
[13:36:31] <godog>	 sure I'll start a new doc
[13:37:12] <dhinus>	 I'm following the guide at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Changing_a_Replica_to_Become_the_Primary
[13:37:39] <godog>	 doc just copied here, I'll make adjustments https://docs.google.com/document/d/1xinOSzx2JtfX_NYC4ntn4lxEC3PViszPcn6JDkojHuQ/edit?tab=t.0
[13:39:36] <dhinus>	 the first crash log is Nov 11 13:28:06 tools-db-4 mysqld[3528040]: 2025-11-11 13:28:06 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch>
[13:40:53] <dhinus>	 there's a previous one actually Nov 11 13:20:48 tools-db-4 mysqld[3528040]: 2025-11-11 13:20:48 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last check>
[13:41:18] <godog>	 ack
[13:43:43] <dhinus>	 https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/281
[13:44:13] <godog>	 LGTM
[13:46:08] <dhinus>	 running tofu apply with cookbook
[13:47:04] <dhinus>	 setting read_only=off on tools-db-6 (the new primary)
[13:49:02] <dhinus>	 I see write transactions are already happening on the new host
[13:49:21] <dhinus>	 `sql tools` from a toolforge bastion also works
[13:50:06] <godog>	 neat
[13:51:00] <dhinus>	 alert is resolved
[13:51:18] <dhinus>	 I'm missing only the pt_heartbeat process that requires a change in instance-puppet, but horizon is down for me
[13:51:40] <dhinus>	 https://horizon.wikimedia.org/ hangs and never loads
[13:51:46] <dhinus>	 I don't see other active alerts though
[13:51:56] <godog>	 horizon WFM, what should I change ?
[13:52:21] <dhinus>	 weird. it's not supert urgent but you can follow https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Update_the_Hiera_key_for_pt-heartbeat
[13:52:34] <godog>	 ok will do
[13:52:37] <dhinus>	 (toolsdb will still work without it, but it won't report lag data)
[13:53:54] <dhinus>	 horizon is now working again, maybe just a glitch in my dns/internet connection
[13:54:01] <dhinus>	 I'll let you finish the change if you're already on it
[13:54:09] <godog>	 {{done}}
[13:58:45] <dhinus>	 thanks. running run-puppet-agent to pick up the change
[13:59:42] <godog>	 not urgent but I'm adding an action item to make that setting follow the dns cname instead, one less moving part
[14:00:30] <dhinus>	 yes that would be nice
[14:02:49] <taavi>	 seems like i just missed all of the fun
[14:04:31] <dhinus>	 taavi: :) but you can still try to guess what caused the crash because I have no idea...
[14:06:57] <volans>	 I was at lunch, do you need any help?
[14:07:46] <volans>	 is there anything to do after merging a patch for the lima-kilo repo?
[14:09:41] <dcaro>	 not really, just update your local repo, you might need to rebuild your lima-kilo depending on what changed
[14:13:15] <dhinus>	 volans: all looking good at the moment
[14:13:51] <volans>	 ack, thx
[14:14:06] <dhinus>	 I'm running innochecksum on the old "ibdata1" file on the old server that crashed. it's taking a while, but maybe will give us some info
[14:18:15] <dhinus>	 ah I need to fix the grants for IPv6 T409563
[14:18:16] <stashbot>	 T409563: [toolsdb] Add users and grants for IPv6, remove obsolete ones - https://phabricator.wikimedia.org/T409563
[14:18:36] <dhinus>	 otherwise tools-db-readonly might not work correctly (e.g. from Quarry)
[14:19:22] <dhinus>	 taavi: what IPv6 prefix I can use for the IPv6 mysql users?
[14:20:35] <taavi>	 cloud vps vms are 2a02:ec80:a000::/56 per https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_IP_space
[14:23:24] <dhinus>	 thanks
[14:47:13] <dhinus>	 what about the source IPs for the labsdbadmin grants? should I use the cloudprivate ones, i.e. 2a02:*?
[14:47:42] <taavi>	 dhinus: yeah, those should be the cloud-private addresses of the cloudcontrol hosts
[14:48:04] <dhinus>	 ack thanks. I'm writing down everything in the task so you can review it before I apply them
[15:22:27] <volans>	 what's the procedure to import a new verion of X into docker-registry.svc.toolforge.org ?
[15:23:54] <volans>	 let's say I want to use the latest chart of loki from upstream, it looks for loki:3.5.7 while we have 3.5.0
[15:24:07] <taavi>	 `cookbook wmcs.toolforge.k8s.logging.copy_images_to_registry`
[15:24:31] <volans>	 thx, looking
[15:26:25] <volans>	 and it's safe to import because noone will pick it up unless explicitly selecting that version right?
[15:26:45] <taavi>	 correct
[15:29:17] <volans>	 worked like a charm and was also quick, thanks!
[15:44:09] <dhinus>	 the puppet alerts on cloudcontrols are T373815
[15:44:09] <stashbot>	 T373815: Puppet fails on cloudcontrol when updating /srv/tofu-infra - https://phabricator.wikimedia.org/T373815
[15:44:24] <dhinus>	 dcaro and I fixed them manually, I reopened the task
[16:07:12] <volans>	 taavi: with the split of loki-tracing to its own component that I'm about to send, does https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/92/diffs needs some adjustment? 
[16:07:36] <taavi>	 afaik no
[16:54:13] <volans>	 ok, https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1040 should be now ready for a full pass/final review. I'll be out the rest of the week so if the reviews are positive and anyone wants to take over the deploy feel free. (I'll send also a short email about it)
[17:06:40] <volans>	 {email sent}
[17:14:48] <dcaro>	 volans: thanks!
[17:17:27] <volans>	 added a last minute comment to the MR with some useful log/command to check it
[18:09:40] * dhinus off
[18:38:04] * dcaro off