[00:25:55] taavi: For T256168 I would like to add email and irc alerting when a systemd::timer::job fails/recovers. I have notes in T315695 about config things, but I'm not sure about how to get the job status signal into Prometheus. Does the `monitoring_enabled` flag on systemd::timer::job work for that in deployment-prep, or do I need to do something completely different? [00:25:55] T256168: Move beta cluster automatic deployment to a dedicated infrastructure - https://phabricator.wikimedia.org/T256168 [00:25:56] T315695: Add basic MediaWiki/web site up alerting to the Beta Cluster - https://phabricator.wikimedia.org/T315695 [06:44:32] greetings [09:30:18] bd808: no. I'm fairly sure that switch only controls an icinga nrpe check (which won't work in cloud vps that has no icinga), and even if it somehow did manage to somehow feed the prometheus instance in deployment-prep, that prometheus instance can't send alerts out by itself [09:30:57] so the 'easy' solution at the moment is to manually poke a prometheus alerting rule in metricsinfra, based on the `node_systemd_unit_state` metric [11:34:40] paws is currently down, me and volans are investigating [11:35:01] please do shout for help and/or rubberducking as needed [11:35:12] the trigger was me running the deploy script. when volans ran it, it kept timing out [11:35:27] I ran it and it "worked", but now paws is replying "Service Unavailable" [11:36:04] ack [11:39:22] maybe a pvc issue: Warning FailedMount pod/hub-55869874bb-6pdlm [11:40:30] there's a hub pod stuck in "containercreating", and it's using an old version of the image... [11:40:38] I tried killing that pod [11:41:46] it spinned up a new one but it's still using quay.io/wikimedia-paws-prod/paws-hub:pr-499 [11:41:49] it should be using pr-511 [11:42:29] weird [11:42:53] the deployment is pointing to pr-499, it's not up to date [11:42:57] AGE: 221d [11:43:09] so the deploy script that "worked" for you didn't actually worked? [11:43:20] looks like it. [11:43:25] also [11:43:25] -rw-r--r-- 1 root root 2368 Jul 8 2025 terraform.tfstate [11:43:53] shouldn't had been updated by the deployment? or it's only ansible and tofu was a noop [11:45:23] tofu was a noop [11:46:03] I'm not sure but maybe helm was also a no-op for "hub", as you only updated a dep of the workers [11:46:18] but then why stuck in creating? [11:47:06] wait I still see ingress-nginx in ansbile's paws.yaml [11:47:11] mmhh did you see this already ? [11:47:13] Warning FailedAttachVolume 6m36s attachdetach-controller Multi-Attach error for volume "pvc-287ed3a7-d043-4ad0-a233-727038d4b110" Volume is already exclusively attached to one node and can't be attached to another Warning FailedMount 2s (x3 over 4m33s) kubelet Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[pvc], [11:47:20] failed to process volumes=[]: timed out waiting for the condition [11:47:23] from kubectl describe pod hub-55869874bb-wr967 -n prod [11:47:35] ye [11:47:39] yep [11:48:12] ok! nevermind [11:48:31] shouldn't ingress-nginx be gone? could be related? [11:48:50] no, ingress-nginx was removed from toolforge but we're still using it in paws [11:48:56] ok [11:51:07] ok pr-499 is correct for hub [11:51:29] https://github.com/toolforge/paws/blob/main/paws/values.yaml#L253 [11:52:03] so in theory "hub" should not have been impacted at all by this deploy, which only update the image for jupyterhub-singleuser: https://github.com/toolforge/paws/blob/main/paws/values.yaml#L253 [11:55:21] volume attachments are in node1, but the hub is running in node0 [11:55:41] lacking some colocation config or failed to migrate it? [11:56:39] not sure [11:56:56] I have to jump into a meeting, will catch up later [11:57:20] though perhaps we can delete the volumeattachment ? [12:03:13] godog: yes potentially, or we get the pod to be scheduled on the right node [12:16:49] I tried to add an explicit nodeSelector to the deployment [12:17:18] aaand we're back [12:17:29] awesome! [12:17:51] now, I have no idea what should be the proper fix for this :D [12:18:21] and re-running my notebook I see the 11.2.0 for pywikibot [12:18:26] nice [12:30:19] nice! [12:33:17] I will do codfw1dev later after checking with andrew what's teh status of magnum there [12:33:50] I was going to do toolforge now but it fails to build the image, Error: The runtime.txt file isn't supported (should use .python-version) [12:35:38] should I just modify in pywikibot source mv runtime.txt .python-version and remove 'python-' from the content? [12:39:14] volans: yes, I think you are the first one to run that build with the new heroku:24 stack... [12:39:28] hopefully that will be the only required change #fingerscrossed [12:39:51] don't jinx it that you've already done a great job today for that :-P [12:40:36] * dhinus hides [12:41:13] the patch is just for the toolforge branch I guess? [12:44:00] yes, runtime.txt is not even present upstream, it was added in the toolforge branch [12:44:34] and do we need to keep the old 3.9? (python-3.9.18 currently) [12:44:36] commit d9341693b, I would stick to the pattern of adding [toolforge] to the commit message [12:45:18] about the version, not sure [12:45:56] upstream is requires-python = ">=3.9.0" buth then supports up to 3.15 [12:46:33] I think we can upgrade, worst case we can rebuild it later with an older version [12:47:20] the heroku build stack should support up to 3.14 [12:47:20] I wonder what happens to all the tools that import it via nfs [12:47:32] is that related? [12:47:42] I don't think they will be using this image, but let me check [12:48:25] I think this image is only for people following https://wikitech.wikimedia.org/wiki/Help:Toolforge/Running_Pywikibot_scripts [12:48:54] it doesn't go into "/mnt/nfs/labstore-secondary-tools-project/pywikibot/public_html/core/" ? [12:50:57] no idea what the "pywikibot" tool is doing, looking [12:53:03] looks like that tool runs a nightly build of pywikibot using python:3.11, and publishes it at https://pywikibot.toolforge.org/ [12:53:21] but with a different system? [12:53:32] or it should be broken too [12:53:35] yes, not a buildpack build, just a python build [12:53:40] k [12:53:48] it doesn't build a docker image [12:53:52] 3.11? 3.13? 3.14? preferences [12:54:12] I'd say 3.14 -- you can blame me when it will inevitably break :D [12:54:17] sure [12:54:41] we could even skip the .python_version file and just let the buildpack use the default version [12:56:30] https://gitlab.wikimedia.org/toolforge-repos/pywikibot-buildservice/-/merge_requests/2 [12:57:20] lgtm, give it a spin [12:57:42] thx [12:59:13] at least it passed that point [12:59:14] let's see [13:21:15] godog (or whoever), reprepro keeps telling me that there are 0 packages to import from the debian.net repo [13:21:20] https://www.irccloud.com/pastebin/BPxoqabj/ [13:21:45] are you able to see the typo/broken path/whatever in updates that I'm unable to see? [13:22:11] andrewbogott: mmhh not yet, checking [13:22:15] andrewbogott: lmk if/when I can update pywikibot in paws in codfw1dev [13:23:44] volans: hm... tbh last time I tried it hung in the same way that it hung for you in eqiad1... that just fixed itself without explanation right? [13:24:09] for me it kept timing out when francesco tried it worked but then broke paws [13:24:25] because of a pod on a node that didn't had the mountpoint required for the pod to exists [13:24:33] two great outcomes to look forward to! [13:25:00] basically I just wante dto know if you were playing with magnum in a way that might break this even more [13:25:04] I'm planning to work on that more this afternoon anyway so you should just leave it to me I think. [13:25:06] or I should just skip [13:25:11] ok [13:25:46] I've been using it as my test case for the new driver, so both magnum /and/ the paws tofu config are in weird hacked states right now [13:26:26] ok all yours [13:27:23] godog: I spent a while yesterday thinking the issue was a lack of ListShellHook but when I bodged in a ".*" wildcard there it didn't change the behavior. [13:27:45] andrewbogott: ack, ok I'll poke at it a little bit [13:27:51] thanks [13:32:06] andrewbogott: the trixie-wikimedia 'distributions' file is missing the components in the Update: section [13:32:37] easier if I send a patch [13:32:51] thx [13:32:58] I assume it's missing for all four things in that case [13:35:16] mmhh I think so yeah [13:39:23] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1278473 ? [13:40:01] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1278472 [13:41:08] I see, ok [13:41:29] the naming between components and updates is clashing in this case, though my understanding is that the Update: section wants updates not components [13:43:16] we'll find out! Shall I do the merging? [13:43:42] sure go ahead [13:44:49] andrewbogott: what I did earlier to test is use 'checkupdate' reprepro action FWIW [13:45:33] reprepro -C thirdparty/openstack-trixie-flamingo-backports --noskipold checkupdate tri xie-wikimedia [13:45:44] that just said 'nothing to do' though right? [13:46:14] I did test my change temporarily before, then checkupdate reported things to do [13:52:16] ok, now it is doing very many things. Thanks godog [13:53:26] sure np! glad it works [14:22:13] here's the last bit https://gerrit.wikimedia.org/r/c/operations/puppet/+/1276011 [14:24:05] thank you whoever took care of those puppet errors on syslog-server-audit0x [14:58:05] yw [14:58:16] i will be a minute or two late to the toolforge meeting [15:37:10] proposed email about istio changes: https://etherpad.wikimedia.org/p/toolforge-gateway-api [15:37:13] cc bd808: ^ [15:42:32] andrewbogott: could you look into https://phabricator.wikimedia.org/T424675 in the next 1-2 days? the intermediate expires on Sunday and a lot of European folks are out on Friday, so it would be good to have this wrapped up by Thursday [15:43:43] mmmaybe, I suspect taavi would be quicker than me [15:43:49] but I'll see what I can do [15:48:13] taavi: looks ok to me. Thanks for that. It will help me feel like I'm not the only person who might know what changed. [15:49:35] andrewbogott: you have found the singular timeslot when I actually could have a look - are you already on it or want me to do it? [15:49:54] I'm busy dealing with a designate failure so if you can do it please do [15:50:24] godog: if you're still working can you log onto cloudcontrol1011 and help me understand why zookeeper is crashing and why systemd insists that it is not crashing? [15:50:32] If you're already out for the day that's fine, I'll sort it out [15:51:53] (same goes for volans) [15:54:23] andrewbogott: I can briefly have a look [15:55:08] my evidence that it isn't actually running: no java processes in ps, also connection refused for all clients who try to reach it (e.g. telnet localhost 2181) [15:55:56] status active (exited) [15:56:16] it doesn't have a system unit, it's an init script [15:56:56] can I try to start it? [15:57:18] sure [15:57:31] you can do whatever, I'm testing on a different cloudcontrol [15:58:54] andrewbogott: I see that there is a change at every puppet run, that's not a great sign [15:58:57] https://puppetboard.wikimedia.org/report/cloudcontrol1011.eqiad.wmnet/6c7ed661cd019613d7593f58b2d7b1bf27b19c5c [15:59:27] the mode change? Or the service restart? [16:00:07] the start usually means that is failing (and not self-restarting) and the mode shouldn't be there, who's modifying it? [16:01:11] the same setup (?) is running in codf1dev without issues, I'm going to doublecheck that puppet isn't doing that there... [16:04:20] andrewbogott: I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1278502 and all seems fine. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1278503 is the eqivalent for the prod hosts, but I don't want to merge it and then immediately disapper, can you take care of that? [16:04:26] (or someone else) [16:04:29] yeah, it doesn't happen in codfw1dev. So that's... interesting. [16:04:43] taavi: yes, I will when I'm done with the problem of the moment [16:04:53] thanks! [16:05:47] so volans if you want an example of what it /should/ be doing you can look at cloudcontrol2010-dev [16:08:56] I'm reverting things back to memcached for now, I think that will leave the zookeeper package for us to keep testing [16:14:27] ack, can I still plai wtih the host a bit? [16:15:23] yep [16:15:59] I'm assuming the issue is 'zookeeper exits right away' which will probably still happen when designate isn't using it [16:22:23] andrewbogott: this is the error [16:22:24] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". [16:22:24] SLF4J: Defaulting to no-operation (NOP) logger implementation [16:22:24] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. [16:22:28] Invalid config, exiting abnormally [16:22:44] how did you get it to log? [16:22:49] unless I did execute it in the wrong way (still possible) [16:23:11] oh, running on the cli? [16:23:12] I modified /etc/init.d/zookeeper to echo the start-stop-daemon line [16:23:15] ok [16:23:18] given that start-stop-daemon -v didn't show anything [16:23:23] (replacing --quiet with -v) [16:23:43] so I run the java command from the zookeeper user to see if it was going [16:23:46] and got that [16:24:09] ofc I didn't do the pidfile part of start-stop-daemon but I doubt it matters for this [16:24:30] will you paste the command? [16:25:06] https://phabricator.wikimedia.org/P91808 [16:25:25] it's also on journalctl [16:25:47] another bit, given the bad integration of the init script with systemd you have to stop+start to do anything [16:25:57] if it's in the start (failed) state a new start does nothing [16:26:51] I diff'd the java packages installed on a working host and non-working host, they're the same :/ [16:30:29] andrewbogott: /var/lib/zookeeper/myid doesn't have the id [16:30:34] has insteaed "replace this text with the cluster-unique zookeeper's instance id (1-255)" [16:31:11] that's interesting! [16:31:39] looking to see if that's puppetized on 2010-dev [16:32:16] cloudcontrol2205-dev and 2006-dev ahave an id [16:32:33] it's the host numeric part (2006) that is actually outside of the suggested range 1-255 though [16:32:38] yeah, it's either set by zookeeper on (non-failing) startup or puppetized [16:32:42] yeah, I noticed that too [16:34:57] ok, yep, changing that to a valid value makes it start up [16:35:19] So... seems like puppet should manage that, I'll look into that [16:35:39] k [16:35:40] A classic case of 'how did this ever work' [16:35:42] thank you! [16:35:47] no prob