[08:45:11] hey folks, just an heads up - me and Riccardo reworked a little the tox settings for spicerack, it should be faster now both locally and in CI. All tracked in https://phabricator.wikimedia.org/T420475, I suggest to clean up your .tox spicerack dir the next time you work on it. We are still working on reducing the timing further, we'll keep you posted :) [08:45:26] πŸ‘ [08:47:45] nice! [09:10:08] hey folks, i'm trying to understand the difference between 'critical' and 'warning' alerts. is the difference primarily how those alerts are displayed on alerts.wikimedia.org? or is there more to it? [09:13:50] icinga or alertmanager? in general warnings don't notify on IRC [09:14:01] alertmanager [09:14:22] hm, i see [09:16:40] but you can also have alerts that create phab tasks for example [09:17:16] so depends what you want to achieve, but I'll 301 to o11y for the details as I'm not too familiar with the details [09:32:33] thanks! [10:17:31] bjensen: it depends on how the routing is configured in https://github.com/wikimedia/operations-puppet/blob/production/modules/alertmanager/templates/alertmanager.yml.erb each team have their own policies kind off [10:32:04] ah, okay, so not a standard, makes sense [10:32:21] elukey: I've updated T423286, but: I tried ms-be2069 without any firmware upgrades today, and it's the same failure mode - installer works, reboots fine, but after the initial puppet run it's unbootable, hanging at "GRUB " forever :( [10:32:21] T423286: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286 [10:35:19] Emperor: very weird, at this point I'd try to target a different os to see the difference. IIUC you are targeting bullseye but we'll have to move away from it during the next 2/3 months anyway, I am wondering if bookworm and/or trixie make any difference (namely, I am trying to exclude variables like you did with the firmware upgrade) [10:38:28] elukey: I can try another OS, but we realistically can't move swift off bullseye in the near future (constructing a test cluster to even test the process is a goal for this quarter) [10:38:45] I'll give trixie a go on ms-be2069 [14:07:51] In testing the Wikifunctions k8s staging services, curl is saying the certificates have expired. Is this a known thing? Should I file a task? [14:10:31] elukey: Might this be related to your work on cert-manager? [14:18:33] James_F: yes yes my fault sorry, I was testing something that it sound working :D I am going to revert later on, sorry for the trouble [14:18:44] for the moment you can just accept those certs expired [14:18:57] How do I do that? curl -k still just reports the error. [14:21:54] is curl saying the certs have expired or is curl proxying an upstream reverse proxy telling another reverse proxy that its certs have expired πŸ˜… [14:22:02] Probably the latter. [14:22:18] can you `curl -v` ? [14:23:03] you can `|& phaste` if you don't wanna read all that [14:23:20] https://phabricator.wikimedia.org/P90793 [14:23:44] ah ok so it is curl directly, while talking to what must be the staging ingress [14:24:00] and then you ignore that, and then you get an openssl error from envoy [14:25:07] Yes, if I don't pass in -k I get roughly the same error but formatted differently, presumably from… whatever is between me and the staging ingress. [14:31:52] James_F: will revert in ~30 mins if it is ok for you [14:32:06] elukey: No worries, I've reverted my attempted deploy. [14:57:28] James_F: should be fixed now or in few mins, cert-manager is up and certs should renew now [15:33:06] elukey: Confirming it's fixed, thanks! [16:21:17] mutante: could use some help/ideas around what to do with locks at T421147 [16:21:18] T421147: Codesearch stuck at Feb 12th? - https://phabricator.wikimedia.org/T421147 [16:21:42] do you know of any cases where a git lock is not safe to delete – if we take as given there are no running git processes? [19:25:12] One of my favorites is when the syslog of a crashed server just says "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" -- does that /mean/ something? Why that character in particular? [19:25:37] andrewbogott: that's often how NUL (0) renders in a terminal [19:26:05] yeah. So it really is just the server going 'ummmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm' [19:26:26] not a userful diagnostic :( [19:26:30] *useful [19:38:26] I've seen that quite a few times after a disk failure [19:40:16] disk failure would certainly explain the lack of log messages