[07:44:52] can I get an (hopefully) easy +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180724 [07:47:50] XioNoX: could be useful to add what (broadly) the test vm would be used for? [07:48:03] XioNoX: yes [07:48:22] ah sorry I just seen the T396864 on the comment [07:48:22] T396864: Routed Ganeti: same node DHCP limitation - https://phabricator.wikimedia.org/T396864 [07:48:24] nv [07:48:24] sorry, fabfur, didn't see your comment here (though I'd say the link to the task was sufficient) [07:48:50] Emperor: yes, sorry, I didn't saw that before, anyway +1 as it was non-blocking anyway [08:03:34] thx! [08:10:07] FYI, I'm installing Java 17 security updates on the Puppet servers, these need an immediate restart of the Puppet server, so there will few Puppet failures (usually 10-20 fleet-wide), I'll spread these out to minimise impact [14:35:29] nemo-yiannis: I propose 1) deploy restbase change, 2) figure out how to get mobileapps responsive, 3) implement ats plugin for beta that is basically "rest-gateway light" which is similar to what production does already but mapping to mobileapps direclty instead of through another layer. [14:35:53] I can do #3 and but need some help with #1 and #2. [14:35:59] Is the config.labs.yaml file in the repo actually used? [14:37:50] maintenance would be the less that way, since we either keep beta versions of 3 things (2 ats plugins + rest-gateway) and keeping the service running and configured and updated vs keeping a beta version of 1 thing (1 ats plugin). [14:44:02] I am not sure if i understand, did we change the domains for beta? Because other than that, mobileapps should respond properly [14:44:28] I added a comment on the task. the backend service doesn't seem to work. your example works because it is returning an old cached revision restbase has locally [14:44:35] ah got it [14:44:39] i can check the mobileapps on beta [14:44:50] but yes we did change domains. wmflabs.org>wmcloud.org [14:45:11] ok got it [14:45:11] Gergo's patch might fix that, I'm not familiar with where that is decided/maintained. [14:45:35] but that will, I suspect, only reveal the next issue whcih is that mobileapps is not responding since the domain restrict feature was added [14:46:17] once the mobileapps service is working, I'm happy to implement the ats plugin for beta that routes /api/rest to mobileapps as-needed [14:49:17] ref T402206 [14:49:18] T402206: HyperSwitch/errors/not found (404) on beta cluster: There was an issue displaying this preview - https://phabricator.wikimedia.org/T402206 [14:49:24] ok i think (?) that i can allow all domains on beta [14:50:41] I assume that mechanism is defense in-depth given that it is not publicly exposed in prod or beta, I think? [14:51:14] although I do see `https://mobileapps.wmflabs.org` defined. [14:51:18] not sure what that's used for [14:52:10] in prod we use `mobile_html_rest_api_base_uri: "//{{host}}/api/v1/"` [14:52:15] seems like that should work in beta too [15:00:11] hm.. looks like we already do that the same way in beta on the actual server. maybe the config.labs.yaml file isn't used? [15:38:25] it's also referenced here: https://gerrit.wikimedia.org/g/mediawiki/services/restbase/deploy/+/1586262e70251e81a12ea0f01482b7e45e2b683c/scap/vars.yaml#26 [15:38:54] whcih appears to be a prod config (new domains are regularly added), but I assume prod restbase doesn't call beta. [15:41:47] looks like puppet writes a different config file: [15:41:48] https://gerrit.wikimedia.org/g/operations/puppet/+/9aaf502587ab8dc389bc6c0c0d3a283e346fa4d1/modules/service/manifests/node/config/scap3.pp#72 [15:42:05] https://gerrit.wikimedia.org/g/operations/puppet/+/9aaf502587ab8dc389bc6c0c0d3a283e346fa4d1/modules/profile/manifests/restbase.pp#141 [15:42:28] so I'm guessing somewhere both are read, and we don't want the underlying layer to be empty/incomplete? [16:40:53] moritzm: hi, seems like you upgraded openssh-server fleet wide earlier today? seems like that uninstalled systemd-timesyncd and some other packages at least on cloudnet1005 for some reason [16:42:57] AFAICT from debmonitor systemd-timesyncd is missing only on 34 hosts [16:43:27] that's interesting -- what do they have in common? weird apt sources? [16:47:29] cloudnet[1005-1006].eqiad.wmnet and dns* hosts AFAICT [16:47:54] for DNS that is expected I think, as they run 'proper' NTP daemons [16:48:13] volans: on the host we're looking at (cloudnet1005) systemd-timesyncd is installed, but puppet wants to upgrade it and can't [16:48:28] So the symptom in that case isn't a missing package but a puppet failure [16:48:54] it shous as rc systemd-timesyncd [16:48:59] same on cloudnet1006 but /not/ on cloudnet2005-dev which should have identical packages [16:49:06] oh you're right, nm [16:49:16] it is not installed. it was, but the openssh upgrade made apt solver think removing that was the correct choice [16:49:44] so, the good news is that it doesn't seem a widespread issue [16:49:51] yeah [16:50:06] apparently these hosts have a /etc/apt/preferences.d/systemd.pref with `release n=bookworm`?? [16:50:32] that could explain this, now that the packages are coming from the security repo [16:51:00] /var/log/apt/history.log is clear on the what happened, less on the why [16:52:22] andrewbogott: I think the systemd apt pin in openstack::serverpackages::epoxy::bookworm is to blame, it's excluding the systemd package in bookworm-security [16:52:54] hm, why not happening on other epoxy/openstack hosts then? [16:52:56] * andrewbogott makes sure it isn't [16:54:42] it isn't [17:02:51] cloudnet1006 is the standby, I tried removing the epoxy pins and retried, still can't install timesyncd [17:03:05] so I don't think it's related to the osbpo [17:05:52] did you try removing the systemd pin? [17:09:13] that seems to be doing something. That pin is related to the epoxy osbpo somehow? [17:14:23] that pin is defined in the server packages class. the comment references T247013 from 2020. [17:14:23] T247013: cloudservices1003/1004: Warning: NTP not enabled! - https://phabricator.wikimedia.org/T247013 [17:15:09] the pin seems odd to me (in general these hosts should require an explicit opt-in for backports anyway), but are also buggy as they exclude the -updates and -security repos as I said earlier [17:15:39] if they're really still needed then they could be reversed to set a low/negative priority for -backports, but I suspect they could be dropped entirely [17:16:40] yes, the reasons described in that thread seem very no-longer-applicable [17:36:27] not sure what led apt to uninstalling these, but the systemd pre indeed seems to be the root cause [17:37:46] it's probably from a time when the openstack repo included systemd as well? [21:26:56] heads-up that we've depooling eqiad search again. We've suppressed alerts for the next 2 hrs as well, but do reach out if you see anything [21:50:24] OK, eqiad is repooled and everything is looking good