[06:56:19] greetings [08:15:54] I'm seeking reviews/feedback on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203383 and the deployment plan at https://phabricator.wikimedia.org/T399180#11310845 [08:41:51] morning [08:42:44] godog: what's the process in case that the host does not come back from the network flip? [08:43:50] dcaro: good question, revert the patch and restore connectivity from the console, I'll update the task [08:46:41] netbox looks ok (interface tagged with both vlans), how does puppet select the interface to use? [08:46:51] actually moved the procedures to task description [08:47:03] it reads the public interface configuration [08:47:13] sending a pcc so it is more clear [08:47:35] ack, the ones tagged are eno.*np0 from netbox it seems [08:48:45] indeed, I picked 1048 and 1049 which I verified are configured already in netbox, we got T409690 to audit all cloudcephosd hosts [08:48:46] T409690: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690 [08:49:52] https://puppet-compiler.wmflabs.org/output/1203383/7604/cloudcephosd1048.eqiad.wmnet/index.html [08:50:46] thanks! [08:53:51] PCC looks ok, the interfaces in 1048 are also 25G so good, LGTM to go ahead, though if you want a deep review you might want to wait for someone more network-saavi [08:54:32] (specifically, the processes to setup the vlan through puppet) [08:54:44] I see what you did there re: saavi [08:55:32] xd [08:55:38] thank you, 1050 and 1051 are already live with single_iface: true, I'll be merging the patch next week since I'm off tomorrow afternoon and fri [08:56:40] were those also migrated? or rebuilt from scratch? [08:58:09] we did the single_iface: true migration post-reimage, and then put ceph weight on them [08:58:46] then the process it's well tested, I'm quite confident then [09:00:07] yes indeed [09:27:00] quick review for toolviews alerts to fire after 2.5h instead of 1.5 (to allow a single failure+retry) https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/48 [09:31:39] dcaro: LGTM, module tests [09:31:41] modulo [09:36:17] thanks, fixed the tests [10:26:32] dcaro: the main issue with that alert change is that the toolviews script processes the last rotated file every time it runs, so we're losing data if it doesn't run for a specific hour [10:26:58] I will try to see why it's been failing, it's not supposed to be doing that [10:28:21] it was failing due to connection to mysql failing [10:29:18] or hm, the 'last run failed' alert should catch that [10:29:29] so it might be fine, if it's just something weird with how systemd schedules that unit [10:30:39] what about not staring with logrotate, but a timer, and just checking if the current log.1 is already processed? [10:30:55] that might be an option [10:31:24] might need some patches though [10:31:29] godog: re https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203383, is there some mechanism to bring the old interface down when applying the change? otherwise we'll end up with the same IP on both ifaces [10:32:29] a-ha [10:32:35] found the "issue" [10:32:43] the logrotate unit, even when it runs 'hourly', has RandomizedDelaySec=1h [10:33:37] 🤦‍♂️ [10:34:14] hmm... does that mean that it could run twice one after the other? [10:35:23] seems to be wildly inconsistent, the shortest interval i can quickly see in the logs is about twenty minutes [10:47:07] taavi: yes indeed, via exec { 'bring-down-extra-iface' [10:48:45] ah cool, then +1 to give that a go [10:50:51] sweet, thank you [12:22:14] the alert ToolforgeToolviewsStale is very flappy, does anybody know why? [12:22:31] see above :-) [12:22:50] tl;dr it's due to how the systemd timer is defined, https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/48 fixed it [13:13:40] sorry, I didn't read the scrollback :) [13:36:42] * dhinus paged: ToolsDBWritableState [13:36:44] looking [13:38:30] it's the page from yesterday that was not resolved [13:38:42] resolved now [13:42:27] that's a bit annoying :/ (the retriggering) [13:42:37] I started seeing this in the maintain-dbusers logs [13:42:39] https://www.irccloud.com/pastebin/eNOdNHTa/ [13:43:05] anyone is familiar with those? (before I start looking into it, should they be not reachable or something?) [13:43:46] dcaro: i think those are the new clouddb hosts still in setup, cc dhinus ^ [13:44:08] ack [13:51:44] yes new hosts that manuel is setting up [13:55:32] ack thanks [13:58:54] easy patch adding last run stat to maintain-dbusers https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204381 [13:59:10] and the matching alerts https://gerrit.wikimedia.org/r/c/operations/alerts/+/1204575 [14:01:28] folks will join in 5 apologies [15:51:15] andrewbogott: would you be opposed to flicking the interface MTU setting on all eqiad1 cloudvirts now or would you prefer me to do it more incrementally? it causes a ~5s network hickup per cloudvirt during the puppet agent run, already live on all of codfw1dev and a single eqiad1 cloudvirt since yesterday [15:52:38] taavi: I'm tempted to say we should wait for Trixie rebuilds but of course I don't yet know when that's going to happen... [15:56:14] If we have actual things that are failing today due to the broken MTU then I'll revise my answer [15:57:42] my main argument for doing it today is that the more we wait the more we have people migrating from old bullseye instances to the new network that will hit this issue [15:57:59] and given it takes a full vm stop-start cycle to fix the interface mtu [16:02:08] If you want to do it today or tomorrow that's fine, can you send an email beforehand along the lines of "this is going to happen, you might notice, you don't need to do anything"? [16:03:23] ack, will do that tomorrow then [16:04:23] thx [16:05:11] happy wikiversary, komla! [16:08:30] thank you! [16:22:45] patch for doing that will be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204623/ [16:47:32] dhinus: andrewbogott quick review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204381 https://gerrit.wikimedia.org/r/c/operations/alerts/+/1204575 [17:01:27] dhinus: thanks! replied on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204381, will change the other but depends on that one [17:14:06] * dcaro off [17:20:12] +1d both patches [17:20:16] * dhinus off [17:57:20] bd808: I did a bunch of experimentation for T409474 and I'm fairly sure there's no ingress-nginx setting to do redirects that include the current request path in ingress-nginx without turning off the new validation features. [17:57:20] T409474: Reduce tool breakage over new ingress-nginx annotation validation rules - https://phabricator.wikimedia.org/T409474 [17:57:29] so I'm all ears if you have thoughs about what's the best way forwards there. [18:38:17] taavi: I was thinking about a build system container that does a redirect based on an envvar for config. [21:54:22] apologies to people who are getting a renewed flood of wikitech-static alerts; I turned them back on to find out if my rate-limiting change worked and it clearly did not.