[08:56:10] hm... how do I get aptly to add a new arch to a repo? [08:58:48] hmm at least https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/aptly/manifests/repo.pp#21 needs adjusting [09:06:50] we might need https://github.com/aptly-dev/aptly/pull/1366 for it :/ [09:13:39] hmmm.....m.m.m. maybe creating a new one with the new arch too, copying all the packages, dropping the old and recreating it again with all the archs, and copying all the packages back? To avoid losing the packages? [09:14:21] * dcaro does a backup [09:14:35] AIUI on aptly, you can delete and re-create the publish config without deleting the actual repo the packages are on [09:14:59] tbh I've for a while wanted to move that aptly instance to use reprepro instead (the same software that apt.wikimedia.o uses) [09:15:08] this seems to point otherwise https://github.com/aptly-dev/aptly/issues/1242#issuecomment-1998161431 [09:15:45] huh [09:16:12] that's, uh, not great [09:16:34] that aptly fix is in v1.6.0 and newer, which will be in trixie [09:18:01] yep [09:18:23] I can try installing the newer aptly by hand? [09:19:40] the trixie packages won't run on bookworm as is due to glibc linking, but I can try rebuilding them on bookworm [09:20:27] it seems to work though [09:20:31] https://www.irccloud.com/pastebin/ZPtENQOB/ [09:20:41] but might crash at some other point, better rebuild yep [09:20:43] sbuild-build-depends-main-dummy : Depends: golang-github-aws-aws-sdk-go-v2-dev (>= 1.24.1) but 1.17.1-3 is to be installed [09:20:43] Depends: golang-github-rs-zerolog-dev (>= 1.29.1) but 1.26.1-2 is to be installed [09:20:43] Depends: golang-etcd-server-dev (>= 3.5.15-7) but it is not going to be installed [09:21:24] hmm? depends on etcd? [09:22:07] apparently yeah. I'm as surprised as you [09:28:25] I'll try creating a new "backup" publish for the old jessie repo with all the archs, remove the old publish, and recreate the old publish adding the extra arch, then cleanup the "backup" publish, that might allow me to replace the old publish with the new arch without ending removing the packages [09:32:55] got both now: [09:32:58] https://www.irccloud.com/pastebin/kpHGqRaa/ [09:35:43] recreated the old one, and the packages are still there \o/ [09:35:45] https://usercontent.irccloud-cdn.com/file/cCu6PWdw/image.png [09:35:49] with the new arch [09:37:09] okok, I'll do the same with the others, and change puppet too :) [09:39:49] the puppet change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167551 [09:41:40] this is such a rabbit hole tiny change [09:41:47] first, why do we still have a jessie-tools repo around [09:42:18] second, apparently there's role::aptly::server which seems rather clearly not be in use anymore as it hasn't been updated since stretch, but also is applied to some deployment-prep instances [09:42:25] anyway, +1 [09:42:50] xd yep [10:48:11] * dcaro lunch [11:39:13] dcaro: the new misctools binary seems to have made it to buster-tools which is making puppet fail on sgebastion-10 [11:52:25] hmm... puppet should not be trying to update the package no? [11:55:02] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167590 [11:59:37] yes, but there should not be a version in buster-tools to try to update to that doesn't work on buster [12:01:57] I can manually remove it, though we are going to get rid of it anyhow soon enough, imo it's not a big issue if there's one. No new VMs, or any autoupgrade are expected for buster [12:05:08] manually removed [12:05:22] the patch is still good though [12:07:17] does the cookbook also need fixing to not put new misctools releases in there in the future? [12:09:00] yep, mind I asking what's the issue you foresee if we put them there? [12:10:42] even if you remove the auto-update bit from puppet, unattended-upgrades will try to upgrade when it runs nightly [12:12:23] it should not, it does not upgrade the toolforge clis no? [12:12:35] (at least I was operating with that in mind :/ ) [12:13:09] it should https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/aptly/manifests/client.pp#23 [12:13:39] and in general I consider any situation where the latest packages in the apt repo are not in an immediately installable state a bug [12:14:28] (that's why there was a separate cookbook to copy things from $FOO-toolsbeta to $FOO-tools after testing, which was apparently removed while I was away) [12:17:36] this should avoid that then https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167594 [12:17:50] we don't want the clis to auto-upgrade, we want to closely control the version they are in [12:17:58] (and upgrade only on demand) [12:18:51] sorry, botchered the rebase... [12:19:02] I am opposed to that change, before the auto-upgrades were added we had too many issues where a package was upgraded one one host but not the other or similar leading to user-facing issues [12:19:15] again, the way to do this is on the apt repository level [12:19:24] that does not happen anymore, as the packages are upgrade by the cookbook on all the bastions [12:22:11] (and gets the bastions from cumin/puppetdb) [12:22:27] wait no, it gets them from the inventory [12:22:37] (a bit less nice, but still ok imo) [12:22:57] if the only thing operating on the installed versions of the packages is the cookbook which immediately installs the latest versions, then I don't see how the bastions would get in a state where unattended-upgrades would be harmful? [12:23:16] it installs a specific version [12:23:27] the one you give by the branch you specify in the deploy cookbook [12:23:39] that allows rollbacks [12:24:07] (btw. the packages are not moved to the tools repos until you deploy on tools itself, so deploying on toolsbeta does not put the packages in the tools repos) [12:27:23] if a change needs to be rolled back long term, that should be done by publishing a new release ihmo [12:27:43] the packages are/will be used in a variety of containers, development environments, etc, and you can't rely on a manual downgrade in any of those [12:27:50] yep, though now that 'longterm' can be more that what's left for the unattended upgrades to run [12:28:58] for me "long term" here is anything longer than what it takes to publish that fixed or rolled back release [12:29:42] both can happen at the same time [12:35:51] specially across time zones [12:36:37] ideally we would have the versions set by config somewhere (ex. hiera, from the toolforge-deploy script), and puppet would enforce that version specifically, but we are not yet there [12:46:21] "the packages are/will be used in a variety of containers, development environments, etc, and you can't rely on a manual downgrade in any of those" I'm not sure this is true though, clis will be distributed as binaries eventually, not through debian repos [12:56:44] for the package name change https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/3 [12:58:17] there was some metadata you can put specifying that it replaces the old one right? [13:00:57] taavi: thanks for the pointer there [13:01:25] hmm... I'll do the same for jobs-cli now that I'm at it [13:01:27] i think you missed Provides: [13:01:50] yep :) I sent the patch before seeing the comment, just added it [13:01:52] * taavi still uses jobs-framework-* as the names of his local clones [13:02:08] ah I thought you were just super fast with the fix, that makes more sense :D [13:02:15] it will help remove a bunch of custom logic to handle packagename-component mappings xd [13:02:57] if you're in the mood of renaming things, the gitlab repo name for webservice has annoyed me for a while :-) [13:03:38] what should the name be? [13:03:50] toolforge-webservice? [13:03:54] yeah [13:03:56] ack [13:08:05] I tested the offboarding checklist and I think it's good, T399068 [13:08:06] T399068: WMCS Offboarding: Chuck Onwumelu - https://phabricator.wikimedia.org/T399068 [13:08:41] most things were already done, one thing I'm missing is the cloud@ mailing list, can somebody make me an admin there? [13:09:09] I'll remove wikitech-l, ops and ops-private from the checklist, I think they can be handled by the SRE team [13:09:51] tbh wikitech-l is a public list, not sure why people need to be removed there in the first place [13:10:05] my thinking was that if it's a WMF email, it will bounce [13:10:14] I'm assuming the email will be deactivated [13:10:20] yeah, although Mailman will catch the bounces and take care of it automatically [13:11:01] for cloud@ we do want to check for potential moderator/owner access. speaking of that, I don't seem to be one myself either [13:23:52] dcaro: if you have a change to look at https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1166955 then I should be able to start upgrading eqiad ceph things to bookworm. It's pretty quick with that partman change! [13:27:55] 👀 [13:32:21] andrewbogott: wikitech-static did not sync overnight :/ [13:32:35] dang, I'll see what's up [13:32:39] thx [13:34:14] * andrewbogott just runs the sync on the cli [13:34:40] andrewbogott: the patch looks good, added some nits (feel free to ignore), and a question, but you'll see if it's useful when using it better :) [13:36:28] As you can see from 'patchset 49,' I have already tried it many times! [13:36:32] thanks for the review. [13:57:50] the MaxConntrack alert is again diffscan02 opening MANY connections, who's managing that VM? T399050 [13:57:50] T399050: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050 [13:59:15] last year the same happened, then it looks like the number just went down on its own (T355222) [13:59:16] T355222: MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1043:9100 - https://phabricator.wikimedia.org/T355222 [14:19:58] dhinus: I've been shoving VMs gradually off 1067 and noticing that it didn't help. Did you have a dashboard or something that pointed you to the diffscan host or did you just remember that it was a problem in the past? [14:24:40] andrewbogott: see comments in T399050, "conntrack -L |grep 172.16.3.44" shows ~80% of conns are from that host [14:24:41] T399050: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050 [14:24:55] I checked that one first because it was the one that caused issues last year [14:25:36] I tried to identify other VMs explaining the remaining 20% but that's harder :) [14:26:16] dhinus: obviously your approach was more useful than mine! [14:26:52] we could probably do some |awk|sort magic to order by IP :) [14:29:20] the limit is 524288, so 283593 is not enough to trigger the alert. there might be something else that's new [14:33:23] this might give you what you ask `conntrack -L | grep -o 'src=[^ ]*' | sort | uniq -c | sort -n` [15:03:40] dcaro: yep that worked, I pasted the results in the task! [15:06:00] I'm now thinking diffscan is stable, and it's something else that tripped the alert, maybe PAWS? T399050 [15:06:01] T399050: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050 [15:10:14] I'm running the ceph mon upgrade cookbook which will reboot each mon in turn, there may be brief 'down' alerts [15:10:40] ack [15:13:17] andrewbogott: I created a silence matching service=~.*ceph.*, should catch most of the leaks I think, but not sure, let me know if it does note [15:13:19] *not [15:13:53] thanks! [15:28:31] looks like there are many connections between cloudvirt1067 and cloudvirt1062 on port 4789, not sure what that is but it could explain the alert [15:32:59] probably a vxlan tunnel, as that uses port 4789 [15:39:45] in itself that is only 10% of the connection limit, so maybe we should go back to the original plan and throttle diffscan [15:39:51] I'll add some notes to T399050 [15:39:52] T399050: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050 [16:03:58] * dhinus offline [16:21:01] dhinus: part of the issue with wikitech-static was https://gerrit.wikimedia.org/r/c/operations/wikitech-static/+/1167671 [16:21:20] it seems to be syncing now; we'll see if that clears the alert [16:35:46] for a second I though that the typo wal 'pyp' -> 'php' xd [16:36:49] oops [17:14:21] * dcaro off [17:14:24] cya tomorrow!