[06:43:38] <Amir1>	 <copied from -operations> beta cluster only patch to fix url shortener in beta cluster. Any SRE willing to merge please 🥺 https://gerrit.wikimedia.org/r/c/operations/puppet/+/673389
[06:43:49] <Amir1>	 tested already on the vm
[08:58:22] <Majavah>	 hi again, one more betacluster hiera patch in operations/puppet: https://gerrit.wikimedia.org/r/673047
[08:59:39] <volans>	 Majavah: I can take care of it, are those restbase instance being removed?
[09:00:04] <Majavah>	 volans: yes, those were Jessie hosts that are being removed
[09:00:35] <volans>	 k
[09:00:37] <volans>	 merging
[09:00:45] <Majavah>	 ty
[09:01:34] <volans>	 Majavah: {done}
[09:38:45] <jynus>	 anyone has any suggestion of where .sql files (for user consumption, no automation yet) should be put on a debian filesystem? /usr/share/doc/packagename ?
[09:49:29] <akosiaris>	 if they aren't meant to be consumed by something in the package but just users's eye, yes
[09:50:18] <akosiaris>	 otherwise, /usr/share/<package> would be more appropriate.
[10:18:03] <jynus>	 thank you very much, alex!
[10:28:08] <volans>	 klausman, elukey: cumin alias A:ml-serve doens't match any host, did maybe the role name change? (O:ml_serve currently)
[10:28:22] <volans>	 it seems they were converted to ml_k8s::{worker,master} from a cursory loo
[10:28:25] <volans>	 k
[10:29:02] <klausman>	 Yes, that was likely borken by one of my recent CLs
[10:30:00] <elukey>	 volans: ahahah I was just wondering how much time would have passed between the email and you pinging us :D
[10:30:24] <volans>	 elukey: I'm also on clinic duty this week, so you were doomed from the start :-P
[10:30:36] <elukey>	 jokes aside, I am working on a fix, hopefully puppet will be happier soon
[10:30:41] <volans>	  ack
[10:30:57] <volans>	 but I think this is different from the logstash plugin stuff
[10:31:17] <volans>	 I didn't ping you for the broken puppet because I saw it's already being worked ;)
[10:33:00] <elukey>	 volans: you are right, we changed the role name! Silly me
[10:33:14] <elukey>	 fixing
[10:33:32] <volans>	 thx! <3
[12:01:18] <volans>	 dcaro, legoktm and others: FYI wrt the pip backgracking issue in some of our python tooling repositories, I've just sent a proposal fix for wmflib. Once we agree on the solution I'll propagate the fix to the other repos as well.
[12:02:41] * volans bbl
[12:13:45] <Majavah>	 moritzm: hi, re T161675 I'm not exactly sure what you're suggesting for apertium/recommendation-api (my understanding is that graphoid isn't needed anymore), both of those are still running on deployment-sca* and restbase is configured to use those hosts, I think those services need to be moved somewhere else before removing deployment-sca hosts
[12:13:45] <stashbot>	 T161675: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675
[12:14:09] <Majavah>	 oops wrong ticket, meant T218729
[12:14:10] <stashbot>	 T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster - https://phabricator.wikimedia.org/T218729
[12:16:36] <moritzm>	 but there's no place to move them in beta given that it lacks k8s :-) so I think removing them is the only option at this point, the remaining jessie support in puppet.git will need to be removed in a few weeks
[12:17:53] <moritzm>	 sure, ideally that would move to a separate k8s cluster for beta following what production does, but in the absences of that I don't see another option?
[12:18:07] * Majavah wonders what exactly would break if those services would be unavailable
[12:18:25] <Majavah>	 yes, beta will need a k8s cluster at some point, especially when mediawiki is moving to k8s
[12:19:31] <moritzm>	 it's hard to tell, we could just shut off the VMs for a week and if there are no complaints, delete them next Friday?
[12:19:55] <Majavah>	 fwiw puppet on those hosts is already broken, so I'm not exactly sure what would happen if those were still around when jessie support would be removed from ops/puppet.git
[12:21:13] <moritzm>	 independent of puppet we can't leave them around as well; the security support for jessie has ended and for now has been on life support by backporting security updates internally on apt.wikimedia.org
[12:21:31] <moritzm>	 but with the forthcoming migration of the remaining prod services that'll end as well
[12:22:06] <moritzm>	 and we shouldn't keep VMs without security updates in Cloud VPS either
[12:24:45] <Majavah>	 those services don't have any more documentation than "The Recommendation API is a Node.js API service", but based on a quick look those removing them should only affect content translation and breaking that for some time shouldn't be a too large problem
[12:25:59] <Majavah>	 I already shut down jessie restbases from deployment-prep today, so those and sca* only leave one deployment-prep instance running on jessie
[12:26:26] <moritzm>	 I'm not familiar with the internals of apertium/recommendation_api, akosiaris might know where to look whether it's actually in use in beta or not
[12:40:32] * akosiaris reading backlog
[12:44:00] <akosiaris>	 so, apertium is used by cxserver and both have been moved to k8s in production for a pretty long time now. It also means that whatever is in beta is way too old to be of any use and is probably not being used. I 'd just shut it off. recommendation-api is also in k8s now and has been for more > 3 months IIRC. The version there hasn't really been
[12:44:01] <akosiaris>	 updated so the way too old argument doesn't stand here. That being said, there isn't even a proper service owner for it in production, nevermind beta. I 'd just remove it
[12:44:33] <akosiaris>	 If anyone ends up complaining, they just ended up volunteering themselves for becoming the owner of it
[12:46:07] <Majavah>	 akosiaris: ok, thanks!
[12:46:27] <akosiaris>	 yw
[12:46:37] <akosiaris>	 sorry I couldn't help in a more concrete way
[12:49:12] <Majavah>	 now I just need to figure out what to do with deployment-logstash2 and then beta cluster is on Stretch or newer
[12:55:29] <moritzm>	 thanks for all your work on this!
[13:47:16] <klausman>	 So who would I pester about additions to the rescue netboot image?
[13:47:42] <klausman>	 Or is the answer already: if it ain't in busybox, we ain't got it.
[13:48:09] <bblack>	 what did you have in mind?
[13:48:34] <klausman>	 Specifically, I'd like to have wipefs in the rescue image. With existing things like MD and LVM vgs partman can get quite confused.
[13:48:57] <klausman>	 Wipefs is part of util-linux
[13:50:52] <bblack>	 right
[13:51:18] <bblack>	 I mean, it seems like a reasonable thing to need in a rescue image for manual work
[13:52:03] <bblack>	 one could perhaps argue that some early pre-script in our installer should maybe be doing wipefs work to ensure no leftover complications from the previous install, which might cover your case, too?
[13:52:09] <klausman>	 Yeah, the only alternative is dd if=/dev/zero, and that can take forever on big disks. better to cleverly erase the partions/signatures
[13:52:51] <klausman>	 Ah, I was considering the wipefs step, but some partman recipes do not wipe everything, eg. the stat1xxx hosts usually keep /srv intact quite on purpose.
[13:53:28] <klausman>	 I'd rather have wipefs available but not the default. Otherwise, there might be very pointy questions if-when it misfires.
[13:53:44] <bblack>	 yeah, that's maybe interesting to make note of as well.  Most of us probably make the assumption that reimages don't retain data...
[13:53:54] <bblack>	 but yeah, I can see how stateful users don't like that :)
[13:55:55] <bblack>	 but yeah, the ideal we're aiming for (not that we ever quite reach ideal states) is that imaging/reimaging shouldn't involve manual steps.  So through some lens, you can view any actual use of a rescue image (other than perhaps for helping diagnose an actual hardwarefail-crash) as a pointer towards something that should be automated better.
[13:56:22] <bblack>	 whether that's pragmatic in this case is ???
[13:59:16] <bblack>	 I guess as a thought-comparison point about practices: in some environments, storage that's too large and/or important to be wiped on reimage would probably live in some SAN-attached stuff, and the local disks would always be considered wipeable.
[13:59:45] <bblack>	 but we don't operate that way here (for some good reasons in most cases, I think)
[14:04:17] <bblack>	 I don't see anything "obvious" for customizing our rescue image at present in the puppet repo
[14:04:21] <volans>	 klausman: we have a plan to add a boot menu with various utilities, like disk wipe, stress test, d-i, local disk, live CD for debugging. No ETA right now
[14:04:31] <bblack>	 but there is  https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/rescue-pxe/
[14:04:36] <volans>	 FYI the decommission cookbook already wipes the FS partition table
[14:04:57] <klausman>	 My suspicion is that it does more than I want, tho.
[14:05:06] <bblack>	 which looks like it hasn't been worked on in a while, and there's some work on it in https://phabricator.wikimedia.org/T78135
[14:05:30] <volans>	 it's on the plate of SRE I/F fwiw
[14:05:41] <volans>	 cc paravoid for prioritization
[14:07:32] <volans>	 and then dcops takes care of the full wipe + physical decom
[14:08:11] <klausman>	 Ack. Maybe a good idea to consider other useful rescue-y stuff as well. Naturally, the rescue image can't be 100G of stuff, but what little Debian ships by default with their minimal install/rescue ISO is just very, very barebones.
[14:08:14] <bblack>	 wondering aloud with probably dumb ideas, but perhaps if reimage workflows which preserve important data partitions are desirable, a path towards that would be to mark "stateless" filesystems like rootfs with a marker file like "/.wipeable", or even better some kind of metadata flag we could set at the partition level if such a thing exists.
[14:08:34] <bblack>	 and then have decom/reimage only wipe filesystems that are detectable as flagged-wipeable.
[14:08:53] <volans>	 anyway, I shouldn't even be here right now...
[14:08:59] * volans vanishes in a puff of logic
[14:09:04] <klausman>	 Yeah, neat idea. I wonder if GPT is a bit more powerful when it comes to metadata for partitions.
[14:09:31] <kormat>	 please do not complicated reuseparts further with your fancy logic :P
[14:09:35] <kormat>	 *complicate
[14:10:00] <klausman>	 There's that.
[14:10:03] <bblack>	 what's "reuseparts"?
[14:10:15] <klausman>	 The part of our partman recipes that doesn't wipe everything
[14:10:19] <kormat>	 bblack: the support i wrote to allow partman to know which partitions to reformat and which to keep
[14:10:40] <bblack>	 oh where is that?
[14:10:48] <klausman>	 The "nuke this partition" logic above would/could/should live outside of partman, though
[14:10:49] <kormat>	 `modules/install_server/files/autoinstall/scripts/reuse-parts.sh`
[14:10:52] <paravoid>	 it is indeed in the roadmap. if folks are interested in working more of that, we can take the help for sure :) we could perhaps have a shared OKR or something
[14:12:09] <bblack>	 woah
[14:12:15] <klausman>	 Of course I bring this up the last day before 2 weeks of PTO. :D
[14:12:18] <kormat>	 jbond42_: is there any way to tell puppet 'please ensure this systemd .service exists, but never start/stop the service?
[14:13:05] <kormat>	 bblack: that was my _starter_ project
[14:13:30] <bblack>	 trial by partman
[14:13:31] <jbond42>	 kormat: is it a service you are creating or one that comes from a deb package?
[14:14:03] <kormat>	 jbond42: one i'm creating. the CR is at https://gerrit.wikimedia.org/r/c/operations/puppet/+/665324. i only just realised that it'll cause puppet to either start or stop the service on every puppet run, depending on a hiera var. both are very unwanted.
[14:14:26] * jbond42 looking
[14:15:29] <kormat>	 i think the answer is probably going to be "well, write the template to the correct place, but do not use Service or systemd::service"
[14:16:44] <bblack>	 yeah I know "Service" only has ensure=>stopped|running support.  It really should have an "exists" or something like that.
[14:16:52] <kormat>	 bblack: yeah :(
[14:18:17] <bblack>	 but surely we could add such a state at the systemd::service level, which probably ends up not defining a Service, but still doing the rest?
[14:20:03] <bblack>	 systemd::service currently does an unconditional "ensure_resource('service', $label, $params)"
[14:20:15] <jbond42>	 at the puppet level you can delclare service{'foo': } and as far as i can tell it will do nothing
[14:20:43] <jbond42>	 you could also do https://phabricator.wikimedia.org/P14951 but both of theses are a bit hacky and i think bblack solution is better
[14:21:16] <jbond42>	 i.e. add $manage_service = true to systemd::service and use it to gaurd ensure_resource('service')
[14:21:44] <bblack>	 although another way of thinking of it is that systemd::service is mostly just a wrapper for systemd::unit + that Service definition
[14:21:50] <bblack>	 so maybe could just use systemd::unit directly?
[14:22:45] <jbond42>	 +1 also a good idea
[14:23:13] <kormat>	 sounds like one i can probably pull off, too :)
[14:30:12] <kormat>	 and it works! thanks bblack, jbond42
[14:32:30] <jbond42>	 yay :) np
[17:34:36] <legoktm>	 volans|off: thanks :D +2'd
[18:57:02] <legoktm>	 if someone would like to +1/+2 https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/673558/ to unbreak CI, it's the same exact patch vol.ans made for pywmflib
[19:06:48] <bblack>	 legoktm: ^ +1'd :)
[19:07:00] <legoktm>	 thanks :)
[20:18:50] <Majavah>	 legoktm: hi! you might have sucked me into the rabbit hole of debian packaging, now working on converting udp2log to systemd :D I made https://gerrit.wikimedia.org/r/c/analytics/udplog/+/673596, but no clue if it's correct or how to even test it, advice and reviews appreciated
[20:19:51] <legoktm>	 wheee
[20:21:33] <legoktm>	 heh, it's only 11 lines
[20:22:15] <legoktm>	 Majavah: I'll take a look after lunch, I also added mor.itz as a reviewer since he's also great at packaging+systemd stuff
[20:22:57] <Majavah>	 legoktm: cool, thanks!
[20:45:06] <legoktm>	 FYI: the sre.ganeti.makevm cookbook now takes a hostname as an argument, not the fqdn: https://wikitech.wikimedia.org/w/index.php?title=Ganeti&type=revision&diff=1904470&oldid=1901141
[20:50:31] <mutante>	 oh, that's good to know. thanks legoktm