[06:43:38] beta cluster only patch to fix url shortener in beta cluster. Any SRE willing to merge please 🥺 https://gerrit.wikimedia.org/r/c/operations/puppet/+/673389 [06:43:49] tested already on the vm [08:58:22] hi again, one more betacluster hiera patch in operations/puppet: https://gerrit.wikimedia.org/r/673047 [08:59:39] Majavah: I can take care of it, are those restbase instance being removed? [09:00:04] volans: yes, those were Jessie hosts that are being removed [09:00:35] k [09:00:37] merging [09:00:45] ty [09:01:34] Majavah: {done} [09:38:45] anyone has any suggestion of where .sql files (for user consumption, no automation yet) should be put on a debian filesystem? /usr/share/doc/packagename ? [09:49:29] if they aren't meant to be consumed by something in the package but just users's eye, yes [09:50:18] otherwise, /usr/share/ would be more appropriate. [10:18:03] thank you very much, alex! [10:28:08] klausman, elukey: cumin alias A:ml-serve doens't match any host, did maybe the role name change? (O:ml_serve currently) [10:28:22] it seems they were converted to ml_k8s::{worker,master} from a cursory loo [10:28:25] k [10:29:02] Yes, that was likely borken by one of my recent CLs [10:30:00] volans: ahahah I was just wondering how much time would have passed between the email and you pinging us :D [10:30:24] elukey: I'm also on clinic duty this week, so you were doomed from the start :-P [10:30:36] jokes aside, I am working on a fix, hopefully puppet will be happier soon [10:30:41] ack [10:30:57] but I think this is different from the logstash plugin stuff [10:31:17] I didn't ping you for the broken puppet because I saw it's already being worked ;) [10:33:00] volans: you are right, we changed the role name! Silly me [10:33:14] fixing [10:33:32] thx! <3 [12:01:18] dcaro, legoktm and others: FYI wrt the pip backgracking issue in some of our python tooling repositories, I've just sent a proposal fix for wmflib. Once we agree on the solution I'll propagate the fix to the other repos as well. [12:02:41] * volans bbl [12:13:45] moritzm: hi, re T161675 I'm not exactly sure what you're suggesting for apertium/recommendation-api (my understanding is that graphoid isn't needed anymore), both of those are still running on deployment-sca* and restbase is configured to use those hosts, I think those services need to be moved somewhere else before removing deployment-sca hosts [12:13:45] T161675: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 [12:14:09] oops wrong ticket, meant T218729 [12:14:10] T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster - https://phabricator.wikimedia.org/T218729 [12:16:36] but there's no place to move them in beta given that it lacks k8s :-) so I think removing them is the only option at this point, the remaining jessie support in puppet.git will need to be removed in a few weeks [12:17:53] sure, ideally that would move to a separate k8s cluster for beta following what production does, but in the absences of that I don't see another option? [12:18:07] * Majavah wonders what exactly would break if those services would be unavailable [12:18:25] yes, beta will need a k8s cluster at some point, especially when mediawiki is moving to k8s [12:19:31] it's hard to tell, we could just shut off the VMs for a week and if there are no complaints, delete them next Friday? [12:19:55] fwiw puppet on those hosts is already broken, so I'm not exactly sure what would happen if those were still around when jessie support would be removed from ops/puppet.git [12:21:13] independent of puppet we can't leave them around as well; the security support for jessie has ended and for now has been on life support by backporting security updates internally on apt.wikimedia.org [12:21:31] but with the forthcoming migration of the remaining prod services that'll end as well [12:22:06] and we shouldn't keep VMs without security updates in Cloud VPS either [12:24:45] those services don't have any more documentation than "The Recommendation API is a Node.js API service", but based on a quick look those removing them should only affect content translation and breaking that for some time shouldn't be a too large problem [12:25:59] I already shut down jessie restbases from deployment-prep today, so those and sca* only leave one deployment-prep instance running on jessie [12:26:26] I'm not familiar with the internals of apertium/recommendation_api, akosiaris might know where to look whether it's actually in use in beta or not [12:40:32] * akosiaris reading backlog [12:44:00] so, apertium is used by cxserver and both have been moved to k8s in production for a pretty long time now. It also means that whatever is in beta is way too old to be of any use and is probably not being used. I 'd just shut it off. recommendation-api is also in k8s now and has been for more > 3 months IIRC. The version there hasn't really been [12:44:01] updated so the way too old argument doesn't stand here. That being said, there isn't even a proper service owner for it in production, nevermind beta. I 'd just remove it [12:44:33] If anyone ends up complaining, they just ended up volunteering themselves for becoming the owner of it [12:46:07] akosiaris: ok, thanks! [12:46:27] yw [12:46:37] sorry I couldn't help in a more concrete way [12:49:12] now I just need to figure out what to do with deployment-logstash2 and then beta cluster is on Stretch or newer [12:55:29] thanks for all your work on this! [13:47:16] So who would I pester about additions to the rescue netboot image? [13:47:42] Or is the answer already: if it ain't in busybox, we ain't got it. [13:48:09] what did you have in mind? [13:48:34] Specifically, I'd like to have wipefs in the rescue image. With existing things like MD and LVM vgs partman can get quite confused. [13:48:57] Wipefs is part of util-linux [13:50:52] right [13:51:18] I mean, it seems like a reasonable thing to need in a rescue image for manual work [13:52:03] one could perhaps argue that some early pre-script in our installer should maybe be doing wipefs work to ensure no leftover complications from the previous install, which might cover your case, too? [13:52:09] Yeah, the only alternative is dd if=/dev/zero, and that can take forever on big disks. better to cleverly erase the partions/signatures [13:52:51] Ah, I was considering the wipefs step, but some partman recipes do not wipe everything, eg. the stat1xxx hosts usually keep /srv intact quite on purpose. [13:53:28] I'd rather have wipefs available but not the default. Otherwise, there might be very pointy questions if-when it misfires. [13:53:44] yeah, that's maybe interesting to make note of as well. Most of us probably make the assumption that reimages don't retain data... [13:53:54] but yeah, I can see how stateful users don't like that :) [13:55:55] but yeah, the ideal we're aiming for (not that we ever quite reach ideal states) is that imaging/reimaging shouldn't involve manual steps. So through some lens, you can view any actual use of a rescue image (other than perhaps for helping diagnose an actual hardwarefail-crash) as a pointer towards something that should be automated better. [13:56:22] whether that's pragmatic in this case is ??? [13:59:16] I guess as a thought-comparison point about practices: in some environments, storage that's too large and/or important to be wiped on reimage would probably live in some SAN-attached stuff, and the local disks would always be considered wipeable. [13:59:45] but we don't operate that way here (for some good reasons in most cases, I think) [14:04:17] I don't see anything "obvious" for customizing our rescue image at present in the puppet repo [14:04:21] klausman: we have a plan to add a boot menu with various utilities, like disk wipe, stress test, d-i, local disk, live CD for debugging. No ETA right now [14:04:31] but there is https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/rescue-pxe/ [14:04:36] FYI the decommission cookbook already wipes the FS partition table [14:04:57] My suspicion is that it does more than I want, tho. [14:05:06] which looks like it hasn't been worked on in a while, and there's some work on it in https://phabricator.wikimedia.org/T78135 [14:05:30] it's on the plate of SRE I/F fwiw [14:05:41] cc paravoid for prioritization [14:07:32] and then dcops takes care of the full wipe + physical decom [14:08:11] Ack. Maybe a good idea to consider other useful rescue-y stuff as well. Naturally, the rescue image can't be 100G of stuff, but what little Debian ships by default with their minimal install/rescue ISO is just very, very barebones. [14:08:14] wondering aloud with probably dumb ideas, but perhaps if reimage workflows which preserve important data partitions are desirable, a path towards that would be to mark "stateless" filesystems like rootfs with a marker file like "/.wipeable", or even better some kind of metadata flag we could set at the partition level if such a thing exists. [14:08:34] and then have decom/reimage only wipe filesystems that are detectable as flagged-wipeable. [14:08:53] anyway, I shouldn't even be here right now... [14:08:59] * volans vanishes in a puff of logic [14:09:04] Yeah, neat idea. I wonder if GPT is a bit more powerful when it comes to metadata for partitions. [14:09:31] please do not complicated reuseparts further with your fancy logic :P [14:09:35] *complicate [14:10:00] There's that. [14:10:03] what's "reuseparts"? [14:10:15] The part of our partman recipes that doesn't wipe everything [14:10:19] bblack: the support i wrote to allow partman to know which partitions to reformat and which to keep [14:10:40] oh where is that? [14:10:48] The "nuke this partition" logic above would/could/should live outside of partman, though [14:10:49] `modules/install_server/files/autoinstall/scripts/reuse-parts.sh` [14:10:52] it is indeed in the roadmap. if folks are interested in working more of that, we can take the help for sure :) we could perhaps have a shared OKR or something [14:12:09] woah [14:12:15] Of course I bring this up the last day before 2 weeks of PTO. :D [14:12:18] jbond42_: is there any way to tell puppet 'please ensure this systemd .service exists, but never start/stop the service? [14:13:05] bblack: that was my _starter_ project [14:13:30] trial by partman [14:13:31] kormat: is it a service you are creating or one that comes from a deb package? [14:14:03] jbond42: one i'm creating. the CR is at https://gerrit.wikimedia.org/r/c/operations/puppet/+/665324. i only just realised that it'll cause puppet to either start or stop the service on every puppet run, depending on a hiera var. both are very unwanted. [14:14:26] * jbond42 looking [14:15:29] i think the answer is probably going to be "well, write the template to the correct place, but do not use Service or systemd::service" [14:16:44] yeah I know "Service" only has ensure=>stopped|running support. It really should have an "exists" or something like that. [14:16:52] bblack: yeah :( [14:18:17] but surely we could add such a state at the systemd::service level, which probably ends up not defining a Service, but still doing the rest? [14:20:03] systemd::service currently does an unconditional "ensure_resource('service', $label, $params)" [14:20:15] at the puppet level you can delclare service{'foo': } and as far as i can tell it will do nothing [14:20:43] you could also do https://phabricator.wikimedia.org/P14951 but both of theses are a bit hacky and i think bblack solution is better [14:21:16] i.e. add $manage_service = true to systemd::service and use it to gaurd ensure_resource('service') [14:21:44] although another way of thinking of it is that systemd::service is mostly just a wrapper for systemd::unit + that Service definition [14:21:50] so maybe could just use systemd::unit directly? [14:22:45] +1 also a good idea [14:23:13] sounds like one i can probably pull off, too :) [14:30:12] and it works! thanks bblack, jbond42 [14:32:30] yay :) np [17:34:36] volans|off: thanks :D +2'd [18:57:02] if someone would like to +1/+2 https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/673558/ to unbreak CI, it's the same exact patch vol.ans made for pywmflib [19:06:48] legoktm: ^ +1'd :) [19:07:00] thanks :) [20:18:50] legoktm: hi! you might have sucked me into the rabbit hole of debian packaging, now working on converting udp2log to systemd :D I made https://gerrit.wikimedia.org/r/c/analytics/udplog/+/673596, but no clue if it's correct or how to even test it, advice and reviews appreciated [20:19:51] wheee [20:21:33] heh, it's only 11 lines [20:22:15] Majavah: I'll take a look after lunch, I also added mor.itz as a reviewer since he's also great at packaging+systemd stuff [20:22:57] legoktm: cool, thanks! [20:45:06] FYI: the sre.ganeti.makevm cookbook now takes a hostname as an argument, not the fqdn: https://wikitech.wikimedia.org/w/index.php?title=Ganeti&type=revision&diff=1904470&oldid=1901141 [20:50:31] oh, that's good to know. thanks legoktm