[05:44:47] Yes, so db1114 was depooled by mistake at https://phabricator.wikimedia.org/P11039 [05:44:50] I am going to pool it back [07:02:03] interesting: [07:02:06] Cache key contains characters that are not allowed: `commonswiki [07:03:33] going to open a task [07:08:59] https://phabricator.wikimedia.org/T251368 [08:29:35] jayme: no helm 3 related task that I know of. Create one! :-) [08:31:09] Did so *duck* :) [11:50:12] mutante: hey, can you give me +2 rights in my repo? I can't merge https://gerrit.wikimedia.org/r/c/wikimedia/meet-accountmanager :D [11:51:11] Amir1: Ah, it has already been created? nice! the permissions should have been part of that request though. uhm... [11:51:54] seems like 404 to me, looking [11:51:56] https://gerrit.wikimedia.org/r/#/admin/groups/1775,members [11:53:16] Amir1: reload that page [11:53:38] Thanks! [11:54:41] Amir1: i see the history got imported, yay :) [11:54:52] \o/ [11:54:54] that's what i was hoping for [12:37:56] mutante: could you please remind me where is the new repo server? [12:38:17] got it! [12:38:20] arturo@endurance:~ $ host apt.wikimedia.org [12:38:20] apt.wikimedia.org is an alias for apt1001.wikimedia.org. [12:38:41] arturo: correct :) [12:38:58] :-) [13:21:22] volans: do we have any system for sending automated changes to gerrit, waiting on the result, and then proceeding? [13:22:28] 👀 [13:22:49] 👀 [13:23:03] specifically i'm interested in having automation make changes to the puppet repo. i'm fine with (and indeed in favour of) humans reviewing the changes [13:23:10] 👀👀👀👀👀👀 [13:28:37] if it's for the disabled notification that's the wrong approach ;) [13:28:51] do tell :) [13:29:15] https://gerrit.wikimedia.org/r/c/operations/puppet/+/593205 is an example [13:29:36] new hosts? [13:29:40] or existing ones [13:30:05] this specific case was reimaging an existing host [13:30:45] ok, so I'll start with questions, why do you need to change partman? which partman was using before? [13:31:05] volans: for that bit, https://phabricator.wikimedia.org/T251392 gives some context [13:31:37] if only phab was working... [13:33:23] is the grafana db with the dashboard definitions in backed up in any sort of way? [13:33:31] addshore: yes, in bacula [13:33:31] the TL;DR is we do not want a db server that mistakenly boots of off pxe to have its data wiped [13:33:36] cdanis: amazing thanks! [13:34:47] kormat: volans: I've long thought there's a deeper problem here, which is that reimages can destroy stateful data. not all services work this way; AIUI on the Swift backend hosts, partman formats the system drives but leaves the storage drives alone, and we have Puppet that does a smarter job of seeing if they are already partitioned and such [13:35:19] and for some services an accidental reimage is also fine as they are stateless to begin with [13:35:21] i imagine we'll want to do something similar, too [13:35:31] but it's a laudable task to work on! [13:35:46] some thoughts/requirements/ideas: [13:35:47] kormat: yeah I know that part, but 1) I was about to suggest the same thing, add a line before, case will exit at the first one AFAIK, 2) which partman recipe is then it using? we don't have a default [13:36:21] I think we should entirely decouple the partman config from the data which defines whether a host is allowed to be reimaged [13:36:23] volans: it doesn't use any. which currently means we attach to the console and manually handle partitioning [13:36:34] that's wrong in so many ways [13:36:38] and should not be done IMHPO [13:36:44] *IMHO [13:36:55] i don't think anyone is arguing the current process is ideal :) [13:37:05] i'm just saying what it _is_, and i'm looking at how to make it less worse [13:37:14] I think those are the *only* servers in which this is done, it's a first for me [13:37:37] Only for those where we do not want to format /Srv [13:37:52] the current no-srv-format.cfg is a hack, partman.cfg should simply configure what partman recipe a server is meant to use (independent of whether that server is currently whitelisted for a reimage) [13:39:28] d-i could simply shell out to a small wrapper "am-i-fine-to-reimage" which aborts d-i if the server in question (and we could use the same for non-DB servers as well) [13:39:42] is not whitelisted for reimage [13:40:00] we can have a partman that never touches /srv and have puppet create it if it doesn't exists also [13:40:06] is there a reason servers pxe boot an installer even when they're not being reimaged? [13:40:27] could we not have a default pxe boot target that is a no-op, or just boots from the first hard drive, or something? [13:40:37] yes, that's the plan, have a menu, ENOTIME [13:40:47] but there is a whole plan about it if you're curious [13:40:53] kormat: currently not, there were plans for a diagnostics boot, but currently PXE is only used for reimages/installs [13:41:00] but by default it boots from disk [13:41:01] mm, ack. [13:41:15] only if you override with PXE then it will boot with pxe [13:41:19] volans: dells (at least) default to booting from pxe, it seems [13:41:20] and because this has happened once [13:41:31] it boots whatever you configure it [13:41:32] which has caused data loss in the past, i'm told [13:41:33] in the bios [13:42:26] volans: we've had cases with hosts booting up by default with PXE, and getting reimaged [13:42:32] I know [13:42:35] that's why the no-srv-format.cfg partman came up :( [13:42:36] I'm not saying it can't happen [13:42:53] the long term solution is to always boot pxe, have a menu that by default goes back to local disk [13:42:58] It happened to an es host actually, so that was a few TB to reclone [13:43:01] volans: I agree [13:43:12] and that menu will also have additional options like stress test, wipe, debug with live OS [13:43:21] and normal PXE reinstall [13:45:30] "but by default it boots from disk" not true [13:45:47] it tries everywhere [13:45:56] if the dc op forgot about it [13:45:58] in the interim we could add a flag in Netbox "prevent reimage of server", which generates a list of hostnames on the PXE servers and then modify d-i to check that list (and abort if matching). then a reimage of a DB server would require to untick the box (and tick it again when complete) [13:46:25] or a technician replaced the board, reseting the bios [13:46:37] an ideal system would allow a host to boot off the os install pxe image once, and then revert [13:46:43] or the disk, for some reason, fails to boot (e.g. grub malfunction) [13:46:44] we reimage [13:46:59] I am all for moritz solution :-D [13:47:19] I proposed that a while ago, we got ignored :-( [13:47:32] so we "fixed it" on our own [13:50:31] and netbox (or similar) is the right place to actually handle that [13:50:52] property: reimaginable: YES/NO [13:51:00] or whatever [13:51:15] what was it that your proposed jynus? [13:51:50] moritzm: a bonus is that it doesn't require going through the puppet repo, which is a plus for automation [13:51:51] a puppet config to prevent reimaging servers accidentally [13:52:05] doesn't necesarilly have to be puppet [13:52:17] could be netbox "life cycle state" [13:52:37] did this happen? [13:52:48] what do you mean? [13:52:49] kormat: i do like the idea of bots that can propose changes, it's just tricky [13:53:07] some host getting accidentally reimaged [13:53:10] yes [13:53:11] kormat: defenitely, we can even optimise the second "tick it again" step away and have Puppet mark it non-reimageable when the mariadb role runs Puppet the first time [13:53:20] paravoid: yes we lost all wiki contents [13:53:29] https://phabricator.wikimedia.org/T160242 [13:53:51] we suffered to recover them without an outage [13:53:58] if you have any questions on how d-i works, please ping me, it's technically quite elegant/powerful, but not very approachable [13:54:25] so this may sound bureaucratic, but [13:54:26] it also happened with a proxy, but that was less impactful as no data was lost [13:54:32] despite the fact that we recovered without an outage [13:54:39] it does sound like what I'd call an incident [13:54:47] it was an incident [13:54:48] and I think it may be worth it to track it as one [13:54:54] same as the proxy [13:54:56] looks like this was in 2017? [13:55:03] it happened again later [13:57:17] I proposed to work (myself) on a fix, but people started discussing how dc ops or someone else was going to create a pxe menu and a workflow, but until then, we did the partman hack [13:57:59] +1 to some work decoupling partman definitions from a 'is-safe-to-reimage' bit, and +1 to hopefully eventually making reimages never destroy service state data [13:58:18] of course [13:58:33] I am not saying this is ideal, I am saying that is the only thing we have now :-D [13:58:35] indeed :) [13:58:43] I/F talked about building a pxe menu last year, dunno if anyone actually found the time [13:59:10] cdanis, let me be skeptical, because I have been hearing that since 2017, every year :-D [13:59:22] and that is ok, as volans said it is ETIME [13:59:42] what I mean is what we have now > nothing :-D [13:59:44] the plan might have been on a server that got accidentally reimaged ;) [14:00:55] I am the first person that will switch to another system, and again, I also proposed working on it [14:00:56] the PXE menu idea is late, but it hasn't been on the plans since 2017 [14:01:13] similar ideas were proposed after that happened [14:01:20] but regardless -- as all incidents go, I think there is a) value into looking for a short-term fix b) thinking of the broader stroke/long-term issues [14:01:24] <_joe_> my 2c are: if a server should reboot to pxe or not is state, not configuration. So puppet is a horrendous place to keep that information. Unless we start using a db as a hiera backend for puppet :) [14:01:35] but also prioritizing them based on impact, and frequency of occurence [14:01:35] _joe_: +1 too [14:01:38] <_joe_> and yes, netbox acting as a ENC would fit the bill too [14:01:44] that is why I said netbox or similar [14:01:46] <_joe_> (that is basically moritzm's proposal) [14:01:50] it would be programmable [14:01:55] dynamic [14:02:03] and integrated into server lifecycle [14:02:21] although I can see people not wanting to put too much functionality into netbox [14:02:23] there is so much depth to the "netbox as an ENC" problem that I'd prefer to not discuss adhoc and on IRC [14:02:29] exactly [14:02:39] so I leave it as "on netbox or similar" [14:02:59] simialar == any kind of dynamic store [14:03:09] store/service [14:03:29] I think it'd be more meaningful to talk about the problem rather than solutions at this point [14:03:45] and this has been the yearly discussion about the PXC menu :-DDDD [14:03:52] *PXE [14:05:05] again, this is all boils down to prioritization, which is a factor of resourcing, other priorities, risk/threats associated with this problem (incl. probability it would happen, and impact if it did) [14:05:37] I'm happy to have that discussion and reconsider our priorities, but I'd like to talk about _that_ rather than "netbox as an ENC or dynamic store?" [14:05:39] please don't take this as more as an informal rant [14:05:59] but I think accidental reimage of a backup and database or swirft server is very real [14:06:07] I'm saying that I hear your informal rant and would like to make it more actionable ;) [14:06:20] and the fact that it doesn't happen more often is beacause we hack it around [14:06:57] what I am anoyed is people looking at yes, a hack, and saying "this is not great" [14:07:14] and me the first person that agrees, but has nothing better at the moment :-D [14:07:40] jynus: do you know how many times a host rebooted into PXE when it was not supposed to? both those that went bad and those saved by the hack [14:08:10] volans: I'd say in reimage time, multiple times a week [14:08:24] but it doesn't matter, if it only happened 1 it is better than 0 [14:08:36] that is good feedback and news to me [14:08:47] jynus: what that means multiple times a week? [14:08:49] it does matter, because we have to balance that over other issues [14:08:55] in most cases this is related to hw issues [14:09:07] can I rope you into filing a task about this? [14:09:21] ideally with some links to times where this cropped up, past incidents etc.? [14:09:32] paravoid: sure, I can even work on it (no saying I should, just that I am very willing to help it happen) [14:09:43] and some data to help us prioritize [14:09:55] but that the main obstacule I found wasn't "I have no time" [14:10:03] but "it makes no sense" to fix it [14:10:14] at minimum, you can add a comment to that puppet code that links to the task [14:10:29] puppet code? [14:10:31] so if anyone has questions can look it up and see that it's a workaround for a known problem [14:10:44] d-i partman recipe, whatever the "hack" is [14:10:47] ah, sure [14:11:03] thanks! [14:11:11] the other thing, before I write the task [14:11:17] what's ENC? [14:11:22] "description of the problem" [14:11:40] XioNoX: external node classifier, a programmable site.pp [14:11:51] thx [14:12:01] is that there is sometimes not fully comprehension between teams of the main challenges one has [14:12:05] where puppet runs a script with the hostname as input, and that script returns classes to be applied, and parameters to them [14:12:25] and I am the first one that doesn't understand what other people priorities are [14:12:42] but I know other people doesn't understand my team's [14:12:58] relevant to this case [14:13:13] I can see how "reimaging an app server" is not a big deal [14:13:16] :-D [14:13:26] not the time and place to have that conversation I would say -- something I'd encourage you to talk to your manager about though [14:13:29] ok [14:13:40] I am just trying to open a channel here [14:14:24] I want to know what your thought process is, and I want to let you know what mine is [14:14:38] so we can reack mutual understaning of an issue [14:15:25] It looks like kormat has already opened one T251392 [14:15:26] T251392: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 [14:16:03] but that is only 1 side of the story, the question is why it is like that [14:28:50] smart-data-dump exports prometheus metrics - do we have a dashboard somewhere that displays them? [14:32:49] paravoid: T251416 [14:32:51] T251416: PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 [14:33:00] I am actually happy you are asking for that ticket [16:42:59] herron: will follow up to our email tomorrow. Run out of time today :-P [16:44:22] arturo: sure, sounds good. have a good night