[05:44:47] <marostegui>	  Yes, so db1114 was depooled by mistake at https://phabricator.wikimedia.org/P11039
[05:44:50] <marostegui>	 I am going to pool it back
[07:02:03] <elukey>	 interesting:
[07:02:06] <elukey>	 Cache key contains characters that are not allowed: `commonswiki
[07:03:33] <elukey>	 going to open a task
[07:08:59] <elukey>	 https://phabricator.wikimedia.org/T251368
[08:29:35] <akosiaris>	 jayme: no helm 3 related task that I know of. Create one! :-)
[08:31:09] <jayme>	 Did so *duck* :)
[11:50:12] <Amir1>	 mutante: hey, can you give me +2 rights in my repo? I can't merge https://gerrit.wikimedia.org/r/c/wikimedia/meet-accountmanager :D
[11:51:11] <mutante>	 Amir1: Ah, it has already been created? nice!  the permissions should have been part of that request though. uhm...
[11:51:54] <mutante>	 seems like 404 to me, looking
[11:51:56] <Amir1>	  https://gerrit.wikimedia.org/r/#/admin/groups/1775,members
[11:53:16] <mutante>	 Amir1: reload that page
[11:53:38] <Amir1>	 Thanks!
[11:54:41] <mutante>	 Amir1: i see the history got imported, yay :)
[11:54:52] <Amir1>	 \o/
[11:54:54] <mutante>	 that's what i was hoping for
[12:37:56] <arturo>	 mutante: could you please remind me where is the new repo server?
[12:38:17] <arturo>	 got it!
[12:38:20] <arturo>	 arturo@endurance:~ $ host apt.wikimedia.org
[12:38:20] <arturo>	 apt.wikimedia.org is an alias for apt1001.wikimedia.org.
[12:38:41] <mutante>	 arturo: correct :)
[12:38:58] <arturo>	 :-)
[13:21:22] <kormat>	 volans: do we have any system for sending automated changes to gerrit, waiting on the result, and then proceeding?
[13:22:28] <marostegui>	 👀
[13:22:49] <cdanis>	 👀
[13:23:03] <kormat>	 specifically i'm interested in having automation make changes to the puppet repo. i'm fine with (and indeed in favour of) humans reviewing the changes
[13:23:10] <marostegui>	 👀👀👀👀👀👀
[13:28:37] <volans>	 if it's for the disabled notification that's the wrong approach ;)
[13:28:51] <kormat>	 do tell :)
[13:29:15] <kormat>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/593205 is an example
[13:29:36] <volans>	 new hosts?
[13:29:40] <volans>	 or existing ones
[13:30:05] <kormat>	 this specific case was reimaging an existing host
[13:30:45] <volans>	 ok, so I'll start with questions, why do you need to change partman? which partman was using before?
[13:31:05] <kormat>	 volans: for that bit, https://phabricator.wikimedia.org/T251392 gives some context
[13:31:37] <volans>	 if only phab was working...
[13:33:23] <addshore>	 is the grafana db with the dashboard definitions in backed up in any sort of way?
[13:33:31] <cdanis>	 addshore: yes, in bacula
[13:33:31] <kormat>	 the TL;DR is we do not want a db server that mistakenly boots of off pxe to have its data wiped
[13:33:36] <addshore>	 cdanis: amazing thanks!
[13:34:47] <cdanis>	 kormat: volans: I've long thought there's a deeper problem here, which is that reimages can destroy stateful data.  not all services work this way; AIUI on the Swift backend hosts, partman formats the system drives but leaves the storage drives alone, and we have Puppet that does a smarter job of seeing if they are already partitioned and such
[13:35:19] <moritzm>	 and for some services an accidental reimage is also fine as they are stateless to begin with
[13:35:21] <kormat>	 i imagine we'll want to do something similar, too
[13:35:31] <moritzm>	 but it's a laudable task to work on!
[13:35:46] <moritzm>	 some thoughts/requirements/ideas:
[13:35:47] <volans>	 kormat: yeah I know that part, but 1) I was about to suggest the same thing, add a line before, case will exit at the first one AFAIK, 2) which partman recipe is then it using? we don't have a default
[13:36:21] <moritzm>	 I think we should entirely decouple the partman config from the data which defines whether a host is allowed to be reimaged
[13:36:23] <kormat>	 volans: it doesn't use any. which currently means we attach to the console and manually handle partitioning
[13:36:34] <volans>	 that's wrong in so many ways
[13:36:38] <volans>	 and should not be done IMHPO
[13:36:44] <volans>	 *IMHO
[13:36:55] <kormat>	 i don't think anyone is arguing the current process is ideal :)
[13:37:05] <kormat>	 i'm just saying what it _is_, and i'm looking at how to make it less worse
[13:37:14] <volans>	 I think those are the *only* servers in which this is done, it's a first for me
[13:37:37] <marostegui>	 Only for those where we do not want to format /Srv
[13:37:52] <moritzm>	 the current no-srv-format.cfg is a hack, partman.cfg should simply configure what partman recipe a server is meant to use (independent of whether that server is currently whitelisted for a reimage)
[13:39:28] <moritzm>	 d-i could simply shell out to a small wrapper "am-i-fine-to-reimage" which aborts d-i if the server in question (and we could use the same for non-DB servers as well)
[13:39:42] <moritzm>	 is not whitelisted for reimage
[13:40:00] <volans>	 we can have a partman that never touches /srv and have puppet create it if it doesn't exists also
[13:40:06] <kormat>	 is there a reason servers pxe boot an installer even when they're not being reimaged?
[13:40:27] <kormat>	 could we not have a default pxe boot target that is a no-op, or just boots from the first hard drive, or something?
[13:40:37] <volans>	 yes, that's the plan, have a menu, ENOTIME
[13:40:47] <volans>	 but there is a whole plan about it if you're curious
[13:40:53] <moritzm>	 kormat: currently not, there were plans for a diagnostics boot, but currently PXE is only used for reimages/installs
[13:41:00] <volans>	 but by default it boots from disk
[13:41:01] <kormat>	 mm, ack.
[13:41:15] <volans>	 only if you override with PXE then it will boot with pxe
[13:41:19] <kormat>	 volans: dells (at least) default to booting from pxe, it seems
[13:41:20] <volans>	 and because this has happened once
[13:41:31] <volans>	 it boots whatever you configure it
[13:41:32] <kormat>	 which has caused data loss in the past, i'm told
[13:41:33] <volans>	 in the bios
[13:42:26] <marostegui>	 volans: we've had cases with hosts booting up by default with PXE, and getting reimaged
[13:42:32] <volans>	 I know
[13:42:35] <marostegui>	 that's why the no-srv-format.cfg partman came up :(
[13:42:36] <volans>	 I'm not saying it can't happen
[13:42:53] <volans>	 the long term solution is to always boot pxe, have a menu that by default goes back to local disk
[13:42:58] <marostegui>	 It happened to an es host actually, so that was a few TB to reclone
[13:43:01] <marostegui>	 volans: I agree
[13:43:12] <volans>	 and that menu will also have additional options like stress test, wipe, debug with live OS
[13:43:21] <volans>	 and normal PXE reinstall
[13:45:30] <jynus>	 "but by default it boots from disk" not true
[13:45:47] <jynus>	 it tries everywhere
[13:45:56] <jynus>	 if the dc op forgot about it
[13:45:58] <moritzm>	 in the interim we could add a flag in Netbox "prevent reimage of server", which generates a list of hostnames on the PXE servers and then modify d-i to check that list (and abort if matching). then a reimage of a DB server would require to untick the box (and tick it again when complete)
[13:46:25] <jynus>	 or a technician replaced the board, reseting the bios
[13:46:37] <kormat>	 an ideal system would allow a host to boot off the os install pxe image once, and then revert
[13:46:43] <jynus>	 or the disk, for some reason, fails to boot (e.g. grub malfunction)
[13:46:44] <jynus>	 we reimage
[13:46:59] <jynus>	 I am all for moritz solution :-D
[13:47:19] <jynus>	 I proposed that a while ago, we got ignored :-(
[13:47:32] <jynus>	 so we "fixed it" on our own
[13:50:31] <jynus>	 and netbox (or similar) is the right place to actually handle that
[13:50:52] <jynus>	 property: reimaginable: YES/NO
[13:51:00] <jynus>	 or whatever
[13:51:15] <paravoid>	 what was it that your proposed jynus?
[13:51:50] <kormat>	 moritzm: a bonus is that it doesn't require going through the puppet repo, which is a plus for automation
[13:51:51] <jynus>	 a puppet config to prevent reimaging servers accidentally
[13:52:05] <jynus>	 doesn't necesarilly have to be puppet
[13:52:17] <jynus>	 could be netbox "life cycle state"
[13:52:37] <paravoid>	 did this happen?
[13:52:48] <jynus>	 what do you mean?
[13:52:49] <cdanis>	 kormat: i do like the idea of bots that can propose changes, it's just tricky
[13:53:07] <paravoid>	 some host getting accidentally reimaged
[13:53:10] <jynus>	 yes
[13:53:11] <moritzm>	 kormat: defenitely, we can even optimise the second "tick it again" step away and have Puppet mark it non-reimageable when the mariadb role runs Puppet the first time
[13:53:20] <jynus>	 paravoid: yes we lost all wiki contents
[13:53:29] <jynus>	 https://phabricator.wikimedia.org/T160242
[13:53:51] <jynus>	 we suffered to recover them without an outage
[13:53:58] <moritzm>	 if you have any questions on how d-i works, please ping me, it's technically quite elegant/powerful, but not very approachable
[13:54:25] <paravoid>	 so this may sound bureaucratic, but
[13:54:26] <jynus>	 it also happened with a proxy, but that was less impactful as no data was lost
[13:54:32] <paravoid>	 despite the fact that we recovered without an outage
[13:54:39] <paravoid>	 it does sound like what I'd call an incident
[13:54:47] <jynus>	 it was an incident
[13:54:48] <paravoid>	 and I think it may be worth it to track it as one
[13:54:54] <jynus>	 same as the proxy
[13:54:56] <cdanis>	 looks like this was in 2017?
[13:55:03] <jynus>	 it happened again later
[13:57:17] <jynus>	 I proposed to work (myself) on a fix, but people started discussing how dc ops or someone else was going to create a pxe menu and a workflow, but until then, we did the partman hack
[13:57:59] <cdanis>	 +1 to some work decoupling partman definitions from a 'is-safe-to-reimage' bit, and +1 to hopefully eventually making reimages never destroy service state data
[13:58:18] <jynus>	 of course
[13:58:33] <jynus>	 I am not saying this is ideal, I am saying that is the only thing we have now :-D
[13:58:35] <cdanis>	 indeed :)
[13:58:43] <cdanis>	 I/F talked about building a pxe menu last year, dunno if anyone actually found the time
[13:59:10] <jynus>	 cdanis, let me be skeptical, because I have been hearing that since 2017, every year :-D
[13:59:22] <jynus>	 and that is ok, as volans said it is ETIME
[13:59:42] <jynus>	 what I mean is what we have now > nothing :-D
[13:59:44] <kormat>	 the plan might have been on a server that got accidentally reimaged ;)
[14:00:55] <jynus>	 I am the first person that will switch to another system, and again, I also proposed working on it
[14:00:56] <paravoid>	 the PXE menu idea is late, but it hasn't been on the plans since 2017
[14:01:13] <jynus>	 similar ideas were proposed after that happened
[14:01:20] <paravoid>	 but regardless -- as all incidents go, I think there is a) value into looking for a short-term fix b) thinking of the broader stroke/long-term issues
[14:01:24] <_joe_>	 my 2c are: if a server should reboot to pxe or not is state, not configuration. So puppet is a horrendous place to keep that information. Unless we start using a db as a hiera backend for puppet :)
[14:01:35] <paravoid>	 but also prioritizing them based on impact, and frequency of occurence
[14:01:35] <jynus>	 _joe_: +1 too
[14:01:38] <_joe_>	 and yes, netbox acting as a ENC would fit the bill too
[14:01:44] <jynus>	 that is why I said netbox or similar
[14:01:46] <_joe_>	 (that is basically moritzm's proposal)
[14:01:50] <jynus>	 it would be programmable
[14:01:55] <jynus>	 dynamic
[14:02:03] <jynus>	 and integrated into server lifecycle
[14:02:21] <jynus>	 although I can see people not wanting to put too much functionality into netbox
[14:02:23] <paravoid>	 there is so much depth to the "netbox as an ENC" problem that I'd prefer to not discuss adhoc and on IRC
[14:02:29] <jynus>	 exactly
[14:02:39] <jynus>	 so I leave it as "on netbox or similar"
[14:02:59] <jynus>	 simialar == any kind of dynamic store
[14:03:09] <jynus>	 store/service
[14:03:29] <paravoid>	 I think it'd be more meaningful to talk about the problem rather than solutions at this point
[14:03:45] <jynus>	 and this has been the yearly discussion about the PXC menu :-DDDD
[14:03:52] <jynus>	 *PXE
[14:05:05] <paravoid>	 again, this is all boils down to prioritization, which is a factor of resourcing, other priorities, risk/threats associated with this problem (incl. probability it would happen, and impact if it did)
[14:05:37] <paravoid>	 I'm happy to have that discussion and reconsider our priorities, but I'd like to talk about _that_ rather than "netbox as an ENC or dynamic store?"
[14:05:39] <jynus>	 please don't take this as more as an informal rant
[14:05:59] <jynus>	 but I think accidental reimage of a backup and database or swirft server is very real
[14:06:07] <paravoid>	 I'm saying that I hear your informal rant and would like to make it more actionable ;)
[14:06:20] <jynus>	 and the fact that it doesn't happen more often is beacause we hack it around
[14:06:57] <jynus>	 what I am anoyed is people looking at yes, a hack, and saying "this is not great"
[14:07:14] <jynus>	 and me the first person that agrees, but has nothing better at the moment :-D
[14:07:40] <volans>	 jynus: do you know how many times a host rebooted into PXE when it was not supposed to? both those that went bad and those saved by the hack
[14:08:10] <jynus>	 volans: I'd say in reimage time, multiple times a week
[14:08:24] <jynus>	 but it doesn't matter, if it only happened 1 it is better than 0
[14:08:36] <paravoid>	 that is good feedback and news to me
[14:08:47] <volans>	 jynus: what that means multiple times a week?
[14:08:49] <paravoid>	 it does matter, because we have to balance that over other issues
[14:08:55] <jynus>	 in most cases this is related to hw issues
[14:09:07] <paravoid>	 can I rope you into filing a task about this?
[14:09:21] <paravoid>	 ideally with some links to times where this cropped up, past incidents etc.?
[14:09:32] <jynus>	 paravoid: sure, I can even work on it (no saying I should, just that I am very willing to help it happen)
[14:09:43] <paravoid>	 and some data to help us prioritize
[14:09:55] <jynus>	 but that the main obstacule I found wasn't "I have no time"
[14:10:03] <jynus>	 but "it makes no sense" to fix it
[14:10:14] <paravoid>	 at minimum, you can add a comment to that puppet code that links to the task
[14:10:29] <jynus>	 puppet code?
[14:10:31] <paravoid>	 so if anyone has questions can look it up and see that it's a workaround for a known problem
[14:10:44] <paravoid>	 d-i partman recipe, whatever the "hack" is
[14:10:47] <jynus>	 ah, sure
[14:11:03] <paravoid>	 thanks!
[14:11:11] <jynus>	 the other thing, before I write the task
[14:11:17] <XioNoX>	 what's ENC?
[14:11:22] <jynus>	 "description of the problem"
[14:11:40] <paravoid>	 XioNoX: external node classifier, a programmable site.pp
[14:11:51] <XioNoX>	 thx
[14:12:01] <jynus>	 is that there is sometimes not fully comprehension between teams of the main challenges one has
[14:12:05] <paravoid>	 where puppet runs a script with the hostname as input, and that script returns classes to be applied, and parameters to them
[14:12:25] <jynus>	 and I am the first one that doesn't understand what other people priorities are
[14:12:42] <jynus>	 but I know other people doesn't understand my team's
[14:12:58] <jynus>	 relevant to this case
[14:13:13] <jynus>	 I can see how "reimaging an app server" is not a big deal
[14:13:16] <jynus>	 :-D
[14:13:26] <paravoid>	 not the time and place to have that conversation I would say -- something I'd encourage you to talk to your manager about though
[14:13:29] <jynus>	 ok
[14:13:40] <jynus>	 I am just trying to open a channel here
[14:14:24] <jynus>	 I want to know what your thought process is, and I want to let you know what mine is
[14:14:38] <jynus>	 so we can reack mutual understaning of an issue
[14:15:25] <jynus>	 It looks like kormat has already opened one T251392
[14:15:26] <stashbot>	 T251392: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392
[14:16:03] <jynus>	 but that is only 1 side of the story, the question is why it is like that
[14:28:50] <kormat>	 smart-data-dump exports prometheus metrics - do we have a dashboard somewhere that displays them?
[14:32:49] <jynus>	 paravoid: T251416
[14:32:51] <stashbot>	 T251416: PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416
[14:33:00] <jynus>	 I am actually happy you are asking for that ticket
[16:42:59] <arturo>	 herron: will follow up to our email tomorrow. Run out of time today :-P
[16:44:22] <herron>	 arturo: sure, sounds good. have a good night