[07:07:16] <_joe_>	 elukey: so, jayme and I will fix a couple things with etcd
[07:07:22] <_joe_>	 then work on the zk transition
[07:09:54] <elukey>	 +1
[07:10:25] <_joe_>	 abd when I say "jayme and I" I mean jayme
[07:21:03] <elukey>	 disclaimer: the procedure that I added in the task for zk swap conf2001/conf2004 is not super clean, but I didn't find a better one since zk doesn't allow (in v 3.4) to reload members of the cluster on the fly
[07:21:51] <elukey>	 so please review it before apply :D
[07:21:59] <elukey>	 (we can chat about it in here if you want)
[07:25:42] <_joe_>	 sure, we'll finish the etcd part first
[07:35:07] <_joe_>	 jayme: let's start replication?
[07:35:36] <jayme>	 1sec. nginx did not pick up the new cert after puppet run
[07:36:03] <_joe_>	 that's by design
[07:36:08] <jayme>	 ah
[07:36:17] <_joe_>	 tlsproxy::instance lets you choose how to restart nginx
[07:36:35] <jayme>	 It seems to have tried restarting, though
[07:36:46] <jayme>	 while the cert was not there
[07:37:13] <_joe_>	 uhm that's a bug then
[07:37:25] <jayme>	 reloading it tried, sorry
[07:37:31] <_joe_>	 probably in our code
[07:37:41] <_joe_>	 I'll take a look
[07:38:18] <jayme>	 reload seems to not be enough in that case
[07:39:09] <_joe_>	 yeah possibly
[07:39:22] <_joe_>	 because the code is correct, the cert should be installed before the reload is issues
[07:39:25] <_joe_>	 *d
[07:40:04] <jayme>	 Yes. It's just that the reload will fail then
[07:41:03] <_joe_>	 ok
[07:41:58] <jayme>	 all green now. We can configure replication now
[07:42:07] <jayme>	 you said there is a non-obvious detail?
[07:42:43] <_joe_>	 yes
[07:43:02] <_joe_>	 https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication
[07:43:26] <_joe_>	 as you can read there, etcdmirror keeps track of the state of replication under /__replication/$destination_prefix
[07:43:32] <_joe_>	 obviously now you don't have it
[07:43:41] <_joe_>	 so when you enable the replica it will fail
[07:43:57] <_joe_>	 unless you start etcdmirror first with --reload
[07:44:09] <_joe_>	 so my suggestion on how to proceed is:
[07:44:14] <_joe_>	 * downtime the host
[07:45:09] <_joe_>	 * run puppet, replication will fail
[07:45:35] <_joe_>	 * run etcdmirror with the --reload parameter from cli, once it's done the initial dump and load stop it, run puppet again
[07:45:41] <_joe_>	 * replication should be ok
[07:46:15] <_joe_>	 I'm thinking of writing a script that tests if /__replication/$destination_prefix is not present and runs a dump and load
[07:46:34] <_joe_>	 with the same parameters as the replication instance
[07:48:47] <jayme>	 Okay. "The host" in your steps is the one I choose to enable replication on, right? That can, but must not, be the dame one that is configured ad replication::dst_url?
[07:49:11] <jayme>	 s/must not/does not have to be/
[07:49:30] <_joe_>	 does not have to be, but I'm not sure about firewalls :)
[07:50:12] <jayme>	 yeah. I will obviously choose the one anyways :D
[07:54:54] <_joe_>	 ok so, I just wrote a patch
[07:55:12] <_joe_>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/682498
[07:55:33] <_joe_>	 lemme check it for a sec, but that should give you a script to reload the cluster before you enable replication via puppet
[07:57:29] <_joe_>	 so maybe we don't even need the downtime :)
[07:57:38] <jayme>	 okay. 2005 is downtimed anyways
[07:58:26] <_joe_>	 https://puppet-compiler.wmflabs.org/compiler1001/29187/conf2004.codfw.wmnet/fulldiff.html
[07:59:15] <jayme>	 Looks about right to me
[07:59:31] <_joe_>	 ok so lemme merge it, and then we can try running it
[07:59:37] <jayme>	 ack
[08:01:08] <_joe_>	  /usr/local/sbin/reload-etcdmirror-conftool-eqiad-wmnet is now on conf2005
[08:01:22] <_joe_>	 let me open a root tmux so we can operate together
[08:47:50] <elukey>	 https://www.confluent.io/blog/kafka-without-zookeeper-a-sneak-peek/ is also nice and related to today's maintenance :D
[09:02:46] <_joe_>	 elukey: kraft makes me think of gross prepared food :P
[09:03:02] <elukey>	 :D
[09:19:17] <godog>	 ryankemper: re: raid0, you need to include partman/raid0.cfg before the raid0 recipes
[09:19:30] <godog>	 there are other examples in netboot.cfg
[09:20:37] <godog>	 it'd be ideal if CI validated netboot.cfg of course, not sure if there's a simple way to ensure e.g. a regexp passes
[10:04:26] <godog>	 sth like "if there's partman/raid0- on a line then there must be also partman/raid0.cfg before it", I suspect it isn't a low hanging fruit to implement, but would love to be wrong
[15:22:53] <ryankemper>	 godog: that totally makes sense, thanks for looking into it!
[15:23:02] <ryankemper>	 I'll take a note to see if there's a simple way to have ci validate
[15:41:55] <godog>	 ryankemper: sure no worries, it isn't super intuitive for sure heh
[16:19:32] <legoktm>	 could someone familiar with mod_security double check https://gerrit.wikimedia.org/r/c/operations/puppet/+/681244 for me?
[16:35:56] <elukey>	 arturo: o/ - qq if you have a min - when I add users to a wmcs project, do I need to do anything to allow ssh access or is there a sync happening once every X time ?
[16:39:13] <Majavah>	 elukey: should be instant
[16:41:28] <elukey>	 Majavah: I thought it needed a puppet run to push users, but after me and others have been added to ores-staging we were not able to ssh to the instances, this is why I was asking
[16:41:54] <Majavah>	 on prod needs a puppet run but on wmcs it's queried from ldap and instant
[16:42:27] <Majavah>	 what error you're getting? "permission denied (publickey)"?
[16:43:06] <elukey>	 connection closed from remote host during key exchange
[16:44:04] <elukey>	 "open failed: administratively prohibited: open failed" etc..
[16:44:47] <Majavah>	 double check your instance names and fqdns, that sounds like issues when hopping from the bastion to the actual vm
[16:45:40] <mutante>	 elukey: my proxy command for VPS looks like this:  ProxyCommand ssh -W %h.eqiad1.wikimedia.cloud:%p dzahn@restricted.bastion.wmcloud.org
[16:45:53] <mutante>	 maybe the ".cloud" and "restricted" parts there
[16:46:13] <Majavah>	 trying to ssh to an instance that does not exist gives "channel 0: open failed: administratively prohibited: open failed" to me, so that's my best guess on what's happening there
[16:46:21] <mutante>	 host names changed
[16:46:41] <Majavah>	 old names work for hosts that originally had them
[16:46:45] <elukey>	 mutante: yeah I am using restricted.bastion.wmcloud.org too, but I see that puppet is borked over there (and I use it for other wmcs projects successfully)
[16:46:47] <Majavah>	 which specific instance is that?
[16:47:04] <elukey>	 ores-staging01.eqiad1.wikimedia.cloud
[16:47:47] <Majavah>	 https://openstack-browser.toolforge.org/server/ores-staging-01.ores-staging.eqiad1.wikimedia.cloud
[16:47:51] <Majavah>	 the instance name has a dash
[16:47:56] <Majavah>	 that you're missing
[16:48:37] * elukey plays sad_trombone.wav
[16:48:50] <elukey>	 yes I am in now
[16:48:53] <elukey>	 thanks :)
[16:49:01] <elukey>	 I'll open a task for puppet on the bastion anyway :)
[16:49:22] <elukey>	 I'll check that others can ssh in too
[16:50:13] <elukey>	 thanks Majavah!
[16:52:20] <elukey>	 thanks also mutante for the brainbounce :)
[16:53:00] <elukey>	 opened https://phabricator.wikimedia.org/T281176
[17:00:17] <mutante>	 elukey: ah, I think I know who to ping for that
[17:00:18] <mutante>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/675124/11/modules/ssh/manifests/server.pp
[17:01:00] <mutante>	 something about "# Allow Cloud VPS restricted bastions to override it for Cumin"
[18:18:33] <mutante>	 elukey: bastionhost fixed!  (2 separate issues)
[20:53:05] <andrewbogott>	 jbond42: do you have a workaround in mind for testing with Bullseye builds?  Right now https://gerrit.wikimedia.org/r/c/operations/puppet/+/677496 causes puppet to bail out early on.
[20:55:30] <andrewbogott>	 (pursuing this in response to godog's request for test builds on cloudvps)
[20:57:25] <moritzm>	 andrewbogott: this is fixed in base-files 11.1, which will migrate to bullseye in four days: https://packages.qa.debian.org/b/base-files/news/20210410T203326Z.html
[20:57:54] <andrewbogott>	 moritzm: ok, I'll just wait until Friday to make my build.  Thanks!
[20:58:19] <moritzm>	 yeah, if you want to build a cloud vps image better wait, otherwise you can edit /etc/debian_version manually for a one off host
[20:58:59] <andrewbogott>	 I'm patient if filippo is
[22:52:52] <andrewbogott>	 razzi: want me to merge 'sqoop: switch to single grouped_wikis.csv' ?
[22:53:05] <razzi>	 andrewbogott: that'd be great, but no rush
[22:53:36] <andrewbogott>	 done
[22:54:49] <razzi>	 great, thanks andrewbogott.
[22:54:49] <razzi>	 I'm curious, did we submit at the same time, or is something else going on with puppet?
[22:55:39] <andrewbogott>	 Just both submitted at the same time
[22:56:06] <andrewbogott>	 so the manual merge wanted to merge both patches at once
[22:56:38] <razzi>	 cool