[07:07:16] <_joe_> elukey: so, jayme and I will fix a couple things with etcd [07:07:22] <_joe_> then work on the zk transition [07:09:54] +1 [07:10:25] <_joe_> abd when I say "jayme and I" I mean jayme [07:21:03] disclaimer: the procedure that I added in the task for zk swap conf2001/conf2004 is not super clean, but I didn't find a better one since zk doesn't allow (in v 3.4) to reload members of the cluster on the fly [07:21:51] so please review it before apply :D [07:21:59] (we can chat about it in here if you want) [07:25:42] <_joe_> sure, we'll finish the etcd part first [07:35:07] <_joe_> jayme: let's start replication? [07:35:36] 1sec. nginx did not pick up the new cert after puppet run [07:36:03] <_joe_> that's by design [07:36:08] ah [07:36:17] <_joe_> tlsproxy::instance lets you choose how to restart nginx [07:36:35] It seems to have tried restarting, though [07:36:46] while the cert was not there [07:37:13] <_joe_> uhm that's a bug then [07:37:25] reloading it tried, sorry [07:37:31] <_joe_> probably in our code [07:37:41] <_joe_> I'll take a look [07:38:18] reload seems to not be enough in that case [07:39:09] <_joe_> yeah possibly [07:39:22] <_joe_> because the code is correct, the cert should be installed before the reload is issues [07:39:25] <_joe_> *d [07:40:04] Yes. It's just that the reload will fail then [07:41:03] <_joe_> ok [07:41:58] all green now. We can configure replication now [07:42:07] you said there is a non-obvious detail? [07:42:43] <_joe_> yes [07:43:02] <_joe_> https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication [07:43:26] <_joe_> as you can read there, etcdmirror keeps track of the state of replication under /__replication/$destination_prefix [07:43:32] <_joe_> obviously now you don't have it [07:43:41] <_joe_> so when you enable the replica it will fail [07:43:57] <_joe_> unless you start etcdmirror first with --reload [07:44:09] <_joe_> so my suggestion on how to proceed is: [07:44:14] <_joe_> * downtime the host [07:45:09] <_joe_> * run puppet, replication will fail [07:45:35] <_joe_> * run etcdmirror with the --reload parameter from cli, once it's done the initial dump and load stop it, run puppet again [07:45:41] <_joe_> * replication should be ok [07:46:15] <_joe_> I'm thinking of writing a script that tests if /__replication/$destination_prefix is not present and runs a dump and load [07:46:34] <_joe_> with the same parameters as the replication instance [07:48:47] Okay. "The host" in your steps is the one I choose to enable replication on, right? That can, but must not, be the dame one that is configured ad replication::dst_url? [07:49:11] s/must not/does not have to be/ [07:49:30] <_joe_> does not have to be, but I'm not sure about firewalls :) [07:50:12] yeah. I will obviously choose the one anyways :D [07:54:54] <_joe_> ok so, I just wrote a patch [07:55:12] <_joe_> https://gerrit.wikimedia.org/r/c/operations/puppet/+/682498 [07:55:33] <_joe_> lemme check it for a sec, but that should give you a script to reload the cluster before you enable replication via puppet [07:57:29] <_joe_> so maybe we don't even need the downtime :) [07:57:38] okay. 2005 is downtimed anyways [07:58:26] <_joe_> https://puppet-compiler.wmflabs.org/compiler1001/29187/conf2004.codfw.wmnet/fulldiff.html [07:59:15] Looks about right to me [07:59:31] <_joe_> ok so lemme merge it, and then we can try running it [07:59:37] ack [08:01:08] <_joe_> /usr/local/sbin/reload-etcdmirror-conftool-eqiad-wmnet is now on conf2005 [08:01:22] <_joe_> let me open a root tmux so we can operate together [08:47:50] https://www.confluent.io/blog/kafka-without-zookeeper-a-sneak-peek/ is also nice and related to today's maintenance :D [09:02:46] <_joe_> elukey: kraft makes me think of gross prepared food :P [09:03:02] :D [09:19:17] ryankemper: re: raid0, you need to include partman/raid0.cfg before the raid0 recipes [09:19:30] there are other examples in netboot.cfg [09:20:37] it'd be ideal if CI validated netboot.cfg of course, not sure if there's a simple way to ensure e.g. a regexp passes [10:04:26] sth like "if there's partman/raid0- on a line then there must be also partman/raid0.cfg before it", I suspect it isn't a low hanging fruit to implement, but would love to be wrong [15:22:53] godog: that totally makes sense, thanks for looking into it! [15:23:02] I'll take a note to see if there's a simple way to have ci validate [15:41:55] ryankemper: sure no worries, it isn't super intuitive for sure heh [16:19:32] could someone familiar with mod_security double check https://gerrit.wikimedia.org/r/c/operations/puppet/+/681244 for me? [16:35:56] arturo: o/ - qq if you have a min - when I add users to a wmcs project, do I need to do anything to allow ssh access or is there a sync happening once every X time ? [16:39:13] elukey: should be instant [16:41:28] Majavah: I thought it needed a puppet run to push users, but after me and others have been added to ores-staging we were not able to ssh to the instances, this is why I was asking [16:41:54] on prod needs a puppet run but on wmcs it's queried from ldap and instant [16:42:27] what error you're getting? "permission denied (publickey)"? [16:43:06] connection closed from remote host during key exchange [16:44:04] "open failed: administratively prohibited: open failed" etc.. [16:44:47] double check your instance names and fqdns, that sounds like issues when hopping from the bastion to the actual vm [16:45:40] elukey: my proxy command for VPS looks like this: ProxyCommand ssh -W %h.eqiad1.wikimedia.cloud:%p dzahn@restricted.bastion.wmcloud.org [16:45:53] maybe the ".cloud" and "restricted" parts there [16:46:13] trying to ssh to an instance that does not exist gives "channel 0: open failed: administratively prohibited: open failed" to me, so that's my best guess on what's happening there [16:46:21] host names changed [16:46:41] old names work for hosts that originally had them [16:46:45] mutante: yeah I am using restricted.bastion.wmcloud.org too, but I see that puppet is borked over there (and I use it for other wmcs projects successfully) [16:46:47] which specific instance is that? [16:47:04] ores-staging01.eqiad1.wikimedia.cloud [16:47:47] https://openstack-browser.toolforge.org/server/ores-staging-01.ores-staging.eqiad1.wikimedia.cloud [16:47:51] the instance name has a dash [16:47:56] that you're missing [16:48:37] * elukey plays sad_trombone.wav [16:48:50] yes I am in now [16:48:53] thanks :) [16:49:01] I'll open a task for puppet on the bastion anyway :) [16:49:22] I'll check that others can ssh in too [16:50:13] thanks Majavah! [16:52:20] thanks also mutante for the brainbounce :) [16:53:00] opened https://phabricator.wikimedia.org/T281176 [17:00:17] elukey: ah, I think I know who to ping for that [17:00:18] https://gerrit.wikimedia.org/r/c/operations/puppet/+/675124/11/modules/ssh/manifests/server.pp [17:01:00] something about "# Allow Cloud VPS restricted bastions to override it for Cumin" [18:18:33] elukey: bastionhost fixed! (2 separate issues) [20:53:05] jbond42: do you have a workaround in mind for testing with Bullseye builds? Right now https://gerrit.wikimedia.org/r/c/operations/puppet/+/677496 causes puppet to bail out early on. [20:55:30] (pursuing this in response to godog's request for test builds on cloudvps) [20:57:25] andrewbogott: this is fixed in base-files 11.1, which will migrate to bullseye in four days: https://packages.qa.debian.org/b/base-files/news/20210410T203326Z.html [20:57:54] moritzm: ok, I'll just wait until Friday to make my build. Thanks! [20:58:19] yeah, if you want to build a cloud vps image better wait, otherwise you can edit /etc/debian_version manually for a one off host [20:58:59] I'm patient if filippo is [22:52:52] razzi: want me to merge 'sqoop: switch to single grouped_wikis.csv' ? [22:53:05] andrewbogott: that'd be great, but no rush [22:53:36] done [22:54:49] great, thanks andrewbogott. [22:54:49] I'm curious, did we submit at the same time, or is something else going on with puppet? [22:55:39] Just both submitted at the same time [22:56:06] so the manual merge wanted to merge both patches at once [22:56:38] cool