[00:01:43] <ryankemper>	 Okay, so next step for me was to figure out whether it's the query itself or the endpoint being talked to, and it looks like if I use the query `kafka_burrow_partition_lag{    exported_cluster="main-eqiad",    group="change-prop-cirrusSearchElasticaWrite",    topic="eqiad.mediawiki.job.cirrusSearchElasticaWrite"}` on `prometheus.svc.codfw.wmnet`
[00:05:47] <ryankemper>	 ^ Disregard that last line, I got mixed up. I'd actually ran the `codfw` query so that didn't prove anything I didn't already know
[10:43:02] <hnowlan>	 silly one-line fix for review if anyone has a sec https://gerrit.wikimedia.org/r/c/operations/puppet/+/603960
[10:44:49] <kormat>	 done
[10:48:48] <hnowlan>	 thanks!
[11:46:54] <_joe_>	 hah that's a classic
[11:47:24] <_joe_>	 it's a bit strange it wasn't caught by flake8
[11:56:53] <mutante>	 Amir1: if the instance called "meet-auth" is the important one and the "jitsi" instance is just the client.. then why does jitsi save /srv/meet-auth with the cloned repo but on meet-auth /srv is empty?
[11:57:05] <mutante>	 s/save/have
[12:00:02] <Amir1>	 mutante: the meet-auth is the gate to https://meet-auth.wmflabs.org/create (this resides in the VM), the jitsi VM just gets the data from this VM
[12:00:18] <Amir1>	 yeah, I think the account manager bit is currently on my home :(
[12:00:23] <Amir1>	 I should have fixed it
[12:02:12] <mutante>	 Amir1: ah, i see it in your home. and that version is also newer. so i figure /srv/meet-auth on jitsi is not used
[12:02:44] <mutante>	 also the remote there is already gerrit, cool
[12:06:40] <mutante>	 Amir1: shouldn't i expect so see a uwsgi process on one of the instances?
[12:07:07] <Amir1>	 The jitsi VM uses the auth manager to receive the result and create accounts locally but it's secondary
[12:07:20] <Amir1>	 mutante: it's currently a screen, I should fix that too  :(
[12:08:28] <Amir1>	 basically the meet-auth VM needs to run https://github.com/wikimedia/wikimedia-meet-accountmanager/blob/master/server.py
[12:08:46] <Amir1>	 and the jitsi VM needs to run https://github.com/wikimedia/wikimedia-meet-accountmanager/blob/master/client.py
[12:08:56] <mutante>	 ok.. i see your screen but no uwsgi process
[12:09:03] <Amir1>	 so they can talk to each other
[12:09:25] <Amir1>	 fixing the uwsgi shouldn't be too hard
[12:10:34] <mutante>	 i can apply the minimal role on meet-auth
[12:10:45] <mutante>	 and see what happens
[12:13:37] <Amir1>	 Sure. I'd love to see it
[12:13:44] <mutante>	 ok
[12:17:53] <kormat>	 moritzm: is it ok with you if i reimage sretest1002 ~now?
[12:19:56] <moritzm>	 sure, please go ahead! make sure to add some content to /srv, I doubt there's currently anything which the partman recipe can retain :-)
[12:21:39] <kormat>	 good point, done :)
[13:41:54] <kormat>	 i think i violated some constraints of wmf-auto-reimage. it died with:
[13:42:00] <kormat>	 12:56:10 | sretest1002.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Failed to detect_init
[13:42:44] <volans>	 weird, that was a backward compatibility to distinguish between systemd and init
[13:42:47] <volans>	 let me check the logs
[13:42:50] <volans>	 cumin1001?
[13:42:54] <kormat>	 yep
[13:43:01] <kormat>	 12:56:10 | sretest1002.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Failed to detect_init
[13:43:06] <kormat>	 er
[13:43:09] <kormat>	  /var/log/wmf-auto-reimage/202006091224_kormat_33963_sretest1002_eqiad_wmnet.log
[13:43:31] <volans>	 unrecognized option '--no-headers'
[13:43:36] <volans>	 BusyBox v1.30.1 (Debian 1:1.30.1-4) multi-call binary.
[13:43:39] <volans>	 you're in a busybox
[13:43:51] <volans>	 from the end of 202006091224_kormat_33963_sretest1002_eqiad_wmnet_cumin.out
[13:44:09] <kormat>	 volans: i ended up manually rebooting the host a bunch of times as i was fixing stuff,
[13:44:24] <kormat>	 i suspect the tool doesn't like that
[13:44:31] <kormat>	 my main question is: how do i recover from here?
[13:45:51] <volans>	 what's teh current status?
[13:45:58] <volans>	 is a OS installed?
[13:46:25] <volans>	 *base OS
[13:47:28] <kormat>	 yes
[13:47:56] <kormat>	 bug e.g. `puppet agent -t` gives `Exiting; no certificate found and waitforcert is disabled`
[13:47:57] <volans>	 ok so if you just want it to resume from running puppet and such, you can re-run it with the --no-pxe option
[13:48:05] <volans>	 ah
[13:48:51] <volans>	 that should be covered, it will delete the current cert and re-create one IIRC
[13:49:06] <kormat>	 heh. `--no-pxe` still asks for the ipmi password :)
[13:49:11] <kormat>	 anyway, trying that
[13:49:16] <volans>	 lol
[13:49:20] <volans>	 that's being picky :D
[13:49:52] <volans>	 it might complain about the host not existing in puppetdb hence can't verify it
[13:49:58] <volans>	 if it's the case use the --new
[13:50:02] <kormat>	 13:49:23 | sretest1002.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Failed to icinga_downtime
[13:50:13] <marostegui>	 volans: what's the difference between --no-verify and --new?
[13:51:34] <volans>	 no-verify still performs the verification but doesn't fail if can't verify it, new skips the verification completely
[13:51:47] <volans>	 the first downtime is skipped if either --new or --no-downtime are passed
[13:52:09] <kormat>	 i've added `--no-downtime`, trying again
[13:52:15] <marostegui>	 thanks volans :)
[13:52:49] <kormat>	 looks like this is working - it's made it as far as the first puppet run \o/
[13:52:53] <kormat>	 volans: thanks! :)
[13:52:56] <volans>	 so basically if you run and fail, it's almost like a new host
[13:53:03] <volans>	 so --new is quicker than thinking of the other stuff :D
[14:28:24] <_joe_>	 or you know, the automation should really figure that out for you
[14:29:39] <cdanis>	 jbond42: lol, in my testing, I found/created a bug
[14:30:03] <cdanis>	 https://phabricator.wikimedia.org/P11435
[14:30:06] <cdanis>	 fixing
[14:31:55] <jbond42>	 lol :)
[14:55:25] <kormat>	 marostegui, moritzm: fully-automated reinstall of sretest1002 with reuse-parts completed successfully. it works \o/
[14:55:45] <cdanis>	 kormat: \o/
[14:55:56] <cdanis>	 kormat: congratulations! your reward is owning that code forever
[14:56:57] <kormat>	 Pyrrhic victory, i guess :)
[14:57:10] <marostegui>	 kormat: \o/ excellent news!!!
[14:58:20] <kormat>	 cdanis: check out https://gerrit.wikimedia.org/r/604020
[14:58:51] <moritzm>	 kormat: very nice!
[14:59:49] <moritzm>	 let's also try to get this merged in d-i, so that you can also become the upstream maintainer of partman d-i code :-)
[15:00:14] <marostegui>	 so like _joe_ is zuul's expert kormat is now partman expert, I like it
[15:00:47] <_joe_>	 moritzm: wow that escalated *fast*
[15:01:26] * volans handsover the PartmanExpert badge to kormat 
[15:01:36] <volans>	 that was coined for the occasion
[15:01:52] <volans>	 using the heaviest element on earth
[15:02:15] <volans>	 as a side product is also radioactive... but that's a plus, glows in the dark! :D
[15:02:42] <kormat>	 volans: thanks, i think. ;)
[15:03:10] <kormat>	 moritzm: i'm up for pushing it upstream in a few weeks, when we've used it a bit
[15:03:28] <kormat>	 now, any sane person upstream would say hell no, but if they're working on partman, they're not sane. so there's a chance. :)
[15:04:14] <moritzm>	 sure, makes sense. It doesn't seem entirely unlikely to get this merged, it's certainly useful/missing functionality
[15:04:38] <moritzm>	 ping me when this has seen more testing in the wild and we can discuss next steps to submit it
[15:04:44] <kormat>	 +1
[16:04:40] <hnowlan>	 For cassandra SSL expiration, it is acceptable to delete the old keys  in `/srv/private` and run `cassandra-ca-manager` to regenerate them?
[16:12:12] <mutante>	 i am not sure if we want to switch to using cergen because cergen was "originally based on cassandra-ca-manager". not the root ca file is also in srv/private along with the certs. _if_ you use cassandra-ca-manager there is a ticket that the certs should have the FQDN and not just short host name at  https://phabricator.wikimedia.org/T141541
[16:14:37] <mutante>	 the certs created with cergen and using the Puppet CA are all under /srv/private/modules/secret/secrets/certificates   while the cassandra certs are kind of a special case under /srv/private/modules/secret/secrets/cassandra/restbase and with their own CA
[16:16:05] <mutante>	 maybe it would make sense to switch to https://wikitech.wikimedia.org/wiki/Cergen
[17:02:58] <hnowlan>	 mutante: interesting, thanks for the context.
[17:03:28] <hnowlan>	 it looks like cassandra-ca-manager was used in January for some new hosts
[23:00:46] <Krenair>	 https://crt.sh/?id=2109629862 - we've an OV cert?