[00:01:43] Okay, so next step for me was to figure out whether it's the query itself or the endpoint being talked to, and it looks like if I use the query `kafka_burrow_partition_lag{ exported_cluster="main-eqiad", group="change-prop-cirrusSearchElasticaWrite", topic="eqiad.mediawiki.job.cirrusSearchElasticaWrite"}` on `prometheus.svc.codfw.wmnet` [00:05:47] ^ Disregard that last line, I got mixed up. I'd actually ran the `codfw` query so that didn't prove anything I didn't already know [10:43:02] silly one-line fix for review if anyone has a sec https://gerrit.wikimedia.org/r/c/operations/puppet/+/603960 [10:44:49] done [10:48:48] thanks! [11:46:54] <_joe_> hah that's a classic [11:47:24] <_joe_> it's a bit strange it wasn't caught by flake8 [11:56:53] Amir1: if the instance called "meet-auth" is the important one and the "jitsi" instance is just the client.. then why does jitsi save /srv/meet-auth with the cloned repo but on meet-auth /srv is empty? [11:57:05] s/save/have [12:00:02] mutante: the meet-auth is the gate to https://meet-auth.wmflabs.org/create (this resides in the VM), the jitsi VM just gets the data from this VM [12:00:18] yeah, I think the account manager bit is currently on my home :( [12:00:23] I should have fixed it [12:02:12] Amir1: ah, i see it in your home. and that version is also newer. so i figure /srv/meet-auth on jitsi is not used [12:02:44] also the remote there is already gerrit, cool [12:06:40] Amir1: shouldn't i expect so see a uwsgi process on one of the instances? [12:07:07] The jitsi VM uses the auth manager to receive the result and create accounts locally but it's secondary [12:07:20] mutante: it's currently a screen, I should fix that too :( [12:08:28] basically the meet-auth VM needs to run https://github.com/wikimedia/wikimedia-meet-accountmanager/blob/master/server.py [12:08:46] and the jitsi VM needs to run https://github.com/wikimedia/wikimedia-meet-accountmanager/blob/master/client.py [12:08:56] ok.. i see your screen but no uwsgi process [12:09:03] so they can talk to each other [12:09:25] fixing the uwsgi shouldn't be too hard [12:10:34] i can apply the minimal role on meet-auth [12:10:45] and see what happens [12:13:37] Sure. I'd love to see it [12:13:44] ok [12:17:53] moritzm: is it ok with you if i reimage sretest1002 ~now? [12:19:56] sure, please go ahead! make sure to add some content to /srv, I doubt there's currently anything which the partman recipe can retain :-) [12:21:39] good point, done :) [13:41:54] i think i violated some constraints of wmf-auto-reimage. it died with: [13:42:00] 12:56:10 | sretest1002.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Failed to detect_init [13:42:44] weird, that was a backward compatibility to distinguish between systemd and init [13:42:47] let me check the logs [13:42:50] cumin1001? [13:42:54] yep [13:43:01] 12:56:10 | sretest1002.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Failed to detect_init [13:43:06] er [13:43:09] /var/log/wmf-auto-reimage/202006091224_kormat_33963_sretest1002_eqiad_wmnet.log [13:43:31] unrecognized option '--no-headers' [13:43:36] BusyBox v1.30.1 (Debian 1:1.30.1-4) multi-call binary. [13:43:39] you're in a busybox [13:43:51] from the end of 202006091224_kormat_33963_sretest1002_eqiad_wmnet_cumin.out [13:44:09] volans: i ended up manually rebooting the host a bunch of times as i was fixing stuff, [13:44:24] i suspect the tool doesn't like that [13:44:31] my main question is: how do i recover from here? [13:45:51] what's teh current status? [13:45:58] is a OS installed? [13:46:25] *base OS [13:47:28] yes [13:47:56] bug e.g. `puppet agent -t` gives `Exiting; no certificate found and waitforcert is disabled` [13:47:57] ok so if you just want it to resume from running puppet and such, you can re-run it with the --no-pxe option [13:48:05] ah [13:48:51] that should be covered, it will delete the current cert and re-create one IIRC [13:49:06] heh. `--no-pxe` still asks for the ipmi password :) [13:49:11] anyway, trying that [13:49:16] lol [13:49:20] that's being picky :D [13:49:52] it might complain about the host not existing in puppetdb hence can't verify it [13:49:58] if it's the case use the --new [13:50:02] 13:49:23 | sretest1002.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Failed to icinga_downtime [13:50:13] volans: what's the difference between --no-verify and --new? [13:51:34] no-verify still performs the verification but doesn't fail if can't verify it, new skips the verification completely [13:51:47] the first downtime is skipped if either --new or --no-downtime are passed [13:52:09] i've added `--no-downtime`, trying again [13:52:15] thanks volans :) [13:52:49] looks like this is working - it's made it as far as the first puppet run \o/ [13:52:53] volans: thanks! :) [13:52:56] so basically if you run and fail, it's almost like a new host [13:53:03] so --new is quicker than thinking of the other stuff :D [14:28:24] <_joe_> or you know, the automation should really figure that out for you [14:29:39] jbond42: lol, in my testing, I found/created a bug [14:30:03] https://phabricator.wikimedia.org/P11435 [14:30:06] fixing [14:31:55] lol :) [14:55:25] marostegui, moritzm: fully-automated reinstall of sretest1002 with reuse-parts completed successfully. it works \o/ [14:55:45] kormat: \o/ [14:55:56] kormat: congratulations! your reward is owning that code forever [14:56:57] Pyrrhic victory, i guess :) [14:57:10] kormat: \o/ excellent news!!! [14:58:20] cdanis: check out https://gerrit.wikimedia.org/r/604020 [14:58:51] kormat: very nice! [14:59:49] let's also try to get this merged in d-i, so that you can also become the upstream maintainer of partman d-i code :-) [15:00:14] so like _joe_ is zuul's expert kormat is now partman expert, I like it [15:00:47] <_joe_> moritzm: wow that escalated *fast* [15:01:26] * volans handsover the PartmanExpert badge to kormat [15:01:36] that was coined for the occasion [15:01:52] using the heaviest element on earth [15:02:15] as a side product is also radioactive... but that's a plus, glows in the dark! :D [15:02:42] volans: thanks, i think. ;) [15:03:10] moritzm: i'm up for pushing it upstream in a few weeks, when we've used it a bit [15:03:28] now, any sane person upstream would say hell no, but if they're working on partman, they're not sane. so there's a chance. :) [15:04:14] sure, makes sense. It doesn't seem entirely unlikely to get this merged, it's certainly useful/missing functionality [15:04:38] ping me when this has seen more testing in the wild and we can discuss next steps to submit it [15:04:44] +1 [16:04:40] For cassandra SSL expiration, it is acceptable to delete the old keys in `/srv/private` and run `cassandra-ca-manager` to regenerate them? [16:12:12] i am not sure if we want to switch to using cergen because cergen was "originally based on cassandra-ca-manager". not the root ca file is also in srv/private along with the certs. _if_ you use cassandra-ca-manager there is a ticket that the certs should have the FQDN and not just short host name at https://phabricator.wikimedia.org/T141541 [16:14:37] the certs created with cergen and using the Puppet CA are all under /srv/private/modules/secret/secrets/certificates while the cassandra certs are kind of a special case under /srv/private/modules/secret/secrets/cassandra/restbase and with their own CA [16:16:05] maybe it would make sense to switch to https://wikitech.wikimedia.org/wiki/Cergen [17:02:58] mutante: interesting, thanks for the context. [17:03:28] it looks like cassandra-ca-manager was used in January for some new hosts [23:00:46] https://crt.sh/?id=2109629862 - we've an OV cert?