[00:34:34] so I accidentally created a VM named registry2004.eqiad.wmnet in codfw [00:35:07] the VM isn't in puppet yet, so the decom script doesn't work [00:35:25] if I try to add it to puppet, https://gerrit.wikimedia.org/r/c/operations/puppet/+/668571 jenkins is correctly complaining [00:38:53] I think it might be a bug in the decom script, but my general question still remains as to what to do [03:12:58] legoktm: (may be a late comment) but can you decom it by ssh into the host itself? [05:51:34] phamhi: Please make sure to run apt-get update before doing the upgrade on clouddb* hosts, so the new mariadb version is also installed :) [08:15:21] legoktm: simply remove it with gnt-instance (needs to be run from the ganeti master in codfw): https://wikitech.wikimedia.org/wiki/Ganeti#Delete_a_VM [08:15:51] Oh, is that good enough? What about netbox and stuff? [08:38:06] depends on how you created it, if you used the cookbook to create the instance and it completed, then you'll in fact also need to remove the entry from netbox, yes [08:41:23] if all else fails, burn netbox to the ground, and tell volans a bug in his code did it [08:42:24] Yeah I created it with the cookbook [08:42:32] <_joe_> kormat: you're doing it wrong [08:42:36] But didn't do the initial install or puppet runs [08:42:49] <_joe_> kormat: you just need to upgrade netbox of one patch version [08:43:07] <_joe_> that usually obtains the goal of destroying it, also offering you plausible deniability [08:43:37] ah haha [08:43:57] <_joe_> (big props to cas and riccardo for making us barely notice any upgrade) [09:35:58] <_joe_> Majavah: jayme: around? [09:36:14] somewhat yes [09:36:22] <_joe_> I can wait :) [09:37:49] I'm in that "somewhat" state until about ~12:30 UTC [09:39:19] <_joe_> ok, let's reconvene later then, but thinking about it, jayme and I will need to do some experimenting with etcd on buster, so we might relieve you of the task, if you're not doing it for learning :) [09:39:55] <_joe_> so: if you want to learn how to set up an etcd server, I'm happy to assist; otherwise, I'm happy to pick up that work from you [09:40:51] I'm not around ~11:45 - 13:00 UTC, anything else is fine [09:40:55] I would like to learn that if it doesn't cause too much trouble [09:41:07] anything after ~12:30 works for me [10:01:35] <_joe_> Majavah: not at all [10:20:37] <_joe_> elukey, effie you both brought up https://phabricator.wikimedia.org/T273950 for the memcached image. The rationale in that task is valid, but doesn't really apply to stuff running in containers IMHO [10:21:55] _joe_ yep yep I just mentioned it in case it was relevant, didn't really have a strong opinion [10:22:03] in the image itself, you will avoid the chown as well [10:22:05] I agree that in containers nobody is fine [10:22:39] <_joe_> effie: the disadvantage is we won't be able to use numeric uids in USER consistently [10:22:47] <_joe_> that creates problems to pod security policies [10:23:58] if that is the case, then sure [10:24:43] the argument was for simplicity maintly, but since there is a more important reason, good to know [13:17:37] _joe_, jayme: hi? [13:18:20] <_joe_> Majavah: gimme 10' [13:19:39] <_joe_> in the meantime, give a look at profile::etcd::v3 in puppet [13:31:53] <_joe_> ok, so, you will be setting up a one-node etcd cluster, without encryption or other complications [13:32:33] <_joe_> that means you just need to fire up a VM with buster [13:33:10] <_joe_> add profile::etcd::v3 to it [13:33:23] <_joe_> and set some parameters in hiera [13:33:29] sorry, took me a while longer. Back now [13:34:40] I created deployment-etcd02 as Buster g2.cores1.ram2.disk20 [13:34:50] now waiting for it to boot up [13:34:57] <_joe_> the tricky one can be discovery: so let's assume that your server is called "deployment-etcd02.deployment-prep.eqiad.wmflabs" [13:36:03] <_joe_> then I think discovery should have the value: "name=http://deployment-etcd02.deployment-prep.eqiad.wmflabs:2380" [13:36:40] that dns name won't exist anymore, since only .eqiad1.wikimedia.cloud names are created for new VMs [13:36:46] <_joe_> you also need to set cluster_bootstrap to true [13:36:52] <_joe_> ok, use that then :) [13:37:01] waiting for the VM to boot up first [13:37:08] <_joe_> muscle memory is hard to fight [13:37:30] <_joe_> the next thing we'll need to do is to import data from the old VM [13:37:56] <_joe_> it can be done with puppet, but I don't think that's worth it [13:40:31] VM created, installing etcd now [13:42:05] do I need other hiera values than discovery and cluster_bootstrap? [13:43:17] Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::etcd::v3::cluster_name' [13:43:21] <_joe_> yes [13:43:22] ah, probably everything that the profile has [13:43:30] <_joe_> see the docs for the profile [13:43:55] <_joe_> you need to set a cluster name (use "beta" so that we further confuse the terminology :D) [13:44:01] <_joe_> and also [13:44:07] <_joe_> use_client_certs: false [13:44:15] <_joe_> allow_from: 0.0.0.0 [13:44:34] <_joe_> there is no point in restricting things more tbh [13:45:17] <_joe_> profile::etcd::v3::max_latency: 100 [13:45:37] <_joe_> profile::etcd::v3::adv_client_port: 2379 [13:45:50] <_joe_> the latency is irrelevant here, as we have no cluster [13:46:02] why is that port different than the one in discovery? [13:46:23] <_joe_> etcd uses one port for peer-to-peer comunications [13:46:31] <_joe_> that's 2380 by default [13:46:35] ah [13:46:42] <_joe_> and one to listen to clients, by default 2379 [13:47:22] <_joe_> now, in production we have a proxy in front of etcd, so we listen on 2379 but we /advertise/ to clients another port, the one of the proxy [13:47:28] <_joe_> but in your case, no proxy [13:48:03] okay [13:48:53] next problem: it needs certificate files, and deployment-puppetmaster04 does not have cergen, not sure what beta uses for these [13:50:58] <_joe_> yeah interesting problem :/ [13:52:25] <_joe_> I mean we clearly don't need them here, as we have no peers [13:52:50] <_joe_> but OTOH I think we might want to start using cergen in beta too [13:53:11] <_joe_> jbond42: or should we start out using the pki stuff you've been working on maybe? [13:54:18] <_joe_> one obvious way to solve it would be to add a parameter to the profile allowing to use puppet certificates for peer communications [13:54:55] _joe_: i have litrally just finished building a pki service in cloud which could be used. (although by litrally i mean 5 minutes ago) [13:55:04] <_joe_> lol [13:55:07] <_joe_> perfect timing [13:55:18] the production one needs to be rebuilt although i dont think that would work here [13:55:58] lets try it and see, ill create a new intermediate for this do you have a preference for CN? [13:56:23] <_joe_> so given it's all of deployment-prep where having a pki would be super useful [13:57:10] <_joe_> the CN of the intermediate CA should proabbly be "deployment-prep.eqiad1.wikimedia.cloud" [13:57:42] ack give me 5 mins [13:57:49] <_joe_> sure :) [13:58:16] <_joe_> Majavah: so we might actually use this installation as a pilot for a new way of providing certs :) [13:58:23] ooh exciting [13:58:34] * jayme grabs more popcorn [14:00:16] can you see if you can reach pki-intermediate.pki.eqiad1.wikimedia.cloud 8888 [14:00:44] <_joe_> from the server or from the puppetmaster? [14:00:54] from the server [14:01:00] <_joe_> yes [14:01:17] pki-intermediate.pki.eqiad1.wikimedia.cloud [172.16.5.134] 8888 (?) open [14:02:38] <_joe_> jbond42: I have a few Qs prolly, once you're done [14:04:58] sure [14:07:44] <_joe_> jbond42: I suppose I need to use cfssl::cert to get a cert, correct? [14:08:04] profile::pki::client [14:08:13] <_joe_> ah ok [14:09:19] <_joe_> ok, and that needs to be configured in hiera, uhm, not exactly the best UI for me, but I can work with it [14:09:24] https://github.com/wikimedia/puppet/blob/production/hieradata/cloud/eqiad1/pki/common.yaml [14:10:07] ocne in production the will all have sane defaults then its just a matter of adding profile::pki::client::certs: [14:10:37] but you can also use cfssl::cert directly (document is still a bit lacking) [14:10:46] _joe_: are you applying those somewhere or should I? [14:11:01] <_joe_> Majavah: working on a patch to the profile for now [14:11:35] <_joe_> jbond42: so I just need to override the certs for the host, correct? [14:11:57] yes [14:12:22] <_joe_> ok, I'll bake up a patch, and ask for your review [14:12:26] ack [14:12:52] <_joe_> jbond42: where do I find the CA file when I apply that profile? [14:13:44] fyi this is what the certs look like https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/pki.yaml#L41-L43 [14:14:16] yur label will be deployment_prep_eqiad1_wikimedia_cloud [14:14:38] * jbond42 looking for the second q now [14:16:29] _joe_: looking at the config getting the CA bundle maybe feature pending [14:17:18] <_joe_> jbond42: uhm but without the CA available, I can't use the certs for etcd I fear [14:17:39] yes indeed (im just testing now) [14:19:13] <_joe_> ok, then plan b [14:20:14] sorry about that, however ill look to add it now [14:24:49] <_joe_> no problem! [14:24:57] <_joe_> it's how we find out such issues [14:25:17] :) [14:25:19] <_joe_> Majavah: ok so I'm going another way, clearly the profile was written with only production use-cases in mind [14:31:24] sure, what is your plan now? [14:34:42] <_joe_> Majavah: using the puppet certs [14:34:59] <_joe_> I'm writing a patch, I'll also add all hiera variables in the patch [14:43:57] _joe_: do you have a specific reason to set those hiera values in ops/puppet instead of horizon? [14:44:07] <_joe_> Majavah: visibility [14:44:41] <_joe_> q: how do I refresh the cloud facts on the puppet compiler? [14:45:04] no idea [14:45:12] should I remove those values from horizon then? [14:45:15] <_joe_> yeah it was a more general question :) [14:45:40] PUPPET_MASTER=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs ./modules/puppet_compiler/files/compiler-update-facts [14:45:43] <_joe_> Majavah: sure, as I think ops/puppet takes precedence anyways [14:45:57] <_joe_> jbond42: thanks [14:46:19] np you may need to change the puppetmaster that was fom my histroy [14:46:31] that is indeed the current deployment-prep puppetmaster [14:46:36] <_joe_> jbond42: no that's the current one [14:46:40] ack [14:47:13] <_joe_> Majavah: let me run the compiler, although I'll be honest, I'm not 100% confident merging such a change on a friday afternoon [14:49:33] <_joe_> good news is that my change makes compilation work, so we can just cherry-pick it on deployment-prep for now [14:50:08] {{doing}} [14:52:14] <_joe_> Majavah: let me do it, sorry [14:52:19] okay, sure [14:52:26] <_joe_> I want to be able to fix issues as they happen [15:03:53] <_joe_> etcd is currently working on that machine, as far as I can tell, with a minor quirk [15:06:48] what kind of quirk? [15:09:39] <_joe_> Majavah: that you can't connect to etcd from the etcd server without overriding what's in /etc/hosts [15:10:08] <_joe_> curl https://deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud:2379/v2/keys will work from e.g. the puppetmaster [15:10:20] <_joe_> but if you try it from the etcd server itself, it will fail [15:10:29] <_joe_> because we only listen on the public IP [15:10:33] <_joe_> and not on localhost [15:12:13] <_joe_> so to test locally you need [15:12:33] <_joe_> curl -k --resolve deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud:2379:$(dig +short deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud) https://deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud:2379/v2/keys [15:38:05] are you still trying to figure that or something else? [15:49:53] _joe_: ^ [15:50:11] <_joe_> Majavah: no it's generally working [15:50:27] <_joe_> now it just needs the data to be copied over from the other server [15:50:38] ok, just thought that you were still working on that, thanks [15:50:51] what next? somehow import data from deployment-etcd-01? [15:55:49] deployment-etcd-01 is etcd v2.2.1, etcd02 is v3.2.26 [15:57:26] looking at the upgrade docs it might require hopping via 2.3 and 3.1, since 3.0 upgrade requires 2.3 and 3.2 no longer supports upgrading from 2 :/ [15:57:57] what's the best way to get those intermediary versions running for a short time? [16:38:27] <_joe_> Majavah: sorry I was in another convo [16:38:51] <_joe_> Majavah: you need to just copy stuff from one v2 datastore to the other [16:39:09] <_joe_> I can take care of that, but there are a few dump&load tools for doing so [16:43:45] _joe_: if you can that works with me, I can also try to take care of that if you give pointers to the right tools [16:43:59] anyways dinner time for me now, afk [17:24:52] _joe_: I see the conftool data on etcd02, so I'm assuming you did the import, thank you!! [17:42:05] does someone remember why IABot causes 415s ? [17:44:05] <_joe_> Majavah: barely of the data needed [17:44:28] <_joe_> ut yes, you should be able to switch servers in mediaiki-config [17:46:05] and etcd_host hiera value set on horizon for all of deployment-prep I'd imagine [17:46:31] effie: yeah, it tries to fetch action=raw for wikidata pages, which don't support that [17:46:33] https://phabricator.wikimedia.org/T269914 [17:46:36] <_joe_> prolly, yes, although it should not be used [17:46:38] yeah found it [17:47:31] I'll switch it anyways since it pointing to a nonexistent server does not sound desirable [17:58:56] I have a mw-config patch but those aren't deployed on fridays or weekends and Jenkins replaces all local changes [17:59:06] thanks for helping, really appreciate it [18:38:18] Majavah: if the mw-config patch is beta-only, we can deploy it (as long as someone is going to keep an eye on it for a bit to ensure it didn't break) [18:39:31] legoktm: it's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/668751, in theory I can look but it's hard because logstash-beta is not receiving any events [18:40:29] elukey: thanks for the merge :) [18:41:13] legoktm: thank you for the patch! I didn't notice it, my bad, the bug was mine :( [18:51:11] :)) an no worries, worked perfectly now ^.^ [19:02:01] legoktm: can we deploy that config patch? or somehow fix logstash-beta first? [19:02:17] fwiw I already spent a few hours looking into it but couldn't fix it [19:03:27] I'm missing the link between switching etcd and logstash-beta being down [19:03:54] harder to see errors if there are any [19:03:58] wouldn't any failure logs end up on mwlog too? [19:04:46] I mean, if it's not working we'd know right away right? Won't MW go down without it? [19:04:55] Yeah, I guess that would work too [19:05:16] I think MediaWiki caches the last good data on disk to avoid etcd taking everything down [19:05:24] I once fixed logstash by rebooting the VM [19:05:38] but I guess that's been tried already [19:05:57] I've tried restarting all the processes related, but not the VM itself [19:12:09] Majavah: I think it should be deployed now [19:12:23] legoktm: it is, atleast on deployment-mediawiki-07 [19:12:31] nothing on exception.log which is a good sign [19:13:19] uhhh (/srv/mediawiki/php-master/maintenance/eval.php(81) : eval()'d code:1) syntax error, unexpected ',' [19:13:36] * tabbycat smells smoke [19:14:03] IAS, beta-cluster is kinda FUBAR [19:14:07] someone manually made a typo? [19:14:19] legoktm: I didn't, I would have guessed that was you [19:14:28] but that stack trace would suggest that [19:14:39] nope, not me [19:15:13] hmmm [19:15:59] now got another ParseError: syntax error, unexpected '{', expecting ',' or ')' [19:16:15] they are on wikidatawiki, but otherwise have no useful exception [19:16:30] what host are they coming from? [19:16:46] deployment-mediawiki-07 [19:17:10] four in total, first one today 19:04:50 UTC [19:17:24] which was before we switched over [19:17:35] oh, it's ppchelko [19:17:49] guess he's just debugging something [19:17:51] ah you were faster than me looking at that [19:17:59] I just ran `w` :p [19:18:18] yes, I did that too, but you were faster posting here [19:18:29] hehe [19:19:13] if you want to be absolutely sure etcd is working properly, you could rename those cache files so MW has to hit etcd and create new ones? [19:19:32] I'm trying to find what they are named to do that :P [19:20:04] $localCache = new APCUBagOStuff; [19:20:10] so just restart php-fpm? [19:20:26] actually not sure how to verify that it would fill then [19:22:21] etcd v2 was proxied via nginx so it had an access log (which shows that requests to it stopped) but v3 isn't proxied so no access log [19:23:46] it'll throw new ConfigException( "Failed to load configuration from etcd: $error" ); [19:23:52] if the cache is empty and it can't talk to etcd [19:24:07] so yeah, restarting php-fpm to clear APCu seems reasonable [19:25:38] did that, everything seems to be still working [19:25:52] woot [19:27:14] I'm declaring success and shutting down etcd-01 and if nothing breaks next week delete it [19:27:34] but still no logstash Majavah ? [19:28:35] oh, that, but it's unrelated to etcd