[05:56:34] andrewbogott: deleted! Sorry for the noise [07:08:47] network question - I am decomming analytis1032, and did all the steps except the switch's port disable [07:09:16] that seems straightforward: edit interface blabla; set disable; commit [07:09:34] is it something that whoever has network access can do? [07:09:55] (task is https://phabricator.wikimedia.org/T233080) [07:15:34] that step is normally done by DC ops in the final unracking step, probably fine if you simply do it, but with the new Netbox consistency checks maybe this triggers some internal inconsistency check, not sure [07:15:44] best to ping the dcops channel for clarification I'd say [07:19:18] moritzm: regarding the python3-cryptography issue... makes it sense to upload a wmf1 version while the debian package is broken? [07:24:38] the py-crypto update that I mentioned yesterday was submitted before anyone noticed your bug: https://lists.debian.org/debian-release/2019/09/msg00598.html so let's do the following: [07:26:25] - backport the patch to a 2.6.1-3+deb10u1~wmf1 build which includes the patch from http://bugs.debian.org/941451 plus a backport for the memleak (the new OpenSSL DSA will be released tonight and without the patch from 941451 it will fail to build) [07:26:47] - let's verify that it fixes the mem leak on acmechief* hosts [07:26:56] - if so, we report back to the bug [07:27:33] - we can then offer the py-crypto maintainer to do the legwork of submitting a 2.6.1-3+deb10u2 with the backported memleak patch [07:27:55] - and once that it out, we yank our 2.6.1-3+deb10u1~wmf1 build from buster-wikimedia [07:28:10] ack [07:28:41] I'm gonna do it as a one shot thingie on boron... no internal repo on gerrit.wm.o, is that ok? [07:29:04] yeah, given that we have only two acmechief hosts we can also simply test the packages with a mere dpkg -i [07:29:27] does the leak also show up on the acmechief test hosts or do they see too little requests? [07:29:39] it shows as well [07:29:52] ack [07:29:56] it's linked to the number of configured certificates, not clients using the API [07:42:56] moritzm: so... applying that patch on top of debian/2.6.1-3 doesn't make quilt happy [07:45:23] forget about it [07:45:25] L8 issue on my side [08:54:25] moritzm: my bug report is incomplete, more backports are required to fix the issue [08:54:40] but I got a python3-cryptography_2.6.1-3+deb10u1~wmf1_amd64.deb already.. testing on acmechief-test1001 [08:54:46] ack! [08:58:43] moritzm: valgrind is happy now :D [09:01:41] cool :-) [09:03:47] I've also referenced the two PRs required to fix the issue in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941413#10 [09:04:18] I'm gonna hit the acme-chief production environment as well [10:13:32] hi all i plan to upgrade puppetmaster2001 today. while this takes place ill disable puppet in codfw, ulsfo and eqsin while the reimage takes place. i plan to start at 11:00 UTC. please say if there are issues [10:22:46] hmmm [10:22:52] what's the expected downtime? [10:24:41] the previous reimages of the backends took 45-50 mins [10:26:54] also i have rescheduled to 12:00 to give ops list a bit more lead [11:42:19] do we have a declared maintainer for this grafana dashboard? https://grafana.wikimedia.org/d/000000342/node-exporter-server-metrics [11:53:18] we don't really have fixed maintainers, try pinging the last person who modified the dashboard per history :-) [13:03:56] <_joe_> arturo: why do you want to use that dashboard? [13:04:14] is linked from other dashboard. But I see all the queries are updated [13:04:16] <_joe_> I think you should use "host overview" instead [13:04:17] outdated* [13:05:19] thanks _joe_ will update the links [13:09:03] I'd like to live in a world where we have fewer dashboards, and a set of 'core' dashboards (like host overview) that lived in source control or something [13:09:15] +1 [13:10:41] <_joe_> I think we tried at some point [13:10:52] <_joe_> then I defeated all of you [13:12:07] <_joe_> (jokes aside, I think you're right, I would even go further and make the root folder read-only in grafana, but I don't think they have good enough multitenancy [13:12:19] the dashboards in git didn't work really, too unreadable. we'd need some meta language we keep in git, which generates the actual dashboards [13:12:39] <_joe_> cdanis: I think this is a perfect work for the new hire too! [13:14:17] moritzm: yeah, I think keeping them as JSON is a bad idea, I'd prefer our own specializations around grafanalib (https://github.com/weaveworks/grafanalib) or similar [13:15:32] <_joe_> I don't love grafanalib, but well [13:15:34] ofc then you wouldn't be able to tweak a 'core' dashboard in the UI and then easily save the changes, but hopefully everyone who needs to edit them wouldn't mind [13:15:57] <_joe_> yes probably you need to evolve a bit on top of it and you can build something user-friendly [13:16:20] _joe_: I don't love grafanalib either, but it does seem to be the thing that exists [13:17:07] <_joe_> cdanis: sure my main grievance is that being very general-purpose is not opinionated so it's as unconvenient to use as grafana itself [13:17:21] <_joe_> only you don't get an immediate feedback of your changes [13:18:05] <_joe_> while I would love to have something like LatencyGraph(timeseries={"label": "query", ...}) that does all the right things [13:18:27] +1 to something more opinionated and specialized [13:18:35] <_joe_> including setting up our most used variables or something. It's mostly a matter of a good UX [13:19:06] <_joe_> which is very hard to get right :P [13:19:12] indeed [13:19:40] <_joe_> but yes something more high level with the possibility to go lower-level with grafanalib itself [13:48:30] jbond42: puppet merges can continue while you work, right? [13:48:58] cdanis: 2001 has just come back online so it should be fine regardless [13:49:04] ahh thanks [13:49:04] but yes i think it should [13:56:10] /usr/local/bin/puppet-merge: 213: /usr/local/bin/puppet-merge: cannot create /srv/config-master/puppet-sha1.txt: Permission denied [13:56:13] Connection to puppetmaster2001.codfw.wmnet closed. [13:56:15] ERROR: puppet-merge on puppetmaster2001.codfw.wmnet failed [13:58:02] ack thanks looking [13:58:13] looks like the /srv/config-master/puppet-sha1.txt and labsprivate-sha1.txt files aren't puppetized, nor are permissions that allow the 'gitprivate' user to write in the directory -- I'm guessing the files got created there as owner gitprivate, and so writes to the existing files happen to work [14:00:32] how can I force puppet-merge to run? [14:03:39] s/gitprivate/gitpuppet/ above [14:05:19] cheers [14:06:31] <_joe_> why are we saving those files in /srv/config-master?? [14:06:43] <_joe_> oh my. [14:06:45] <_joe_> anyways [14:06:53] so that cloud hosts can know what the 'canonical' puppet revision is [14:07:17] <_joe_> as opposed to read the puppet repo? [14:07:29] they need to know what's been merged in production [14:07:35] obviously they are also reading the puppet repo [14:07:49] <_joe_> no I mean the puppet repo from the frontend puppetmaster [14:07:57] <_joe_> anyways, please disregard [14:07:59] that I don't know [14:08:25] <_joe_> I don't understand why there is one for labsprivate on a production puppetmaster either, but I am not following evolutions much there [14:09:02] anyway, should be fixed now, although it will be a pain when reimaging other puppetmasters [14:10:01] I think it would be reasonable to puppetize something about the permissions needed, either make the directory writable by group gitpuppet, or if that's undesirable, is there a way to have puppet create files with a given owner but not care about the contents of said file? [14:22:54] sorry i droped of there. Thank chris ill create a taske for the permissions [14:26:03] <_joe_> I would prefer not to make that dir writable to gitpuppet, just having puppet ensure => present and owner, group without source or content shall be enough [14:29:47] tiket here https://phabricator.wikimedia.org/T234332 [14:54:12] jbond42: okay hopefully on future puppetmaster reimages it will Just Work [14:54:32] well i can test tommorrow :) thx [14:56:04] <_joe_> cdanis: they mostly work, and unless we test reimaging constantly, we will always find things that have been created incrementally when implemented and don't work on a clean system [14:56:34] <_joe_> that's something I thought about btw, having a couple VMs on which to reimage daily an appserver [14:56:41] _joe_: maybe we should reimage mostly-stateless systems constantly [14:57:01] <_joe_> cdanis: if puppet was remotely less slow in applying a large catalog, sure [14:57:17] ok once a week then [14:57:21] <_joe_> reimaging an appserver takes ~ 2 hours for the first puppet run to happen [14:57:53] <_joe_> something like a service that picks one appserver at random, depools it, and reimages it every N hours [14:58:15] <_joe_> (and it also takes about 15-20 more minutes to run its first scap pull) [14:58:37] <_joe_> I'm pretty confident we have a good validation system at least for all the things in the services layer [14:58:49] <_joe_> between service-checker and tests I devised for the appservers [14:58:59] <_joe_> not so sure about the rest of the stack [14:59:16] jaime and manuel now starting to present at percona live [14:59:59] <_joe_> no streaming I guess :) [15:00:20] i don't think so ;) [15:00:24] pretty decent audience [15:00:42] <_joe_> cdanis: I can tell you I was positively surprised when an appserver reimaged at the first attempt without a hiccup whatsoever last time I tried a few weeks back [15:00:47] haha [15:00:51] <_joe_> after like one year when we never did it [15:01:01] <_joe_> and we changed literally 90% of the puppet code applied there [15:01:12] never did it [15:01:16] what [15:01:26] <_joe_> reimage a server [15:01:28] that's amazing and upsetting for multiple reasons [15:02:12] <_joe_> upsetting why? [15:02:42] <_joe_> because it would be great to have a policy to try regularly? sure [15:02:54] <_joe_> but cost/benefit as usual [15:03:05] <_joe_> the only way it makes sense is if we automate the procedure [15:03:32] <_joe_> also very soon(TM) all the servers I manage will just be kubernetes nodes :P [15:04:59] the more of that, the better [15:05:53] are you saying we can't have pets? :,-( [15:06:10] datacenters are very inhospitable environments for pets, mark [15:06:14] I'm just trying to be humane [15:07:06] <_joe_> mark: don't worry, we still have to support gerrit and jenkins and phabricator [15:07:15] <_joe_> plenty of pets [15:07:47] if we fix https://phabricator.wikimedia.org/T195847 before reimaging the mw servers, this will reduce reimage times quite a bit (especially on the older servers with slow I/O), that tex stuff is insanely big and I think we've only been installed it for math [15:08:44] <_joe_> oh we can finally really remove it? [15:08:50] are you saying I should host those pets at my home? [15:09:05] <_joe_> mark: no just our mailserver :P [15:09:23] ;) [15:09:26] <_joe_> moritzm: yeah that is also the biggest nightmare I had re: mediawiki in containers [15:09:39] that was a proper colocated server in a professional data center, tbf ;) [15:09:56] <_joe_> mark: it's funnier if you let the legend become it was under your desk [15:10:03] yeah yeah ;) [15:15:26] yeah, all the math packages should be good to go, IIRC the database side of the extension is even removed by now [15:33:36] <_joe_> moritzm: from a quick test on mwdebug2002, we should just undeclare those packages in puppet, then run apt-get remove --autoremove [15:36:32] yeah, but let's keep it running on the canaries for a week or so, we might have some weird extension shelling out to some TeX binary which simply relied on the TeTeX packages being around for math [15:36:49] (before dropping fleet-wide) [15:45:00] <_joe_> yes, I was wondering about that [15:45:19] <_joe_> it's easier to manage this way too