[00:22:34] elukey: <3 [06:43:58] zayo link repared afaics :) [08:18:48] oof, syslog spam with this: {"timestamp":"2020-02-24T08:18:34.018737+00:00", "message":"PHP Notice: Undefined index: es4 in \/srv\/mediawiki\/wmf-config\/etcd.php on line 79", "host":"mw1288", "logsource":"mw1288", "severity":"notice", "facility":"daemon", "program":"php7.2-fpm"} [08:18:57] marostegui: ^ [08:19:11] :( [08:19:14] cdanis: ^ [08:19:23] blah [08:19:40] oh I think I see an easy fix [08:21:27] _joe_: around? [08:22:43] thanks cdanis / marostegui [08:23:45] if anyone who is even half-PHP-y wants to take a look, https://gerrit.wikimedia.org/r/574377 [08:23:46] cdanis: https://gerrit.wikimedia.org/r/574378 [08:23:50] oh XD [08:24:11] aha [08:24:17] cdanis: maybe yours is safer indeed [08:24:18] yours is better, if those are indeed to be the names [08:24:21] haha [08:24:23] XDDDDDD [08:24:29] oh no now I don't know where to vote /o\ [08:24:36] cdanis: yeah, those are the names [08:24:56] Maybe I can put the master of es4 in read only [08:24:57] just in case [08:25:02] to ensure nothing gets written [08:25:18] as nothing is supposed to write there anyways [08:25:23] so better to get an error than a split [08:26:49] cdanis: let's go for mine + RO on es4 master (es1020)? [08:27:25] marostegui: yeah, testing your change on mwdebug1002 atm [08:28:58] ok, going to change both es4 and es5 to RO on mysql level [08:29:32] <_joe_> cdanis: sorry I'm back now [08:29:46] done [08:29:49] ok looks good [08:29:55] ok, let me push then [08:29:59] ok! [08:33:59] godog: change pushed, can you let us know if you see the messages stopping? [08:35:51] marostegui: yup, we're back [08:36:02] thanks folks! appreciate it [08:36:19] thanks for the heads up :) [08:36:44] yeah :) [14:13:05] <_joe_> jynus: what do you want to look at in aggregation? [14:14:02] traffic througput in general at app level, http bytes, requests, requests per method [14:14:38] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1 [14:14:51] <_joe_> that, yes [14:15:08] ok, I see, I skipped that one because of the RED name, though it only had alerts [14:15:20] very useful also on its own [14:15:34] thanks [14:15:36] false cognate :) Requests, Errors, Delay [14:15:57] or, sorry, Rate, Errors, Duration [14:16:01] I always misremember but it's the same idea [16:06:50] ferm seems to have failed to start on 2 ms-be codfw hosts 5 hours ago [16:06:58] is it safe to reload? [16:08:35] I see other dns-related issues today, so I will proceed if I confirm it is the same root issue [16:08:42] ferm-dns [16:08:46] it's unknown why it suddenly fails, but there's a patch: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/574426/ [16:08:57] not worrid that it failed [16:09:22] just wanted to make sure I didn't applied an unwanted change or someone was working on them [16:10:18] indeed it was a dns timeout [16:10:46] ack, there should be no unwarranted rules changes or so [16:11:49] I like to ask because you never know :-D [16:12:05] o/ [16:12:33] is that for me? [16:19:50] jynus: thanks for the heads up, yes I restarted another couple this morning. I want also to follow up with traffic because AFAIK is the first time we got this and that ferm config is there since last Sep. [16:20:06] so might be related with recent changes to the local recursors/authdns [16:20:10] ? [16:20:12] first time? [16:20:23] I am seeing ferm failing due to dns for a long time? [16:20:45] not as a huge problem, but like 1/10 chance to happen on every reboot [16:21:11] although maybe the new things is that it happened wihtout a reboot? [16:21:25] (I saw the server was running for 270 days) [16:21:30] I've never seen it on a normal ferm reload/restart on a running host [16:21:39] @reboot might be the order of things in systemd [16:21:51] we saw it before [16:22:16] I think brandon started digging and he saw some interesting behaviour client side [16:22:26] the reboot part? [16:22:31] no, in general [16:22:39] the ferm fules for swift change/reload whenever there's a new host added/removed, so probably caused by the recent expansions [16:22:40] on how dns is used by ferm [16:23:00] he said it was something like concurrency making failure more likely [16:23:04] or not retrying [16:23:17] cannot remember, you should ask him :-D [16:23:40] I am saying it because it may be a new thing, but exposing an underlying sw flaw [16:24:00] e.g. if many calls are sent, more likey to expose it as you said on the patch [16:24:05] moritzm: this one was actually triggered by $CODFW_PRIVATE_CLOUD [16:24:06] _INSTANCES2_B_CODFW_IPV4 [16:24:49] or that, but basically also with every server change as they restrict traffic between themselves (hence it's more frequent than for other roles) [18:22:17] Greeks: adding gr.wikimedia.org for a new Wikimedia User Group Greece [21:16:25] 2 VMs, created at the same time. one in eqiad, one in codfw. eqiad works fine, codfw just don't see any console output ever. ganeti says status is up, ganeti logs show disks are in sync and that API returns 200 for it.. double checked the row.. hmmm [21:16:51] mutante: how have you created the VM? [21:17:00] volans: cookbook [21:17:14] and the creation log was ok? [21:19:02] volans: yes, it ended just fine and showed me the MAC ..just like with the one in eqiad [21:19:17] looking in "extended" log now too [21:19:37] that' basically debug logging, I doubt you'll find anything different wrt ganeti logs [21:58:03] on buster: Failed to reload nginx.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files [21:58:06] aww :p [22:00:54] that's missing sudo. but with sudo "not active, cannot reload" always something :) [22:12:03] nginx: [emerg] "ssl_stapling_file" directive is duplicate :p [22:12:16] guess on jessie it ignored more [22:15:21] mutante: context? [22:15:40] I can guess the deep part of what's going on from your error message, but I don't know what host/service/situation [22:15:46] bblack: installservers, nothing critical [22:16:08] new VMs called apt[12] that are APT repos without the TFTP/DHCP part [22:16:20] but it includs "nginx from jessie to buster" [22:16:27] on jessie and stretch, we had a custom-patched nginx, which allowed multi-stapling-file [22:16:45] oh. that is a valuable hint [22:16:45] but traffic moved to ats-tls and doesn't use it on buster, you likely have the stock debian buster nginx, which doesn't do multi-stapling-file [22:17:02] we haven't tried and weren't planning to, build a cusotm nginx for buster [22:17:05] i guess i could also consider httpd instead [22:17:12] well [22:17:21] it's just a webserver for the APT repo [22:17:40] yeah, but it will have the same problem, probably, if we can even *do* stapling on httpd this way (maybe not) [22:18:31] nope AFAIK [22:18:54] I'd say the best option is follow the ncredir route and drop the dual cert setup [22:18:55] another thing you could do, is configure it for either rsa-only or ecdsa-only [22:19:02] hmm.. *nod* [22:19:02] :) [22:19:10] for some things where we care about backcompat less, we've been going ecdsa-only [22:19:25] if we care about backcompat more for this case (I'm really not sure), we could also go rsa-only [22:19:55] I think for the public apt, we still offer unredirected plain http anyways, because Debian [22:20:04] so may as well go ecdsa-only for the tls part? [22:20:07] yea.. uhmm. if it was only internal for us then ecdsa but it's also for the public [22:21:14] fair about plain http.. yea. not everybody has apt-transport-https [22:21:52] and if they od, chances are they have ecdsa, it's not that new :) [22:21:58] s/od/do/ [22:22:14] yea, that makes sense to me. thanks [22:22:37] any decent http implementation shouldn't connect to wikimedia.org using plain text.. so I don't know if we should support that use case [22:23:34] there's some debian thing about linking them as http:// and it breaking older apt tools that wouldn't follow the 301 [22:23:41] at least, last I heard, some time back [22:25:11] what about removing the stapling directive entirely [22:25:22] we dont need it anymore since switch to ATS? [22:25:26] i hear [22:26:09] we should keep it as-is where we can [22:26:21] ok [22:26:26] we've never tried to hard-require converting all the minor tls terminator cases like ATS to stapling, but it's nice to have where we can [22:26:36] ugh typo confusion [22:26:43] we've never tried to hard-require converting all the minor tls terminator cases like *APT* to stapling, but it's nice to have where we can [22:29:24] uploads a patch to go ecdsa-only and adds Moritz too [22:58:28] do I have to define a new type of ssl_ciphersuite in addition to "mid" and "strong" that removes all the RSA ciphers.. if i remove the stapling file for RSA? [23:00:25] no [23:00:32] it will just ignore the RSA variants [23:00:50] if you're manually hacking before the ecdsa-only patch, you need to remove all of the RSA parts together [23:00:57] (stapling, key, cert) [23:01:32] ssl_certificate /etc/acmecerts/apt/live/rsa-2048.chained.crt; [23:01:33] ssl_certificate_key /etc/acmecerts/apt/live/rsa-2048.key; [23:01:35] ^ those [23:02:05] gotcha, thanks [23:12:06] tox [23:12:09] woops