[08:34:45] <_joe_> hey there traffic people [08:35:05] <_joe_> pybals on lvs1015 and lvs1016 are suffering a lot [08:35:19] <_joe_> it can be 30 seconds before they detect a server being up or down [09:09:55] probably related to the 83 services configured there [09:25:59] <_joe_> sure [10:22:35] 10Traffic, 10Analytics, 10Operations: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10BBlack) The URL mentioned at the top isn't a media URL, it actually is HTML content and is a pageview. Try it in your browser: https://commons.wikimedia.org//wiki/File:Arm_muscles_b... [10:38:32] 10Traffic, 10Operations: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) Reported and proposed a solution to upstream in https://github.com/apache/trafficserver/pull/5935 [10:40:29] _joe_: how do we measure the 30s lag? [10:41:05] (is this cpu-limited like we've seen in the past? eventloop falls behind basically?) [10:41:46] <_joe_> bblack: One sec, finishing an interview [10:47:01] <_joe_> so what I noticed is when restarting a service the new wrapper scripts change the pooled state in etcd, then check every 10 seconds the state in pybal via an http query to its api [10:47:25] <_joe_> and sometimes it retries 3, 4 times before pybal actually sees the service is up again [10:47:39] <_joe_> even when I know the service is ready on the servers [10:48:20] <_joe_> that would be ~ 15/20 seconds of lag. Earlier I even got some timeouts, that happen after 5 retries [10:50:07] <_joe_> I can debug this further if you want, but yes I assumed it was the eventloop falling behind [10:50:32] I can play with it a bit [10:50:41] it could be the issue is with consumption of etcd, too [10:51:00] (on pybal's end) [10:52:03] <_joe_> that has that 1 second lag that's a bit annoying, I would expect 0.1 seconds would do just fine there [10:52:13] <_joe_> but I don't think that's the case [10:52:23] <_joe_> I see the state from pybal as enabled/down/not pooled [10:52:33] <_joe_> so it's the checks that don't get performed in time [10:54:17] yeah I'm looking at some stuff [10:54:34] pybal doesn't ever really seem to lock up a CPU core, at least not for more than an isolated second here or there [10:54:48] but there's definitely some inefficiencies in the eventloop for the healthchecks, we've stared at before. [10:55:31] at the end of the day it's just not a very efficient model to have all of these checks in the same singular process, thread, and loop as each other and everything else. [10:55:49] (esp when it's never truly "nonblocking", there are some minor hiccups that happens) [10:55:59] <_joe_> yes [10:59:23] so one thing I'm staring at here just in general optimization terms [11:00:04] we still use the RunCommand healthchecks.... which just ssh to the service host and execute the "true" command. But only for the port 80 appservers + api pools (but not the port 443 services, and we don't use it for any other pools). [11:00:45] and these seem to all execute batched together as healthchecks too, meaning there's a point in all of this healthcheck cycle where pybal forks off a very large number of subshells and ssh clients, etc. [11:01:06] how useful is that runcommand in terms of healthcheck? [11:01:30] <_joe_> yeah I don't think it is anymore :) [11:01:51] ok, I might do some testing (first repro the delays under normal conditions) and see if it makes any big diff [11:01:53] <_joe_> let's ask mark for the historical reason for it, but I've not seen runcommand catch a problem in years [11:02:13] https://phabricator.wikimedia.org/T111899 [11:02:15] it may not be a problem anyways [11:02:43] we discussed this back in 2015 when the ferm rules were added for mw servers [11:02:45] oh look someone recorded this from 4 years ago :) [11:03:01] I knew this might be useful some day :-) [11:04:09] for the "avoid pooling servers that were broken and not up to date" angle we do have sufficient other types of monitoring I'd say [11:05:10] I'd say at this point in history, really _joe_ is the authority on this, although he should maybe read that convo before making a call :) [11:05:58] <_joe_> yeah, lemme read that [11:10:12] <_joe_> ok so [11:10:18] now that I get 4 years to think between conversation lines: I doubt the sh and uptime commands actually validate that the rootfs still works, they could still be in cache from other continuous check executions. [11:10:29] (and now uptime is actually just "true") [11:10:35] <_joe_> yeah and also [11:10:52] <_joe_> we are using Special:BlankPage [11:11:02] <_joe_> which loads at least some data [11:11:09] the only angle on this that seems semi-valid [11:11:09] <_joe_> we have multiple checks available [11:11:34] <_joe_> if scap fails on a server, we just depool it quickly [11:11:40] is that S:BP could be working, but ssh and/or other stuff is borked and it's falling behind on scap updates, and somehow this also kills the ssh runcommand check [11:11:45] but, yeah, that [11:12:04] <_joe_> I don't think we had a single case in years [11:12:16] <_joe_> where runcommand did depool a server that the other checks didn't [11:12:29] <_joe_> I mean if I had to name a check that would be useful [11:12:45] <_joe_> is calling mtail to check the error rate maybe :P [11:24:18] _joe_: minor Q while staring at things, is it intentional that eqiad api servers in row D have lower weight than A/B/C? [11:25:08] <_joe_> it's because they're older I guess? [11:26:49] I donno [13:19:38] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 7 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10fgiunchedi) [15:06:10] 10Traffic, 10Operations, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) Note that with new eqiad routing engines we can set the MSS at the router level (untested). Advantages are: easier to deploy (one configuration change) and can be applied to ext... [15:32:45] 10Traffic, 10Analytics, 10Operations: Add google weblight to the list of trusted proxies - https://phabricator.wikimedia.org/T232849 (10Nuria) [15:33:13] XioNoX: that reminds me, the LVS 2x bgp patch, let's aim for monday? it's somewhat risky for today at this point. ditto the last anycast recdns enablement patch. [15:33:31] bblack: wfm! [15:53:11] 10Traffic, 10Operations, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10BBlack) Right, that would cover cases like install1002 and archiva (and probably many other minor cases we've missed which haven't set off big alarm bells), but we'll still need direct m... [15:54:45] bblack: "but we'll still need direct mitigation on the hosts where it matters for inbound" I don't think so (but could be wrong) [15:55:41] as the routers would clamp the ADVMSS on the outbound packets, it will influence the inbound traffic' MSS [15:56:36] I thought it was only on outbound SYN? [15:56:57] (whereas for the cp case, we need to clamp the one in the outbound SYNACK response) [15:58:12] if it's just "all outward-bound packets regardless of syn directionality", then yeah [15:58:22] we can just put it on the router for all applicable links to the outside world. [16:02:36] bblack: doc is unclear "If the router receives a TCP packet with the SYN bit" [16:02:57] but I'd think more that it works with SYNACK too [16:02:59] well SYN+ACK does have the SYN bit [16:03:09] yeah exactly [16:03:26] to be tested [16:03:51] I mean, it would make sense. The TCP RFC also defines that the MSS Option is only sent with the SYN bit (SYN or SYN+ACK). [16:03:55] just you know, juniper :) [16:04:04] eh [16:04:05] but yeah we can test it [16:04:26] another thing for "monday" (I'm sure all the things I've said monday about will stretch longer than that) [16:05:00] assuming it works right, it is a much better solution than software [16:05:02] not today, but easy to test, especially if you still have your test server [16:05:08] and we can keep doing the software hack for now just on cp3 [16:05:24] yeah [16:06:37] and write a runbook :) [16:07:05] step 1: don't get ddos'd [16:07:07] :) [16:18:50] 10Traffic, 10Operations, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) As discussed on IRC, this *should* work for inbound (clamping the SYNACK too), but to be tested. [16:56:52] 10Traffic, 10Analytics, 10Operations: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10Nuria) [16:58:39] 10Traffic, 10Analytics, 10Operations: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10Nuria) I have started another ticket that as you mentioned, better explains the rationale behing having "trusted proxies", we really do not need them if we can capture the original i... [16:59:00] 10Traffic, 10Analytics, 10Operations: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10Nuria) ping @Ottomata and @JAllemandou for thou... [17:59:29] 10Traffic, 10Analytics, 10Operations: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10BBlack) The problem stems from the "Trust" in "... [18:40:43] 10Traffic, 10Analytics, 10Operations: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10Nuria) Right, I see the UA issue but in the abs... [19:14:44] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Dzahn) [19:14:49] 10Traffic, 10Operations, 10Phabricator: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10Dzahn) [19:15:08] 10Traffic, 10Operations, 10Phabricator: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10Dzahn) merging into T226044 [19:18:28] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) == From the merged task: Blog posts on phame cannot currently be cached by our... [19:21:14] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) a:05mmodell→03None This is unblocked on my end, @ema feel free to proceed wh... [19:22:42] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) Also important, @epriestley's comment at T219978#5346100 [19:42:06] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Krinkle) [21:04:41] 10Traffic, 10Operations, 10ops-eqiad: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10wiki_willy) Hi @Dzahn - just following up on this one, to see when the server can be taken down. Thanks, Willy [21:05:33] 10netops, 10Operations, 10ops-eqiad: asw2-c-eqiad:xe-2/0/45 inbound interface errors - https://phabricator.wikimedia.org/T229612 (10wiki_willy) @Cmjohnson - can you provide an update on this one next week? Thanks, Willy