[08:34:45] <_joe_>	 hey there traffic people
[08:35:05] <_joe_>	 pybals on lvs1015 and lvs1016 are suffering a lot
[08:35:19] <_joe_>	 it can be 30 seconds before they detect a server being up or down
[09:09:55] <vgutierrez>	 probably related to the 83 services configured there
[09:25:59] <_joe_>	 sure
[10:22:35] <wikibugs>	 10Traffic, 10Analytics, 10Operations: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10BBlack) The URL mentioned at the top isn't a media URL, it actually is HTML content and is a pageview.  Try it in your browser: https://commons.wikimedia.org//wiki/File:Arm_muscles_b...
[10:38:32] <wikibugs>	 10Traffic, 10Operations: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) Reported and proposed a solution to upstream in https://github.com/apache/trafficserver/pull/5935
[10:40:29] <bblack>	 _joe_: how do we measure the 30s lag?
[10:41:05] <bblack>	 (is this cpu-limited like we've seen in the past? eventloop falls behind basically?)
[10:41:46] <_joe_>	 bblack: One sec, finishing an interview
[10:47:01] <_joe_>	 so what I noticed is when restarting a service the new wrapper scripts change the pooled state in etcd, then check every 10 seconds the state in pybal via an http query to its api
[10:47:25] <_joe_>	 and sometimes it retries 3, 4 times before pybal actually sees the service is up again
[10:47:39] <_joe_>	 even when I know the service is ready on the servers
[10:48:20] <_joe_>	 that would be ~ 15/20 seconds of lag. Earlier I even got some timeouts, that happen after 5 retries
[10:50:07] <_joe_>	 I can debug this further if you want, but yes I assumed it was the eventloop falling behind
[10:50:32] <bblack>	 I can play with it a bit
[10:50:41] <bblack>	 it could be the issue is with consumption of etcd, too
[10:51:00] <bblack>	 (on pybal's end)
[10:52:03] <_joe_>	 that has that 1 second lag that's a bit annoying, I would expect 0.1 seconds would do just fine there
[10:52:13] <_joe_>	 but I don't think that's the case
[10:52:23] <_joe_>	 I see the state from pybal as enabled/down/not pooled
[10:52:33] <_joe_>	 so it's the checks that don't get performed in time
[10:54:17] <bblack>	 yeah I'm looking at some stuff
[10:54:34] <bblack>	 pybal doesn't ever really seem to lock up a CPU core, at least not for more than an isolated second here or there
[10:54:48] <bblack>	 but there's definitely some inefficiencies in the eventloop for the healthchecks, we've stared at before.
[10:55:31] <bblack>	 at the end of the day it's just not a very efficient model to have all of these checks in the same singular process, thread, and loop as each other and everything else.
[10:55:49] <bblack>	 (esp when it's never truly "nonblocking", there are some minor hiccups that happens)
[10:55:59] <_joe_>	 yes
[10:59:23] <bblack>	 so one thing I'm staring at here just in general optimization terms
[11:00:04] <bblack>	 we still use the RunCommand healthchecks.... which just ssh to the service host and execute the "true" command.  But only for the port 80 appservers + api pools (but not the port 443 services, and we don't use it for any other pools).
[11:00:45] <bblack>	 and <for whatever reason> these seem to all execute batched together as healthchecks too, meaning there's a point in all of this healthcheck cycle where pybal forks off a very large number of subshells and ssh clients, etc.
[11:01:06] <bblack>	 how useful is that runcommand in terms of healthcheck?
[11:01:30] <_joe_>	 yeah I don't think it is anymore :)
[11:01:51] <bblack>	 ok, I might do some testing (first repro the delays under normal conditions) and see if it makes any big diff
[11:01:53] <_joe_>	 let's ask mark for the historical reason for it, but I've not seen runcommand catch a problem in years
[11:02:13] <moritzm>	 https://phabricator.wikimedia.org/T111899
[11:02:15] <bblack>	 it may not be a problem anyways
[11:02:43] <moritzm>	 we discussed this back in 2015 when the ferm rules were added for mw servers
[11:02:45] <bblack>	 oh look someone recorded this from 4 years ago :)
[11:03:01] <moritzm>	 I knew this might be useful some day :-)
[11:04:09] <moritzm>	 for the "avoid pooling servers that were broken and not up to date" angle we do have sufficient other types of monitoring I'd say
[11:05:10] <bblack>	 I'd say at this point in history, really _joe_ is the authority on this, although he should maybe read that convo before making a call :)
[11:05:58] <_joe_>	 yeah, lemme read that
[11:10:12] <_joe_>	 ok so
[11:10:18] <bblack>	 now that I get 4 years to think between conversation lines: I doubt the sh and uptime commands actually validate that the rootfs still works, they could still be in cache from other continuous check executions.
[11:10:29] <bblack>	 (and now uptime is actually just "true")
[11:10:35] <_joe_>	 yeah and also
[11:10:52] <_joe_>	 we are using Special:BlankPage
[11:11:02] <_joe_>	 which loads at least some data
[11:11:09] <bblack>	 the only angle on this that seems semi-valid
[11:11:09] <_joe_>	 we have multiple checks available
[11:11:34] <_joe_>	 if scap fails on a server, we just depool it quickly
[11:11:40] <bblack>	 is that S:BP could be working, but ssh and/or other stuff is borked and it's falling behind on scap updates, and somehow this also kills the ssh runcommand check
[11:11:45] <bblack>	 but, yeah, that
[11:12:04] <_joe_>	 I don't think we had a single case in years 
[11:12:16] <_joe_>	 where runcommand did depool a server that the other checks didn't
[11:12:29] <_joe_>	 I mean if I had to name a check that would be useful
[11:12:45] <_joe_>	 is calling mtail to check the error rate maybe :P
[11:24:18] <bblack>	 _joe_: minor Q while staring at things, is it intentional that eqiad api servers in row D have lower weight than A/B/C?
[11:25:08] <_joe_>	 it's because they're older I guess?
[11:26:49] <bblack>	 I donno
[13:19:38] <wikibugs>	 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 7 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10fgiunchedi)
[15:06:10] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) Note that with new eqiad routing engines we can set the MSS at the router level (untested). Advantages are: easier to deploy (one configuration change) and can be applied to ext...
[15:32:45] <wikibugs>	 10Traffic, 10Analytics, 10Operations: Add google weblight to the list of trusted proxies - https://phabricator.wikimedia.org/T232849 (10Nuria)
[15:33:13] <bblack>	 XioNoX: that reminds me, the LVS 2x bgp patch, let's aim for monday? it's somewhat risky for today at this point.  ditto the last anycast recdns enablement patch.
[15:33:31] <XioNoX>	 bblack: wfm!
[15:53:11] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10BBlack) Right, that would cover cases like install1002 and archiva (and probably many other minor cases we've missed which haven't set off big alarm bells), but we'll still need direct m...
[15:54:45] <XioNoX>	 bblack: "but we'll still need direct mitigation on the hosts where it matters for inbound" I don't think so (but could be wrong)
[15:55:41] <XioNoX>	 as the routers would clamp the ADVMSS on the outbound packets, it will influence the inbound traffic' MSS
[15:56:36] <bblack>	 I thought it was only on outbound SYN?
[15:56:57] <bblack>	 (whereas for the cp case, we need to clamp the one in the outbound SYNACK response)
[15:58:12] <bblack>	 if it's just "all outward-bound packets regardless of syn directionality", then yeah
[15:58:22] <bblack>	 we can just put it on the router for all applicable links to the outside world.
[16:02:36] <XioNoX>	 bblack: doc is unclear "If the router receives a TCP packet with the SYN bit"
[16:02:57] <XioNoX>	 but I'd think more that it works with SYNACK too
[16:02:59] <bblack>	 well SYN+ACK does have the SYN bit
[16:03:09] <XioNoX>	 yeah exactly
[16:03:26] <XioNoX>	 to be tested
[16:03:51] <bblack>	 I mean, it would make sense.  The TCP RFC also defines that the MSS Option is only sent with the SYN bit (SYN or SYN+ACK).
[16:03:55] <bblack>	 just you know, juniper :)
[16:04:04] <XioNoX>	 eh
[16:04:05] <bblack>	 but yeah we can test it
[16:04:26] <bblack>	 another thing for "monday" (I'm sure all the things I've said monday about will stretch longer than that)
[16:05:00] <bblack>	 assuming it works right, it is a much better solution than software
[16:05:02] <XioNoX>	 not today, but easy to test, especially if you still have your test server
[16:05:08] <bblack>	 and we can keep doing the software hack for now just on cp3
[16:05:24] <bblack>	 yeah
[16:06:37] <XioNoX>	 and write a runbook :)
[16:07:05] <bblack>	 step 1: don't get ddos'd
[16:07:07] <bblack>	 :)
[16:18:50] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) As discussed on IRC, this *should* work for inbound (clamping the SYNACK too), but to be tested.
[16:56:52] <wikibugs>	 10Traffic, 10Analytics, 10Operations: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data  - https://phabricator.wikimedia.org/T232795 (10Nuria)
[16:58:39] <wikibugs>	 10Traffic, 10Analytics, 10Operations: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10Nuria) I have started another ticket that as you mentioned, better explains the rationale behing having "trusted proxies", we really do not need them if we can capture the original i...
[16:59:00] <wikibugs>	 10Traffic, 10Analytics, 10Operations: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10Nuria) ping @Ottomata and @JAllemandou for thou...
[17:59:29] <wikibugs>	 10Traffic, 10Analytics, 10Operations: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10BBlack) The problem stems from the "Trust" in "...
[18:40:43] <wikibugs>	 10Traffic, 10Analytics, 10Operations: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10Nuria) Right, I see the UA issue but in the abs...
[19:14:44] <wikibugs>	 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Dzahn)
[19:14:49] <wikibugs>	 10Traffic, 10Operations, 10Phabricator: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10Dzahn)
[19:15:08] <wikibugs>	 10Traffic, 10Operations, 10Phabricator: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10Dzahn) merging into T226044
[19:18:28] <wikibugs>	 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) == From the merged task:  Blog posts on phame cannot currently be cached by our...
[19:21:14] <wikibugs>	 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) a:05mmodell→03None This is unblocked on my end, @ema feel free to proceed wh...
[19:22:42] <wikibugs>	 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) Also important, @epriestley's comment at T219978#5346100
[19:42:06] <wikibugs>	 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Krinkle)
[21:04:41] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10wiki_willy) Hi @Dzahn - just following up on this one, to see when the server can be taken down.  Thanks, Willy
[21:05:33] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: asw2-c-eqiad:xe-2/0/45 inbound interface errors - https://phabricator.wikimedia.org/T229612 (10wiki_willy) @Cmjohnson - can you provide an update on this one next week?  Thanks, Willy