[04:09:11] <wikibugs>	 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10BBlack) We'll also need to normalize the incoming `Accept` headers up in the edge...
[05:34:33] <wikibugs>	 10Traffic, 10Operations: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez)
[05:34:48] <wikibugs>	 10Traffic, 10Operations: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) p:05Triage→03Normal
[05:35:17] <wikibugs>	 10Traffic, 10Operations: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez)
[05:35:19] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 (10Vgutierrez) 05Open→03Resolved
[05:39:11] <wikibugs>	 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: Use acme-chief provided OCSP stapling responses - https://phabricator.wikimedia.org/T232988 (10Vgutierrez)
[06:27:21] <vgutierrez>	 Krenair: hi! there is any labs environment as far as you know using acme-chief with OCSP enabled?
[06:28:48] <vgutierrez>	 Krenair: https://gerrit.wikimedia.org/r/c/operations/puppet/+/537789 could mess with those
[06:29:35] <vgutierrez>	 deployment-prep already got acme-chief 0.21 so it already has server side OCSP responses
[06:29:49] <vgutierrez>	 so it shouldn't be an issue AFAIK
[08:12:09] <wikibugs>	 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) ` elukey@asw2-a-eqiad# show | compare [edit interfaces xe-2/0/24] -   description cloudvirtan1002; +...
[08:18:12] <wikibugs>	 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) ` elukey@asw2-b-eqiad# show | compare [edit interfaces xe-4/0/5] -   description cloudvirtan1004; +...
[08:33:54] <wikibugs>	 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) @Cmjohnson @Jclark-ctr there is one last problem - an-presto1005:  1) is not connected to any switch...
[09:27:44] <wikibugs>	 10Wikimedia-Apache-configuration, 10Performance-Team, 10Patch-For-Review: Apache configuration: SVGs served by MediaWiki aren't gzipped - https://phabricator.wikimedia.org/T232615 (10elukey) Reporting a discussion happened on IRC: the change looks good, but it seems that Varnish/ATS explicitly unsed the Acce...
[10:40:28] <wikibugs>	 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10Lucas_Werkmeister_WMDE) Real code: [`MIMEParse.java`](https://github.com/LinkedDat...
[12:12:27] <wikibugs>	 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Check for faulty optic asw-c-eqiad to cr1-eqiad - https://phabricator.wikimedia.org/T233265 (10Cmjohnson)
[12:12:30] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: asw2-c-eqiad:xe-2/0/45 inbound interface errors - https://phabricator.wikimedia.org/T229612 (10Cmjohnson)
[14:51:38] <wikibugs>	 10netops, 10Operations: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10Papaul)
[15:14:47] <wikibugs>	 10netops, 10Operations, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10ayounsi) p:05Triage→03Normal
[15:31:07] <paravoid>	 .win 29
[15:57:34] <bblack>	 XioNoX: I've been missing windows here and there to go sync up with you on the LVS multi-bgp patch for codfw, sorry!  Can we try in ~1h after I get through my next meeting?
[15:58:07] <XioNoX>	 bblack: yep
[15:58:12] <bblack>	 ok
[16:01:04] <wikibugs>	 10netops, 10Operations, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10RobH) Just FYI it seems the serial console's have some built in nagios support.  I've attached a print out of the nagios configuration screen below.  {F30398437}
[16:02:23] <wikibugs>	 10netops, 10Icinga, 10Operations, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10RobH)
[17:02:38] <XioNoX>	 bblack: let me know when ready, what server to start with, etc..
[17:13:58] <bblack>	 XioNoX: ok, gonna rebase patch, etc.  They're all 3x backups in codfw (2004, 5, 6), since it's still on the older 3+3 system.
[17:14:05] <bblack>	 I'll go 6-5-4
[17:17:10] <XioNoX>	 ok
[17:17:18] <bblack>	 they were all cr2-codfw-only before, now cr1+cr2
[17:18:20] <XioNoX>	 bblack: added 6 to cr1
[17:20:20] <bblack>	 Sep 19 17:20:00 lvs2006 pybal[6248]: [bgp.FSM@0x7f83dc17cc90 peer 208.80.153.192:179] INFO: State is now: ESTABLISHED
[17:20:22] <XioNoX>	 I see it established
[17:20:23] <bblack>	 Sep 19 17:20:00 lvs2006 pybal[6248]: [bgp.BGPFactory@0x7f83dc16dcf8] INFO: BGP session established for ASN 64600 peer 208.80.153.192
[17:20:26] <bblack>	 Sep 19 17:20:00 lvs2006 pybal[6248]: [bgp.FSM@0x7f83dc1870d0 peer 208.80.153.193:179] INFO: State is now: ESTABLISHED
[17:20:29] <bblack>	 Sep 19 17:20:00 lvs2006 pybal[6248]: [bgp.BGPFactory@0x7f83dc1679e0] INFO: BGP session established for ASN 64600 peer 208.80.153.193
[17:20:32] <bblack>	 seems ok here?
[17:20:32] <bblack>	 ok
[17:20:40] <XioNoX>	 Received prefixes:            34
[17:20:53] <XioNoX>	 bblack: v4 prefixes only?
[17:21:04] <bblack>	 should be both?
[17:21:28] <bblack>	 oh interesting hmmm
[17:21:34] <bblack>	 give me a sec on that ...
[17:22:15] <XioNoX>	 not receiving v6 prefixes from 6 on both cr1 or cr2
[17:22:15] <bblack>	 XioNoX: correct, 2006 only has ipv4, normal
[17:22:21] <XioNoX>	 cool :)
[17:22:28] <bblack>	 (because it's only the internal services, and none of them have v6 defined)
[17:23:14] <XioNoX>	 adding 4 and 5 to cr1
[17:24:34] <bblack>	 cr1 didn't reach estab I think for 2005
[17:24:52] <bblack>	 hmm maybe I'm reading logs wrong
[17:24:53] <XioNoX>	 2005 is established
[17:25:09] <bblack>	 Sep 19 17:23:47 lvs2005 pybal[27851]: [bgp.FSM@0x7f9fccff9f90 peer 208.80.153.192:179] INFO: State is now: ESTABLISHED
[17:25:13] <bblack>	 Sep 19 17:23:47 lvs2005 pybal[27851]: [bgp.BGPFactory@0x7f9fccffd518] INFO: BGP session established for ASN 64600 peer 208.80.153.192
[17:25:13] <XioNoX>	 4*v4 and 3*v6 prefixes
[17:25:16] <bblack>	 Sep 19 17:24:19 lvs2005 pybal[27851]: [bgp.FSM@0x7f9fcca8a490 peer 208.80.153.193:53970] INFO: State is now: ESTABLISHED
[17:25:19] <bblack>	 Sep 19 17:24:19 lvs2005 pybal[27851]: [bgp.BGPFactory@0x7f9fccffda70] INFO: BGP session established for ASN 64600 peer 208.80.153.193
[17:25:22] <bblack>	 there was a delay for some reason between the two, on pybal's side
[17:26:22] <XioNoX>	 router side looks good
[17:27:37] <bblack>	 all looks good here for all 3 now
[17:28:03] <bblack>	 figure let this ride and see if we get any anomalies for a few days (e.g. in LVS load or healthchecks or who knows what might have some weird indirect impact)
[17:28:16] <bblack>	 then we can progressively roll out others next week, etc
[17:28:23] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10Krinkle)
[17:28:32] <XioNoX>	 sounds good!
[17:28:33] <XioNoX>	 thx!
[17:59:12] <XioNoX>	 bblack: 2004 isn't established
[17:59:24] <XioNoX>	 on cr1
[18:00:12] <bblack>	 oh?
[18:00:21] <bblack>	 heh you're right
[18:00:33] <bblack>	 I did 6, then 5, then 6, I never restarted 2004
[18:00:36] <bblack>	 bad brain :P
[18:01:54] <bblack>	 apparently I even logged it correctly on !log as 6-5-6 :)
[18:02:07] <XioNoX>	 yeah and I checked 5 again...
[18:02:12] <XioNoX>	 anyway, 4 is up
[18:02:18] <bblack>	 thanks!
[18:07:56] <XioNoX>	 bblack: I might be hitting more MTU issues
[18:08:24] <XioNoX>	 like SSH to pfw3-codfw.wikimedia.org through bast1002.wikimedia.org hangs in the middle of the ssh handshake
[18:08:32] <XioNoX>	 but via bast2002 it works fine
[18:09:17] <bblack>	 right
[18:09:21] <bblack>	 we didn't apply it to bastions
[18:09:34] <bblack>	 should we switch to router clamping in the 3x affected sites?
[18:09:43] <bblack>	 err sorry, in eqiad only I guess
[18:09:58] <bblack>	 esams can't, and we decided that esams/eqsin cases weren't that important, being edge-only
[18:10:06] <XioNoX>	 we can try yep
[18:10:08] <bblack>	 worst case someone may have to reconfigure their bastion if so affected
[18:14:00] <XioNoX>	 bblack: pushed to cr1
[18:15:15] <XioNoX>	 and cr2
[18:15:21] <XioNoX>	 it solved my SSH issue
[18:15:50] <XioNoX>	 bblack: want to try to rollback the tcp-mss on the servers?
[18:16:38] <bblack>	 I have another short meeting starting in ~14 mins, will poke at that afterwards when I can watch better
[18:17:24] <XioNoX>	 bblack: also codfw-eqiad exchange a full view, so traffic from eqiad can exit through codfw as well
[18:17:35] <XioNoX>	 so in theory we should clamp codfw as well
[18:17:47] <XioNoX>	 I don't know how much traffic we're talking about, probably not much
[18:23:25] <XioNoX>	 bblack: also setting tcp-mss causes all the BGP peers of that interface to bounce...
[18:24:58] <bblack>	 heh, that's fun :/
[18:29:34] <bblack>	 it makes sense though, they are tcp conns!
[18:29:38] <bblack>	 back in ~30
[18:30:27] <wikibugs>	 10Traffic, 10Operations: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) * Setting tcp-mss on an interface causes all the BGP sessions going over that interface to bounce * As eqiad and codfw exchange a full view, some outbound eqiad traffic goes through codfw so we shoul...
[18:32:08] <XioNoX>	 I don't think it makes sens, it should only impact new TCP sessions, only on the fly modify SYN packets
[18:32:18] <XioNoX>	 "Reason: Interface change for the peer-group"
[18:32:27] <XioNoX>	 not very verbose, and the doc doesn't say anything neither
[18:55:23] <XioNoX>	 bblack: this is on hackernews homepage https://news.ycombinator.com/item?id=21016972
[18:55:53] <XioNoX>	 so I guess phabricator can handle the load?
[18:56:23] <bblack>	 I don't think we've done any tech work on our end, re: cacheability (would've been on a different subdomain than phab.wm.o anyways)
[18:56:33] <bblack>	 so hopefully phab will hold up to HN :)
[19:09:35] <bblack>	 ok so yeah the syn thing
[19:11:20] <bblack>	 I didn't realize we had some asymmetric routing cross between eqiad/codfw
[19:11:28] <bblack>	 that certainly makes things interesting :)
[19:11:57] <bblack>	 for that matter there's eqord too right?
[19:12:09] <bblack>	 (don't they both get some traffic via-eqord?)
[19:13:02] <XioNoX>	 bblack: yeah, eqord is in the same confederation as eqiad, so I pushed the tcp-mss there too
[19:13:05] <bblack>	 the mss clamping is only on eqiad's transit/peer right?
[19:13:09] <bblack>	 oh ok
[19:13:33] <XioNoX>	 even if it's geographically far, it's as close to eqiad as eqdfw is to codfw
[19:13:40] <bblack>	 would it be possible to apply the tcp-mss clamp on codfw/eqdfw routers, but only for source addresses in eqiad's range?
[19:19:24] <XioNoX>	 bblack: nop
[19:19:29] <XioNoX>	 it's all interface or nothing
[19:19:50] <XioNoX>	 I mean, it's for the whole interface, no filering possible
[19:20:04] <bblack>	 yeah interesting
[19:20:23] <bblack>	 and we can't just put it on the transports from eqiad->dfw because it would clamp all our internal transport traffic heh
[19:21:47] <bblack>	 the main thing I'm concerned about is pulling the cp's hacks in eqiad.  if we're routing via-dfw from eqiad for some users, we'll see the return of the "can't edit" complaints and such
[19:22:21] <bblack>	 maybe we can just pull back on the other hosts, since they're smaller cases and less statistically likely and/or impactful
[19:22:37] <bblack>	 I donno
[19:22:44] <bblack>	 or we can shut down the CF routing :P
[19:33:19] <XioNoX>	 bblack: yeah, we can keep the cp hack for now
[19:33:31] <XioNoX>	 that too :)
[19:33:55] <XioNoX>	 not my call, but looks safe and easy to re-add if needed
[19:34:34] <XioNoX>	 for the future we can discuss the cost/benefits of sharing a full view between codfw/eqiad
[19:34:47] <XioNoX>	 vs. having them as distinct sites
[19:39:08] <bblack>	 we could also just stop our prepends for now and take a chunk of the traffic back.  But it wouldn't resolutely fix anything MTU-related until we pull the CF routing.
[19:40:28] <bblack>	 anyways, I pinged about the CF routing again.  Assume it stays for now, and let's just leave the host-level stuff in stasis.  the router clamping just buys us some extra protection for now.
[19:41:29] <XioNoX>	 bblack: no prepend in eqiad, they advertise more specific
[19:42:45] <XioNoX>	 also gives us an extra data point that clamping at the router level works (and includes a free BGP bounce)
[19:43:05] <bblack>	 yeah :)
[20:57:56] <wikibugs>	 10Traffic, 10MediaWiki-ResourceLoader, 10Operations, 10Performance-Team, 10Performance-Team-publish: The 5min expires for load.php/startup should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657 (10Krinkle)
[20:58:00] <wikibugs>	 10Traffic, 10MediaWiki-ResourceLoader, 10Operations, 10Performance-Team, 10Performance-Team-publish: The 5min expiry for load.php/startup should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657 (10Krinkle)
[21:03:43] <wikibugs>	 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Check for faulty optic asw-c-eqiad to cr1-eqiad - https://phabricator.wikimedia.org/T233265 (10Cmjohnson) 05Open→03Resolved It's been nearly 24 hours and there are 0 errors. resolving the task  cmjohnson@asw2-c-eqiad> show interfaces xe-2/0/45 extensive |...
[22:38:57] <wikibugs>	 10Traffic, 10Operations, 10Release-Engineering-Team-TODO: Blubberoid endpoint intermittently routing to MediaWiki backend - https://phabricator.wikimedia.org/T233369 (10dduvall)
[22:51:46] <marxarelli>	 bblack, vgutierrez: just wanted to ping you on https://phabricator.wikimedia.org/T233369
[22:53:07] <marxarelli>	 blubberoid intermittently failing isn't a UBN but in case it applies to all service requests hitting cp1075
[23:52:46] <wikibugs>	 10Traffic, 10MobileFrontend, 10Operations: Sections on some mobile pages are not collabsable - https://phabricator.wikimedia.org/T233373 (10AntiCompositeNumber)