[04:09:11] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10BBlack) We'll also need to normalize the incoming `Accept` headers up in the edge... [05:34:33] 10Traffic, 10Operations: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) [05:34:48] 10Traffic, 10Operations: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) p:05Triage→03Normal [05:35:17] 10Traffic, 10Operations: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:35:19] 10Traffic, 10Operations, 10Patch-For-Review: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 (10Vgutierrez) 05Open→03Resolved [05:39:11] 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: Use acme-chief provided OCSP stapling responses - https://phabricator.wikimedia.org/T232988 (10Vgutierrez) [06:27:21] Krenair: hi! there is any labs environment as far as you know using acme-chief with OCSP enabled? [06:28:48] Krenair: https://gerrit.wikimedia.org/r/c/operations/puppet/+/537789 could mess with those [06:29:35] deployment-prep already got acme-chief 0.21 so it already has server side OCSP responses [06:29:49] so it shouldn't be an issue AFAIK [08:12:09] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) ` elukey@asw2-a-eqiad# show | compare [edit interfaces xe-2/0/24] - description cloudvirtan1002; +... [08:18:12] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) ` elukey@asw2-b-eqiad# show | compare [edit interfaces xe-4/0/5] - description cloudvirtan1004; +... [08:33:54] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) @Cmjohnson @Jclark-ctr there is one last problem - an-presto1005: 1) is not connected to any switch... [09:27:44] 10Wikimedia-Apache-configuration, 10Performance-Team, 10Patch-For-Review: Apache configuration: SVGs served by MediaWiki aren't gzipped - https://phabricator.wikimedia.org/T232615 (10elukey) Reporting a discussion happened on IRC: the change looks good, but it seems that Varnish/ATS explicitly unsed the Acce... [10:40:28] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10Lucas_Werkmeister_WMDE) Real code: [`MIMEParse.java`](https://github.com/LinkedDat... [12:12:27] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Check for faulty optic asw-c-eqiad to cr1-eqiad - https://phabricator.wikimedia.org/T233265 (10Cmjohnson) [12:12:30] 10netops, 10Operations, 10ops-eqiad: asw2-c-eqiad:xe-2/0/45 inbound interface errors - https://phabricator.wikimedia.org/T229612 (10Cmjohnson) [14:51:38] 10netops, 10Operations: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10Papaul) [15:14:47] 10netops, 10Operations, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10ayounsi) p:05Triage→03Normal [15:31:07] .win 29 [15:57:34] XioNoX: I've been missing windows here and there to go sync up with you on the LVS multi-bgp patch for codfw, sorry! Can we try in ~1h after I get through my next meeting? [15:58:07] bblack: yep [15:58:12] ok [16:01:04] 10netops, 10Operations, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10RobH) Just FYI it seems the serial console's have some built in nagios support. I've attached a print out of the nagios configuration screen below. {F30398437} [16:02:23] 10netops, 10Icinga, 10Operations, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10RobH) [17:02:38] bblack: let me know when ready, what server to start with, etc.. [17:13:58] XioNoX: ok, gonna rebase patch, etc. They're all 3x backups in codfw (2004, 5, 6), since it's still on the older 3+3 system. [17:14:05] I'll go 6-5-4 [17:17:10] ok [17:17:18] they were all cr2-codfw-only before, now cr1+cr2 [17:18:20] bblack: added 6 to cr1 [17:20:20] Sep 19 17:20:00 lvs2006 pybal[6248]: [bgp.FSM@0x7f83dc17cc90 peer 208.80.153.192:179] INFO: State is now: ESTABLISHED [17:20:22] I see it established [17:20:23] Sep 19 17:20:00 lvs2006 pybal[6248]: [bgp.BGPFactory@0x7f83dc16dcf8] INFO: BGP session established for ASN 64600 peer 208.80.153.192 [17:20:26] Sep 19 17:20:00 lvs2006 pybal[6248]: [bgp.FSM@0x7f83dc1870d0 peer 208.80.153.193:179] INFO: State is now: ESTABLISHED [17:20:29] Sep 19 17:20:00 lvs2006 pybal[6248]: [bgp.BGPFactory@0x7f83dc1679e0] INFO: BGP session established for ASN 64600 peer 208.80.153.193 [17:20:32] seems ok here? [17:20:32] ok [17:20:40] Received prefixes: 34 [17:20:53] bblack: v4 prefixes only? [17:21:04] should be both? [17:21:28] oh interesting hmmm [17:21:34] give me a sec on that ... [17:22:15] not receiving v6 prefixes from 6 on both cr1 or cr2 [17:22:15] XioNoX: correct, 2006 only has ipv4, normal [17:22:21] cool :) [17:22:28] (because it's only the internal services, and none of them have v6 defined) [17:23:14] adding 4 and 5 to cr1 [17:24:34] cr1 didn't reach estab I think for 2005 [17:24:52] hmm maybe I'm reading logs wrong [17:24:53] 2005 is established [17:25:09] Sep 19 17:23:47 lvs2005 pybal[27851]: [bgp.FSM@0x7f9fccff9f90 peer 208.80.153.192:179] INFO: State is now: ESTABLISHED [17:25:13] Sep 19 17:23:47 lvs2005 pybal[27851]: [bgp.BGPFactory@0x7f9fccffd518] INFO: BGP session established for ASN 64600 peer 208.80.153.192 [17:25:13] 4*v4 and 3*v6 prefixes [17:25:16] Sep 19 17:24:19 lvs2005 pybal[27851]: [bgp.FSM@0x7f9fcca8a490 peer 208.80.153.193:53970] INFO: State is now: ESTABLISHED [17:25:19] Sep 19 17:24:19 lvs2005 pybal[27851]: [bgp.BGPFactory@0x7f9fccffda70] INFO: BGP session established for ASN 64600 peer 208.80.153.193 [17:25:22] there was a delay for some reason between the two, on pybal's side [17:26:22] router side looks good [17:27:37] all looks good here for all 3 now [17:28:03] figure let this ride and see if we get any anomalies for a few days (e.g. in LVS load or healthchecks or who knows what might have some weird indirect impact) [17:28:16] then we can progressively roll out others next week, etc [17:28:23] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10Krinkle) [17:28:32] sounds good! [17:28:33] thx! [17:59:12] bblack: 2004 isn't established [17:59:24] on cr1 [18:00:12] oh? [18:00:21] heh you're right [18:00:33] I did 6, then 5, then 6, I never restarted 2004 [18:00:36] bad brain :P [18:01:54] apparently I even logged it correctly on !log as 6-5-6 :) [18:02:07] yeah and I checked 5 again... [18:02:12] anyway, 4 is up [18:02:18] thanks! [18:07:56] bblack: I might be hitting more MTU issues [18:08:24] like SSH to pfw3-codfw.wikimedia.org through bast1002.wikimedia.org hangs in the middle of the ssh handshake [18:08:32] but via bast2002 it works fine [18:09:17] right [18:09:21] we didn't apply it to bastions [18:09:34] should we switch to router clamping in the 3x affected sites? [18:09:43] err sorry, in eqiad only I guess [18:09:58] esams can't, and we decided that esams/eqsin cases weren't that important, being edge-only [18:10:06] we can try yep [18:10:08] worst case someone may have to reconfigure their bastion if so affected [18:14:00] bblack: pushed to cr1 [18:15:15] and cr2 [18:15:21] it solved my SSH issue [18:15:50] bblack: want to try to rollback the tcp-mss on the servers? [18:16:38] I have another short meeting starting in ~14 mins, will poke at that afterwards when I can watch better [18:17:24] bblack: also codfw-eqiad exchange a full view, so traffic from eqiad can exit through codfw as well [18:17:35] so in theory we should clamp codfw as well [18:17:47] I don't know how much traffic we're talking about, probably not much [18:23:25] bblack: also setting tcp-mss causes all the BGP peers of that interface to bounce... [18:24:58] heh, that's fun :/ [18:29:34] it makes sense though, they are tcp conns! [18:29:38] back in ~30 [18:30:27] 10Traffic, 10Operations: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) * Setting tcp-mss on an interface causes all the BGP sessions going over that interface to bounce * As eqiad and codfw exchange a full view, some outbound eqiad traffic goes through codfw so we shoul... [18:32:08] I don't think it makes sens, it should only impact new TCP sessions, only on the fly modify SYN packets [18:32:18] "Reason: Interface change for the peer-group" [18:32:27] not very verbose, and the doc doesn't say anything neither [18:55:23] bblack: this is on hackernews homepage https://news.ycombinator.com/item?id=21016972 [18:55:53] so I guess phabricator can handle the load? [18:56:23] I don't think we've done any tech work on our end, re: cacheability (would've been on a different subdomain than phab.wm.o anyways) [18:56:33] so hopefully phab will hold up to HN :) [19:09:35] ok so yeah the syn thing [19:11:20] I didn't realize we had some asymmetric routing cross between eqiad/codfw [19:11:28] that certainly makes things interesting :) [19:11:57] for that matter there's eqord too right? [19:12:09] (don't they both get some traffic via-eqord?) [19:13:02] bblack: yeah, eqord is in the same confederation as eqiad, so I pushed the tcp-mss there too [19:13:05] the mss clamping is only on eqiad's transit/peer right? [19:13:09] oh ok [19:13:33] even if it's geographically far, it's as close to eqiad as eqdfw is to codfw [19:13:40] would it be possible to apply the tcp-mss clamp on codfw/eqdfw routers, but only for source addresses in eqiad's range? [19:19:24] bblack: nop [19:19:29] it's all interface or nothing [19:19:50] I mean, it's for the whole interface, no filering possible [19:20:04] yeah interesting [19:20:23] and we can't just put it on the transports from eqiad->dfw because it would clamp all our internal transport traffic heh [19:21:47] the main thing I'm concerned about is pulling the cp's hacks in eqiad. if we're routing via-dfw from eqiad for some users, we'll see the return of the "can't edit" complaints and such [19:22:21] maybe we can just pull back on the other hosts, since they're smaller cases and less statistically likely and/or impactful [19:22:37] I donno [19:22:44] or we can shut down the CF routing :P [19:33:19] bblack: yeah, we can keep the cp hack for now [19:33:31] that too :) [19:33:55] not my call, but looks safe and easy to re-add if needed [19:34:34] for the future we can discuss the cost/benefits of sharing a full view between codfw/eqiad [19:34:47] vs. having them as distinct sites [19:39:08] we could also just stop our prepends for now and take a chunk of the traffic back. But it wouldn't resolutely fix anything MTU-related until we pull the CF routing. [19:40:28] anyways, I pinged about the CF routing again. Assume it stays for now, and let's just leave the host-level stuff in stasis. the router clamping just buys us some extra protection for now. [19:41:29] bblack: no prepend in eqiad, they advertise more specific [19:42:45] also gives us an extra data point that clamping at the router level works (and includes a free BGP bounce) [19:43:05] yeah :) [20:57:56] 10Traffic, 10MediaWiki-ResourceLoader, 10Operations, 10Performance-Team, 10Performance-Team-publish: The 5min expires for load.php/startup should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657 (10Krinkle) [20:58:00] 10Traffic, 10MediaWiki-ResourceLoader, 10Operations, 10Performance-Team, 10Performance-Team-publish: The 5min expiry for load.php/startup should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657 (10Krinkle) [21:03:43] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Check for faulty optic asw-c-eqiad to cr1-eqiad - https://phabricator.wikimedia.org/T233265 (10Cmjohnson) 05Open→03Resolved It's been nearly 24 hours and there are 0 errors. resolving the task cmjohnson@asw2-c-eqiad> show interfaces xe-2/0/45 extensive |... [22:38:57] 10Traffic, 10Operations, 10Release-Engineering-Team-TODO: Blubberoid endpoint intermittently routing to MediaWiki backend - https://phabricator.wikimedia.org/T233369 (10dduvall) [22:51:46] bblack, vgutierrez: just wanted to ping you on https://phabricator.wikimedia.org/T233369 [22:53:07] blubberoid intermittently failing isn't a UBN but in case it applies to all service requests hitting cp1075 [23:52:46] 10Traffic, 10MobileFrontend, 10Operations: Sections on some mobile pages are not collabsable - https://phabricator.wikimedia.org/T233373 (10AntiCompositeNumber)