[09:26:37] bblack: all letsencrypt certs for *.wikipedia.org (cp[124]) are expiring on 2020-10-18. Shouldn't they have auto-renewed by now? I think you know, but it seems worth a reminder :) [09:35:07] 10Traffic, 10Operations, 10Performance-Team (Radar): Depooling single text caching server in esams had a disproportionate performance impact - https://phabricator.wikimedia.org/T238085 (10ema) [09:35:13] 10Traffic, 10Operations, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10ema) 05Resolved→03Open The dashboard has stopped working on September 1st: {F32360683} [09:46:22] ema: they're already renewed [09:46:48] ema: Sep 23 09:03:49 acmechief1001 acme-chief-backend[18581]: Staging_time will be enforced for unified / rsa-2048 till 2020-09-25 09:03:26 [09:47:20] bblack also mentioned it on https://phabricator.wikimedia.org/T263006#6483164 [09:47:33] * vgutierrez goes back to his vacation [09:48:27] vgutierrez: thanks, enjoy your vac! :) [09:48:39] <3 [10:01:53] 10Traffic, 10Operations: Upgrade a production cache node to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) I've upgraded deployment-cache-text06 to Varnish 6, and https://en.wikipedia.beta.wmflabs.org looks fine. Later today I'll use Varnish 6 to run our VTC tests, and then proceed with the upgr... [10:05:14] 10Traffic, 10Operations, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) That's because of the switchover, the metric is now coming from the "codfw prometheus/ops" instead of the... [10:07:22] 10Traffic, 10Operations, 10Performance-Team (Radar): Depooling single text caching server in esams had a disproportionate performance impact - https://phabricator.wikimedia.org/T238085 (10Gilles) [10:07:24] 10Traffic, 10Operations, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) 05Open→03Resolved For now I've switched the source, we'll have to remember doing it again when the pr... [11:26:54] 10netops, 10DBA, 10Operations, 10ops-eqiad, and 3 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10jijiki) [12:08:29] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade a production cache node to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) All cache_text VTC tests green with 6.0.6, proceeding with the upgrade of cp4027. [12:49:40] so cp4027 is now pooled with 6.0.6, things look good [12:50:03] I even reloaded VCL and the old one eventually DID get discarded [12:50:37] exhilarating [12:51:40] the only broken thing so far is the label for transient memory in stats [12:51:46] varnish_sma_g_bytes{type="Transient"} is now varnish_sma_g_bytes{type="transient"} [13:02:41] heh, nice random change :) [13:06:51] yeah [13:07:19] however if that's the price to pay for getting T236754 fixed, I take it :) [13:07:19] T236754: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests - https://phabricator.wikimedia.org/T236754 [13:14:26] ema: if you wanted continuity of metrics, we could add a rewrite rule to the prometheis [13:18:03] cdanis: I'm currently satisfied with ~="[Tt]ransient" on the grafana query, but yes! [13:58:31] bblack: ema: so I know that the mmdb data we get includes an IP->ASN mapping, do we use that anywhere? [13:58:38] context in which I'm asking is T263496 [13:58:39] T263496: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 [13:59:35] as a larger question I think it's arguable we might want to 'staple' geoIP country and AS number as a request header to upstream services on all queries terminated by the traffic layer [14:00:57] cdanis: at some point the various pieces of analytics magic get the AS details under "isp_data" https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest [14:01:18] ema: yeah, I believe that's done as part of some 'refine' step in Hive ingestion, but we're talking about doing it earlier [14:01:27] yeah currently analytics does their own geoip, basically [14:01:40] the only geoip we do at the traffic level is for the cookie for banner selections in the client [14:02:22] this is work I didn't/won't get to this quarter, but the other reason to make AS number available at the Traffic layer itself is so we can easily say "heavily throttle / block this AS, please" [14:02:51] there's a few tricky angles on it, probably [14:03:38] ema: ah nice varnish 6 already deployed in ulsfo? [14:03:53] elukey: on cp4027 only [14:04:24] one is that an mmdb lookup per-request might be expensive (currently we mitigate this, somewhat, for some clients, by only doing the lookup+cookie-set if they don't already have a geoip session cookie) [14:04:24] elukey: you'll be pleased to know that varnishkafka is doing fine :) [14:04:50] but there are probably tons of clients that don't have our cookie, I'm not even sure what percentage of requests we do mmdb lookups for at this point (might be a useful thing to know!) [14:05:17] ema: molto bene [14:05:40] and the other that comes to mind is that if we hand off mmdb-derived data via headers to a bunch of internal services, we're opening a pandora's box of service owners using what they see without contemplating MM's terms of use, etc [14:06:57] ah okay, the terms-of-use stuff I don't know abut [14:07:50] in practice where that becomes a problem is if service owners are doing more than just making per-request decisions based on the header. e.g. if they start storing the data in other ways and/or especially echoing it back out as data to some or all users. [14:09:17] mm, sure, makes sense [14:10:09] well, I'm certainly going to start with the lesser case of only doing any of this on intake-logging.wikimedia.org requests [14:18:39] is https://book.varnish-software.com/4.0/chapters/Appendix_D__VMOD_Development.html the best primer for embedding C in VCL or are there other references? [14:28:07] the best primer for C-in-VCL is probably to become an expert player of russian roulette through experience [14:28:41] but the link you have is for writing vmods I think, which is somewhat-different than inline C [14:28:51] yes, but it at least mentions inline C [14:29:00] which is better than anything else I've found so far 🙃 [14:29:21] yeah [14:29:46] I mean, there is https://varnish-cache.org/docs/trunk/users-guide/vcl-inline-c.html [14:29:48] but uh. [14:30:12] it's clearly missing a "// Here be dragons" in the example [14:30:45] is this for the mmdb stuff? [14:30:48] yeah [14:31:03] we have already modules/varnish/templates/geoip.inc.vcl.erb , which might prove an interesting example and/or launching point [14:31:07] good ol' geoip.inc.vcl.erb [14:31:07] that gets used for the cookies [14:31:09] yep [14:31:44] I had planned on editing that and adding a few more subroutines and such [14:31:56] (as much as I've planned anything in the past half hour) [14:32:33] ok [14:32:44] as noted in its own comments, it could also use reloading support :) [14:32:57] there is also a TODO to switch it wholesale to a vmod [14:33:01] (right now we just rely on the fact that VCL probably gets reloaded for some other reason at least as often as new geoip data appears) [14:33:26] there might already be some open source mmdb vmod that works, I haven't looked in a long time [14:33:31] there are a few! [14:33:47] looks like a few different forks of a common ancestor [14:34:06] https://github.com/russellsimpkins-nyt/varnish-mmdb-vmod [14:34:19] yeah, but that also looks old enough that we've probably seen it before, so there might be a reason we don't use it [14:34:22] yeah [14:34:27] "NOTE This is for Varnish 3" :) [14:34:45] they also all care about the City mmdb, whereas in this case I want to work with the ISP mmdb; not a big deal but also means it won't Just Work [14:35:02] with a vmod variant, though, we could do the URCU thing and monitor the pathname for reloads or something [14:35:09] right [14:35:27] mmdb is a pretty generic format, I'd imagine there's not much functional difference to work with ASN [14:35:34] you'll need to do some patching, but very light patching [14:35:39] no it's probably just a few different keys in the structure, yeah [15:14:53] 10Traffic, 10DNS, 10Operations, 10netbox: Netbox DNS change not effective in gdns - https://phabricator.wikimedia.org/T255748 (10BBlack) @Volans is this closeable now? [15:21:49] 10Traffic, 10DNS, 10Operations, 10netbox: Netbox DNS change not effective in gdns - https://phabricator.wikimedia.org/T255748 (10Volans) 05Open→03Resolved a:03Volans I think so didn't get any report of issues. [16:02:02] 10netops, 10Operations: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10CDanis) I'm no expert here, but seems reasonable enough to me. [16:06:40] 10Traffic, 10Operations: Clean up DNS server puppetization - https://phabricator.wikimedia.org/T240285 (10BBlack) 05Open→03Resolved The new puppetization has been stable for quite a while now, we can resolve this, as it's kind of ambiguous what if any further improvements are warranted outside of any speci... [16:06:44] 10Traffic, 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [16:11:38] 10Traffic, 10Operations: Define 3-host infra cluster for traffic pops - https://phabricator.wikimedia.org/T96852 (10BBlack) [16:11:42] 10Traffic, 10Operations: Consolidate misc servers at edge sites - https://phabricator.wikimedia.org/T257323 (10BBlack) [16:12:00] 10Traffic, 10Operations: Define 3-host infra cluster for traffic pops - https://phabricator.wikimedia.org/T96852 (10BBlack) ^ Remaining work superseded by new plans in the ticket this was closed into. [16:23:41] 10Traffic, 10Operations: High number of failed inbound TFO connections in esams Mon-Fri - https://phabricator.wikimedia.org/T143562 (10BBlack) 05Open→03Declined No movement in 4 years. If there are new/ongoing TFO issues, someone should make a new ticket about them! [16:25:46] 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178 (10BBlack) @Krinkle - Is this ticket still worth pursuing at all? [16:30:04] 10Traffic, 10CX-cxserver, 10Citoid, 10Operations, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001 (10BBlack) @Pchelolo what about `https://cxserver.wikimedia.org/` - Can it be removed? Or is it better to just ignore it... [16:38:26] 10HTTPS, 10Traffic, 10Operations: Inbound TLS for tier-1 varnish backend caches - https://phabricator.wikimedia.org/T109321 (10BBlack) 05Open→03Invalid There is no more varnish-be [16:38:28] 10HTTPS, 10Traffic, 10Operations, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10BBlack) [16:38:47] 10HTTPS, 10Traffic, 10Operations, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10BBlack) [16:38:57] 10HTTPS, 10Traffic, 10Varnish, 10Operations, 10codfw-rollout: Outbound HTTPS for varnish backend instances - https://phabricator.wikimedia.org/T109325 (10BBlack) 05Open→03Invalid There is no more varnish-be [16:44:04] 10HTTPS, 10Traffic, 10Operations, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10BBlack) All subtasks gone, but there are technically stlil a few edges cases showing up in the trafficserver backend-facing config. Specifically: ` $ grep 'replacement: htt... [16:45:03] 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries - https://phabricator.wikimedia.org/T180269 (10BBlack) 05Open→03Declined We've long since moved on from this. Nginx isn't even terminating our public TLS... [16:52:57] 10Traffic, 10Operations, 10Goal: Establish timeline and methodology for upcoming deprecation of non-forward-secret ciphers and TLSv1.0 - https://phabricator.wikimedia.org/T192559 (10BBlack) 05Open→03Resolved a:03BBlack A lot of this planning is already-done, and the remainder of the plans are in progre... [16:55:50] 10Traffic, 10Operations: nginx HTTP 500 rate increase on specific cache hosts - https://phabricator.wikimedia.org/T226805 (10BBlack) 05Open→03Declined This has been idle over a year, and some of the software stack referenced here doesn't exist anymore. [16:56:59] 10Traffic, 10CX-cxserver, 10Citoid, 10Operations, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001 (10KartikMistry) >>! In T133001#6488134, @BBlack wrote: > @Pchelolo what about `https://cxserver.wikimedia.org/` - Can it... [16:58:00] 10Traffic, 10Operations: Wikipedia is unavailable on Symbian phone's browsers - https://phabricator.wikimedia.org/T227828 (10BBlack) 05Open→03Declined I don't think there's much we can do here. We can expect there will be more tickets like this over time as we deprecate and remove legacy TLS standards, fr... [17:03:34] 10Traffic, 10Operations: Analyze the impact of removing TLSv1/v1.1 on puppetmasters - https://phabricator.wikimedia.org/T242991 (10BBlack) @jbond any further thoughts here? We do still have ~55 jessies: ` conf[2001-2003].codfw.wmnet,dbmonitor1001.wikimedia.org,helium.eqiad.wmnet,heze.codfw.wmnet,kraz.wikimed... [17:09:02] 10Traffic, 10Operations, 10Security: HTTP MediaWiki API GET requests to Wikimedia wikis should not be redirected to HTTPS when they have a session cookie or Authorization header - https://phabricator.wikimedia.org/T247490 (10BBlack) Yeah this is an interesting angle on things. Currently for all traffic to c... [17:27:43] 10Traffic, 10Varnish, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, and 3 others: Spike: CentralNotice: Verify that our Special:HideBanners cookie storm works as efficiently as possible - https://phabricator.wikimedia.org/T117435 (10BBlack) 05Open→03Resolved a:03BBlack Resolving for no... [17:30:07] 10Traffic, 10Operations: restrict upload cache access for private wikis - https://phabricator.wikimedia.org/T129839 (10BBlack) Is this still an ongoing concern? No updates since 2016 [17:32:56] 10Traffic, 10Operations: Parametrization of VCL is inconsistent - https://phabricator.wikimedia.org/T137747 (10BBlack) 05Open→03Invalid Very old ticket references very old stuff. If there are still similar concerns in more-modern cache puppetization, someone should make a new ticket! [17:33:57] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Operations: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873 (10BBlack) Are we still working on something here, or is this best closed and any remaining concerns op... [17:37:03] 10Traffic, 10MediaWiki-API, 10Operations, 10Patch-For-Review: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314 (10BBlack) Anything to still pursue here? It's been a few years. Obviously, one path towards fixing these things is to not emit `Vary: Cookie... [17:37:40] 10Traffic, 10Operations, 10Mobile: Samsung Internet's desktop mode getting redirected to mobile site - https://phabricator.wikimedia.org/T158599 (10BBlack) 05Open→03Declined Reopen or make a new ticket if this is still an issue for a real user, it's too-stale with no movement as-is. [17:40:30] 10Traffic, 10netops, 10Operations, 10Pybal, 10Patch-For-Review: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674 (10BBlack) 05Open→03Declined Declining for lack of movement and lack of urgency. [17:40:53] 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691 (10BBlack) 05Open→03Declined `ssl_do_wait_shutdown` never really did anything, declining this on for lack of urgency (are there users impacted?) and m... [17:41:48] 10Traffic, 10Operations, 10Patch-For-Review: Uncacheable content handling: hfp vs hfm - https://phabricator.wikimedia.org/T180434 (10BBlack) 05Open→03Resolved a:03ema Looks like this was resolved long ago and didn't block the V5 upgrade [17:43:46] 10Traffic, 10CheckUser, 10Operations: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10BBlack) Is this still desirable for checkusers? Infrastructure has changed since then and is still-changing, but we could probably find a way to pass t... [17:45:18] 10Traffic, 10Operations: Consider using vmod_var instead of temporary headers in VCL - https://phabricator.wikimedia.org/T198620 (10BBlack) This might be a useful project still, as it might help clarify our remaining frontend VCL going forward. Maybe keep this for a backburner thing to attack post-V6-upgrade. [17:46:15] 10Traffic, 10Operations: cp3040: kernel crash in ipsec code shortly after reboot - https://phabricator.wikimedia.org/T201666 (10BBlack) 05Open→03Invalid Ipsec for cp nodes is long gone, as is this kernel I'm sure [17:48:52] 10netops, 10Operations: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10jbond) Sorry missed this looks good to me [17:50:20] 10Traffic, 10Operations: add Icinga alert on Varnish backends that are close to maxing out their allowed connections to their applayer backends - https://phabricator.wikimedia.org/T224738 (10BBlack) 05Open→03Invalid We don't have varnish-be anymore. [17:52:31] 10Traffic, 10Operations, 10Patch-For-Review: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 (10BBlack) 05Open→03Resolved a:03ema Long-ago dealt with it looks like, and in any case varnish-be doesn't exist anymore. [17:53:55] 10Traffic, 10Operations: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10BBlack) 05Open→03Declined Declining for now, as multiple implicated parts of the software stack have changed significantly since this report, and nothing was... [17:57:39] 10Traffic, 10Operations, 10Patch-For-Review: Analyze the impact of removing TLSv1/v1.1 on puppetmasters - https://phabricator.wikimedia.org/T242991 (10jbond) >>! In T242991#6488366, @BBlack wrote: > @jbond any further thoughts here? We do still have ~55 jessies: > > ` > conf[2001-2003].codfw.wmnet,dbmonito... [17:58:55] 10Traffic, 10Operations: cp3032 and cp3040 occasional failed fetches - https://phabricator.wikimedia.org/T235736 (10BBlack) 05Open→03Declined Probably related to the transient memory issues discussed in various tickets: T164768 T165063 T249809 . In any case this is almost a year old with no investigation,... [18:37:19] 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178 (10Krinkle) Yes, in so far that the primary RESTBase URL is still completely broken if accessed through the canonical version of that domainna... [19:02:07] 10Traffic, 10Operations, 10observability, 10Patch-For-Review: Aggregated metrics for ats-tls <-> clients ttfb percentiles - https://phabricator.wikimedia.org/T263536 (10crusnov) p:05Triage→03Medium a:03fgiunchedi [19:51:28] 10Traffic, 10Operations, 10Performance-Team (Radar): experiment with a "unified" ATS-BE pool - https://phabricator.wikimedia.org/T263291 (10crusnov) p:05Triage→03Medium