[05:35:30] 10Traffic, 10Operations, 10good first task: Only retry failed requests for external traffic on cache frontends - https://phabricator.wikimedia.org/T249317 (10ema) >>! In T249317#6034060, @srishakatux wrote: > @ema Hello! As this task is tagged as a #good_first_task, I'm wondering if it can be made clear wher... [05:57:27] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#6031455, @Gilles wrote: > Are there any other upcoming performance improvements in the p... [06:02:58] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Site-requests, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10MoritzMuehlenhoff) p:05Triage→03Medium [06:07:50] 10Traffic, 10Operations: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 (10ema) [06:07:56] 10Traffic, 10Operations: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 (10ema) p:05Triage→03High [06:14:10] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Page Content Service, and 4 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10ema) We have discussed this during yesterday's #traffic meeting and the current plan to attack the issue is: (1) @bb... [06:16:51] 10Traffic, 10Operations, 10Patch-For-Review: varnishd crashes in vbf_stp_condfetch(): cp3057 and cp3061 - https://phabricator.wikimedia.org/T249344 (10ema) 05Open→03Resolved a:03ema 5.1.3-1wm13 deployed. [06:19:42] 10Traffic, 10Operations, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10ema) p:05Medium→03High [07:46:32] <_joe_> so, envoy rejects urls longer than 8192 bytes [07:46:45] <_joe_> vgutierrez: how is it possible we accept such long urls at the edge? [07:46:55] <_joe_> I mean, do we? [07:47:12] hmmm I think so [07:47:44] <_joe_> I don't think it makes much sense tbh [07:47:55] ema fought more with long URLs than me [07:49:04] <_joe_> vgutierrez: I think we accept the same uri lenght in nginx and in apache FWIW [07:52:21] <_joe_> yep I get the same response for that url from nginx [08:00:19] so yeah, it doesn't make any sense to let those hit the applayer IMHO [08:00:37] how are we getting those humongous uris on the first place? [08:03:05] the JS beacons? [08:07:26] in any case, ats-tls doesn't impose any kind of limit, I didn't check varnish-fe yet [08:07:31] I'm in the middle of something else [08:31:03] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10Gilles) @ema have you checked if there is a correlation with Keep-Alive headers? Eg. does restbase reply with a Keep-Alive head... [09:25:24] 10Traffic, 10Operations, 10Phabricator, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) The aphlict service has been re-enabled on phab1001. The plan is to have ATS (caching layer) talk directly... [10:28:56] 10Traffic, 10Operations, 10Phabricator, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) So we have just one last remaining issue to deal with: ` Unable to open file ("/etc/ssl/private/phabrica... [12:06:54] 10Traffic, 10Operations, 10Phabricator, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) The new plan is to do TLS termination in envoy rather than in nodejs itself. Hence the new patch above to... [12:31:10] 10Traffic, 10Operations, 10Repository-Admins: Requesting new gerrit project repository "operations/software/purged" - https://phabricator.wikimedia.org/T249606 (10ema) [13:50:09] 10Traffic, 10Varnish, 10Operations, 10Product-Infrastructure-Team-Backlog, and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Pchelolo) We're going to remove support for this from #changeprop as well as a part of k8s transition. If there are any plans to ever... [14:04:35] 10Traffic, 10Varnish, 10Operations, 10Product-Infrastructure-Team-Backlog, and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) Let's leave it open for now. We may have some dedicated maps maintenance capacity in the next couple quarters and would wa... [14:05:22] 10Traffic, 10Varnish, 10Maps, 10Operations, 10Product-Infrastructure-Team-Backlog: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) [14:13:08] 10Traffic, 10Cloud-Services, 10Operations, 10Wikimedia-Incident: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10CDanis) @MusikAnimal Yeah, you shouldn't expect to see any request data in your tcpdumps -- it'll all be TLS-encrypted. B... [14:55:33] v-fe does have a URI length limit, we've had to raise it before [14:57:28] modules/varnish/templates/initscripts/varnish.systemd.erb:-p http_req_size=24576 \ [14:58:03] ^ that ~24K size is the whole request (the GET /url line + all headers) [14:58:20] there's also a per-header limit defaulting to 8k, but I don't think it applies to the URI or the request line [14:59:19] I don't see any explicit param for URI len, although from VCL we could certainly programmatically limit it and return a 4xx. [15:03:52] (we could do it even earier in ats-tls and it's the more-correct place to do it, but, I don't think we want to introduce any explicit error returns at the ats-tls layer yet if we don't have to, because they'd be missed by analytics until that has moved up) [15:08:49] 10Traffic, 10Operations: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 (10ema) I am testing a first iteration of `purged` (T249583) on cp3052. The program sends PURGEs over multiple TCP connections, and ats-be is now doing much better:... [15:25:42] ema: let me know if i can help with monitoring for purged [15:27:01] cdanis: hi! Sure. So far I've got a first iteration working but we have to puppetize it and add the prometheus scraping (prometheus metrics are built-in) [15:27:33] cdanis: what we don't have is a gerrit repo yet, but https://phabricator.wikimedia.org/P10935 and https://phabricator.wikimedia.org/P10936 is the whole thing for now [15:30:42] of course it's 100% bug free and always will be [15:32:53] heheh [15:33:20] oh cool, we won't need some textfile writer :D [15:33:25] ema: yet another t-shirt idea [15:33:28] yeah looks pretty nice [15:34:04] I've tried running it with -concurrency 1 and sure enough backlog was growing quickly [15:35:24] -concurrency 4 and higher solves the backlog problem, though load on other cores increases too (which means that if PURGEs get really really out of control we might DoS the whole ats-be instead of just one thread) [15:35:41] sukhe: yes! [15:36:47] ema: we will probably cut purge traffic in half, fwiw [15:38:48] cdanis: just like that? :) [15:39:02] yes [15:39:49] mediawiki is doing a thing called 'rebound purges' where it sends a purge request, and then enqueues a duplicate purge request on the jobqueue, as a workaround for a few different race conditions which other pieces were made to prevent, after rebounding was implemented [15:40:06] so, we will make it not do that [15:45:22] 10Traffic, 10Operations: Implement TTL cap for ats-be - https://phabricator.wikimedia.org/T249627 (10ema) [15:45:35] 10Traffic, 10Operations: Implement TTL cap for ats-be - https://phabricator.wikimedia.org/T249627 (10ema) p:05Triage→03Medium [15:46:04] cdanis: excellent! [15:48:23] or maybe per aaron's comment just now I've misunderstood the code paths involved entirely 🙃 [15:49:08] :) [16:02:02] yeah the timing's now tricky, because you moved so fast! :) [16:02:30] I have most of the patchwork done for vhtcpd to do multiple conns as well, but probably can't push it until later today [16:02:54] it may still be useful in the interim while working out any kinks in the new thing [16:05:43] 10Traffic, 10Cloud-Services, 10Operations, 10Wikimedia-Incident: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10JHedden) I think this is the full conversation stream between xtools and de.wikipedia.org (after URI: http://xtools.wmflab... [16:31:09] 10Traffic, 10Operations, 10Repository-Admins: Requesting new gerrit project repository "operations/software/purged" - https://phabricator.wikimedia.org/T249606 (10Dzahn) Please see https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests [16:34:34] ema: given we have a rate cut coming, and you're about done, maybe I won't bother spinning my wheels trying to make a new release, it doesn't seem like a good use of effort at this juncture [16:38:34] (also the vhtcpd code is really ugly. it's hard not to just rewrite it all to better standards as I go, which would be a bad idea for stability anyways) [16:41:24] 10Traffic, 10Cloud-Services, 10Operations, 10Wikimedia-Incident: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) >>! In T249035#6036900, @JHedden wrote: > Can you confirm that you're seeing this on both xtools-prod06 and x... [16:58:37] bblack: the rate cut might not be coming [17:01:11] 10Traffic, 10Analytics, 10Analytics-Wikistats, 10Operations, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10Krinkle) 05Open→03Resolved a:03Krinkle Confirmed via . It now... [17:01:18] 10Traffic, 10Analytics, 10Analytics-Wikistats, 10Operations, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10Krinkle) [17:18:33] cdanis: another thing we could do in the short term, is restart the vhtcpds. We'd lose the current backlog, but it might reset the clock on falling behind for a little while [17:18:48] e.ma already restarted one in the process of testing his new daemon [17:19:04] bblack: yeah, another thing that had crossed my mind was seeing if they manage more throughput on a depooled node [17:19:23] I think they will, the evidence is there in the daily patterns [17:19:30] but I don't know how long a depool it would take [17:19:32] it might still not be enough in esams :) [17:19:34] yeah [17:19:57] learning about ttl_cap this morning at least made me feel a bit better about the situation [17:20:03] there's also the error-free way to reset the clock, which is to stop vhtcpd and then wipe the ats-be storage, then start both back up [17:20:21] but all things considered, I don't think the cache wipes are worth it [17:20:23] is wiping ats-be storage something that can be done online? [17:20:30] complaint rate is relatively-tame with the current backlog [17:20:52] it would be probably even tamer if we dropped the several hours' backlog and just started processing fresher purges [17:21:16] cdanis: not sure, but either way it's not much trouble to do it offline with a quick depool [17:21:21] nod [18:24:45] 10Wikimedia-Apache-configuration, 10Operations, 10Puppet: redirect sco.wiktionary.org - https://phabricator.wikimedia.org/T249648 (10Bugreporter) [18:28:25] 10Traffic, 10DNS, 10Operations: redirect sco.wiktionary.org - https://phabricator.wikimedia.org/T249648 (10Bugreporter) [19:04:43] 10Traffic, 10CommRel-Specialists-Support, 10Core Platform Team, 10Editing-team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [19:05:23] 10Traffic, 10CommRel-Specialists-Support, 10Core Platform Team, 10Editing-team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) Roadmap alignment and any stewardship needs from CPT confirmed by Cindy. [19:32:24] 10Traffic, 10DNS, 10Operations: redirect sco.wiktionary.org - https://phabricator.wikimedia.org/T249648 (10Aklapper) 05Open→03Stalled @Bugreporter: Redirect what to what? Please be clearer and do follow https://www.mediawiki.org/wiki/How_to_report_a_bug . Clicking your first link goes to incubator, for e... [19:41:06] 10Traffic, 10DNS, 10Operations: redirect sco.wiktionary.org - https://phabricator.wikimedia.org/T249648 (10Bugreporter) [19:41:09] 10Traffic, 10DNS, 10Operations: redirect sco.wiktionary.org - https://phabricator.wikimedia.org/T249648 (10Bugreporter) 05Stalled→03Open [21:54:48] 10Traffic, 10DNS, 10Operations: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 (10Reedy) [22:53:56] 10Traffic, 10Operations, 10decommission, 10ops-codfw: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Papaul) [22:59:54] 10Traffic, 10Operations, 10decommission, 10ops-codfw: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Papaul) [23:01:50] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Papaul) [23:02:28] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Papaul) [23:02:57] 10Traffic, 10Operations, 10decommission, 10ops-codfw: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10Papaul) [23:03:25] 10Traffic, 10Operations, 10decommission, 10ops-codfw: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10Papaul) [23:43:13] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 (10Papaul) [23:43:44] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 (10Papaul) 05Open→03Resolved Complete [23:43:56] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 (10Papaul) [23:44:06] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 (10Papaul) 05Open→03Resolved Complete [23:44:20] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10Papaul) [23:44:32] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10Papaul) 05Open→03Resolved complete [23:44:49] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10Papaul) [23:45:02] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10Papaul) 05Open→03Resolved Complete [23:45:18] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Papaul) [23:45:26] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Papaul) Complete [23:46:16] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Papaul) [23:46:29] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Papaul) 05Open→03Resolved Complete [23:46:45] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Papaul) [23:47:01] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Papaul) 05Open→03Resolved Complete [23:47:36] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Papaul) [23:47:43] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Papaul) Complete [23:47:55] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Papaul) [23:48:04] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Papaul) 05Open→03Resolved Complete [23:49:00] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10Papaul) [23:49:19] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2012.codfw.wmnet - https://phabricator.wikimedia.org/T249080 (10Papaul) 05Open→03Resolved Complete [23:50:20] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10Papaul) [23:50:47] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2013.codfw.wmnet - https://phabricator.wikimedia.org/T249088 (10Papaul) 05Open→03Resolved Complete [23:51:25] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10Papaul) [23:51:40] 10Traffic, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10Papaul) 05Open→03Resolved Complete [23:52:09] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Papaul) 05Open→03Resolved [23:52:39] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Papaul) 05Open→03Resolved