[06:24:09] 10netops, 10Operations, 10observability: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:14:14] 10netops, 10Operations: IRR updates needed - https://phabricator.wikimedia.org/T235886 (10ayounsi) 05Open→03Resolved I'd rather avoid creating and managing inetnums at a per team or per route object granularity, but it's not possible to create inetnums for the whole allocation (whole /22 for example), only... [07:14:16] 10Traffic, 10netops, 10Operations: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 (10ayounsi) [07:14:19] 10netops, 10Operations: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 (10ayounsi) [08:36:05] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Page Content Service, and 3 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10ema) [10:08:22] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Gilles) @ema @Vgutierrez last time we talked about this a month ago, you were in the process of rolling out A... [11:15:47] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Site-requests, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10Joe) I can comment on the PHP part, I'm not 100% sure about the caching layer cache times given the changes happened in... [12:39:43] 10Traffic, 10Operations, 10Wikimedia-Logstash, 10observability: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960 (10fgiunchedi) 05Open→03Declined Tentatively resolving since things are working as intended [12:48:25] 10netops, 10Analytics, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:53:47] 10netops, 10Operations, 10Wikimedia-Incident: Add linecard diversity to the router-to-router interconnect in codfw - https://phabricator.wikimedia.org/T248506 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:41:32] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Site-requests, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10ema) When it comes to caching 50x responses: I can confirm with confidence that at the ats-be level we [[https://gerrit.... [14:09:49] Krinkle: I had no idea we were multiplying purges by 2x. That seems very relevant to both the backlog situation, and to the discussion _joe_ started around other infrastructure load concerns. as for the evolution of vhtcpd, I believe bblack added the multi-tier functionality back in 2017, although I'm not sure why -- certainly it's useful for having multiple layers of cache on one machine that [14:09:51] can cache from each other, as we do with varnishfe/ats-be [14:10:30] <_joe_> cdanis: that's because usually [14:10:54] <_joe_> url /foo will be potentially on all frontends, but only on one backend [14:11:07] <_joe_> so I don't think the two purges are race-condition free [14:11:09] <_joe_> at all [14:11:45] is that still true in the ATS world? [14:11:46] it doesn't really matter on how many frontends the object can be, what matters is that backend purges must happen before frontend purges because frontends fetch from backends [14:12:33] hi :) [14:12:43] so if you first purge a frontend and then a backend, and in between the two purges you get a new frontend requests you end up with a stale object at the frontend layer [14:12:46] so yeah, the serial queues have been there a while. it purges the backend first, then the frontend. [14:13:28] (and there's always a 1 second delay after the backend purge before the frontend purge, unless that time expires while the frontend is working through a backlog) [14:13:40] it doesn't gaurantee against races, but it makes them far less likely [14:14:26] (the 1 second delay is based on the timestamp when the request entered the frontend queue, and is checked at dequeue time for >=1s ago. [14:14:29] ) [14:15:11] the reason for the 1s delay, and the reason it doesn't completely remove races, is that all this serialiation and queueing is happening in the context of one node [14:15:28] and to prevent all races, you would need to ensure the backends of all nodes (at that site) had purged the item before any of the frontends did [14:15:35] cdanis: still true with ATS, yes, but much less annoying than with varnish because now we only have fe<->be instead of up to fe<->be<->be<-be> [14:15:42] this doesn't gaurantee that, but it makes it much more likely [14:16:22] also: re the rebound purges (the duplicates some seconds later), at one point in time it was triplicated for whatever reason, all total. [14:16:45] perhaps one is still sent instantly, then one by the jobqueue, then a "second" (really third) by the jobqueue? [14:18:20] (or maybe all 3 are jobq based, or maybe one of them has been removed since I last looked. But I know at one point in the past, 3x copies of each purge was common, if you watched from the receiving side on the caches) [14:20:09] ema: ok, sorry, I hadn't been sure if the varnishfe->atsbe hop still crossed machines within a site or not [14:22:32] _joe_: bblack: so backends always purge 1s before frontends. that means there's no race conditions, right? [14:22:53] Krinkle: it isn't synchronized across hosts [14:22:57] <_joe_> Krinkle: I assume there is [14:23:30] I don't understand "you would need to ensure the backends of all nodes (at that site) had purged the item before any of the frontends did" [14:23:34] <_joe_> because it's multiple hosts [14:23:47] FWIW I'm not sure if in current conditions it does make it much less likely; the stddev of queue backlogs amongst esams machines is approx 40 million [14:24:10] <_joe_> yeah that just means we're mostly not purging what needs to be purged [14:24:46] Oh I see. so the purges are broadcasted to each cp host uncoordinated. and then on that host it will apply to backend first, and once it is popped off and fully processed on thelocal backend does it start entereing the queue on the frontend. [14:24:52] yes [14:25:06] I figured they'd be routed to the relevant backend (which is only 1 not all, per site) and then from there go to the frontends. [14:25:48] <_joe_> there is no guarantee that every frontend will purge before the backend is, as we do url-based hashing to the backends [14:26:00] <_joe_> so every url should be cached only on one backend per dc [14:26:18] yeah but what matters is that the backend purges first not the frontend first. [14:26:31] <_joe_> so if server A gets the purge before server B, but server B has the url in storage, server A will get a stale frontend [14:27:17] but yeah, if the backends don't coordinate and also aren't fanned out in the way I thought, then yeah, we don't have guruantees right now. [14:27:23] <_joe_> because A's frontend and (irrelevant) backend both get purged, but then A will re-fetch from B's backend, which is still not purged [14:27:35] Krinkle: it's a much harder distributed systems problem to make it work that way [14:28:13] right, but unlike before 2017, we do now have coordination within a local host. the backend port of any frontend+backend combo is now always purged before the frontend, they have separate queues which are chained to each other, if I understand bblack correctly. [14:28:24] <_joe_> Krinkle: not only [14:28:31] <_joe_> we don't have multiple-cd backends [14:29:58] what I mean is that for cp1234 we know for sure that ats-be is purged before varnish-fe. That is hard coded known for sure I believe. But it doesn't help us in the big picture because of course the point of backend is that they are sharded by hash, so yeah, we don't know for sure all backends got it before any frontend. [14:31:42] cdanis: the naive approach imho would be to have some kind of shared queue (e.g. kafka) that the responsible backend puts the purge in after it has processed it, and all frontend-purgers listen to that. The tricky would be knowing which backend to deliver it to or for the backend-purger to know which are relevant to it and whcih it shoud not act on/propagate. [14:33:20] I dont know how dynamic or complex our hashing algo is or whether it's feasible to reproduce in the purger process. Or for the purger to "just" talk to it and ask it: does this hash to you. Or for the purger to not be a separate process but something that e.g. VCL or equiv would interpret as if PURGE, hash=me, then do it, and also issue a push to the frontend queue. [14:33:51] cdanis: https://www.mediawiki.org/wiki/Core_Platform_Team/Initiatives/Dependency_Tracking [14:34:03] if you introduce kafka to the mix you get to do much smarter things, for sure [14:35:30] hm not a lot of info there but DK (and others) have a lot of ideas [14:35:56] iiuc: a graph db of dependencies that is updated via stream processed events and traveresed in response to events as well [14:36:02] to find whta needs purged/re-rendered [14:37:05] sure, makes sense as part of some future world :) [14:37:41] hah ya [14:37:50] it has been part of the future world for 2 years now [14:38:10] ottomata: as I understand it, within the scope of that project, the output toward Varnish/ATS would be similar to today. Something in MW determines a page or title needs to update, and issues a purge at some point. After the problem of making that purge actually work still applies. [14:39:57] i have very little context of the detail of that project, but given it, couldnt' you model both the backend and frontend as independent purges? [14:40:06] sorry, dependent purges i mean [14:40:17] in the graph? [14:40:22] seems like you could; it's a separate question as to whether or not you should ;) [14:40:27] oh ha ok [14:40:28] (I can imagine arguments both ways) [14:41:14] anyway if no one else minds I'd like to refocus discussion on the immediate issue -- purges are very backlogged across the fleet; while esams is worst off (and did not catch up on its purges at daily nadir), all clusters are affected [14:42:16] https://grafana.wikimedia.org/d/wBCQKHjWz/vhtcpd?orgId=1&from=now-24h&to=now&var-datasource=esams%20prometheus%2Fops&var-cluster=cache_text&var-worker=Purger0 [14:48:46] ah not at allo immediate but found moreinfo: https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/DependencyEngine https://phabricator.wikimedia.org/T105766 [14:49:08] I guess a few more specific questions [14:49:20] 1) How concerned should we be that purging essentially isn't working in esams? [14:50:20] 2) What stuff can we try to work through the backlog faster (I think I heard mention of having vhtcpd send purges over several parallel connections?) and who's working on them? [14:50:37] 3) Is it feasible to reduce the purge rate coming from the application layer itself? [14:51:44] Do we know why it increased and/or why servers are having trouble with keeping up now? [14:52:10] Do we have a sample of some purges to see what they are or which services they are coming from (app/job/api) [14:52:37] e.g. would be good to maybe capture the purges mwdebug emits when editing a page. just to sanity check that there isn't some kind of obvious problem [14:52:50] I imagine it's possible with tcpdump or such but that's magic to me. [14:53:54] also, side note, I'm consistently seeing that the first wikipedia url I open (logged-in) after an hour of connectivity has a latency of 10 seconds or more. I confirmed that the hits in question spent <300ms in MW backend time. So something else is up, could be my ISP. But just noting it here (I am going through esams) [14:54:10] after an hour of no connectivity* [14:54:23] (fiber, desktop, wifi, London) [14:59:35] catching up a little bit on the above: [15:00:13] kafka (or similar) purging that can control the race condition has been talked about a lot in the past, it's been on our radar forever, it just hasn't risen above other projects. It would definitely be worthwhile, in general. [15:00:54] there are ways, within the current basic context, that we could make the layered multicast purging more-reliable than it is, but none are foolproof without some kind of centralized coordination. [15:01:28] the current solution actually does mostly solve the problem in practice, but iff the queue backlogs stay small. [15:01:59] when the ats backends are falling behind constantly, then the variance in where they are in their respective queues is large, and the 1s delay won't cover it. [15:02:20] but if you can keep the queues small (which there are many ways to attack that), it's not awful. [15:02:49] you could reduce the raw rate of purges, and we can do the multi-connectoin purging mentioned earlier with a change to vhtcpd that doesn't take too much effort, to cope better with it. [15:03:37] one of the other ways we can make it better-er would be to somehow use the frontends' chashing to direct purges to backends, but it's kind of tricky to make that work sanely [15:04:05] it would need to at least hit the second choice from the chash as well, and if all frontends do it, the backend purge rate will escalate pretty dramatically, too. [15:04:48] in the meantime: the 20s rebound purge isn't helping any more than our serialization + 1s is, in the face of these large backlogs in esams. [15:05:09] and removing it might cut the rate by either 50% or 33%, which might make a difference in the queues catching up and staying small, too [15:05:22] and ditto the whole thing about pursuing not purging every affected page from template edits [15:16:50] 10HTTPS, 10Security-Team, 10Wikimedia-General-or-Unknown: Check all wikis for inclusions of http resources on https - https://phabricator.wikimedia.org/T36670 (10Reedy) 05Open→03Resolved >>! In T36670#6029734, @Krinkle wrote: > This is probably obsolete nowadays. But will let Security Team triage and kee... [15:31:15] bblack: I was about to file a task about exploring the kafka idea, but I see T133821 already covers it pretty much. [15:31:15] T133821: Content purges are unreliable - https://phabricator.wikimedia.org/T133821 [15:31:36] bblack: I'm updating that task now a bit. Do I understand correctly that problem 1 listed there is that the purges cause internal network congestion? Or is point 1 just explaining how it works? [15:33:45] bblack: +1 to disabling rebound purges then. Seems fine indeed. vhtcpd was deployed a year or so after the rebound purges were introduced. [15:33:59] I don't think we would've done that today knowing that vhtcpd is in place [15:37:41] hah, and now I see this comment in mediawiki-config/wmf-config/reverse-proxy.php [15:37:44] # Secondary purges to deal with DB replication lag [15:37:46] $wgCdnReboundPurgeDelay = 11; // should be safely more than $wgLBFactoryConf 'max lag' [15:38:49] right, that was the other problem [15:39:03] I think we have another strategy for that already nowadays but worth confirming [15:39:14] when MW knows it used a lagged replica, it is supposed to shorten the cache control [15:39:18] but I dont think we ever tested that [15:39:50] good catch! [15:40:01] I believe that is `$wgCdnMaxageSubstitute` [15:40:10] * Cache timeout for the CDN when a response is known to be wrong or incomplete (due to load) [15:41:17] wgCdnMaxageLagged [15:41:17] although the only place I see that referenced in mediawiki-core is in DefaultSettings.php [15:41:30] That is the s-maxage we emit if db is known lagged [15:41:36] same for that variable [15:41:37] the lag value is derived from the LBFactory conf [15:42:01] wgCdnMaxageSubstitute is something else but also related [15:42:11] That's when the response is compltely broken not just laggged [15:42:17] e.g. missing l10n cache or corrupt [15:42:18] oh, of course, the variables aren't referenced directly [15:42:27] includes/MediaWiki.php: $maxAge = $config->get( 'CdnMaxageLagged' ); [15:42:28] omit wg when grepping [15:42:29] yeah [16:13:42] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Site-requests, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10BBlack) I worry that removing a no-cache default might have all kinds of unintended consequences. We should probably do... [16:14:15] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10ema) >>! In T241145#5884800, @ema wrote: > The function `release_server_session` calls `do_io_close` if the following condition... [16:18:57] Krinkle: bblack: _joe_: have we also thought about lowering the Mediawiki-emitted default maxage? [16:19:53] <_joe_> cdanis: yes, see the doc :) [16:20:00] <_joe_> that was one of the ideas [16:20:14] <_joe_> lower the maxage to ~ 1 day, disable purges on referenced pages [16:23:24] bblack: btw, I think that disabling rebounds will decrease load by half, given https://gerrit.wikimedia.org/g/mediawiki/core/+/f98a9e576a5c11c357899f5c7dd3bac22c39a8ca/includes/deferred/CdnCacheUpdate.php#69 [17:34:57] 10Traffic, 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Krinkle) [17:35:15] cdanis: bblack: I've updated the task description of https://phabricator.wikimedia.org/T133821 and included notes about the rebound purging and purge chaining [17:35:29] thanks, will take a look after this meeting [17:35:48] the old description I hurriedly copypastad this weekend from one of Brandon's updates :) [17:36:12] cdanis: btw thanks for teaching me the word "nadir". [17:36:19] * Krinkle adds to his list as word of today. [17:36:39] I'm used to using on-peak/off-peak, but I picked up zenith/nadir from Xi.oNoX and quite like it :) [17:37:45] underrated word! [18:15:29] apoapse / periapse [19:08:18] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Site-requests, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10Krinkle) MediaWiki's default is already to not cache by default. Any response it produces through normal means sets Cach... [19:18:46] 10Traffic, 10Cloud-Services, 10Operations: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) The timeouts are still happening consistently, and my bots are still complaining, too 🙁 ([[ https://en.wikipedia.org/w/index.php?titl... [19:41:47] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Page Content Service, and 4 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10Krinkle) [19:41:50] 10Traffic, 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Krinkle) [19:43:09] 10Traffic, 10Cloud-Services, 10Operations, 10Wikimedia-Incident: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10Krinkle) [20:50:54] 10Traffic, 10Operations, 10good first task: Only retry failed requests for external traffic on cache frontends - https://phabricator.wikimedia.org/T249317 (10srishakatux) @ema Hello! As this task is tagged as a #good_first_task, I'm wondering if it can be made clear where exactly the code needs to be changed...