[14:29:46] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Page Content Service, and 3 others: esams cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10CDanis) [16:25:35] 10Traffic, 10Operations, 10observability, 10Patch-For-Review: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality - https://phabricator.wikimedia.org/T249346 (10CDanis) Updated version parsing your new output above: {P10893} [16:26:47] 10Traffic, 10Operations, 10observability, 10Patch-For-Review: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality - https://phabricator.wikimedia.org/T249346 (10CDanis) And parsing current output from `cp3052`: {P10895} [16:36:33] 10Domains, 10Traffic, 10Operations, 10WMF-Legal: wikipedia.lol - https://phabricator.wikimedia.org/T88861 (10Dzahn) Ok, fine with me. Thanks! [16:43:54] the purger backlogs aren't just esams :( [16:46:57] 10Traffic, 10Operations, 10observability, 10Patch-For-Review: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality - https://phabricator.wikimedia.org/T249346 (10CDanis) 05Open→03Resolved a:03CDanis https://grafana.wikimedia.org/d/wBCQKHjWz/vhtcpd?orgId=1&var-datasour... [16:52:30] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Page Content Service, and 4 others: esams cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10CDanis) It isn't just esams that often has a backlog: looking at the past 10-20 minutes of monitoring data now... [17:17:43] 10netops, 10Operations: review fastnetmon thresholds after sensible flow table sizes rollout - https://phabricator.wikimedia.org/T249454 (10CDanis) [18:34:49] 10Traffic, 10MediaWiki-Cache, 10Operations, 10Page Content Service, and 3 others: esams cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10bearND) Yes, I think it's more than just esams since the merged in task I created (T249290) was about an experi... [20:23:53] cdanis: whats' Purger 1 vs Purger 0? It seems all "Purger 0s" have the backlog, and Purger 1 all have a fairly similar and constant low rate of X K/s. [20:28:10] >: Those purge rates are from the frontend varnish, but the frontend varnish PURGE queue doesn't receive entries until they've traversed the backend one, which is backlogged, so that's why we still see the effect there. [20:28:43] it purges from the fe varnish first, and then the same purge gets enqueued in the ats queue [20:29:23] anyway, i'm not working any more today unless ubn :) [20:30:55] er, reverse what I said, but anyway [21:03:56] so you will continue working except if it's a blocker, in which case you will stop doing any work? :) [21:25:19] c.danis: okay, that's actually pretty cool. I recall we had bugs in the past about the race conditions between back and front, especially with ve-be in pops talking to ve-be in eqiad etc. I wonder how long this has existed? Is it new? Is it something we made, or something we got as part of ATS? Does this mean we can get rid of MW's "rebound" purge? We currently issue every purge twice. once directly and a second time after a short [21:25:19] delay (I think about 20 seconds) because of this.