[00:34:52] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10observability: Ensure graphs used by Performance account for Varnish-to-ATS migration - https://phabricator.wikimedia.org/T233474 (10Krinkle) 05Open→03Resolved Not anymore. Thank you!
[03:25:09] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Research, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10leila) @Dzahn Sorry to ping you personally. I'm hoping you can point me to the righ...
[08:26:08] <wikibugs>	 10netops, 10Operations, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10fgiunchedi) >>! In T267018#6689478, @CDanis wrote: > Does this mean we can deprecate the [[ https://ge...
[09:40:14] <wikibugs>	 10netops, 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10jbond) @Dwisehaupt I have created a  [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/649592 | patch  ]] to add a new per interface `neighbours` fact....
[14:21:19] <wikibugs>	 10Traffic, 10InternetArchiveBot, 10Operations, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10holger.knust)
[15:35:46] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Research, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10RLazarus) Happened to see this go by -- I've dropped a single comment on the review...
[15:35:48] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Research, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10RLazarus) a:03RLazarus
[16:09:26] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Research, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10Dzahn) @leila It's about to get deployed. We apologize for this taking so long. Yes...
[16:24:29] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Research: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10RLazarus) 05Open→03Resolved Deployed and tested:  ` rzl@cumin1001:~$ httpbb /srv/deployment/httpbb-te...
[16:46:26] <rzl>	 hi traffic 👋 after merging https://gerrit.wikimedia.org/r/619130 to change a redirect target, I'd like to purge the old redirect from the cache -- can someone check me before I run this, please? :)
[16:46:28] <rzl>	 sudo cumin A:cp-text -b 1 "varnishadm -n frontend ban 'req.http.host == \"wikimedia.org\" && req.url == \"/research\"'"
[16:47:37] <bblack>	 my brain always has a hard time with quoting via cumin
[16:47:48] <bblack>	 but other than that part, looks ok?
[16:48:47] <rzl>	 same, so I slapped an "echo" in front of varnishadm and the right thing happened
[16:48:51] <rzl>	 if the rest lgty I'll go ahead
[16:49:34] <bblack>	 yeah looks good
[16:49:39] <rzl>	 thanks!
[16:49:40] <bblack>	 I even tried the syntax on one host
[16:50:05] <chaomodus>	 it'd be neat if cumin could take a shell script that it'd execute remotely to save a bit of chore
[16:52:12] <rzl>	 (for posterity, it was -b 1 A:cp-text, not A:cp-text -b 1)
[16:52:32] <bblack>	 oh yeah :)
[16:52:44] <rzl>	 oh whoops, now I'm getting a frontend cache miss but I guess I needed to do it without -n frontend first, then with?
[16:52:55] <bblack>	 well
[16:53:06] <bblack>	 without -n frontend is how you target the varnish backend caches
[16:53:11] <bblack>	 but we don't have those anymore, so that doesn't work :)
[16:53:21] <bblack>	 we actually don't have a great banning solution for backends
[16:53:43] <rzl>	 right okay -- I was doing my best to figure out how much wikitech to believe :P
[16:53:49] <bblack>	 there is a way, but it's a temporary chunk of lua code
[16:54:05] <rzl>	 ahhh that's right, I've done that with y'all's help befor3
[16:54:07] <rzl>	 *before
[16:54:36] <rzl>	 have you got an easy way of checking on what the TTL is? might or might not be worth the effort if it'll just expire naturally
[16:54:55] <bblack>	 could check the applayer response
[16:55:37] <bblack>	 oh it already changed, right
[16:56:08] <rzl>	 just the URL changed, the rest should be identical
[16:56:10] <rzl>	 Cache-Control: max-age=2592000
[16:56:12] <bblack>	 anyways, the new redirect says a week, which the be would cap down to 24h I believe
[16:56:32] <rzl>	 huh, I get 30 days, but if it's capped to 24h that's fine
[16:56:34] <bblack>	 err not a week, 30d
[16:56:37] <rzl>	 👍
[16:56:37] <bblack>	 yeah
[16:56:51] <rzl>	 cool -- thanks!
[16:57:09] <bblack>	 anyways, the fe can also refresh it past the edge of the be lifetime
[16:57:25] <bblack>	 you can get it down to ~1d if you wait 1d and then ban the frontends
[16:57:28] <bblack>	 or do the lua hack
[16:57:34] <bblack>	 (and then ban the frontends)
[16:57:51] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Research: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10Aklapper) a:05RLazarus→03Aklapper Thanks everyone! :)
[16:58:59] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Research: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10RLazarus) Quick correction -- this is now live on all appservers, but the old URL is still cached by the...
[16:59:31] <rzl>	 nod, I'll just FE ban tomorrow -- appreciate the walkthrough :)
[17:08:41] <rzl>	 cleaning up wikitech a little -- is this still accurate? '''Note''': if all you need is purging objects based on their URLs, see how to perform [https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge One-off purges].
[17:09:06] <rzl>	 architecturally definitely not, I guess the question is does the one-liner still work with purged
[17:17:31] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Research: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10leila) Thank you all! I really appreciate your work on this.
[17:25:06] <bblack>	 rzl: so, "varnishadm ban" is varnish-specific and not related to the usual PURGE mechanisms or HTCP or purged (and the lua hacks for ats-be are similarly ATS-specific and don't use the generic purge mechanisms)
[17:25:42] <bblack>	 mwscript.php used to drive multicast purging which would have in turn come back through purged to both varnish-fe + ats-be
[17:26:11] <bblack>	 I *assume* it has been upgraded to work without multicast now and results in sending a kafka purge to both, as deployers use that tool
[17:26:42] <bblack>	 and yeah, I guess since this is a simple URL match and not some complicated regex or whole-host case, mwscript might work for what you need, without waiting
[17:26:52] <bblack>	 (sorry I didn't think of that earlier!)
[17:27:15] <rzl>	 cool, I'll give that a try
[17:29:25] <rzl>	 hey, that did the trick :D
[17:29:33] <cdanis>	 bblack: rzl: yeah, I was about to say, Varnish bans are the wrong tool to use here for two reasons
[17:29:49] <cdanis>	 and yes, that script just invokes the same MW code that all purges use, which nowadays posts to kafka
[17:30:00] <rzl>	 love to see it
[17:32:07] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Research: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10RLazarus) >>! In T259979#6692703, @RLazarus wrote: > Quick correction -- this is now live on all appserve...
[17:34:16] <cdanis>	 btw rzl if you check my .zsh_history on cumin1001 there are a sample of cumin+curl invocations to check all cps for things :)
[17:34:56] <rzl>	 that's true, but also, if I check your .zsh_history on cumin1001 there's a very real chance I won't come back alive
[17:35:08] <rzl>	 or if I do, I'll be a changed man, forever harrowed by the experience
[17:35:49] <cdanis>	 there's only a few unspeakable things in there
[17:36:13] <bblack>	 feature request: implement regex purges/bans efficiently via standard PURGE for varnish+ats, so we can use mwscript for all cases
[17:37:12] <rzl>	 hey, all you really have to do is write a little PHP to expand the regex to all possible individual URLs, then it's reduced to a solved problem
[17:37:17] <cdanis>	 🤔
[17:37:21] <bblack>	 lol
[17:37:21] <rzl>	 am I doing Fun Day right
[17:37:25] <cdanis>	 would that need work in the varnish/ats layers bblack?
[17:37:35] <bblack>	 in the code of each daemon, yes
[17:37:40] <cdanis>	 right
[17:37:55] <bblack>	 you'd have to translate the varnish regex PURGE into a ban with the right params basically
[17:38:10] <bblack>	 and ATS would also have to do a ban-like thing, probably using its built-in generation counter
[17:38:45] <bblack>	 in both cases the problem is basically the same: you can't have PURGE synchronously scan a whole cache storage to look for matches, it could take forever.
[17:39:06] <bblack>	 and you can't just return async or you have no timing gaurantee on when the operation is effectively complete from the public POV
[17:39:08] <cdanis>	 yup, instead you have to check all new requests, with some sort of ttl
[17:40:04] <bblack>	 for varnish when you ban on request attributes, yes, it checks all new requests and ensures any object matching them as a hit, is newer than the timestamp the ban went into effect
[17:40:49] <bblack>	 if the conditional happens to be on a response attribute, it also does a slow background scan of storage, so it can remove the ban structure after all relevant objects are gone
[17:41:16] <bblack>	 I think for request attributes, the whole cache has to get newer than the ban before it goes away, but I'm not sure if I'm remembering that right
[17:41:36] <bblack>	 (because many request attributes aren't stored in cache)
[17:42:32] <bblack>	 so even if you implement efficiently like that, it's not a scalable solution for automated use from mediawiki for content updates or whatever
[17:42:51] <bblack>	 but you could probably hook up the bits and peices to make manually-authorized one-off cases work via kafka->purged
[17:45:08] <bblack>	 understanding varnish's request -vs- response types of bans can be important in some corner cases, since they're so different
[17:45:52] <cdanis>	 well, max_ttl helps us there, at least
[17:46:01] <bblack>	 it's better to think of a request ban as "ensure that all incoming new client requests with request attributes match foo do not hit any objects that are older than X"
[17:46:24] <bblack>	 whereas a response ban is more like what you'd normally think of as a true purge, in practice
[17:46:56] <bblack>	 I don't think max_ttl actually works for this case, because of keep/grace and conditional refreshes and longer applayer TTLs, etc
[17:47:09] <cdanis>	 oh, hm
[17:47:36] <bblack>	 but if it tracks the timestamp of the oldest object in cache, it can go by that.  that's what I seem to vaguely recall that it does.
[17:47:54] <bblack>	 but don't quote me on that
[17:48:23] <bblack>	 the max_ttl corner case affects other things too, it's another one of those long-standing but difficult-to-address issues
[17:49:40] <bblack>	 we have real cases, where e.g. MW returns an object with a max-age=30d, and we cap the TTL to 24h, but when the 24h expires we try a conditional refresh (based on Last-Modified or ETag), and MW claims with a 304 that the object is unchanged, and so we keep the copy we have and just bump it for a fresh 24h.
[17:49:53] <bblack>	 but the underlying actual content has in fact changed (e.g. due to an edit)
[17:50:37] <bblack>	 I assert that MW is lying in these cases and shouldn't be, but I recall the MW folks making some pragmatic arguments about why they had to handle these cases this way or something
[17:50:59] <bblack>	 it's all recorded in the bowels of some ancient ticket somewhere I'm sure
[17:51:48] <bblack>	 which is why the "nothing is older than 24h in the caches" assertion always has a strange asterisk on it
[17:52:32] <bblack>	 the other reason is inter-layer races.  those are probably less-common than they used to be, but we don't have a gauranteed mitigation for that, either.
[17:53:02] <bblack>	 depending on the timing of incoming new client requests to various caches in a DC, and the timing of the purge queue spooling out to those same caches at the two layers
[17:53:30] <bblack>	 you can get the case where the kafka purge of object X happens in the frontend, then a client requests gets a new copy (of the old object) from the backend, and then the backend processes the same purge
[17:53:49] <bblack>	 and so the should-be-dead old object gets to live for another TTL in the frontend, past the kafka purge
[17:54:43] <bblack>	 there are a few hacks to make it less-likely for that to impact us, but no solid solution
[17:59:50] <cdanis>	 yeah, I think purged sends a purge to ats-be before sending it to varnish, but, only on the same machine
[18:00:22] <cdanis>	 I can think of ways how we *could* mitigate that entirely, but they would make the average purge latency worse
[18:23:17] <wikibugs>	 10Traffic, 10InternetArchiveBot, 10Operations, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10daniel) We could sleep for a second before sending the 415...
[18:45:25] <rzl>	 made a start with https://wikitech.wikimedia.org/w/index.php?title=Varnish&diff=1891595&oldid=1885806, there's probably more improvements to be had
[19:11:12] <wikibugs>	 10Traffic, 10Operations: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10Majavah) added hopefully right project tags, please correct if I'm wrong
[19:30:19] <wikibugs>	 10Traffic, 10Operations: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10DannyS712)
[22:57:26] <volans>	 for the above chat about cumin's quotes, one option is to write a cookbook to perform varnishadm commands, another one would be to get a python shell and use cumin or spicerack directly, a third to implement this item from cumin's TODO: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/cumin/+/refs/heads/master/TODO.rst#26
[22:57:39] <volans>	 do you have any thoughts/preferences?