[00:34:52] 10Traffic, 10Operations, 10Performance-Team, 10observability: Ensure graphs used by Performance account for Varnish-to-ATS migration - https://phabricator.wikimedia.org/T233474 (10Krinkle) 05Open→03Resolved Not anymore. Thank you! [03:25:09] 10Wikimedia-Apache-configuration, 10Operations, 10Research, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10leila) @Dzahn Sorry to ping you personally. I'm hoping you can point me to the righ... [08:26:08] 10netops, 10Operations, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10fgiunchedi) >>! In T267018#6689478, @CDanis wrote: > Does this mean we can deprecate the [[ https://ge... [09:40:14] 10netops, 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10jbond) @Dwisehaupt I have created a [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/649592 | patch ]] to add a new per interface `neighbours` fact.... [14:21:19] 10Traffic, 10InternetArchiveBot, 10Operations, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10holger.knust) [15:35:46] 10Wikimedia-Apache-configuration, 10Operations, 10Research, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10RLazarus) Happened to see this go by -- I've dropped a single comment on the review... [15:35:48] 10Wikimedia-Apache-configuration, 10Operations, 10Research, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10RLazarus) a:03RLazarus [16:09:26] 10Wikimedia-Apache-configuration, 10Operations, 10Research, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10Dzahn) @leila It's about to get deployed. We apologize for this taking so long. Yes... [16:24:29] 10Wikimedia-Apache-configuration, 10Operations, 10Research: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10RLazarus) 05Open→03Resolved Deployed and tested: ` rzl@cumin1001:~$ httpbb /srv/deployment/httpbb-te... [16:46:26] hi traffic 👋 after merging https://gerrit.wikimedia.org/r/619130 to change a redirect target, I'd like to purge the old redirect from the cache -- can someone check me before I run this, please? :) [16:46:28] sudo cumin A:cp-text -b 1 "varnishadm -n frontend ban 'req.http.host == \"wikimedia.org\" && req.url == \"/research\"'" [16:47:37] my brain always has a hard time with quoting via cumin [16:47:48] but other than that part, looks ok? [16:48:47] same, so I slapped an "echo" in front of varnishadm and the right thing happened [16:48:51] if the rest lgty I'll go ahead [16:49:34] yeah looks good [16:49:39] thanks! [16:49:40] I even tried the syntax on one host [16:50:05] it'd be neat if cumin could take a shell script that it'd execute remotely to save a bit of chore [16:52:12] (for posterity, it was -b 1 A:cp-text, not A:cp-text -b 1) [16:52:32] oh yeah :) [16:52:44] oh whoops, now I'm getting a frontend cache miss but I guess I needed to do it without -n frontend first, then with? [16:52:55] well [16:53:06] without -n frontend is how you target the varnish backend caches [16:53:11] but we don't have those anymore, so that doesn't work :) [16:53:21] we actually don't have a great banning solution for backends [16:53:43] right okay -- I was doing my best to figure out how much wikitech to believe :P [16:53:49] there is a way, but it's a temporary chunk of lua code [16:54:05] ahhh that's right, I've done that with y'all's help befor3 [16:54:07] *before [16:54:36] have you got an easy way of checking on what the TTL is? might or might not be worth the effort if it'll just expire naturally [16:54:55] could check the applayer response [16:55:37] oh it already changed, right [16:56:08] just the URL changed, the rest should be identical [16:56:10] Cache-Control: max-age=2592000 [16:56:12] anyways, the new redirect says a week, which the be would cap down to 24h I believe [16:56:32] huh, I get 30 days, but if it's capped to 24h that's fine [16:56:34] err not a week, 30d [16:56:37] 👍 [16:56:37] yeah [16:56:51] cool -- thanks! [16:57:09] anyways, the fe can also refresh it past the edge of the be lifetime [16:57:25] you can get it down to ~1d if you wait 1d and then ban the frontends [16:57:28] or do the lua hack [16:57:34] (and then ban the frontends) [16:57:51] 10Wikimedia-Apache-configuration, 10Operations, 10Research: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10Aklapper) a:05RLazarus→03Aklapper Thanks everyone! :) [16:58:59] 10Wikimedia-Apache-configuration, 10Operations, 10Research: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10RLazarus) Quick correction -- this is now live on all appservers, but the old URL is still cached by the... [16:59:31] nod, I'll just FE ban tomorrow -- appreciate the walkthrough :) [17:08:41] cleaning up wikitech a little -- is this still accurate? '''Note''': if all you need is purging objects based on their URLs, see how to perform [https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge One-off purges]. [17:09:06] architecturally definitely not, I guess the question is does the one-liner still work with purged [17:17:31] 10Wikimedia-Apache-configuration, 10Operations, 10Research: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10leila) Thank you all! I really appreciate your work on this. [17:25:06] rzl: so, "varnishadm ban" is varnish-specific and not related to the usual PURGE mechanisms or HTCP or purged (and the lua hacks for ats-be are similarly ATS-specific and don't use the generic purge mechanisms) [17:25:42] mwscript.php used to drive multicast purging which would have in turn come back through purged to both varnish-fe + ats-be [17:26:11] I *assume* it has been upgraded to work without multicast now and results in sending a kafka purge to both, as deployers use that tool [17:26:42] and yeah, I guess since this is a simple URL match and not some complicated regex or whole-host case, mwscript might work for what you need, without waiting [17:26:52] (sorry I didn't think of that earlier!) [17:27:15] cool, I'll give that a try [17:29:25] hey, that did the trick :D [17:29:33] bblack: rzl: yeah, I was about to say, Varnish bans are the wrong tool to use here for two reasons [17:29:49] and yes, that script just invokes the same MW code that all purges use, which nowadays posts to kafka [17:30:00] love to see it [17:32:07] 10Wikimedia-Apache-configuration, 10Operations, 10Research: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10RLazarus) >>! In T259979#6692703, @RLazarus wrote: > Quick correction -- this is now live on all appserve... [17:34:16] btw rzl if you check my .zsh_history on cumin1001 there are a sample of cumin+curl invocations to check all cps for things :) [17:34:56] that's true, but also, if I check your .zsh_history on cumin1001 there's a very real chance I won't come back alive [17:35:08] or if I do, I'll be a changed man, forever harrowed by the experience [17:35:49] there's only a few unspeakable things in there [17:36:13] feature request: implement regex purges/bans efficiently via standard PURGE for varnish+ats, so we can use mwscript for all cases [17:37:12] hey, all you really have to do is write a little PHP to expand the regex to all possible individual URLs, then it's reduced to a solved problem [17:37:17] 🤔 [17:37:21] lol [17:37:21] am I doing Fun Day right [17:37:25] would that need work in the varnish/ats layers bblack? [17:37:35] in the code of each daemon, yes [17:37:40] right [17:37:55] you'd have to translate the varnish regex PURGE into a ban with the right params basically [17:38:10] and ATS would also have to do a ban-like thing, probably using its built-in generation counter [17:38:45] in both cases the problem is basically the same: you can't have PURGE synchronously scan a whole cache storage to look for matches, it could take forever. [17:39:06] and you can't just return async or you have no timing gaurantee on when the operation is effectively complete from the public POV [17:39:08] yup, instead you have to check all new requests, with some sort of ttl [17:40:04] for varnish when you ban on request attributes, yes, it checks all new requests and ensures any object matching them as a hit, is newer than the timestamp the ban went into effect [17:40:49] if the conditional happens to be on a response attribute, it also does a slow background scan of storage, so it can remove the ban structure after all relevant objects are gone [17:41:16] I think for request attributes, the whole cache has to get newer than the ban before it goes away, but I'm not sure if I'm remembering that right [17:41:36] (because many request attributes aren't stored in cache) [17:42:32] so even if you implement efficiently like that, it's not a scalable solution for automated use from mediawiki for content updates or whatever [17:42:51] but you could probably hook up the bits and peices to make manually-authorized one-off cases work via kafka->purged [17:45:08] understanding varnish's request -vs- response types of bans can be important in some corner cases, since they're so different [17:45:52] well, max_ttl helps us there, at least [17:46:01] it's better to think of a request ban as "ensure that all incoming new client requests with request attributes match foo do not hit any objects that are older than X" [17:46:24] whereas a response ban is more like what you'd normally think of as a true purge, in practice [17:46:56] I don't think max_ttl actually works for this case, because of keep/grace and conditional refreshes and longer applayer TTLs, etc [17:47:09] oh, hm [17:47:36] but if it tracks the timestamp of the oldest object in cache, it can go by that. that's what I seem to vaguely recall that it does. [17:47:54] but don't quote me on that [17:48:23] the max_ttl corner case affects other things too, it's another one of those long-standing but difficult-to-address issues [17:49:40] we have real cases, where e.g. MW returns an object with a max-age=30d, and we cap the TTL to 24h, but when the 24h expires we try a conditional refresh (based on Last-Modified or ETag), and MW claims with a 304 that the object is unchanged, and so we keep the copy we have and just bump it for a fresh 24h. [17:49:53] but the underlying actual content has in fact changed (e.g. due to an edit) [17:50:37] I assert that MW is lying in these cases and shouldn't be, but I recall the MW folks making some pragmatic arguments about why they had to handle these cases this way or something [17:50:59] it's all recorded in the bowels of some ancient ticket somewhere I'm sure [17:51:48] which is why the "nothing is older than 24h in the caches" assertion always has a strange asterisk on it [17:52:32] the other reason is inter-layer races. those are probably less-common than they used to be, but we don't have a gauranteed mitigation for that, either. [17:53:02] depending on the timing of incoming new client requests to various caches in a DC, and the timing of the purge queue spooling out to those same caches at the two layers [17:53:30] you can get the case where the kafka purge of object X happens in the frontend, then a client requests gets a new copy (of the old object) from the backend, and then the backend processes the same purge [17:53:49] and so the should-be-dead old object gets to live for another TTL in the frontend, past the kafka purge [17:54:43] there are a few hacks to make it less-likely for that to impact us, but no solid solution [17:59:50] yeah, I think purged sends a purge to ats-be before sending it to varnish, but, only on the same machine [18:00:22] I can think of ways how we *could* mitigate that entirely, but they would make the average purge latency worse [18:23:17] 10Traffic, 10InternetArchiveBot, 10Operations, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10daniel) We could sleep for a second before sending the 415... [18:45:25] made a start with https://wikitech.wikimedia.org/w/index.php?title=Varnish&diff=1891595&oldid=1885806, there's probably more improvements to be had [19:11:12] 10Traffic, 10Operations: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10Majavah) added hopefully right project tags, please correct if I'm wrong [19:30:19] 10Traffic, 10Operations: Image fails to load with CORS violation - https://phabricator.wikimedia.org/T270209 (10DannyS712) [22:57:26] for the above chat about cumin's quotes, one option is to write a cookbook to perform varnishadm commands, another one would be to get a python shell and use cumin or spicerack directly, a third to implement this item from cumin's TODO: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/cumin/+/refs/heads/master/TODO.rst#26 [22:57:39] do you have any thoughts/preferences?