[06:23:51] 10Traffic, 10DNS, 10Operations: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Dzahn) >>! In T240341#5728007, @Bugreporter wrote: > If someone find an old Wikimania website they may think the current website is wikimania2020.wikimed... [09:51:02] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Send X-Analytics information from Varnish to Hadoop with VCL_Log - https://phabricator.wikimedia.org/T196558 (10ema) [09:55:31] 10Traffic, 10Operations: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10MoritzMuehlenhoff) [09:56:05] 10Traffic, 10Operations: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I tested a failover and an instance migration successfully. I also changed the cluster setting so that CPU vulnerability flags are passed th... [12:33:15] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10ema) p:05Triage→03Normal [12:33:36] 10Traffic, 10DNS, 10Mail, 10Operations: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10ema) p:05Triage→03Normal [12:34:23] 10Traffic, 10Operations: Create a system for distributed shared secret material to server tmps - https://phabricator.wikimedia.org/T240866 (10ema) p:05Triage→03Normal [12:34:48] 10Traffic, 10Operations: Secure shared ticket key rotation for anycast authdns - https://phabricator.wikimedia.org/T240863 (10ema) p:05Triage→03Normal [12:36:26] 10Traffic, 10Operations: HTTPS/Browser Recommendations page on Wikitech is outdated - https://phabricator.wikimedia.org/T240813 (10ema) p:05Triage→03Normal [12:37:02] 10Domains, 10Traffic, 10DNS, 10Operations: Donate wikiźródła.pl and wikisłownik.pl to the Foundation - https://phabricator.wikimedia.org/T240446 (10ema) p:05Triage→03Normal [12:53:36] 10Traffic, 10Operations: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 (10ema) [12:54:59] 10Traffic, 10Operations: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 (10ema) p:05Triage→03Normal [12:59:26] ema: or just not have 5k/sec purges in the first place :/ [13:00:46] +1 :) [13:00:59] bblack: or that! :) [13:01:32] but still, it does give some pause, it's an interesting edge case on perf [13:01:41] ema: I see that our theory checked out :D [13:01:51] because I don't think v-be was similarly maxing out a core doing the same thing [13:02:29] my theory is that your theory is way more complex and conspiratorial than necessary :) [13:03:06] /o\ [13:04:07] I think vgutierrez meant "our theory that high cpu usage is due to purges" :) [13:04:17] yep [13:04:41] oh I figured it was something about "bblack must have an IRC watcher on certain keywords that summons him even during vacations" [13:04:51] LOL [13:05:04] bblack: I'm looking at varnish-fe on cp3050 now, the busiest worker uses between 20% and 40% CPU [13:05:20] well the vhtcpd crash the other day is part of this as well [13:05:29] this is likely the difference between native language vs lua [13:05:41] it crashed because it got backlogged by ~40GB worth of purge data, presumably facing this cpu-limited ats-be thread [13:06:33] on a depooled ats-be host we can try to remove the path normalization lua code and see :) [13:06:38] (and apparently my former self in some sense decided that was ok behavior - that vhtcpd should malloc blindly and when allocation finally fails just let itself segfault :P) [13:07:01] ema: yeah that might be an interesting experiment [13:07:13] but if lua can only be used on non-heavy traffic.... :/ [13:07:49] it's another strong argument against the eventual death of varnish-fe [13:07:56] yeah [13:08:17] maybe we need to flex our C++ skills [13:08:22] even if all our vcl->lua translations for the fe case work under "normal" load, if they melt under heavy load, it's not a very useful fe [13:09:11] vgutierrez: yeah but you've see our v-fe VCL complexity. Managing that in C++ would be ... yeah [13:09:24] yep, it's far from ideal [13:09:53] if only someone wrote a higher-level abstraction specialized for HTTP revproxy/cache operations which compiled to native C for performance [13:09:59] maybe extending the lua plugin is enough [13:10:08] I doubt we'll ever get 5k rps from a single IP though? [13:10:18] single IP and single connection? [13:10:24] that would be banned for sure [13:10:25] bblack: new-weekend-project-alert! [13:10:26] right, single IP and single TCP connection [13:10:29] ema: yeah the purge model is definitely unique in that sense [13:10:56] if the only problem is scaling to 5k rps on a single conn/thread, that's actually pretty ok. it makes abusive connections self-limiting. [13:11:10] I'm worried that this might be a harbinger of other limitations though [13:11:14] with the current, admittedly simple, lua used by ats-tls we're doing very well when it comes to resource usage [13:11:52] basically if Lua slowness is the reason for purge-handling slowness in this one thread [13:12:11] I wouldn't expect it to be any better if you spread that over many threads. Either way is 5k/sec executions of the same lua code. [13:12:41] but if it's something else, then sure [13:22:55] 10netops, 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10ayounsi) 05Open→03Resolved a:03ayounsi All good! [13:23:56] 10Traffic, 10Operations: ats-be: consider moving accept from dedicated thread to workers - https://phabricator.wikimedia.org/T241233 (10ema) [13:24:37] 10Traffic, 10Operations: ats-be: consider increasing accept threads or moving accept from dedicated thread to workers - https://phabricator.wikimedia.org/T241233 (10ema) [13:24:47] 10Traffic, 10Operations: ats-be: consider increasing accept threads or moving accept from dedicated thread to workers - https://phabricator.wikimedia.org/T241233 (10ema) p:05Triage→03Normal [13:50:17] bblack: if you're around I've a (hopefully) quick question wrt netbox-generated dns [13:50:37] ok [13:51:17] 10Traffic, 10Operations: Cleanup after varnish-be -> ats-be migration - https://phabricator.wikimedia.org/T241239 (10ema) [13:51:27] preamble: netbox is out of CDN and LVS on purpose, to expose it I'd like it to be a/a from the two netbox hosts, ideally with geo-dns [13:51:36] 10Traffic, 10Operations: Cleanup after varnish-be -> ats-be migration - https://phabricator.wikimedia.org/T241239 (10ema) p:05Triage→03Normal [13:51:49] so I was thinking of doing a similar approach like ncredir (thanks valentin for the pointer) [13:51:52] adding it in geo-resources [13:52:16] is that a valid approach? how to "depool" one in case of maintenance? [13:53:28] also it's ok to put a host IP in that mapping? or should we do something different? [13:53:43] I can ofc go with a CNAME for now and we an improve in Jan. [14:07:39] you lost me. You want to do geo-resources for generating host-level address records for things like puppetmaster1001(.mgmt)?.eqiad.wmnet ? [14:08:20] no no, the netbox hosts will expose the dns repo via https right? so we'll have something like netbox-exports.w.o [14:08:28] from where to pull the data [14:08:40] *dns snippets repo [14:08:43] to avoid confusions [14:08:50] oh that [14:09:01] ok, that seems a little less crazy :) [14:09:21] I thought you were talking about e.g. using some dynamic system to like, depool per-host DNS records while they're being reimaged, etc :) [14:09:51] rotfl, no no [14:10:39] I think it's a potentially-valid approach. Although it's certainly simpler to start with a CNAME and we can always swap in more complexity later. [14:11:07] the big question is whether it's active/active, and whether it could present a split worldview to the authdnses, etc [14:13:01] the cookbook will be responsible of that, commit a change, pull on the other netbox, and only after that pull on the authdns hosts [14:14:28] if later on we'll go a bit more automated, like having authdns watch the repo somehow and pull it automatically (given that new commits will still be manually triggered) [14:14:57] we can still publish the new sha1 only after we're sure the commit is in both hosts [14:14:59] there will always be cases where the authdns can pull without the cookbook pushing on a change, I think [14:15:06] like? [14:15:24] well the easiest example is reimaging an authserver [14:15:35] it has to do its initial pull of all data from ops/dns and netbox [14:15:55] but in general, "authdns-update" with the right flags should be capable of resyncing [14:16:14] (e.g. a human had to edit the data for an emergency, and now wants authdns-update to bring it back into sync with upstream netbox) [14:16:19] sure [14:17:20] my thought was that the cookbook would revert the change if unable to have it on both hosts [14:17:26] and icinga check the sha1 [14:18:03] ofc that still leave open a race condition for few seconds [14:18:27] but after the revert we could always re-pull on all authdns to make sure they're in sync [14:19:03] it was more to not depend on a single VM, but if we prefer the simpler active/backup that's ok too [14:19:08] even quicker for me :) [14:25:00] I'll go with a CNAME for now [14:30:45] so a race condition isn't the end of the world [14:31:16] because a lesson we've / I've had to re-wire my brain for repeatedly is that DNS is inherently asynchronous [14:31:30] you don't get the luxury of global point-in-time state changes with DNS, ever, anywhere :P [14:31:53] ehehe [14:32:00] so if it's just a matter of half our authdns servers getting a netbox-driven data change a few seconds (or honestly even minutes) differently in time from the others, that's actually ok [14:32:23] I'm more worried about the really-persistent/bad splits. [14:32:51] sure, and that we can enforce/check [15:10:08] 10Traffic, 10DNS, 10Mail, 10Operations: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Basak) Hi All! Just wanted to drop a line to share that this is the address shared as the main communication email of the group in social media and other cha... [16:19:20] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Krinkle) 05Open→03Resolved [16:20:18] 10Traffic, 10Operations, 10Performance-Team, 10Performance-Team-publish: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Krinkle) [17:07:48] 10HTTPS, 10Traffic, 10Operations: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Bugreporter) [17:19:44] 10HTTPS, 10Traffic, 10Operations, 10Performance-Team (Radar): Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Krinkle) [21:13:08] 10netops, 10Operations: fastnetmon fired for routine text-lb.esams traffic - https://phabricator.wikimedia.org/T241281 (10CDanis)