[06:23:51] <wikibugs>	 10Traffic, 10DNS, 10Operations: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Dzahn) >>! In T240341#5728007, @Bugreporter wrote: > If someone find an old Wikimania website they may think the current website is wikimania2020.wikimed...
[09:51:02] <wikibugs>	 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Send X-Analytics information from Varnish to Hadoop with VCL_Log - https://phabricator.wikimedia.org/T196558 (10ema)
[09:55:31] <wikibugs>	 10Traffic, 10Operations: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10MoritzMuehlenhoff)
[09:56:05] <wikibugs>	 10Traffic, 10Operations: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I tested a failover and an instance migration successfully. I also changed the cluster setting so that CPU vulnerability flags are passed th...
[12:33:15] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10ema) p:05Triage→03Normal
[12:33:36] <wikibugs>	 10Traffic, 10DNS, 10Mail, 10Operations: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10ema) p:05Triage→03Normal
[12:34:23] <wikibugs>	 10Traffic, 10Operations: Create a system for distributed shared secret material to server tmps - https://phabricator.wikimedia.org/T240866 (10ema) p:05Triage→03Normal
[12:34:48] <wikibugs>	 10Traffic, 10Operations: Secure shared ticket key rotation for anycast authdns - https://phabricator.wikimedia.org/T240863 (10ema) p:05Triage→03Normal
[12:36:26] <wikibugs>	 10Traffic, 10Operations: HTTPS/Browser Recommendations page on Wikitech is outdated - https://phabricator.wikimedia.org/T240813 (10ema) p:05Triage→03Normal
[12:37:02] <wikibugs>	 10Domains, 10Traffic, 10DNS, 10Operations: Donate wikiźródła.pl and wikisłownik.pl to the Foundation - https://phabricator.wikimedia.org/T240446 (10ema) p:05Triage→03Normal
[12:53:36] <wikibugs>	 10Traffic, 10Operations: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 (10ema)
[12:54:59] <wikibugs>	 10Traffic, 10Operations: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 (10ema) p:05Triage→03Normal
[12:59:26] <bblack>	 ema: or just not have 5k/sec purges in the first place :/
[13:00:46] <vgutierrez>	 +1 :)
[13:00:59] <ema>	 bblack: or that! :)
[13:01:32] <bblack>	 but still, it does give some pause, it's an interesting edge case on perf
[13:01:41] <vgutierrez>	 ema: I see that our theory checked out :D
[13:01:51] <bblack>	 because I don't think v-be was similarly maxing out a core doing the same thing
[13:02:29] <bblack>	 my theory is that your theory is way more complex and conspiratorial than necessary :)
[13:03:06] <vgutierrez>	 /o\
[13:04:07] <ema>	 I think vgutierrez meant "our theory that high cpu usage is due to purges" :) 
[13:04:17] <vgutierrez>	 yep
[13:04:41] <bblack>	 oh I figured it was something about "bblack must have an IRC watcher on certain keywords that summons him even during vacations"
[13:04:51] <vgutierrez>	 LOL
[13:05:04] <ema>	 bblack: I'm looking at varnish-fe on cp3050 now, the busiest worker uses between 20% and 40% CPU
[13:05:20] <bblack>	 well the vhtcpd crash the other day is part of this as well
[13:05:29] <ema>	 this is likely the difference between native language vs lua
[13:05:41] <bblack>	 it crashed because it got backlogged by ~40GB worth of purge data, presumably facing this cpu-limited ats-be thread
[13:06:33] <ema>	 on a depooled ats-be host we can try to remove the path normalization lua code and see :) 
[13:06:38] <bblack>	 (and apparently my former self in some sense decided that was ok behavior - that vhtcpd should malloc blindly and when allocation finally fails just let itself segfault :P)
[13:07:01] <bblack>	 ema: yeah that might be an interesting experiment
[13:07:13] <bblack>	 but if lua can only be used on non-heavy traffic.... :/
[13:07:49] <bblack>	 it's another strong argument against the eventual death of varnish-fe
[13:07:56] <ema>	 yeah
[13:08:17] <vgutierrez>	 maybe we need to flex our C++ skills
[13:08:22] <bblack>	 even if all our vcl->lua translations for the fe case work under "normal" load, if they melt under heavy load, it's not a very useful fe
[13:09:11] <bblack>	 vgutierrez: yeah but you've see our v-fe VCL complexity.  Managing that in C++ would be ... yeah
[13:09:24] <vgutierrez>	 yep, it's far from ideal
[13:09:53] <bblack>	 if only someone wrote a higher-level abstraction specialized for HTTP revproxy/cache operations which compiled to native C for performance
[13:09:59] <vgutierrez>	 maybe extending the lua plugin is enough
[13:10:08] <ema>	 I doubt we'll ever get 5k rps from a single IP though?
[13:10:18] <vgutierrez>	 single IP and single connection?
[13:10:24] <vgutierrez>	 that would be banned for sure
[13:10:25] <volans>	 bblack: new-weekend-project-alert!
[13:10:26] <ema>	 right, single IP and single TCP connection
[13:10:29] <bblack>	 ema: yeah the purge model is definitely unique in that sense
[13:10:56] <bblack>	 if the only problem is scaling to 5k rps on a single conn/thread, that's actually pretty ok.  it makes abusive connections self-limiting.
[13:11:10] <bblack>	 I'm worried that this might be a harbinger of other limitations though
[13:11:14] <ema>	 with the current, admittedly simple, lua used by ats-tls we're doing very well when it comes to resource usage
[13:11:52] <bblack>	 basically if Lua slowness is the reason for purge-handling slowness in this one thread
[13:12:11] <bblack>	 I wouldn't expect it to be any better if you spread that over many threads.  Either way is 5k/sec executions of the same lua code.
[13:12:41] <bblack>	 but if it's something else, then sure
[13:22:55] <wikibugs>	 10netops, 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10ayounsi) 05Open→03Resolved a:03ayounsi All good!
[13:23:56] <wikibugs>	 10Traffic, 10Operations: ats-be: consider moving accept from dedicated thread to workers - https://phabricator.wikimedia.org/T241233 (10ema)
[13:24:37] <wikibugs>	 10Traffic, 10Operations: ats-be: consider increasing accept threads or moving accept from dedicated thread to workers - https://phabricator.wikimedia.org/T241233 (10ema)
[13:24:47] <wikibugs>	 10Traffic, 10Operations: ats-be: consider increasing accept threads or moving accept from dedicated thread to workers - https://phabricator.wikimedia.org/T241233 (10ema) p:05Triage→03Normal
[13:50:17] <volans>	 bblack: if you're around I've a (hopefully) quick question wrt netbox-generated dns
[13:50:37] <bblack>	 ok
[13:51:17] <wikibugs>	 10Traffic, 10Operations: Cleanup after varnish-be -> ats-be migration - https://phabricator.wikimedia.org/T241239 (10ema)
[13:51:27] <volans>	 preamble: netbox is out of CDN and LVS on purpose, to expose it I'd like it to be a/a from the two netbox hosts, ideally with geo-dns
[13:51:36] <wikibugs>	 10Traffic, 10Operations: Cleanup after varnish-be -> ats-be migration - https://phabricator.wikimedia.org/T241239 (10ema) p:05Triage→03Normal
[13:51:49] <volans>	 so I was thinking of doing a similar approach like ncredir (thanks valentin for the pointer)
[13:51:52] <volans>	 adding it in geo-resources
[13:52:16] <volans>	 is that a valid approach? how to "depool" one in case of maintenance?
[13:53:28] <volans>	 also it's ok to put a host IP in that mapping? or should we do something different?
[13:53:43] <volans>	 I can ofc go with a CNAME for now and we an improve in Jan.
[14:07:39] <bblack>	 you lost me.  You want to do geo-resources for generating host-level address records for things like puppetmaster1001(.mgmt)?.eqiad.wmnet ?
[14:08:20] <volans>	 no no, the netbox hosts will expose the dns repo via https right? so we'll have something like netbox-exports.w.o
[14:08:28] <volans>	 from where to pull the data
[14:08:40] <volans>	 *dns snippets repo
[14:08:43] <volans>	 to avoid confusions
[14:08:50] <bblack>	 oh that
[14:09:01] <bblack>	 ok, that seems a little less crazy :)
[14:09:21] <bblack>	 I thought you were talking about e.g. using some dynamic system to like, depool per-host DNS records while they're being reimaged, etc :)
[14:09:51] <volans>	 rotfl, no no
[14:10:39] <bblack>	 I think it's a potentially-valid approach.  Although it's certainly simpler to start with a CNAME and we can always swap in more complexity later.
[14:11:07] <bblack>	 the big question is whether it's active/active, and whether it could present a split worldview to the authdnses, etc
[14:13:01] <volans>	 the cookbook will be responsible of that, commit a change, pull on the other netbox, and only after that pull on the authdns hosts
[14:14:28] <volans>	 if later on we'll go a bit more automated, like having authdns watch the repo somehow and pull it automatically (given that new commits will still be manually triggered)
[14:14:57] <volans>	 we can still publish the new sha1 only after we're sure the commit is in both hosts
[14:14:59] <bblack>	 there will always be cases where the authdns can pull without the cookbook pushing on a change, I think
[14:15:06] <volans>	 like?
[14:15:24] <bblack>	 well the easiest example is reimaging an authserver
[14:15:35] <bblack>	 it has to do its initial pull of all data from ops/dns and netbox
[14:15:55] <bblack>	 but in general, "authdns-update" with the right flags should be capable of resyncing
[14:16:14] <bblack>	 (e.g. a human had to edit the data for an emergency, and now wants authdns-update to bring it back into sync with upstream netbox)
[14:16:19] <volans>	 sure
[14:17:20] <volans>	 my thought was that the cookbook would revert the change if unable to have it on both hosts
[14:17:26] <volans>	 and icinga check the sha1
[14:18:03] <volans>	 ofc that still leave open a race condition for few seconds
[14:18:27] <volans>	 but after the revert we could always re-pull on all authdns to make sure they're in sync
[14:19:03] <volans>	 it was more to not depend on a single VM, but if we prefer the simpler active/backup that's ok too
[14:19:08] <volans>	 even quicker for me :)
[14:25:00] <volans>	 I'll go with a CNAME for now
[14:30:45] <bblack>	 so a race condition isn't the end of the world
[14:31:16] <bblack>	 because a lesson we've / I've had to re-wire my brain for repeatedly is that DNS is inherently asynchronous
[14:31:30] <bblack>	 you don't get the luxury of global point-in-time state changes with DNS, ever, anywhere :P
[14:31:53] <volans>	 ehehe
[14:32:00] <bblack>	 so if it's just a matter of half our authdns servers getting a netbox-driven data change a few seconds (or honestly even minutes) differently in time from the others, that's actually ok
[14:32:23] <bblack>	 I'm more worried about the really-persistent/bad splits.
[14:32:51] <volans>	 sure, and that we can enforce/check
[15:10:08] <wikibugs>	 10Traffic, 10DNS, 10Mail, 10Operations: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Basak) Hi All!  Just wanted to drop a line to share that this is the address shared as the main communication email of the group in social media and other cha...
[16:19:20] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Krinkle) 05Open→03Resolved
[16:20:18] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Performance-Team-publish: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Krinkle)
[17:07:48] <wikibugs>	 10HTTPS, 10Traffic, 10Operations: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Bugreporter)
[17:19:44] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Performance-Team (Radar): Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Krinkle)
[21:13:08] <wikibugs>	 10netops, 10Operations: fastnetmon fired for routine text-lb.esams traffic - https://phabricator.wikimedia.org/T241281 (10CDanis)