[05:36:34] 10netops, 10Operations: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) p:05Triage→03High [05:38:59] 10netops, 10Operations: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) [05:52:25] 10netops, 10Operations: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) So for the two that went down there was no planned maintenance, but we did get an email from the vendor ("00985243 Disturbance") suggesting that this was an unplanne... [06:23:24] 10netops, 10Operations, 10Operations-Software-Development, 10netbox, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10faidon) - esams should be blacklisted for now indeed. - `test_nb_inventory_in_librenms` could use some improvement -- it didn't... [07:04:38] 10netops, 10Operations: librenms logrotate script seems not working - https://phabricator.wikimedia.org/T224502 (10elukey) Did a chown to www-data:librenms: ` elukey@netmon1002:~$ ls -l /var/log/librenms/daily.log* -rw------- 1 www-data librenms 0 May 13 06:25 /var/log/librenms/daily.log -rw-r--r-- 1 www... [07:33:09] 10Traffic, 10Operations: Provide nginx support in compile_redirects() - https://phabricator.wikimedia.org/T224539 (10Vgutierrez) [07:33:19] 10Traffic, 10Operations: Provide nginx support in compile_redirects() - https://phabricator.wikimedia.org/T224539 (10Vgutierrez) p:05Triage→03Normal [09:09:31] vgutierrez: ema: Language is kindly requesting input on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/506043/ [09:16:27] ema is on vacations right now [09:17:10] and TBH I don't think I'm suitable to review that, probably b.black would be a better reviewer [09:28:00] 10netops, 10Operations: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) OK, so the vendor "bounced the interface" and the eqiad<->eqord traffic has been restored. What they noticed -and I confirmed- is that this interface was not carryin... [09:32:36] 10netops, 10Operations: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) [09:35:14] 10netops, 10Operations: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) a:03ayounsi [09:49:15] 10Traffic, 10Analytics, 10Analytics-Cluster, 10Operations, and 2 others: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561 (10Ottomata) [11:32:47] 10Traffic, 10Operations, 10serviceops: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10Volans) [12:34:37] 10Traffic, 10Operations, 10Pybal: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10MoritzMuehlenhoff) [14:55:23] vgutierrez: Wmflib::UserIpPort or Stdlib::Port::Unprivileged ? [14:55:39] hmmm [14:55:43] I'm fine either way :) [14:55:52] AFAIK Stdlib > Wmflib [14:56:09] so if Unprivileged is already in the version of stdlib that we are using, go for it [14:56:19] yeah, wmflib is Integer[1024, 49151] [14:56:49] cool [14:56:53] thx! [14:56:58] np [15:03:38] bblack: I'm wondering if we could give some love to redirects.dat assuming that HSTS is a thing for canonical domains and make every 301 into a canonical domain a https:// one instead of // [15:04:12] i.e: replace rewrite wikipedia.com //www.wikipedia.org with rewrite wikipedia.com https://www.wikipedia.org [15:04:14] maybe, not sure on the details [15:04:28] I don't think it actually makes a difference in practice, though [15:04:50] if the request was originally insecure, varnish-fe will 301->HTTPS without even talking to the applayer where redirects.dat kicks in [15:04:54] (for canonicals) [15:05:10] so all of those // only get seen/used in an HTTPS case anyways [15:05:21] right,but not for stuff like wikipedia.com [15:05:30] which isn't canonical [15:05:49] yeah, I was referring to canonical domains being targets [15:06:02] ah, I see [15:06:10] yes, maybe? [15:06:28] although the non-canonicals will get similar treatment with the stuff you've got going this ~quarter too [15:06:30] of course a UA with preload HSTS will translate that to https automatically [15:06:50] yeah.. I'm working into that as we speak [15:07:02] I'm having such a blast with this redirects.dat thingie :) [15:07:51] bblack: BTW, I'd like your input regarding stuff like www.*.wikipedia.com [15:08:07] cause obviously we cannot get a TLS certificate that matches that [15:08:23] we probably shouldn't, yeah [15:08:45] we could do it in theory, by doing another ~300 SANs per domainname for *.en.wikipedia.com and such [15:08:58] yeah.. kinda bruteforcing it [15:08:59] but it's kind of crazy and pointless and nobody needs to type those or find them anywhere organically :P [15:09:20] most of the non-canonicals are just there for typo/confusion or trademark. A one-level wildcard should suffice. [15:09:50] we would have 24 extra certificates to match the currently configured wildcard-in-the-middle redirections [15:10:51] 313*3/40 (313 languages, wikipedia.{com,net.info}, 40 SANS per certificate) ~= 24 certificates [15:11:27] are the wildcard-in-the-middle cases explicitly configured in redirects.dat today? [15:11:31] yes [15:11:37] in any case, I don't think that they'd exist in DNS to get useful traffic, right? [15:12:00] if the hostname is an nxdomain in DNS the redirect was doing nothing [15:12:04] https://github.com/wikimedia/puppet/blob/production/modules/mediawiki/files/apache/sites/redirects/redirects.dat#L193-L195 [15:12:07] hmmm [15:12:17] Host www.en.wikipedia.com not found: 3(NXDOMAIN) [15:12:23] right [15:12:34] so we can get rid of those redirects [15:12:43] could/should perhaps remove those before the rest of your conversion work to save effort/confusion [15:12:48] cause they're pretty useless [15:12:56] and they are giving my a hard time :) [15:12:58] alright, merging the RPKI validator CR [15:13:31] bblack: so if we need to provide TLS support for every use case of redirects.dat we should ban the wildcard-in-the-middle thingie there [15:14:41] yes, in all senses: 1) remove them from existing redirects.dat after confirming they're all nxdomain, which I think they all are based on reading all the zonefiles before) + 2) Don't support the notion in whatever configurability the new stuff's redirection targetting and/or SAN config has. [15:15:40] I suspect almost all non-canonicals fall into one of two buckets for how we support it (in terms of hostnames that could/should existing in the DNS and/or as SANs): [15:16:05] 1) Just needs root domain + www, both of which redirect off to some generic target like www.wikipedia.org or whatever [15:16:54] 2) Needs root domain + www + languages (so just do a full wildcard + root), and should probably re-use the leading label in the redirect, safely? [15:18:15] (e.g. "en.wikipedia.com" -> "en.wikipedia.org" for all possible "en". I worry slightly about some kind of reflection attack where someone uses a crazy/illegal hostname though. Could opt to just sanitize that it's a short string of reasonable characters, e.g. [-a-z0-9]{1,10} or something like that) [15:18:43] we have some latitude, I think, to cut a few corners if there are weird existing edge cases that don't make sense that we want to trim. [15:19:16] wikipedia.com in previous surveys was far and away the most-popular non-canonical (like, probably more traffic than all the rest combined), and even its traffic rate is very very tiny. [15:19:36] so there's only so much reasonable effort one should put into a non-canonical domainname that gets like 3 lookups a day or whatever. [15:20:43] so far I want to get rid of specific Apache things in redirects.dat like using %{TIME_YEAR} - https://gerrit.wikimedia.org/r/c/operations/puppet/+/513077 [15:20:57] and now those 3 wildcards [15:23:02] yeah seems reasonable [15:24:33] what's up with: [15:24:35] rewrite en.wikipedia.com //en.wikipedia.org [15:24:35] rewrite en.wikipedia.com //en.wikipedia.org [15:24:38] rewrite *.wikipedia.com //*.wikipedia.org [15:24:45] sorry double-paste on the first line [15:24:53] but doesn't the wildcard redirect already cover the en case? [15:25:46] anyways [15:26:18] eventually all the NCs will get out of that file, with the ncredir service handling them and mediawiki's apache no longer having knowledge of them. [15:30:20] yeah, *.wikipedia.com covers the en.wikipedia.com one [15:31:52] I'm really looking forward to the clarity this will bring to a bunch of weird situations [15:32:15] so we should keep two redirects.dat files, one to be handled closer to the users by the ncredir service, and another one for the mediawiki's apache [15:32:29] people is already used to the file format, and it's pretty easy to handle [15:32:33] ah I didn't realize you were actually parsing redirects.dat as input to the new stuff [15:32:44] yeah, perhaps split in two makes sense [15:32:52] yeah.. _joe_ tricked me into it [15:32:54] ok [15:32:58] (with some good reasons) [15:33:06] so I'm hacking compile_redirects.rb to support nginx [15:33:17] I'm out of sanity already, but it's almost done :) [15:33:36] but yeah, once we get past this quarter stuff, and then do some followup cleanups, etc... the world I'd like us to be inhabiting looks different in these key ways as a result: [15:34:16] 1) Everything that maps to text-addrs in DNS (points at the text cache cluster) is within the canonical domains, because all the others are now pointing off at this redirect service IP [15:34:56] 2) varnish-fe VCL related to HTTPS redirect + HSTS enforcement can stop having domain regexes, because all traffic that lands there is canonical and the rules always apply [15:35:47] 3) Our current nginx terminator config (or equivalent in the future under ATS, whatever) can bounce port 80 in a very simple way (any traffic coming in on port 80 just gets an immediate same-host 301 to https without parsing for special cases) [15:36:49] in our current world, once (3) is done the redirection parts of VCL in (2) can go away. But yeah I'm not sure on the details of how this maps to the 3 stages of the ATS deployment. Similarly in some sense I'm sure. [15:37:37] or worst case, we have the ATS TLS terminator only listening to 443 and not 80, and put some ridiculously-simple separate port-80 listener in place [15:38:10] (its entire configuration is static and simple, it could still be nginx or whatever. It just listens on 80 and does a same-host 301 for any hostname to https:// (or a 403 for non-GET/HEAD)) [15:44:09] I love how nginx doesn't complain about anything... ~^(.+)\.wikipedia\.com$ $scheme://$1.wikipedia.org$request_uri; [15:44:36] it just works [15:44:42] :) [15:45:03] yeah it's not our ideal solution for proxying anymore for a variety of reasons [15:45:14] but as a backendless redirect server, its config is pretty nice! [15:45:17] 10Traffic, 10ExternalGuidance, 10Operations, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10KartikMistry) @BBlack Can you please review https://gerrit.wikimedia.org/r/506043 ? [15:46:00] and if $1 is not needed, then it becomes even nicer: *.wikijunior.net $scheme://en.wikibooks.org/wiki/Wikijunior; [15:49:21] 10Traffic, 10ExternalGuidance, 10Operations, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10BBlack) Done. Are we ready to deploy it already or blocked on other MW-level deploys still? [15:49:28] BTW, those ExpiresByType in https://github.com/wikimedia/puppet/blob/production/modules/mediawiki/lib/puppet/parser/functions/compile_redirects.rb#L289-L297 [15:49:37] are completely useless, right? [15:50:01] or the redirector VirtualHost is actually serving something? [15:50:31] yeah that's really oddball [15:50:45] in any case, I don't think you need to parse/emit anything like that into your nginx config [15:50:56] but setting the CC header might be useful [15:51:09] Header set Cache-control "s-maxage=86000, max-age=0, must-revalidate" [15:51:22] hmmm, although now I wonder why it's set that way [15:51:41] they're basically telling UAs to never cache the redirects, but letting varnish cache them [15:52:11] (although if it's a 301, the meaning of the status code is permanent, which would imply cacheability in a completely different sense) [15:52:43] anyways, there's no local cache for s-maxage to apply to in this case [15:53:14] but IMHO it makes sense to let UAs cache it for some time, in the case it has any effect on the UA (I think most UAs take the 301 "permanent" pretty literally anyways) [15:54:45] maybe just set a conservatively-timed one that's open to public caching, like "Cache-control: max-age=3600" [15:54:55] (for all cases as a static part of the nginx config I mean) [15:56:15] top answer here is informative: [15:56:16] https://stackoverflow.com/questions/9130422/how-long-do-browsers-cache-http-301s [15:57:10] "standards" [15:57:12] TL;DR - with no other headers to go on, modern UAs will cache 301 indefinitely (until manual clear, or pushed out of cache to make space, or a conflicting 301 back into the source makes it re-fetch to check the loop) [15:57:37] but putting in a "CC: max-age=3600" will probably make at least some browsers re-check it periodically [15:58:41] the real issue at the root of all of this, is the hubris of whomever decided a protocol like HTTP could declare something "Permanent" in a world controlled by humans and corporations who can never predict the future or control their own destinies, but whatever :P [16:01:18] <_joe_> bblack, vgutierrez I had a question about the redirects service [16:01:25] <_joe_> why on ganeti and not on kubernetes? [16:01:37] <_joe_> I can see /one/ reason [16:01:46] <_joe_> but I'd like to hear from you [16:07:36] well a few reasons of varying types: [16:08:29] 1) AFAIK we're not planning to directly expose k8s to a public IP for most services within it? at least not production ones, so this would be the exception (the redirect service is meant to be an entirely separate public IP from our cache endpoints and not involved with them) [16:09:00] 2) We have more experience with ganeti, and this is supposed to go live by EOQ. [16:09:00] <_joe_> oh ok, indeed, I got it the other way around [16:09:34] <_joe_> so 2 is kinda moot, we have experience with kubernetes, but 1) is a real issue indeed. Also my reason was [16:09:40] 3) We have plans in play to get small ganeti clusters at every edge DC. So in theory, if we desire, we could also put the redirector at all the edges too (although the initial deployment is just here in the core for now, as these aren't so perf-sensitive) [16:09:43] <_joe_> if we're going to have ganeti clusters in the PoPs [16:09:45] <_joe_> ahah ok [16:10:07] <_joe_> still not having traffic spreading would be nice [16:10:26] <_joe_> heh I mean having pop-level redundancy [16:10:34] yeah [16:10:37] <_joe_> this thing is going to get like 10 requests/day or so? [16:10:43] <_joe_> I never looked at the numbers [16:11:01] yeah I looked once but don't remember them clearly. it's a tiny fraction of real reqs in any case. [16:11:12] <_joe_> anyways, thanks, I'm really off now! [16:11:33] but sometimes some of them are temporarily popular because someone that doesn't understand why not to decides it's a great marketing idea to print a non-canonical domainname on a sticker or poster at a wiki event, etc :P [16:13:26] the whole rationale and understanding of "canonical" vs "non-canonical" domainnames, and all the various whats and whys, and the rules we should be pushing the rest of the org/movement to adopt (i.e. don't publicize non-canonicals) is all a fascinating sub-area of the dublin offsite's proposed topic about DNS and domainnames. [16:13:28] FYI I added a doc section about mailbox lag on https://wikitech.wikimedia.org/wiki/Varnish#Varnish_mailbox_lag after struggling to figure out what to do yesterday [16:13:38] feel free to edit if it's incorrect [16:14:32] XioNoX: it's overly simplistic and potentially misleading in some scenarios, but there is no simple bulletproof workflow for diagnosing this stuff to correct it to either. [16:15:02] some things just aren't easy, because software is awful, and ridding ourselves of awful software takes time and resources [16:15:03] yeah, I put a big warning above it [16:15:17] you can remove it too [16:15:22] (the section I mean) [16:15:30] eh may as well leave it, it's better than nothing! [16:16:08] In this past minute or two I've thought of several possible additional sentences or ammendments, but I end up thinking they only make it more confusing, not less. [16:17:45] (I'm still of the general opinion that true "mailbox lag" problems that varnish causes on its own without external impetus are rare and well-managed. They're much more likely on cache_upload than on cache_text, and in either case extenuating circumstances (ugly external traffic, or failing/slow internal services) can both mimic the same symptoms, or eventually put varnish into a bad enough state [16:17:51] to trigger an unnatural occurence of the real mailbox lag problem) [16:20:22] in the case from yesterday, when I looked later, I tend to think it's that latter case, but the response is still the same [16:20:56] the pattern of the 3 misbehaving backends, all in cache_upload, looks like something external or internal was triggering, but the result was bad varnishes that needed restarts to get out of their messed-up states regardless. [16:21:30] (but if the bad other behaviors had continued or increased, probably we would've seen the problem continue to cascade and re-appear around the cluster) [16:24:08] yeah, restart looks like a safe first step when those symptoms appear [16:24:29] then if that doesn't solve it, better call someone who knows better [16:25:51] yeah it's the "first step" part that is maybe the most-dangerous, though [16:26:35] if the workflow is "see 5xx's, look for mailbox lag, restart varnishes", we're kinda missing the step where we look for other causes (bad incoming traffic, misbehaving services behind it, etc)... [16:27:00] which in some sense is fine: "try this easy thing first, and if the problems persist look elsewhere" is reasonable [16:27:22] except that the easy thing here is also wiping a large fraction of the cache's contents, which can make some situations much worse [16:28:03] (e.g. if the 503s on cache_upload were really due to a slow/unresponsive thumbor/swift, wiping out cache contents causes more misses into them and makes it worse) [16:28:39] in yesterday's scenario, there were 3 varnishes that got restarted, all on reasonable evidence, and it turned out to be the right move at the time. [16:28:48] 10netops, 10Operations, 10ops-codfw: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10Papaul) [16:29:04] but we lose 27% of cache_upload contents in esams as a result, and had it not been the right move, it could've also caused a cascade of other problems. [16:30:19] I'm not knocking anyone's efforts to deal with these intractably-difficult issues. The point I'm making is just that none of this reduces easily to a simple workflow anyone can execute with confidence :/ [16:31:44] make sens [16:33:27] going back to yesterday: another key clue in that scenario was that it was only affecting esams: there wasn't a corresponding cache_upload 5xx spike in the core DCs or the other remote sites [16:33:49] which tends to support the idea that it wasn't an internal service problem, and was probably induced by an incoming traffic pattern hitting esams [16:35:32] (or it was a "natural" uninduced problem with particular cache_upload backends in esams is a possibility at that point in analysis as well, but they don't tend to fail in rapid succession like that naturally. The time split on the first two was short (healthcheck failover of chashed stuff? I donno), and the 3rd one was possibly in response to the first depools reactions by humans, but I didn't s [16:35:39] tare at the timeline enough to verify that) [16:35:41] maybe add it as a condition in the small blob I added to the wikipage? [16:36:22] I could add pages of asterisks to it like that, but at the end of the day it's not going to help someone execute a reasonable plan quickly in an emergency without first spending months or years understanding the nuances heh [16:36:35] ok :) [16:37:16] I know it sucks as an answer, but the right answer is "get rid of varnish, especially at the backend layer, and especially for cache_upload" [16:37:43] and we're tantalizingly close to removing the cache_upload varnish backends globally, I suspect it will be over with during the Q1 sometime. [16:37:51] perfect :) [16:38:31] speaking of which, I should merge my cp3044 patch and reimage another esams cache_upload backend away from this mess [16:48:35] 10netops, 10Operations, 10ops-codfw: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10Papaul) console information: scs-a1-codwf port 40 [18:39:07] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3044.esams.wmnet'] ` The log can be found i... [19:04:46] bblack: I have a puppet merge conflict [19:05:04] BBlack: cache: reimage cp3044 as upload_ats (dcb0c1fbf9) [19:05:52] my change should be no impact, so feel free to merge it with yours [19:06:07] (and ping me so I can check it's all fine) [19:06:11] ah nice [19:06:21] I forgot to merge, and then reimaged :) [19:06:51] XioNoX: merged both [19:07:12] thanks! [19:07:36] the reimage went surprisingly-fast for esams, too [19:07:43] hopefully the next will as well :) [19:08:01] I know I saw some puppet changes flying by lately about efi installers and such, but I didn't pay much attention. possibly related? [19:08:27] I hope the question is not for me :) [19:08:53] nope [19:10:44] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3044.esams.wmnet'] ` The log can be found i... [19:11:32] 10netops, 10Operations: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) Next step is to configure the RPKI validators on one router (eg. cr4-ulsfo): `lang=diff [edit routing-options] + validation { + group rpki { + session 10.64.32.19 { + port 3323; +... [19:32:40] 10netops, 10Operations, 10Operations-Software-Development, 10netbox, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) >>! In T221507#5219523, @faidon wrote: > - The cr1-eqsin serial change is a bit odd. Netbox used to have a record of wh... [19:47:05] 10netops, 10Operations: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10ayounsi) [19:47:50] 10netops, 10Operations: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10ayounsi) 05Open→03Resolved Everything seems back to normal. Please reopen if the same issue happen again and we will proceed with a RMA. [19:50:13] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3044.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3044.esams.wmnet'] ` [20:12:15] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10CDanis) a:03Ottomata Andrew, can you (or someone else) advise on rolling out this change for Analytics? I think the minimal viable thing is havin... [20:14:48] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10BBlack) The failed reimage was finished up manually (probably not the reimager's fault) [20:15:54] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) @CDanis we are currently in offsite so this needs to wait until next week :) I'll bring this up tomorrow to my team! [22:20:25] bblack: any issue with reimages? (disclaimer: I didn't read backlog)