[05:59:08] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) 05Open→03Resolved Update now uses revision IDs everywhere for non-lagged fetches. [08:29:03] 10Traffic, 10Operations, 10ops-eqiad: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10fgiunchedi) >>! In T222620#5164117, @CDanis wrote: >>>! In T222620#5163577, @ema wrote: >> Interestingly, there was a memory usage spike right before the host crashed. >> >> {F28951427} > > I think that is j... [09:08:20] mailbox lag on cp5008 went from 0 to 10 million in 10 minutes [09:08:22] not bad [09:12:45] due to be cron-restarted today at 20:12, let's see how it behaves [09:47:23] <_joe_> ema: is lvs1006 the backup in eqiad? [09:48:34] <_joe_> bgp-med = 100 [09:48:35] <_joe_> yup [09:48:54] right, it is [09:50:03] lvs_eqiad_backup: [09:50:03] __regex: !ruby/regexp /^lvs100[4-6]\.wikimedia\.org$/ [09:50:03] profile::pybal::primary: false [09:50:54] <_joe_> ew [09:50:56] <_joe_> :P [10:20:47] yeah... [10:20:49] HW issues :) [12:30:55] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10jbond) Is there any further action for this ticket or can we close it? [12:59:40] ema, vgutierrez - there was a spike in 503s from esams upload caches afaics [13:00:22] esams? nothing related to the prometheus migration? [13:00:49] from https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=backend it seems cp3039 related [13:04:29] still ongoing though [13:04:31] ack [13:05:42] interesting, the varnish backend on cp3039 was restarted ~34 mins ago [13:06:44] vgutierrez: so to clarify, it is still ongoing, not a spike [13:07:03] yeah, that's the cron saving varnish from mbox lag [13:07:56] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&var-site=esams&var-cache_type=upload&var-status_type=5&from=now-3h&to=now [13:08:06] yeah still ongoing at a mild rate, odd [13:08:35] ubbtibvgfnurrunekdkidkitnvjcrrhf [13:09:29] hmmmm yubi causing other eyboard issues :P [13:09:39] :) [13:09:42] very odd [13:10:01] hmmm all fine now [13:10:15] but for like 30-45 seconds there after the accidental yubikey tap [13:10:32] I couldn't use the letter k on my keyboard, nor a few of the meta-keys I use for window/screen switching [13:11:18] I'm sure something odd with usb and the virtual keyboard of the yubi vs the real one, etc [13:13:30] those remaining 503s seem to be coming from all esams, and from the frontends [13:13:55] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10herron) 05Open→03Resolved a:03herron Ready to resolve afaict! [13:14:44] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10GedHaywood) Do those IPv6 addresses actually send any mail? If not they can be deleted. [13:15:12] https://grafana.wikimedia.org/d/000000439/varnish-backend-connections?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload [13:16:10] only cp3035 affected according to varnishospital it seems [13:16:10] yeah so something's going on with upload@esams for sure, but it's broad [13:16:31] https://logstash.wikimedia.org/goto/d938a755dd5a26aeefcd8b073b1a6721 [13:16:35] ema: you mean, the esams fe's are all having trouble with cp3035-be? [13:17:10] bblack: that I don't know, but surely cp3035 has troubles [13:17:11] yeah, random look at varnishlog seems to indicate that too [13:17:27] (on 3039 fe, the fe internal 503s are "no backend connection" and trying to use 3035 [13:17:29] due to be cron restarted today at 8PM [13:17:30] ) [13:17:42] bblack: ok to varnish-be there? [13:17:51] *restart [13:18:26] yes, please [13:18:33] I think it actually is storage-related here [13:19:00] even the cp3035->eqiad reqs that are succeeding, have ExpKill log lines in them etc [13:20:18] restart done [13:21:20] bin1 is my guess at the one that's falling over first lately [13:21:41] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10herron) >>! In T221288#5167063, @GedHaywood wrote: > Do those IPv6 addresses actually send any mail? Yes, these are the IPv6 addr... [13:21:59] it was the 2nd largest of the bins based on whatever long-ago stats we took to come up with that [13:22:00] thanks a lot ema, 503s seems down now [13:22:23] but maybe better compression and/or the webp stuff and/or who-knows-what-else-in-all-that-time have shifted the size stats smaller a bit [13:22:28] elukey: ty! [13:22:44] oh 3rd largest [13:22:56] but in random log peeks, bin1 seemed to come up a lot... [13:23:18] looking at varnish-machine-stats it seems that things started going very bad around 12:50 [13:23:50] bin size percentages: 0:4% 1:23% 2:40% 3:27% 4:6% [13:25:02] and the bin cutoffs are 16K 256K 4MB 64MB ∞ [13:25:36] I'm guessing esams is our next target anyways in ~June though, may not be worth re-optimizing :) [13:26:34] (unless we accelerate plans!) [13:57:33] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10herron) 05Open→03Resolved [14:25:04] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10GedHaywood) SPF works on the information in the SMTP envelope. If mx1001.wikimedia.org has only the IP addresses 208.80.154.76 an... [14:58:08] 10netops, 10Operations, 10observability, 10Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992 (10ayounsi) [14:58:23] 10netops, 10Operations, 10observability, 10Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992 (10ayounsi) 05Open→03Resolved https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ospf [15:07:58] bblack: I need to discuss some stuff related to ssl_ciphersuite() [15:08:21] mainly how to provide support for ATS there [15:08:58] ATS inbound TLS settings are configured via a custom type: https://github.com/wikimedia/puppet/blob/production/modules/trafficserver/types/inbound_tls_settings.pp [15:09:21] so I'd like to avoid injecting plain text from ssl_ciphersuite() in ATS config files [15:13:19] it's what we do everywhere else, for better or worse [15:13:27] what does the custom type have to do with it? [15:13:42] in general ssl_ciphersuite needs lots of updates, and isn't very general-case [15:15:35] so our ATS puppetization expects a list of ciphersuites on Inbound_TLS_settings[common][cipher_suites] [15:17:19] same thing for various TLS settings like the dhparams file [15:17:40] so I don't know if I should skip ssl_ciphersuite() right now [15:18:28] yes, but why? [15:18:58] it's not like the cipher list is structured data we'd manipulate within puppet. it's just a string we have a few possible values for [15:19:18] or is it that the final config output template needs a different format than the :-delimited one? [15:20:36] so regarding the cipher suite list, ATS has one setting for TLS1.3 ciphersuites and another one for non TLS1.3 ciphersuites [15:20:43] ok [15:21:11] relevant side-note: the 1.3-specific ones we tacked into ssl_ciphersuite (long long ago before the software was ready) are wrong anyways and probably should be removed heh [15:21:26] I haven't kept up with openssl's side of the TLS1.3 + ciphersuite preferences debate [15:21:39] last I heard, openssl didn't even want 1.3 ciphersuite ordering to be configurable [15:23:01] https://github.com/openssl/openssl/issues/541 <- that was the old issue I was arguing in [15:23:27] and https://github.com/openssl/openssl/issues/5050 [15:25:44] anyways [15:26:12] like I said, I haven't kept up. What does openssl actually allow to be configured about tls1.3 anyways and how does ATS handle that? I assume it's all more or less sorted out in some sense by someone's standards by now [15:27:49] so right now regarding TLS 1.3 we can set a list of cipher suites [15:27:56] and enable or disable TLS 1.3 support [15:28:45] but there's no ordering I guess, so it's just allow-or-disallow? [15:28:52] (cipher suites I mean) [15:31:17] there is order as well [15:32:16] according to https://www.openssl.org/docs/manmaster/man3/SSL_CTX_set_ciphersuites.html [15:32:28] SSL_CTX_set_ciphersuites() is used to configure the available TLSv1.3 ciphersuites for ctx. This is a simple colon (":") separated list of TLSv1.3 ciphersuite names in order of perference. [15:32:48] (that's the API used by ATS to set the TLS1.3 ciphersuite list) [15:40:41] ok [15:40:48] so we do have order, just not equalpref, so that's ok [15:41:11] so, yeah, having pondered this a bit while doing a few other things (hence my long pauses) [15:41:51] yeah, let's just skip ssl_ciphersuite, and set what we want (matching existing public cache terminator tls1.2 settings, and a TLS1.3 that uses the 3x we want in the order we want. [15:42:12] because realstically we're not using ATS to TLS-terminate any other case but our public terminators, so there's little value in sharing, etc [15:42:40] for TLS1.3 basically we want to mirror what we're doing in TLS1.2's high list, in the form 1.3 expects [15:42:55] (so 3 ciphers, chapoly then aes256 then aes128) [15:43:04] at least that would be our baseline for "don't change anything but the protocol" [15:43:56] you could make the argument that, given what flexibility openssl gives or doesn't give us in this space, we'd rather limit to those 3 but leave preference entirely to the client, but I think we can tackle discussion on that kind of thing Later. [15:44:20] yeah, I just want to make a "MVP" release of ats-tls ASAP [15:44:26] right [15:44:32] that will skip some niceties like X-Connection-Properties [15:44:34] and TLS1.3 [15:44:49] I need to hack ATS to be able to provide X-Connection-Properties [15:45:00] I imagine a small bit of lua can do it [15:45:05] not right now [15:45:14] ah they don't expose the data? [15:45:23] yeah, that part is what's missing [15:45:29] but they allow you to log that data [15:45:33] well anyways [15:45:41] so I've been checking that code and figuring out how to provide it to the lua plugin [15:45:48] also to the rewrite-header plugin [15:46:03] but I don't want to block the ats-tls initial tests because of that [15:46:42] in the long view, we can go ATS-specific with our ATS TLS settings (and not ssl_ciphersuite), and then once the ATS-TLS transition is done traffic-edges won't be using ssl_ciphersuite anymore, so it can be refactored in that light (and dump most legacy things and simplify, etc, since it's only being used for non-cache public endpoints technical in nature, and internal TLS) [15:47:59] right [15:49:01] so hopefully at some point next week I'll torture ema to spawn an instance using ats-tls instead of nginx [15:49:32] now that the main blockers have been merged (multiple instance support + multiple prometheus-trafficserver exporters) [16:24:14] 10netops, 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Revoke production prometheus fundraising access - https://phabricator.wikimedia.org/T217355 (10cwdent) 05Open→03Resolved [17:49:53] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10BBlack) Putting this here for lack of a better place, for future reference: In the TLSv1.2 (and below) world, we've gone with a static preference on symmetric ciphers of ChaPoly -> AES256 ->... [17:50:26] 10Wikimedia-Apache-configuration, 10Operations, 10Patch-For-Review, 10User-revi: Change kr.wikimedia.org redirection destination - https://phabricator.wikimedia.org/T222033 (10Dzahn) 13:46 < mutante> revi: it has been applied on mwdebug1001 now if that helps. it will be deployed to all others within 30min... [17:51:03] 10Wikimedia-Apache-configuration, 10Operations, 10User-revi: Change kr.wikimedia.org redirection destination - https://phabricator.wikimedia.org/T222033 (10revi) [21:14:05] 10Traffic, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, 10Services (watching): Package libvmod-uuid for Debian - https://phabricator.wikimedia.org/T221977 (10mobrovac) @ema since the pkg has been uploaded, are we now good here? Ok to resolve the task or is there s... [22:52:11] bblack: the amount of TCP retransmits from eqiad/codfw recursors suddenly increased: https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=18&fullscreen&orgId=1&from=1557337862628&to=1557353490863 I don't think it's causing any issues, but it's odd at best [22:57:47] hm, looks like it's due to a drop in the overall amount of tcp sent https://grafana.wikimedia.org/d/000000365/network-performances?orgId=1&panelId=8&fullscreen&var-server=dns1002&var-datasource=eqiad%20prometheus%2Fops&from=now-6h&to=now [22:58:28] interesting that is ~2h after the DYNA/CNAME change and I don't see anything anomalous around that time on the https://grafana.wikimedia.org/d/000000399/dns-recursors dashboard (for both eqiad and codfw) [22:58:30] while the retransmits are staying stable https://grafana.wikimedia.org/d/000000365/network-performances?orgId=1&panelId=15&fullscreen&var-server=dns1002&var-datasource=eqiad%20prometheus%2Fops&from=now-6h&to=now [23:00:39] overall UDP is the same, only TCP dropped by ~1/4