[05:03:31] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3219947 (10Marostegui) [07:09:00] bblack, ema, now that row-d is ready, let me know how I can help with https://phabricator.wikimedia.org/T150256 [07:20:23] 10Traffic, 06Operations, 10ops-eqiad: cp1066.mgmt.eqiad.wmnet is unreachable - https://phabricator.wikimedia.org/T149217#3220052 (10ema) Thanks @Cmjohnson! [07:44:28] 10netops, 10Monitoring, 06Operations: Icinga check for VRRP - https://phabricator.wikimedia.org/T150264#2779751 (10ayounsi) For the option "cr1 always backup, cr2 always master" [[ https://github.com/dnsmichi/manubulon-snmp/blob/master/plugins/check_snmp_vrrp.pl | This script ]] does exactly that. Success (... [08:06:33] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3220073 (10akosiaris) Great! I'll start undoing some of the preparatory works, that is * repool puppetmaster1002 * switchover oresrdb.svc.eqia... [09:03:15] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506) - https://phabricator.wikimedia.org/T163326#3220182 (10akosiaris) [09:03:41] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724981 (10akosiaris) [09:03:42] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506) - https://phabricator.wikimedia.org/T163326#3193640 (10akosiaris) 05Resolved>03Open And T148506 is done, re-opening and switching back [10:05:30] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3220324 (10elukey) 05Open>03Resolved I'd love to do it anyway, but Chris is super busy and this is only a "good to have" for the moment, so... [10:05:32] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3220326 (10elukey) [10:24:36] 10Traffic, 06Operations, 10RESTBase, 10RESTBase-API, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2557878 (10TheDJ) Can someone be so kind to document on mediawiki.org how to configure this ? Many people there are interested in running electron on... [10:43:38] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3220415 (10akosiaris) [10:43:41] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506) - https://phabricator.wikimedia.org/T163326#3220413 (10akosiaris) 05Open>03Resolved And switched back, re-resolving. Thanks! [12:16:37] ema: storage stuff looks nice. most stats are similar-ish vs last restart, but a few seem improved [12:17:07] bblack: yeah, it looks good :) [12:17:36] we've still got the RT experiment running on cp2024 FTR [12:18:19] ok [12:18:28] with both options turned on via config I guess? [12:18:33] yep [12:33:46] nuked objects rate, for example, seems lower than usual at this point after restart [12:33:48] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=20&fullscreen&orgId=1&var-server=cp2002&var-datasource=codfw%20prometheus%2Fops&from=now-2d&to=now [12:47:58] at least less spiky, more natural [12:48:37] a lot of those stats diffs that I've peered around at, intuitively they look "saner" like things are operating a bit smoother wrt how nukes and allocations are handled, etc [12:49:54] hitrate seems to be slightly-better than usual all other patterns (restarts, etc) considered, but there's a lot of noise in seeing that [12:50:13] time will tell :) [12:50:50] yeah! OK to build/upload 4.1.6 meanwhile? [12:57:32] yeah [13:29:10] I'm already getting unecessarily excited about the upcoming TLSv1.3 stuff :) [13:29:39] basically we'll wait for an official openssl-1.1.1 release, and pair that with nginx-1.13.x, when both look about right on stabilization [13:29:55] and make some tweaks to our ssl_ciphersuite() outputs to support it properly [13:49:25] openssl 1.1.1 was scheduled for May, should not take that long [14:12:47] bblack: what is the gdns behaviour if 2 different A record have the same name? [14:13:48] different IP, same name [14:18:02] it's going to hand out both IPs together [14:18:24] that's not really gdnsd-specific, it's just how DNS data is structurally [14:18:50] DNS data (in the very real protocol / data-structure sense) is composed not of zones containing records [14:18:53] that's what I thought, but there was an error in the config and db1061.eqiad.wmnet. was listed twice and dig was showing randomly one or the other [14:18:57] I was expecting both [14:19:00] but of zones containing "rrsets" (record sets) which contain records [14:19:40] records sharing a name (left hand label) plus data type (e.g. "A") comprise an rrset, and the rrset is basically an indivisible unit for most purposes [14:20:33] sorry, I'm super dumb today, I said it completely wrong [14:21:14] the record was one, the reverse were two, also seems that our linter/checks didn't catch it [14:22:08] * volans checking too many things at once, sorry for the misinformation [14:22:12] ok [14:22:21] so you had a scenario like: [14:22:30] 100 IN PTR db1061.eqiad.wmnet. [14:22:37] 200 IN PTR db1061.eqiad.wmnet. [14:22:38] templates/10.in-addr.arpa:227 1H IN PTR db1061.eqiad.wmnet. [14:22:39] right? [14:22:41] templates/10.in-addr.arpa:14 1H IN PTR db1061.eqiad.wmnet. [14:22:42] yes [14:22:53] and [14:22:54] db1061 1H IN A 10.64.32.227 [14:22:58] so, that doesn't affect forward resolution at all [14:23:11] what was the odd dig output you were getting about randomly one or the other? [14:23:19] from dig from puppetmaster1001 1s apart from each other [14:23:26] what's the exact command and outputs? [14:23:32] db1061.eqiad.wmnet. 1133 IN A 10.64.48.14 [14:23:35] db1061.eqiad.wmnet. 3062 IN A 10.64.32.227 [14:23:53] what was the command? [14:24:01] I'll put in a paste: dig db1061.eqiad.wmnet [14:24:07] ok [14:24:21] did the forward resolution change in the past hour before you checked? [14:24:48] probably yes [14:25:10] https://phabricator.wikimedia.org/P5345 [14:25:24] so the only odd thing there really is that dns caches update asynchronously [14:25:44] which is kind of "normal", but it also points to perhaps something undesirable in our recdns arch, too [14:28:46] right, so two unrelated things, the double PTR, that maybe we could detect with the linter to avoid human errors [14:29:14] and the cache that is not invalidated when we change a record [14:29:54] doing dig @{baham,...} of course works as expected [14:32:01] sorry to bother, I've explained it really badly in the first place :) [14:36:23] 10Wikimedia-Apache-configuration, 10Wikidata, 06Services (watching), 15User-Daniel: RFC: Canonical data URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3220953 (10daniel) 05Open>03Resolved a:03daniel This RFC has been approved after final call for comment on April 2... [14:37:10] 10Wikimedia-Apache-configuration, 10Wikidata, 06Services (watching), 15User-Daniel: RFC: Canonical data URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3220959 (10daniel) [14:48:42] volans: yeah they are two separate issues. The dns linter is actually gdnsd itself, but the problem is there's lots of legitimate reasons to have that double record (e.g. during transitions), so it's not really an error. [14:48:57] tru [14:48:58] true [14:49:18] so on the other front, the bouncy resolution on switching.... [14:49:44] there's a few different lenses we can view that through, and varying definitions of what we can fix, what we should expect, etc... [14:51:49] So, in general, if you change an authdns record with a 1H TTL, you can expect mixed responses on various caches and end-hosts for the next hour, that's kind of normal [14:52:07] (in general meaning DNS in general, not WMF in general) [14:52:21] yes, we should lower the TTL of that before switching for cases like this one [14:52:33] however, *usually* the way things normally play out in the vast majority of cases everywhere [14:52:35] and raise it back afterwards [14:52:52] a given end-host tends to see it change once sometime during that hour and the change sticks, not bouncing back and forth like that [14:53:21] there's no gaurantee by DNS of that behavior, but things tend to work that way for a variety of normal reasons [14:53:50] mmmh marosotegui pasted this: https://phabricator.wikimedia.org/P5344 [14:53:51] because an end-host will have an /etc/resolv.conf with cache1, cache2, cache3 listed, but it will tend to stick to cache1 unless it fails (as opposed to rotating them randomly) [14:54:21] and cache1 tends to have a singular consistent view of the response from authdns. So sometime during the 1H window that cache will update its record and change at a single point in time [14:55:18] (re paste: we're still in the initial 1H window I think) [14:55:44] yes, but means that a single end-host did't see the change once [14:55:47] but was flapping [14:55:53] I know [14:56:28] 14:51 < bblack> So, in general, if you change an authdns record with a 1H TTL, you can expect mixed responses on various caches and end-hosts for the next hour, that's kind of normal [14:56:51] and then I had several lines there talking about why you usually tend to see a single point in time for the change from any one host's perspective [14:56:53] I was referring to: bblack| a given end-host tends to see it change once sometime during that hour and the change sticks [14:56:57] I might have misunderstood this [14:57:14] the important thing is the qualifiers like "generally" and "tends to" [14:57:19] :D [14:57:25] DNS makes no gaurantees anywhere about these behaviors [14:57:32] yes, I know [14:57:34] from DNS's standpoint, your flapping response is acceptable and normal [14:57:48] is within the TTL, so yes [14:58:11] so now we get into "why would we get unusual behavior here and see such obvious flapping"? [14:58:22] and also "should we design other things to assume non-flapping?" [14:59:00] the answer to the second question is no. regardless of what we fix about this situation, if we're relying on DNS not flapping during the TTL after a change, we're going to sometimes hit corner cases no matter what and get screwed. We have to assume it can flap during the TTL. [14:59:17] sure, it's also how DNS works [14:59:35] just being clear before we launch into the other part [14:59:39] :) [15:00:16] one way you could get this unusual behavior would be using some non-standard resolver that did randomly rotate through the available recdns servers in /etc/resolv.conf (as opposed to glibc which sticks with the first one unless there's a failure) [15:00:36] because the two caches in /etc/resolv.conf could have different cached responses for most of the TTL window and then the end-host would see flapping [15:01:02] another way you could get this behavior is if network traffic from the host to the caches is flaky, thus causing glibc to fail through the list routinely [15:01:36] but in our case I think we're using glibc's resolver and the traffic isn't flaky, so it's something else. A singular recdns cache is actually providing inconsistent answers [15:01:50] because we upgraded powerdns recently and changed the config [15:02:20] and one of the ways the new version of powerdns scales is it runs multiple threads with independent caches [15:02:30] gotcha [15:02:48] so you're thinking the different thread have different values in cache hence the flapping [15:02:50] so every query to the singular recdns IP is going to end up using a different one of those independent caches randomly [15:02:53] yes [15:03:18] and while that's acceptable and it's kind of interesting to leave it that way to punish and expose faulty DNS assumptions elsewhere [15:03:22] we should probably fix that :) [15:03:46] :D [15:06:06] quick question, are we using LVS for the recursors? [15:06:25] templates/wikimedia.org:recursor0 1H IN A 208.80.154.254 ; eqiad LVS (dns-rec-lb) [15:07:17] yes [15:07:39] and doesn't this mean that we'll go to different recursors from the same host? [15:08:15] hmmmm, probably kinda? :) [15:08:27] unless we have only 1 backend ofc :D [15:08:36] we have two backends, and it is set to round-robin [15:08:56] we could set it to "sh" and source-hash clients though and reduce that sort of flapping [15:09:17] * volans was going to look exactly at that [15:09:22] I'm not entirely sure if even with RR it may have some short-term memory about client->server mapping for UDP that might already reduce that flapping [15:09:37] so maybe pdnsd is not at fault here [15:09:51] well yeah but we'd probably have had someone else ask this same question before [15:09:58] the pdnsd change is the most recent change in all of this [15:10:08] we don't change host's IP very oftern though [15:12:09] and I guess that all service-related IPs that change have much shorter TTLs [15:13:42] yeah we're not using "one packet scheduling" for UDP DNS, which is relatively new ipvs stuff [15:13:51] so I think it does have at least a short-term memory about host->recdns mapping [15:15:23] yeah [15:15:25] UDP 04:45 UDP 10.192.32.78:41748 208.80.153.254:53 208.80.153.42:53 [15:15:42] ^ some raw table outputs, 04:45 is apparently a 5-minute countdown on the entry, but it's per source port heh [15:15:49] so probably effectively random [15:16:11] I don't know that "sh" is a great option either, but maybe [15:16:57] we need to think ahead about how this works out with anycast too [15:17:03] right [22:00:30] 10Wikimedia-Apache-configuration, 06Operations: https://test.wikipedia.org/wiki/Bug%3F?action=history doesn't show the history page, unlike https://test.wikipedia.org/w/index.php?title=Bug%3F&action=history - https://phabricator.wikimedia.org/T123276#3222527 (10Krinkle) [22:16:59] 07HTTPS, 10Traffic, 06Operations, 05Security: $wgServer with initial https:// does not force HTTPS - https://phabricator.wikimedia.org/T156320#3222567 (10Krinkle) [22:17:44] 07HTTPS, 10Traffic, 06Operations, 05Security: $wgServer with initial https:// does not force HTTPS (wgSecureLogin) - https://phabricator.wikimedia.org/T156320#2970877 (10Krinkle) [22:38:06] 07HTTPS, 10Traffic, 06Operations, 05Security: $wgServer with initial https:// does not force HTTPS (wgSecureLogin) - https://phabricator.wikimedia.org/T156320#3222669 (10Krinkle)