[00:02:54] * Platonides was imagining it was an old lady having to do all of that [06:36:51] Hmm. From home I seem to be getting 20ms more ping to gerrit on ipv6 vs ipv4 [06:37:15] ipv6 [06:37:16] 10. ae-1.a03.asbnva02.us.bb.gin.ntt.net 0.0% 18 114.0 103.3 84.6 131.4 15.6 [06:37:16] 11. xe-0-0-28-0.a03.asbnva02.us.ce.gin.ntt.net 0.0% 17 104.3 104.4 103.7 106.7 0.6 [06:37:16] 12. gerrit.wikimedia.org 0.0% 17 103.0 104.1 102.7 121.5 4.5 [06:37:44] ipv4 [06:37:45] 10. ae-0.a03.asbnva02.us.bb.gin.ntt.net 0.0% 20 84.6 85.9 84.5 91.7 2.1 [06:37:46] 11. xe-0-0-28-0.a03.asbnva02.us.ce.gin.ntt.net 0.0% 20 90.4 85.6 85.0 90.4 1.2 [06:37:46] 12. gerrit.wikimedia.org 0.0% 20 85.8 86.1 85.5 89.1 0.8 [06:38:01] Best poking my ISP (who aren't morons) or can one of you lot help? :P [06:42:38] an/or filing a bug [07:29:52] 10netops, 10Operations, 10ops-codfw: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10elukey) Need to check with Joe but I'd do the following: * replace mw2287 in mcrouter codfw proxy config with another one in DX (with X!=4) * before the maintenance, remove mc2033 from the mcr... [08:04:36] 10Traffic, 10netops, 10Operations: IPv6 ~20ms higher ping than IPv4 to gerrit on last ntt hop - https://phabricator.wikimedia.org/T211079 (10Reedy) [08:06:07] 10Traffic, 10netops, 10Operations: IPv6 ~20ms higher ping than IPv4 to gerrit on last ntt hop - https://phabricator.wikimedia.org/T211079 (10Reedy) [10:06:40] 10Traffic, 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10ema) [11:22:49] Reedy: somehow I am not surprised.... but it would be useful make sure it's not us that are to blame in that last hop. /me taking a look [12:14:09] FWIW, I get 37.6ms over v6 vs 39ms over v4 to gerrit.wikimedia.org from my IRC bouncer [12:14:23] AT&Ts IPv6 at my house is still too broken to turn on by default, last I checked :P [12:14:55] so I don't think it's something generically bad for all IPv6 at our end [12:15:51] and almost all my hops are via NTT, same as copied bits of the trace above [12:16:08] could be a return path diff [12:28:19] 10Traffic, 10netops, 10Operations: IPv6 ~20ms higher ping than IPv4 to gerrit on last ntt hop - https://phabricator.wikimedia.org/T211079 (10BBlack) From bast1001 to the endpoints shown in line (2) above over v4 and v6: ` bblack@bast1002:~$ mtr -c 10 -r -4 bottomless.aa.net.uk Start: Tue Dec 4 12:23:35 2018... [12:29:36] 10Traffic, 10netops, 10Operations: IPv6 ~20ms higher ping than IPv4 to gerrit on last ntt hop - https://phabricator.wikimedia.org/T211079 (10BBlack) (But note that first hop from Ashburn to Chicago is our routers' choice, so it's possible some of our route engineering is at play here). [13:18:55] akosiaris: Cheers [13:19:53] Is it worth me poking me ISP and see if they can send stuff by a different route? [13:22:04] well in this case, I think it's us sending things by different routes [13:22:32] we probably have the option to exit that traffic straight from eqiad -> NYC (via gtt), like the incoming routes for both and the outbound from us for v4 [13:22:46] but instead we hop over to our router in chicago before hitting HE.net in NYC [13:23:11] whether that's because of our local policies, or because of something about how your ISP advertises their routes, I don't know. Arzhel might when he gets in later :) [13:35:34] Reedy: what's your IP(v6)? :) [13:35:45] 2001:8b0:fafa:8020::/64 [13:36:17] and v4? [13:36:45] 81.187.84.202 should be being used [13:36:55] (but I have a /29 + 1 other single ipv4 too) [13:38:55] hm [13:39:02] so the return path for this two goes over different transits [13:39:07] but that said... I found a routing bug :) [13:39:50] woo [13:49:56] 10Traffic, 10netops, 10Operations: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10faidon) p:05Triage>03High [13:51:17] paravoid: the hope to chicago is our fault? [13:51:20] s/hope/hop/ [13:51:23] yes [13:55:28] 10Traffic, 10netops, 10Operations: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10faidon) The forward paths are nearly identical, but the reverse is not: reverse path selection is HE for IPv6 and NTT for IPv4, so different paths, and latency could be reasonably expl... [13:55:39] bblack: ^ [13:55:48] thanks Reedy, this was interesting :) [13:57:17] Nice [15:16:41] do we have any bad origin server? Ideally, something that manages to send the response headers fairly quickly but is then outrageously slow at generating the response (or parts of it) [15:18:41] I've written a little Flask app that does that (called it lameoid), but a real life case would be nice [15:24:45] even if we had one today, as soon as we realized it someone would be fixing it I think. [15:26:09] ema: btw did you see https://gerrit.wikimedia.org/r/c/operations/puppet/+/477424 ? I did the quick revert because I was short on time. But I think the "real" problem is that the "nhw" part also included the size filter for cache_upload frontends, so with "none" they weren't filtering huge items out of FE mem cache (in other words, the switch for "nhw" vs "exp" wasn't really designed to handle "n [15:26:15] one" correctly, since the size filter should be there in the "none" case too or whatever) [15:27:32] we put the size filter in the "nhw" block because "exp" implicitly handles size-related issues in a different way than "nhw" [15:28:13] (also, looking quickly through code history it looks like we've always been on "nhw"... I thought at some point we turned on "exp" for at least upload?) [15:28:29] (maybe my memory is faulty and we never quite got there, or we reverted from it and I forgot that too) [15:30:54] bblack: I think we only really turned exp on for cp1008 [15:36:49] uh-oh, yeah I see what you mean! We should make sure that "none" does the hfp part for objects > 256K too [15:37:11] I'll prepare a patch shortly [15:47:14] 10Domains, 10Traffic, 10Operations: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) 05Resolved>03Open They mailed again with the same stuff as before.. wikimedia.is isn't compliant because the SOAs differ etc.. Then MarkMonitor mailed... [15:49:07] mutante: how is this possible, since wikimedia.is hasn't changed since like 2013? [15:50:37] grr at our serial number system [15:54:41] hmmm, must be that the multatuli switch interfered with the original fix? [15:57:08] 10Domains, 10Traffic, 10Operations: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10BBlack) 05Open>03Resolved Fixed again. Copying my whole terminal output for posterity. This runs a readonly command that `md5sum`'s the zones directory to c... [15:59:58] maybe we should give up on the serial number problem and just put a fixed "1" in them all [16:02:28] the only real reason to put some kind of real datestamps in the serial field is for debugging, but it could end up being a lot of complexity added to the system just to make debugging work, and thus a net negative in the end. [16:03:13] or alternatively, we could have some system that creates real serial numbers as changes are merged to master [16:03:36] e.g. a gerrit-side hook on merge to master that increments the field when a real file (non-symlink) is altered by a merge? [16:03:46] sounds tricky though [16:04:24] or we could come up with something else smarter that doesn't rely on datestamps and is consistent [16:05:02] e.g. some mapping of the git hash of the last affecting commit to the numeric space available to serial numbers, which still leaves problematic collisions unlikely [16:05:38] the number would be somewhat meaningless on its own, but it would still have the debugging value that it can be compared to see if all authservers are using the same version of a zone's data [16:05:53] don't they need to increase? [16:06:51] I'm not sure, we'd have to dig on whether that's necessary [16:07:17] bblack: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/477573/ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/477574/ https://puppet-compiler.wmflabs.org/compiler1002/13834/ [16:07:19] there's probably some standard somewhere that says so, but many of the early standards tend to implicitly assume BIND and AXFR and all sorts of non-protocol things, too [16:07:51] so i didn't read everything, the problem is that some domain registrar requires all serials to be the same across servers? [16:08:07] yeah [16:08:24] you can just wait for one that assumes ascending order I guess ;) [16:08:29] it's not because they have a functional need, it's just one of their sanity-checks and they assume if the serials don't match you must have something horribly wrong [16:08:33] yes [16:09:01] in our case, our dns deployment scripts have always filled out "{{ serial }}" with a datestamp-derived value when the file last changed [16:09:22] yes [16:09:24] what puts it out of sync is when you re-deployed DNS servers, so new ones get fresh serials from when they first were puppetized [16:09:29] i did that originally [16:09:45] yeah didn't think anyone would care about mismatching serials between servers [16:10:15] I wouldn't have either, it's pretty annoying and senseless for anyone to even look at serial values unless they're actually an AXFR slave [16:11:00] creating a timestamp from original file git commit date seems easiest to me [16:11:01] and well, there could posible by code out there in the wild, I could imagine, on caches.... which might notice if the SOA serial changes and affects negative cache validity, as some kind of optimization [16:11:09] I don't know if they'd look for increments specifically [16:11:33] ascending serials is pretty common place so if we don't honor that either we can just wait for someone/something to complain about that [16:11:39] :) [16:12:25] technically, I think even with random serials, your odds of a given move being "ascending" are 50%, but that diminishes if they've missed observing several changes in a row [16:12:57] I'm actually not sure if there's a definition of ascending that makes sense for more than one change, either. [16:13:07] (and caches can't be sure they see each change) [16:13:24] https://tools.ietf.org/html/rfc1982 :/ [16:13:30] well, not decreasing is probably what anyone would test [16:13:59] heh [16:14:04] right, but you mean over time, from one observation to the next [16:14:28] I'm not sure that serial numbers have any mathematical meaning that makes sense over time like that, but I need to read that RFC more in that light [16:14:47] in any case, it's implicit that an observer can't know how many changes have happened between observations [16:15:15] and all serial math eventually wraps, but there might be some rule in there about not wrapping more than X values per Y (very long) time period or something [16:20:58] the best part is: [16:21:00] "Note that there are some pairs of values s1 and s2 for which s1 is not equal to s2, but for which s1 is neither greater than, nor less than, s2. An attempt to use these ordering operators on such pairs of values produces an undefined result. [16:21:04] " [16:21:23] hahaha [16:22:41] 4.2 Corollary 2 also basically says, for these purposes, "There is no defined ordering between X and Y, if you did two legal additions to get from value X to value Y" [16:23:09] in other words, if you fail to observe an increment by an arbitrary amount before another occurs, you no longer know anything about <> [16:24:12] and then the killer statement down in the SOA-specific section: [16:24:14] "Care should be taken that the serial number not be incremented, in one or more steps, by more than this maximum within the period given by the value of SOA.expire. [16:24:17] " [16:24:32] (this maximum being "2^(32 - 1) - 1") [16:25:32] our current expiry value is 2 weeks [16:25:42] eh, just implement a timestamp and make sure it's consistent between all servers for 99% of the time and I think we should be good :P [16:25:44] although really, again since we don't use AXFR, all of those values are meaningless [16:25:50] they are [16:26:13] anyways [16:26:50] we already have a system that does a timestamp that's consistent most of the time. but it fails if we don't go around and bump all the serials manually after (re-)deploying a new authdns server [16:27:23] i guess dates are not necessarily ascending in git either [16:27:27] rebase etc [16:27:33] bah [16:27:38] well you'd want the date of the actual merge, not the date of the commit [16:27:42] yes [16:27:45] but even then [16:27:48] they can be rebased later, right? [16:27:51] not that we do that typically [16:27:53] but in theory [16:29:07] bblack@haliax:~/repos/dns$ git log --format=fuller [16:29:07] commit 637bdcaa339d6633f89007f8d16125ee9b09763a (HEAD -> master, origin/master, origin/HEAD) [16:29:10] Author: pt1979 [16:29:12] AuthorDate: Mon Dec 3 17:51:12 2018 -0600 [16:29:15] Commit: Gehel [16:29:17] CommitDate: Tue Dec 4 13:34:31 2018 +0000 [16:29:18] ^ I think CommitDate shown here might be sequential and meaningful [16:29:20] DNS: Add mgmt and production DNS for elastic2045 - elastic2054 [16:29:36] but what's typically shown in "git log" is the AuthorDate [16:32:56] well, maybe not [16:34:15] I think the commitdate only looks meaningful because we have a merge policy requiring rebase [16:38:30] in any case, I don't think a fixed or randomly-wandering value actually makes anything break anywhere [16:38:57] except that if all servers don't agree, overzealous registrar sanity-checks may get annoying [16:43:36] so things that would "work" and could be implemented reasonably: [16:44:17] 1) Leave it all the way it is, and document somewhere that we need to manually regen zones when an authserver is (re-)deployed, and hopefully not forget. May get annoying/challenging later when we have a much larger count of anycast authservers. [16:45:34] 2) Generate it from the CommitDate and not worry too much if there are rare cases where CommitDate moves backwards. Caveat here is we probably want the CommitDate that last affected each file, rather than the global last commit, so we don't bump and reload every zone pointlessly on every small change. Requires some python+git work in the scripts. [16:46:18] 3) Set it to 1 everywhere and forget it - consistency is gauranteed and the value is meaningless for any of our own sanity checking [16:47:54] 4) Set it to some translation of the git sha-1 to serial number space (basically reduce the hash down to a 32-bit one). Upside vs (2) is there's no chance of date collisions/reversals being confusing, while remaining consistent. Downside is the numbers become fairly human-meaningless, and the hashes may occasionally collide in 32-bit space anyways. I'm not sure which confusing problem there is [16:48:00] more statistically likely. [16:48:45] of course we already have problems with the current python scripts anyways outside of serial numbers. They fail to update things on their own when the list of languages changes, and there's some other similar cases. [16:49:14] so "leave everything alone" probably isn't a great long-term answer. it all needs some rework at some point. [16:50:09] I think (2) is probably the best option, but it's all low priority and should all be fixed together with new tooling that solves the other problems, too (and maybe that also operates better in parallel when there's many more than 3 servers) [17:04:10] bblack: thanks for fixing it again. if it happens again i'll try to just follow your docs. wasn't sure if we want to actually talk directly to IS registry that their rules don't make sense to us [17:08:34] I doubt they want to hear from us about it :) [17:37:53] Can I check the pybal healthcheck history somewhere? [17:39:22] XioNoX: the history of pybal checking other services? [17:39:26] yep [17:39:48] XioNoX: well, there's log output on each lvs (I think still mirrored to /var/log/pybal.log), which is quite verbose [17:40:18] I think icinga alerts when pybal says a whole service is in trouble as oppoesd to one node, so icinga alert logging may provide some higher-level view. [17:41:28] bblack: on lvs2002: -rw-r----- 1 root adm 0 Nov 10 06:25 /var/log/pybal.log [17:41:48] nice! [17:42:17] I think that just means it's had nothing to say since the last rotation, though [17:42:22] at least, that could be the case [17:42:29] I'm curious to know if lvs2002 sometimes sees dns2001 as down [17:42:30] ok [17:42:33] lvs1016 has lots of lines [17:43:24] bblack: new potential rabbit hole: https://phabricator.wikimedia.org/T211131 [17:43:33] 10Traffic, 10Operations: DNS recursors TCP retransmits - https://phabricator.wikimedia.org/T211131 (10ayounsi) p:05Triage>03Normal [17:43:44] while I'm looking at that... [17:44:08] all four thumbor100x are failing from lvs1016 pybal all the time [17:44:37] 10Traffic, 10Operations: DNS recursors TCP retransmits - https://phabricator.wikimedia.org/T211131 (10ayounsi) [17:45:39] there's lots of interesting data in /var/log/pybal.log stuff, things that seem ignored I assume [17:46:29] I don't know that pybal's per-server healthchecks are worth alerting on, they're too fine-grained and they should be caught by other functional per-server icinga checks [17:46:45] but it might be nice to have a dashboard/grid sort of view of all the things a pybal is monitoring. [17:47:43] or some alerting on enabled but flapping hosts [17:48:02] probably not trivial to implement though [17:48:05] I think it would get noisy and annoying [17:48:19] I donno :) [17:49:39] we can tune it though, like if it has been flapping for 24h, maybe there is something wrong it with [17:50:09] XioNoX: on your TCP retrans thing... my first question about your example there is: "Why is LVS talking to dns2001:53 at all?" [17:50:25] bblack: healthchecks? [17:50:25] but then I remembered pybal monitoring, right [17:50:32] which is just a random TCP connection [17:50:40] yeah [17:55:07] I don't have that much time to look into it (or ideas so far), but at least the task is there now [17:59:10] XioNoX: you could try a real dns query over TCP while tracing to validate non-healthcheck behavior [17:59:24] good idea [17:59:32] e.g. "dig" using TCP, towards dns2001 (or the LVS virtual IP, landing on either of dns200[12]) [18:00:10] I would guess that probably whatever's going wrong here is specific to the pybal tcp healthcheck, or specific to all the odd things that are hard to remember about an LVS server contacting one of its own backends directly... [18:09:54] bblack: indeed, from bast2001 to dns2001 (directly) or dns2002 (via the LVS VIP) I don't see any retransmits [18:16:18] yeah, it's gotta be something funky about LVS/pybal itself then [18:16:31] possibly asymmetric routing out the various interfaces [18:17:00] I donno, it's worth digging into it as feedback into how we could do pybal healthchecks better in the future, but not super high priority [18:17:48] I think it's been noted before, that pybal healthchecks should probably use the appropriate source interface/macaddr so they don't succeed over the "wrong" interface and fool us if we can't reach the backend over the correct interface for forwarding. [18:18:42] bblack: in thise case the capture is on the dns2001 side, so the packets are making it back [18:19:20] and it's a retransmit, so it's the exact same packet re-sent [18:20:05] what I mean is, it could be the case that the LVS->dns2001 side of the TCP conn goes over eth0 and the dns2001->LVS side over eth1 (or whatever interfaces) [18:21:06] and what's being lost here is that LVS isn't seeing packet #2 on the return side, at least not initially [18:21:21] or number 10 I guess it is, in your example [18:21:51] I don't know, none of that trace makes sense. it might make more sense if you could see it from both sides [18:22:26] but I'd still gamble on something to do with a mismatch of interfaces/MACs used for the two directions of flow for that connection. [18:22:40] (on the LVS) [18:22:41] 10Traffic, 10Operations, 10Pybal: DNS recursors TCP retransmits - https://phabricator.wikimedia.org/T211131 (10ayounsi) p:05Normal>03Low Doing a DNS query over TCP from bast2001 to dns2001 (directly) or dns2002 (via the LVS VIP `dig @dns-rec-lb.codfw.wikimedia.org en.wikipedia.org +tcp`) doesn't show any... [18:23:58] if the LVS didn't see packet #10 (syn-ack), then it wouldn't send the #11 (ack) [18:25:23] and mac addresses are consistent [18:28:12] 10Traffic, 10Operations, 10Pybal: DNS recursors TCP retransmits - https://phabricator.wikimedia.org/T211131 (10ayounsi) [19:31:27] 10Traffic, 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Smalyshev) I've reduced the pool lifetime to 1s (which should be essentially as if there was no pooling if I get it correctly), let's... [21:11:06] 10netops, 10Operations, 10ops-codfw: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10ayounsi) Parts keeps getting delayed, new shipping is expected for this Friday, rescheduling the work for next Wednesday. [21:11:30] 10netops, 10Operations, 10ops-codfw: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10ayounsi) [22:05:20] 10Traffic, 10Analytics, 10Operations, 10Performance-Team: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10jijiki) p:05Triage>03Normal