[06:25:42] 10Traffic, 10netops, 10Operations, 10ops-eqiad: lvs1016 enp5s0f0 interface errors - https://phabricator.wikimedia.org/T264227 (10ayounsi) No errors on the switch side. `lang=bash lvs1016:~$ sudo ethtool -S enp5s0f0 | grep crc rx_crc_errors: 27387518 lvs1016:~$ sudo ethtool -S enp5s0f0 | grep crc... [07:26:29] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) 05Open→03Resolved a:03ema All production nodes are now running Varnish 6.0.6-1wm1. Closing! [07:54:53] 10Traffic, 10Analytics, 10Operations: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) I am not an expert in `perf` but I tried to do the following on cp5012: `sudo perf record -F 99 -p 29945 --call-graph dwarf sleep 10` (the pid is varnishkafka-webrequest) And I... [11:02:55] 10Traffic, 10Operations, 10observability, 10User-fgiunchedi: Aggregated metrics for ats-tls <-> clients ttfb percentiles - https://phabricator.wikimedia.org/T263536 (10fgiunchedi) Added a panel to https://grafana.wikimedia.org/d/000000479/frontend-traffic to showcase the top p95 offenders: {F32369902} I'... [11:53:59] 10netops, 10Operations, 10observability: active/active links monitoring - https://phabricator.wikimedia.org/T264300 (10ayounsi) p:05Triage→03Medium [11:57:16] 10Traffic, 10netops, 10Operations, 10Epic: Capacity planning for (& optimization of) transport backhaul vs edge egress - https://phabricator.wikimedia.org/T263275 (10ayounsi) [11:57:21] 10netops, 10Operations: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10ayounsi) 05Open→03Resolved Monitoring discussion moved to T264300. Balancing is done. [12:41:46] vgutierrez: Hey! Would you have a few minutes to check https://gerrit.wikimedia.org/r/c/operations/puppet/+/629829 ? I'm never entirely sure about LVS configs. [12:42:09] Also, are there special steps that need to be taken when deploying changes to LVS configs? [12:46:10] sure.. one sec [12:46:23] I was into varnishland and that drains my tiny brain [12:48:00] vgutierrez: I can help fill that brain with LVS :) [12:50:26] so, profile::lvs::realserver::pools is affecting your servers [12:50:44] s/your/cloudelastic/ [12:50:59] yep, that's my understanding [12:51:26] if that's the missing bit, basically cloudelastic servers are missing the service IP attached to localhost [12:51:40] so they're ignoring the traffic that the LVS is routing their way [12:53:07] but from hieradata/common/service.yaml [12:53:33] you can see that all chi, psi and omega are referring to the same VIP, tagged as id004 on service.yaml [12:53:45] We already have one service exposed via LVS on those servers, but not the other 2 [12:53:52] cloudelasticlb: 208.80.154.241 [12:53:52] cloudelasticlb6: 2620:0:861:ed1a::3:241 [12:56:12] so basically that part from the LVS point of view is going to be a NOOP I believe [12:56:20] but not from the conftool side of things [12:57:02] as you can see from modules/profile/manifests/lvs/realserver.pp [12:57:07] LVS is only at IP layer? It does not care about TCP. But conftool will add additional service checks? [12:59:00] what I mean is that lvs::realserver doesn't have anything to do with the load balancers [12:59:12] but with the backend servers handling the traffic [13:00:01] psi and omega are already configured on the lvs [13:00:15] * gehel is now confused :) [13:00:34] quick check: `gehel@elastic2058:~$ curl https://cloudelastic.wikimedia.org:9643` [13:00:41] that already works as expected. [13:01:17] yup [13:01:22] https://www.irccloud.com/pastebin/K6Vwa6sg/ [13:02:11] so the only think that this change would bring, is the additional pool/depool scripts for each service [13:02:21] right [13:02:40] * gehel should have done more reading before pinging vgutierrez [13:02:47] np :) [13:02:57] ok, thanks a lot! [13:50:33] hi traffic o/ - I would like to remove two LVS services (just a heads-up) [13:51:50] we like [14:08:53] \o/ [14:08:59] kil them all jayme ;P [14:09:28] working on it :) [14:37:01] bblack: for when you're around, we would be ready to migrate esams (includes also few knams records) to Netbox if today is deemed a good day and not too close to eqsin migration. [14:37:05] the patch is: https://gerrit.wikimedia.org/r/c/operations/dns/+/630647 [14:46:57] bblack: so you still need puppet disabled on lvs1016? [14:59:03] bblack: s/so/do/ :) ... if you re-enable at some point you will probably see "Services in IPVS but unknown to PyBal: set([10.2.2.10:8081, 10.2.2.47:8889])". Feel free to remove them (ipvsadm -D -t 10.2.2.10:8081; ipvsadm -D -t 10.2.2.47:8889) [14:59:26] jayme: is it just a removal? [14:59:52] lvs1016 is kinda broken, we should probably note that somewhere for now, because it's not a great idea to restart the pybals elsewhere and fail over to it, if we can help it [14:59:59] there's a ticket started yesterday [15:00:20] https://phabricator.wikimedia.org/T264227 [15:00:23] needs some dcops [15:00:24] bblack: yeah. It's done on the others. Just needs puppet run, pybal restart and "ipvsadm -D ... " [15:00:46] jayme: if you've already done the others, go ahead and re-enable puppet and go for it there [15:01:39] bblack: uh. Did not knew about that. Should I re-disable puppet afterwards? [15:01:53] no, that was mostly for some testing I did, it can be left re-enabled [15:02:12] what I'm left pondering is whether we should (after this one) take a pause on all lvs service changes in eqiad until we get past this hw issue [15:02:35] (because they all involve a short failover to lvs1016 while a primary pybal is restarting, and lvs1016 has packet loss) [15:03:05] Okay. Will do lvs1016 after meeting (~30m) [15:03:12] ack, sounds good [15:05:27] volans: from my pov, you're good to go for esams [15:18:51] ack thanks a lot [15:27:29] bblack: I'm currently unable to tun puppet on lvs1016 because "puppetmaster1001.eqiad.wmnet [10.64.16.73] 8140 (puppet) : No route to host" [15:40:14] don't know if I should/can leave it like this...ema/vgutierrez maybe ^ [15:40:53] uh... that's related to your testing bblack? [15:41:15] fwiw it's not reporting to debmonitor either since 15h [15:41:19] jayme: I'd say that's far from ideal :) [15:41:21] it's maybe more related to !log re-enable and run puppet on I guess [15:41:30] ops [15:41:35] to https://phabricator.wikimedia.org/T264227 [15:49:44] I don't think I did anything strange there [15:49:46] looking [15:52:19] ah the link is flapping now [15:52:27] o we've lost route to one row [15:52:32] awesome [15:55:41] 10Traffic, 10netops, 10Operations, 10ops-eqiad: lvs1016 enp5s0f0 interface errors - https://phabricator.wikimedia.org/T264227 (10BBlack) The link has gotten worse and began flapping up and down rapidly since last update, causing a loss of routing to the row. I've downtimed the whole host now in icinga, di... [15:56:14] (which ironically will fix the proxyfetch errors for now, because now they'll just check over other interfaces and get routed) [15:58:28] bblack: So you think I can apply my changes now and run? :) [16:00:06] hmm yeah, I guess you can try [16:00:14] it can probably reach puppet now [16:00:27] but it does put the host now in a dangeorusly-unusable state for real LVS traffic [16:00:53] (it being the state of affairs with the dead link, not your changes) [16:02:22] thats clear. I just want to get all in the same state [16:05:49] bblack: okay, I'm fine. Should I re-disable puppet this time? (as you had it disabled again) [16:07:52] yeah may as well for now [16:07:55] jayme: ^ [16:08:35] will see what happens on the dcops side, we might have a quick resolution. if not, I'll send some irc/email updates about not messing with LVS until we get this fixed. [16:12:11] bblack: okay. Disabled again. Thanks! [16:12:53] jayme: we'll not forget that is all your fault!™ :-p [16:45:04] * jayme as blame proxy hereby forwards your blame to kormat. x-blame-forwarded-for: volans [16:45:46] lol [17:19:10] 10Traffic, 10netops, 10Operations, 10ops-eqiad: lvs1016 enp5s0f0 interface errors - https://phabricator.wikimedia.org/T264227 (10BBlack) 05Open→03Resolved a:03Cmjohnson @Cmjohnson replaced the SFPs on both ends of this link before my reboot above. Since the reboot, we don't seem to have any abnormal... [17:20:40] esams migrated too, so far so good, we'll keep an eye ofc [17:22:51] \o/ [17:26:27] bblack: fyi ar.zhel opened today T264273, you might have an opinion too :) [17:26:27] T264273: DNS: per prefix zone-file limitation - https://phabricator.wikimedia.org/T264273 [17:27:38] let me also comment on what we can add to the current approach [17:31:19] yeah that's tricky [17:31:43] added https://phabricator.wikimedia.org/T264273#6509764 [17:31:58] I think a desirable end-state is one include per zonefile, but obviously we're not doing that during the rollout, so that we can attack little peices at a time [17:32:03] the bottom line is if we want flexibility or simplification [17:34:01] if we ignore this transitional period for now and look at the end-state, is there any desirable flexibility in having multiple separate netbox includes for a single zoneflie? [17:34:05] *zonefile [17:34:54] if we want to be able to not manage with netbox something [17:35:24] well [17:35:44] you mean something that netbox is exporting, but we want to ignore the export and provide a manual version of the records instead [17:36:29] or a whole subzone, like svc if we decide to manage that in another way (I hope not) [17:37:22] but even with a single include per zonefile, if netbox doesn't export a thing (like svc), and we define manual records in the zonefile, we're good [17:37:34] yes as long as netbox doesn't export them [17:37:52] so for NS records right now we've done this: [17:37:52] https://netbox.wikimedia.org/search/?q=ns0 [17:37:54] so as a thought experiment, we could control that on export [17:38:18] through config to the exporter or something, to filter out some otherwise-exported records. [17:38:30] but then transitions get tricky too [17:38:52] if we had exporter config filtering ns[012], and had a manual set of records for those, and a single-include-per-zone style [17:39:03] and then later we wanted to let netbox manage them, there's no clean way to get there [17:39:19] if you remove the filter first you get duplicate definitions, and if you remove the dns side first you lose the records till netbox pushes again [17:40:40] even without contemplating the single include per zonefile, this sort of thing is already a potential problem [17:41:06] there could be subsets of records in the existing netbox export includes that we want to transition back to manual, or vice-versa [17:42:34] for those we just need to deploy with the same gdnsd "reload" both manual and auto-generated stuff, is not that hard [17:42:51] just a matter to allow the tools to support it [17:42:54] do we have a mechanism to do that? [17:43:08] yeah I guess we could make one, but it's a "special" transition time [17:43:37] tell everyone to stop other dns changes, push the ops/dns change without authdns-update, then let the netbox-side change also pull in the latest authdns git at the same time. [17:44:33] (or some reversed equivalent, where netbox pushes new files but doesn't reload, and then authdns-update manual run picks up both changes) [17:45:02] yeah [17:45:50] another more "natural" way could be to make the change so that there is a duplicate record and make the current zone-validator (or equivalent) check for those and abort the reload [17:46:18] so you merge one change, deploy fails because of duplicate (expected), you merge the other one and deploy succeed [17:46:20] yeah that's tricky too, since they're not illegal in most common cases [17:46:28] but yeah, we could explicitly check for matching data [17:46:47] zone validator was already failing on totally duplicated records (within the maual repo) [17:47:04] it's just a matter of making it check in teh autogenerated one too and be a bit more flexible [17:47:13] ok, so rewinding back to your "svc" example [17:47:14] as I guess if we want to run it manual we might change something [17:47:16] sure [17:47:31] basically if svc.wmnet is a separate export, it's easier to turn it off and switch to manual records if we needed to [17:48:23] because we wouldn't have to (a) create some filtering mechanism to exclude svc from the whole-wmnet export + (b) do whatever simul-deploy magic like above. [17:48:43] we would still need b [17:48:48] but not (a) [17:49:15] we wouldn't need (b) either, because a single ops/dns change can supply the new manual records and comment out the include line for the svc.wmnet-specific include. [17:49:33] assuming we picked those export file boundaries well and they match what we need to switch [17:50:02] right, yeah I was thinking the mixed case [17:50:16] as long as you replace the whole thing [17:50:40] the mixed case is something like "stop exporting foo.svc.wmnet from netbox, and create a manual record for it?" [17:51:06] what's the mechanism for stopping the export if not the filtering mechanism in (a)? is there already a flag in netbox or something? [17:51:37] emptying the dns name field in netbox make it non exportable [17:51:41] ok [17:51:50] that's how the ns* records are "blacklisted" [17:52:02] but then if we do that for many records defies a bit the whole purpose [17:52:17] yeah it does [17:52:32] I think, the exceptions will be rare [17:52:58] ns[012] are the only ones currently right? [17:54:00] and few others because of different TTL [17:54:01] https://netbox.wikimedia.org/ipam/ip-addresses/?q=keep [17:54:18] because a host-prefix IP was used instead of a service one [17:54:24] gerrit and lists [17:54:32] right [17:54:50] so currently the TTL differs on a subnet level basically? [17:55:20] it's all 1H flat [17:55:24] maybe we could add a TTL override field. It might be nice for some transitions anyways, I know we've used that before (turning TTLs down low on special names and then moving them, etc) [17:55:51] it was proposed, but fa.idon was kinda against it and more towards fixing the oddities instead [17:56:50] but what about actual service IPs that are in service subnets? [17:57:16] like wmnet:blubberoid 1H IN A 10.2.2.31 ? [17:57:41] the sort TTL ones are almost all CNAMEs AFAIK [17:57:45] *short [17:58:28] *CNAMEs or discovery records [17:58:34] right, but what if we needed to change the blubberoid IP [17:58:42] I guess is what I mean [17:59:07] in this case, i think it has a discovery record anyways, and hopefully nobody's hitting the direct one [17:59:24] which is analgous to the text-lb.ulsfo vs en.wikipedia.org scenario [17:59:33] (enwiki has a short DYNA, text-lb has a 1H A) [18:00:13] right now for the generated ones we can't change it on a per-record basis, but would be pretty easy to do that if needed [18:00:29] so "fixing the oddities" means anything that's a real service IP should use one of the DYNA mechanisms, and not have to worry about a fast transition of a netbox-exported A-record [18:00:39] it's all a matter of where to save that data ,if it's worth having it in netbox or if it's just for migrations have it in some more temporary place [18:01:07] I guess so yeah [18:01:42] but I think for gerrit/lists is more tricky [18:01:51] [out of scope for netbox, but I wonder how we audit/prevent/whatever the case that some internal service uses the blubberoid.svc.eqiad.wmnet hostname to reach a service, when we didn't want it to] [18:02:06] [since that's really more of a placeholder/documentation hostname than anything, given discovery] [18:02:55] [eh, I think it's a non-totally-solved problem yet, code search, checking logs, dunno] [18:04:34] yeah I'm really lost now on these side threads [18:04:46] I think for now what you're doing makes sense, because transition and care [18:05:19] in the long run post-transition, we might be better off consolidating the includes more, so there's not so many of them. [18:05:37] the transition from many-includes to more-consolidated includes itself might be difficult, too [18:06:27] the simplest way would be to temporarily export both versions (export the 10 smaller includes of some zone, and a new combined include, then do the ops/dns include switch, then stop the smaller export copies) [18:06:42] I don't know how hard that would be on the exporter side [18:07:39] some work but probably not that much [18:07:41] it does reduce some flexibility, but really we shouldn't be aiming to support very much flexibility. it will get abused :) [18:08:09] yeah, I was asking arzhel though how to get how much to go up in the prefix chain [18:08:22] and is not totally clear yet which are the conditions [18:08:29] I think that's unknowable [18:09:00] and I think really if you dig into the practicalities of the problem he's stating, it comes down to the same transition problems we've talked about earlier [18:09:29] (about having some special mechanism for coordinated change, in the case that we didn't consolidate includes enough to cover a given case for some future subnet change he's talking about) [18:10:34] in any case we can't consolidate further than the zonefile level [18:10:56] so there's no way to make it all happen in one include if a /31 interface subnet moves to a different /24 [18:11:20] sure [18:11:50] but it also seems needlessly-complex, the scenario at the end of the current path with tons of includes [18:12:01] right now we do /64 for v6 and /24 for v4 *unless* the IP has a larger prefixlen in netbox, in that case we pick the prefix with the higher prefixlen [18:12:49] to change that I need to know which parent prefix pick (or forcely assume /24 dunno) [18:12:55] there is a certain organizational sanity to that, since for the common cases it will create a file per actual subnet (e.g. row vlans) [18:13:11] but yeah creating tons of /31 for links seems iffy [18:13:22] indeed [18:14:34] so in the ulsfo example, above all the /31 there is: [18:14:34] but we have also a bunch of things in the middle /25, /26 /27 /28 [18:14:37] ; 198.35.26.192/27 (192-223) - Infrastructure Space [18:15:02] that's https://netbox.wikimedia.org/ipam/prefixes/15/ [18:15:06] and is marked as container [18:15:08] which is inaccurate actually [18:15:25] well maybe not, depends on your pov [18:15:29] lol [18:15:30] but the office subnet is outside that space [18:16:12] office is so 2019... [18:16:35] maybe there's some logic that makes sense here and can work? [18:17:08] like "if the cidr mask is >= 29 and there's a container above it, use the container instead"? [18:17:38] (and some v6 equivalent of the same) [18:18:21] either way we'll probaby have to solve the coordinated-change problem for some edge cases [18:18:39] and I don't know if this will or won't solve most problems ay is predicting either [18:19:17] surely not arzhel's problem of creating/deleting many new /31 [18:19:27] I'm sure there will be others in the long run, too [18:19:28] as this new Q work [18:19:45] like if we renumber an existing vlan to a whole new /24 or whatever [18:20:27] (other scenarios/cases that will require upping our automation/deploy game wrt netbox+dns I mean, when we reach the need) [18:21:05] yeah [18:25:29] and let's not forget that for an emergency there is always the manual modification of the exported repo on the netbox hosts + push + deploy [18:27:15] what is the potential failure conditions that you're worrying about though? [18:27:32] that can be easily added as a use case of the dns.netbox cookbook with a flag [18:27:52] chaomodus: like we need to change a single record in a non-standard way and we need to do it *now* [18:28:04] hm [18:28:15] different TTL for example [18:28:15] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Cmjohnson) 05Open→03Resolved updated em0 for both...resolving [18:28:41] yeah I see little need for non-emergency overrides, or any "going back" [18:29:01] but in strange unpredictable emergency scenarios, we might need some unpredictable random changes to something about some generated records [18:29:46] that's super easy to implement with the current cookbook, I'll generate everything in the tmp dir as it does now, pause and give you the path [18:30:03] you go do you changes, commit --amend and then tell the cookbook to continue [18:30:10] awesome [18:30:19] what would gdnsd if you put a record after the include it also appears in? :) [18:30:42] it depends, but for the most part it would accept it and serve both, because most records allow multiple data [18:30:57] hm like the same a record with different ttl [18:31:14] yah i guess having two a records with different addresses would be a problem [18:31:25] depends on your POV [18:31:41] so you couldn't just override things by adding them to the manual part of the dns [18:31:43] but from the pure pov of dns software and protocols, two A records with different IPs for the same name are fine. [18:31:46] it will serve both [18:32:15] mixed TTLs are a different matter, though [18:32:18] it'd be a problem from the perspective of only wanting one of them [18:32:50] well yeah, but what I mean is gdnsd-level validation wouldn't fail, and the result would not be what you wanted/intended [18:33:02] right [18:33:21] I'd have to look, to remember how it treats the edge case of mixed TTLs on zonefile load. it has changed over time. [18:33:40] it's possible to express mixed TTLs, even on the wire, but you're not supposed to do it, by the standards [18:34:47] looking at the code, the zonefile loader issues a warning about it, and forces all TTLs to the first one it encountered [18:35:00] but there's a flag to upgrade warnings to errors, and we use that flag, so the load would fail [18:35:36] then you go to netbox, reset the dns name and be happy :) [18:36:03] interesting [18:36:21] if we turned off the warning-upgrades, you could use this to do an emergency TTL change without an address change, with the result being a double-A record that "works" [18:36:24] in the sense than running the cookbook will remove the generated record and gdnsd should be happy to reload [18:37:27] right that might involve munging the records in netbox altho i think that's nbd a known quantity [18:37:33] e.g. if netbox was exporting "foo 1H A 192.0.2.1", and you defined a manual record "foo 30 A 192.0.2.1" above the include, it would load with a warning and serve 2x A records on the wire, both with the shorter (first) TTL [18:38:06] which "works", but it's kinda wonky [18:38:44] the warnings are going away in gdnsd-4.x anyways, to be replaced by "please do non-fatal sanity checks with external tooling, and maybe we'll ship a simplistic one as an example" [18:38:48] I prefer the fail, then modify netbox and run the cookbook, more explicit and we get only 1 record [18:39:28] gdnsd-4.x is due to be released at least 30 days before the heat death of the universe, but it will eventually happen. [18:40:28] ahahah [18:41:08] (the basic plan is that 4.x is a simplification release that gets rid of all the plugin junk and replaces it with external-to-the-daemon tooling that we can do in convenient scripting languages for stuff like GeoIP and friends, and all data loads are explicit rather than auto-detected, and then 5.x is the version that implements DNSSEC on top of the simpler daemon) [18:41:33] it's good to have a deadline for your work. [18:42:21] I've already done a ton of work for 4.x on the design front, but only a handful of real commits in the right directions. Maybe by the end of the year I'll at least start pushing up some WIP branches. [18:44:55] lines of C code will be greatly reduced, which is always a win :) [18:47:03] +1 [18:55:50] sukhe: Stdlib::IP::Address::V4::CIDR ? [18:56:09] I don't think there's a blended V4::CIDR + V6::CIDR though, like there is for non-CIDR [18:57:08] someone should maybe define a Stdlib::IP::Address::CIDR as Variant over them [18:57:32] bblack: but the documentation says for Stdlib::IP:Address, "Match any string consisting of an IPv4 address in the quad-dotted decimal format, with or without a CIDR prefix" [18:59:09] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10Cmjohnson) [19:01:44] 10netops, 10DBA, 10Operations, 10ops-eqiad, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Cmjohnson) 05Open→03Resolved This has been completed [20:16:55] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [20:18:46] 10netops, 10DBA, 10Operations, 10ops-eqiad, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) 05Resolved→03Open From the task description: > [DCops] Update Netbox At least the status and name are incorrect (should be asw2-d4 for consistency) > [D... [21:11:44] 10netops, 10DBA, 10Operations, 10ops-eqiad, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10wiki_willy) Related to Arzhel's previous comment, getting these Netbox errors: test_missing_assets_from_accounting asw3-d4-eqiad Device with s/n TA3716160376 (WMF542... [21:38:03] 10Domains, 10Traffic, 10Operations: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10hdothiduc) [23:07:06] 10Traffic, 10Operations: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working - https://phabricator.wikimedia.org/T264378 (10CDanis)