[02:59:30] volans: I guess, I'm less-worried about duplication out of all the possible errors that could happen. I think we could/should validate that, but it's on a long list of things we could/should be validating about the final output zone data. [03:00:44] on the other other [other?] hand, though, once all of the host-level records are being auto-generated, the volume of manually-maintained data will get quite small, especially with a little de-duplication where appropriate on that end (includes and/or symlinks for common zones). [03:01:19] maybe that reaches the point where the equation really changes (do we end up with more LOC of validation code than lines of manual dns records?) [03:02:26] wikimedia.org will remain a contentious nexus in all of this because of its many mixed purposes, but that can/should be split up eventually anyways. [03:04:10] maybe the perspective shift from all this causes us to rethink how we lay out zones, and how we lay out auth servers/pools/softwares too. [03:05:41] [e.g. it'd perhaps be great if private revdns and wmnet was all on a separate internal-only auth service] [03:06:40] I think it's hard to see yet how all that will play out. but take the first steps and see how it changes things first. [06:52:33] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs3006.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [07:29:38] 10Traffic, 10Operations: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3006.esams.wmnet'] ` and were **ALL** successful. [09:08:31] bblack: ack! sounds a sane approach :) [09:49:22] 10Traffic, 10Analytics, 10Operations, 10Research, and 2 others: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10elukey) [09:49:51] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs3005.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [10:02:29] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3005.esams.wmnet'] ` Of which those **FAILED**: ` ['lvs3005.esams.wmnet'] ` [10:15:47] 10Traffic, 10Analytics, 10Operations, 10Research, and 2 others: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10elukey) >>! In T245833#5934803, @leila wrote: > > @Miriam @elukey the layered permission system can have internal use-cases,... [10:17:41] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Vgutierrez) [10:34:34] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs3005.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [10:44:45] !log replace lvs2002 with lvs2008 - T196560 [10:44:45] Sorry, you are not authorized to perform this [10:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:50] T196560: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 [10:47:11] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2002.codfw.wmnet` - lvs2002.codfw.wmnet (**PASS**) - Downt... [11:07:48] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Vgutierrez) a:05Vgutierrez→03Papaul [11:12:55] 10Traffic, 10Operations: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3005.esams.wmnet'] ` and were **ALL** successful. [12:53:13] 10netops, 10Operations, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) I'm seeing this in the openstack BGP speaker: ` 2020-03-03 12:29:28.322 2724 ERROR neutron_dynamic_routing.services.bgp.agent.bgp_dragent [re... [13:08:50] Has someone checked to see if we got any mail to suggest we were affected by the LE CAA thing? [13:14:24] volans: netbox doesn't have the host IP data yet? I figured it did by now, but I was looking this morning... [13:14:39] AlexM-phone: link? [13:15:43] bblack: I think AlexM-phone is referencing https://community.letsencrypt.org/t/2020-02-29-caa-rechecking-bug/114591 ? [13:15:53] https://community.letsencrypt.org/t/2020-02-29-caa-rechecking-bug/114591 [13:16:00] yeah [13:16:00] https://letsencrypt.org/caaproblem/ [13:16:06] serials at second link [13:16:16] Yeah [13:16:29] bblack: yes, it's in CR and has been run few times in dry-run mode, there is one last bit under discussion about the host interfaces [13:17:48] They should've sent an email to whatever our LE email address is - possibly noc@? [13:17:49] because in netbox we'll get the FQDN as the "DNS name" set on an IP that is set as primary IP (v4/v6) [13:17:58] and that IP is attached to an interface too ofc [13:18:04] AlexM-phone: at a glance, I wouldn't expect us to be impacted, as our CAA values are stable and consistent for this stuff [13:18:09] but it wouldn't hurt to check [13:19:36] volans: so the issue is it doesn't consider having both v4 and v6 as the primary pair of IPs? or that we can't tell the main hostname/IP stuff from add-on interfaces? [13:21:31] no, it's about interface naming and if importing all interfaces or just primary [13:21:48] oh ok [13:22:01] the mgmt ones were imported as mgmt that was trivial [13:22:03] we don't have very many cases for multiple interface, I don't think? [13:22:09] a few for sure [13:22:18] yeah [13:22:39] also the problem between a naming that DCops can use in the DC and a naming that we can map to the OS [13:22:39] VIPs / service IPs, I assume will not be part of this, at least initially [13:22:45] indeed [13:23:31] yeah the iface naming problem is tricky. It seems like we all (the world) went through a bunch of pain for a transition to a fancy new interface naming system that doesn't solve one of the key problems it was meant to solve :/ [13:24:18] if the host boots up with an interface named enp59s0f0p1x3z8, and there's no hardware labeling on the back of the box that makes it obvious which port that is, it's all kind of silly :/ [13:25:23] yeah, one of the possible options proposed was even to have a custom numbering and then generating iface name on the OS from that (in the long term) [13:25:34] or at least something close-enough [13:27:27] if the iface in software is enp59s0f1 , I'd accept that the PCI slots are labeled (on the outside) with a number 59 on one of them, and that the two ports on that card are labeled "0" and "1", and that's enough to infer where 59s0f1 is (slot #59, port#1) [13:27:44] but I don't think we're ever close to that kind of world, AFAIk [13:30:28] volans: so for the self-defined case.... DC-ops would put stickers next to the ports or something? i0, i1, i2, whatever, and somehow when we initially provision, we'd feed that data to the installer in a way that works out? [13:30:30] yes and there is also the chicken/egg problem of new installations, where the IP and DNS must be already assigned (in netbox) but there is still no OS, basically requiring some temporary naming until the reimage script runs and staticize the names [13:30:45] *yes was for the prev comment [13:31:32] we could also avoid physical labels and just number always in the same way (left to right, top to bottom, mainboard first for example) [13:31:41] ok [13:31:59] but still, for feeding that to the installer to make our custom iface names, how would you discover/define the mapping? [13:32:02] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs2007.codfw.wmnet ` The log can be found in... [13:32:20] tricky in all cases [13:32:35] the current wired solution into the script is to get OS names from puppetdb for backfilling [13:32:39] exiting ones [13:32:48] and probably is what we'll use in the end I guess at this point [13:33:23] yeah that still leaves you with some kind of chicken-and-egg that gets resolved manually [13:33:50] in practice, we don't have many multi-iface hosts though, right? [13:34:10] just few cases AFAIK [13:34:15] I'm trying to think of great reasons to have it, other than cases like LVS [13:34:33] and we have some bonded-port cases here and there, which I'm not fond of [13:35:20] so mostly, for most cases, it all comes down to finding out what the one primary port is, which is a simpler problem [13:35:39] there's one cable plugged in, and the one with link status up in software is it heh [13:36:37] yep [13:37:26] there's a polarity here (I love using outdated wmf buzzwords), in all related things [13:37:55] between defining automations and tooling that can cover all the weird cases we have today (which can be quite complex) [13:38:03] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [13:38:22] vs pushing to standardize what we've got, and kill pointless oddball corner cases, to where it fits neatly into simpler automation and management solutions. [13:39:36] ehheh :) [13:39:43] if there's like 4 hosts in our fleet with bonded interfaces, and it's hard to rationally justify them in terms of design principles and uptime-engineering... maybe we get rid of those bonded interfaces instead of building things to handle that case. [13:39:50] as a random hypothetical example [13:43:59] doesn't sound a bad idea, in terms of cost/benefit for the long run also in an infra that's growing in general [13:50:18] next stop on the enterprise-ation train: all dns changes are sourced from an SAP extension written by a team of consultants, and you have to call up the inventory management department to inform them they have an incoming fax from the datacenter that they need to scan and process to put new machines into SAP for deployment [13:54:22] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs1016.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [13:59:29] ahahah [14:01:28] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2007.codfw.wmnet'] ` and were **ALL** successful. [14:05:51] bblack: was chatting with vgutierrez earlier today and we were wondering if the static routes for the LVS were still needed, now that we have the LVS peering with both routers, and primary/backup using MED [14:06:22] going to open a task, but was wondering if there was something we didn't think about [14:14:14] 10Traffic, 10Operations: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1016.eqiad.wmnet'] ` and were **ALL** successful. [14:18:25] XioNoX: well, the idea with the fallback statics is they're for the worst case where nothing is advertising [14:19:28] if we do something unpredictably dumb with cumin and/or a puppet-merge (or even just an etcd update coupled with an unknown software bug lurking on the pybal side) and manage to cause every pybal to halt/crash all pybals in a short window of time [14:19:38] or even just cause them all to restart in a short window of time [14:20:08] did that ever happened? [14:20:17] having the routers still send the traffic *somewhere* which might be useful and might still have legitimate IPVS-level rules in place in the kernel, makes that scenario a lot smoother. [14:20:40] ok [14:20:58] if the router had some kind of settings for holding onto the last-known-good route for that /27 in the vent that all advertisers drop out, that would be sufficient or even better than the static route, too [14:21:05] s/vent/event/ [14:21:41] I don't recall when the last time it saved us was [14:22:12] bblack: yeah, it would require pybal to support BGP graceful-restart (up to 5min) or long-lived-graceful-restart (up to many hours, maye days) [14:22:17] but I'm pretty sure it has happened before, and there are a number of easy ways for us to shoot ourselves in the foot operationally there. We should probably fix those, but many of them are not easy to fix. [14:22:57] XioNoX: is that just a flag it needs to send with its updates basically? it might not be too hard to patch it in. [14:22:58] we noticed some discrepencies in the static-routes while working on codfw, so if we're going to keep them we should either audit them and/or make the fule clear in Homer for example [14:23:24] bblack: yeah during the session negociation [14:23:34] s/fule/rule/ [14:24:00] https://tools.ietf.org/html/rfc4724 [14:24:46] without defending anything about what we're doing today in this space as a good long-term idea, I'd say in the present tense the static fallbacks are still a good thing to have, and yeah maybe we need to automate them or audit them better or whatever. [14:25:26] ok! [14:25:40] (if we did have the grace stuff that might change things, but only if we also were able to fire off an alert when all adverts were lost and we're relying on the grace) [14:25:54] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Vgutierrez) [14:26:44] regarless, we should have alerting if BGP to/from a LVS goes down [14:26:56] yeah [14:26:58] static or graceful-restart [14:27:15] right now it's caught by the generic BGP alert, but it takes a bit to figure out what's going on [14:28:11] (also, I donno if it was fixed since I last saw it, but it seemed like esams didn't even catch lvs going down in its generic BGP sessions alert, at some point in the relatively-recent past) [14:28:46] bblack: I think we should add the LVS AS# to --critasn in https://github.com/wikimedia/puppet/blob/production/modules/nagios_common/files/check_commands/check_bgp.cfg#L3 [14:29:04] probably frack too [14:29:16] and maybe k8s [14:29:24] yeah [14:29:31] otherwise they will only alert as warning after 7 days down [14:29:32] it'd be nice if that came with nice labels, too, but that might involve a code change [14:29:55] so that the alert tells you its pybal or k8s or whatever, not just AS 65001 or whatever [14:30:03] (perl code change even) [14:31:27] bblack: there is also a check on the LVS side: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs2009&service=PyBal+BGP+sessions+are+established [14:31:36] so not sure if we need one on the router side too? [14:32:04] I think it's wiser to have a check on the router side [14:32:15] it's more functional in that sense [14:32:35] I think I can patch up check_bgp for optional asn description strings in a few minutes... [14:34:42] ok! I'll send the CR to add more AS, and will ping Alex on it about k8s [14:35:27] our full private AS# list is on https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#Private_AS [14:43:46] https://gerrit.wikimedia.org/r/c/operations/puppet/+/576354 [14:44:15] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2001.codfw.wmnet` - lvs2001.codfw.wmnet (**PASS**) - Downt... [14:51:19] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Vgutierrez) a:05Vgutierrez→03Papaul [14:51:32] 10Traffic, 10Operations: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [15:01:13] XioNoX: like this https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/576359/ [15:01:26] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs1015.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [15:02:37] cool! [15:02:42] bblack: tested? [15:02:58] kinda :) [15:03:04] not really-tested [15:03:14] just manually ran similar code on my laptop against artificial input [15:04:15] should try actually executing a copy of that against a real router [15:04:40] I can do that [15:09:17] hmmm maybe I should re-use the critasn description for the short-time warning too [15:09:33] let me refactor a bit and make that simpler [15:09:59] ok! [15:10:34] maybe silly question, but is there a way to download the file directly from gerrit? without having to go through the diff/PS [15:10:45] not that I know of [15:11:23] nevermind, the warnings are just for non-critical cases by current logic anyways [15:11:37] do we have a known case with a down peer somewhere I can check against? [15:11:48] bblack: down critical? [15:12:00] down in general, I can fake the critical part in manual testing [15:12:26] ah found one [15:12:39] yep, icinga has a few down IX peers - https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?allunhandledproblems&sortobject=services&sorttype=2&sortoption=3 [15:13:44] bblack@icinga1001:~$ ./check_bgp -H 208.80.154.197 -c $SNMP_COMM --vendor=juniper --critasn 12909/Telia [15:13:47] BGP WARNING - AS8001/IPv4: Active (for 16h23m), AS64600/IPv4: Active (for 12m36s) [15:13:50] bblack@icinga1001:~$ ./check_bgp -H 208.80.154.197 -c $SNMP_COMM --vendor=juniper --critasn 12909/Telia,8001 [15:13:53] BGP CRITICAL - AS8001/IPv4: Active - unknown [15:13:56] bblack@icinga1001:~$ ./check_bgp -H 208.80.154.197 -c $SNMP_COMM --vendor=juniper --critasn 12909/Telia,8001/ImportantCarrier [15:13:59] BGP CRITICAL - AS8001/IPv4: Active - ImportantCarrier [15:14:05] seems to work ok :) [15:14:43] (ignore the wrong asn numbers there, was from earlier testing) [15:15:14] no pb! [15:15:15] but anyways, I think we can push this, and add /labels to all the ones that are crit enough to put in the list [15:15:40] 10netops, 10Operations: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10CDanis) Happy to help here, e.g. to perform this at an off-peak time in esams/knams. [15:17:52] XioNoX: I believe you can view a pending patchset in gitiles, since there's a review/xxxxx tag, IIRC [15:19:17] bblack: you can merge it and I'll check if the checks are happy [15:19:41] 10Traffic, 10Operations: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1015.eqiad.wmnet'] ` and were **ALL** successful. [15:22:22] adding the descriptions to https://gerrit.wikimedia.org/r/#/c/576354/ [15:26:20] XioNoX: oh, there's also `git review -d 576354` [15:43:03] I gotta pack up and move my laptop + me, I'll be back and poke at it in 10-15 mins or so [15:45:09] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs1014.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [15:46:12] 10netops, 10Operations: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10ayounsi) Steps are: # Depool esams # Ssh to the mgmt interface `re0.cr2-esams.mgmt.esams.wmnet` less likely to be impacted by the flaps # run `conf` then `set routing-options graceful-restart` # `commit` #... [16:05:05] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1014.eqiad.wmnet'] ` and were **ALL** successful. [16:09:39] XioNoX: it's merged and deployed on icinga1001 (the check_bgp change, not the new descriptions themselves yet) [16:09:52] bblack: I'm merging the other one now [16:09:57] ok [16:13:49] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [16:17:57] 10Traffic, 10Operations, 10Pybal: Add graceful-restart capability to PyBal - https://phabricator.wikimedia.org/T246788 (10ayounsi) p:05Triage→03Lowest [16:18:27] XioNoX: I guess for these router-side alert, obviously we'll only get that alert if a given router loses all pybals [16:18:45] but also it's not traffic-class-specific, so we could lose all text-lb pybals but no alert because there's still an upload-lb pybal connected [16:18:45] Opened https://phabricator.wikimedia.org/T246788 for pybal graceful restart. But unlike thought previously I don't think it should replace the static route. [16:19:24] bblack: it would alert as soon as 1 BGP session with the matching AS# goes down [16:19:27] but maybe we don't need to fix what I'm pointing out above, because we could also just move towards all-active/shared where there's not specific classes isolated to specific pybals (just per-service med prefs) [16:19:34] XioNoX: oh ok, that's a bit better then! [16:20:02] it's per neithbor, not per prefix [16:20:37] right [16:20:54] I think ideally, we'll reach a state soon where all the lvses are advertising all the prefixes anyways [16:21:08] and just per-service MEDs are deciding which box gets which traffic when they're all online, but they all back each other up [16:21:37] (or there's the all-active idea with ECMP, but there's a lot of things left to look at on that, I'm less-sure of it) [16:23:49] 10netops, 10Operations, 10Wikimedia-Incident: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10RLazarus) [16:24:10] 10Traffic, 10netops, 10Operations, 10ops-codfw: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 (10Papaul) 05Open→03Resolved I think all good now on the interface . closing this task. Just replaced the transceiver on the switch side. ` Laser... [16:26:52] maybe I should stop a pybal somewhere, on a backup node, just to see the new alert and test it [16:30:14] 10Traffic, 10Operations, 10ops-codfw: lvs2002: raid battery failure - https://phabricator.wikimedia.org/T213417 (10Papaul) 05Open→03Declined declining this since there is a decommissioning task @ T246756 [16:32:44] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs1013.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [16:33:57] I'm not getting anything on crX-eqsin [16:34:18] maybe icinga config isn't reloading or something [16:36:07] icinga says: "Things look okay - No serious problems were detected during the pre-flight check [16:36:35] I still see it executing the old version of the check though [16:36:44] nagios 230023 230021 11 16:35 ? 00:00:00 /usr/bin/perl /usr/lib/nagios/plugins/check_bgp -H 103.102.166.130 -c BVK3oVFP --threshold 604800 --vendor juniper --critasn 1299,2914,6461,1257,13030,38930,6908,6939,6453 [16:36:53] vs puppet log output on icinga1001 saying that changed: [16:37:03] command::Config[check_bgp]/File[/etc/icinga/commands/check_bgp.cfg]/content) - command_line $USER1$/check_bgp -H $HOSTADDRESS$ -c $ARG1$ --threshold 604800 --vendor juniper --critasn 1299,2914,6461,1257,13030,38930,6908,6939,6453 [16:37:07] Mar 3 16:13:10 icinga1001 puppet-agent[32506]: (/Stage[main]/Nagios_common::Commands/Nagios_common::Check_command[check_bgp]/Nagios_common::Check_command::Config[check_bgp]/File[/etc/icinga/commands/check_bgp.cfg]/content) + command_line $USER1$/check_bgp -H $HOSTADDRESS$ -c $ARG1$ --threshold 604800 --vendor juniper --critasn [16:37:12] 1299/Telia,2914/NTT,6461/Zayo,1257/Tele2,13030/Init7,38930/Fiberring,6908/DataHop,6939/HE,6453/Tata,64600/PyBal,64605/Anycast,64700/frack-eqiad,64701/frack-codfw [16:37:24] arg [16:45:19] now that I look more-closely at the puppet log [16:45:37] it looks like after the check_bgp.cfg change, puppet just didn't bother reloading icinga config [16:46:22] yeah I only check that nothing was breaking, but not what exactly was getting called [16:47:30] it looks like that's maybe-intentional in the puppetization of icinga check_command stuff [16:49:10] bblack: so it took your new script, but not the .cfg change? [16:49:27] the cfg change deployed to the disk, but puppet didn't trigger an icinga config reload [16:49:59] the puppetization doesn't trigger that, when check_command file or its config snippet changes, apparently [16:52:42] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1013.eqiad.wmnet'] ` and were **ALL** successful. [16:59:32] 10Traffic, 10Operations: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [17:04:44] bblack: but does puppet load the actual check_command scripts into memory? [17:17:59] puppet applies check_command scripts and configs (in this case, the check_bgp perl script + check_bgp.cfg icinga config snippet) to disk [17:18:17] but it doesn't do a notify/subscribe to get icinga to reload config, like it would for some other similar changes [17:18:40] you don't need a notify, I guess, for the scripts, but you do for the config snippets [17:20:39] nagios_common::check_command [17:20:49] is what handles check_bgp and others [17:21:11] or I guess actually it's nagios_common::check_command::config [17:31:08] ohhh ofc, since that part also gets into the config [17:45:40] vgutierrez: on the buster-lvs topic: our kernels have ip_vs_mh, but I think our ipvsadm userland doesn't have mh support (critically, support for the port and fallback flag strings) [17:46:08] which are basically like their "sh" cousins, which we're also not using but probably should be. [17:46:58] so a sane path might be to switch up our tooling for the public sh-based services to use weight=0 for depools with the fallback flag set for the "sh" scheduler [17:47:24] (I think we have some related ticketry somewhere) [17:47:59] and then make sure ipvsadm tools are updated, and try out switching to "mh" also with its mh-fallback flag set (maybe manually one DC first, obv) [17:49:39] once we have mh configured well and working, it would be saner to contemplate imperfect ECMP from the routers into multiple LVSes active/active, too. at the very least, it will be less-disruptive on various kinds of depools and/or pybal restarts. [17:52:17] I could be wrong about ipvsadm of course, I'm just basing it on the help output [17:52:35] could be they forgot help output updates :) [17:53:01] --scheduler -s scheduler one of rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq [17:53:05] ^ no "mh" [17:55:38] bblack: https://lwn.net/Articles/792617/ ? [17:56:49] and 1.31 is in sid [17:59:55] ok, maybe just the help output is defective :) [18:01:36] dunno what we run in prod, but buster has 1.29, so maybe needs to be backported [18:33:44] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [18:34:19] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) grafana-labs-admin.wikimedia.org has been removed from DNS in https://gerrit.wikimedia.org/r/c/operations/dns/+/576408 therefore also removed here [18:37:57] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [18:38:50] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) labmon1001 has been replaced by cloudmetrics1002 and is still hosting grafana-labs and graphite-labs. [19:01:00] 10netops, 10Operations, 10fundraising-tech-ops: DHCP routing issue with civi2001 - https://phabricator.wikimedia.org/T246812 (10Jgreen) [19:18:27] 10Traffic, 10Operations, 10Pybal: Minor fixes in pybal checks - https://phabricator.wikimedia.org/T246431 (10Dzahn) p:05Triage→03Medium [20:33:12] 10netops, 10Operations, 10fundraising-tech-ops: DHCP routing issue with civi2001 - https://phabricator.wikimedia.org/T246812 (10Jgreen) 05Open→03Stalled [20:55:56] I checked for any wikipedia things on the LE CAA cert revocation list btw, nothing came up [20:56:41] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [20:57:12] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) grafana-labs and graphite-labs have switched to TLS now. [x] cloudmetrics1002.eqiad.wmnet - http://grafana-labs.wikimedia.org http://graphite-labs.wikimedia.org [22:40:32] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Papaul) ` [edit interfaces interface-range disabled] member ge-5/0/5 { ... } + member xe-2/0/47; [edit interfaces xe-2/0/47] - description lvs2001:eno1... [22:40:54] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Papaul)