[00:05:52] 10Traffic, 10Operations: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Dzahn) [00:06:01] 10Traffic, 10Operations: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Dzahn) p:05Triage>03Normal [02:35:02] 10Traffic, 10Operations: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10BBlack) Yeah I got busy and dropped this. Console was unresponsive initially. Reboot produced a responsive console, but wasn't able to initially ssh into the host (and no icinga recovery). With the fresh reboot, eth0 has no... [02:35:55] 10Traffic, 10netops, 10Operations, 10ops-eqiad: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10BBlack) [02:52:01] bblack: anything I need to do about https://phabricator.wikimedia.org/T210683 lvs1006? port looks down on the switch side, I'd say next step is for Chris to try different port/cable/etc [02:54:25] 10Traffic, 10netops, 10Operations, 10ops-eqiad: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10ayounsi) a:03Cmjohnson Port looks down (but not disabled) on the switch side, I'd say next step is for Chris to try re-seating then different cable/ports/etc. [02:59:17] XioNoX: yeah, probably, he can look at it tomorrow [02:59:28] XioNoX: did you see the email about the 3x maintenances overlapping? [02:59:31] I reassigned it to Chris [02:59:37] bblack: yeah I replied 5min ago :) [02:59:44] almost forgot, thanks for the head's up [03:00:09] ok [03:00:10] let me know if you agree with the conclusion of not depooling anything [03:00:28] yeah seems fine, assuming we don't get any wierd alerts overnight :) [05:52:25] 10Traffic, 10Operations: INMARSAT geolocates to the UK, leading to requests going to esams - https://phabricator.wikimedia.org/T209785 (10Reedy) Same IP going back, 161.30.203.16 ` Reedys-MacBook-Pro:~ reedy$ dig +short reflect.wikimedia.org 161.30.203.0 ` But it does seem to be going to ulsfo now, I guess a... [09:09:47] 10Traffic, 10Operations, 10Patch-For-Review: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez) [11:39:53] 10Traffic, 10Operations: Varnish won't purge thumbnails of specific file - https://phabricator.wikimedia.org/T207615 (10Aklapper) Let's close as declined as noone can reproduce anymore? [12:14:30] 10Traffic, 10Operations: Varnish won't purge thumbnails of specific file - https://phabricator.wikimedia.org/T207615 (10Gilles) 05Open>03declined Sure, 'till next time ;) [12:57:47] https://blog.cloudflare.com/know-your-scm_rights/ [13:17:41] interesting [13:17:53] iirc gdnsd 3 uses SCM rights [13:54:11] it does :) [13:55:19] in gdnsd's case, it's used to hand off live listening sockets from one instance of the daemon to the next [13:55:36] so that you can restart for code upgrades or config changes and not have any service interruption [13:56:34] like they say, it's a very powerful tool for solving all kinds of socket-related issues in designs [13:57:00] it's kind of unfortunate that it's tucked away in a strange legacy interface with weird semantics and unportable limits, etc [13:58:11] (note they don't define the value of NUM_FD in their code snippets, that's the unportable part) [13:59:29] gdnsd's handling of that issue is: https://github.com/gdnsd/gdnsd/blob/master/src/cs.h#L78 [13:59:41] which seems to work for Linux and FreeBSD, and hopefully everyone else at least allows 32 :P [14:02:28] I guess you could alternatively probe the limit at runtime by sending test FDs to yourself over SCM_RIGHTS and binary-searching for some limit (e.g. over the rand 1-1024 or something). [14:02:38] s/rand/range/ [14:02:51] I'll bother with that if we find a platform that can't do 32 heh [14:03:09] (that someone actually uses gdnsd on, which limits the scope a bunch!) [15:11:05] so it looks like we have some scenarios where we need to let another groups besides root read the private key of the TLS certificate, like for lists.wm.o and exim [15:11:07] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476510/ [15:11:15] bblack, Krenair this makes sense to you? [15:12:36] the use case is already linked in the commit message but it's basically this one: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/339477/1/modules/role/manifests/lists/server.pp [15:13:59] yes [15:22:13] yup! [15:25:51] and accidentally we've checked that the icinga checks related to cert sync between certcentral instances work as expected O:) [15:27:11] keyholder got disarmed... instance reboots or something? [15:27:26] indeed [15:27:54] one thing that's hard to really validate in our real-world deployment, until the time comes [15:27:57] moritzm pinged me but in another L8 issue on my side, I forgot about keyholder [15:28:12] is what happens when we reach the appropriate cert expiry times and CC re-issues an existing cert? [15:28:36] yeah... that's going to be interesting :) [15:28:43] I'm kind of assuming it will go smoothly, but who knows, we should keep an eye out in a couple months from the first cert :) [15:29:04] yeah let's not go for the unified cert until we've seen that process go over smoothly in prod :p [15:29:32] I never looked, but I assume CC (a) renews fairly early (e.g. ~60/90 days passed), and (b) swizzles renewal time a bit so they don't happen all exactly N days after initial issue? [15:29:53] yeah.. I got it on my calendar, the first will be librenms [15:31:14] I guess swizzling isn't super important if it's not there, esp with retries. I think I had some kind of deterministic swizzle in the old one, just to avoid chances of renewal alignments triggering ratelimits or whatever. [15:31:24] (although now, I kind of question whether that even makes logical sense) [15:31:38] DEFAULT_RENEWAL_PERIOD = timedelta(days=30) [15:32:12] that means default renewal 30 days after issue, or 30 days before expiry? [15:32:20] before expiry [15:32:24] ok cool [15:33:21] so January 17th @ 15:50 UTC for librenms [15:33:35] and we will have a test run before with pinkunicorn.wikimedia.org [15:34:33] January 14th @ 04:51 UTC... --> insonmia sucks [16:04:23] 10Traffic, 10netops, 10Operations, 10ops-eqiad: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10BBlack) p:05Normal>03High [16:24:19] 10Traffic, 10netops, 10Operations, 10ops-eqiad: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Cmjohnson) @bblack @ayounsi sfp-t was bad, replaced and the link is up [16:27:44] 10Traffic, 10netops, 10Operations, 10ops-eqiad: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Cmjohnson) 05Open>03Resolved [19:49:07] 10netops, 10Operations: Remove neodymium/sarin from router ACLs - https://phabricator.wikimedia.org/T210612 (10ayounsi) 05Open>03Resolved a:03ayounsi Removed! [20:07:04] We now have extra monitoring for Juniper virtual chassis links: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=Juniper+virtual+chassis+ports [20:11:32] nice! [20:15:46] cool [20:16:19] I have a half-baked check_vrrp check on my ~ [20:16:39] do you want to re{vive,view,write}/merge this as well? [20:19:39] also a py version of check_jnx_alarms [20:20:14] paravoid: yes, and check_jnx_alarms is already working so I care less :) [20:20:15] https://tty.gr/s/check_vrrp.py + http://tty.gr/s/check_jnx_alarms.py [20:20:34] I was playing around at the time with nagiosplugin which is... a little weird to use [20:20:39] I haven't made up my mind about using it yet [20:20:45] these should work though [20:21:49] paravoid: also I have check_bfd ready, but the Bird instances will alert until https://phabricator.wikimedia.org/T209989 is solved so not sure if it's better to wait or merge it and ACK the bird alerts (so we have alerting for the other devices at least) [20:22:14] we have bird deployed? [20:27:51] paravoid: it's part of the anycast text/poc, some doc on https://wikitech.wikimedia.org/w/index.php?title=Anycast_recursive_DNS / https://phabricator.wikimedia.org/T186550 [20:29:04] uh ok [20:29:07] I know nothing about all that [20:31:32] the tasks probably need an update, that task above has no comments at all, just a generic task description [20:32:24] yeah, I think there are other tasks as well for the same thing [20:32:38] even more reason to clean up :) [20:33:17] anycast_healthchecker [20:33:22] is... eh [20:33:33] why aren't we using pybal for this? [20:34:12] yeah, stuff started with https://phabricator.wikimedia.org/T98006 [20:34:37] and they a subtask got created for rdns [20:34:54] none of these tasks mention a deployment [20:34:55] or bird [20:35:00] as far as I can see :) [20:37:57] we had this debate once before I think :) [20:40:41] (and got nowhere conclusive, IIRC) [20:41:26] I don't remember it (but I'm not disputing it either!) [20:41:43] I'll circle back to this in a few after this meeting [20:41:52] but I'd like to ask to capture these kind of things in phab, as well as track e.g. a deployment there [21:10:46] yeah, I think that's fair. some of this has just fallen in the crack of the task saying that we're "exploring other routes", and this one route to a solution has been explored now pretty heavily to the point of test deployments to see how it fares. [21:11:18] but it's also dragged out over a long period because it's not a Goal-level priority for anyone, and we're still not really at a decision point on anything. [21:12:35] the bird-approach vs pybal(with or without LVS) debate was the one we had before. It's deep in terms of all kinds of tradeoffs (including which kinds of failure modes we're likely to handle better) [21:12:51] but I don't have the energy in me to re-ramble it all at present! [21:21:08] 10netops, 10Operations, 10ops-eqiad: faulty VC link on asw2-c-eqiad - https://phabricator.wikimedia.org/T210788 (10ayounsi) p:05Triage>03High [21:36:14] 10netops, 10Operations, 10ops-eqiad: faulty VC link on asw2-c-eqiad - https://phabricator.wikimedia.org/T210788 (10ayounsi) 05Open>03Resolved a:05Cmjohnson>03ayounsi That was actually an unused port.