[07:41:14] The Telia link between cr1 eqiad/codfw seems up, but OSPF is down [07:41:21] for both ipv4 and ipv6 afaics [07:42:56] and the laser rx/tx look ok on both sides [07:58:06] mmmm xe-5/2/1 on cr1-codfw flapped a lot [07:58:23] looking [07:58:49] thanks! [07:58:57] there is also a BGP session down [07:59:05] no, thank you! [07:59:33] can you RTFM me when you find the issue ? :D [08:00:38] elukey: cr1-codfw> ping 208.80.153.220 source 208.80.153.221 [08:01:05] if there is a link that means the physical layer is good (and the light) [08:01:18] so from there I'd just open a telia ticket [08:01:36] elukey: which bgp session? [08:02:06] XioNoX: on cr1-codfw, BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Connect - Telia [08:03:01] cr1-codfw> ping 80.239.192.101 source 80.239.192.102 [08:03:04] same thing [08:03:17] I guess both link terminate on the same equipment [08:03:42] ok so something is blackholing traffic on their end [08:04:10] yep [08:04:13] it is the same link that suffered issues for the Texas weather changes of the last days [08:05:05] anyway, thanks for the info :) [08:05:24] elukey: you can also do stuff like cr1-codfw> monitor traffic interface xe-5/3/1 [08:06:03] that only show traffic to/from the router's RE, not everything transiting [08:06:37] ah interesting! [08:43:28] headsup: cumin1001 will be rebooted in ~ 5 minutes, you can use cumin2001 in the mean time [08:56:15] cumin1001 is available again [08:57:03] thx moritzm [09:06:24] elukey: hahaha, should have looked at the calendar, I only looked at the mailbox [09:06:43] they did schedule it, but didn't send a "start of maintenance" email [09:07:01] ah nice! [09:08:12] also they don't send "start" emails [09:08:14] so that's why [09:19:54] _joe_: can i trick you into reviewing https://gerrit.wikimedia.org/r/c/operations/puppet/+/665324? i mostly want to be sure i'm not missing something obvious re: systemd units [09:20:17] <_joe_> kormat: I have quite the backlog of requests right now, so it will be delayed a bit [09:20:23] <_joe_> but sure :) [09:20:28] that's no problem, there's no rush. [09:42:20] who handles cloudelastic instances? it looks like nginx is having a hard time reloading TLS material there [09:44:08] vgutierrez: I'd assume cloud services [09:49:54] these are handled by Search [09:54:42] volans: just to double-check: it's normally expected to remove all references to a host from puppet before running `sre.hosts.decommission`, correct? [09:55:08] kormat: nope, after [09:55:19] wut [09:55:46] it's expected to have it out of production, whatever that means for that specific host/services [09:55:49] the help output says it checks for any reference, and then requires confirmation before proceeding [09:56:11] yes, it's a reminder for you to know where it's referencd in puppet/dns [09:56:38] to make sure it's not yet pooled, active, used in some config file for others to use/ping/connect/etc. [09:57:10] it's expected to have some output there [09:57:23] ok. thanks [09:57:33] up to the user to know it's all good to proceed and also a reminder where to remove it afterwards [12:15:45] dcaro: volans: Hey, ad T275488, I'm unsure what's requested. Do you want to have a project called "spicerack" as a sub-project to #sre-tools? Or as a top level project, like #decommission-hardware? [12:15:51] T275488: Create project tag for Spicerack - https://phabricator.wikimedia.org/T275488 [12:35:44] dcaro: I'm at lunch, will reply later. TL;DR I'm not even sure it's needed as spicerack is already an alias that points to SRE-tools and I'm not sure having another workboard would actually help instead of just adding more fragmentation [13:52:51] I've replied to the task and pinged interested people [14:02:12] vgutierrez: as the last person to log into acmechief-test1001, do you know anything about the cron emails it's sending every hour? :) [14:03:20] y [14:06:58] kormat: nope but I can check :) [14:08:22] fwiw I had sent an email this morning... but probably valentin has a rule that sends all my email to the trash directly :-P [14:09:15] volans: the gmail threading sent you to the Cron tag [14:10:59] it was probably a Fwd of the last one [14:11:01] yes [14:19:35] hmm weird issue with the LE staging environment OCSP responder [15:07:11] XioNoX: o/ ok if I merge+deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/665814 ? [15:08:49] yeo [15:16:32] thanksss [15:18:35] kormat: it seems like it's triggered by https://community.letsencrypt.org/t/staging-maintenance-and-database-reset-2-march-2021/145957 [15:18:43] even if CURRENT_TIMESTAMP < March 2nd