[07:11:13] one linecard on cr1-codfw failed (cleanly, no outage) but until it's fixed we don't have any router redundancy over there [07:12:43] of course it failed Friday at 11pm my time when I'm far from a laptop [07:22:13] next steps are probably 1/ prepare CR to depool the site, I'm not expecting cr2- to fail, but we never know [07:22:53] and 2/ follow up with juniper so they send us a replacement part [07:24:23] bblack: mark: paravoid ^ [07:26:16] both need a laptop and I won't see mine before Monday evening [07:28:08] we could potentially try to powercycle the linecard too, but then there is the risk that it fails in a bad state and causes a real outage [07:28:17] hey [07:32:31] hello! [07:33:07] weird [07:33:17] so yeah, what I gathered here and in the text are what I can figure out from my phone without ssh [07:33:19] so it's FPC0 that allegedly failed but i see lots and lots of instability around fpc5 [07:33:43] and fpc0 currently seems up? [07:34:09] the et- interfaces are up, but the AE that are supposed to be made of only one et- each are down [07:35:25] indeed [07:35:39] fpc0 is the one with the et- links to the switches [07:35:56] yeah [07:37:35] lots of instability on multiple xe-5/ ports [07:39:33] Telia troubleshoot their transit link earlier today and it's a xe-5 interface, maybe related [07:39:54] but I don't know if it's related to the fpc0 [07:40:29] yeah a lot of it seems telia [07:40:34] you did a commit confirmed yesterday [07:40:39] what was that? [07:40:57] disabling the bgp session to telia [07:41:25] we're investigating interface error on telia transit [07:41:34] right [07:41:49] https://phabricator.wikimedia.org/T222967 [07:47:13] May 25 06:17:56 re0.cr1-codfw fpc0 XMCHIP(0): HOSTIF: Protect: Log Error 0x1, Log Address 0x3774, Multiple Errors 0x0 [07:47:13] May 25 06:17:56 re0.cr1-codfw tnp.tftpd[92395]: TFTPD_CONNECT_INFO: TFTP write from address 16 port 34 file /var/tmp/pfe_debug_info_NPC0 [07:47:13] May 25 06:17:56 re0.cr1-codfw /kernel: ae_bundlestate_ifd_change: bundle ae3: bundle IFD minimum links not met 0 < 1 [07:49:00] it's like the lacp daemon or chip thinks the interface is down while it's not [07:49:53] due to lacp timeouts [07:50:11] May 25 06:17:56 re0.cr1-codfw lacpd[1698]: LACP_INTF_DOWN: ae3: Interface marked down due to lacp timeout on member et-0/2/0 [07:51:13] i think i may disable ospf on those links, so if it comes up it won't immediately carry traffic - vrrp is already standby [07:51:19] and then maybe try a linecard reboot [07:51:46] but since things are stable now, give me a few minutes, i just woke up [07:52:47] and I need to go to sleep soon, it's 1am and I'm waking up early tomorrow :( [07:57:42] right [07:57:49] we can also choose to do nothing at all, which may be the safest of all [07:57:57] it's a stable situation as long as cr2 is ok [07:58:18] and any reboot of fpc0 we could do later as well [08:00:14] i can try to file an RMA request [08:01:58] It's a long weekend in the US, I don't know how that will impact the rma or replacing the part if it gets there before Tuesday [08:02:33] yeah probably not before tue if then [08:02:52] it's in active contract, that's a good start ;) [08:03:07] hahah yeah :) [08:03:23] would they even warrant the RMA from just these messages? [08:03:56] i can try, not sure how much we can followup on debugging later today [08:04:09] they usually ask automatically for a request system information [08:04:58] if it's just that I can provide that [08:06:27] thankfully we don't often have failing fpc so I don't know what their runbook is [08:07:00] for a power supply they don't ask for anything else, but it's not the same cost :) [08:07:01] yeah [08:07:02] we'll see [08:07:06] worst case nothing comes out of it until next week [08:07:33] yeah I can take care of it Monday night [08:08:01] how do I know which out of ~6 contracts I need? :P [08:10:04] I don't know, I only have one in the dropdown [08:12:03] which one is it? what nr? [08:12:11] i probably need that one :P [08:12:21] mine may be super old hehe