[03:05:27] 06Traffic, 10Citoid, 06Editing-team, 10RESTBase Sunsetting, and 2 others: Switchover plan from restbase to api gateway for Citoid - https://phabricator.wikimedia.org/T361576#10627058 (10Ryasmeen) [10:46:52] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641 (10ayounsi) 03NEW p:05Triage→03Low [10:53:31] 10netops, 06Infrastructure-Foundations: gnmi_interfaces_interface_state_oper_status missing from most devices - https://phabricator.wikimedia.org/T388642 (10ayounsi) 03NEW [10:54:01] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10627856 (10ayounsi) [10:54:44] 10netops, 06Infrastructure-Foundations: gnmi_interfaces_interface_state_oper_status missing from most devices - https://phabricator.wikimedia.org/T388642#10627863 (10ayounsi) [10:54:47] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10627862 (10ayounsi) [10:56:39] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10627894 (10cmooney) [10:56:42] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10627895 (10cmooney) [10:59:25] FIRING: SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp4044:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:04:25] RESOLVED: SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp4044:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:21:49] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10627951 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1002 for host lvs6003.drmrs.wmnet with OS bookworm [11:41:03] 10netops, 06Infrastructure-Foundations: gnmi_interfaces_interface_state_oper_status missing from most devices - https://phabricator.wikimedia.org/T388642#10628022 (10ayounsi) 05Open→03Resolved a:03ayounsi Chatted about it with Cathal on IRC, the gNMIc deamon just needed a restart. [11:53:31] topranks: less noise here [11:53:34] topranks: list_destroy(): In-progress: non-empty list (1); [11:57:53] ok [11:58:04] what is that list_destroy() from? [11:58:17] I see these error-level syslogs which I think triggered the syslog alert [11:58:17] https://logstash.wikimedia.org/goto/71a4c9e2ea26417c13677f7e6d6d362b [11:58:49] hmm is that a bgp daemon crash? [11:58:56] on the switch? no [11:59:57] I'm not sure exactly what it means here, but essentially it got something it didn't expect from the remote side [12:00:04] some sessions are up weeks so ok in general [12:00:15] https://phabricator.wikimedia.org/P74204 [12:00:23] This is Liberica/Bird on the host side right? [12:00:26] sorry gobgp ? [12:00:40] so it's being reimaged as we speak [12:00:46] pybal went away [12:00:52] and liberica with gobgp appeared [12:01:13] ok [12:01:15] lvs6003? [12:01:19] yes [12:03:20] I'm not finding much about the error [12:04:16] anyway things seem to be ok, perhaps a quirk with the junos on this platform it logs like that when the device becomes unreachable [12:04:37] I don't think we need to worry much anyway, back up and stable, the root cause is known and expected [12:04:46] even if the additional alert perhaps not [12:04:53] https://www.irccloud.com/pastebin/NkJpbunu/ [12:07:43] vgutierrez: I note the switches in drmrs are running the oldest version of JunOS we have for that platform (qfx5120) [12:07:50] and none of the other sites are running that same version [12:08:01] so it may be a quirk in that release of the OS it logs these additional msgs [12:10:12] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10628099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1002 for host lvs6003.drmrs.wmnet with OS bookworm completed: - lvs6003 (**PASS**) - Downtimed on... [12:12:25] topranks: nice finding :) [12:12:28] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10628111 (10Vgutierrez) [14:23:43] o/ once the backport window is finished, would there be objections to me merging this patch? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1123625 it routes PUTs to the write DC similar to POSTs (which we noticed happened during the switchover live test) [14:24:15] hnowlan: go ahead :) [14:27:01] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10628755 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6160b7b2-7281-4c01-a4ad-0c0ebed8103d) set by vgutierrez@cumin1002 for 0:30:00 on 1 host(s) and their services with reas... [14:33:04] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10628818 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1002 for host lvs6002.drmrs.wmnet with OS bookworm [14:34:34] 06Traffic, 10observability, 06SRE: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680 (10ssingh) 03NEW [14:34:55] 06Traffic, 10observability, 06SRE: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10628842 (10ssingh) p:05Triage→03Medium