[09:28:40] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, and 2 others: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10566946 (10elukey) @BCornwall the easiest way is probably to use test-cookbook on a cumin host, using a depooled magru cp node as target. Once we are sure that the se... [09:34:54] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10566953 (10fgiunchedi) >>! In T384731#10565308, @cmooney wrote: >>>! In T384731#10563685, @ayounsi wro... [09:45:19] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10566958 (10cmooney) Thanks for the update @fgiunchedi > >! In T384731#10566953, @fgiunchedi wrote: >>... [11:54:25] 06Traffic, 06Data-Persistence, 13Patch-For-Review: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564#10567253 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez services have been migrated to IPIP encapsulation and maglev successfully: `name=eqiad vgut... [12:20:20] o/ I have an ATS gateway-check change to route traffic to PCS for testwiki only. Would it be cool for me to roll it out? https://gerrit.wikimedia.org/r/1121350 [12:20:31] happy to wait if the lvs work at the moment is ongoing [12:34:03] 10netops, 06Infrastructure-Foundations, 06SRE: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10567389 (10cmooney) I ran gnmic in debug mode on netflow1002 but nothing is jumping out at me as a problem, at least on a basic review of the logs. One thing I do notice, and... [13:08:47] 10netops, 06Infrastructure-Foundations, 06SRE: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10567530 (10cmooney) Also fwiw I grabbed the same stats for 24 hours from both prometheus servers, and compared the total stats. In total there are 115 gaps in the data, 68 of... [13:20:52] hnowlan: I would say it's pretty safe to proceed but to be extra sure I summon vgutierrez [14:02:21] hnowlan: feel free to merge; no other work is ongoing [14:22:49] 06Traffic, 10Maps, 06SRE, 13Patch-For-Review: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10567788 (10ssingh) @MSantos: Any update on this? Thanks! [14:35:49] fabfur, sukhe: thanks! [14:52:17] change looks okay, enabling puppet [14:52:34] thanks! [16:42:01] 06Traffic, 13Patch-For-Review: liberica control plane fails to upgrade metrics on administrative depool of a realserver - https://phabricator.wikimedia.org/T386785#10568428 (10Vgutierrez) 05Open→03Resolved fixed: ` vgutierrez@lvs4008:~$ curl -s lvs4008:3003/metrics |grep _pooled_realservers |grep textl... [17:08:25] FIRING: [2x] SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs4008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:33] sigh... [17:09:48] liberica user can't write on /var/lib/prometheus/node.d :( [17:10:08] oh no [17:10:16] drwxrwx--- 2 prometheus prometheus-node-exporter 4.0K Feb 20 17:10 node.d [17:10:22] root it is then [17:10:39] yeah... [17:13:39] sukhe: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1121401 [17:13:44] +1 [17:13:46] <3 [17:18:25] FIRING: [4x] SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:59] recoveries should come soon [17:23:25] FIRING: [4x] SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:25] RESOLVED: [4x] SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:16] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, and 2 others: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10568651 (10BCornwall) [20:05:53] 06Traffic: Adopt trafficserver UDS support - https://phabricator.wikimedia.org/T386970 (10Vgutierrez) 03NEW [20:06:30] 06Traffic: Adopt trafficserver UDS support - https://phabricator.wikimedia.org/T386970#10569239 (10Vgutierrez) p:05Triage→03Medium