[00:24:38] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [00:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [04:24:38] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [04:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [08:24:38] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [08:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [11:07:40] 10homer, 10SRE-tools, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#10779851 (10ayounsi) Another tiny improvement would be to only prompt for yes/no when there is only 1 target device. [11:19:57] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: Upgrade codfw E/F Juniper equipment to Junos 23.x - https://phabricator.wikimedia.org/T393001 (10ayounsi) 03NEW [12:21:06] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10780029 (10ayounsi) While the equivalent of this is needed on the pfw: ` [edit security zones security-zone production interfaces lo0.0 host-inbound-traffic system-ser... [12:24:38] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [12:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [14:48:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10780708 (10RobH) New remote hands entered to get this fixed: Case Order #01053614 > Directions for remote hands to repair our link between cr3 an... [15:01:03] Hey IF, there's a new KSM feature in Bookworm that's not available in Bullseye: https://devkernel.io/posts/ksm_smart_scan/ . I've been playing around with it in my home lab and it seems to conserve more memory than the old approach [15:01:50] just throwin' it out there if y'all want to play around with it. We'll probably try it whenever we (DPE SRE) build that small ganeti cluster we talked about in email [15:07:26] interesting inflatador, there was a recent lwn article about KSM, https://lwn.net/Articles/1016426/ [15:13:35] 👀 [15:20:05] Selective KSM seems like an awesome idea [15:22:51] jhathaway: hi! I wanted to set up some time to talk about the Puppet side of planning for mini pops. I am guessing Riccard.o was interested as well so I will find some time for when he is back [15:23:08] any preference for a day next week? week of May 5 [15:23:19] sounds good, no preference from me [15:23:37] after our discussion, I will add a dedicated section in the doc [15:23:40] ok thanks! [15:25:12] * inflatador wonders if we should get a group subscription to LWN, re: https://lwn.net/Articles/1019217/ [15:27:41] we already enable KSM in the ganeti class, but it was done a decade ago, so not sure if there's currently room for improvements [15:29:42] and I can only recommend to sign up for LWN, I'm a subscriber since over 20 years, it's one of the few tech publications which are doing truly good work [15:31:10] inflatador: you should feel free to use the continuing education subsidy benefit WMF offers to subscribe to LWN [15:31:54] cdanis ACK, will do that for sure [15:56:30] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic, 13Patch-For-Review: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10780975 (10elukey) I filed https://gerrit.wikimedia.org/r/c/operati... [16:16:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:38] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [16:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [18:48:48] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:53:48] FIRING: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:22:13] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops, 06SRE: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10781850 (10ABran-WMF) [20:16:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:38] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [20:24:38] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [20:28:48] FIRING: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:33:48] RESOLVED: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:23:27] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10782143 (10RobH) a:05RobH→03cmooney @cmooney : > "Created by: mmariscalmata The following has been completed: > > Retrieve package #1... [21:27:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10782147 (10cmooney) Thanks @RobH. It looks good so far, this is the graph we need to keep an eye on: https://grafana.wikimedia.org/goto/SVEEkIbHR... [23:16:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed