[00:32:35] !log [WDQS] Restarted blazegraph on `wdqs101[1,3]` [00:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:40] !log [WDQS] Restarted blazegraph on `wdqs1014` as well. all 3 hosts were deadlocked [00:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1240830 [00:39:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1240830 (owner: 10TrainBranchBot) [00:53:52] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1240830 (owner: 10TrainBranchBot) [00:56:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [01:08:51] (03CR) 10ArielGlenn: "This looks clearer for people skimming through the tests, and that's a good thing. Left some style comments for you." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239972 (owner: 10Daniel Kinzler) [01:09:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1240831 [01:09:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1240831 (owner: 10TrainBranchBot) [01:11:51] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.16; 2026-02-17): Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11634933 (10Rtconner) Yes mine is working good too thank you. [01:21:00] (03PS1) 10BryanDavis: Revert "extension-list: add a bogus extension to test l10n-update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240832 (https://phabricator.wikimedia.org/T411516) [01:27:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240832 (https://phabricator.wikimedia.org/T411516) (owner: 10BryanDavis) [01:34:55] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1240831 (owner: 10TrainBranchBot) [01:36:09] jouncebot: nowandnext [01:36:10] No deployments scheduled for the next 5 hour(s) and 23 minute(s) [01:36:10] In 5 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260220T0700) [01:39:41] (03CR) 10Zabe: [C:03+2] Start reading from new file tables on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239497 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [01:40:37] (03Merged) 10jenkins-bot: Start reading from new file tables on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239497 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [01:41:48] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1239497|Start reading from new file tables on mediawikiwiki (T416548)]] [01:41:52] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [01:44:00] !log zabe@deploy2002 zabe: Backport for [[gerrit:1239497|Start reading from new file tables on mediawikiwiki (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:45:03] !log zabe@deploy2002 zabe: Continuing with sync [01:48:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [01:49:05] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1239497|Start reading from new file tables on mediawikiwiki (T416548)]] (duration: 07m 17s) [01:49:09] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [02:08:21] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:39] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:15:39] (03PS1) 10Dwisehaupt: Add spf records for civicrm and frmx hosts [dns] - 10https://gerrit.wikimedia.org/r/1240834 (https://phabricator.wikimedia.org/T417958) [02:33:21] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:40] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.16; 2026-02-17): Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11635028 (10Legoktm) Was any integration or regression test added for this? [03:19:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:23:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:24:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:43:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [05:54:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:55:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqord and cr1-eqiad (208.80.154.196) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqord:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr1-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:00:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:08:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:10:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:12:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:13:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:14:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:16:24] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:17:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:21:24] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:58:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260220T0700) [07:03:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:13:32] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.sync-instances sync Gerrit data from gerrit2003.wikimedia.org to gerrit1003.wikimedia.org [07:18:02] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.sync-instances (exit_code=0) sync Gerrit data from gerrit2003.wikimedia.org to gerrit1003.wikimedia.org [07:18:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:19:17] (03PS1) 10Muehlenhoff: Run IDM spec tests on Bookworm/Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1240840 [07:20:40] (03PS1) 10Muehlenhoff: lvs: Run spec tests on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1240841 [07:24:41] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:25:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [07:26:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-eqord and cr1-eqiad (208.80.154.196) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:29:21] (03CR) 10Arnaudb: [C:03+2] gerrit: resume replication on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1240689 (https://phabricator.wikimedia.org/T417246) (owner: 10Arnaudb) [07:29:41] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:31:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:33:21] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:33:55] (03PS1) 10Muehlenhoff: Remove access for mobrovac [puppet] - 10https://gerrit.wikimedia.org/r/1240842 [07:35:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [07:36:39] RESOLVED: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:37:46] FIRING: [7x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [07:37:57] FIRING: [2x] GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in codfw - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [07:38:42] FIRING: [2x] ProbeDown: Service gerrit2003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit2003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:42:46] RESOLVED: [4x] GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in codfw - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [07:42:51] RESOLVED: [14x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [07:43:42] RESOLVED: [2x] ProbeDown: Service gerrit2003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit2003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:43:55] (03PS2) 10Muehlenhoff: Remove access for mobrovac [puppet] - 10https://gerrit.wikimedia.org/r/1240842 [07:47:07] (03CR) 10Muehlenhoff: [C:03+2] Remove access for mobrovac [puppet] - 10https://gerrit.wikimedia.org/r/1240842 (owner: 10Muehlenhoff) [07:47:24] (03CR) 10Brouberol: [C:03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568) (owner: 10Ryan Kemper) [07:49:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:52:54] (03CR) 10Ryan Kemper: [C:03+2] hadoop.reboot-workers: make host override smarter [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568) (owner: 10Ryan Kemper) [07:55:32] (03PS9) 10Brouberol: elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [07:55:37] 06SRE, 06Data-Platform-SRE (2026-02-13 - 2026-03-06), 13Patch-For-Review: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11635286 (10RKemper) `an-test-worker*` done [07:57:29] (03PS1) 10Muehlenhoff: Remove dotfiles of three absented users [puppet] - 10https://gerrit.wikimedia.org/r/1240847 [07:58:15] (03Merged) 10jenkins-bot: hadoop.reboot-workers: make host override smarter [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568) (owner: 10Ryan Kemper) [07:59:08] (03PS1) 10Arnaudb: gerrit: gerrit-spare lfs-sync enable [puppet] - 10https://gerrit.wikimedia.org/r/1240846 (https://phabricator.wikimedia.org/T417246) [07:59:12] (03PS1) 10Arnaudb: gerrit: fix gerrit1003 ssh fingerprint [puppet] - 10https://gerrit.wikimedia.org/r/1240848 (https://phabricator.wikimedia.org/T417246) [07:59:55] (03CR) 10Arnaudb: [C:03+2] gerrit: gerrit-spare lfs-sync enable [puppet] - 10https://gerrit.wikimedia.org/r/1240846 (https://phabricator.wikimedia.org/T417246) (owner: 10Arnaudb) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260220T0800) [08:01:46] (03CR) 10CI reject: [V:04-1] elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [08:01:47] (03CR) 10Arnaudb: [C:03+2] gerrit: fix gerrit1003 ssh fingerprint [puppet] - 10https://gerrit.wikimedia.org/r/1240848 (https://phabricator.wikimedia.org/T417246) (owner: 10Arnaudb) [08:03:03] (03CR) 10Muehlenhoff: [C:03+2] Remove dotfiles of three absented users [puppet] - 10https://gerrit.wikimedia.org/r/1240847 (owner: 10Muehlenhoff) [08:03:38] (03PS1) 10Brouberol: Use importlib.metadata instead of pkg_resources, now deprecated/removed. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240850 [08:09:53] (03CR) 10CI reject: [V:04-1] Use importlib.metadata instead of pkg_resources, now deprecated/removed. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240850 (owner: 10Brouberol) [08:13:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:15:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr1-eqiad (208.80.154.196) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:16:05] (03PS2) 10Brouberol: Use importlib.metadata instead of pkg_resources, now deprecated/removed. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240850 [08:17:00] (03PS1) 10Muehlenhoff: Remove puppetmaster class and related classes [puppet] - 10https://gerrit.wikimedia.org/r/1240853 (https://phabricator.wikimedia.org/T365798) [08:18:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:19:13] !log brouberol@cumin1003 START - Cookbook sre.hosts.reboot-single for host cephosd1003.eqiad.wmnet [08:20:03] (03PS3) 10Brouberol: Use importlib.metadata instead of pkg_resources, now deprecated/removed. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240850 [08:20:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:22:22] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:23:50] (03PS4) 10Brouberol: Use importlib.metadata instead of pkg_resources, now deprecated/removed. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240850 [08:27:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:28:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:29:08] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1003.eqiad.wmnet [08:29:22] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:29:40] (03PS1) 10Muehlenhoff: Remove puppetmaster::base_repo [puppet] - 10https://gerrit.wikimedia.org/r/1240856 (https://phabricator.wikimedia.org/T365798) [08:30:32] (03CR) 10CI reject: [V:04-1] Use importlib.metadata instead of pkg_resources, now deprecated/removed. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240850 (owner: 10Brouberol) [08:32:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:33:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:33:50] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE (2026-02-13 - 2026-03-06), 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561#11635329 (10Gehel) [08:34:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:37:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:39:57] (03CR) 10Gehel: "LGTM, minus the build issue" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [08:48:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240856 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:56:08] (03PS1) 10Muehlenhoff: Remove Globalsign/Digicert stub certs used by PCC [labs/private] - 10https://gerrit.wikimedia.org/r/1240865 (https://phabricator.wikimedia.org/T414955) [08:57:38] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11635379 (10Tacsipacsi) >>! In T414805#11632990, @Ladsgroup wrote: > A global rate limit effectively is not that different than full... [08:58:31] (03PS1) 10Muehlenhoff: Remove various stub certs for now removed cergen certs [labs/private] - 10https://gerrit.wikimedia.org/r/1240866 (https://phabricator.wikimedia.org/T357750) [08:59:54] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove Globalsign/Digicert stub certs used by PCC [labs/private] - 10https://gerrit.wikimedia.org/r/1240865 (https://phabricator.wikimedia.org/T414955) (owner: 10Muehlenhoff) [09:00:15] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove various stub certs for now removed cergen certs [labs/private] - 10https://gerrit.wikimedia.org/r/1240866 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [09:00:24] (03PS1) 10Muehlenhoff: Remove long obsoleted sni/star stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1240867 [09:10:18] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove long obsoleted sni/star stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1240867 (owner: 10Muehlenhoff) [09:12:53] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 07Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#11635406 (10MoritzMuehlenhoff) 05Open→03Resolved cergen is fully undeployed from our infrastructure: All certificates have been mig... [09:13:44] (03CR) 10Elukey: [C:03+2] setup.py: Pin setuptools < 82.0.0 to make pkg_resources available. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240702 (owner: 10Blake) [09:15:23] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11635412 (10Volans) @Blake I think we should do that check inside the logic that fires up the cookbook, namely... [09:26:15] (03PS3) 10Effie Mouzeli: x-wikimedia-debug-routing: add routing to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1240750 (https://phabricator.wikimedia.org/T386246) [09:28:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:28:59] (03CR) 10Fabfur: [C:03+1] x-wikimedia-debug-routing: add routing to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1240750 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [09:29:50] (03CR) 10Effie Mouzeli: [C:03+2] x-wikimedia-debug-routing: add routing to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1240750 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [09:30:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr1-eqiad (208.80.154.196) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:33:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:33:56] jouncebot: nowandnext [09:33:56] For the next 22 hour(s) and 26 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260220T0800) [09:33:56] In 2 hour(s) and 26 minute(s): GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260220T1200) [09:34:00] arnaudb: I am going to upgrade the CI Jenkins on contint1002 [09:35:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:36:29] (03CR) 10Elukey: hadoop.reboot-workers: make host override smarter (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568) (owner: 10Ryan Kemper) [09:38:57] ack ! [09:40:09] I wanted to do it yesterday but well.. I ended up not having the time for it :b [09:41:20] !log Upgraded CI Jenkins from 2.528.3 to 2.541.2 # T417791 [09:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:44:46] (03PS2) 10Muehlenhoff: Remove puppetmaster class and related classes [puppet] - 10https://gerrit.wikimedia.org/r/1240853 (https://phabricator.wikimedia.org/T365798) [09:44:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:44:58] (03PS10) 10Brouberol: elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [09:46:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:46:15] (03CR) 10Sergio Gimeno: [C:03+1] "No objections, just a question." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240694 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [09:50:01] (03PS3) 10Muehlenhoff: Fix copy&paste errors in comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240202 [09:51:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:51:24] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:53:06] (03CR) 10CI reject: [V:04-1] elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [09:53:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:56:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:58:11] (03PS11) 10Brouberol: elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [10:00:38] 06SRE, 06Infrastructure-Foundations, 06serviceops-radar, 07Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741#11635512 (10taavi) 05Open→03Resolved assuming this is done as all of the checkboxes have been checked [10:02:53] (03CR) 10Elukey: [C:03+1] Fix copy&paste errors in comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240202 (owner: 10Muehlenhoff) [10:05:59] (03PS1) 10Muehlenhoff: Remove Puppet 5 support from Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240877 (https://phabricator.wikimedia.org/T365798) [10:08:11] (03CR) 10CI reject: [V:04-1] elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [10:09:20] (03PS12) 10Brouberol: elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [10:10:48] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 07Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490#11635528 (10MoritzMuehlenhoff) [10:12:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:22] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11635530 (10Nux) This would not change the calculations much, but there are [[ https://global-search.toolforge.org/?q=thumb%5C%2F.%5C... [10:13:36] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [10:14:10] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [10:14:55] (03CR) 10CI reject: [V:04-1] Remove Puppet 5 support from Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240877 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:14:56] (03PS1) 10Muehlenhoff: Remove bgpalerter spec test [puppet] - 10https://gerrit.wikimedia.org/r/1240878 [10:15:11] (03CR) 10Arnaudb: "the server is back in production, backups should be able to properly run again" [puppet] - 10https://gerrit.wikimedia.org/r/1240650 (owner: 10Jcrespo) [10:15:30] (03CR) 10Arnaudb: [C:03+2] Revert "backup: Temporarily ignore backup job failures from gerrit1003" [puppet] - 10https://gerrit.wikimedia.org/r/1240650 (owner: 10Jcrespo) [10:15:43] (03PS2) 10Muehlenhoff: Remove bgpalerter spec test [puppet] - 10https://gerrit.wikimedia.org/r/1240878 [10:18:57] (03CR) 10Brouberol: [C:03+1] elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [10:26:19] (03PS2) 10Muehlenhoff: Remove Puppet 5 support from Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240877 (https://phabricator.wikimedia.org/T365798) [10:29:19] (03PS1) 10Hashar: zuul: pin spec tests to Bullseye (10) [puppet] - 10https://gerrit.wikimedia.org/r/1240879 [10:30:03] (03CR) 10CI reject: [V:04-1] zuul: pin spec tests to Bullseye (10) [puppet] - 10https://gerrit.wikimedia.org/r/1240879 (owner: 10Hashar) [10:31:58] (03PS2) 10Hashar: zuul: pin spec tests to Bullseye (10) [puppet] - 10https://gerrit.wikimedia.org/r/1240879 [10:33:04] (03CR) 10Majavah: [C:04-1] "bullseye would be Debian 11, not 10" [puppet] - 10https://gerrit.wikimedia.org/r/1240879 (owner: 10Hashar) [10:34:14] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T417862]]' 'Wikimedia Foundation/Advancement/Community Growth/Community Resources and Partnerships' 'Wikimedia Foundation/Advancement/Community Growth/Community Investment and Partnerships' Ammarpad # T417862 [10:34:18] T417862: Request to move translatable page: :meta:Wikimedia Foundation/Advancement/Community Growth/Community Resources and Partnerships - https://phabricator.wikimedia.org/T417862 [10:35:04] (03CR) 10Hashar: "I keep being confused by the doubled version system (names vs number) :-\ Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1240879 (owner: 10Hashar) [10:37:06] (03PS3) 10Hashar: zuul: pin spec tests to Bullseye (11) [puppet] - 10https://gerrit.wikimedia.org/r/1240879 [10:37:08] !log jayme@cumin1003 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on A:wikikube-staging-worker-codfw [10:38:41] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestage2002.codfw.wmnet with OS trixie [10:38:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1240879 (owner: 10Hashar) [10:38:50] (03CR) 10Muehlenhoff: [C:03+2] zuul: pin spec tests to Bullseye (11) [puppet] - 10https://gerrit.wikimedia.org/r/1240879 (owner: 10Hashar) [10:45:37] (03CR) 10Muehlenhoff: [C:03+2] Fix copy&paste errors in comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240202 (owner: 10Muehlenhoff) [10:45:59] 10SRE-swift-storage, 06Data-Persistence, 10Prod-Kubernetes, 06ServiceOps new, and 5 others: Fix thumbor discovery records and make swift use them - https://phabricator.wikimedia.org/T397618#11635575 (10MLechvien-WMF) @JTweed-WMF would you have inputs on how to triage this task? [10:47:52] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [10:47:58] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [10:48:04] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [10:48:31] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [10:49:04] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11635578 (10Blake) I think I'd be inclined to prefer the more-defensive option (maybe @Clement_Goubert has a pr... [10:56:30] (03PS1) 10Joal: dse-k8s-eqiad: Add turnilo-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240884 (https://phabricator.wikimedia.org/T416120) [10:57:12] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [10:58:22] (03PS2) 10Hnowlan: Revert "svg: refuse to generate SVGs larger than a particular size" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1218284 (https://phabricator.wikimedia.org/T411076) (owner: 10Muehlenhoff) [10:58:32] (03CR) 10CI reject: [V:04-1] Revert "svg: refuse to generate SVGs larger than a particular size" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1218284 (https://phabricator.wikimedia.org/T411076) (owner: 10Muehlenhoff) [10:58:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T415786)', diff saved to https://phabricator.wikimedia.org/P88914 and previous config saved to /var/cache/conftool/dbconfig/20260220-105847-marostegui.json [10:58:51] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [11:00:08] (03CR) 10Hnowlan: "This change has been reverted in another change and builds on master are now successful again" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1218284 (https://phabricator.wikimedia.org/T411076) (owner: 10Muehlenhoff) [11:02:39] (03PS1) 10Joal: Add turnilo-next to dse-k8s-eqiad kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1240887 (https://phabricator.wikimedia.org/T416119) [11:03:23] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [11:03:50] (03PS1) 10Clément Goubert: service mesh: Add page-analytics listener [puppet] - 10https://gerrit.wikimedia.org/r/1240888 (https://phabricator.wikimedia.org/T411769) [11:04:11] (03PS1) 10Muehlenhoff: civicrm: Run spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240889 [11:04:42] (03PS3) 10Clément Goubert: wikifeeds: Add request definition for page analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [11:04:49] (03CR) 10Dreamy Jazz: [C:03+1] IPReputation: Lower IPoid request and connect timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240769 (https://phabricator.wikimedia.org/T417910) (owner: 10Kosta Harlan) [11:05:27] (03CR) 10Clément Goubert: wikifeeds: Add request definition for page analytics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [11:05:55] (03CR) 10Ayounsi: "I have no opinion on that" [puppet] - 10https://gerrit.wikimedia.org/r/1240878 (owner: 10Muehlenhoff) [11:06:43] (03CR) 10CI reject: [V:04-1] civicrm: Run spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240889 (owner: 10Muehlenhoff) [11:06:58] (03CR) 10CI reject: [V:04-1] wikifeeds: Add request definition for page analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [11:08:08] (03PS1) 10Joal: Add helm chart for turnilo UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240890 (https://phabricator.wikimedia.org/T416118) [11:11:48] (03PS1) 10Joal: Add turnilo-next helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240891 (https://phabricator.wikimedia.org/T416121) [11:13:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P88915 and previous config saved to /var/cache/conftool/dbconfig/20260220-111355-marostegui.json [11:15:55] (03PS1) 10Mszwarc: Ensure that sysops don't have '(oathauth-recover-for-user)' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240892 (https://phabricator.wikimedia.org/T417877) [11:16:30] (03CR) 10Michael Große: [C:03+1] [Growth] Force legacy validation of GrowthMentorList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240694 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [11:16:43] (03CR) 10Michael Große: [C:03+1] [Growth] beta: Enable new GrowthMentorList validation on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240697 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [11:17:19] (03PS2) 10Muehlenhoff: civicrm: Run spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240889 [11:20:45] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2002.codfw.wmnet with OS trixie [11:21:07] (03CR) 10Muehlenhoff: [C:03+2] Remove bgpalerter spec test [puppet] - 10https://gerrit.wikimedia.org/r/1240878 (owner: 10Muehlenhoff) [11:24:32] 06SRE: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#11635678 (10MoritzMuehlenhoff) 05Open→03Resolved Access has been removed [11:26:49] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestage2003.codfw.wmnet with OS trixie [11:26:57] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.16; 2026-02-17): Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11635687 (10Tgr) Not for this incident specifically, but we are planning to update tests in the next few weeks ({T4... [11:28:06] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [11:28:27] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [11:29:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P88916 and previous config saved to /var/cache/conftool/dbconfig/20260220-112903-marostegui.json [11:29:52] (03PS1) 10Muehlenhoff: Record LDAP access for mikez [puppet] - 10https://gerrit.wikimedia.org/r/1240896 [11:33:47] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for mikez [puppet] - 10https://gerrit.wikimedia.org/r/1240896 (owner: 10Muehlenhoff) [11:38:54] (03PS3) 10Tiziano Fogli: thanos::rule: add ExecReload to the service unit [puppet] - 10https://gerrit.wikimedia.org/r/1239906 (https://phabricator.wikimedia.org/T414579) [11:38:55] (03PS26) 10Tiziano Fogli: slothslos: add module to build and deploy sloth manifests [puppet] - 10https://gerrit.wikimedia.org/r/1239166 (https://phabricator.wikimedia.org/T414579) [11:40:32] (03PS1) 10Effie Mouzeli: mw-parsoid/experimental: update resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240899 [11:44:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T415786)', diff saved to https://phabricator.wikimedia.org/P88917 and previous config saved to /var/cache/conftool/dbconfig/20260220-114412-marostegui.json [11:44:14] (03CR) 10Clément Goubert: [C:03+1] mw-parsoid/experimental: update resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240899 (owner: 10Effie Mouzeli) [11:44:17] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [11:44:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance [11:44:33] (03PS27) 10Tiziano Fogli: slothslos: add module to build and deploy sloth manifests [puppet] - 10https://gerrit.wikimedia.org/r/1239166 (https://phabricator.wikimedia.org/T414579) [11:44:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2219 (T415786)', diff saved to https://phabricator.wikimedia.org/P88918 and previous config saved to /var/cache/conftool/dbconfig/20260220-114437-marostegui.json [11:44:56] (03CR) 10Tiziano Fogli: "The flattening step now happens before generation." [puppet] - 10https://gerrit.wikimedia.org/r/1239166 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [11:45:48] (03CR) 10Effie Mouzeli: [C:03+2] mw-parsoid/experimental: update resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240899 (owner: 10Effie Mouzeli) [11:47:46] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2003.codfw.wmnet with reason: host reimage [11:47:48] (03Merged) 10jenkins-bot: mw-parsoid/experimental: update resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240899 (owner: 10Effie Mouzeli) [11:48:39] (03PS2) 10Majavah: cr-cloud: Move allow-public below deny-to-private-subnets [homer/public] - 10https://gerrit.wikimedia.org/r/970275 [11:49:07] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [11:49:40] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [11:49:48] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [11:50:24] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [11:50:36] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [11:51:16] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [11:51:24] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [11:52:06] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [11:52:28] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11635763 (10Clement_Goubert) I'd rather we be more defensive than not, especially if there is no strong enforce... [11:54:19] (03CR) 10Ayounsi: [C:03+1] cr-cloud: Move allow-public below deny-to-private-subnets [homer/public] - 10https://gerrit.wikimedia.org/r/970275 (owner: 10Majavah) [11:54:57] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2003.codfw.wmnet with reason: host reimage [12:00:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260220T0800) [12:00:05] jelto, arnoldokoth, mutante, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260220T1200). [12:00:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240892 (https://phabricator.wikimedia.org/T417877) (owner: 10Mszwarc) [12:07:43] (03CR) 10Clément Goubert: "Wouldn't they be relevant to beta, which is still on bare metal?" [puppet] - 10https://gerrit.wikimedia.org/r/1240720 (owner: 10Muehlenhoff) [12:13:46] (03PS1) 10Muehlenhoff: puppetserver: Update two hooks to the variants from the puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1240924 (https://phabricator.wikimedia.org/T365798) [12:14:49] (03PS1) 10Hashar: jenkins: pin spec tests to Bullseye (11) to Bookworm (12) [puppet] - 10https://gerrit.wikimedia.org/r/1240925 [12:14:51] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:15:11] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2003.codfw.wmnet with OS trixie [12:16:33] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestage2004.codfw.wmnet with OS trixie [12:20:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [12:20:05] (03PS1) 10Ayounsi: nftables: define NETWORK_INFRA [puppet] - 10https://gerrit.wikimedia.org/r/1240931 [12:20:42] (03PS3) 10Muehlenhoff: pmacct: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/954287 [12:23:02] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:23:34] (03CR) 10JMeybohm: [C:03+2] k8s.roll-reimage-nodes: Support exclusion of target OS version [cookbooks] - 10https://gerrit.wikimedia.org/r/1240725 (https://phabricator.wikimedia.org/T414417) (owner: 10JMeybohm) [12:23:38] (03CR) 10JMeybohm: [C:03+2] k8s.roll-reimage-nodes: Remove --puppet argument when calling reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1240755 (https://phabricator.wikimedia.org/T414417) (owner: 10JMeybohm) [12:24:30] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, merging" [puppet] - 10https://gerrit.wikimedia.org/r/1240925 (owner: 10Hashar) [12:24:33] (03CR) 10Muehlenhoff: [C:03+2] jenkins: pin spec tests to Bullseye (11) to Bookworm (12) [puppet] - 10https://gerrit.wikimedia.org/r/1240925 (owner: 10Hashar) [12:24:43] (03CR) 10Ayounsi: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954287 (owner: 10Muehlenhoff) [12:27:25] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:29:13] (03Merged) 10jenkins-bot: k8s.roll-reimage-nodes: Remove --puppet argument when calling reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1240755 (https://phabricator.wikimedia.org/T414417) (owner: 10JMeybohm) [12:29:51] (03Merged) 10jenkins-bot: k8s.roll-reimage-nodes: Support exclusion of target OS version [cookbooks] - 10https://gerrit.wikimedia.org/r/1240725 (https://phabricator.wikimedia.org/T414417) (owner: 10JMeybohm) [12:30:02] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:31:50] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:33:02] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:34:20] !log jayme@cumin1003 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on A:wikikube-staging-worker-eqiad [12:35:54] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestage1004.eqiad.wmnet with OS trixie [12:36:06] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2004.codfw.wmnet with reason: host reimage [12:37:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240924 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:39:53] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2004.codfw.wmnet with reason: host reimage [12:45:32] (03CR) 10Alex.sanford: [C:03+1] "Just noting that this will also deploy https://gitlab.wikimedia.org/repos/sre/miscweb/security-landing-page/-/merge_requests/25 (Update Te" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240819 (https://phabricator.wikimedia.org/T415379) (owner: 10SBassett) [12:46:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T415786)', diff saved to https://phabricator.wikimedia.org/P88922 and previous config saved to /var/cache/conftool/dbconfig/20260220-124627-marostegui.json [12:46:32] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [12:46:45] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: Add turnilo-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240884 (https://phabricator.wikimedia.org/T416120) (owner: 10Joal) [12:46:48] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: Add turnilo-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240884 (https://phabricator.wikimedia.org/T416120) (owner: 10Joal) [12:47:07] (03CR) 10Brouberol: [C:03+1] Add turnilo-next to dse-k8s-eqiad kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1240887 (https://phabricator.wikimedia.org/T416119) (owner: 10Joal) [12:47:09] (03CR) 10Brouberol: [C:03+2] Add turnilo-next to dse-k8s-eqiad kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1240887 (https://phabricator.wikimedia.org/T416119) (owner: 10Joal) [12:51:06] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1004.eqiad.wmnet with reason: host reimage [12:52:02] 06SRE, 06serviceops, 10Wikibase GraphQL, 06Wikibase Reuse Team, 10Wikidata: Create a rewrite for the GraphQL endpoint on wikidata.org - https://phabricator.wikimedia.org/T417026#11635862 (10LSobanski) [12:54:00] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1004.eqiad.wmnet with reason: host reimage [12:55:34] 06SRE, 06serviceops, 10Wikibase GraphQL, 06Wikibase Reuse Team, and 2 others: Create a rewrite for the GraphQL endpoint on wikidata.org - https://phabricator.wikimedia.org/T417026#11635869 (10taavi) [13:01:01] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2004.codfw.wmnet with OS trixie [13:01:03] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on A:wikikube-staging-worker-codfw [13:01:27] (03CR) 10Rsilvola: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240819 (https://phabricator.wikimedia.org/T415379) (owner: 10SBassett) [13:01:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P88923 and previous config saved to /var/cache/conftool/dbconfig/20260220-130136-marostegui.json [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:41] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:08:21] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:13:06] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1004.eqiad.wmnet with OS trixie [13:15:16] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestage1005.eqiad.wmnet with OS trixie [13:16:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P88924 and previous config saved to /var/cache/conftool/dbconfig/20260220-131644-marostegui.json [13:22:15] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55711 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:22:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:22:53] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:30:00] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1005.eqiad.wmnet with reason: host reimage [13:31:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T415786)', diff saved to https://phabricator.wikimedia.org/P88925 and previous config saved to /var/cache/conftool/dbconfig/20260220-133152-marostegui.json [13:31:58] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [13:32:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance [13:32:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1242 (T415786)', diff saved to https://phabricator.wikimedia.org/P88926 and previous config saved to /var/cache/conftool/dbconfig/20260220-133216-marostegui.json [13:34:16] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1005.eqiad.wmnet with reason: host reimage [13:40:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [13:50:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:51:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:52:20] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1005.eqiad.wmnet with OS trixie [13:54:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:58:23] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestage1006.eqiad.wmnet with OS trixie [14:03:15] (03PS1) 10Brouberol: idp_test: define turnilo_next client secret [labs/private] - 10https://gerrit.wikimedia.org/r/1240977 [14:03:24] (03CR) 10Brouberol: [C:03+2] idp_test: define turnilo_next client secret [labs/private] - 10https://gerrit.wikimedia.org/r/1240977 (owner: 10Brouberol) [14:03:29] (03CR) 10Brouberol: [V:03+2 C:03+2] idp_test: define turnilo_next client secret [labs/private] - 10https://gerrit.wikimedia.org/r/1240977 (owner: 10Brouberol) [14:05:47] (03PS1) 10Brouberol: idp_test: define the turnilo_next service [puppet] - 10https://gerrit.wikimedia.org/r/1240979 (https://phabricator.wikimedia.org/T417990) [14:08:55] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1240979 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [14:09:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::5e5e:ab00:d3d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:09:31] (03CR) 10Brouberol: [C:03+2] idp_test: define the turnilo_next service [puppet] - 10https://gerrit.wikimedia.org/r/1240979 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [14:12:47] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1006.eqiad.wmnet with reason: host reimage [14:14:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::5e5e:ab00:d3d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:14:13] (03CR) 10Jgreen: [C:03+1] Add spf records for civicrm and frmx hosts [dns] - 10https://gerrit.wikimedia.org/r/1240834 (https://phabricator.wikimedia.org/T417958) (owner: 10Dwisehaupt) [14:19:15] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1006.eqiad.wmnet with reason: host reimage [14:35:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [14:36:55] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1006.eqiad.wmnet with OS trixie [14:36:58] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on A:wikikube-staging-worker-eqiad [14:43:02] (03CR) 10Jsn.sherman: "this totally works, but I wonder if we should create variables for revertrisk language agnostic default/fallback/whatever (naming is hard)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [15:01:58] (03CR) 10Urbanecm: [Growth] Force legacy validation of GrowthMentorList (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240694 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [15:06:12] (03PS10) 10Arnaudb: gerrit: adapt httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) [15:08:21] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:56] (03CR) 10SBassett: "Yep." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240819 (https://phabricator.wikimedia.org/T415379) (owner: 10SBassett) [15:09:10] (03CR) 10Herron: [C:03+1] "Sweet! thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1239166 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [15:09:41] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:13:28] (03PS11) 10Hashar: gerrit: adapt httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [15:13:36] (03PS12) 10Hashar: gerrit: adapt httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [15:15:58] (03CR) 10Hashar: "I have removed T417536 which was to deal with the incident of CI not being able to clone. It has been fully fixed by disabling the TCP con" [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [15:16:28] (03PS1) 10Ssingh: P:bird::anycast: improve the code for IPv6 support (and automatically detect it) [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) [15:17:51] (03CR) 10Muehlenhoff: [C:03+1] "Let's give that a shot next week, we can compare the generated set on a ferm node with an nftables node" [puppet] - 10https://gerrit.wikimedia.org/r/1240931 (owner: 10Ayounsi) [15:18:02] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8100/co" [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:22:46] (03PS1) 10Elukey: admin: set home dir for analytics-sre [puppet] - 10https://gerrit.wikimedia.org/r/1241004 (https://phabricator.wikimedia.org/T402512) [15:23:12] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1241004 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:23:39] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T416726#11636592 (10Jhancock.wm) new part shipped last night [15:23:43] 06SRE, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in operations group for Rsilvola - https://phabricator.wikimedia.org/T418004#11636593 (10Rsilvola) [15:24:09] (03CR) 10Ssingh: [V:03+1] "Yeah PCC reminds me where I am wrong. I forgot that that under a single FQDN, you can have both v4 and v6. Fixing that." [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:29:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1241004 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:31:07] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.130, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:34:29] (03CR) 10Elukey: [C:03+2] admin: set home dir for analytics-sre [puppet] - 10https://gerrit.wikimedia.org/r/1241004 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:38:21] 06SRE, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11636635 (10taavi) [15:39:54] 06SRE, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11636638 (10taavi) +2 in deployment-charts is linked to the ability to actually deploy those changes, and as far as I can tell... [15:42:21] (03PS2) 10Ssingh: P:bird::anycast: improve the code for IPv6 support (and automatically detect it) [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) [15:42:42] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11636656 (10MatthewVernon) [15:42:54] (03CR) 10CI reject: [V:04-1] P:bird::anycast: improve the code for IPv6 support (and automatically detect it) [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:44:16] (03CR) 10JHathaway: [C:03+1] Remove puppetmaster::base_repo [puppet] - 10https://gerrit.wikimedia.org/r/1240856 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:44:41] (03CR) 10JHathaway: [C:03+1] Remove puppetmaster class and related classes [puppet] - 10https://gerrit.wikimedia.org/r/1240853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:45:23] (03CR) 10JHathaway: [C:03+1] puppetserver: Update two hooks to the variants from the puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1240924 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:48:33] 06SRE, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11636661 (10Rsilvola) Ah, well this clarifies the situation. Thanks @taavi! [15:59:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [16:04:13] (03PS1) 10Brouberol: define new httpd-cas image based on httpd including the cas module [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1241007 (https://phabricator.wikimedia.org/T417990) [16:05:57] (03PS2) 10Brouberol: define new httpd-cas image based on httpd including the cas module [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1241007 (https://phabricator.wikimedia.org/T417990) [16:07:52] (03PS3) 10Brouberol: define new httpd-cas image based on httpd including the cas module [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1241007 (https://phabricator.wikimedia.org/T417990) [16:08:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:12:29] (03CR) 10Joal: [C:03+1] "LGTM! thanks so much :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1241007 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [16:14:07] (03CR) 10Brouberol: [C:03+2] define new httpd-cas image based on httpd including the cas module [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1241007 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [16:14:10] (03CR) 10Brouberol: [V:03+2 C:03+2] define new httpd-cas image based on httpd including the cas module [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1241007 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [16:20:00] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11636757 (10MatthewVernon) OK, I understand the problem - these nodes are being UEFI booted, but the installer is setup for BIOS still. Sorry, I've got confused be... [16:24:30] (03CR) 10Mmartorana: [C:03+2] Version bump security-landing-page values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240819 (https://phabricator.wikimedia.org/T415379) (owner: 10SBassett) [16:25:55] 06SRE, 06Infrastructure-Foundations, 10Mail: Remove mail alias/fork from dmarc-rua@wikimedia.org to dmarc@donate.wikimedia.org - https://phabricator.wikimedia.org/T417941#11636766 (10Dzahn) @Jgreen I removed the dmarc@donate.wikimedia.org line from that alias. I did not find any "dmarc-ruf@" alias in that p... [16:26:57] (03Merged) 10jenkins-bot: Version bump security-landing-page values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240819 (https://phabricator.wikimedia.org/T415379) (owner: 10SBassett) [16:27:12] (03PS2) 10Dzahn: phabricator: disable dump job [puppet] - 10https://gerrit.wikimedia.org/r/1240778 (https://phabricator.wikimedia.org/T417824) [16:29:15] (03CR) 10Dzahn: [C:04-1] "to my surprise this is noop in the compiler.." [puppet] - 10https://gerrit.wikimedia.org/r/1240778 (https://phabricator.wikimedia.org/T417824) (owner: 10Dzahn) [16:32:25] (03PS1) 10MVernon: installserver: use EFI booting for new apus frontends [puppet] - 10https://gerrit.wikimedia.org/r/1241010 (https://phabricator.wikimedia.org/T416387) [16:33:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:56] (03PS1) 10Elukey: profile::puppetserver: rework and fix the analytics-sre config [puppet] - 10https://gerrit.wikimedia.org/r/1241012 (https://phabricator.wikimedia.org/T402512) [16:36:58] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11636837 (10RobH) [16:37:44] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:42:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:43:33] jhancock@cumin2002 netbox (PID 4065054) is awaiting input [16:45:29] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding moss-fe2021 to codfw - jhancock@cumin2002" [16:45:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding moss-fe2021 to codfw - jhancock@cumin2002" [16:45:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:46:06] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host moss-fe2021 [16:46:08] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host moss-fe2022 [16:46:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host moss-fe2021 [16:46:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host moss-fe2022 [16:46:37] (03PS1) 10Tiziano Fogli: meta-monitoring: add rewrite rule to redirect home to Wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1241014 (https://phabricator.wikimedia.org/T417900) [16:46:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-fe2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:47:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-fe2022.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:47:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:55:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad row A/B switch upgrade - https://phabricator.wikimedia.org/T418012 (10RobH) 03NEW p:05Triage→03Medium [16:56:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad row A/B switch upgrade - https://phabricator.wikimedia.org/T418012#11636926 (10RobH) [16:56:49] (03CR) 10Ssingh: codfw: add the following cp nodes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (owner: 10CDobbins) [16:57:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host moss-fe2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:57:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host moss-fe2022.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:09:53] (03PS3) 10Ssingh: P:bird::anycast: automatically detect IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) [17:11:20] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8105/console" [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [17:12:14] !log sbassett@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [17:12:36] 🐦 [17:12:46] !log sbassett@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:13:05] !log sbassett@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:13:21] !log sbassett@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:13:37] !log sbassett@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:13:52] !log sbassett@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:13:55] !log sbassett@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:14:02] !log sbassett@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:14:05] !log sbassett@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:14:10] !log sbassett@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:14:29] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11637021 (10Jhancock.wm) [17:18:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2022.codfw.wmnet with OS bullseye [17:18:28] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11637028 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-fe2022.codfw.wmnet with OS bullseye [17:18:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2022.codfw.wmnet with OS bullseye [17:18:45] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11637029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-fe2022.codfw.wmnet with OS bullseye executed with errors: - mos... [17:24:03] 06SRE, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11637038 (10sbassett) (we just did this for Alex [T418015], if you'd like an example request to work from) [17:27:40] (03PS4) 10Ssingh: P:bird::anycast: automatically detect IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) [17:28:02] (03CR) 10Ssingh: "PCC NOOP for a random selection of hosts in sudo cumin "C:bird%do_ipv6=true"" [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [17:28:40] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests: Request membership in wmf-deployment group for alex.sanford - https://phabricator.wikimedia.org/T418015#11637047 (10Dzahn) [17:28:56] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11637048 (10Dzahn) [17:29:44] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests: Request membership in wmf-deployment group for alex.sanford - https://phabricator.wikimedia.org/T418015#11637063 (10Dzahn) Being able to +2 in the repo should be combined with being able to actually deploy changes. Which then turns it into a request... [17:29:54] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11637065 (10Dzahn) Being able to +2 in the repo should be combined with being able to actually deploy... [17:30:02] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8106/console" [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [17:30:39] (03CR) 10Ssingh: "Adding a few relevant folks for the review." [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [17:30:49] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [17:40:53] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests: Request membership in wmf-deployment group for alex.sanford - https://phabricator.wikimedia.org/T418015#11637097 (10ASanford-WMF) I am already in the `deployment` shell group - https://phabricator.wikimedia.org/source/operations-puppet/browse/product... [17:45:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad A/B switch cabling documentation - https://phabricator.wikimedia.org/T418018 (10RobH) 03NEW [17:46:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad A/B switch cabling documentation - https://phabricator.wikimedia.org/T418018#11637114 (10RobH) [17:47:46] !log gerrit added Alex Sanford to wmf-deployment group - already has deployment shell group T418015 [17:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:51] T418015: Request membership in wmf-deployment group for alex.sanford - https://phabricator.wikimedia.org/T418015 [17:48:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad A/B switch cabling documentation - https://phabricator.wikimedia.org/T418018#11637117 (10RobH) @ayounsi or @papaul: Would one of you be best suited to provide the cable diagram and matrix so we know how/where each switch is conne... [17:49:04] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests: Request membership in wmf-deployment group for alex.sanford - https://phabricator.wikimedia.org/T418015#11637118 (10Dzahn) @ASanford-WMF Oh! I see, yes. Sorry about that. I just added you to the Gerrit group as requested. [17:50:50] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests: Request membership in wmf-deployment group for alex.sanford - https://phabricator.wikimedia.org/T418015#11637119 (10ASanford-WMF) 05Open→03Resolved a:03ASanford-WMF Great, thanks! 🙌 [17:54:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:14:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [18:40:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad A/B switch cabling documentation - https://phabricator.wikimedia.org/T418018#11637204 (10RobH) [18:42:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2021.codfw.wmnet with OS bullseye [18:43:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11637213 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-fe2021.codfw.wmnet with OS bullseye [18:43:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2021.codfw.wmnet with OS bullseye [18:43:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11637226 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-fe2021.codfw.wmnet with OS bullseye executed with errors: - mos... [18:47:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:48:26] huh [18:48:28] esams [18:48:29] !ack [18:48:30] 7461 (ACKED) ProbeDown sre (2a02:ec80:300:ed1a::1 ip6 text-https:443 probes/service http_text-https_ip6 esams) [18:48:30] here [18:48:41] faster than the person oncall :D [18:48:42] just ipv6 huh [18:48:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:49:00] no not really [18:49:01] ok [18:49:08] sukhe: Amir1: we have rising NELs as well [18:49:29] https://logstash.wikimedia.org/goto/74d8bb4b0159485a083635ceeb7941e9 [18:49:32] cdanis: Amir1: see noc@ email [18:49:39] https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&from=now-1h&to=now&timezone=utc&var-cluster=$__all&var-site=esams&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&refresh=15m [18:49:46] raw req is going down [18:50:15] (03CR) 10Dwisehaupt: [C:03+2] Add spf records for civicrm and frmx hosts [dns] - 10https://gerrit.wikimedia.org/r/1240834 (https://phabricator.wikimedia.org/T417958) (owner: 10Dwisehaupt) [18:50:30] FIRING: LibericaUnhealthyRealserverPooled: ... [18:50:30] Liberica service text-httpslb6_443 has 4 unhealthy realservers pooled on lvs3010:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://grafana.wikimedia.org/d/d70d14db-4a71-414d-8425-7a30d7127ca6/liberica-services?orgId=1&var-site=esams&var-service=text-httpslb6_443&var-instance=lvs3010 - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [18:50:39] !log dwisehaupt@dns1004 START - running authdns-update [18:52:08] (03CR) 10Xcollazo: "At some other time, to disable they `absent`ed the job: I3fc05f5" [puppet] - 10https://gerrit.wikimedia.org/r/1240778 (https://phabricator.wikimedia.org/T417824) (owner: 10Dzahn) [18:52:11] !log dwisehaupt@dns1004 END - running authdns-update [18:55:30] FIRING: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 6 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [18:56:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:57:10] (03CR) 10Dwisehaupt: [C:03+1] "Looks good. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1240889 (owner: 10Muehlenhoff) [18:57:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:57:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [18:58:10] !ack [18:58:11] 7464 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [18:58:21] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:00:51] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3068 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [19:01:55] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3068 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:02:13] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3071 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [19:02:58] FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [19:03:29] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3071 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:06:38] (03PS1) 10Ssingh: config/sites: prepend_as_out to true for esams [homer/public] - 10https://gerrit.wikimedia.org/r/1241024 [19:07:07] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3069 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [19:07:14] !log sukhe@cumin1003 START - Cookbook sre.network.cf [19:07:14] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.network.cf (exit_code=1) [19:07:32] !log sukhe@cumin1003 START - Cookbook sre.network.cf [19:07:32] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.network.cf (exit_code=1) [19:07:58] FIRING: [3x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from DE) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [19:08:23] !log sukhe@cumin1003 START - Cookbook sre.network.cf [19:08:23] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.network.cf (exit_code=1) [19:10:05] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3069 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:11:15] (03PS2) 10Ssingh: config/sites: prepend_as_out to true for drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1241024 [19:12:15] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp3072 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [19:12:16] !log sukhe@cumin1003 START - Cookbook sre.network.cf [19:12:16] !log sukhe@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0) [19:12:27] (03PS1) 10Bernard Wang: Migrate default user preference configuration to Community Configuration [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1241026 (https://phabricator.wikimedia.org/T415355) [19:12:52] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site esams [reason: no reason specified, ] [19:13:15] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp3072 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 45 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:13:19] (03CR) 10Ssingh: [V:03+2 C:03+2] config/sites: prepend_as_out to true for drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1241024 (owner: 10Ssingh) [19:14:18] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11637304 (10Shawn) Why are those 429 errors unpredictable? All this afternoon LiveRC (real time monitoring tool of recent changes on... [19:14:34] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: depool site esams [reason: no reason specified, ] [19:16:22] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site esams [reason: no reason specified, ] [19:16:24] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site esams [reason: no reason specified, ] [19:18:21] RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:21:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:22:57] FIRING: [3x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:22:58] RESOLVED: [3x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from DE) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [19:23:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:25:30] FIRING: [16x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 4 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [19:27:57] RESOLVED: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:27:58] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [19:30:30] RESOLVED: [16x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 4 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [19:39:24] (03PS1) 10Aklapper: ProdPasteBot: Call paste.edit instead of deprecated paste.create [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) [19:40:58] (03CR) 10Aklapper: "Note that this is how I imagine Python, and that this is untested." [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper) [19:41:26] (03CR) 10CI reject: [V:04-1] ProdPasteBot: Call paste.edit instead of deprecated paste.create [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper) [19:47:04] (03CR) 10Aklapper: [C:04-1] "Yeah, "self.phab" is not how to get the host URI here" [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper) [19:53:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:54:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:55:19] <_joe_> !ack [19:55:20] 7465 (ACKED) [2x] ProbeDown sre (text-https:443 probes/service drmrs) [19:55:31] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11637409 (10Dzahn) a:03thcipriani Hi Tyler, they would need both shell "deployment" and gerrit "wmf... [19:58:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:59:26] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11637416 (10Dzahn) Well.. wait. I am saying that but it's actually just about access to deployment... [19:59:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:01:04] cccccbukvgbchetkjblbetcvjefrerkgdrthgbbdijvt [20:06:32] (03PS6) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [20:06:41] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11637421 (10thcipriani) k8s deployment does require `deployment` for read access to the configs to us... [20:08:01] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11637423 (10Dzahn) a:05thcipriani→03None Thanks for that confirmation and approval:) [20:08:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:10:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 5 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [20:10:50] (03PS7) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [20:13:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:13:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:14:26] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11637434 (10Dzahn) @Rsilvola So.. there are 2 things you need. The shell access to the deployment ser... [20:14:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:15:07] (03PS8) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [20:15:30] RESOLVED: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 4 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [20:16:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 2 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [20:17:06] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11637440 (10Dzahn) [20:18:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:19:16] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11637442 (10Shawn) I'm not sure if I understood properly the issue because I tried to fix LiveRC by changing size of icons but I stil... [20:19:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:20:27] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11637446 (10Dzahn) @Rsilvola I copied the template for shell access requests into the ticket descript... [20:21:30] RESOLVED: [5x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 2 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [20:22:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:24:29] (03PS9) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [20:24:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 6 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [20:26:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:28:21] FIRING: [2x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:29:30] FIRING: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 6 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [20:31:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:32:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:33:21] RESOLVED: [2x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:34:30] RESOLVED: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 6 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [20:34:56] (03PS10) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [20:37:26] (03PS11) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [20:45:33] (03PS12) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [20:49:59] (03PS13) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [20:57:11] (03CR) 10Dzahn: [C:04-1] "That's true and thanks for the link! but $dump_job_ensure is supposed to be set by $dump being true or false." [puppet] - 10https://gerrit.wikimedia.org/r/1240778 (https://phabricator.wikimedia.org/T417824) (owner: 10Dzahn) [20:57:27] (03PS14) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [21:01:56] (03PS3) 10Dzahn: phabricator: disable dump job [puppet] - 10https://gerrit.wikimedia.org/r/1240778 (https://phabricator.wikimedia.org/T417824) [21:02:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:04:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 5 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [21:05:31] (03PS15) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [21:06:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:08:21] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:09:30] FIRING: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 5 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [21:09:41] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:10:34] !log eevans@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site esams [reason: no reason specified, ] [21:10:37] !log eevans@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site esams [reason: no reason specified, ] [21:10:45] (03PS16) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [21:11:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:12:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:13:03] (03CR) 10CDobbins: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [21:13:21] FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:14:30] RESOLVED: [6x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 5 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [21:21:43] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11637548 (10Tacsipacsi) >>! In T414805#11636273, @Ladsgroup wrote: > The rate limit is at the edge layer so it doesn't know whether i... [21:28:38] (03PS3) 10CDobbins: codfw: add the following cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1240784 [21:29:31] (03CR) 10Xcollazo: [C:03+1] "Ah, I see it now." [puppet] - 10https://gerrit.wikimedia.org/r/1240778 (https://phabricator.wikimedia.org/T417824) (owner: 10Dzahn) [21:29:50] (03CR) 10CDobbins: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (owner: 10CDobbins) [21:30:22] (03CR) 10Dzahn: [V:03+1 C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1240778 (https://phabricator.wikimedia.org/T417824) (owner: 10Dzahn) [21:30:49] (03CR) 10Dzahn: [V:03+1 C:03+2] phabricator: disable dump job [puppet] - 10https://gerrit.wikimedia.org/r/1240778 (https://phabricator.wikimedia.org/T417824) (owner: 10Dzahn) [21:33:21] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:38:33] (03CR) 10CDobbins: codfw: add the following cp nodes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (owner: 10CDobbins) [21:47:05] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8119/co" [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (owner: 10CDobbins) [21:54:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:03:11] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T418027 (10ops-monitoring-bot) 03NEW [22:06:23] (03PS25) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [22:06:59] (03CR) 10CI reject: [V:04-1] prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:07:16] (03CR) 10CDobbins: prometheus: add pooled host check (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:09:41] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:12:14] (03PS26) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [22:13:09] (03PS2) 10Aklapper: ProdPasteBot: Call paste.edit instead of deprecated paste.create [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) [22:13:21] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:13:42] (03CR) 10CDobbins: "I'm sorry, I don't remember the answer to this question, but I should be able to install it on a puppetserver, right?" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:15:54] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8120/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:21:37] (03PS1) 10Eevans: Revert "config/sites: prepend_as_out to true for drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1241040 [22:28:24] (03CR) 10BBlack: [C:03+2] Revert "config/sites: prepend_as_out to true for drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1241040 (owner: 10Eevans) [22:29:07] (03CR) 10BBlack: [C:03+1] "+1 :)" [homer/public] - 10https://gerrit.wikimedia.org/r/1241040 (owner: 10Eevans) [22:29:41] (03CR) 10Eevans: [C:03+2] Revert "config/sites: prepend_as_out to true for drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1241040 (owner: 10Eevans) [22:29:44] (03Merged) 10jenkins-bot: Revert "config/sites: prepend_as_out to true for drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1241040 (owner: 10Eevans) [22:35:01] !log eevans@cumin1003 START - Cookbook sre.network.cf [22:35:07] !log eevans@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0) [22:35:41] (03CR) 10Dzahn: ProdPasteBot: Call paste.edit instead of deprecated paste.create (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper) [22:38:26] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1240703 (owner: 10Muehlenhoff) [22:39:10] (03CR) 10CI reject: [V:04-1] Run Gerrit spec tests on Bullseye/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240703 (owner: 10Muehlenhoff) [22:40:50] (03CR) 10Paladox: Run Gerrit spec tests on Bullseye/Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240703 (owner: 10Muehlenhoff) [22:55:22] (03PS1) 10Eevans: cassandra: enable use of Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1241042 [22:56:57] (03PS2) 10Eevans: cassandra: enable use of Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1241042 [22:57:31] (03PS3) 10Eevans: cassandra: enable use of Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1241042 (https://phabricator.wikimedia.org/T418010) [22:58:08] (03CR) 10CI reject: [V:04-1] cassandra: enable use of Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1241042 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [23:02:38] !log eevans@cumin1003 START - Cookbook sre.network.cf [23:02:38] !log eevans@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0) [23:03:24] !log eevans@cumin1003 START - Cookbook sre.network.cf [23:03:25] !log eevans@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0) [23:06:37] (03CR) 10Aklapper: ProdPasteBot: Call paste.edit instead of deprecated paste.create (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper) [23:44:07] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11637743 (10Papaul) The first step was completed by remote hands yesterday but the port number and the cable ID's were not given to me so I just got the informati... [23:47:30] 06SRE, 10Wikimedia-Mailing-lists: Spam filtering rules for mediawiki-api@lists.wikimedia.org failing - https://phabricator.wikimedia.org/T418028#11637744 (10Quiddity) I sent this to that mailing-list's owner list last week: > You're getting a lot of spam at this mailing list, over the last few months. I wonder...