[00:00:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:00:31] MatmaRex: thanks but not really, since this is on the MW side. we can try to ping some folks internally [00:01:06] if there is an emergency deploy, we can support that [00:05:43] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team: Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11630362 (10Gerges) Additional note: OAuth 2.0 using Owner-only consumers works correctly and allows editing without issues. The problem appears when using standard O... [00:06:25] jhancock@cumin2002 provision (PID 2847571) is awaiting input [00:11:35] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team: Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11630364 (10LucasWerkmeister) >>! In T417839#11630340, @matmarex wrote: > I am pretty sure it comes from Envoy, and not MediaWiki: https://github.com/envoyproxy/envoy... [00:12:00] MatmaRex: can you check once in the serviceops channel [00:12:11] or I can but since you have more contexy [00:12:15] *context [00:13:42] sukhe: sorry, i'm not sure what you mean [00:14:36] no worries, I am trying to figure out if this is really and envoy error, or just a symptom [00:15:01] because I don't see any changes on our end [00:16:24] jhancock@cumin2002 provision (PID 2847571) is awaiting input [00:20:13] if nothing changed in envoy config, then it could be that MW is emitting malformed JWT tokens [00:22:03] it most likely is that [00:36:17] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team: Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11630378 (10matmarex) Still debugging locally. The JWTs I am getting for OAuth 2 access tokens do not have an issuer (`iss`) field at all. This is probably the problem. [00:38:18] jouncebot: nowandnext [00:38:18] No deployments scheduled for the next 6 hour(s) and 21 minute(s) [00:38:18] In 6 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T0700) [00:38:18] In 6 hour(s) and 21 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T0700) [00:39:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1240411 [00:39:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1240411 (owner: 10TrainBranchBot) [00:39:24] (03CR) 10Dzahn: [C:03+2] scap3 install provider: Set env vars for deploy_user when running scap [puppet] - 10https://gerrit.wikimedia.org/r/1240372 (https://phabricator.wikimedia.org/T417767) (owner: 10Ahmon Dancy) [00:40:43] (03CR) 10Dzahn: [C:03+2] scap: load scap_source type in specs [puppet] - 10https://gerrit.wikimedia.org/r/1240377 (owner: 10Ahmon Dancy) [00:43:21] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:47:04] (03PS1) 10Dzahn: miscweb: add release for status.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240412 (https://phabricator.wikimedia.org/T414098) [00:48:20] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:50:34] (03PS2) 10Dzahn: miscweb: add release for status.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240412 (https://phabricator.wikimedia.org/T414098) [00:51:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host apus-fe2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:53:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1240411 (owner: 10TrainBranchBot) [00:58:11] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 13Patch-For-Review: Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11630437 (10matmarex) That fixes the problem for me locally. [00:59:07] (03PS1) 10Dzahn: switch status.wikimedia.org from rackspace to wikimedia [dns] - 10https://gerrit.wikimedia.org/r/1240414 (https://phabricator.wikimedia.org/T414098) [00:59:24] (03CR) 10Dzahn: [C:04-2] "when it's ready" [dns] - 10https://gerrit.wikimedia.org/r/1240414 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [01:04:14] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 13Patch-For-Review: Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11630452 (10matmarex) The bug is caused by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1222279, which upgraded lcobucci/jwt from "4.1.5" t... [01:05:58] (03PS1) 10Dzahn: trafficserver: add map for status.wikimedia.org to miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1240416 (https://phabricator.wikimedia.org/T414098) [01:08:20] (03PS1) 10Dzahn: microsites: add monitoring for status.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1240417 (https://phabricator.wikimedia.org/T414098) [01:09:03] (03PS1) 10Jforrester: Do not pass null to AccessTokenEntity::setUserIdentifier() [extensions/OAuth] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240418 (https://phabricator.wikimedia.org/T417820) [01:09:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1240419 [01:09:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1240419 (owner: 10TrainBranchBot) [01:09:50] Looks like they found the cause of the issue - it's indeed not going to be us [01:10:56] (03PS1) 10Dzahn: httpbb/miscweb: add test for status.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1240420 (https://phabricator.wikimedia.org/T414098) [01:19:28] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:20:08] PROBLEM - Bird Internet Routing Daemon on cephosd1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [01:21:09] (03PS1) 10Dzahn: httpbb/miscweb: add tests for wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1240421 (https://phabricator.wikimedia.org/T408592) [01:22:07] (03CR) 10Dzahn: "no effect until DNS is changed (https://gerrit.wikimedia.org/r/c/operations/dns/+/1240414)" [puppet] - 10https://gerrit.wikimedia.org/r/1240416 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [01:22:44] (03CR) 10BCornwall: "NB: So I could continue on with getting our trixie hosts up I created the symlink in /etc/acmecerts manually" [puppet] - 10https://gerrit.wikimedia.org/r/1240395 (owner: 10BCornwall) [01:23:03] (03CR) 10BCornwall: "(on cp2043/cp2044)" [puppet] - 10https://gerrit.wikimedia.org/r/1240395 (owner: 10BCornwall) [01:24:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:25:44] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 13Patch-For-Review: Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11630501 (10matmarex) a:03matmarex [01:38:03] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1240419 (owner: 10TrainBranchBot) [02:00:46] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:08:20] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:44] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 57s) [02:14:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [02:32:25] (03PS1) 10Reedy: Fix "iss" field missing in OAuth 2 access token JWT [extensions/OAuth] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240430 (https://phabricator.wikimedia.org/T417839) [02:33:20] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:35] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:44:09] PROBLEM - Bird Internet Routing Daemon on cephosd1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [03:12:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:17:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:48:21] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:49:41] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:12:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [04:15:39] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:16:07] PROBLEM - Bird Internet Routing Daemon on cephosd1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [04:33:55] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.130, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:41:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:41:10] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:47:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:47:10] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:48] PROBLEM - SSH on cephosd1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:59:44] RECOVERY - SSH on cephosd1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:06:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:07:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [05:14:48] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1240286 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [05:21:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:22:38] (03CR) 10Marostegui: [C:03+2] check_mariadb_events.sh: Fixes [puppet] - 10https://gerrit.wikimedia.org/r/1240286 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [05:24:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:28:34] (03CR) 10ArielGlenn: "Haven't tested, think it's generally ok, see my very few comments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228218 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [05:29:48] PROBLEM - MariaDB Event Scheduler test-s4 on db1176 is CRITICAL: CRIT: event_scheduler: False, expected True: OK: Version 10.11.16-MariaDB-log, Uptime 1099299s, read_only: True, 26.73 QPS, connection latency: 0.025371s, query latency: 0.000548s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [05:29:54] ^ me testing [05:30:48] RECOVERY - MariaDB Event Scheduler test-s4 on db1176 is OK: Version 10.11.16-MariaDB-log, Uptime 1099359s, read_only: True, event_scheduler: True, 24.75 QPS, connection latency: 0.029218s, query latency: 0.000491s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [05:31:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:34:28] PROBLEM - MariaDB Events test-s4 on db1176 is CRITICAL: CRITICAL - Events not ENABLED: wmf_slave_purge(DISABLED) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [05:34:41] ^me testing again [05:35:28] RECOVERY - MariaDB Events test-s4 on db1176 is OK: OK - All 4 events in ops database are ENABLED https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [05:35:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:45:24] (03PS1) 10Marostegui: core.pp: Add alert for query killers [puppet] - 10https://gerrit.wikimedia.org/r/1240462 (https://phabricator.wikimedia.org/T254738) [05:46:43] (03PS1) 10KartikMistry: Update Recommendation API to 2026-02-10-184357-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240464 (https://phabricator.wikimedia.org/T409482) [05:55:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [05:56:35] (03CR) 10Marostegui: "Also written the wiki page for this: https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2FTroubleshooting&diff=2382550&oldid=2361001" [puppet] - 10https://gerrit.wikimedia.org/r/1240462 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [05:57:13] (03CR) 10Marostegui: "This requires deploying a new grant in prod for nagios@localhost defined at: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239969/" [puppet] - 10https://gerrit.wikimedia.org/r/1240462 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [06:03:47] (03PS2) 10Marostegui: core.pp: Add alert for query killers [puppet] - 10https://gerrit.wikimedia.org/r/1240462 (https://phabricator.wikimedia.org/T254738) [06:06:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:08:42] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:09:10] PROBLEM - Bird Internet Routing Daemon on cephosd1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:11:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:18:49] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:19:11] PROBLEM - Bird Internet Routing Daemon on cephosd1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:28:49] (03PS1) 10Marostegui: wmnet: Failover m2 master [dns] - 10https://gerrit.wikimedia.org/r/1240472 (https://phabricator.wikimedia.org/T414656) [06:31:16] (03PS1) 10Arnaudb: gerrit: change gerrit1003 role [puppet] - 10https://gerrit.wikimedia.org/r/1240471 (https://phabricator.wikimedia.org/T417246) [06:31:16] (03CR) 10Marostegui: [C:03+1] filtered_tables: Drop old categorylinks columns [puppet] - 10https://gerrit.wikimedia.org/r/1239484 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [06:38:20] FIRING: [14x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:38:57] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:39:11] RECOVERY - Bird Internet Routing Daemon on cephosd1004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:40:10] RECOVERY - Bird Internet Routing Daemon on cephosd1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:40:50] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:42:59] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:44:50] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:45:10] RECOVERY - Bird Internet Routing Daemon on cephosd1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:45:58] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:46:58] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:47:12] PROBLEM - Bird Internet Routing Daemon on cephosd1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:49:58] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:50:12] RECOVERY - Bird Internet Routing Daemon on cephosd1003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:50:29] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1240462 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T0700). [07:15:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [07:18:21] (03PS2) 10Arnaudb: java: update puppet certificate [puppet] - 10https://gerrit.wikimedia.org/r/1240477 (https://phabricator.wikimedia.org/T417767) [07:18:21] (03CR) 10Arnaudb: "as mentioned in https://w.wiki/HuGA, gerrit installation by puppet is still depending on using the puppet5_CA that no longer exists. I've " [puppet] - 10https://gerrit.wikimedia.org/r/1240477 (https://phabricator.wikimedia.org/T417767) (owner: 10Arnaudb) [07:35:28] (03CR) 10Federico Ceratto: [C:03+2] orchestrator: disable service on dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/1240228 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto) [07:39:34] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:40:51] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:41:22] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1240462 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [07:46:46] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [07:47:23] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [07:48:14] (03CR) 10Federico Ceratto: [C:03+1] "Looks ok, just left a comment." [puppet] - 10https://gerrit.wikimedia.org/r/1240220 (owner: 10Marostegui) [07:52:01] (03CR) 10Federico Ceratto: [C:03+1] "(I checked the current CNAME on DNS and matches the change)" [dns] - 10https://gerrit.wikimedia.org/r/1240472 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [08:00:05] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:07:32] (03CR) 10Elukey: [C:03+1] "I am not super fond of changes like these since git blame and history may become less easy to quickly inspect, but it is also fine to have" [cookbooks] - 10https://gerrit.wikimedia.org/r/1240302 (owner: 10Federico Ceratto) [08:09:06] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:09:18] RECOVERY - Bird Internet Routing Daemon on cephosd1004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:12:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:14:06] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:14:18] RECOVERY - Bird Internet Routing Daemon on cephosd1005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:15:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [08:15:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:15:52] (03CR) 10Muehlenhoff: java: update puppet certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240477 (https://phabricator.wikimedia.org/T417767) (owner: 10Arnaudb) [08:17:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:20:06] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:30:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:32:39] (03CR) 10Federico Ceratto: [C:03+2] Reflow setup.py using Black [cookbooks] - 10https://gerrit.wikimedia.org/r/1240302 (owner: 10Federico Ceratto) [08:34:43] (03CR) 10Elukey: [C:03+2] Drop support for Python 3.7 and 3.8 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1239678 (owner: 10Volans) [08:35:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:35:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11630922 (10ayounsi) The tenant was missing on the two new switches so ` switches = Device.objects.filter(tenant__slug=FRACK_TENANT_SLUG, role__slug='asw',... [08:37:13] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.17; 2026-02-24), 13Patch-For-Review: Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11630929 (10matmarex) [08:38:04] (03PS4) 10Ryan Kemper: elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) [08:38:20] (03PS5) 10Ryan Kemper: elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) [08:39:29] oh, it's the backport window [08:39:41] anyone around who could backport some fixes for deployment blockers? [08:40:10] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OAuth/+/1240430 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OAuth/+/1240418 [08:40:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:40:39] (03PS6) 10Ryan Kemper: elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) [08:41:00] (03PS7) 10Ryan Kemper: elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) [08:41:02] (03Merged) 10jenkins-bot: Drop support for Python 3.7 and 3.8 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1239678 (owner: 10Volans) [08:41:13] (03CR) 10Ryan Kemper: "This is a really good call-out. Made a first attempt in PS7." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [08:42:45] (03CR) 10Vgutierrez: [C:03+1] cache::upload: enable global ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1237245 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [08:44:24] (03CR) 10Fabfur: [C:03+2] cache::upload: enable global ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1237245 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [08:47:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:48:00] (03PS1) 10Majavah: P:dumps::distribution::nfs: Use IPs only in /etc/exports [puppet] - 10https://gerrit.wikimedia.org/r/1240605 [08:48:18] (03CR) 10Elukey: [C:03+2] tests: remove fixture require_caplog [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1239679 (owner: 10Volans) [08:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:59] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8070/co" [puppet] - 10https://gerrit.wikimedia.org/r/1240605 (owner: 10Majavah) [08:50:38] (03CR) 10Muehlenhoff: java: update puppet certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240477 (https://phabricator.wikimedia.org/T417767) (owner: 10Arnaudb) [08:52:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:54:39] (03Merged) 10jenkins-bot: tests: remove fixture require_caplog [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1239679 (owner: 10Volans) [08:56:18] (03PS2) 10Arnaudb: gerrit: add gerrit-replica service to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417536) [08:56:44] jouncebot: refresh [08:56:45] I refreshed my knowledge about deployments. [08:56:45] (03PS1) 10Arnaudb: gerrit: add gerrit-replica backend to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417536) [08:56:46] jouncebot: nowandnext [08:56:46] For the next 0 hour(s) and 3 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T0800) [08:56:46] In 0 hour(s) and 3 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T0900) [08:57:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T415786)', diff saved to https://phabricator.wikimedia.org/P88889 and previous config saved to /var/cache/conftool/dbconfig/20260219-085709-marostegui.json [08:57:14] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [08:57:15] (03CR) 10CI reject: [V:04-1] elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [08:57:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance [08:57:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2206 (T415786)', diff saved to https://phabricator.wikimedia.org/P88890 and previous config saved to /var/cache/conftool/dbconfig/20260219-085723-marostegui.json [08:57:55] (03PS1) 10Majavah: P:dumps::distribution::nfs: Apply changes to exports file [puppet] - 10https://gerrit.wikimedia.org/r/1240609 [08:58:23] jnuche: arnaudb: I am going to upgrade the CI Jenkins since it is easier in the morning ( https://phabricator.wikimedia.org/T417791 ) [08:59:20] gotta read the changelog https://www.jenkins.io/changelog-stable/ [08:59:29] hashar: sounds good [08:59:41] ack thanks [09:00:05] dancy and jnuche: How many deployers does it take to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T0900). [09:02:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:02:49] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OAuth/+/1240430 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OAuth/+/1240418 can be backported to unblock the train [09:03:58] neat :] [09:04:13] they are so lacking tests but I guess that is straigtforward enough [09:04:42] jnuche: I guess we can backport the couple OAuth patch above right now? I haven't stopped CI Jenkins and there is no rush for that upgrade [09:05:01] I don't mind doing it [09:05:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [09:05:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:05:28] hashar: sure, train is happening in US time this week [09:05:33] OH [09:05:36] OF COURSE [09:05:37] :) [09:05:44] MatmaRex: if we backport them now, would you be around to test/verify? [09:05:48] yep [09:06:10] I imagine the order does not matter and I can push them both together [09:06:13] (03CR) 10Elukey: [C:03+1] "Done" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [09:06:20] hashar: yeah [09:06:44] they're not really possible to test on mwdebug though. but i can verify on production test wikis once deployed [09:07:15] (03CR) 10Hashar: [C:03+2] "Train blocker and @dziewonski@fastmail.fm is around to validate the fix." [extensions/OAuth] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240418 (https://phabricator.wikimedia.org/T417820) (owner: 10Jforrester) [09:07:20] (03CR) 10Hashar: [C:03+2] "Train blocker and @dziewonski@fastmail.fm is around to validate the fix." [extensions/OAuth] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240430 (https://phabricator.wikimedia.org/T417839) (owner: 10Reedy) [09:07:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:08:17] MatmaRex: sounds good [09:08:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240418 (https://phabricator.wikimedia.org/T417820) (owner: 10Jforrester) [09:08:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240430 (https://phabricator.wikimedia.org/T417839) (owner: 10Reedy) [09:08:27] the patches are in the pipe [09:08:39] (03Merged) 10jenkins-bot: Do not pass null to AccessTokenEntity::setUserIdentifier() [extensions/OAuth] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240418 (https://phabricator.wikimedia.org/T417820) (owner: 10Jforrester) [09:09:11] (03PS1) 10Federico Ceratto: tox: use generic py3 during local development [cookbooks] - 10https://gerrit.wikimedia.org/r/1240614 [09:09:11] (03CR) 10Federico Ceratto: "As discussed on IRC" [cookbooks] - 10https://gerrit.wikimedia.org/r/1240614 (owner: 10Federico Ceratto) [09:09:12] oh [09:09:24] that Success Cache is such a blessing, backports get merged so fast sometime [09:09:28] (03PS2) 10Ayounsi: decom cookbook: use homer on Nokia switches [cookbooks] - 10https://gerrit.wikimedia.org/r/1240318 (https://phabricator.wikimedia.org/T417428) [09:09:51] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [09:09:54] and I think that if those two had been cherry picked in a chain, the second patch would probably have hit the success cache and would have merged by now [09:10:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:10:32] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [09:10:38] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [09:11:16] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [09:11:58] (03Merged) 10jenkins-bot: Fix "iss" field missing in OAuth 2 access token JWT [extensions/OAuth] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240430 (https://phabricator.wikimedia.org/T417839) (owner: 10Reedy) [09:12:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P88891 and previous config saved to /var/cache/conftool/dbconfig/20260219-091217-marostegui.json [09:12:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:12:40] ah [09:12:45] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/1240317 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:12:59] (03CR) 10Vgutierrez: [C:04-1] haproxy: symlink /etc/acmechief to cert tmpfs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240395 (owner: 10BCornwall) [09:13:11] 09:12:11 The following are unexpected commits pulled from origin for /srv/mediawiki-staging: [09:13:11] d33a9e16db62da4cf4da0bb7292023b9f51117f5 Disable ReaderExperiments on beta commonswiki [09:13:29] I thought scap was smart enough to detect that only affeced beta and happilly skip it [09:13:36] then it doesn't hurt to ask human confirmation [09:14:09] !log Deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1240406 config change for beta which was left unfetched/undeployed :) [09:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:15] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: remove exception for query.wikidata.org, automattic [puppet] - 10https://gerrit.wikimedia.org/r/1240616 (https://phabricator.wikimedia.org/T402959) [09:14:47] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1240418|Do not pass null to AccessTokenEntity::setUserIdentifier() (T417820)]], [[gerrit:1240430|Fix "iss" field missing in OAuth 2 access token JWT (T417839)]] [09:14:53] T417820: TypeError: MediaWiki\Extension\OAuth\Entity\AccessTokenEntity::setUserIdentifier(): Argument #1 ($identifier) must be of type string, null given, called in AccessTokenEntity.php - https://phabricator.wikimedia.org/T417820 [09:14:53] T417839: Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839 [09:15:12] (03CR) 10CI reject: [V:04-1] decom cookbook: use homer on Nokia switches [cookbooks] - 10https://gerrit.wikimedia.org/r/1240318 (https://phabricator.wikimedia.org/T417428) (owner: 10Ayounsi) [09:16:10] (03CR) 10CI reject: [V:04-1] cache::haproxy: remove exception for query.wikidata.org, automattic [puppet] - 10https://gerrit.wikimedia.org/r/1240616 (https://phabricator.wikimedia.org/T402959) (owner: 10Giuseppe Lavagetto) [09:16:13] (03CR) 10Elukey: [C:03+2] type hints: use standard types as type hints [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1239680 (owner: 10Volans) [09:16:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:17:03] (03CR) 10Hashar: "This one did not get fetched on the production deployment server and when running scap backport for other patches I have been warned of an" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240406 (owner: 10Bvibber) [09:17:04] !log hashar@deploy2002 reedy, jforrester, hashar: Backport for [[gerrit:1240418|Do not pass null to AccessTokenEntity::setUserIdentifier() (T417820)]], [[gerrit:1240430|Fix "iss" field missing in OAuth 2 access token JWT (T417839)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:18:02] MatmaRex: patches are on the test servers [09:18:14] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppetmaster::frontend [puppet] - 10https://gerrit.wikimedia.org/r/1240288 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:19:03] hashar: i can't test much until they're live [09:19:07] oh sorry [09:19:13] I misnderstood what you said earlier [09:19:15] !log hashar@deploy2002 reedy, jforrester, hashar: Continuing with sync [09:19:53] np [09:19:57] * hashar presses Y and hits Return key [09:20:31] i suppose it would be possible to set up an oauth app on toolforge and make it send the headers so that it only talks to mwdebug servers. maybe something to consider. but i don't have one set up now :) [09:21:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:21:17] yeah I am not too worried since those patches looks straightforward and low risk, but I always ask nonetheless [09:21:17] (03PS3) 10Ayounsi: decom cookbook: use homer on Nokia switches [cookbooks] - 10https://gerrit.wikimedia.org/r/1240318 (https://phabricator.wikimedia.org/T417428) [09:21:36] (03PS2) 10Vgutierrez: cache::haproxy: remove exception for query.wikidata.org, automattic [puppet] - 10https://gerrit.wikimedia.org/r/1240616 (https://phabricator.wikimedia.org/T402959) (owner: 10Giuseppe Lavagetto) [09:21:54] what I find surprising is we seem to lack a few tests to verify the integration with the 3rd party lib, that makes upgrade a bit fragile [09:22:16] (03Merged) 10jenkins-bot: type hints: use standard types as type hints [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1239680 (owner: 10Volans) [09:23:07] (03PS5) 10Muehlenhoff: Move the puppetmaster puppetdb client class under puppet_compiler [puppet] - 10https://gerrit.wikimedia.org/r/1240278 (https://phabricator.wikimedia.org/T365798) [09:23:24] I think there is also a task pointing out that mediawiki/core and mediawiki/vendor are not gated against changes made with CentralAuth so it can end up being broken [09:23:25] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240418|Do not pass null to AccessTokenEntity::setUserIdentifier() (T417820)]], [[gerrit:1240430|Fix "iss" field missing in OAuth 2 access token JWT (T417839)]] (duration: 08m 37s) [09:23:30] T417820: TypeError: MediaWiki\Extension\OAuth\Entity\AccessTokenEntity::setUserIdentifier(): Argument #1 ($identifier) must be of type string, null given, called in AccessTokenEntity.php - https://phabricator.wikimedia.org/T417820 [09:23:31] T417839: Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839 [09:23:34] MatmaRex: patches have been deployed [09:24:19] thanks. i'll verify and comment on the tasks in a moment [09:24:27] (03PS2) 10Muehlenhoff: Fix copy&paste errors in comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240202 [09:24:36] sure, let me know if any follows up patch is needed over the course of the day [09:24:40] or this morning or whatever [09:24:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:26:11] (03PS1) 10Vgutierrez: acme_chief: Add support for symlinking $certs_path [puppet] - 10https://gerrit.wikimedia.org/r/1240618 (https://phabricator.wikimedia.org/T384227) [09:26:24] I am gonna check metrics [09:26:43] (03PS3) 10Arnaudb: java: update puppet certificate [puppet] - 10https://gerrit.wikimedia.org/r/1240477 (https://phabricator.wikimedia.org/T417767) [09:26:44] (03CR) 10CI reject: [V:04-1] acme_chief: Add support for symlinking $certs_path [puppet] - 10https://gerrit.wikimedia.org/r/1240618 (https://phabricator.wikimedia.org/T384227) (owner: 10Vgutierrez) [09:27:17] cause there is an elevated rate of errors https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=000000006&var-level=$__all&var-channel=$__all&refresh=5m&viewPanel=panel-18 [09:27:17] and notice/debugs rate doubled since 8:03 ( https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=000000006&var-level=$__all&var-channel=$__all&refresh=5m&viewPanel=panel-25 ) [09:27:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P88893 and previous config saved to /var/cache/conftool/dbconfig/20260219-092726-marostegui.json [09:27:46] (03PS6) 10Muehlenhoff: Move the puppetmaster puppetdb client class under puppet_compiler [puppet] - 10https://gerrit.wikimedia.org/r/1240278 (https://phabricator.wikimedia.org/T365798) [09:29:10] (03PS2) 10Vgutierrez: acme_chief: Add support for symlinking $certs_path [puppet] - 10https://gerrit.wikimedia.org/r/1240618 (https://phabricator.wikimedia.org/T384227) [09:29:20] (03CR) 10Arnaudb: java: update puppet certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240477 (https://phabricator.wikimedia.org/T417767) (owner: 10Arnaudb) [09:29:59] (03CR) 10Volans: [C:03+1] "LGTM, checked PCC that resolves both v4 and v6" [puppet] - 10https://gerrit.wikimedia.org/r/1240605 (owner: 10Majavah) [09:30:21] (03CR) 10Majavah: [V:03+1 C:03+2] P:dumps::distribution::nfs: Use IPs only in /etc/exports [puppet] - 10https://gerrit.wikimedia.org/r/1240605 (owner: 10Majavah) [09:31:24] (03CR) 10Elukey: "IIUC the idea is to start rolling out a new hiera key containing info about how to depool a server, or a set of servers, to avoid manual a" [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [09:31:25] the DEBUG raised because of DuplicateParse on commonswiki [09:32:21] (03CR) 10CI reject: [V:04-1] Fix copy&paste errors in comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240202 (owner: 10Muehlenhoff) [09:32:34] (03PS1) 10Vgutierrez: cache::haproxy: Use acme_chief::cert symlink support [puppet] - 10https://gerrit.wikimedia.org/r/1240619 (https://phabricator.wikimedia.org/T384227) [09:33:25] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240618 (https://phabricator.wikimedia.org/T384227) (owner: 10Vgutierrez) [09:34:05] the spike of ERROR is from the http channel for 4629 failed fetches , that is commonswiki timing out reaching out to Flicker API [09:34:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1240477 (https://phabricator.wikimedia.org/T417767) (owner: 10Arnaudb) [09:35:04] annd bunch of other issues such as sparql?query= [09:35:08] anyway they are gone now [09:35:30] I am going to brew myself a coffee and do the CI Jenkins upgrade [09:36:06] (03CR) 10Elukey: [C:03+1] tox: use generic py3 during local development [cookbooks] - 10https://gerrit.wikimedia.org/r/1240614 (owner: 10Federico Ceratto) [09:37:12] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240609 (owner: 10Majavah) [09:37:58] !log lsw1-d7-eqiad# tools network-instance default protocols bgp neighbor 10.64.128.17 reset-peer - T411054 [09:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:02] T411054: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054 [09:38:28] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240619 (https://phabricator.wikimedia.org/T384227) (owner: 10Vgutierrez) [09:39:36] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8072/co" [puppet] - 10https://gerrit.wikimedia.org/r/1240609 (owner: 10Majavah) [09:39:40] (03CR) 10Volans: [C:03+1] "LGTM, PCC also happy:" [puppet] - 10https://gerrit.wikimedia.org/r/1240609 (owner: 10Majavah) [09:40:05] (03CR) 10Majavah: [V:03+1 C:03+2] P:dumps::distribution::nfs: Apply changes to exports file [puppet] - 10https://gerrit.wikimedia.org/r/1240609 (owner: 10Majavah) [09:40:17] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.17; 2026-02-24): Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11631101 (10matmarex) Works for me now at . Backpo... [09:42:20] (03PS1) 10Fabfur: cache::upload: exclude map tiles from global ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/1240620 (https://phabricator.wikimedia.org/T406545) [09:42:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T415786)', diff saved to https://phabricator.wikimedia.org/P88895 and previous config saved to /var/cache/conftool/dbconfig/20260219-094234-marostegui.json [09:42:39] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [09:42:45] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.17; 2026-02-24): Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11631110 (10Don-vip) The issue is solved for me, thanks a lot @matmarex ! [09:42:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance [09:42:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1238 (T415786)', diff saved to https://phabricator.wikimedia.org/P88896 and previous config saved to /var/cache/conftool/dbconfig/20260219-094258-marostegui.json [09:43:58] MatmaRex: well done! :-] [09:44:19] I am waiting for some jobs to merge [09:44:26] s/merge/complete/ [09:44:39] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:44:59] i'm looking at the error rate here for the other one: https://logstash.wikimedia.org/goto/a94ab0f480dc12720a790e50ba76978a i'll give it a few more minutes before declaring that it has dropped to zero [09:45:10] (but it is zero so far) [09:45:27] (03CR) 10Marostegui: "@fceratto@wikimedia.org a good way is to check the .yaml file for both proxies to ensure the config is the same, can you double check?" [dns] - 10https://gerrit.wikimedia.org/r/1240472 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [09:45:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240278 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:46:08] !log upgraded Jenkins 2.541.2 to 2.528.3 on contint2002 (Jenkins does not run there) Upgrade + T417791 [09:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:47] MatmaRex: great, thank you for monitoring the logs [09:48:19] (03PS3) 10Daniel Kinzler: rest gateway: expose headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) [09:48:39] (03CR) 10Vgutierrez: [C:04-1] haproxy: symlink /etc/acmechief to cert tmpfs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240395 (owner: 10BCornwall) [09:48:53] (03CR) 10Marostegui: [C:03+2] core_test.pp: Remove read_only check from core_test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1240220 (owner: 10Marostegui) [09:49:39] (03CR) 10Marostegui: [C:03+2] core_test.pp: Remove read_only check from core_test hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240220 (owner: 10Marostegui) [09:49:41] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:49:49] (03CR) 10Fabfur: "vtc tests are ok" [puppet] - 10https://gerrit.wikimedia.org/r/1240620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:49:51] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:49:58] (03CR) 10Marostegui: [C:03+2] "Deployed the grant" [puppet] - 10https://gerrit.wikimedia.org/r/1240462 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [09:52:44] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.17; 2026-02-24): Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11631148 (10matmarex) 05Open→03Resolved [09:53:20] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:53:49] 06SRE, 10Gerrit, 10Wikibugs: Wikibugs should ignore `check experimental` messages for operations/puppet - https://phabricator.wikimedia.org/T417866 (10hashar) 03NEW [09:54:20] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 06Traffic: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11631170 (10MoritzMuehlenhoff) >>! In T417632#11627179, @JMeybohm wrote: > @ayounsi suggested we could remove `linux-sysctl... [09:56:12] PROBLEM - MariaDB Event Scheduler s1 on db1163 is CRITICAL: CRIT: event_scheduler: False, expected True: OK: Version 10.11.14-MariaDB-log, Uptime 9776303s, read_only: True, 243.39 QPS, connection latency: 0.029022s, query latency: 0.000552s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [09:56:20] ^ me testing [09:56:37] 06SRE, 10Gerrit, 10Wikibugs: Wikibugs should ignore `check experimental` messages for operations/puppet - https://phabricator.wikimedia.org/T417866#11631180 (10hashar) [09:57:04] (03CR) 10Vgutierrez: "PCC check looks good, only error comes from a WMCS instance where puppet/pcc seems to be broken at the moment (proxy-03.project-proxy)" [puppet] - 10https://gerrit.wikimedia.org/r/1240618 (https://phabricator.wikimedia.org/T384227) (owner: 10Vgutierrez) [09:57:12] RECOVERY - MariaDB Event Scheduler s1 on db1163 is OK: Version 10.11.14-MariaDB-log, Uptime 9776363s, read_only: True, event_scheduler: True, 172.25 QPS, connection latency: 0.029042s, query latency: 0.000529s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [09:57:22] (03PS2) 10Fabfur: cache::upload: exclude non-upload domains from global rl [puppet] - 10https://gerrit.wikimedia.org/r/1240620 (https://phabricator.wikimedia.org/T406545) [09:57:32] (03PS1) 10Majavah: P:dumps::distribution::nfs: Correctly format IPv6 addresses in exports [puppet] - 10https://gerrit.wikimedia.org/r/1240621 [09:58:36] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8076/co" [puppet] - 10https://gerrit.wikimedia.org/r/1240621 (owner: 10Majavah) [09:58:58] (03PS2) 10Muehlenhoff: Remove puppetmaster::ca_server and related classes [puppet] - 10https://gerrit.wikimedia.org/r/1240268 (https://phabricator.wikimedia.org/T365798) [10:00:32] (03CR) 10Federico Ceratto: [C:03+2] tox: use generic py3 during local development [cookbooks] - 10https://gerrit.wikimedia.org/r/1240614 (owner: 10Federico Ceratto) [10:01:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240268 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:01:29] (03CR) 10Volans: [C:03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1240621 (owner: 10Majavah) [10:02:08] (03CR) 10Majavah: [V:03+1 C:03+2] P:dumps::distribution::nfs: Correctly format IPv6 addresses in exports [puppet] - 10https://gerrit.wikimedia.org/r/1240621 (owner: 10Majavah) [10:04:53] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1240618 (https://phabricator.wikimedia.org/T384227) (owner: 10Vgutierrez) [10:05:23] (03CR) 10Fabfur: [C:03+1] "LGTM, this needs to be merged at the same time as the other one I assume..." [puppet] - 10https://gerrit.wikimedia.org/r/1240619 (https://phabricator.wikimedia.org/T384227) (owner: 10Vgutierrez) [10:05:45] (03CR) 10Federico Ceratto: [C:03+1] "You mean `diff hieradata/hosts/dbproxy1023.yaml hieradata/hosts/dbproxy1025.yaml`? They are identical." [dns] - 10https://gerrit.wikimedia.org/r/1240472 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [10:06:16] (03CR) 10Marostegui: "Good, that's how it should be!" [dns] - 10https://gerrit.wikimedia.org/r/1240472 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [10:06:40] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m2 master [dns] - 10https://gerrit.wikimedia.org/r/1240472 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [10:06:43] !log marostegui@dns1006 START - running authdns-update [10:07:01] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.16; 2026-02-17): Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11631210 (10Tgr) Would have been nice to get a Phan warning for this. The Builder is marked `@immutable`, and Phan... [10:07:04] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:07:09] !log Switchover m2 master proxy host from dbproxy1023 to dbproxy1025 T414656 [10:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:13] T414656: Migrate dbproxy* to Debian Trixie - https://phabricator.wikimedia.org/T414656 [10:07:54] (03CR) 10Giuseppe Lavagetto: [C:03+1] cache::upload: exclude non-upload domains from global rl [puppet] - 10https://gerrit.wikimedia.org/r/1240620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:09:00] (03CR) 10Vgutierrez: [C:03+1] cache::upload: exclude non-upload domains from global rl [puppet] - 10https://gerrit.wikimedia.org/r/1240620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:09:48] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 06Traffic: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11631212 (10JMeybohm) >>! In T417632#11631170, @MoritzMuehlenhoff wrote: > Can't we simply simply override net.ipv4.conf.*.... [10:10:15] (03CR) 10Elukey: [C:03+1] "This cookbook really needs a refactor, and the netbox functionality may be good in spicerack at some point, but something for the future :" [cookbooks] - 10https://gerrit.wikimedia.org/r/1240318 (https://phabricator.wikimedia.org/T417428) (owner: 10Ayounsi) [10:13:40] (03PS1) 10Elukey: Remove the last occurrences of puppet_master attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1240624 [10:16:20] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:17:00] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 28e6395e06b3d56868534acb99a99b045f49fb40, dns.git is bf419740becdabe449d1eefe1c010625a84f72b5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:17:46] !log marostegui@dns1006 START - running authdns-update [10:19:05] !log marostegui@dns1006 END - running authdns-update [10:19:30] (03CR) 10Fabfur: [C:03+2] cache::upload: exclude non-upload domains from global rl [puppet] - 10https://gerrit.wikimedia.org/r/1240620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:20:19] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::ca_server and related classes [puppet] - 10https://gerrit.wikimedia.org/r/1240268 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:20:22] 06SRE, 10Wikibase GraphQL, 06Wikibase Reuse Team, 10Wikidata: Create a rewrite for the GraphQL endpoint on wikidata.org - https://phabricator.wikimedia.org/T417026#11631223 (10Silvan_WMDE) Tagging the SRE team here, maybe you'll be able to help us with this server config change for wikidata.org? [10:21:18] (03CR) 10CI reject: [V:04-1] Remove the last occurrences of puppet_master attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1240624 (owner: 10Elukey) [10:21:58] PROBLEM - MariaDB Events s1 on db1169 is CRITICAL: CRITICAL - Events not ENABLED: wmf_slave_purge(DISABLED) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [10:22:00] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:22:58] RECOVERY - MariaDB Events s1 on db1169 is OK: OK - All 4 events in ops database are ENABLED https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [10:26:51] (03CR) 10Ayounsi: [C:03+2] decom cookbook: use homer on Nokia switches [cookbooks] - 10https://gerrit.wikimedia.org/r/1240318 (https://phabricator.wikimedia.org/T417428) (owner: 10Ayounsi) [10:27:03] (03PS1) 10Muehlenhoff: installserver: Run spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240629 [10:27:42] (03PS1) 10Majavah: mailmap: Merge more duplicates [puppet] - 10https://gerrit.wikimedia.org/r/1240631 [10:28:18] (03PS1) 10Muehlenhoff: netbox: Run the spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240632 [10:28:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:30:34] (03PS3) 10Vgutierrez: acme_chief: Ensure that default certs_path always exists [puppet] - 10https://gerrit.wikimedia.org/r/1240618 (https://phabricator.wikimedia.org/T384227) [10:30:47] ^ fabfur this appears to be caused by your "cache::upload: exclude non-upload domains from global rl" patch [10:30:59] Failed to call refresh: '/usr/local/sbin/reload-vcl -n frontend -f /etc/varnish/wikimedia_upload-frontend.vcl -d 2 -a || (touch /var/tmp/reload-vcl-failed-frontend; false)' returned 1 instead of one of [0] [10:31:02] (03Abandoned) 10Vgutierrez: cache::haproxy: Use acme_chief::cert symlink support [puppet] - 10https://gerrit.wikimedia.org/r/1240619 (https://phabricator.wikimedia.org/T384227) (owner: 10Vgutierrez) [10:31:22] (03CR) 10Effie Mouzeli: [C:03+2] mw-on-k8s: do not alert for mw-experimental and mw-parsoid [alerts] - 10https://gerrit.wikimedia.org/r/1239724 (owner: 10Effie Mouzeli) [10:31:34] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240618 (https://phabricator.wikimedia.org/T384227) (owner: 10Vgutierrez) [10:31:41] (03CR) 10Hashar: [C:04-1] "I am relaying my IRC comment from yesterday:" [puppet] - 10https://gerrit.wikimedia.org/r/1239928 (https://phabricator.wikimedia.org/T417497) (owner: 10Jelto) [10:33:03] (03Merged) 10jenkins-bot: mw-on-k8s: do not alert for mw-experimental and mw-parsoid [alerts] - 10https://gerrit.wikimedia.org/r/1239724 (owner: 10Effie Mouzeli) [10:33:23] (03Merged) 10jenkins-bot: decom cookbook: use homer on Nokia switches [cookbooks] - 10https://gerrit.wikimedia.org/r/1240318 (https://phabricator.wikimedia.org/T417428) (owner: 10Ayounsi) [10:35:30] (03CR) 10Effie Mouzeli: [C:03+2] restbase::production: remove mw-parsoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1239709 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [10:37:12] (03PS2) 10Marco Fossati: Shared stream for reader experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240264 (https://phabricator.wikimedia.org/T415611) [10:39:41] FIRING: [14x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:59] (03PS3) 10Ayounsi: WIP: create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [10:41:02] (03CR) 10Ayounsi: "Thanks, the idea is to run --dry-run and `--show` first to know what is expected. I'm less convinced of a cookbook that prompts the user a" [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [10:42:48] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240634 (https://phabricator.wikimedia.org/T417184) [10:43:05] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 06Traffic: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11631314 (10MoritzMuehlenhoff) Alternatively we could also rebuild linux-base for trixie-wikimedia and drop the rp-filter s... [10:43:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:45:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [10:45:14] (03CR) 10Ayounsi: [C:03+1] installserver: Run spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240629 (owner: 10Muehlenhoff) [10:45:46] (03CR) 10Effie Mouzeli: [C:03+2] deployment_server: add parsoid pinkllama release #4 [puppet] - 10https://gerrit.wikimedia.org/r/1238349 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [10:46:08] (03PS1) 10Fabfur: Revert "cache::upload: exclude non-upload domains from global rl" [puppet] - 10https://gerrit.wikimedia.org/r/1240637 [10:46:40] (03CR) 10JMeybohm: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1237258 (https://phabricator.wikimedia.org/T414112) (owner: 10Jelto) [10:46:42] (03CR) 10CI reject: [V:04-1] WIP: create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [10:47:33] (03CR) 10JMeybohm: [C:03+1] "Sorry for getting back late on this, I've replied on the task" [puppet] - 10https://gerrit.wikimedia.org/r/1237258 (https://phabricator.wikimedia.org/T414112) (owner: 10Jelto) [10:48:57] (03CR) 10JMeybohm: [C:04-1] "This needs to be split as Jelto said. You should ensure the deployment works (and there is something actually deployed) before adding the " [puppet] - 10https://gerrit.wikimedia.org/r/1227851 (https://phabricator.wikimedia.org/T414112) (owner: 10Federico Ceratto) [10:49:16] (03PS4) 10Ayounsi: WIP: create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [10:54:22] !log installing gnutls28 security updates [10:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:52] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 06Traffic: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11631374 (10JMeybohm) >>! In T417632#11631314, @MoritzMuehlenhoff wrote: > Alternatively we could also rebuild linux-base f... [10:58:54] (03PS1) 10Fabfur: cache:upload: fix missing parentheses [puppet] - 10https://gerrit.wikimedia.org/r/1240640 (https://phabricator.wikimedia.org/T406545) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T1100) [11:00:05] effie: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:24] (03CR) 10Vgutierrez: [C:03+1] "VTCs are fixes with this CR" [puppet] - 10https://gerrit.wikimedia.org/r/1240640 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:00:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240640 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:01:29] (03CR) 10Joal: [C:03+1] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240634 (https://phabricator.wikimedia.org/T417184) (owner: 10JavierMonton) [11:02:19] (03CR) 10Fabfur: [C:03+2] cache:upload: fix missing parentheses [puppet] - 10https://gerrit.wikimedia.org/r/1240640 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:03:21] (03CR) 10Aqu: [C:03+1] "Looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240634 (https://phabricator.wikimedia.org/T417184) (owner: 10JavierMonton) [11:06:34] (03CR) 10Joal: [C:03+1] Add new dimensions to banner_activity in Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/1240298 (https://phabricator.wikimedia.org/T414478) (owner: 10Ejegg) [11:11:51] (03CR) 10JavierMonton: stream: mw-page-html-content-change-enrich-next (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240634 (https://phabricator.wikimedia.org/T417184) (owner: 10JavierMonton) [11:12:15] folks please do not run any deployments and scap as I will doing some work as scheduled on the deployment cal [11:12:28] well mediawiki related ones that is [11:13:12] (03PS2) 10Elukey: Remove the last occurrences of puppet_master attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1240624 [11:14:04] (03Abandoned) 10Fabfur: Revert "cache::upload: exclude non-upload domains from global rl" [puppet] - 10https://gerrit.wikimedia.org/r/1240637 (owner: 10Fabfur) [11:14:13] (03PS3) 10Elukey: Remove the last occurrences of puppet_master attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1240624 [11:15:33] (03PS6) 10Blake: spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 [11:15:34] (03CR) 10Blake: "Hey folks, I'm not quite sure why the tests are failing here - if there's anything I can do to help fix those, please let me know (it does" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (owner: 10Blake) [11:17:50] (03CR) 10Volans: "If I may add my 2 cents to the topic (here as there is no related task):" [cookbooks] - 10https://gerrit.wikimedia.org/r/1240614 (owner: 10Federico Ceratto) [11:18:38] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: mount parsoid-testing via hostPath #5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1238355 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [11:21:47] (03CR) 10Elukey: "Hey!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (owner: 10Blake) [11:22:10] (03Merged) 10jenkins-bot: mediawiki: mount parsoid-testing via hostPath #5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1238355 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [11:22:34] (03CR) 10CI reject: [V:04-1] spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (owner: 10Blake) [11:23:34] (03CR) 10Volans: "Setuptools 82.0.0 has removed the deprecated pkg_resources [1], that is still used because was waiting for pywmflib to migrate to importli" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (owner: 10Blake) [11:23:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:25:12] (03CR) 10Elukey: "@blake@wikimedia.org - you can see an example of standard vs premium support! :D" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (owner: 10Blake) [11:26:27] :D [11:28:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:29:01] (03CR) 10Fabfur: [C:03+1] acme_chief: Ensure that default certs_path always exists [puppet] - 10https://gerrit.wikimedia.org/r/1240618 (https://phabricator.wikimedia.org/T384227) (owner: 10Vgutierrez) [11:29:46] !log jiji@deploy2002 Started scap sync-world: switching mw-parsoid to pinkllama releases (T386246) [11:29:50] T386246: Migrate parsoidtest functionality to kubernetes - https://phabricator.wikimedia.org/T386246 [11:30:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1240624 (owner: 10Elukey) [11:30:49] !log jiji@deploy2002 jiji: switching mw-parsoid to pinkllama releases (T386246) synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:32:34] !log jiji@deploy2002 jiji: Continuing with sync [11:34:09] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2001.codfw.wmnet [11:34:21] !log jiji@deploy2002 Finished scap sync-world: switching mw-parsoid to pinkllama releases (T386246) (duration: 06m 12s) [11:34:44] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2001.codfw.wmnet [11:38:15] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host kubestage2001.codfw.wmnet [11:38:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:41:36] (03CR) 10Effie Mouzeli: [C:03+2] mw-parsoid: repurpose for parsoidtest use #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237472 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [11:43:44] (03Merged) 10jenkins-bot: mw-parsoid: repurpose for parsoidtest use #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237472 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [11:43:44] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-parsoid_4452: Servers wikikube-worker1144.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1067.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1281.eqiad.wmne [11:43:44] ube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1076.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1161.eqiad.wmnet, wikikube-worker1053.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worker1149.eqiad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker105 [11:43:44] wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worker1106.eqiad.wmnet, wikikube-worker1066.eqiad.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [11:43:44] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-parsoid_4452: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1298.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1259.eqiad.wmne [11:43:44] ube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1050.eqiad.wmnet, wikikube-worker1274.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker114 [11:43:44] wmnet, wikikube-worker1313.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1015.eqiad.wmnet, wikikube-worker1076.eqiad.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [11:43:45] RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:43:49] (03CR) 10Vgutierrez: [C:03+2] acme_chief: Ensure that default certs_path always exists [puppet] - 10https://gerrit.wikimedia.org/r/1240618 (https://phabricator.wikimedia.org/T384227) (owner: 10Vgutierrez) [11:44:08] thsi is me [11:44:13] and it shouldnt have alerted [11:44:33] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2001.codfw.wmnet [11:48:10] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [11:48:24] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [11:50:13] (03PS7) 10Blake: spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) [11:53:31] arnaudb: gerrit1003 is consistently erroring out on backup, should I silence it due to ongoing maintenance? [11:53:43] (03CR) 10Blake: "Thanks very much, y'all! I'm happy to wait until after the Spicerack patch is done, I can start looking at how we'll use this to implement" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [11:54:42] yes please! I thought I disabled backup but it was just monitoring. for maybe a week? [11:55:09] I belive you didn't do either, at least for centarlized backup monitoring [11:55:16] let me send a patch for your review [11:57:26] (03CR) 10CI reject: [V:04-1] spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [11:57:56] (03PS1) 10Jcrespo: backup: Temporarilly ignore backup job failures from gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1240648 (https://phabricator.wikimedia.org/T417246) [11:58:13] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.16; 2026-02-17): Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11631494 (10matmarex) Since PHP 8.5 the `withX()` methods can be annotated with `#[\NoDiscard]`, which will cause a... [11:58:25] (03PS2) 10Jcrespo: backup: Temporarily ignore backup job failures from gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1240648 (https://phabricator.wikimedia.org/T417246) [11:59:09] arnaudb: this is how we can ignore individual job failures: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240648 [11:59:44] (03CR) 10Arnaudb: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1240648 (https://phabricator.wikimedia.org/T417246) (owner: 10Jcrespo) [11:59:52] I propose merging that, create a revert and you have the preapproval to revert when finished [12:00:16] (03PS1) 10Effie Mouzeli: Revert "mw-parsoid: repurpose for parsoidtest use #6" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240649 [12:00:25] (03CR) 10Jcrespo: [C:03+2] backup: Temporarily ignore backup job failures from gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1240648 (https://phabricator.wikimedia.org/T417246) (owner: 10Jcrespo) [12:00:28] good idea, i'll merge and revert after lunch. thanks for the patch! [12:01:29] (03PS1) 10Jcrespo: Revert "backup: Temporarily ignore backup job failures from gerrit1003" [puppet] - 10https://gerrit.wikimedia.org/r/1240650 [12:02:15] (03CR) 10Clément Goubert: [C:03+1] Revert "mw-parsoid: repurpose for parsoidtest use #6" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240649 (owner: 10Effie Mouzeli) [12:02:37] (03CR) 10Jcrespo: [C:03+1] "preaproval to merge his (unless conflict with a future patch) when Arnaud considers adequate, after their team's work concludes." [puppet] - 10https://gerrit.wikimedia.org/r/1240650 (owner: 10Jcrespo) [12:03:04] (03CR) 10Muehlenhoff: [C:03+2] Puppetserver: Update hooks [puppet] - 10https://gerrit.wikimedia.org/r/1104627 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:03:06] arnaudb: revert is precreated here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240650 [12:04:08] (03CR) 10Effie Mouzeli: [C:03+2] Revert "mw-parsoid: repurpose for parsoidtest use #6" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240649 (owner: 10Effie Mouzeli) [12:06:12] (03Merged) 10jenkins-bot: Revert "mw-parsoid: repurpose for parsoidtest use #6" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240649 (owner: 10Effie Mouzeli) [12:07:27] (03CR) 10Muehlenhoff: [C:03+2] installserver: Run spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240629 (owner: 10Muehlenhoff) [12:18:28] (03PS1) 10Effie Mouzeli: mw-parsoid: repurpose for parsoidtest use #7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240663 (https://phabricator.wikimedia.org/T386246) [12:18:39] (03CR) 10CI reject: [V:04-1] mw-parsoid: repurpose for parsoidtest use #7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240663 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [12:19:58] (03Abandoned) 10Effie Mouzeli: mw-parsoid: repurpose for parsoidtest use #7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240663 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [12:22:06] (03PS1) 10Effie Mouzeli: mw-parsoid: repurpose for parsoidtest use #7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240665 (https://phabricator.wikimedia.org/T386246) [12:23:09] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240666 [12:28:28] (03CR) 10JMeybohm: [C:03+1] mw-parsoid: repurpose for parsoidtest use #7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240665 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [12:28:36] (03CR) 10Matthias Mullie: [C:03+2] Shared stream for reader experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240264 (https://phabricator.wikimedia.org/T415611) (owner: 10Marco Fossati) [12:29:48] (03Merged) 10jenkins-bot: Shared stream for reader experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240264 (https://phabricator.wikimedia.org/T415611) (owner: 10Marco Fossati) [12:30:19] (03CR) 10Muehlenhoff: [C:03+2] kafkatee::webrequest::ops: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/1237494 (owner: 10Muehlenhoff) [12:31:01] (03CR) 10Effie Mouzeli: [C:03+2] mw-parsoid: repurpose for parsoidtest use #7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240665 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [12:32:57] (03Merged) 10jenkins-bot: mw-parsoid: repurpose for parsoidtest use #7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240665 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [12:33:34] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:34:09] (03PS1) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) [12:34:23] (03PS1) 10MVernon: ceph: pull in new upstream version of reef 18.2.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240673 (https://phabricator.wikimedia.org/T417396) [12:35:00] Ah crap, I just noticed I +2'ed a config patch instead of +1. It's a no-op change for now, but I'll go scap it now to avoid confusion [12:35:00] (03CR) 10CI reject: [V:04-1] Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [12:36:26] (03CR) 10MVernon: "fancy a +1 on this, please?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240673 (https://phabricator.wikimedia.org/T417396) (owner: 10MVernon) [12:36:43] !log mlitn@deploy2002 Started scap sync-world: Backport for [[gerrit:1240264|Shared stream for reader experiments (T415611)]] [12:36:47] T415611: Set up measurement plan and instrumentation spec for mobile TOC - https://phabricator.wikimedia.org/T415611 [12:38:40] (03PS1) 10Effie Mouzeli: mw-parsoid/experimental: resize CPU resources temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240677 [12:38:53] !log mlitn@deploy2002 mfossati, mlitn: Backport for [[gerrit:1240264|Shared stream for reader experiments (T415611)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:39:24] (03PS2) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) [12:39:31] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:40:25] !log mlitn@deploy2002 mfossati, mlitn: Continuing with sync [12:40:25] (03PS3) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) [12:42:00] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: remove exception for query.wikidata.org, automattic [puppet] - 10https://gerrit.wikimedia.org/r/1240616 (https://phabricator.wikimedia.org/T402959) (owner: 10Giuseppe Lavagetto) [12:42:08] (03CR) 10Muehlenhoff: [C:03+2] analytics::cluster::client: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1237492 (owner: 10Muehlenhoff) [12:42:32] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11631650 (10Jclark-ctr) a:03Jclark-ctr [12:43:46] (03CR) 10Effie Mouzeli: [C:03+2] mw-parsoid/experimental: resize CPU resources temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240677 (owner: 10Effie Mouzeli) [12:43:47] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:44:16] ack noted, thanks jynus [12:44:24] !log mlitn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240264|Shared stream for reader experiments (T415611)]] (duration: 07m 42s) [12:44:29] T415611: Set up measurement plan and instrumentation spec for mobile TOC - https://phabricator.wikimedia.org/T415611 [12:44:51] 10ops-eqiad, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T417881 (10phaultfinder) 03NEW [12:45:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11631672 (10Jclark-ctr) a:03Jclark-ctr [12:45:44] (03Merged) 10jenkins-bot: mw-parsoid/experimental: resize CPU resources temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240677 (owner: 10Effie Mouzeli) [12:46:24] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [12:46:58] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [12:47:39] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [12:50:19] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239377 (owner: 10PipelineBot) [12:50:19] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225556 (owner: 10PipelineBot) [12:50:19] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229144 (owner: 10PipelineBot) [12:50:19] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227818 (owner: 10PipelineBot) [12:50:20] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229139 (owner: 10PipelineBot) [12:50:23] (03Abandoned) 10Jgiannelos: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239598 (owner: 10PipelineBot) [12:50:37] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:51:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239692 (https://phabricator.wikimedia.org/T386246) (owner: 10Jgiannelos) [12:53:51] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:53:53] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:55:12] !log imported linux-base 4.12.1+wmf1 to trixie-wikimedia - T417632 [12:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:16] T417632: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632 [12:56:34] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2001.codfw.wmnet [12:56:37] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2001.codfw.wmnet [12:56:53] (03PS1) 10Marostegui: mariadb: Alert on pt-heartbeat not running [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) [12:56:59] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage1003.eqiad.wmnet [12:57:26] (03CR) 10CI reject: [V:04-1] mariadb: Alert on pt-heartbeat not running [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [12:59:09] (03PS1) 10Jgiannelos: proofreadpage: Enable parsoid for rendering extension output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240682 (https://phabricator.wikimedia.org/T408915) [12:59:31] !log jayme@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host kubestage1003.eqiad.wmnet [12:59:59] (03CR) 10Elukey: [C:03+2] Remove the last occurrences of puppet_master attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1240624 (owner: 10Elukey) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T1300) [13:01:07] (03CR) 10Arnaudb: [C:03+2] java: update puppet certificate [puppet] - 10https://gerrit.wikimedia.org/r/1240477 (https://phabricator.wikimedia.org/T417767) (owner: 10Arnaudb) [13:01:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240682 (https://phabricator.wikimedia.org/T408915) (owner: 10Jgiannelos) [13:01:28] 10SRE-SLO, 06ServiceOps new, 07Essential-Work, 10iPoid-Service (iPoid 1.0), 06Product Safety and Integrity (Sprint Flower (Feb 9 - Feb 27)): IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11631763 (10kostajh) Bringing this into the current... [13:02:52] (03CR) 10Elukey: "@blake@wikimedia.org I think that we can pin setuptools<82.0.0 as Riccardo suggested for the moment (in a separate patch chained to this o" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:26] (03CR) 10Sbisson: [C:03+1] Update Recommendation API to 2026-02-10-184357-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240464 (https://phabricator.wikimedia.org/T409482) (owner: 10KartikMistry) [13:04:29] (03PS2) 10Marostegui: mariadb: Alert on pt-heartbeat not running [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) [13:05:22] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host kubestage1003.eqiad.wmnet [13:05:30] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240685 [13:06:10] (03PS1) 10JMeybohm: admin/files: Ensure __kube_env_ps1 is always defined [puppet] - 10https://gerrit.wikimedia.org/r/1240686 [13:06:14] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] proofreadpage: Enable parsoid for rendering extension output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240682 (https://phabricator.wikimedia.org/T408915) (owner: 10Jgiannelos) [13:09:44] (03CR) 10JMeybohm: [C:03+2] admin/files: Ensure __kube_env_ps1 is always defined [puppet] - 10https://gerrit.wikimedia.org/r/1240686 (owner: 10JMeybohm) [13:12:15] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1003.eqiad.wmnet [13:12:29] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.sync-instances sync Gerrit data from gerrit2003.wikimedia.org to gerrit1003.wikimedia.org [13:12:58] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage1003.eqiad.wmnet [13:13:00] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage1003.eqiad.wmnet [13:14:03] !log jiji@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM wikikube-worker-exp1001.eqiad.wmnet [13:14:30] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 07Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#11631853 (10hashar) [13:14:51] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 07Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#11631858 (10hashar) [13:17:54] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2026-02-10-184357-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240464 (https://phabricator.wikimedia.org/T409482) (owner: 10KartikMistry) [13:19:57] (03Merged) 10jenkins-bot: Update Recommendation API to 2026-02-10-184357-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240464 (https://phabricator.wikimedia.org/T409482) (owner: 10KartikMistry) [13:20:15] !log jiji@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM wikikube-worker-exp1001.eqiad.wmnet [13:20:45] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 06Traffic: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11631892 (10JMeybohm) 05Open→03Resolved a:03JMeybohm A patched package (from trixie-proposed-updates) has been up... [13:20:58] !log jiji@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM wikikube-worker-exp2001.codfw.wmnet [13:23:31] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3704.19 ms [13:23:55] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240690 (https://phabricator.wikimedia.org/T417717) [13:24:17] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:24:35] (03PS1) 10Muehlenhoff: Add two git hooks from the puppetmaster module to the pupperserver module [puppet] - 10https://gerrit.wikimedia.org/r/1240691 (https://phabricator.wikimedia.org/T365798) [13:24:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:24:50] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:25:00] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:25:03] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:25:14] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240692 (https://phabricator.wikimedia.org/T417717) [13:26:24] (03CR) 10CI reject: [V:04-1] Add two git hooks from the puppetmaster module to the pupperserver module [puppet] - 10https://gerrit.wikimedia.org/r/1240691 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:26:42] (03PS1) 10Urbanecm: [Growth] Force legacy validation of GrowthMentorList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240694 (https://phabricator.wikimedia.org/T417422) [13:27:04] !log jiji@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM wikikube-worker-exp2001.codfw.wmnet [13:27:48] (03PS2) 10Urbanecm: [Growth] Force legacy validation of GrowthMentorList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240694 (https://phabricator.wikimedia.org/T417422) [13:28:42] (03PS2) 10Muehlenhoff: Add two git hooks from the puppetmaster module to the pupperserver module [puppet] - 10https://gerrit.wikimedia.org/r/1240691 (https://phabricator.wikimedia.org/T365798) [13:28:50] (03CR) 10Brouberol: [C:03+2] Add new dimensions to banner_activity in Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/1240298 (https://phabricator.wikimedia.org/T414478) (owner: 10Ejegg) [13:29:19] (03PS1) 10Urbanecm: [Growth] Enable new GrowthMentorList validation on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240697 (https://phabricator.wikimedia.org/T417422) [13:29:30] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:29:34] (03PS2) 10Urbanecm: [Growth] beta: Enable new GrowthMentorList validation on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240697 (https://phabricator.wikimedia.org/T417422) [13:29:55] (03CR) 10Elukey: [C:03+1] "I was confused at first, buut IIUC you already uploaded 18.2.7-1~bpo12+1 to thirdparty/ceph-reef in Bookworm and this will just pull the d" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240673 (https://phabricator.wikimedia.org/T417396) (owner: 10MVernon) [13:30:03] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:18] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:30:38] (03CR) 10Marostegui: "Going to be tested on core_test hosts only for now." [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [13:30:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240691 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:30:54] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:31:54] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:33:09] (03CR) 10MVernon: "weirdly, I _didn't_ - I went to check for updates, and someone else had got there first... which is a bit odd, because there's nothing in " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240673 (https://phabricator.wikimedia.org/T417396) (owner: 10MVernon) [13:33:18] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:34:51] (03CR) 10MVernon: [V:03+2] ceph: pull in new upstream version of reef 18.2.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240673 (https://phabricator.wikimedia.org/T417396) (owner: 10MVernon) [13:34:56] (03CR) 10MVernon: [V:03+2 C:03+2] ceph: pull in new upstream version of reef 18.2.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240673 (https://phabricator.wikimedia.org/T417396) (owner: 10MVernon) [13:40:33] (03PS3) 10Muehlenhoff: tlsproxy::envoy: Remove support for legacy sslcert provider [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) [13:41:20] (03PS1) 10Arnaudb: gerrit: resume replication on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1240689 (https://phabricator.wikimedia.org/T417246) [13:41:20] (03CR) 10Arnaudb: "the pcc output looks OK to me:" [puppet] - 10https://gerrit.wikimedia.org/r/1240689 (https://phabricator.wikimedia.org/T417246) (owner: 10Arnaudb) [13:42:42] !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [13:43:02] !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:43:43] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:44:19] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:44:21] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:44:31] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:44:44] (just catching up on pending admin_ng changes) [13:44:56] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11632008 (10Jclark-ctr) [13:45:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [13:45:12] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:45:27] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:45:50] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:46:04] (03PS1) 10MVernon: openjkd-21-jre: fix malformed changelog entry [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240701 [13:46:07] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [13:46:38] (03CR) 10MVernon: "Hi," [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240701 (owner: 10MVernon) [13:46:48] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [13:47:28] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:48:22] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:48:34] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:49:07] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 13Patch-For-Review: Upgrade apus' ceph to 18.2.7 (or .8 if already available) - https://phabricator.wikimedia.org/T417396#11632017 (10MatthewVernon) New images built. [13:49:43] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:52:49] (03CR) 10Federico Ceratto: mysql: update replication source (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [13:53:13] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.sync-instances (exit_code=0) sync Gerrit data from gerrit2003.wikimedia.org to gerrit1003.wikimedia.org [13:53:32] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:53:43] (03PS4) 10Muehlenhoff: tlsproxy::envoy: Remove support for legacy sslcert provider [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) [13:54:33] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:54:46] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:55:33] (03CR) 10CI reject: [V:04-1] tlsproxy::envoy: Remove support for legacy sslcert provider [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:59:20] (03PS5) 10Muehlenhoff: tlsproxy::envoy: Remove support for legacy sslcert provider [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T1400). [14:00:06] MatmaRex, matthiasmullie, and nemo-yiannis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] o/ [14:00:15] o/ I can’t deploy 😔 [14:00:46] hi [14:01:14] (i don't have deployment access) [14:01:41] and i guess i need to reschedule my things. so, over to you matthiasmullie? [14:02:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240685 (owner: 10Matthias Mullie) [14:02:17] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:02:38] Alright, I've begun mine [14:03:39] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240685 (owner: 10Matthias Mullie) [14:04:09] !log mlitn@deploy2002 Started scap sync-world: Backport for [[gerrit:1240685|Squashed diff to master]] [14:05:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:05:45] (03PS6) 10Muehlenhoff: tlsproxy::envoy: Remove support for legacy sslcert provider [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) [14:06:17] !log mlitn@deploy2002 mlitn: Backport for [[gerrit:1240685|Squashed diff to master]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:07:00] !log mlitn@deploy2002 mlitn: Continuing with sync [14:07:14] MatmaRex: i'm happy to help if needed [14:07:41] but we're not on wmf.16, so... [14:10:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [14:10:37] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [14:10:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:10:52] !log mlitn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240685|Squashed diff to master]] (duration: 06m 43s) [14:11:19] I'm done; handing over to ... urbanecm? [14:11:35] depends on whether MatmaRex (or nemo-yiannis) wants me to deploy something [14:11:55] MatmaRex: nemo-yiannis: can you advise? [14:12:51] nothing from me [14:13:12] ok [14:13:19] nemo-yiannis: go ahead then :) [14:13:34] hey sorry, i was afk [14:13:41] i can start [14:13:48] no prob [14:14:27] i think my previous message didn't go through. i might reschedule things for the evening, or for next week [14:14:39] it didn't, thanks for re-sending. [14:15:34] (03CR) 10Jgiannelos: [C:03+2] parsoid: Override test config for parsoid testing env on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239692 (https://phabricator.wikimedia.org/T386246) (owner: 10Jgiannelos) [14:15:43] (03CR) 10Xcollazo: "Will this maybe fix T404202?" [puppet] - 10https://gerrit.wikimedia.org/r/1240621 (owner: 10Majavah) [14:16:49] (03Merged) 10jenkins-bot: parsoid: Override test config for parsoid testing env on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239692 (https://phabricator.wikimedia.org/T386246) (owner: 10Jgiannelos) [14:17:45] !log Updated Recommendation API to 2026-02-10-184357-production (T409482) [14:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:49] T409482: Consider omitting section suggestion when the only sections left to translate are appendix - https://phabricator.wikimedia.org/T409482 [14:18:12] (03PS1) 10Muehlenhoff: Run Gerrit spec tests on Bullseye/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240703 [14:18:44] !log jgiannelos@deploy2002 Started scap sync-world: Backport for [[gerrit:1239692|parsoid: Override test config for parsoid testing env on k8s (T386246)]] [14:18:48] T386246: Migrate parsoidtest functionality to kubernetes - https://phabricator.wikimedia.org/T386246 [14:18:50] (03PS2) 10Muehlenhoff: Run Gerrit spec tests on Bullseye/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240703 [14:20:54] !log jgiannelos@deploy2002 jgiannelos: Backport for [[gerrit:1239692|parsoid: Override test config for parsoid testing env on k8s (T386246)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:21:11] (03PS1) 10Volans: wmcs: infra-tracing-nfs fix home detection [puppet] - 10https://gerrit.wikimedia.org/r/1240704 (https://phabricator.wikimedia.org/T399313) [14:21:37] (03CR) 10Volans: "tested on toolsbeta nfs worker and bastion" [puppet] - 10https://gerrit.wikimedia.org/r/1240704 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [14:22:12] !log jgiannelos@deploy2002 jgiannelos: Continuing with sync [14:22:35] cc effie ^ [14:23:24] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [14:24:30] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11632158 (10ssingh) @BBlack and I had a long discussion about this (longer than the usual ones!) and some of it carried over to T366193 where we discussed getting two new /24s for the v4 anycasts.... [14:24:59] (03CR) 10Majavah: [V:03+1 C:03+2] "No, that task is about the rsync interface and this only touches the NFS config files. However, https://gerrit.wikimedia.org/r/c/operation" [puppet] - 10https://gerrit.wikimedia.org/r/1240621 (owner: 10Majavah) [14:26:07] !log jgiannelos@deploy2002 Finished scap sync-world: Backport for [[gerrit:1239692|parsoid: Override test config for parsoid testing env on k8s (T386246)]] (duration: 07m 23s) [14:26:11] T386246: Migrate parsoidtest functionality to kubernetes - https://phabricator.wikimedia.org/T386246 [14:26:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jgiannelos@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240682 (https://phabricator.wikimedia.org/T408915) (owner: 10Jgiannelos) [14:27:46] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and (A:dnsbox) [14:27:50] (03Merged) 10jenkins-bot: proofreadpage: Enable parsoid for rendering extension output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240682 (https://phabricator.wikimedia.org/T408915) (owner: 10Jgiannelos) [14:28:18] !log jgiannelos@deploy2002 Started scap sync-world: Backport for [[gerrit:1240682|proofreadpage: Enable parsoid for rendering extension output (T408915)]] [14:28:22] T408915: visualdiff testing: Escaped link elements are shown in page view on fr.wikisource.org - https://phabricator.wikimedia.org/T408915 [14:29:51] (03PS1) 10Jgreen: Switch frack bastion to frbast-eqiad [dns] - 10https://gerrit.wikimedia.org/r/1240705 [14:30:03] (03PS1) 10Ladsgroup: OutputPage: Sort language links before storing them [core] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240706 (https://phabricator.wikimedia.org/T253764) [14:30:29] !log jgiannelos@deploy2002 jgiannelos: Backport for [[gerrit:1240682|proofreadpage: Enable parsoid for rendering extension output (T408915)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:31:27] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T416726#11632190 (10Jhancock.wm) The tracking for the shipment now says they were unable to deliver the package and are sending to back to the shipper. I have asked Dell to resend. [14:32:04] !log jgiannelos@deploy2002 jgiannelos: Continuing with sync [14:32:21] (03PS8) 10Daniel Kinzler: rest route: support multiple rate limit policies at once [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228218 (https://phabricator.wikimedia.org/T413186) [14:32:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:33:21] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:34:22] jouncebot: nowandnext [14:34:22] For the next 0 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T1400) [14:34:22] In 0 hour(s) and 55 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T1530) [14:36:00] !log jgiannelos@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240682|proofreadpage: Enable parsoid for rendering extension output (T408915)]] (duration: 07m 41s) [14:36:03] (03PS1) 10Ladsgroup: OutputPage: Sort language links before storing them [core] (wmf/1.46.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1240707 (https://phabricator.wikimedia.org/T253764) [14:36:04] T408915: visualdiff testing: Escaped link elements are shown in page view on fr.wikisource.org - https://phabricator.wikimedia.org/T408915 [14:36:21] (03CR) 10Jgreen: [C:03+2] Switch frack bastion to frbast-eqiad [dns] - 10https://gerrit.wikimedia.org/r/1240705 (owner: 10Jgreen) [14:36:31] ok i think i am done with my backport patches [14:36:38] !log jgreen@dns1004 START - running authdns-update [14:36:52] (03PS2) 10Tiziano Fogli: thanos::rule: add ExecReload to the service unit [puppet] - 10https://gerrit.wikimedia.org/r/1239906 (https://phabricator.wikimedia.org/T414579) [14:36:52] (03PS24) 10Tiziano Fogli: slothslos: add module to build and deploy sloth manifests [puppet] - 10https://gerrit.wikimedia.org/r/1239166 (https://phabricator.wikimedia.org/T414579) [14:37:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:37:58] !log jgreen@dns1004 END - running authdns-update [14:38:19] Is there any other deploys needing to happen? [14:38:30] if not, I‌ push something forward [14:39:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240706 (https://phabricator.wikimedia.org/T253764) (owner: 10Ladsgroup) [14:39:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1240707 (https://phabricator.wikimedia.org/T253764) (owner: 10Ladsgroup) [14:39:41] FIRING: [14x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:40:48] (03CR) 10Filippo Giunchedi: [C:03+1] wmcs: infra-tracing-nfs fix home detection [puppet] - 10https://gerrit.wikimedia.org/r/1240704 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [14:42:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:43:01] (03PS8) 10Blake: spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) [14:43:59] (03PS2) 10Elukey: setup.py: Pin setuptools < 82.0.0 to make pkg_resources available. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240702 (owner: 10Blake) [14:44:01] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11632289 (10MatthewVernon) [14:44:29] (03PS7) 10Muehlenhoff: tlsproxy::envoy: Remove support for legacy sslcert provider [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) [14:45:56] 10ops-codfw, 06SRE, 06DC-Ops: RAM upgrade availability for Titan hosts - https://phabricator.wikimedia.org/T417336#11632302 (10Jhancock.wm) Yeah, we can do that. When is a good time to do that? [14:46:18] (03PS9) 10Daniel Kinzler: rest route: support multiple rate limit policies at once [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228218 (https://phabricator.wikimedia.org/T413186) [14:46:32] (03CR) 10Daniel Kinzler: rest route: support multiple rate limit policies at once (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228218 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [14:47:00] (03CR) 10Daniel Kinzler: [C:04-1] "needs chart version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) (owner: 10Daniel Kinzler) [14:47:18] (03CR) 10Majavah: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [14:47:20] (03CR) 10Volans: [C:03+2] wmcs: infra-tracing-nfs fix home detection [puppet] - 10https://gerrit.wikimedia.org/r/1240704 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [14:47:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:48:35] (03PS12) 10Federico Ceratto: mysql: update replication source [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) [14:48:38] (03PS1) 10MVernon: swift: drain 3 more codfw nodes for re-image on new VLAN [puppet] - 10https://gerrit.wikimedia.org/r/1240710 (https://phabricator.wikimedia.org/T354872) [14:48:55] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.16; 2026-02-17): Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11632312 (10Arcstur) QuickStatements 3.0 login is working again, thank you!! [14:49:05] (03PS1) 10Muehlenhoff: sslcert::certificate: Remove use_cergen [puppet] - 10https://gerrit.wikimedia.org/r/1240711 (https://phabricator.wikimedia.org/T357750) [14:50:10] (03Merged) 10jenkins-bot: OutputPage: Sort language links before storing them [core] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240706 (https://phabricator.wikimedia.org/T253764) (owner: 10Ladsgroup) [14:51:41] (03CR) 10CI reject: [V:04-1] setup.py: Pin setuptools < 82.0.0 to make pkg_resources available. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240702 (owner: 10Blake) [14:52:18] (03CR) 10CI reject: [V:04-1] spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [14:52:23] (03PS9) 10Arnaudb: gerrit: adapt httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417536) [14:52:32] (03PS1) 10Elukey: profile::puppetserver: simplify analytics-sre's authorized key [puppet] - 10https://gerrit.wikimedia.org/r/1240714 (https://phabricator.wikimedia.org/T402512) [14:52:39] (03PS2) 10Arnaudb: gerrit: add gerrit-replica backend to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417897) [14:52:47] (03Merged) 10jenkins-bot: OutputPage: Sort language links before storing them [core] (wmf/1.46.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1240707 (https://phabricator.wikimedia.org/T253764) (owner: 10Ladsgroup) [14:53:02] (03PS3) 10Arnaudb: gerrit: add gerrit-replica service to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417897) [14:53:12] (03CR) 10Elukey: "I kept the .erb since I'll add later on the command= restriction." [puppet] - 10https://gerrit.wikimedia.org/r/1240714 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:53:17] (03PS3) 10Arnaudb: gerrit: add gerrit-replica backend to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417897) [14:53:20] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1240706|OutputPage: Sort language links before storing them (T253764)]], [[gerrit:1240707|OutputPage: Sort language links before storing them (T253764)]] [14:53:24] T253764: Undeploy the InterwikiSorting extension from Wikipedia production - https://phabricator.wikimedia.org/T253764 [14:54:05] (03CR) 10Elukey: [C:03+1] tlsproxy::envoy: Remove support for legacy sslcert provider [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [14:54:13] (03CR) 10Ssingh: "Happy to review from Traffic's side but the question in the meantime is and I wonder, how do we proceed on this? There's a bunch of servic" [dns] - 10https://gerrit.wikimedia.org/r/1238441 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [14:54:51] (03CR) 10Elukey: [C:03+1] Add two git hooks from the puppetmaster module to the pupperserver module [puppet] - 10https://gerrit.wikimedia.org/r/1240691 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:55:01] (03CR) 10Elukey: [C:03+1] netbox: Run the spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240632 (owner: 10Muehlenhoff) [14:55:01] (03PS1) 10Muehlenhoff: Remove puppetmaster::geoip [puppet] - 10https://gerrit.wikimedia.org/r/1240715 (https://phabricator.wikimedia.org/T365798) [14:55:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [14:55:25] (03CR) 10Elukey: [C:03+1] sslcert::certificate: Remove use_cergen [puppet] - 10https://gerrit.wikimedia.org/r/1240711 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [14:55:36] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [14:55:42] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1240706|OutputPage: Sort language links before storing them (T253764)]], [[gerrit:1240707|OutputPage: Sort language links before storing them (T253764)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:56:17] (03CR) 10Federico Ceratto: "I added the checks as discussed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [14:57:14] (03PS3) 10Marostegui: mariadb: Alert on pt-heartbeat not running [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) [14:57:18] (03PS25) 10Tiziano Fogli: slothslos: add module to build and deploy sloth manifests [puppet] - 10https://gerrit.wikimedia.org/r/1239166 (https://phabricator.wikimedia.org/T414579) [14:57:33] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [14:59:37] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [15:01:26] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240706|OutputPage: Sort language links before storing them (T253764)]], [[gerrit:1240707|OutputPage: Sort language links before storing them (T253764)]] (duration: 08m 06s) [15:01:30] T253764: Undeploy the InterwikiSorting extension from Wikipedia production - https://phabricator.wikimedia.org/T253764 [15:02:25] (03CR) 10Elukey: setup.py: Pin setuptools < 82.0.0 to make pkg_resources available. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240702 (owner: 10Blake) [15:03:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240715 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:03:48] (03CR) 10Muehlenhoff: [C:03+2] netbox: Run the spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240632 (owner: 10Muehlenhoff) [15:04:12] (03PS4) 10Clément Goubert: api-gateway: Add external services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225548 (https://phabricator.wikimedia.org/T414333) [15:05:42] (03CR) 10Federico Ceratto: [C:03+1] "I see the three host being drained matching the CR description and appearing (unticked) in the related task description" [puppet] - 10https://gerrit.wikimedia.org/r/1240710 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [15:06:00] (03PS1) 10Esanders: Remove Editing-related config for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240716 (https://phabricator.wikimedia.org/T400063) [15:06:08] (03PS3) 10Blake: setup.py: Pin setuptools < 82.0.0 to make pkg_resources available. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240702 [15:06:08] (03PS9) 10Blake: spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) [15:06:36] (03PS1) 10Muehlenhoff: cassandra: Run spec tests on Bullseye/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240717 [15:07:44] (03CR) 10MVernon: [C:03+2] swift: drain 3 more codfw nodes for re-image on new VLAN [puppet] - 10https://gerrit.wikimedia.org/r/1240710 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [15:07:55] (03CR) 10Marostegui: "The failure is because db2230 still has puppet disabled: https://phabricator.wikimedia.org/T416582#11631322" [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [15:08:19] (03PS4) 10Blake: setup.py: Pin setuptools < 82.0.0 to make pkg_resources available. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240702 [15:08:19] (03PS10) 10Blake: spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) [15:09:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239270 (https://phabricator.wikimedia.org/T417349) (owner: 10Arlolra) [15:09:22] (03PS1) 10Muehlenhoff: openldap::management: Remove spec test [puppet] - 10https://gerrit.wikimedia.org/r/1240719 [15:11:02] (03PS1) 10Muehlenhoff: Remove obsolete mediawiki spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1240720 [15:13:20] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::geoip [puppet] - 10https://gerrit.wikimedia.org/r/1240715 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:13:21] (03PS1) 10Esanders: Stop PasteCheck A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240721 (https://phabricator.wikimedia.org/T417429) [15:14:41] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:15:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [15:15:40] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich-next (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240634 (https://phabricator.wikimedia.org/T417184) (owner: 10JavierMonton) [15:16:18] (03CR) 10Muehlenhoff: [C:03+2] openldap::management: Remove spec test [puppet] - 10https://gerrit.wikimedia.org/r/1240719 (owner: 10Muehlenhoff) [15:17:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:17:54] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240634 (https://phabricator.wikimedia.org/T417184) (owner: 10JavierMonton) [15:18:08] (03CR) 10CI reject: [V:04-1] spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [15:18:21] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:19:41] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:19:47] (03CR) 10Daniel Kinzler: [C:03+2] rest route: support multiple rate limit policies at once [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228218 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [15:22:12] (03Merged) 10jenkins-bot: rest route: support multiple rate limit policies at once [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228218 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [15:22:22] (03CR) 10Blake: setup.py: Pin setuptools < 82.0.0 to make pkg_resources available. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240702 (owner: 10Blake) [15:22:38] (03CR) 10Muehlenhoff: [C:03+2] sslcert::certificate: Remove use_cergen [puppet] - 10https://gerrit.wikimedia.org/r/1240711 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [15:22:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:23:12] (03CR) 10Hnowlan: [C:03+1] api-gateway: Add external services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225548 (https://phabricator.wikimedia.org/T414333) (owner: 10Clément Goubert) [15:23:18] (03PS2) 10Muehlenhoff: Remove obsolete mediawiki spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1240720 [15:24:01] (03CR) 10Herron: "Thanks! Although seems a bit different than the flow outlined above? As I understood the earlier comment:" [puppet] - 10https://gerrit.wikimedia.org/r/1239166 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [15:24:34] (03PS1) 10JMeybohm: k8s.roll-reimage-nodes: Support exclusion of target OS version [cookbooks] - 10https://gerrit.wikimedia.org/r/1240725 (https://phabricator.wikimedia.org/T414417) [15:26:41] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and (A:dnsbox) [15:28:35] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:28:41] (03PS1) 10Muehlenhoff: liberica: Run spec tests on Bookworm/Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1240728 [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T1530) [15:30:08] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:30:45] (03CR) 10CI reject: [V:04-1] k8s.roll-reimage-nodes: Support exclusion of target OS version [cookbooks] - 10https://gerrit.wikimedia.org/r/1240725 (https://phabricator.wikimedia.org/T414417) (owner: 10JMeybohm) [15:30:54] (03CR) 10Muehlenhoff: [C:03+2] tlsproxy::envoy: Remove support for legacy sslcert provider [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [15:32:55] (03CR) 10Elukey: "The previous issues seem fixed, now we have:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [15:33:17] (03PS1) 10Muehlenhoff: Remove analytics::cluster_packages spec test [puppet] - 10https://gerrit.wikimedia.org/r/1240730 [15:34:19] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:35:08] 10ops-codfw, 06SRE, 06DC-Ops: RAM upgrade availability for Titan hosts - https://phabricator.wikimedia.org/T417336#11632591 (10herron) I'll be available today all day Eastern TZ. If that works, ping me on IRC when ready to start on titan2001? I'll depool and shutdown the host for you, and then fyi after re... [15:35:39] (03CR) 10Ladsgroup: "We need to deploy this after the time has passed for the users to update their scripts. I think not, since we have these as ghost columns " [puppet] - 10https://gerrit.wikimedia.org/r/1239484 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [15:35:53] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:36:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1240714 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:36:49] (03PS11) 10Blake: spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) [15:37:03] 06SRE, 10SRE-Access-Requests: Requesting update of Raymond Ndibe's SSH key to Yubikey-backed key - https://phabricator.wikimedia.org/T417594#11632599 (10Raymond_Ndibe) >>! In T417594#11626420, @MatthewVernon wrote: > @Raymond_Ndibe I sent you a slack message yesterday - can you either reply to that with your n... [15:38:03] (03CR) 10Dzahn: "I would slightly prefer if we can dump these custom settings into a file under /etc/apache2/conf-available instead of puppetizing the enti" [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417536) (owner: 10Arnaudb) [15:38:49] (03CR) 10Elukey: [C:03+2] profile::puppetserver: simplify analytics-sre's authorized key [puppet] - 10https://gerrit.wikimedia.org/r/1240714 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:38:57] 06SRE, 10SRE-Access-Requests: Requesting update of Raymond Ndibe's SSH key to Yubikey-backed key - https://phabricator.wikimedia.org/T417594#11632606 (10MatthewVernon) Confirmed out-of-band, I'll put in a CR shortly. [15:39:00] (03CR) 10Dzahn: "Since conf-available should be loaded after the main file and "the last config wins" we would override the settings even if defaults also " [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417536) (owner: 10Arnaudb) [15:39:33] (03PS1) 10MVernon: admin: new FIDO keypair for raymond-ndibe [puppet] - 10https://gerrit.wikimedia.org/r/1240735 (https://phabricator.wikimedia.org/T417594) [15:42:08] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:42:36] (03CR) 10Dzahn: gerrit: add gerrit-replica service to LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417897) (owner: 10Arnaudb) [15:42:38] (03CR) 10Ssingh: [C:03+1] admin: new FIDO keypair for raymond-ndibe [puppet] - 10https://gerrit.wikimedia.org/r/1240735 (https://phabricator.wikimedia.org/T417594) (owner: 10MVernon) [15:42:58] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:43:07] (03CR) 10MVernon: [C:03+2] admin: new FIDO keypair for raymond-ndibe [puppet] - 10https://gerrit.wikimedia.org/r/1240735 (https://phabricator.wikimedia.org/T417594) (owner: 10MVernon) [15:43:21] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:44:36] (03CR) 10Dzahn: "backend.yaml - looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417897) (owner: 10Arnaudb) [15:45:32] (03CR) 10Tiziano Fogli: "Yes, the first flow I described was a rough idea sketched quickly to implement the solution with a flat structure on the manifests reposit" [puppet] - 10https://gerrit.wikimedia.org/r/1239166 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [15:46:03] (03CR) 10CI reject: [V:04-1] spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [15:47:29] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting update of Raymond Ndibe's SSH key to Yubikey-backed key - https://phabricator.wikimedia.org/T417594#11632621 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon @Raymond_Ndibe this is done - give it half an hour for puppet to run everyw... [15:47:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:48:53] (03CR) 10Tiziano Fogli: "Sorry, I forgot to close the previous comment… this version has been tested on Pontoon and, although with only a few tests so far, it seem" [puppet] - 10https://gerrit.wikimedia.org/r/1239166 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [15:50:11] (03CR) 10Dzahn: [C:03+1] gerrit: resume replication on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1240689 (https://phabricator.wikimedia.org/T417246) (owner: 10Arnaudb) [15:50:22] (03PS12) 10Blake: spicerack: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) [15:51:41] 06SRE-OnFire, 06ServiceOps new, 10ServiceOps-Services-Oids, 06Release-Engineering-Team (Radar), 07Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162#11632635 (10jijiki) [15:55:31] 06SRE, 06serviceops, 06MediaWiki-Platform-Team (Radar): k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700#11632641 (10jijiki) @ayounsi shall we close this? [15:55:35] (03CR) 10Federico Ceratto: "Done" [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [15:56:09] 06SRE, 10Prod-Kubernetes, 06ServiceOps new, 06MediaWiki-Platform-Team (Radar): k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700#11632642 (10jijiki) p:05Medium→03Low [15:57:06] (03CR) 10Muehlenhoff: [C:03+2] Add two git hooks from the puppetmaster module to the pupperserver module [puppet] - 10https://gerrit.wikimedia.org/r/1240691 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:58:21] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:00:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [16:00:05] dancy and jnuche: #bothumor I � Unicode. All rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T1600). [16:02:30] (03PS1) 10Hashar: gerrit::sshkey: add discovery.wmnet entry [puppet] - 10https://gerrit.wikimedia.org/r/1240738 (https://phabricator.wikimedia.org/T417497) [16:02:32] (03CR) 10Blake: "Fixed, we needed to allow a None type, because that's the default value for options." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [16:02:33] (03PS1) 10Hashar: zuul: allow varying Gerrit settings between merger and server [puppet] - 10https://gerrit.wikimedia.org/r/1240739 (https://phabricator.wikimedia.org/T417497) [16:02:35] (03PS1) 10Hashar: zuul: change server to SSH to gerrit.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1240740 (https://phabricator.wikimedia.org/T417497) [16:02:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:05:37] (03CR) 10Herron: "The only piece I consider blocking here is post-processing sloth generate output. That model is less modular and will be harder to change" [puppet] - 10https://gerrit.wikimedia.org/r/1239166 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [16:06:28] 06SRE, 06ServiceOps new, 10ServiceOps-Datastores: Can we replace memkeys? - https://phabricator.wikimedia.org/T228970#11632694 (10jijiki) 05Open→03Stalled p:05Medium→03Low [16:08:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:07] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240738 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [16:09:13] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240739 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [16:09:15] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240740 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [16:10:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [16:16:55] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#11632805 (10dancy) A burst of these errors happened on F... [16:17:25] PROBLEM - Host titan2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:19:03] FIRING: [2x] ProbeDown: Service titan2001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:03] (03PS2) 10Bking: opensearch-semantic-search-test: configure egress to liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1238306 (https://phabricator.wikimedia.org/T414095) (owner: 10DCausse) [16:23:21] FIRING: [8x] JobUnavailable: Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:23:22] (03Abandoned) 10BCornwall: haproxy: symlink /etc/acmechief to cert tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/1240395 (owner: 10BCornwall) [16:23:34] (03CR) 10Bking: [C:03+2] opensearch-semantic-search-test: configure egress to liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1238306 (https://phabricator.wikimedia.org/T414095) (owner: 10DCausse) [16:25:36] (03Merged) 10jenkins-bot: opensearch-semantic-search-test: configure egress to liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1238306 (https://phabricator.wikimedia.org/T414095) (owner: 10DCausse) [16:27:02] (03CR) 10Dzahn: [C:03+2] gerrit::sshkey: add discovery.wmnet entry [puppet] - 10https://gerrit.wikimedia.org/r/1240738 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [16:27:11] (03CR) 10Jforrester: [C:03+1] "Yeah, those special wikis almost certainly don't want Wikipedia-style editing config, agreed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240716 (https://phabricator.wikimedia.org/T400063) (owner: 10Esanders) [16:30:29] (03PS1) 10CDanis: gerrit tcp haproxy: rationalize timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1240747 (https://phabricator.wikimedia.org/T417497) [16:30:43] RECOVERY - Host titan2001 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [16:31:23] 10ops-codfw, 06SRE, 06DC-Ops: titan2001: expand ssd storage - https://phabricator.wikimedia.org/T417313#11632856 (10Jhancock.wm) 05Open→03Declined not happening now. revisit with new ticket in the future. (will save some 1.92TB SSDs if recycling happens before this is revisited) [16:32:13] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1240401 (https://phabricator.wikimedia.org/T417456) (owner: 10RLazarus) [16:33:21] FIRING: [8x] JobUnavailable: Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:03] RESOLVED: [2x] ProbeDown: Service titan2001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:21] RESOLVED: [8x] JobUnavailable: Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:38:59] (03CR) 10Hashar: gerrit tcp haproxy: rationalize timeouts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1240747 (https://phabricator.wikimedia.org/T417497) (owner: 10CDanis) [16:40:19] (03PS1) 10Effie Mouzeli: x-wikimedia-debug-routing: add routing to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1240750 (https://phabricator.wikimedia.org/T386246) [16:40:24] (03CR) 10CDanis: gerrit tcp haproxy: rationalize timeouts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240747 (https://phabricator.wikimedia.org/T417497) (owner: 10CDanis) [16:40:44] (03PS2) 10JMeybohm: k8s.roll-reimage-nodes: Support exclusion of target OS version [cookbooks] - 10https://gerrit.wikimedia.org/r/1240725 (https://phabricator.wikimedia.org/T414417) [16:41:05] (03PS2) 10Effie Mouzeli: x-wikimedia-debug-routing: add routing to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1240750 (https://phabricator.wikimedia.org/T386246) [16:43:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2020.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:44:41] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:45:48] (03PS1) 10Daniel Kinzler: rest-gateway: fix x-wmf-ratelimit-policy in access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) [16:47:34] (03CR) 10Hashar: gerrit tcp haproxy: rationalize timeouts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240747 (https://phabricator.wikimedia.org/T417497) (owner: 10CDanis) [16:47:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:48:21] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:49:00] (03CR) 10Scott French: "Thanks, Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1240750 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [16:52:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:53:02] (03CR) 10Hashar: "That one does not seem to change any of the Zuul configuration file, only resource parameters so I guess it is indeed a NOOP." [puppet] - 10https://gerrit.wikimedia.org/r/1240739 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [16:53:25] (03PS3) 10JMeybohm: k8s.roll-reimage-nodes: Support exclusion of target OS version [cookbooks] - 10https://gerrit.wikimedia.org/r/1240725 (https://phabricator.wikimedia.org/T414417) [16:53:25] (03PS1) 10JMeybohm: k8s.roll-reimage-nodes: Remove --puppet argument when calling reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1240755 (https://phabricator.wikimedia.org/T414417) [16:54:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2020.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:55:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2019.codfw.wmnet with OS trixie [16:55:41] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11632939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2019.codfw.wmnet with OS trixie [16:55:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2020.codfw.wmnet with OS trixie [16:55:59] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11632940 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2020.codfw.wmnet with OS trixie [16:58:15] PROBLEM - Host titan2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:03] FIRING: [2x] ProbeDown: Service titan2002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:00:05] jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T1700). Please do the needful. [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:14] (03CR) 10Hashar: [C:04-1] "I22e3dad017988f22521ce55fee764f1a508fd004 did not provide an entry for hostname `gerrit.discovery.wmnet` :-(" [puppet] - 10https://gerrit.wikimedia.org/r/1240739 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [17:00:51] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906 (10Milimetric) 03NEW [17:01:45] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11632969 (10Milimetric) [17:02:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11632978 (10Milimetric) I think I'm technically an approver for this so maybe Ben can approve me. And maybe all analytics-admins should have admin in all airflow admin gr... [17:03:21] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:04:30] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11632990 (10Ladsgroup) >>! In T414805#11630302, @Krinkle wrote: > > Is this about Swift index size or Thumbor capacity? I am not sug... [17:05:53] (03PS2) 10Hashar: zuul: allow varying Gerrit settings between merger and server [puppet] - 10https://gerrit.wikimedia.org/r/1240739 (https://phabricator.wikimedia.org/T417497) [17:05:54] (03PS2) 10Hashar: zuul: change server to SSH to gerrit.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1240740 (https://phabricator.wikimedia.org/T417497) [17:05:54] (03PS1) 10Hashar: gerrit::sshkey: add discovery.wmnet as host alias [puppet] - 10https://gerrit.wikimedia.org/r/1240761 (https://phabricator.wikimedia.org/T417497) [17:06:01] 06SRE, 07Sustainability (Incident Followup): Noise in #wikimedia-operations is making incident response more difficult - https://phabricator.wikimedia.org/T417163#11632997 (10A_smart_kitten) (for want of a better tag) [17:06:22] 06SRE, 10Gerrit, 10Wikibugs: Wikibugs should ignore `check experimental` messages for operations/puppet - https://phabricator.wikimedia.org/T417866#11633002 (10A_smart_kitten) > Wikibugs relay those messages to `#wikimedia-operations` which adds a bit of spam on an already busy enough IRC channel. (cross-li... [17:06:23] (03CR) 10Hashar: "Done with I1ffe2f66cae2fe85dc311dec4888d9b1d0d01bde (hopefully)" [puppet] - 10https://gerrit.wikimedia.org/r/1240739 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [17:06:44] RECOVERY - Host titan2002 is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms [17:07:01] (03CR) 10CI reject: [V:04-1] gerrit::sshkey: add discovery.wmnet as host alias [puppet] - 10https://gerrit.wikimedia.org/r/1240761 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [17:08:21] RESOLVED: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:08:22] (03CR) 10CI reject: [V:04-1] zuul: allow varying Gerrit settings between merger and server [puppet] - 10https://gerrit.wikimedia.org/r/1240739 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [17:10:03] RESOLVED: [2x] ProbeDown: Service titan2002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:10:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2019.codfw.wmnet with reason: host reimage [17:10:52] (03CR) 10Ejegg: "Oh shoot, I just spotted an error in the formula for first_campaign. Fix forthcoming" [puppet] - 10https://gerrit.wikimedia.org/r/1240298 (https://phabricator.wikimedia.org/T414478) (owner: 10Ejegg) [17:11:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2020.codfw.wmnet with reason: host reimage [17:14:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2019.codfw.wmnet with reason: host reimage [17:14:39] (03PS1) 10Ejegg: Fix for new banner activity dimension in Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/1240762 (https://phabricator.wikimedia.org/T414478) [17:15:26] (03CR) 10Ejegg: "Thanks for the review, @Brouberol! I made a quick fix to the first_campaign dimension here: I588b348f9134da675bdffc73fed6f8873d13e155" [puppet] - 10https://gerrit.wikimedia.org/r/1240298 (https://phabricator.wikimedia.org/T414478) (owner: 10Ejegg) [17:18:33] (03CR) 10Effie Mouzeli: x-wikimedia-debug-routing: add routing to mw-parsoid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240750 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [17:18:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2020.codfw.wmnet with reason: host reimage [17:25:45] 10ops-codfw, 06SRE, 06DC-Ops: RAM upgrade availability for Titan hosts - https://phabricator.wikimedia.org/T417336#11633109 (10herron) 05Open→03Resolved Thanks @Jhancock.wm! Ram upgrades on titan200[12] look good! [17:30:20] !log bking@wmf restart bg on wdqs2022.codfw.wmnet,wdqs2014.codfw.wmnet,wdqs2007.codfw.wmnet to clear ProbeDown alerts [17:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:49] (03PS2) 10Dzahn: gerrit::sshkey: add discovery.wmnet as host alias [puppet] - 10https://gerrit.wikimedia.org/r/1240761 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [17:35:37] (03PS3) 10Hashar: gerrit::sshkey: add discovery.wmnet as host alias [puppet] - 10https://gerrit.wikimedia.org/r/1240761 (https://phabricator.wikimedia.org/T417497) [17:36:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [17:36:52] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:37:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:37:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2019.codfw.wmnet with OS trixie [17:37:16] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11633179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2019.codfw.wmnet with OS trixie completed: - backup2019 (**WA... [17:39:13] (03CR) 10Dzahn: [C:03+2] gerrit::sshkey: add discovery.wmnet as host alias [puppet] - 10https://gerrit.wikimedia.org/r/1240761 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [17:41:29] (03CR) 10Dzahn: [C:03+2] gerrit tcp haproxy: rationalize timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1240747 (https://phabricator.wikimedia.org/T417497) (owner: 10CDanis) [17:42:30] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:42:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:42:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2020.codfw.wmnet with OS trixie [17:42:56] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11633208 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2020.codfw.wmnet with OS trixie completed: - backup2020 (**WA... [17:43:22] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11633222 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:46:59] (03CR) 10Scott French: [C:03+1] x-wikimedia-debug-routing: add routing to mw-parsoid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240750 (https://phabricator.wikimedia.org/T386246) (owner: 10Effie Mouzeli) [17:51:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11633244 (10Jhancock.wm) @jcrespo these are ready for you [17:52:02] (03PS3) 10Hashar: zuul: allow varying Gerrit settings between merger and server [puppet] - 10https://gerrit.wikimedia.org/r/1240739 (https://phabricator.wikimedia.org/T417497) [17:52:03] (03CR) 10RLazarus: [C:03+1] k8s.roll-reimage-nodes: Remove --puppet argument when calling reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1240755 (https://phabricator.wikimedia.org/T414417) (owner: 10JMeybohm) [17:53:15] (03PS3) 10Hashar: zuul: change server to SSH to gerrit.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1240740 (https://phabricator.wikimedia.org/T417497) [17:56:43] (03CR) 10Dzahn: [C:03+2] zuul: allow varying Gerrit settings between merger and server [puppet] - 10https://gerrit.wikimedia.org/r/1240739 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [17:59:56] (03CR) 10Dzahn: [C:03+2] zuul: change server to SSH to gerrit.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1240740 (https://phabricator.wikimedia.org/T417497) (owner: 10Hashar) [18:00:05] bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T1800) [18:01:13] The groundhog sees no shadow. Nothing to deploy in my window today. [18:01:49] great mutante and I are going to hard restart Zuul [18:01:59] which flush the queue of changes [18:03:05] !log Hard restarting Zuul and flushing all changes currently in the queue [18:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:13] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.16; 2026-02-17): Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11633339 (10Aklapper) [18:05:52] (03CR) 10RLazarus: [C:03+1] "Thanks for this!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1240725 (https://phabricator.wikimedia.org/T414417) (owner: 10JMeybohm) [18:06:51] (03CR) 10Dzahn: [C:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1238043 (https://phabricator.wikimedia.org/T416929) (owner: 10Hashar) [18:09:49] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:11:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [18:11:39] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:16:40] (03CR) 10Harroyo-wmf: [C:03+1] IPReputation: Lower IPoid request and connect timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240769 (https://phabricator.wikimedia.org/T417910) (owner: 10Kosta Harlan) [18:18:15] 06SRE, 10Gerrit, 10Wikibugs: Wikibugs should ignore `check experimental` messages for operations/puppet - https://phabricator.wikimedia.org/T417866#11633432 (10bd808) p:05Triage→03Medium `wikibugs2.gerrit.should_ignore_ci_comment` would be the right place to add filtering logic for this. Here is an exa... [18:19:12] we think the CI problems should be gone now - after changes to both zuul config [18:19:38] (and tcp-proxy timeouts; but zuul is also just not going through the proxy anymore now) [18:21:07] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11633451 (10Aklapper) T417913 might be a potential side effect which I closed too quickly...? [18:26:01] (03PS1) 10Aude: Update Qids according to communication with communities (v20260219) [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240773 (https://phabricator.wikimedia.org/T417902) [18:30:13] (03CR) 10RLazarus: [C:03+2] deployment_server: Really read namespaces in charlie --dangerously_fast [puppet] - 10https://gerrit.wikimedia.org/r/1240401 (https://phabricator.wikimedia.org/T417456) (owner: 10RLazarus) [18:39:36] (03PS2) 10Jdlrobson: Enable parser survey for opted out users on all parsoid rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238787 (https://phabricator.wikimedia.org/T414852) [18:43:45] (03PS1) 10Dzahn: phabricator: disable dump job [puppet] - 10https://gerrit.wikimedia.org/r/1240778 [18:44:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1240631 (owner: 10Majavah) [18:45:13] (03Abandoned) 10Muehlenhoff: Stop running the IP reputation dump on the Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1230912 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [18:48:26] (03CR) 10Majavah: [C:03+2] mailmap: Merge more duplicates [puppet] - 10https://gerrit.wikimedia.org/r/1240631 (owner: 10Majavah) [18:49:29] (03PS1) 10Arlolra: Fix finding joiner in the face of pwrapping [extensions/ProofreadPage] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240779 (https://phabricator.wikimedia.org/T411935) [18:50:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ProofreadPage] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240779 (https://phabricator.wikimedia.org/T411935) (owner: 10Arlolra) [18:50:13] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11633671 (10Ladsgroup) Actually that'll be fixed in a couple of hours 😅 [18:53:01] (03PS1) 10RLazarus: deployment_server: Add to charlie's ignored services and environments [puppet] - 10https://gerrit.wikimedia.org/r/1240780 (https://phabricator.wikimedia.org/T417456) [18:53:26] 06SRE, 10Prod-Kubernetes, 06ServiceOps new, 06MediaWiki-Platform-Team (Radar): k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700#11633689 (10ayounsi) 05Open→03Declined sure. [18:58:24] (03CR) 10RLazarus: "One last thing for this stack!" [puppet] - 10https://gerrit.wikimedia.org/r/1240780 (https://phabricator.wikimedia.org/T417456) (owner: 10RLazarus) [19:00:05] dancy and jnuche: Time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T1900). [19:02:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [19:02:28] o/ [19:02:38] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240781 (https://phabricator.wikimedia.org/T413807) [19:02:41] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240781 (https://phabricator.wikimedia.org/T413807) (owner: 10TrainBranchBot) [19:03:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:04:03] (03PS3) 10C. Scott Ananian: Enable parser survey for opted out users on all parsoid rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238787 (https://phabricator.wikimedia.org/T414852) (owner: 10Jdlrobson) [19:04:03] (03PS1) 10C. Scott Ananian: Enable parser survey for opted out users on some English-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240782 (https://phabricator.wikimedia.org/T414852) [19:04:03] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240781 (https://phabricator.wikimedia.org/T413807) (owner: 10TrainBranchBot) [19:04:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240782 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [19:05:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240782 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [19:10:28] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.16 refs T413807 [19:10:32] T413807: 1.46.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T413807 [19:11:05] (03CR) 10ArielGlenn: rest-gateway: fix x-wmf-ratelimit-policy in access log (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [19:11:09] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240783 [19:11:54] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: fix x-wmf-ratelimit-policy in access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [19:18:21] RESOLVED: [8x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:18:22] (03PS1) 10CDobbins: codfw: add the following cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1240784 [19:18:25] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:18:52] (03CR) 10Kamila Součková: [C:04-1] "I'm confused by the CI diff:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [19:19:26] (03CR) 10Kamila Součková: [C:04-1] "(marking as not resolved :D)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [19:20:51] (03CR) 10Scott French: [C:03+1] deployment_server: Add to charlie's ignored services and environments [puppet] - 10https://gerrit.wikimedia.org/r/1240780 (https://phabricator.wikimedia.org/T417456) (owner: 10RLazarus) [19:22:10] (03CR) 10Kamila Součková: [C:03+1] "Eeep, my bad, it's expected for the fixtures that don't actually have the policies. Maybe my brain needs food '^^" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [19:22:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host apus-fe2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:23:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host apus-fe2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:23:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:24:29] (03CR) 10Kamila Součková: [C:03+1] "...and marking it as resolved -_- I'm going to go away from the keyboard now '^^" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [19:26:28] (03CR) 10ArielGlenn: "In looking at the logs after a local "make check", I notice the log entries have a trailing space after the list of policies: "x-wmf-ratel" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [19:31:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:33:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host apus-fe2004.codfw.wmnet with OS bookworm [19:33:36] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11633872 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host apus-fe2004.codfw.wmnet with OS bookworm [19:33:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host apus-fe2005.codfw.wmnet with OS bookworm [19:34:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11633873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host apus-fe2005.codfw.wmnet with OS bookworm [19:35:22] (03CR) 10RLazarus: [C:03+2] deployment_server: Add to charlie's ignored services and environments [puppet] - 10https://gerrit.wikimedia.org/r/1240780 (https://phabricator.wikimedia.org/T417456) (owner: 10RLazarus) [19:45:01] (03PS2) 10CDobbins: codfw: add the following cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1240784 [19:47:41] (03CR) 10JHathaway: [C:03+1] Move the puppetmaster puppetdb client class under puppet_compiler [puppet] - 10https://gerrit.wikimedia.org/r/1240278 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [19:57:01] jhancock@cumin2002 reimage (PID 3442452) is awaiting input [20:01:01] (03PS1) 10CDanis: puppet: temp depool puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1240791 [20:02:08] 07Puppet, 06SRE: puppetserver1002 /srv/git/operations/private out of sync - https://phabricator.wikimedia.org/T417934 (10herron) 03NEW [20:02:10] 07Puppet, 06SRE: puppetserver1002 /srv/git/operations/private out of sync - https://phabricator.wikimedia.org/T417934#11634099 (10herron) [20:02:31] (03PS2) 10CDanis: puppet: temp depool puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1240791 (https://phabricator.wikimedia.org/T417934) [20:02:38] (03CR) 10Herron: [C:03+1] puppet: temp depool puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1240791 (https://phabricator.wikimedia.org/T417934) (owner: 10CDanis) [20:03:22] (03CR) 10CDanis: [C:03+2] puppet: temp depool puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1240791 (https://phabricator.wikimedia.org/T417934) (owner: 10CDanis) [20:03:49] !log cdanis@dns1004 START - running authdns-update [20:04:35] (03PS1) 10JHathaway: dmarc: set policy to reject [dns] - 10https://gerrit.wikimedia.org/r/1240792 (https://phabricator.wikimedia.org/T404884) [20:05:09] !log cdanis@dns1004 END - running authdns-update [20:06:36] 07Puppet, 06SRE, 13Patch-For-Review: puppetserver1002 /srv/git/operations/private out of sync - https://phabricator.wikimedia.org/T417934#11634141 (10herron) [20:07:33] (03CR) 10Eevans: [C:03+1] cassandra: Run spec tests on Bullseye/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240717 (owner: 10Muehlenhoff) [20:12:15] 07Puppet, 06SRE, 13Patch-For-Review: puppetserver1002 /srv/git/operations/private out of sync - https://phabricator.wikimedia.org/T417934#11634183 (10CDanis) ` 💙root@puppetserver1002.eqiad.wmnet /srv/git/operations/private 🕒🙃 git reset --hard origin/master HEAD is now at ce722766 (herron) dummy commit to res... [20:12:29] (03PS1) 10CDanis: Revert "puppet: temp depool puppetserver1002" [dns] - 10https://gerrit.wikimedia.org/r/1240795 [20:12:47] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (owner: 10CDobbins) [20:15:20] (03PS2) 10CDanis: Revert "puppet: temp depool puppetserver1002" [dns] - 10https://gerrit.wikimedia.org/r/1240795 (https://phabricator.wikimedia.org/T417934) [20:17:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [20:22:23] (03CR) 10Herron: [C:03+1] Revert "puppet: temp depool puppetserver1002" [dns] - 10https://gerrit.wikimedia.org/r/1240795 (https://phabricator.wikimedia.org/T417934) (owner: 10CDanis) [20:22:38] (03CR) 10CDanis: [C:03+2] Revert "puppet: temp depool puppetserver1002" [dns] - 10https://gerrit.wikimedia.org/r/1240795 (https://phabricator.wikimedia.org/T417934) (owner: 10CDanis) [20:22:49] !log cdanis@dns1004 START - running authdns-update [20:22:53] (03CR) 10JHathaway: [C:03+1] Revert "puppet: temp depool puppetserver1002" [dns] - 10https://gerrit.wikimedia.org/r/1240795 (https://phabricator.wikimedia.org/T417934) (owner: 10CDanis) [20:23:40] jhancock@cumin2002 reimage (PID 3442329) is awaiting input [20:24:08] !log cdanis@dns1004 END - running authdns-update [20:24:28] 07Puppet, 06SRE, 13Patch-For-Review: puppetserver1002 /srv/git/operations/private out of sync - https://phabricator.wikimedia.org/T417934#11634250 (10CDanis) 05Open→03Resolved a:03CDanis [20:33:42] (03PS1) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [20:34:16] (03CR) 10CI reject: [V:04-1] varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [20:34:29] 06SRE, 06Infrastructure-Foundations, 10Mail: Remove mail alias/fork from dmarc-rua@wikimedia.org to dmarc@donate.wikimedia.org - https://phabricator.wikimedia.org/T417941 (10Jgreen) 03NEW [20:36:18] (03PS2) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [20:58:21] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:59:41] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T2100). [21:00:04] toyofuku, arlolra, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:22] o/ [21:00:30] (03PS8) 10Ryan Kemper: elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) [21:00:39] (03PS1) 10Kamila Součková: admin: update ssh keys for kamila [puppet] - 10https://gerrit.wikimedia.org/r/1240804 [21:01:45] i'm here for for the easter eggs patch [21:02:19] o/ [21:02:59] hi, are you all in need of a deployer? [21:03:02] aude: i think you're first on the list. [21:03:16] (03PS2) 10Kamila Součková: admin: update ssh keys for kamila [puppet] - 10https://gerrit.wikimedia.org/r/1240804 (https://phabricator.wikimedia.org/T411404) [21:03:23] jeena: arlolra and i can spiderpig i think. [21:03:29] yup [21:03:34] i don't know if aude would like a deployer or not [21:03:35] thanks [21:03:39] ok could do mine but not sure about combining stuff [21:03:58] think batching the patches makes it faster [21:04:14] 10SRE-Access-Requests, 13Patch-For-Review: Update SSH key for kamila - https://phabricator.wikimedia.org/T411404#11634341 (10Raine) [21:04:32] the config changes are fast by themselves. the wmf.16 patches would probably be faster combined [21:04:47] which is the easter eggs patch? I don't see it on the deployment calendar [21:05:00] arlolra: do you want to batch your 1240779 patch with aude's 1240773 patch? [21:05:08] ok, will do [21:05:09] oh the config one [21:05:10] jeena: the easter eggs are the qids, it's the qids for the various easter eggs [21:05:13] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WP25EasterEggs/+/1240773 [21:05:17] 👍 [21:05:26] cscott: then you can do the config patches [21:05:26] it is updating the json and i checked it is valid [21:05:48] arlolra: sure. [21:06:05] jeena: i think arlolra and i can drive the spiderpig then. we'll yell if something breaks! [21:06:10] thanks! [21:06:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240773 (https://phabricator.wikimedia.org/T417902) (owner: 10Aude) [21:06:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [extensions/ProofreadPage] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240779 (https://phabricator.wikimedia.org/T411935) (owner: 10Arlolra) [21:06:27] thanks [21:06:55] 10SRE-Access-Requests, 13Patch-For-Review: Update SSH key for kamila - https://phabricator.wikimedia.org/T411404#11634354 (10Raine) New keys in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240804 (took me this long because I did not realise that the FIDO key was expected to not work with gerrit and I... [21:07:16] (03CR) 10CI reject: [V:04-1] elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [21:08:20] (03Merged) 10jenkins-bot: Update Qids according to communication with communities (v20260219) [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240773 (https://phabricator.wikimedia.org/T417902) (owner: 10Aude) [21:08:22] (03Merged) 10jenkins-bot: Fix finding joiner in the face of pwrapping [extensions/ProofreadPage] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240779 (https://phabricator.wikimedia.org/T411935) (owner: 10Arlolra) [21:08:45] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1240773|Update Qids according to communication with communities (v20260219) (T417902)]], [[gerrit:1240779|Fix finding joiner in the face of pwrapping (T411935)]] [21:08:51] T417902: Update of Qids according to communication with communities - https://phabricator.wikimedia.org/T417902 [21:08:52] T411935: Parsoid support does not apply $wgProofreadPagePageJoiner logic - https://phabricator.wikimedia.org/T411935 [21:10:39] !log arlolra@deploy2002 arlolra, aude: Backport for [[gerrit:1240773|Update Qids according to communication with communities (v20260219) (T417902)]], [[gerrit:1240779|Fix finding joiner in the face of pwrapping (T411935)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:10:49] checking [21:11:39] looks good to me [21:11:53] me too [21:12:02] !log arlolra@deploy2002 arlolra, aude: Continuing with sync [21:12:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-fe2004.codfw.wmnet with OS bookworm [21:12:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11634398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host apus-fe2004.codfw.wmnet with OS bookworm executed with e... [21:12:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-fe2005.codfw.wmnet with OS bookworm [21:12:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11634399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host apus-fe2005.codfw.wmnet with OS bookworm executed with e... [21:15:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11634406 (10Jhancock.wm) @MatthewVernon i was trying to install these and I think there's an issue with the preseed. not sure what exactly. it looks like it's tryin... [21:16:01] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240773|Update Qids according to communication with communities (v20260219) (T417902)]], [[gerrit:1240779|Fix finding joiner in the face of pwrapping (T411935)]] (duration: 07m 16s) [21:16:07] T417902: Update of Qids according to communication with communities - https://phabricator.wikimedia.org/T417902 [21:16:08] T411935: Parsoid support does not apply $wgProofreadPagePageJoiner logic - https://phabricator.wikimedia.org/T411935 [21:16:24] cscott: all you [21:17:11] thanks again! [21:17:18] np [21:19:59] Ok fun! [21:22:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240782 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [21:22:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239270 (https://phabricator.wikimedia.org/T417349) (owner: 10Arlolra) [21:24:12] (03Merged) 10jenkins-bot: Enable parser survey for opted out users on some English-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240782 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [21:24:16] (03Merged) 10jenkins-bot: Deploy PRV to 19 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239270 (https://phabricator.wikimedia.org/T417349) (owner: 10Arlolra) [21:24:32] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1240782|Enable parser survey for opted out users on some English-language wikis (T414852)]], [[gerrit:1239270|Deploy PRV to 19 wikis (T417349)]] [21:24:38] T414852: Run a survey to understand why existing logged in users might be opting out of Parsoid - https://phabricator.wikimedia.org/T414852 [21:24:38] T417349: Parsoid Read Views to deploy ~2026-02-16 - https://phabricator.wikimedia.org/T417349 [21:26:29] !log cscott@deploy2002 cscott, arlolra: Backport for [[gerrit:1240782|Enable parser survey for opted out users on some English-language wikis (T414852)]], [[gerrit:1239270|Deploy PRV to 19 wikis (T417349)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:28:05] cscott: I checked that avwiki is now parsoidrendered on the debug server so good to go [21:28:37] i checked yuewiki and ltwiki and that enwiki is still *not* parsoid rendered :) [21:28:42] trying to test the quick survey now [21:30:24] it seems to load but it spins forever. i can't tell if that's because it's not passing along x-wikipedia-debug or what. [21:30:39] oh, now it loaded. [21:30:59] ok, i'll let it roll forward [21:31:05] !log cscott@deploy2002 cscott, arlolra: Continuing with sync [21:32:00] (03CR) 10JHathaway: [C:03+2] dmarc: set policy to reject [dns] - 10https://gerrit.wikimedia.org/r/1240792 (https://phabricator.wikimedia.org/T404884) (owner: 10JHathaway) [21:32:32] !log jhathaway@dns1004 START - running authdns-update [21:33:19] 06SRE, 06Infrastructure-Foundations, 10Mail: Remove mail alias/fork from dmarc-rua@wikimedia.org to dmarc@donate.wikimedia.org - https://phabricator.wikimedia.org/T417941#11634533 (10Dzahn) There is indeed a postfix alias dmarc-rua@wikimedia.org in the private puppet repo. But it forwards to 5 different add... [21:33:54] !log jhathaway@dns1004 END - running authdns-update [21:34:41] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:35:01] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240782|Enable parser survey for opted out users on some English-language wikis (T414852)]], [[gerrit:1239270|Deploy PRV to 19 wikis (T417349)]] (duration: 10m 28s) [21:35:06] T414852: Run a survey to understand why existing logged in users might be opting out of Parsoid - https://phabricator.wikimedia.org/T414852 [21:35:06] T417349: Parsoid Read Views to deploy ~2026-02-16 - https://phabricator.wikimedia.org/T417349 [21:37:09] jeena: ok, all done! [21:37:27] thanks cscott ! [21:51:11] (03PS3) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260219T2200) [22:02:53] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948 (10RobH) 03NEW [22:03:16] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11634610 (10RobH) a:03BTullis Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the... [22:03:32] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11634618 (10RobH) [22:05:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [22:06:07] (03CR) 10Ryan Kemper: elasticsearch_cluster: allow checking last reboot (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [22:12:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:14:39] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster [22:14:40] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop test cluster [22:16:46] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster [22:18:24] (03PS1) 10Eric Gardner: Minerva TOC: Fix TOC instrumentation selectors [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240814 (https://phabricator.wikimedia.org/T415611) [22:19:27] (03CR) 10Bking: [C:03+1] feat(WDQS)!: disable LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237142 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [22:21:25] Hey all, just a heads up that I'm planning on backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1240814 onto WMF 16 shortly (in the Web Team late backport window) [22:22:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240814 (https://phabricator.wikimedia.org/T415611) (owner: 10Eric Gardner) [22:24:05] !log T415696 Will be merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1237142 shortly, which will permanently decom the LDF endpoint for wdqs services [22:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:09] T415696: Decommission WDQS Linked Data Fragment (LDF) endpoint - https://phabricator.wikimedia.org/T415696 [22:24:13] (03Merged) 10jenkins-bot: Minerva TOC: Fix TOC instrumentation selectors [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1240814 (https://phabricator.wikimedia.org/T415611) (owner: 10Eric Gardner) [22:24:31] !log egardner@deploy2002 Started scap sync-world: Backport for [[gerrit:1240814|Minerva TOC: Fix TOC instrumentation selectors (T415611)]] [22:24:35] T415611: Set up measurement plan and instrumentation spec for mobile TOC - https://phabricator.wikimedia.org/T415611 [22:25:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T415786)', diff saved to https://phabricator.wikimedia.org/P88901 and previous config saved to /var/cache/conftool/dbconfig/20260219-222512-marostegui.json [22:25:17] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [22:25:31] (03CR) 10Ryan Kemper: [C:03+2] feat(WDQS)!: disable LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237142 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [22:26:26] !log egardner@deploy2002 egardner: Backport for [[gerrit:1240814|Minerva TOC: Fix TOC instrumentation selectors (T415611)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:27:18] (03PS4) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [22:27:58] !log egardner@deploy2002 egardner: Continuing with sync [22:31:37] (03PS1) 10SBassett: Version bump security-landing-page values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240819 (https://phabricator.wikimedia.org/T415379) [22:32:03] !log egardner@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240814|Minerva TOC: Fix TOC instrumentation selectors (T415611)]] (duration: 07m 31s) [22:32:07] T415611: Set up measurement plan and instrumentation spec for mobile TOC - https://phabricator.wikimedia.org/T415611 [22:32:38] (03CR) 10SBassett: "See also: https://docker-registry.wikimedia.org/repos/sre/miscweb/security-landing-page/tags/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240819 (https://phabricator.wikimedia.org/T415379) (owner: 10SBassett) [22:34:26] (03PS2) 10SBassett: Version bump security-landing-page values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240819 (https://phabricator.wikimedia.org/T415379) [22:35:11] (03PS3) 10SBassett: Version bump security-landing-page values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240819 (https://phabricator.wikimedia.org/T415379) [22:37:06] (03PS5) 10CDobbins: varnish: clean up Content-Security-Policy header [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) [22:37:06] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11634713 (10Blake) @elukey and @Volans, do you happen to have thoughts about the best way to go about checking... [22:37:41] (03CR) 10ArielGlenn: [C:03+1] "This definitely improved things for me, 0 failures in 30 runs instead of several in half that." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239669 (owner: 10Daniel Kinzler) [22:40:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [22:40:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P88902 and previous config saved to /var/cache/conftool/dbconfig/20260219-224020-marostegui.json [22:53:21] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:53:58] 06SRE, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 05MW-1.46-notes (1.46.0-wmf.16; 2026-02-17): Editing using OAuth 2 doesn’t work - https://phabricator.wikimedia.org/T417839#11634726 (10LucasWerkmeister) >>! In T417839#11631101, @matmarex wrote: > @LucasWerkmeister @Gerges @Rtconner I'd a... [22:55:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T415786)', diff saved to https://phabricator.wikimedia.org/P88903 and previous config saved to /var/cache/conftool/dbconfig/20260219-225512-marostegui.json [22:55:16] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [22:55:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P88904 and previous config saved to /var/cache/conftool/dbconfig/20260219-225529-marostegui.json [23:00:27] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop test cluster [23:10:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P88906 and previous config saved to /var/cache/conftool/dbconfig/20260219-231020-marostegui.json [23:10:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T415786)', diff saved to https://phabricator.wikimedia.org/P88907 and previous config saved to /var/cache/conftool/dbconfig/20260219-231037-marostegui.json [23:10:41] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [23:10:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance [23:11:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2210 (T415786)', diff saved to https://phabricator.wikimedia.org/P88908 and previous config saved to /var/cache/conftool/dbconfig/20260219-231101-marostegui.json [23:19:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:25:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P88909 and previous config saved to /var/cache/conftool/dbconfig/20260219-232528-marostegui.json [23:40:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T415786)', diff saved to https://phabricator.wikimedia.org/P88910 and previous config saved to /var/cache/conftool/dbconfig/20260219-234036-marostegui.json [23:40:41] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [23:40:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [23:41:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1241 (T415786)', diff saved to https://phabricator.wikimedia.org/P88911 and previous config saved to /var/cache/conftool/dbconfig/20260219-234101-marostegui.json [23:56:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins