[00:06:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [00:06:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [00:08:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192285 [00:08:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192285 (owner: 10TrainBranchBot) [00:09:51] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:55] (03CR) 10BCornwall: [C:03+1] beta: Remove redundant enable_m_redir_except_regex setting [puppet] - 10https://gerrit.wikimedia.org/r/1192263 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [00:16:13] (03CR) 10BCornwall: [C:03+1] varnish: Enable unified mobile routing on Wikisource [puppet] - 10https://gerrit.wikimedia.org/r/1192246 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [00:16:23] (03CR) 10BCornwall: [C:03+1] varnish: Enable unified mobile routing on Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/1192247 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [00:17:36] (03PS11) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [00:17:50] (03PS10) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) [00:18:00] (03PS10) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) [00:26:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [00:26:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [00:30:12] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192285 (owner: 10TrainBranchBot) [00:35:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [00:35:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [00:54:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [00:58:18] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [00:58:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11227561 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with O... [00:58:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [00:59:47] PROBLEM - Druid overlord on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [01:06:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [01:06:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [01:06:47] RECOVERY - Druid overlord on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [01:07:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.21 [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192289 (https://phabricator.wikimedia.org/T405677) [01:07:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.21 [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192289 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot) [01:26:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11227581 (10Papaul) @Jhancock.wm see below why the server is failing. You have 2 options change the role int site.pp to insetup role to finish t... [01:26:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [01:28:21] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.21 [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192289 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot) [01:28:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [01:29:26] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192297 [01:35:23] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192298 [01:35:58] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192299 [01:36:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:36:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [01:36:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [01:45:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11227638 (10Jhancock.wm) a:05Papaul→03bking can you help me out with what papaul pointed out when you get in? thanks! [01:46:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:52:20] (03PS1) 10Jforrester: Wikifunctions clients: Enable rich text (HTML) output in embedded calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192303 (https://phabricator.wikimedia.org/T397402) [01:52:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192303 (https://phabricator.wikimedia.org/T397402) (owner: 10Jforrester) [01:56:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [01:56:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [01:57:50] 07Puppet, 10MobileFrontend (Tracking): Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425#11227652 (10Jdforrester-WMF) Is this blocking T214998, as currently claimed by the relationship, or is this block... [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T0200) [02:05:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:06:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [02:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:26:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:26:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:35:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:35:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:55:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T0300) [03:05:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [03:05:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [03:20:45] (03PS2) 10Snwachukwu: Replace old ingestion wiki list file with new autoupdated file [puppet] - 10https://gerrit.wikimedia.org/r/1191750 [03:21:56] (03CR) 10Snwachukwu: Replace old ingestion wiki list file with new autoupdated file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191750 (owner: 10Snwachukwu) [03:24:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [03:25:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [03:35:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [03:35:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [03:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:56:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [03:56:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T0400) [04:03:53] !log mwpresync@deploy2002 Pruned MediaWiki: 1.45.0-wmf.18 (duration: 03m 50s) [04:05:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [04:05:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [04:06:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:09:51] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:11:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:17:36] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:22:36] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:27:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [04:27:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [04:28:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:33:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:36:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [04:36:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [04:56:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [04:57:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [05:05:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [05:05:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [05:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:44] !log stevemunene@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on druid[1007-1008].eqiad.wmnet with reason: Decommissioning druid_public hosts [05:44:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:46:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T0600). [06:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [06:17:47] (03PS1) 10EggRoll97: Add abusefilter-modify-restricted to enwiki EFM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192326 (https://phabricator.wikimedia.org/T405999) [06:18:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:20:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192326 (https://phabricator.wikimedia.org/T405999) (owner: 10EggRoll97) [06:23:44] FIRING: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to esams RIPE Atlas anchor: failures over threshold for measurement 59935536 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:23:46] (03PS1) 10Kosta Harlan: CheckUser/UserInfoCard: Phase 3 enable by default on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192327 (https://phabricator.wikimedia.org/T405342) [06:24:10] jouncebot: nowandnext [06:24:10] For the next 0 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T0600) [06:24:10] For the next 0 hour(s) and 5 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T0600) [06:24:10] In 0 hour(s) and 35 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T0700) [06:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:25:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192148 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [06:26:46] (03Merged) 10jenkins-bot: Hooks: Enable overriding the hook instance per action [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192148 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [06:27:20] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192148|Hooks: Enable overriding the hook instance per action (T405239 T404204)]] [06:27:30] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [06:27:31] T404204: Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204 [06:33:29] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192148|Hooks: Enable overriding the hook instance per action (T405239 T404204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:33:38] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [06:33:39] T404204: Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204 [06:37:22] !log kharlan@deploy2002 kharlan: Continuing with sync [06:42:29] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192148|Hooks: Enable overriding the hook instance per action (T405239 T404204)]] (duration: 15m 09s) [06:42:38] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [06:42:38] T404204: Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204 [06:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:51:05] (03PS2) 10Kosta Harlan: CheckUser/UserInfoCard: Phase 3 enable by default on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192327 (https://phabricator.wikimedia.org/T405342) [06:53:15] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:58:16] RESOLVED: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:01] (03CR) 10Joal: [C:03+1] "LGTM! Needs to be synchronized with other patch :)" [puppet] - 10https://gerrit.wikimedia.org/r/1191750 (owner: 10Snwachukwu) [07:00:04] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T0700) [07:00:04] kostajh and EggRoll97: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:19] I’m not around for this window [07:03:25] im around [07:08:51] (03CR) 10Brouberol: [C:03+1] Add 28 new hadoop workers to the analytics_hadoop cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192239 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [07:09:41] (03CR) 10Brouberol: [C:03+1] Configure an-launcher1003 with its role, but absent job timers [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [07:09:51] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:10:44] (03CR) 10Brouberol: Customise the imported spark-operator chart for deployment to WMF (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [07:11:44] (03CR) 10Brouberol: [C:03+1] Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [07:13:10] (03CR) 10Brouberol: [C:03+1] Customise the login.html template of JupyterHub to hide the TLS warning [puppet] - 10https://gerrit.wikimedia.org/r/1192259 (https://phabricator.wikimedia.org/T403863) (owner: 10Btullis) [07:14:10] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:14:39] (03PS14) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [07:14:48] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: define the namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192118 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [07:21:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:21:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:27:52] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1192224 (owner: 10CDanis) [07:28:30] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [07:29:36] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [07:35:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [07:36:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [07:43:34] (03CR) 10Majavah: [C:03+2] P:openstack: nova: Drop obsolete settings [puppet] - 10https://gerrit.wikimedia.org/r/1189394 (owner: 10Majavah) [07:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:49:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:49:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192326 (https://phabricator.wikimedia.org/T405999) (owner: 10EggRoll97) [07:52:35] (03PS5) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [07:55:48] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [08:00:04] (03CR) 10CI reject: [V:04-1] WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 (owner: 10Elukey) [08:00:04] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T0800) [08:09:52] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:34] hashar: can I deploy a config patch? [08:11:03] kostajh: yes go ahead, I haven't started the train yet :) [08:11:11] thanks [08:12:48] I’ll start in a few minutes, rebooting my laptop [08:14:14] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [08:14:33] (03PS6) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [08:14:54] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [08:15:11] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [08:16:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192327 (https://phabricator.wikimedia.org/T405342) (owner: 10Kosta Harlan) [08:17:27] (03Merged) 10jenkins-bot: CheckUser/UserInfoCard: Phase 3 enable by default on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192327 (https://phabricator.wikimedia.org/T405342) (owner: 10Kosta Harlan) [08:17:52] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192327|CheckUser/UserInfoCard: Phase 3 enable by default on pilot wikis (T405342)]] [08:18:00] T405342: Enable UserInfoCard by default on a set of wikis - https://phabricator.wikimedia.org/T405342 [08:20:24] (03CR) 10CI reject: [V:04-1] WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 (owner: 10Elukey) [08:24:12] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192327|CheckUser/UserInfoCard: Phase 3 enable by default on pilot wikis (T405342)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:24:19] T405342: Enable UserInfoCard by default on a set of wikis - https://phabricator.wikimedia.org/T405342 [08:26:11] !log kharlan@deploy2002 kharlan: Continuing with sync [08:31:17] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192327|CheckUser/UserInfoCard: Phase 3 enable by default on pilot wikis (T405342)]] (duration: 13m 25s) [08:31:35] T405342: Enable UserInfoCard by default on a set of wikis - https://phabricator.wikimedia.org/T405342 [08:33:42] (03PS7) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [08:34:41] hashar: all done, thanks! [08:34:53] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [08:37:09] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [08:37:22] (03PS4) 10MVernon: swift: re-add 2 nodes, drain the final 2, leave 1 for testing [puppet] - 10https://gerrit.wikimedia.org/r/1190674 (https://phabricator.wikimedia.org/T404356) [08:38:21] (03PS12) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [08:38:21] (03PS11) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) [08:38:21] (03PS11) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) [08:38:30] (03CR) 10CI reject: [V:04-1] Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:38:33] (03CR) 10CI reject: [V:04-1] Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:38:38] (03CR) 10CI reject: [V:04-1] Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:38:39] (03PS8) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [08:39:30] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [08:40:05] (03PS13) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [08:40:14] (03PS12) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) [08:40:20] (03PS12) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) [08:40:48] (03CR) 10Btullis: [V:03+1 C:03+2] Configure an-launcher1003 with its role, but absent job timers [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [08:40:57] (03PS13) 10Daniel Kinzler: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [08:42:12] (03PS14) 10Daniel Kinzler: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [08:43:32] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [08:43:53] (03CR) 10Jcrespo: [C:03+1] swift: re-add 2 nodes, drain the final 2, leave 1 for testing [puppet] - 10https://gerrit.wikimedia.org/r/1190674 (https://phabricator.wikimedia.org/T404356) (owner: 10MVernon) [08:46:09] (03CR) 10CI reject: [V:04-1] WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 (owner: 10Elukey) [08:48:18] (03CR) 10Btullis: [V:03+1 C:03+2] Add 28 new hadoop workers to the analytics_hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1192239 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [08:48:46] (03PS4) 10D3r1ck01: session: Enable MultiBackendSessionStore on `group1` wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187779 (https://phabricator.wikimedia.org/T402808) [08:49:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187779 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [08:52:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1086.eqiad.wmnet with OS bullseye [08:52:32] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11228077 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1086.eqiad.wmnet... [08:53:51] I am starting the train routine, I am running late since I had to catch up with lot of things [08:54:37] it looks like the wmf.21 failed overnight due to some patch [08:57:13] (03CR) 10Elukey: statistics: Delete old model-upload script. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1190577 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [08:59:27] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [08:59:52] (03PS9) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [08:59:52] (03PS4) 10Bartosz Wójtowicz: statistics: Delete old model-upload script. [puppet] - 10https://gerrit.wikimedia.org/r/1190577 (https://phabricator.wikimedia.org/T394301) [09:00:28] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [09:00:41] (03CR) 10Bartosz Wójtowicz: statistics: Delete old model-upload script. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1190577 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [09:02:24] (03PS10) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [09:02:36] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [09:04:02] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [09:04:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1086.eqiad.wmnet with reason: host reimage [09:06:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:06:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:08:22] (03PS1) 10Jcrespo: Revert "dbbackups: Upgrade db1245 to MariaDB 10.11 so it can take over db1150" [puppet] - 10https://gerrit.wikimedia.org/r/1192501 [09:09:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1086.eqiad.wmnet with reason: host reimage [09:09:23] (03CR) 10CI reject: [V:04-1] WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 (owner: 10Elukey) [09:09:48] (03PS2) 10Jcrespo: Revert "dbbackups: Upgrade db1245 to MariaDB 10.11 so it can take over db1150" [puppet] - 10https://gerrit.wikimedia.org/r/1192501 [09:12:33] (03CR) 10Btullis: Customise the imported spark-operator chart for deployment to WMF (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:16:37] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 2.598 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:16:37] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 2.706 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:19:04] (03PS1) 10Jelto: gitlab: fix s3 bucket sync re-download [puppet] - 10https://gerrit.wikimedia.org/r/1192506 (https://phabricator.wikimedia.org/T378922) [09:25:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1086.eqiad.wmnet with OS bullseye [09:25:25] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11228264 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1086.eqiad.wmnet wit... [09:25:41] (03PS1) 10Jgiannelos: rest-gateway: Fix typo in URL rewrite causing breakage on PDF downloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192507 (https://phabricator.wikimedia.org/T405957) [09:30:25] (03CR) 10Clément Goubert: [C:03+1] "LGTM, sorry for the oversight" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192507 (https://phabricator.wikimedia.org/T405957) (owner: 10Jgiannelos) [09:31:43] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:31:43] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:33:35] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Fix typo in URL rewrite causing breakage on PDF downloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192507 (https://phabricator.wikimedia.org/T405957) (owner: 10Jgiannelos) [09:33:50] (03CR) 10Btullis: [C:03+2] Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:33:56] (03CR) 10Btullis: [C:03+2] Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:34:00] (03CR) 10Btullis: [C:03+2] Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:35:46] (03Merged) 10jenkins-bot: rest-gateway: Fix typo in URL rewrite causing breakage on PDF downloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192507 (https://phabricator.wikimedia.org/T405957) (owner: 10Jgiannelos) [09:36:07] (03PS1) 10Joal: Update data-engineering gobblin alert [alerts] - 10https://gerrit.wikimedia.org/r/1192509 (https://phabricator.wikimedia.org/T406009) [09:36:08] (03Merged) 10jenkins-bot: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:36:27] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:36:28] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:37:56] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:38:49] (03PS1) 10Arnaudb: gerrit: disable mod_qos to debug allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1192510 (https://phabricator.wikimedia.org/T406005) [09:38:50] (03CR) 10Arnaudb: [C:03+1] "at least deploy2002 is missing from the allowlist, disabling temporarily mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192510 (https://phabricator.wikimedia.org/T406005) (owner: 10Arnaudb) [09:38:55] (03CR) 10Btullis: [C:03+2] Update data-engineering gobblin alert [alerts] - 10https://gerrit.wikimedia.org/r/1192509 (https://phabricator.wikimedia.org/T406009) (owner: 10Joal) [09:40:20] (03CR) 10Arnaudb: [C:03+2] gerrit: disable mod_qos to debug allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1192510 (https://phabricator.wikimedia.org/T406005) (owner: 10Arnaudb) [09:40:29] (03Merged) 10jenkins-bot: Update data-engineering gobblin alert [alerts] - 10https://gerrit.wikimedia.org/r/1192509 (https://phabricator.wikimedia.org/T406009) (owner: 10Joal) [09:41:24] (03Merged) 10jenkins-bot: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:41:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:41:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:42:03] (03PS1) 10Btullis: Add an-launcher1003 to the list of permitted rsync and nfs clients [puppet] - 10https://gerrit.wikimedia.org/r/1192511 (https://phabricator.wikimedia.org/T402943) [09:42:11] (03Merged) 10jenkins-bot: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:44:04] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:44:28] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1192511 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [09:44:52] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:44:53] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:44:58] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:45:19] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:45:47] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:46:10] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:46:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:18] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192512 (https://phabricator.wikimedia.org/T405677) [09:48:21] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192512 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot) [09:49:28] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192512 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot) [09:50:00] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.45.0-wmf.21 refs T405677 [09:50:07] T405677: 1.45.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T405677 [09:50:16] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:55:19] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7140/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192506 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [09:56:53] (03PS1) 10Tiziano Fogli: mirrormaker: fix prometheus expr [puppet] - 10https://gerrit.wikimedia.org/r/1192513 (https://phabricator.wikimedia.org/T370153) [09:58:46] (03CR) 10Tiziano Fogli: [C:03+2] "I’m self-merging this patch since it’s a minor change to help troubleshoot the current alerts." [puppet] - 10https://gerrit.wikimedia.org/r/1192513 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1000) [10:01:13] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:05:05] 10ops-codfw, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406015 (10phaultfinder) 03NEW [10:07:58] (03CR) 10Btullis: [V:03+1 C:03+2] Add an-launcher1003 to the list of permitted rsync and nfs clients [puppet] - 10https://gerrit.wikimedia.org/r/1192511 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [10:08:01] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: fix s3 bucket sync re-download [puppet] - 10https://gerrit.wikimedia.org/r/1192506 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:13:58] (03PS1) 10Btullis: Bump image for the spark-operator to use a working dir of /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192514 (https://phabricator.wikimedia.org/T405490) [10:14:05] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [10:14:11] (03PS2) 10Btullis: Bump image for the spark-operator to use a working dir of /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192514 (https://phabricator.wikimedia.org/T405490) [10:14:12] (03CR) 10CI reject: [V:04-1] Bump image for the spark-operator to use a working dir of /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192514 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [10:14:23] Hmmm [10:14:37] * claime squints at mw-web [10:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [10:18:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:20:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 16.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:20:32] hmm indeed [10:21:23] s8 getting busy [10:22:35] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [10:23:12] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:24:05] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [10:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:26:05] (03CR) 10Btullis: [C:03+2] Bump image for the spark-operator to use a working dir of /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192514 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [10:28:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1087.eqiad.wmnet with OS bullseye [10:28:06] oh, in a much bigger way also s4 [10:28:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11228635 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1087.eqiad.wmnet... [10:28:22] (03PS1) 10Tiziano Fogli: mirrormaker: fix prometheus expr [puppet] - 10https://gerrit.wikimedia.org/r/1192517 (https://phabricator.wikimedia.org/T370153) [10:28:22] (03CR) 10Tiziano Fogli: [C:03+2] "I’m self-merging this patch since it’s a minor change to help troubleshoot the current alerts." [puppet] - 10https://gerrit.wikimedia.org/r/1192517 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [10:29:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:32:05] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11228650 (10Jelto) All GitLab packages are migrated to APUS object storage. The additional sync from bucket to the backup directory... [10:32:42] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11228653 (10Jelto) [10:34:15] (03Merged) 10jenkins-bot: Bump image for the spark-operator to use a working dir of /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192514 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [10:34:18] (03CR) 10Clément Goubert: [C:04-1] Add rate limiting for REST gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [10:35:22] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.45.0-wmf.21 refs T405677 (duration: 45m 21s) [10:35:29] T405677: 1.45.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T405677 [10:35:37] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:35:37] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:35:53] (03CR) 10Btullis: [V:03+1 C:03+2] Customise the login.html template of JupyterHub to hide the TLS warning [puppet] - 10https://gerrit.wikimedia.org/r/1192259 (https://phabricator.wikimedia.org/T403863) (owner: 10Btullis) [10:39:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:40:11] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1087.eqiad.wmnet with reason: host reimage [10:42:16] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11228699 (10jcrespo) Do we need daily full backups for objects? Assuming only a few object change per day, cannot we do incremental... [10:43:05] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:44:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1087.eqiad.wmnet with reason: host reimage [10:53:27] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:56:48] !log dropping interwiki table on group0 (T397367) [10:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:55] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [10:57:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [10:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:01:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1087.eqiad.wmnet with OS bullseye [11:01:50] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11228814 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1087.eqiad.wmnet wit... [11:07:21] (03PS1) 10TheDJ: SVG: do not allow native SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192528 (https://phabricator.wikimedia.org/T406023) [11:08:43] bah https://versions.toolforge.org/ is dead [11:08:55] * hashar ignores [11:08:58] I am running the train no [11:08:59] w [11:11:07] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192529 (https://phabricator.wikimedia.org/T405677) [11:11:10] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192529 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot) [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:13:03] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192529 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot) [11:14:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet [11:16:23] hashar: reported the versions issue in #wikimedia-cloud fwiw [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:16:39] ah great, thank you for your assistance Lucas_WMDE ! [11:23:01] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1190577 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [11:25:48] ERROR | https://%{_server}%{_url} | Division by zero <-- fun times [11:25:57] from /srv/mediawiki/php-1.45.0-wmf.20/extensions/TimedMediaHandler/includes/WebVideoTranscode/WebVideoTranscodeJob.php(394) [11:26:07] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.21 refs T405677 [11:26:07] anyway, I ll file tasks later this evening [11:26:13] T405677: 1.45.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T405677 [11:26:38] I ran the train via spider pig ( https://spiderpig.wikimedia.org/jobs/650 ) [11:29:48] and now i'm wondering if the spider pig is driving the train or in front of the train pulling it [11:29:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:29:52] 10ops-codfw, 06DC-Ops: Alert for device ps1-a3-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406025 (10phaultfinder) 03NEW [11:34:02] (03CR) 10MVernon: [C:03+2] swift: re-add 2 nodes, drain the final 2, leave 1 for testing [puppet] - 10https://gerrit.wikimedia.org/r/1190674 (https://phabricator.wikimedia.org/T404356) (owner: 10MVernon) [11:34:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:36:15] (03PS1) 10Kosta Harlan: EventStreamConfig: Fix user-agent exclusion config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192533 (https://phabricator.wikimedia.org/T387600) [11:36:35] (03PS2) 10Kosta Harlan: EventStreamConfig: Fix user-agent exclusion config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192533 (https://phabricator.wikimedia.org/T387600) [11:37:44] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7141/co" [puppet] - 10https://gerrit.wikimedia.org/r/1190577 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [11:39:00] (03PS1) 10Btullis: Fix the path of the login.html template for jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/1192534 (https://phabricator.wikimedia.org/T403863) [11:40:52] (03CR) 10Elukey: [V:03+1 C:03+2] statistics: Delete old model-upload script. [puppet] - 10https://gerrit.wikimedia.org/r/1190577 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [11:41:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:41:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:44:55] (03PS11) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [11:45:12] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [11:45:27] (03CR) 10Dr0ptp4kt: [C:03+1] EventStreamConfig: Fix user-agent exclusion config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192533 (https://phabricator.wikimedia.org/T387600) (owner: 10Kosta Harlan) [11:46:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:46:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.161 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:49:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:51:11] (03CR) 10CI reject: [V:04-1] WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 (owner: 10Elukey) [11:53:59] (03CR) 10Btullis: [C:03+2] Fix the path of the login.html template for jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/1192534 (https://phabricator.wikimedia.org/T403863) (owner: 10Btullis) [11:54:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:59:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1007.eqiad.wmnet with OS bookworm [11:59:13] (03PS1) 10Jelto: gitlab: remove packages from daily full backups [puppet] - 10https://gerrit.wikimedia.org/r/1192535 (https://phabricator.wikimedia.org/T378922) [11:59:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11229028 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1200) [12:02:01] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11229035 (10Jelto) >>! In T378922#11228699, @jcrespo wrote: > Do we need daily full backups for objects? Assuming only a few object... [12:06:21] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11229074 (10jcrespo) >>! In T378922#11229035, @Jelto wrote: >> I'm not sure which job defaults are used then, but incremental shoul... [12:09:52] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:25] (03PS10) 10Brouberol: kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) [12:14:25] (03PS15) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [12:16:05] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [12:17:12] (03CR) 10Brouberol: kafka-mirrormaker: define business logic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [12:17:49] (03PS11) 10Brouberol: kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) [12:17:49] (03PS16) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [12:17:56] jclark@cumin1002 reimage (PID 967393) is awaiting input [12:27:24] (03PS12) 10Brouberol: kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) [12:27:24] (03PS17) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [12:30:42] jouncebot: nowandnext [12:30:42] For the next 0 hour(s) and 29 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1200) [12:30:42] In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1300) [12:36:49] (03PS9) 10Brouberol: deployment_server: ensure all users can traverse airflow private directories [puppet] - 10https://gerrit.wikimedia.org/r/1192499 [12:38:08] (03PS1) 10Btullis: Use the new image of spark-operator that uses uid/gid 185 for spark [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192541 (https://phabricator.wikimedia.org/T405490) [12:38:19] (03PS2) 10Btullis: Use the new image of spark-operator that uses uid/gid 185 for spark [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192541 (https://phabricator.wikimedia.org/T405490) [12:40:03] (03PS13) 10Brouberol: kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) [12:40:03] (03PS18) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [12:40:45] (03PS14) 10Brouberol: kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) [12:40:45] (03PS19) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [12:42:34] jouncebot: nowandnext [12:42:34] For the next 0 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1200) [12:42:34] In 0 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1300) [12:42:51] I'll start mine early [12:43:21] Dreamy_Jazz: Ack. [12:45:41] Started my one (it produces no public log entries) [12:47:40] (03CR) 10Btullis: [C:03+2] Use the new image of spark-operator that uses uid/gid 185 for spark [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192541 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [12:51:15] !log bking@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2016.codfw.wmnet with OS bullseye [12:51:24] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11229271 (10Jclark-ctr) 1-252356162540 inbound ticket for data center [12:51:57] (03CR) 10Stevemunene: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1192499 (owner: 10Brouberol) [12:52:51] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11229275 (10Jclark-ctr) @Marostegui when this drive arrives can it be swapped at anytime? [12:52:52] (03CR) 10Brouberol: [C:03+2] deployment_server: ensure all users can traverse airflow private directories [puppet] - 10https://gerrit.wikimedia.org/r/1192499 (owner: 10Brouberol) [12:53:31] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 35354416 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:54:31] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 28168 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:54:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and Hurricane Electric (2001:470:0:1c0::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:55:32] (03Merged) 10jenkins-bot: Use the new image of spark-operator that uses uid/gid 185 for spark [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192541 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [12:55:41] Testing... [12:55:53] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11229282 (10Ladsgroup) Manuel is out. Yes. To my understanding it should be fine. It's live but disk swaps are noop to the system (AFAIK) and also this doesn't get that much writes either. [12:58:04] Finished testing, now on sync-prod-k8s [12:58:21] (03CR) 10Jcrespo: [C:03+2] Revert "dbbackups: Upgrade db1245 to MariaDB 10.11 so it can take over db1150" [puppet] - 10https://gerrit.wikimedia.org/r/1192501 (owner: 10Jcrespo) [12:58:31] (03PS2) 10Klausman: profile::amd_gpu: roll out new AMD GPU plugin to all LiftWing workers [puppet] - 10https://gerrit.wikimedia.org/r/1191699 (https://phabricator.wikimedia.org/T398600) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1300). [13:00:05] mfossati, James_F, EggRoll97, xSavitar, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] Hey. [13:00:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11229287 (10Gehel) [13:00:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11229289 (10Gehel) [13:00:30] o/ [13:00:40] 80% one on my sync, so shortly over to someone else [13:00:42] (03CR) 10Btullis: kafka-mirrormaker: define business logic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [13:01:13] o/ [13:01:25] o/ [13:01:36] I can self-deploy [13:01:38] I'm now done [13:01:42] On to the next person [13:02:01] mfossati: You're up. [13:02:11] cool, will do [13:02:34] (03CR) 10Btullis: kafka-mirrormaker: define helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [13:03:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11229292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm executed with errors: - dbprov1007... [13:03:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11229296 (10Gehel) [13:03:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11229297 (10Gehel) [13:03:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [13:04:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1007.eqiad.wmnet with OS bookworm [13:04:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11229305 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm [13:04:23] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:04:50] Lucas_WMDE, may I lobby you to deploy my patch? :) [13:05:01] (03Merged) 10jenkins-bot: ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [13:05:20] sure, once the other ones are done [13:05:37] !log mfossati@deploy2002 Started scap sync-world: Backport for [[gerrit:1192138|ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent (T403259)]] [13:05:44] T403259: Instrument image browsing interactions - https://phabricator.wikimedia.org/T403259 [13:05:48] Ack! Thanks! [13:06:49] mfossati: just checking, did you see the last comments on the change you’re deploying? [13:06:56] (its two CR+1 votes got removed) [13:10:47] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11229318 (10elukey) Tested the new maps codfw postgres stack with the kartotherian diff tool. This is what I got: Similarities: ` print(quantiles.to_markdown()) | |... [13:11:02] Lucas_WMDE: yeah, thanks for the heads-up. I'll follow up with the commenter, as I thought this is already addressed elsewhere (MPIC) [13:11:07] ok [13:11:28] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11229323 (10FCeratto-WMF) Hello @Maria_Lechner_WMDE , in my understanding you should follow https://wikitech.wikimedia.org/wiki/Volunteer_NDA which requires a different Phabricator task... [13:11:57] (03PS1) 10Brouberol: Revert "deployment_server: ensure all users can traverse airflow private directories" [puppet] - 10https://gerrit.wikimedia.org/r/1192545 [13:11:57] (03PS1) 10Brouberol: deployment_server: ensure all users can traverse service private directories [puppet] - 10https://gerrit.wikimedia.org/r/1192546 [13:12:15] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192546 (owner: 10Brouberol) [13:12:21] !log mfossati@deploy2002 mfossati: Backport for [[gerrit:1192138|ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent (T403259)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:12:28] T403259: Instrument image browsing interactions - https://phabricator.wikimedia.org/T403259 [13:12:28] (03PS4) 10Tiziano Fogli: mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) [13:12:30] (03CR) 10Elukey: [C:03+1] "LGTM, you'll need to manually apt-get install amd-k8s-device-plugin on all gpu nodes IIUC." [puppet] - 10https://gerrit.wikimedia.org/r/1191699 (https://phabricator.wikimedia.org/T398600) (owner: 10Klausman) [13:13:08] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11229333 (10Jclark-ctr) @Ladsgroup thanks! it should arrive this afternoon will install this afternoon or tomorrow morning [13:13:24] !log mfossati@deploy2002 mfossati: Continuing with sync [13:13:29] gerrit seems unhappy [13:13:40] hm back [13:14:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11229341 (10Gehel) [13:14:46] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:15:53] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7145/co" [puppet] - 10https://gerrit.wikimedia.org/r/1191699 (https://phabricator.wikimedia.org/T398600) (owner: 10Klausman) [13:16:30] (03CR) 10Stevemunene: [C:03+1] Revert "deployment_server: ensure all users can traverse airflow private directories" [puppet] - 10https://gerrit.wikimedia.org/r/1192545 (owner: 10Brouberol) [13:16:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:16:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:17:21] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7146/console" [puppet] - 10https://gerrit.wikimedia.org/r/1191699 (https://phabricator.wikimedia.org/T398600) (owner: 10Klausman) [13:18:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:18:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:18:33] !log mfossati@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192138|ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent (T403259)]] (duration: 12m 56s) [13:18:40] T403259: Instrument image browsing interactions - https://phabricator.wikimedia.org/T403259 [13:19:29] James_F, Lucas_WMDE: all done here! [13:19:33] Ack. [13:19:45] James_F: I assume you’ll self-serve? [13:19:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192303 (https://phabricator.wikimedia.org/T397402) (owner: 10Jforrester) [13:19:48] Yeah. [13:19:50] ok [13:20:41] (03Merged) 10jenkins-bot: Wikifunctions clients: Enable rich text (HTML) output in embedded calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192303 (https://phabricator.wikimedia.org/T397402) (owner: 10Jforrester) [13:20:49] haven’t seen EggRoll97 yet btw [13:21:14] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1192303|Wikifunctions clients: Enable rich text (HTML) output in embedded calls (T397402)]] [13:21:19] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11229354 (10Jclark-ctr) 05Resolved→03Open I Emailed support@servertech.com This morning they Already replied with return label for the ticket Emailed To @VRiley-WMF since it has Address info [13:21:20] T397402: If we enable Wikifunctions to output HTML tables, styling, and links, we will demonstrate through a Function that displays a conjugation table its capability for generating net new knowledge on Wiktionaries beyond simple conversions. - https://phabricator.wikimedia.org/T397402 [13:21:27] (03CR) 10Klausman: [V:03+2 C:03+2] profile::amd_gpu: roll out new AMD GPU plugin to all LiftWing workers [puppet] - 10https://gerrit.wikimedia.org/r/1191699 (https://phabricator.wikimedia.org/T398600) (owner: 10Klausman) [13:23:42] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#11229375 (10hashar) //I am pasting comments I have made on a Slack thread:// We had [[ https://www.mediawiki.org/wiki/Extension:DumpHTML | Extensio... [13:24:35] (03CR) 10Brouberol: kafka-mirrormaker: define business logic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [13:25:12] (03CR) 10Elukey: "Closing old review comments since the code was changed by Balthazar." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [13:25:39] (03PS2) 10Brouberol: deployment_server: ensure all users can traverse service private directories [puppet] - 10https://gerrit.wikimedia.org/r/1192546 [13:25:53] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:26:39] (03PS15) 10Brouberol: kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) [13:26:39] (03PS20) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [13:26:57] (03CR) 10Brouberol: [C:03+2] Revert "deployment_server: ensure all users can traverse airflow private directories" [puppet] - 10https://gerrit.wikimedia.org/r/1192545 (owner: 10Brouberol) [13:27:42] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1192303|Wikifunctions clients: Enable rich text (HTML) output in embedded calls (T397402)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:27:49] T397402: If we enable Wikifunctions to output HTML tables, styling, and links, we will demonstrate through a Function that displays a conjugation table its capability for generating net new knowledge on Wiktionaries beyond simple conversions. - https://phabricator.wikimedia.org/T397402 [13:28:13] !log jforrester@deploy2002 jforrester: Continuing with sync [13:30:06] (03PS3) 10Brouberol: deployment_server: ensure all users can traverse service private directories [puppet] - 10https://gerrit.wikimedia.org/r/1192546 [13:32:40] (03CR) 10Jelto: [C:03+1] "lgtm, thanks for removing the hardcoded hostnames." [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [13:33:29] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192303|Wikifunctions clients: Enable rich text (HTML) output in embedded calls (T397402)]] (duration: 12m 15s) [13:33:36] T397402: If we enable Wikifunctions to output HTML tables, styling, and links, we will demonstrate through a Function that displays a conjugation table its capability for generating net new knowledge on Wiktionaries beyond simple conversions. - https://phabricator.wikimedia.org/T397402 [13:33:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:33:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:34:26] Lucas_WMDE: Over to you. [13:35:38] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11229406 (10MatthewVernon) A couple of notes, so I have a record of what I've done, and in case they're of any help! I've just re-im... [13:35:44] (03PS4) 10Brouberol: deployment_server: ensure all users can traverse service private directories [puppet] - 10https://gerrit.wikimedia.org/r/1192546 [13:36:09] (03CR) 10Stevemunene: [C:03+1] deployment_server: ensure all users can traverse service private directories [puppet] - 10https://gerrit.wikimedia.org/r/1192546 (owner: 10Brouberol) [13:36:15] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:36:17] thanks! [13:36:50] still no sign of EggRoll97 so let’s continue with xSavitar [13:36:56] Ack [13:36:57] (03CR) 10Elukey: [C:03+1] "LGTM, I left a comment but it is really a nit. The tests are good, as suggested before we could have used requests-mock (https://requests-" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [13:37:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187779 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:38:09] (03Merged) 10jenkins-bot: session: Enable MultiBackendSessionStore on `group1` wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187779 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:38:36] (03CR) 10Brouberol: [C:03+2] deployment_server: ensure all users can traverse service private directories [puppet] - 10https://gerrit.wikimedia.org/r/1192546 (owner: 10Brouberol) [13:38:41] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1187779|session: Enable MultiBackendSessionStore on `group1` wikis (T402808)]] [13:38:48] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:39:17] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11229421 (10MatthewVernon) [13:44:14] (03CR) 10Brouberol: kafka-mirrormaker: define business logic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [13:45:07] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, d3r1ck01: Backport for [[gerrit:1187779|session: Enable MultiBackendSessionStore on `group1` wikis (T402808)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:45:16] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:45:26] Testing... [13:45:31] ok [13:46:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:28] (03PS1) 10Btullis: Use the new entrypoint.sh of the spark-operator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192552 (https://phabricator.wikimedia.org/T405490) [13:47:40] (03PS2) 10Btullis: Use the new entrypoint.sh of the spark-operator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192552 (https://phabricator.wikimedia.org/T405490) [13:47:48] Lucas_WMDE, feel free to sync. All seems to work as expected. [13:47:49] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1007.eqiad.wmnet with OS bookworm [13:47:56] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11229457 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm executed with errors: - dbprov1007... [13:48:20] thanks! [13:48:24] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, d3r1ck01: Continuing with sync [13:50:48] (03PS3) 10DDesouza: Update reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191510 (https://phabricator.wikimedia.org/T405410) [13:52:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191691 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [13:53:22] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1187779|session: Enable MultiBackendSessionStore on `group1` wikis (T402808)]] (duration: 14m 40s) [13:53:29] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:53:34] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11229481 (10elukey) @MatthewVernon thanks for the write-up! As FYI Jesse is working on T376949, that should address your concerns abo... [13:54:04] Lucas_WMDE, thanks for deploying 🙏🏽 [13:54:13] (03PS1) 10DDesouza: Increase coverage of Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192555 (https://phabricator.wikimedia.org/T405577) [13:54:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191510 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [13:54:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192555 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [13:55:05] np [13:55:30] !log UTC afternoon backport+config window done [13:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:40] (the enwiki EFM change will have to be rescheduled I guess) [13:55:47] (doesn’t sound like it was super urgent anyway) [13:58:18] (03CR) 10Elukey: [C:03+2] thanos: Add recording rules for xlab SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [14:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1400) [14:01:03] (03CR) 10Btullis: [C:03+1] kafka-mirrormaker: define business logic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:01:18] (03CR) 10Btullis: [C:03+1] kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:01:27] (03PS4) 10DDesouza: Update reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191510 (https://phabricator.wikimedia.org/T405410) [14:04:33] (03CR) 10Btullis: [C:03+2] Use the new entrypoint.sh of the spark-operator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192552 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [14:12:15] (03PS5) 10Ssingh: P:cache::haproxy: exempt releases.wikimedia.org from UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) [14:12:29] (03Merged) 10jenkins-bot: Use the new entrypoint.sh of the spark-operator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192552 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [14:13:01] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7149/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [14:14:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and Hurricane Electric (2001:470:0:1c0::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [14:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [14:15:35] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:16:37] (03CR) 10Aklapper: [C:03+2] Update source strings to latest release [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1190699 (https://phabricator.wikimedia.org/T404134) (owner: 10Pppery) [14:16:42] (03CR) 10Aklapper: [V:03+2 C:03+2] Update source strings to latest release [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1190699 (https://phabricator.wikimedia.org/T404134) (owner: 10Pppery) [14:17:04] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! Applies cleanly locally on latest wmf/stable head" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1190699 (https://phabricator.wikimedia.org/T404134) (owner: 10Pppery) [14:18:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1159.eqiad.wmnet with reason: Maintenance [14:18:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T401906)', diff saved to https://phabricator.wikimedia.org/P83483 and previous config saved to /var/cache/conftool/dbconfig/20250930-141816-fceratto.json [14:18:24] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:19:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T401906)', diff saved to https://phabricator.wikimedia.org/P83484 and previous config saved to /var/cache/conftool/dbconfig/20250930-141925-fceratto.json [14:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:25:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:26:38] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs2016.codfw.wmnet [14:26:42] (03PS1) 10Btullis: Update the libyaml-cpp version installed on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1192560 (https://phabricator.wikimedia.org/T402943) [14:27:40] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2016.codfw.wmnet [14:27:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:29:26] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:29:38] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:29:50] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:29:59] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1430) [14:30:34] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:31:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:32:05] (03PS5) 10Daniel Kinzler: api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 [14:32:05] (03PS15) 10Daniel Kinzler: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [14:33:05] (03PS1) 10CDanis: Add mediawiki-config-state to ignored NodeTextfileStales [alerts] - 10https://gerrit.wikimedia.org/r/1192561 [14:34:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P83485 and previous config saved to /var/cache/conftool/dbconfig/20250930-143433-fceratto.json [14:35:32] (03PS16) 10Brouberol: kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) [14:35:32] (03PS21) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [14:36:11] (03PS22) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [14:36:35] (03CR) 10Brouberol: kafka-mirrormaker: define business logic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:36:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:38:09] (03PS23) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [14:38:13] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:38:32] (03CR) 10JHathaway: [C:03+1] Add mediawiki-config-state to ignored NodeTextfileStales [alerts] - 10https://gerrit.wikimedia.org/r/1192561 (owner: 10CDanis) [14:38:43] (03CR) 10CDanis: [C:03+2] Add mediawiki-config-state to ignored NodeTextfileStales [alerts] - 10https://gerrit.wikimedia.org/r/1192561 (owner: 10CDanis) [14:38:55] (03CR) 10Brouberol: kafka-mirrormaker: define helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:39:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:39:55] (03Merged) 10jenkins-bot: Add mediawiki-config-state to ignored NodeTextfileStales [alerts] - 10https://gerrit.wikimedia.org/r/1192561 (owner: 10CDanis) [14:40:10] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:40:20] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:40:24] (03CR) 10CI reject: [V:04-1] kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:40:26] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: initial scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192109 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:40:30] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:41:23] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:41:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:41:48] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:41:56] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:42:05] (03CR) 10CI reject: [V:04-1] kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:42:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:42:59] (03PS1) 10Jelto: gitlab: disable and remove partial backup [puppet] - 10https://gerrit.wikimedia.org/r/1192562 (https://phabricator.wikimedia.org/T378922) [14:43:19] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11229672 (10MatthewVernon) My best theory on that is that one install run writes EFI to one disk (embedding the UUID), then a subsequ... [14:43:25] (03PS24) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [14:43:37] (03CR) 10CI reject: [V:04-1] kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:43:54] !log dancy@deploy2002 Installing scap version "4.213.0" for 2 host(s) [14:44:08] (03CR) 10CDanis: [C:03+2] puppetserver::volatile: Default to no XCheeseScore [puppet] - 10https://gerrit.wikimedia.org/r/1192224 (owner: 10CDanis) [14:44:12] (03PS17) 10Brouberol: kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) [14:44:12] (03PS25) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [14:45:03] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7150/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192562 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:45:43] !log dancy@deploy2002 Installation of scap version "4.213.0" completed for 2 hosts [14:46:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406015#11229692 (10Jhancock.wm) a:03Jhancock.wm this looks like it might have been a one time blip. holding to check later. [14:47:14] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391312 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:47:29] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391312 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:48:01] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2016 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:48:01] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:48:59] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:49:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P83486 and previous config saved to /var/cache/conftool/dbconfig/20250930-144940-fceratto.json [14:49:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11229706 (10VRiley-WMF) Hey @cmooney is there a good time to schedual this move? [14:51:45] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11229723 (10VRiley-WMF) Have been in communication with the Senior Account manager and they are investigating why my login isn't able to submit tickets at this time. The Juniper account is created, bu... [14:52:14] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391312 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:52:39] (03CR) 10Brouberol: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:53:39] (03CR) 10Brouberol: kafka-mirrormaker: define helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:54:09] (03Merged) 10jenkins-bot: kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:55:54] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [14:57:24] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:57:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [14:58:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [14:58:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [14:59:22] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:00:05] jelto, arnoldokoth, and mutante: Time to do the SRE Collaboration Services office hours deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1500). [15:02:14] (03PS1) 10Ahmon Dancy: osm_master: Create /etc/wikimedia directory [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) [15:02:16] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/airflow-main: apply [15:02:27] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:02:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:03:04] !log dzahn@cumin2002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phab.wmfusercontent.org with reason: version upgrade [15:03:06] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [15:03:55] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11229770 (10elukey) I had a chat with Yiannis, we reviewed the values since the similarities in some cases were on the low 90s. So kartotherian-diff emits image diffs for every reques... [15:04:19] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: phab deploy [15:04:27] (03CR) 10CI reject: [V:04-1] osm_master: Create /etc/wikimedia directory [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy) [15:04:44] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: phab deploy [15:04:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T401906)', diff saved to https://phabricator.wikimedia.org/P83488 and previous config saved to /var/cache/conftool/dbconfig/20250930-150448-fceratto.json [15:04:56] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:05:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:05:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:05:27] (03PS1) 10Ahmon Dancy: Allow deployment group to sudo -u mwbuilder scap clean-images [puppet] - 10https://gerrit.wikimedia.org/r/1192567 (https://phabricator.wikimedia.org/T387927) [15:05:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T401906)', diff saved to https://phabricator.wikimedia.org/P83489 and previous config saved to /var/cache/conftool/dbconfig/20250930-150529-fceratto.json [15:06:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T401906)', diff saved to https://phabricator.wikimedia.org/P83490 and previous config saved to /var/cache/conftool/dbconfig/20250930-150638-fceratto.json [15:06:49] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [15:07:10] (03PS2) 10Ahmon Dancy: osm_master: Create /etc/wikimedia directory [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) [15:07:41] !log brennen@deploy2002 Started deploy [phabricator/deployment@41325d8]: deploy phab2002 for T406041 [15:07:48] T406041: Deploy Phabricator/Phorge 2025-09-30 - https://phabricator.wikimedia.org/T406041 [15:08:12] !log brennen@deploy2002 Finished deploy [phabricator/deployment@41325d8]: deploy phab2002 for T406041 (duration: 00m 31s) [15:08:53] !log brennen@deploy2002 Started deploy [phabricator/deployment@41325d8]: deploy phab1004 for T406041 [15:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:52] !log brennen@deploy2002 Finished deploy [phabricator/deployment@41325d8]: deploy phab1004 for T406041 (duration: 00m 59s) [15:11:39] FIRING: TransitBGPDown: Transit BGP session down between cr3-ulsfo and Hurricane Electric (2001:504:30::ba00:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPD [15:16:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Hurricane Electric (2001:504:30::ba00:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [15:18:56] (03PS1) 10CDanis: intake-logging: also ship ja3n [puppet] - 10https://gerrit.wikimedia.org/r/1192570 [15:19:53] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406043 (10phaultfinder) 03NEW [15:19:54] (03PS1) 10CDanis: EventGate: store x-ja3n req hdr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192572 [15:21:21] (03CR) 10Giuseppe Lavagetto: [C:03+1] intake-logging: also ship ja3n [puppet] - 10https://gerrit.wikimedia.org/r/1192570 (owner: 10CDanis) [15:21:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P83491 and previous config saved to /var/cache/conftool/dbconfig/20250930-152146-fceratto.json [15:22:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:25:32] (03CR) 10Snwachukwu: "thank you for your time!" [puppet] - 10https://gerrit.wikimedia.org/r/1191750 (owner: 10Snwachukwu) [15:28:47] (03PS1) 10Ahmon Dancy: Add optional scap-clean-images systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1192573 (https://phabricator.wikimedia.org/T401647) [15:30:38] (03PS2) 10CDanis: intake-logging EventGate: store x-ja3n req hdr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192572 [15:30:51] (03Abandoned) 10CDanis: intake-logging: also ship ja3n [puppet] - 10https://gerrit.wikimedia.org/r/1192570 (owner: 10CDanis) [15:32:46] (03PS16) 10Daniel Kinzler: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [15:33:04] (03PS2) 10Ahmon Dancy: deployment_server: Add optional scap-clean-images systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1192573 (https://phabricator.wikimedia.org/T401647) [15:34:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:13] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/airflow-main: apply [15:34:18] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:34:51] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling required for fr-tech expansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11229937 (10RobH) @wiki_willy, I'm not 100% sure on how to process this and I wanted to check with you... [15:36:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P83493 and previous config saved to /var/cache/conftool/dbconfig/20250930-153653-fceratto.json [15:39:35] (03CR) 10Giuseppe Lavagetto: [C:03+1] intake-logging EventGate: store x-ja3n req hdr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192572 (owner: 10CDanis) [15:39:47] jouncebot: nowandnext [15:39:47] For the next 0 hour(s) and 20 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1500) [15:39:47] In 0 hour(s) and 20 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1600) [15:41:10] (03PS1) 10MVernon: install_server: set ms-be209* to use ms-be_simple-efi.cfg preseed [puppet] - 10https://gerrit.wikimedia.org/r/1192575 (https://phabricator.wikimedia.org/T405958) [15:41:13] (03CR) 10Dzahn: [C:03+1] gitlab: remove packages from daily full backups [puppet] - 10https://gerrit.wikimedia.org/r/1192535 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [15:41:54] (03CR) 10Dzahn: [C:03+1] Remove wikimedia.support from ncredir/acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/1192283 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [15:42:15] (03CR) 10Jcrespo: [C:03+1] "lgtm, as long as it deploys correctly" [puppet] - 10https://gerrit.wikimedia.org/r/1192535 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [15:43:20] (03CR) 10Hnowlan: "We still obey CSP headers that services themselves provide, so I guess this is just opting out of a default value. Still a little awkward," [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191344 (https://phabricator.wikimedia.org/T405368) (owner: 10Hnowlan) [15:43:30] 10SRE-SLO, 10EditCheck, 06Editing-team (Kanban Board), 07Essential-Work: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11229971 (10elukey) @DLynch hi! Any updates? The error budget keeps going down :) [15:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:45:18] (03CR) 10CDanis: [C:03+2] intake-logging EventGate: store x-ja3n req hdr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192572 (owner: 10CDanis) [15:46:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192572 (owner: 10CDanis) [15:46:24] (03PS2) 10Hnowlan: (api|rest)-gateway: Add option to disable CSP, disable for rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191344 (https://phabricator.wikimedia.org/T405368) [15:46:26] (03Merged) 10jenkins-bot: intake-logging EventGate: store x-ja3n req hdr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192572 (owner: 10CDanis) [15:47:01] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1192572|intake-logging EventGate: store x-ja3n req hdr]] [15:51:37] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [15:51:42] (03CR) 10Dzahn: "This seems the right approach. But I am not sure if we are also going to move the keys.txt to a new location with this or not." [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [15:52:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T401906)', diff saved to https://phabricator.wikimedia.org/P83494 and previous config saved to /var/cache/conftool/dbconfig/20250930-155200-fceratto.json [15:52:08] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:52:16] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [15:52:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T401906)', diff saved to https://phabricator.wikimedia.org/P83495 and previous config saved to /var/cache/conftool/dbconfig/20250930-155223-fceratto.json [15:52:45] (03PS1) 10Scott French: php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1192563 [15:52:53] !log cdanis@deploy2002 cdanis: Backport for [[gerrit:1192572|intake-logging EventGate: store x-ja3n req hdr]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:53:13] (03PS10) 10CDanis: WMF-Uniq -> analytics: better stats & privacy [puppet] - 10https://gerrit.wikimedia.org/r/1191708 (https://phabricator.wikimedia.org/T405783) [15:53:13] (03PS1) 10CDanis: benthos: switch to new & improved wmfuniq fields [puppet] - 10https://gerrit.wikimedia.org/r/1192576 (https://phabricator.wikimedia.org/T405783) [15:53:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T401906)', diff saved to https://phabricator.wikimedia.org/P83496 and previous config saved to /var/cache/conftool/dbconfig/20250930-155347-fceratto.json [15:55:32] (03PS1) 10Marco Fossati: ReaderExperiments' ImageBrowsing: use edge uniques [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192578 (https://phabricator.wikimedia.org/T403259) [15:55:35] !log cdanis@deploy2002 cdanis: Continuing with sync [15:56:22] !log reprepro include php8.3_8.3.25-1+wmf11u2 in component/php83 [15:56:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:03] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1192560 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [15:58:34] (03PS4) 10Dzahn: phabricator: hiera'ize the apc_shm_size variable [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) [15:58:37] (03CR) 10Dzahn: phabricator: hiera'ize the apc_shm_size variable (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [15:58:46] (03CR) 10Marco Fossati: ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [15:58:47] (03CR) 10Dzahn: phabricator: hiera'ize the apc_shm_size variable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [15:59:01] (03CR) 10CI reject: [V:04-1] phabricator: hiera'ize the apc_shm_size variable [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [15:59:47] (03CR) 10Scott French: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1192563 (owner: 10Scott French) [15:59:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192578 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [15:59:55] (03CR) 10Dzahn: [C:03+1] "I did address Antoine's comments. Going to merge now because in the phab deploy window an hour ago we already re-enabled puppet and this i" [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [16:00:05] jhathaway and moritzm: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1600). [16:00:05] Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:08] o/ [16:00:21] o/ [16:00:27] (03PS2) 10Btullis: Update the libyaml-cpp version installed on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1192560 (https://phabricator.wikimedia.org/T402943) [16:00:42] we might also want to revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1187779 and perhaps also https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1183132 btw, per #mediawiki-core [16:00:45] (cc xSavitar) [16:01:03] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192572|intake-logging EventGate: store x-ja3n req hdr]] (duration: 14m 01s) [16:01:04] maybe after the puppet change is done (shouldn’t take the whole hour, hopefully ^^) [16:01:33] (03PS3) 10Btullis: Update the libyaml-cpp version installed on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1192560 (https://phabricator.wikimedia.org/T402943) [16:01:42] (03CR) 10JHathaway: [C:03+2] statistics::wmde: Remove unused graphite_host [puppet] - 10https://gerrit.wikimedia.org/r/1191322 (owner: 10Lucas Werkmeister (WMDE)) [16:01:51] (03CR) 10Dzahn: [C:03+1] phabricator: hiera'ize the apc_shm_size variable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [16:01:55] (03PS1) 10Pmiazga: api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [16:02:15] (03CR) 10CI reject: [V:04-1] api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [16:02:15] (03PS5) 10Dzahn: phabricator: hiera'ize the apc_shm_size variable [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) [16:02:18] (03CR) 10Federico Ceratto: [C:03+1] "Matches the task description, configuring all current and future hosts `ms-be209*`" [puppet] - 10https://gerrit.wikimedia.org/r/1192575 (https://phabricator.wikimedia.org/T405958) (owner: 10MVernon) [16:02:26] (03CR) 10Hnowlan: [C:03+1] php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1192563 (owner: 10Scott French) [16:02:59] 10ops-drmrs: Inbound errors on interface cr1-drmrs:xe-0/1/3 (Transit: Arelion (IC-370330) {#D0068}) - https://phabricator.wikimedia.org/T393228#11230029 (10RobH) 05Open→03Declined [16:03:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1192560 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [16:03:26] 10ops-drmrs: Port with no description on access switch - https://phabricator.wikimedia.org/T390028#11230033 (10RobH) 05Open→03Declined [16:03:31] 10ops-drmrs: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389848#11230034 (10RobH) 05Open→03Declined [16:04:03] Lucas_WMDE: merged [16:04:10] thanks! [16:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.77% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:04:34] Lucas_WMDE, looking... [16:04:47] jhathaway: could you perhaps do a puppet run on stat1011? so we’ll see if it causes any issues right away, instead of in half an hour [16:04:56] (if I understand puppet correctly) [16:04:58] yup... [16:05:44] (03PS1) 10Brouberol: kafka-mirrormaker: simplify the config and set deployent.spec.typle to recreate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192581 [16:05:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11230052 (10Jhancock.wm) cp2056 has had the card replaced and i've assigned the mgmt ip to it. if you needed a clean slate for any reason it's ready. [16:06:04] jouncebot: nowandnext [16:06:04] For the next 0 hour(s) and 53 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1600) [16:06:04] In 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1700) [16:06:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:35] Lucas_WMDE: puppet is disabled, 'btullis-T403863' [16:06:36] T403863: Jupyterhub: Decide on/display escalation paths - https://phabricator.wikimedia.org/T403863 [16:07:07] (03CR) 10CI reject: [V:04-1] kafka-mirrormaker: simplify the config and set deployent.spec.typle to recreate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192581 (owner: 10Brouberol) [16:07:24] (03PS6) 10Pmiazga: api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 (owner: 10Daniel Kinzler) [16:07:24] (03PS17) 10Pmiazga: Add rate limiting for REST gateway (WIP) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [16:07:24] (03PS7) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [16:07:25] (03PS2) 10Pmiazga: api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [16:07:28] huh [16:07:28] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1191747/7155/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [16:08:29] jhathaway: maybe you can sudo-edit /srv/analytics-wmde/graphite/src/config on stat1011 to remove the line manually? [16:08:45] that wouldn’t tell us if I’ve written the right puppet change or not [16:08:52] but at least it would increase confidence that it’s not needed in the config file anymore [16:08:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P83497 and previous config saved to /var/cache/conftool/dbconfig/20250930-160855-fceratto.json [16:09:04] (03CR) 10MVernon: [C:03+2] install_server: set ms-be209* to use ms-be_simple-efi.cfg preseed [puppet] - 10https://gerrit.wikimedia.org/r/1192575 (https://phabricator.wikimedia.org/T405958) (owner: 10MVernon) [16:09:17] (03CR) 10Pmiazga: "@dkinzler@wikimedia.org -- hmm, can you check whats wrong here? I was rebasing my patch but somehow this patch is also changed ;/ no clue " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [16:09:35] (03CR) 10CI reject: [V:04-1] Add rate limiting for REST gateway (WIP) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [16:09:52] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:10:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11230095 (10MatthewVernon) a:05MatthewVernon→03None Done, thanks. [16:10:27] Lucas_WMDE: got the green light to enable, applying [16:10:53] (03PS2) 10Brouberol: kafka-mirrormaker: simplify the config and set deployent.spec.typle to recreate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192581 [16:10:59] \o/ [16:11:24] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host wdqs2016.codfw.wmnet [16:11:29] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts wdqs2016.codfw.wmnet [16:12:38] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2016.codfw.wmnet with OS bullseye [16:12:57] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: simplify the config and set deployent.spec.typle to recreate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192581 (owner: 10Brouberol) [16:13:17] (03CR) 10Dzahn: [C:03+1] "also per https://phabricator.wikimedia.org/T378922#11229074" [puppet] - 10https://gerrit.wikimedia.org/r/1192535 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [16:13:17] the graphite_host seems to be gone from the config file [16:13:55] 10ops-drmrs: drmrs: remove old lvs secondary links - https://phabricator.wikimedia.org/T396603#11230119 (10RobH) [16:14:14] and `sudo journalctl -u wmde-analytics-minutely.service` looks good so far [16:14:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:14:21] 10ops-drmrs: drmrs: remove old lvs secondary links - https://phabricator.wikimedia.org/T396603#11230128 (10RobH) I've typed up the following directions for this removal via remote hands request CS3255331 Support, We would like 3 fibers unplugged and removed from our racks. There is no longer traffic flowing... [16:14:52] jhathaway: I think I’m happy to call that a success :) thank you! [16:14:52] 10ops-drmrs: drmrs: remove old lvs secondary links - https://phabricator.wikimedia.org/T396603#11230134 (10RobH) a:03RobH [16:15:03] Lucas_WMDE: great [16:15:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:15:17] is it okay if we use the rest of the window to revert 1-2 config changes? [16:15:22] * Lucas_WMDE asks around in #mediawiki-core too [16:15:28] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/airflow-main: apply [16:15:36] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/airflow-main: apply [16:15:53] (03PS1) 10D3r1ck01: Revert "session: Enable MultiBackendSessionStore on `group1` wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192583 [16:15:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [16:16:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [16:16:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Hurricane Electric (2001:504:30::ba00:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [16:16:48] (03PS18) 10Daniel Kinzler: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [16:16:57] (03CR) 10Dr0ptp4kt: [C:03+1] "Seems fine and a good follow up to the previous patch. @mpopov@wikimedia.org look okay to you?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192578 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [16:17:26] (03CR) 10Dzahn: [V:03+1 C:03+2] "confirmed noop on phab1004 prod - and fixed test instance:" [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [16:19:54] o/ I'll be deploying the revert Lucas_WMDE mentioned [16:20:08] 👍 [16:20:40] should affect MediaWiki, Kask/Cassandra load, Logstash load, not much else [16:21:14] (the revert patch is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1192583/, xSavitar already uploaded it) [16:21:56] (03CR) 10Daniel Kinzler: Add rate limiting for REST gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [16:21:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192583 (owner: 10D3r1ck01) [16:22:47] (03Merged) 10jenkins-bot: Revert "session: Enable MultiBackendSessionStore on `group1` wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192583 (owner: 10D3r1ck01) [16:23:21] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1192583|Revert "session: Enable MultiBackendSessionStore on `group1` wikis"]] [16:24:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P83498 and previous config saved to /var/cache/conftool/dbconfig/20250930-162402-fceratto.json [16:25:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:28:59] !log reprepro copy bookworm-wikimedia trixie-wikimedia helm3 [16:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:22] (03CR) 10BCornwall: [C:03+2] beta: Remove redundant enable_m_redir_except_regex setting [puppet] - 10https://gerrit.wikimedia.org/r/1192263 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [16:29:25] (03CR) 10BCornwall: [C:03+2] varnish: Enable unified mobile routing on Wikisource [puppet] - 10https://gerrit.wikimedia.org/r/1192246 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [16:29:26] (03CR) 10BCornwall: [C:03+2] varnish: Enable unified mobile routing on Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/1192247 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [16:30:01] !log tgr@deploy2002 d3r1ck01, tgr: Backport for [[gerrit:1192583|Revert "session: Enable MultiBackendSessionStore on `group1` wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:30:11] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage [16:31:14] !log tgr@deploy2002 d3r1ck01, tgr: Continuing with sync [16:33:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage [16:36:10] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192583|Revert "session: Enable MultiBackendSessionStore on `group1` wikis"]] (duration: 12m 49s) [16:37:57] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@f3216ec] (releasing): test [16:38:28] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@f3216ec] (releasing): test (duration: 00m 31s) [16:38:30] done [16:38:51] tgr_, thanks! I can confirm that the spikes are going down [16:39:07] And login still works too [16:39:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T401906)', diff saved to https://phabricator.wikimedia.org/P83499 and previous config saved to /var/cache/conftool/dbconfig/20250930-163910-fceratto.json [16:39:18] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [16:39:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [16:39:28] I also see log ingest rate has dropped off. Thank you! [16:39:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T401906)', diff saved to https://phabricator.wikimedia.org/P83500 and previous config saved to /var/cache/conftool/dbconfig/20250930-163933-fceratto.json [16:40:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T401906)', diff saved to https://phabricator.wikimedia.org/P83501 and previous config saved to /var/cache/conftool/dbconfig/20250930-164043-fceratto.json [16:41:51] (03PS3) 10Krinkle: varnish: Enable unified mobile routing on Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/1192247 (https://phabricator.wikimedia.org/T403510) [16:42:21] (03PS7) 10Pmiazga: api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 (owner: 10Daniel Kinzler) [16:42:26] (03PS19) 10Daniel Kinzler: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [16:42:52] (03CR) 10Daniel Kinzler: "fixed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [16:43:08] (03CR) 10Brennen Bearnes: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [16:43:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:49:09] (03CR) 10BCornwall: [C:03+2] varnish: Enable unified mobile routing on Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/1192247 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [16:49:12] (03PS1) 10DCausse: flink jobs: stop search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192590 (https://phabricator.wikimedia.org/T404605) [16:49:15] (03PS1) 10DCausse: [DNM] flink jobs: resume search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192591 [16:51:52] (03PS2) 10DCausse: flink jobs: stop search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192590 (https://phabricator.wikimedia.org/T404605) [16:51:52] (03PS2) 10DCausse: [DNM] flink jobs: resume search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192591 [16:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:55:46] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@f3216ec] (releasing): test [16:55:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P83502 and previous config saved to /var/cache/conftool/dbconfig/20250930-165550-fceratto.json [16:56:49] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@f3216ec] (releasing): test (duration: 01m 02s) [16:56:54] (03CR) 10Scott French: [V:03+2] "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1192563 (owner: 10Scott French) [16:56:57] (03CR) 10Scott French: [V:03+2 C:03+2] php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1192563 (owner: 10Scott French) [16:59:46] (03CR) 10Hnowlan: [C:03+1] deployment_server: switch next and migration releases to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192227 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:00:05] swfrench-wmf: #bothumor I � Unicode. All rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1700). [17:00:22] o/ [17:00:43] (03CR) 10DCausse: [C:04-1] "should be merged before the wikikube@eqiad k8s upgrade." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192590 (https://phabricator.wikimedia.org/T404605) (owner: 10DCausse) [17:01:13] (03PS1) 10Jon Harald Søby: Enable USERLANGUAGE for sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192595 (https://phabricator.wikimedia.org/T406050) [17:01:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192595 (https://phabricator.wikimedia.org/T406050) (owner: 10Jon Harald Søby) [17:01:39] (03PS8) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [17:01:39] (03PS3) 10Pmiazga: api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [17:02:39] * swfrench-wmf is waiting on production images builds ... [17:03:40] (03CR) 10CI reject: [V:04-1] api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [17:03:43] (03CR) 10CI reject: [V:04-1] api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [17:04:37] !log swfrench@deploy2002 Started scap sync-world: Deployment to pick up new PHP 8.3 production images [17:04:47] (03PS9) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [17:04:47] (03PS4) 10Pmiazga: api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [17:04:54] (03CR) 10Hnowlan: [C:03+2] (api|rest)-gateway: Add option to disable CSP, disable for rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191344 (https://phabricator.wikimedia.org/T405368) (owner: 10Hnowlan) [17:05:33] (03CR) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [17:05:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1007.eqiad.wmnet with OS bookworm [17:05:39] swfrench-wmf: mind if I roll out a change to the rest-gateway at the same time? it'll only impact 50% of API requests for test2wiki [17:05:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11230390 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm [17:05:58] hnowlan: not at all! :) [17:06:44] (03CR) 10CI reject: [V:04-1] api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [17:06:47] (03CR) 10CI reject: [V:04-1] api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [17:07:12] (03Merged) 10jenkins-bot: (api|rest)-gateway: Add option to disable CSP, disable for rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191344 (https://phabricator.wikimedia.org/T405368) (owner: 10Hnowlan) [17:09:55] (03PS32) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [17:10:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11230402 (10Jclark-ctr) @jcrespo Can you change site.pp to insetup it is not passing puppet. It is re imaged and should be correct now. ` Attempt to run 'spicerack.puppet.Pup... [17:10:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P83503 and previous config saved to /var/cache/conftool/dbconfig/20250930-171058-fceratto.json [17:11:22] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:11:25] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:11:38] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:12:36] 10SRE-SLO, 06SRE Observability: Thanos: support multiple ruler instances - https://phabricator.wikimedia.org/T406054 (10herron) 03NEW p:05Triage→03Medium [17:12:54] 10SRE-SLO, 06SRE Observability (FY2025/2026-Q1): Thanos: support multiple ruler instances - https://phabricator.wikimedia.org/T406054#11230420 (10herron) [17:13:01] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:13:09] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:13:48] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:14:02] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:15:18] (03PS1) 10Dzahn: jenkins: move apt-get command for scap to a dedicated file [puppet] - 10https://gerrit.wikimedia.org/r/1192597 (https://phabricator.wikimedia.org/T405352) [17:16:03] (03CR) 10CI reject: [V:04-1] jenkins: move apt-get command for scap to a dedicated file [puppet] - 10https://gerrit.wikimedia.org/r/1192597 (https://phabricator.wikimedia.org/T405352) (owner: 10Dzahn) [17:16:06] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406043#11230429 (10Jclark-ctr) a:03Jclark-ctr ` #1: Phase, AA:L3-L1, Active Power; Value: 1426 (power) high: 1400 ` [17:16:37] !log starting refinery-source deployment as part of weekly deployment train [17:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11230431 (10jcrespo) This is the current configuration for dbprov1007, do you want something different? Maybe it just needs to remove the existing cert or something? ` # Needs re-s... [17:17:18] swfrench-wmf: all done, thanks! [17:17:28] (03PS2) 10Dzahn: jenkins: move apt-get command for scap to a dedicated file [puppet] - 10https://gerrit.wikimedia.org/r/1192597 (https://phabricator.wikimedia.org/T405352) [17:17:49] hnowlan: awesome, thanks! as expected, still waiting on images, heh [17:18:21] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406043#11230432 (10Jclark-ctr) 05Open→03Resolved [17:18:33] (03PS3) 10Dzahn: jenkins: move apt-get command for scap to a dedicated file [puppet] - 10https://gerrit.wikimedia.org/r/1192597 (https://phabricator.wikimedia.org/T405352) [17:19:11] (03PS10) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [17:19:11] (03PS5) 10Pmiazga: api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [17:19:22] (03CR) 10CI reject: [V:04-1] api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [17:19:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11230434 (10Papaul) [17:19:28] (03CR) 10CI reject: [V:04-1] api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [17:20:27] (03CR) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [17:21:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11230439 (10Jclark-ctr) Sorry for the ping. I jumped to conclusions. I did not refresh my repo i am unsure why it is failing puppet right this moment. [17:21:15] (03PS17) 10Herron: thanos-rule: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) [17:21:25] (03PS14) 10Herron: thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) [17:21:34] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/1192264 (https://phabricator.wikimedia.org/T403510) [17:21:48] (03CR) 10BCornwall: [C:03+2] varnish: Enable unified mobile routing on Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/1192264 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [17:21:52] (03CR) 10Jaime Nuche: [C:03+1] jenkins: move apt-get command for scap to a dedicated file [puppet] - 10https://gerrit.wikimedia.org/r/1192597 (https://phabricator.wikimedia.org/T405352) (owner: 10Dzahn) [17:23:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11230453 (10Papaul) [17:26:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T401906)', diff saved to https://phabricator.wikimedia.org/P83505 and previous config saved to /var/cache/conftool/dbconfig/20250930-172605-fceratto.json [17:26:14] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [17:26:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1207.eqiad.wmnet with reason: Maintenance [17:26:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T401906)', diff saved to https://phabricator.wikimedia.org/P83506 and previous config saved to /var/cache/conftool/dbconfig/20250930-172628-fceratto.json [17:27:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T401906)', diff saved to https://phabricator.wikimedia.org/P83507 and previous config saved to /var/cache/conftool/dbconfig/20250930-172738-fceratto.json [17:28:18] (03CR) 10Dzahn: [C:03+2] "sudo privs are only affected on releases* not contint*" [puppet] - 10https://gerrit.wikimedia.org/r/1192597 (https://phabricator.wikimedia.org/T405352) (owner: 10Dzahn) [17:28:48] (03CR) 10Dzahn: [C:03+2] "same privs just in a new place and if anything it's safer than before" [puppet] - 10https://gerrit.wikimedia.org/r/1192597 (https://phabricator.wikimedia.org/T405352) (owner: 10Dzahn) [17:29:41] !log swfrench@deploy2002 Finished scap sync-world: Deployment to pick up new PHP 8.3 production images (duration: 25m 33s) [17:31:17] that was as bit quicker than expected, so I may squeeze an additional change into the remaining 30 minutes of the infa window [17:31:21] *infra [17:31:49] (03CR) 10Scott French: [C:03+2] deployment_server: switch next and migration releases to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192227 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:42:14] !log swfrench@deploy2002 Started scap sync-world: Non-image-build scap run to switch next and migration releases to PHP 8.3 - T405955 [17:42:21] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:42:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P83508 and previous config saved to /var/cache/conftool/dbconfig/20250930-174245-fceratto.json [17:45:05] (03PS2) 10Scott French: trafficserver: enable PHP_ENGINE next routing [puppet] - 10https://gerrit.wikimedia.org/r/1192228 (https://phabricator.wikimedia.org/T405955) [17:46:43] !log swfrench@deploy2002 Finished scap sync-world: Non-image-build scap run to switch next and migration releases to PHP 8.3 - T405955 (duration: 04m 29s) [17:48:22] alright, I believe I'm done with the window [17:49:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1007.eqiad.wmnet with OS bookworm [17:49:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11230556 (10jcrespo) I know, it is trying to use puppet 5, not puppet 7, I have not idea why, trying to fix it. [17:49:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11230557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm executed with errors: - dbprov1007... [17:56:44] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2016.codfw.wmnet with OS bullseye [17:57:39] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:57:40] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:57:48] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [17:57:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P83509 and previous config saved to /var/cache/conftool/dbconfig/20250930-175752-fceratto.json [17:58:21] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:58:22] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:58:50] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards [18:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T1800) [18:12:37] o/ [18:12:40] nothing for this window. [18:13:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T401906)', diff saved to https://phabricator.wikimedia.org/P83510 and previous config saved to /var/cache/conftool/dbconfig/20250930-181300-fceratto.json [18:13:08] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [18:13:16] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [18:13:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1230.eqiad.wmnet with reason: Maintenance [18:13:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T401906)', diff saved to https://phabricator.wikimedia.org/P83511 and previous config saved to /var/cache/conftool/dbconfig/20250930-181340-fceratto.json [18:14:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T401906)', diff saved to https://phabricator.wikimedia.org/P83512 and previous config saved to /var/cache/conftool/dbconfig/20250930-181449-fceratto.json [18:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [18:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:27:42] (03CR) 10Ssingh: [V:03+1] "Yeah I am not sure how to tackle that yet, or if we should fix for MW.org for now." [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [18:29:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P83513 and previous config saved to /var/cache/conftool/dbconfig/20250930-182957-fceratto.json [18:30:24] (03CR) 10Dzahn: "maybe it needs both. a short term fix for mediawiki.org, then moving the file around, with a redirect from old to new.. then the permanent" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [18:33:03] (03CR) 10Ssingh: [C:03+1] Remove wikimedia.support from ncredir/acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/1192283 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [18:36:06] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192607 [18:36:22] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: sync [18:36:44] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: sync [18:37:45] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: sync [18:38:04] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: sync [18:38:10] (03CR) 10Gmodena: "ack" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192590 (https://phabricator.wikimedia.org/T404605) (owner: 10DCausse) [18:45:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P83514 and previous config saved to /var/cache/conftool/dbconfig/20250930-184504-fceratto.json [18:46:08] (03PS1) 10Scott French: Introduce DSL rendering for known_client objects [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192608 [18:47:01] (03CR) 10Scott French: [V:03+2] "Tested locally without issue." [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192608 (owner: 10Scott French) [18:47:02] (03CR) 10Ssingh: wikimedia.support: Rm ncredir, add zendesk records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [18:49:00] PROBLEM - MD RAID on wikikube-worker2035 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:49:02] ACKNOWLEDGEMENT - MD RAID on wikikube-worker2035 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T406060 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:49:11] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2035 - https://phabricator.wikimedia.org/T406060 (10ops-monitoring-bot) 03NEW [18:50:24] (03CR) 10Scott French: [V:03+2 C:03+2] Introduce DSL rendering for known_client objects [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192608 (owner: 10Scott French) [18:51:09] (03PS1) 10Jforrester: Wikifunctions SLO: Adjust upper bucket to 10.1s to cover slow reporting [puppet] - 10https://gerrit.wikimedia.org/r/1192609 (https://phabricator.wikimedia.org/T394057) [18:51:11] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards [18:51:18] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [18:52:22] (03CR) 10Dzahn: [C:03+1] wikimedia.support: Rm ncredir, add zendesk records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [18:52:35] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy DSL rendering for known_client objects - swfrench@cumin2002" [18:52:37] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy DSL rendering for known_client objects - swfrench@cumin2002 [18:53:23] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy DSL rendering for known_client objects - swfrench@cumin2002 [18:53:25] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy DSL rendering for known_client objects - swfrench@cumin2002" [18:53:31] (03CR) 10Ssingh: wikimedia.support: Rm ncredir, add zendesk records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [18:53:44] (03CR) 10Ssingh: wikimedia.support: Rm ncredir, add zendesk records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [18:56:25] FIRING: [13x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:58:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:59:09] (03CR) 10Elukey: "James could you please add more info about the new bucket and the rationale to move to this value? Just to understand the context :)" [puppet] - 10https://gerrit.wikimedia.org/r/1192609 (https://phabricator.wikimedia.org/T394057) (owner: 10Jforrester) [19:00:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T401906)', diff saved to https://phabricator.wikimedia.org/P83515 and previous config saved to /var/cache/conftool/dbconfig/20250930-190012-fceratto.json [19:00:20] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [19:00:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [19:00:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:01:10] (03PS1) 10Herron: vo-escalate: absent timer [puppet] - 10https://gerrit.wikimedia.org/r/1192610 [19:01:37] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1001.eqiad.wmnet [19:01:53] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet [19:02:08] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1001.eqiad.wmnet [19:04:52] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:05:04] (03PS2) 10Jforrester: Wikifunctions SLO: Adjust upper bucket to 10.1s to cover slow reporting [puppet] - 10https://gerrit.wikimedia.org/r/1192609 (https://phabricator.wikimedia.org/T394057) [19:05:50] (03PS1) 10Dzahn: jenkins: follow-up, fix file name for apt_update_jenkins script [puppet] - 10https://gerrit.wikimedia.org/r/1192611 (https://phabricator.wikimedia.org/T405352) [19:06:25] FIRING: [14x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:08:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling required for fr-tech expansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11230861 (10wiki_willy) Hi @RobH - thanks for checking on this. Can you start off by sending me an ema... [19:08:14] (03CR) 10Dzahn: [C:03+2] jenkins: follow-up, fix file name for apt_update_jenkins script [puppet] - 10https://gerrit.wikimedia.org/r/1192611 (https://phabricator.wikimedia.org/T405352) (owner: 10Dzahn) [19:09:01] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet [19:09:32] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1002.eqiad.wmnet [19:11:23] (03PS1) 10Gergő Tisza: Enable JWT session cookies on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192613 (https://phabricator.wikimedia.org/T399631) [19:11:25] FIRING: [15x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:11:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192613 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [19:12:34] 10ops-esams, 06SRE, 06DC-Ops: esams: remove old lvs secondary links - https://phabricator.wikimedia.org/T396601#11230880 (10RobH) a:03RobH [19:13:00] 10ops-esams, 06SRE, 06DC-Ops: esams: remove old lvs secondary links - https://phabricator.wikimedia.org/T396601#11230884 (10RobH) Opened CS3256432 with Interxion: > Support, > > We would like 3 fibers unplugged and removed from our racks. There is no longer traffic flowing over any of these three patche... [19:16:24] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1002.eqiad.wmnet [19:16:41] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1003.eqiad.wmnet [19:17:26] PROBLEM - MegaRAID on db1152 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:17:28] ACKNOWLEDGEMENT - MegaRAID on db1152 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T406063 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:17:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063 (10ops-monitoring-bot) 03NEW [19:19:38] (03PS1) 10Dzahn: zuul: use mysql+pymysql instead of mariadb+mariadbconnector in db URI [puppet] - 10https://gerrit.wikimedia.org/r/1192614 (https://phabricator.wikimedia.org/T405118) [19:20:14] (03CR) 10Ssingh: wikimedia.support: Rm ncredir, add zendesk records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [19:21:38] (03CR) 10Dzahn: "https://docs.sqlalchemy.org/en/20/dialects/mysql.html" [puppet] - 10https://gerrit.wikimedia.org/r/1192614 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn) [19:23:45] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1003.eqiad.wmnet [19:25:15] (03PS1) 10Dzahn: zuul: let zuul-scheduler also reach zookeeper outside container [puppet] - 10https://gerrit.wikimedia.org/r/1192615 (https://phabricator.wikimedia.org/T405118) [19:27:56] (03PS1) 10Dzahn: zuul: set values for zuul auth operator? (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1192617 [19:29:02] (03CR) 10Dzahn: "hmm, "This is a Kubernetes Operator for the Zuul Project Gating System."" [puppet] - 10https://gerrit.wikimedia.org/r/1192617 (owner: 10Dzahn) [19:29:45] (03CR) 10Dzahn: "it seems like we should rather drop this section from the config file?" [puppet] - 10https://gerrit.wikimedia.org/r/1192617 (owner: 10Dzahn) [19:31:25] FIRING: [16x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:32:00] (03PS2) 10Dzahn: zuul: drop section for zuul auth operator from new config [puppet] - 10https://gerrit.wikimedia.org/r/1192617 (https://phabricator.wikimedia.org/T395938) [19:39:10] RESOLVED: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:42:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:44:44] !incidents [19:44:44] 6807 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [19:44:52] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:47:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:49:11] (03CR) 10Dduvall: [C:03+1] zuul: use mysql+pymysql instead of mariadb+mariadbconnector in db URI [puppet] - 10https://gerrit.wikimedia.org/r/1192614 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn) [19:50:56] (03PS33) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [19:58:48] (03PS3) 10Dduvall: gitlab runners: Allow new buildkit-syntax-forwarder gateway [puppet] - 10https://gerrit.wikimedia.org/r/1191486 (https://phabricator.wikimedia.org/T405651) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T2000) [20:00:04] danisztls, Jhs, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:36] (03CR) 10Bking: [C:03+2] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:01:00] (03CR) 10Dduvall: "An empty allow list means "allow any gateway source" which is pretty confusing. I've added a comment about that to the file." [puppet] - 10https://gerrit.wikimedia.org/r/1191486 (https://phabricator.wikimedia.org/T405651) (owner: 10Dduvall) [20:02:01] o/ I can self-deploy [20:02:07] (03Merged) 10jenkins-bot: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:03:08] i'm here 👋 [20:03:52] Jhs and tgr_, your changes look trivial do you want me to deploy them (in a batch)? [20:04:02] danisztls, fine with me [20:04:30] Jhs, tgr_ the caveat is that I will not be able to run post-deployment commands as I will use spiderpig [20:05:11] ok, I will start the deploy [20:05:15] i thought spiderpigs did whatever spiderpigs do [20:05:28] Jhs: it does but sometimes there are extra things [20:05:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling required for fr-tech expansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11231001 (10RobH) a:03wiki_willy @wiki_willy, I stand corrected, I figured out how to get this order... [20:05:35] (no post-deployment commands should be necessary for mine) [20:05:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191691 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [20:05:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192555 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [20:05:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191510 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [20:05:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192595 (https://phabricator.wikimedia.org/T406050) (owner: 10Jon Harald Søby) [20:06:27] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1007.eqiad.wmnet with OS bookworm [20:06:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11231004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm [20:06:50] (03Merged) 10jenkins-bot: Remove reader foundational survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191691 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [20:06:52] (03Merged) 10jenkins-bot: Increase coverage of Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192555 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [20:06:55] (03Merged) 10jenkins-bot: Update reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191510 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [20:06:58] (03Merged) 10jenkins-bot: Enable USERLANGUAGE for sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192595 (https://phabricator.wikimedia.org/T406050) (owner: 10Jon Harald Søby) [20:07:29] !log refinery-source deployment paused due to maven release error [20:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:34] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1191691|Remove reader foundational survey on enwiki (beta) (T405410)]], [[gerrit:1192555|Increase coverage of Design Research participant recruitment survey on jawiki (T405577)]], [[gerrit:1191510|Update reader foundational survey on enwiki (T405410)]], [[gerrit:1192595|Enable USERLANGUAGE for sourceswiki (T406050)]] [20:07:46] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [20:07:47] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [20:07:48] T406050: Enable USERLANGUAGE magic word for the multilingual Wikisource - https://phabricator.wikimedia.org/T406050 [20:09:52] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:37] (03CR) 10Dzahn: [C:03+2] zuul: use mysql+pymysql instead of mariadb+mariadbconnector in db URI [puppet] - 10https://gerrit.wikimedia.org/r/1192614 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn) [20:11:46] The cause of the elevated swift errors is not obvious to me, if anyone has any swift knowledge, and can point me in a direction that would be appreciated. [20:12:17] !log Deploying Refinery as part of deployment weekly train [20:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:31] (03PS1) 10Bking: wdqs: add newly-reimaged hosts as scap targets [puppet] - 10https://gerrit.wikimedia.org/r/1192626 (https://phabricator.wikimedia.org/T405978) [20:12:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192626 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [20:13:53] !log ebysans@deploy2002 Started deploy [analytics/refinery@c5c78d1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c5c78d17] [20:14:53] !log dani@deploy2002 dani, jhsoby: Backport for [[gerrit:1191691|Remove reader foundational survey on enwiki (beta) (T405410)]], [[gerrit:1192555|Increase coverage of Design Research participant recruitment survey on jawiki (T405577)]], [[gerrit:1191510|Update reader foundational survey on enwiki (T405410)]], [[gerrit:1192595|Enable USERLANGUAGE for sourceswiki (T406050)]] synced to the testservers (see https://wikitech.w [20:14:53] ikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:14:54] !log ebysans@deploy2002 Finished deploy [analytics/refinery@c5c78d1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c5c78d17] (duration: 01m 00s) [20:14:58] (03PS2) 10Bking: wdqs: add newly-reimaged hosts as scap targets [puppet] - 10https://gerrit.wikimedia.org/r/1192626 (https://phabricator.wikimedia.org/T405978) [20:15:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192626 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [20:15:04] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [20:15:04] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [20:15:05] T406050: Enable USERLANGUAGE magic word for the multilingual Wikisource - https://phabricator.wikimedia.org/T406050 [20:15:40] (03CR) 10Dzahn: [C:03+2] zuul: let zuul-scheduler also reach zookeeper outside container [puppet] - 10https://gerrit.wikimedia.org/r/1192615 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn) [20:15:42] SandraEbele_: I don't think it's the case but just to be sure, should I wait before your deploy finishes to continue mine? [20:16:03] Jhs: can you test your change? [20:16:15] danisztls, already did, looks fine 👍 [20:16:55] (03PS1) 10Scott French: Use overflow-x: auto when displaying DSL [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192628 [20:17:11] (03CR) 10Scott French: [V:03+2] "Tested locally." [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192628 (owner: 10Scott French) [20:17:40] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063#11231084 (10Jclark-ctr) @Ladsgroup this server is out of warranty. Can we swap with a disk from decom server. Failed Drive ` Disk 0 in Backplane 1 of Integrated RAID Controller 1 Manufacturer SKhyni... [20:18:15] (03CR) 10Scott French: [V:03+2 C:03+2] Use overflow-x: auto when displaying DSL [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192628 (owner: 10Scott French) [20:18:50] @dan [20:19:12] (03CR) 10Dzahn: [V:04-1 C:04-1] "https://puppet-compiler.wmflabs.org/output/1192615/7160/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1192615 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn) [20:20:00] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063#11231091 (10Jclark-ctr) a:03Jclark-ctr [20:20:21] danisztls which of my deployment? Refinery? Also what are you deploying? [20:21:03] !log ebysans@deploy2002 Started deploy [analytics/refinery@c5c78d1]: Regular analytics weekly train [analytics/refinery@c5c78d17] [20:21:06] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy minor UI tweak for improved DSL viewing - swfrench@cumin2002" [20:21:08] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy minor UI tweak for improved DSL viewing - swfrench@cumin2002 [20:21:28] SandraEbele_: the current, refinery. I'm doing a spiderpig deploy. [20:21:40] (03CR) 10Bking: [C:03+2] wdqs: add newly-reimaged hosts as scap targets [puppet] - 10https://gerrit.wikimedia.org/r/1192626 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [20:21:58] (03CR) 10Bking: [C:03+2] "merging based on IRC approval from ryankemper" [puppet] - 10https://gerrit.wikimedia.org/r/1192626 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [20:22:01] (03PS2) 10Dzahn: zuul: let zuul-scheduler also reach zookeeper outside container [puppet] - 10https://gerrit.wikimedia.org/r/1192615 (https://phabricator.wikimedia.org/T405118) [20:22:04] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy minor UI tweak for improved DSL viewing - swfrench@cumin2002 [20:22:06] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy minor UI tweak for improved DSL viewing - swfrench@cumin2002" [20:22:55] SandraEbele_: yours target analytics infra and mine config backports so I don't think they conflict [20:24:13] !log dani@deploy2002 dani, jhsoby: Continuing with sync [20:25:46] !log ebysans@deploy2002 Finished deploy [analytics/refinery@c5c78d1]: Regular analytics weekly train [analytics/refinery@c5c78d17] (duration: 04m 43s) [20:26:32] !log ebysans@deploy2002 Started deploy [analytics/refinery@c5c78d1] (thin): Regular analytics weekly train THIN [analytics/refinery@c5c78d17] [20:27:29] !log ebysans@deploy2002 Finished deploy [analytics/refinery@c5c78d1] (thin): Regular analytics weekly train THIN [analytics/refinery@c5c78d17] (duration: 00m 57s) [20:29:17] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191691|Remove reader foundational survey on enwiki (beta) (T405410)]], [[gerrit:1192555|Increase coverage of Design Research participant recruitment survey on jawiki (T405577)]], [[gerrit:1191510|Update reader foundational survey on enwiki (T405410)]], [[gerrit:1192595|Enable USERLANGUAGE for sourceswiki (T406050)]] (duration: 21m 42s) [20:29:27] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [20:29:27] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [20:29:28] T406050: Enable USERLANGUAGE magic word for the multilingual Wikisource - https://phabricator.wikimedia.org/T406050 [20:29:45] tgr_: all yours [20:30:28] thx [20:30:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192613 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [20:31:25] FIRING: [17x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:31:44] (03Merged) 10jenkins-bot: Enable JWT session cookies on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192613 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [20:32:19] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1192613|Enable JWT session cookies on group0 (T399631)]] [20:32:25] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [20:33:44] !log bking@deploy2002 Started deploy [wdqs/wdqs@fea7794]: T405978 [20:33:50] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [20:33:59] !log bking@deploy2002 Finished deploy [wdqs/wdqs@fea7794]: T405978 (duration: 00m 20s) [20:35:40] !log bking@deploy2002 Started deploy [wdqs/wdqs@fea7794]: T405978 [20:35:45] !log bking@deploy2002 Finished deploy [wdqs/wdqs@fea7794]: T405978 (duration: 00m 10s) [20:36:48] !log Deployed refinery using scap, then deployed onto hdfs [20:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:28] !log tgr@deploy2002 tgr: Backport for [[gerrit:1192613|Enable JWT session cookies on group0 (T399631)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:39:35] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [20:42:51] !log tgr@deploy2002 tgr: Continuing with sync [20:43:18] (03PS1) 10D3r1ck01: Revert^2 "session: Enable MultiBackendSessionStore on `group1` wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192632 [20:45:50] (03CR) 10Dzahn: [C:03+2] admin: upgrade ebomani from ldap_only to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1191742 (https://phabricator.wikimedia.org/T405124) (owner: 10Dzahn) [20:47:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11231231 (10Dzahn) You have deployment access now. Welcome to deployers, @EBomani [20:47:31] (03PS1) 10DDesouza: Deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192635 (https://phabricator.wikimedia.org/T405410) [20:47:46] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192613|Enable JWT session cookies on group0 (T399631)]] (duration: 15m 27s) [20:47:53] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [20:49:26] !log UTC late deploys done [20:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:59:05] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11231243 (10Dzahn) 05In progress→03Resolved [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250930T2100) [21:01:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:01:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:03:34] (03PS1) 10Dzahn: phabricator: drop cluster_search config [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) [21:06:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9310 bytes in 2.331 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:10:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:15:44] 10ops-magru: Power Supply - Status - issue on cp7004:9290 - https://phabricator.wikimedia.org/T405157#11231293 (10RobH) 05Open→03Resolved a:03RobH known power work in this period by magru , went away and both back online shortly after [21:15:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:16:13] 10ops-magru, 06DC-Ops, 06Traffic: planned power redundancy depreciation 2025-09-20 @ 18:00 GMT to 2025-09-21 @ 21:00 GMT - https://phabricator.wikimedia.org/T402818#11231299 (10RobH) 05Open→03Resolved a:03RobH [21:16:43] (03PS3) 10Ebernhardson: cirrus: Drop absented periodic_job (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1169210 [21:17:03] (03CR) 10Eric Gardner: Add ReaderExperiments extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [21:17:26] (03CR) 10Bking: [C:03+2] cirrus: Drop absented periodic_job (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1169210 (owner: 10Ebernhardson) [21:31:15] (03CR) 10Bearloga: [C:03+1] "Yup, looks okay to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192578 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [21:33:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:40:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:41:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:44:10] FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:44:26] FIRING: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:51:08] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [21:51:35] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [21:51:38] !log urbanecm@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [21:52:34] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:53:36] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:53:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:56:32] !log urbanecm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [21:57:36] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:01:40] (03PS1) 10Scott French: P:conftool::requestctl_client: update requestctl_cli.original.py [puppet] - 10https://gerrit.wikimedia.org/r/1192616 (https://phabricator.wikimedia.org/T403220) [22:04:26] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1007.eqiad.wmnet with OS bookworm [22:04:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11231399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm [22:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [22:15:36] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:15:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:18:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:20:32] jclark@cumin1002 reimage (PID 1950730) is awaiting input [22:20:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:20:48] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063#11231441 (10Ladsgroup) Sounds good. This is primary db of ms1 but if things go sideways, I can depool the section. [22:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:26:08] PROBLEM - ensure kvm processes are running on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:27:08] RECOVERY - ensure kvm processes are running on cloudvirt1073 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:30:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:36:15] 06SRE, 06Commons, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11231467 (10Aklapper) 05Open→03Stalled What does the load chart in the network tab of your browser's developer tools show? Is there a specific call which takes longer? [22:39:56] (03PS1) 10Scott French: deployment_server: set the php.version value in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1192287 [22:40:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:42:22] (03Abandoned) 10Aklapper: Update swagger documentation for CertificateSigningRequestSpec and ResourceClass [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1189894 (https://phabricator.wikimedia.org/T201491) (owner: 10Divyaratann Srivastava) [22:43:48] (03CR) 10Scott French: "Thanks in advance for the review, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1192287 (owner: 10Scott French) [22:45:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:49:10] (03CR) 10RLazarus: [C:03+1] deployment_server: set the php.version value in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1192287 (owner: 10Scott French) [22:57:52] (03CR) 10Scott French: [C:03+2] deployment_server: set the php.version value in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1192287 (owner: 10Scott French) [22:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:00:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:01:08] PROBLEM - ensure kvm processes are running on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:02:08] RECOVERY - ensure kvm processes are running on cloudvirt1075 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:02:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192276 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [23:02:28] 06SRE, 06Commons, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11231516 (10Prototyperspective) * Right now the load quickly again but the whole day it was slow and it was often like the past days. I'll update this with info on which call takes long to l... [23:03:15] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192276 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [23:03:50] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1192276|Disable wmgUseMdotRouting on Wikisource (T403510)]] [23:03:57] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [23:05:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:10:41] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1192276|Disable wmgUseMdotRouting on Wikisource (T403510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:10:47] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [23:11:04] (03PS10) 10Krinkle: MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) (owner: 10Bartosz Dziewoński) [23:12:49] !log krinkle@deploy2002 krinkle: Continuing with sync [23:14:32] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1007.eqiad.wmnet with OS bookworm [23:14:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11231536 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm executed with errors: - dbprov1007... [23:17:52] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192276|Disable wmgUseMdotRouting on Wikisource (T403510)]] (duration: 14m 01s) [23:17:59] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [23:20:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:22:34] (03PS2) 10Krinkle: Disable wmgUseMdotRouting on Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192277 (https://phabricator.wikimedia.org/T403510) [23:22:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192277 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [23:23:31] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192277 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [23:24:07] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1192277|Disable wmgUseMdotRouting on Wiktionary (T403510)]] [23:24:13] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [23:25:34] (03PS2) 10Krinkle: Disable wmgUseMdotRouting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192278 (https://phabricator.wikimedia.org/T403510) [23:25:44] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:26:40] FIRING: [17x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:28:59] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1192277|Disable wmgUseMdotRouting on Wiktionary (T403510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:29:26] RESOLVED: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:31:25] FIRING: [17x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:31:27] !log krinkle@deploy2002 krinkle: Continuing with sync [23:33:36] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:34:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:36:34] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192277|Disable wmgUseMdotRouting on Wiktionary (T403510)]] (duration: 12m 27s) [23:36:41] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [23:37:56] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11231646 (10Jclark-ctr) @Papaul i continue to have issues with this not passing puppet can you asssit with this one? [23:38:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192643 [23:38:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192643 (owner: 10TrainBranchBot) [23:39:34] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:44:52] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:45:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192278 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [23:46:17] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192278 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [23:46:52] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1192278|Disable wmgUseMdotRouting on Wikidata (T403510)]] [23:46:59] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [23:49:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:50:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11231670 (10Papaul) Th node sent the puppet request to the wrong puppet master. I cleaned it up, you can re-run the cookbook with the --no-pxe flag ` pt1979@puppetmaster1001:~$ su... [23:53:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192643 (owner: 10TrainBranchBot) [23:53:58] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1192278|Disable wmgUseMdotRouting on Wikidata (T403510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:54:05] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [23:54:15] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1007.eqiad.wmnet with OS bookworm [23:54:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:54:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11231676 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm [23:55:19] !log krinkle@deploy2002 krinkle: Continuing with sync [23:59:06] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:59:26] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11231682 (10EBomani) Thank you @Dzahn! :)