[00:01:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:03:07] (03PS1) 10Dzahn: zuul::main: set tls_truststore for zookeeper to the copy it owns [puppet] - 10https://gerrit.wikimedia.org/r/1244021 (https://phabricator.wikimedia.org/T395938) [00:03:33] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:04:10] (03PS1) 10Ryan Kemper: wdqs: Separate deadlock remediation config from script [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) [00:04:12] (03PS1) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) [00:04:30] (03CR) 10Ryan Kemper: "Uploaded https://gerrit.wikimedia.org/r/c/operations/puppet/+/1244023 to address the review" [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [00:05:53] (03CR) 10Dzahn: [C:03+2] zuul::main: set tls_truststore for zookeeper to the copy it owns [puppet] - 10https://gerrit.wikimedia.org/r/1244021 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [00:06:30] (03CR) 10CI reject: [V:04-1] wdqs: Separate deadlock remediation config from script [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [00:06:59] (03CR) 10CI reject: [V:04-1] wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [00:20:25] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:45] (03PS1) 10Dzahn: zuul::main: add extra Java opts to debug zookeeper TLS [puppet] - 10https://gerrit.wikimedia.org/r/1244033 (https://phabricator.wikimedia.org/T395938) [00:21:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:22:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:24:24] (03PS2) 10Dzahn: zuul::main: add extra Java opts to debug zookeeper TLS [puppet] - 10https://gerrit.wikimedia.org/r/1244033 (https://phabricator.wikimedia.org/T395938) [00:25:25] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:26:28] (03CR) 10BryanDavis: "Thanks for this" [software/bitu] - 10https://gerrit.wikimedia.org/r/1243694 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [00:27:53] 10ops-eqiad, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418429 (10phaultfinder) 03NEW [00:31:23] (03PS1) 10Scardenasmolinar: Deploy PersonalDashboard to new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665) [00:36:55] FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:10] RESOLVED: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1244055 [00:38:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1244055 (owner: 10TrainBranchBot) [00:41:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:41:37] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:43:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:46:52] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 07Wikimedia-Incident: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11652803 (10AlexisJazz) 05Open→03Resolved [00:48:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:50:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1244055 (owner: 10TrainBranchBot) [00:51:37] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:52:16] (03CR) 10Cwhite: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1243813 (owner: 10Muehlenhoff) [00:53:11] (03PS2) 10Ryan Kemper: wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) [00:53:12] (03PS1) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244058 (https://phabricator.wikimedia.org/T242453) [00:55:35] (03CR) 10CI reject: [V:04-1] wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [00:59:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [01:04:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:04:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243930 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza) [01:04:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243929 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza) [01:05:55] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:08:58] (03PS2) 10Scott French: P:cache::haproxy: fix non-default scope key structure [puppet] - 10https://gerrit.wikimedia.org/r/1244061 [01:09:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1244073 [01:09:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1244073 (owner: 10TrainBranchBot) [01:09:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:10:55] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:17:11] (03CR) 10Scott French: [V:03+2 C:03+2] "I think (or at least how I was interpreting it) "rewriting it in Python" was meant as a measure of complexity, rather than a concrete reco" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [01:19:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:21:13] (03PS1) 10Zabe: typos: Add magru wrongly numbered hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244083 [01:23:06] (03CR) 10Zabe: [C:03+2] typos: Add magru wrongly numbered hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244083 (owner: 10Zabe) [01:23:59] (03Merged) 10jenkins-bot: typos: Add magru wrongly numbered hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244083 (owner: 10Zabe) [01:24:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:26:59] (03PS3) 10Ryan Kemper: wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) [01:26:59] (03PS2) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244058 (https://phabricator.wikimedia.org/T242453) [01:27:51] (03PS3) 10Dzahn: zuul::main: add extra Java opts to debug zookeeper TLS [puppet] - 10https://gerrit.wikimedia.org/r/1244033 (https://phabricator.wikimedia.org/T395938) [01:29:19] (03CR) 10CI reject: [V:04-1] wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [01:34:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:34:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:36:18] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1244073 (owner: 10TrainBranchBot) [01:37:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:42:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:43:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:47:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:49:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:50:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:56:25] (03Abandoned) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244058 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [01:58:44] (03PS4) 10Ryan Kemper: wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) [01:58:44] (03PS2) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) [02:00:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:00:45] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:01:12] (03CR) 10CI reject: [V:04-1] wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [02:02:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:07:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:54] 10ops-eqiad, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T418433 (10phaultfinder) 03NEW [02:13:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:13:58] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 12s) [02:23:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:26:34] (03CR) 10Dwisehaupt: [C:03+1] "IPs match the hosts. shipit." [puppet] - 10https://gerrit.wikimedia.org/r/1243894 (https://phabricator.wikimedia.org/T418393) (owner: 10Jgreen) [02:33:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:43] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:59] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:58:01] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:01:39] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:01:41] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:05:39] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:05:39] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:08:39] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:08:39] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:14:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:22:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:23:22] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:32:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:33:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:37:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:40:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:40:56] 06SRE, 06Infrastructure-Foundations, 10netops: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11652966 (10Papaul) @ayounsi The new diagram is now on Wikitech. For the Wikimedia Amsterdam DCs, IP layer do you want for us to update it or just delete it. [03:45:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:47:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:50:40] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:55:40] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:57:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:02:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:02:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:05:59] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:06:01] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:07:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:10:39] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:10:39] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:12:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:13:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:17:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:17:27] RESOLVED: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:19:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:21:59] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:22:01] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:22:39] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:24:39] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:25:39] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:26:39] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:29:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:30:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:39:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:39:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:40:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:44:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:44:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:44:55] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:47:03] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:47:59] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:48:01] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:50:59] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:51:01] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:53:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:54:40] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:57:03] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:58:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:59:17] FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:00:51] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439 (10Papaul) 03NEW [05:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [05:01:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:03:22] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:04:40] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:10:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:10:53] dr0ptp4kt and I will be investigating that flapping ErrorBudgetBurn alert today ^ [05:10:59] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:11:01] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:14:40] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:19:40] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:05] PROBLEM - Bird Internet Routing Daemon on cephosd1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [05:20:43] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:21:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:22:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T415786)', diff saved to https://phabricator.wikimedia.org/P89028 and previous config saved to /var/cache/conftool/dbconfig/20260226-052230-marostegui.json [05:22:36] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [05:29:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:34:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:37:16] (03CR) 10Giuseppe Lavagetto: [C:03+1] P:cache::haproxy: fix non-default scope key structure [puppet] - 10https://gerrit.wikimedia.org/r/1244061 (owner: 10Scott French) [05:37:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P89029 and previous config saved to /var/cache/conftool/dbconfig/20260226-053739-marostegui.json [05:37:59] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:38:01] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:39:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:39:39] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:40:21] (03PS1) 101F616EMO: Wiping accountcreator from zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T303578) [05:40:39] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:41:11] (03PS2) 101F616EMO: Wiping accountcreator from zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) [05:45:39] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:45:41] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:48:39] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:48:39] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:49:17] FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:49:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:52:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P89030 and previous config saved to /var/cache/conftool/dbconfig/20260226-055246-marostegui.json [05:55:22] (03CR) 10Marostegui: [C:03+2] data.yaml: Add ssh-key for the bkup token. [puppet] - 10https://gerrit.wikimedia.org/r/1243889 (owner: 10Marostegui) [05:55:31] (03CR) 10Muehlenhoff: [C:03+2] package_builder: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1243813 (owner: 10Muehlenhoff) [06:01:23] (03PS1) 10Muehlenhoff: Record LDAP access for hahmed [puppet] - 10https://gerrit.wikimedia.org/r/1244392 [06:02:16] (03CR) 10Marostegui: [C:03+2] mariadb: Add monitor_heartbeat to core hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1243594 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [06:04:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:04:17] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for hahmed [puppet] - 10https://gerrit.wikimedia.org/r/1244392 (owner: 10Muehlenhoff) [06:07:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T415786)', diff saved to https://phabricator.wikimedia.org/P89031 and previous config saved to /var/cache/conftool/dbconfig/20260226-060755-marostegui.json [06:08:00] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [06:08:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1263.eqiad.wmnet with reason: Maintenance [06:08:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1263 (T415786)', diff saved to https://phabricator.wikimedia.org/P89032 and previous config saved to /var/cache/conftool/dbconfig/20260226-060809-marostegui.json [06:08:22] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:11:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [06:12:13] (03PS4) 10Muehlenhoff: pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) [06:12:47] (03CR) 10CI reject: [V:04-1] pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [06:14:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:16:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:16:50] !log updated thirdparty/node22 to node 20.20.0 [06:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:42] (03PS5) 10Muehlenhoff: pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) [06:19:17] FIRING: [9x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:20:41] 06SRE, 06Infrastructure-Foundations, 10netops: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11653051 (10ayounsi) Updating them would be great. Thanks [06:21:22] 06SRE: Create nodejs 24 production images - https://phabricator.wikimedia.org/T418440 (10MoritzMuehlenhoff) 03NEW [06:21:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [06:21:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [06:23:46] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [06:26:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:26:23] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [06:28:46] (03PS1) 10Marostegui: dbproxy1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1244428 (https://phabricator.wikimedia.org/T414656) [06:29:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:20] (03CR) 10Marostegui: [C:03+2] dbproxy1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1244428 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [06:34:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:34:55] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:23] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [06:38:57] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [06:39:17] FIRING: [11x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:39:40] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:44:17] FIRING: [11x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:45:40] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:45:40] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:45:55] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:46:00] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:46:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:47:24] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [06:48:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:49:17] FIRING: [13x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:50:06] RECOVERY - Bird Internet Routing Daemon on cephosd1003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [06:50:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:53:46] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [06:53:48] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:53:57] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [06:54:40] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:57:24] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [06:59:55] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T0700). [07:00:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:02:24] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [07:04:17] FIRING: [7x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:05:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:07:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:09:40] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:43] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:xe-0/1/1 (Transport: cr1-codfw:xe-1/1/1:0 (Lumen, 442550294) {#1065}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:12:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:12:24] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [07:13:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:14:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:28:10] (03PS1) 10Muehlenhoff: Remove access for hokwelum [puppet] - 10https://gerrit.wikimedia.org/r/1244459 [07:28:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11653142 (10Aklapper) @DTotten-WMF: Hi and welcome! Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/extern... [07:28:55] (03CR) 10CI reject: [V:04-1] Remove access for hokwelum [puppet] - 10https://gerrit.wikimedia.org/r/1244459 (owner: 10Muehlenhoff) [07:31:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:34:17] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:35:46] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:38:31] (03PS2) 10Muehlenhoff: Remove access for hokwelum [puppet] - 10https://gerrit.wikimedia.org/r/1244459 [07:39:17] FIRING: [12x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:35] !log installing openssl security updates [07:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:17] FIRING: [14x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:44:48] (03CR) 10Muehlenhoff: [C:03+2] Remove access for hokwelum [puppet] - 10https://gerrit.wikimedia.org/r/1244459 (owner: 10Muehlenhoff) [07:48:46] FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [07:50:42] !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Hokwelum out of all services on: 2432 hosts [07:58:22] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T0800) [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:10:39] (03CR) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [08:10:55] (03PS4) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) [08:25:39] (03CR) 10Gehel: "Minor comments inline. Mostly suggestions, you can ignore them as you want." [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:26:36] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [08:27:24] (03CR) 10Gehel: "It looks that a bunch of changes from the parent commit have ended up in this one." [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:31:45] (03CR) 10Anzx: [C:03+1] "thanks for creating patch for removing usergroup, please schedule this for deploying through https://schedule-deployment.toolforge.org/bac" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [08:41:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:44:33] (03CR) 10Aklapper: [C:04-1] "Thanks so much, sorry this takes me a while. I ran both extract and generate locally before and after." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) (owner: 10Pppery) [08:49:43] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:51:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:58:22] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:58:34] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.8 [puppet] - 10https://gerrit.wikimedia.org/r/1244570 (https://phabricator.wikimedia.org/T418448) [09:00:23] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe [09:00:58] !log restart FPM on Phabricator hosts to pick up OpenSSL updates [09:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [09:01:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host dbproxy1028.eqiad.wmnet with OS trixie [09:01:53] !log mvernon@cumin1003 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [09:04:10] (03CR) 10Elukey: [C:03+1] pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [09:05:52] !log mvernon@cumin1003 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [09:06:38] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=1) rolling restart_daemons on A:swift-fe [09:08:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:09:21] (03PS3) 10Elukey: Apply role to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) [09:09:23] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [09:09:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [09:10:06] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:10:13] (03PS1) 10Vgutierrez: traffic: Exclude stats frontend on HaproxyKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/1244575 [09:10:39] (03CR) 101F616EMO: "Scheduled to this afternoon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [09:11:48] (03PS1) 10Slyngshede: P:idp release givenName on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1244576 (https://phabricator.wikimedia.org/T338214) [09:11:48] (03PS1) 10Hashar: wm-checks-api: document mapping of bot to rerun command [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244577 (https://phabricator.wikimedia.org/T418424) [09:11:50] (03PS1) 10Hashar: wm-checks-api: add Rerun command for codehealth [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244578 (https://phabricator.wikimedia.org/T418424) [09:11:50] (03CR) 10CI reject: [V:04-1] traffic: Exclude stats frontend on HaproxyKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/1244575 (owner: 10Vgutierrez) [09:12:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:13:26] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1028.eqiad.wmnet with reason: host reimage [09:13:48] (03CR) 10Hashar: [C:03+2] wm-checks-api: document mapping of bot to rerun command [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244577 (https://phabricator.wikimedia.org/T418424) (owner: 10Hashar) [09:13:53] (03CR) 10Hashar: [C:03+2] wm-checks-api: add Rerun command for codehealth [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244578 (https://phabricator.wikimedia.org/T418424) (owner: 10Hashar) [09:14:12] (03PS1) 10Brouberol: deployment_server: install kubetail to be able to stream multicontainer pod logs [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450) [09:14:24] (03Merged) 10jenkins-bot: wm-checks-api: document mapping of bot to rerun command [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244577 (https://phabricator.wikimedia.org/T418424) (owner: 10Hashar) [09:14:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:26] (03Merged) 10jenkins-bot: wm-checks-api: add Rerun command for codehealth [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244578 (https://phabricator.wikimedia.org/T418424) (owner: 10Hashar) [09:15:05] !log hashar@deploy2002 Started deploy [gerrit/gerrit@74473c2]: wm-checks-api: add Rerun command for codehealth + inline documentation [09:15:19] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@74473c2]: wm-checks-api: add Rerun command for codehealth + inline documentation (duration: 00m 14s) [09:16:28] (03PS2) 10Vgutierrez: traffic: Exclude stats frontend on HaproxyKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/1244575 [09:16:35] (03PS2) 10Brouberol: deployment_server: install kubetail to be able to stream multicontainer pod logs [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450) [09:17:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [09:18:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1028.eqiad.wmnet with reason: host reimage [09:19:15] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8153/co" [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450) (owner: 10Brouberol) [09:19:48] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.8 [puppet] - 10https://gerrit.wikimedia.org/r/1244570 (https://phabricator.wikimedia.org/T418448) (owner: 10Jelto) [09:21:06] (03PS4) 10Elukey: Apply role to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) [09:21:40] (03CR) 10Elukey: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [09:22:03] (03PS5) 10Elukey: Apply role to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) [09:22:13] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/resource_config: add resource_title param [puppet] - 10https://gerrit.wikimedia.org/r/1243898 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [09:22:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:22:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, the deployment servers are still on Bullseye but kubetail is already packaged for it." [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450) (owner: 10Brouberol) [09:22:26] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/ops: monitor thanos store instances with resouce_config [puppet] - 10https://gerrit.wikimedia.org/r/1243899 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [09:22:35] (03CR) 10Brouberol: [V:03+1 C:03+2] deployment_server: install kubetail to be able to stream multicontainer pod logs [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450) (owner: 10Brouberol) [09:22:35] (03CR) 10JMeybohm: [C:03+1] deployment_server: install kubetail to be able to stream multicontainer pod logs [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450) (owner: 10Brouberol) [09:23:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:24:05] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:24:05] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:28:59] (03CR) 10Elukey: [C:03+2] docker_registry: remove the /test prefix special handling [puppet] - 10https://gerrit.wikimedia.org/r/1243726 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [09:30:55] (03CR) 10Fabfur: [C:03+1] traffic: Exclude stats frontend on HaproxyKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/1244575 (owner: 10Vgutierrez) [09:31:15] jouncebot: nowandnext [09:31:15] No deployments scheduled for the next 1 hour(s) and 28 minute(s) [09:31:15] In 1 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1100) [09:32:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:34:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:34:39] (03PS1) 101F616EMO: zhwiki: drop event organizer's duplicated right to remove eventparticipant from self [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) [09:34:43] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:35:07] (03CR) 10Vgutierrez: [C:03+2] traffic: Exclude stats frontend on HaproxyKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/1244575 (owner: 10Vgutierrez) [09:35:45] (03CR) 10Volans: [C:03+2] "Tested on toolsbeta static" [puppet] - 10https://gerrit.wikimedia.org/r/1243151 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [09:38:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:39:03] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:39:05] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:39:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:41:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1028.eqiad.wmnet with OS trixie [09:42:32] hello deployers! I am going to make a change to the Docker Registry to route MediaWiki docker images to a new (hopefully faster and more reliable) backend [09:42:50] I don't see anything ongoing, but please ping me if you need to deploy MW [09:42:57] (03CR) 10Elukey: [C:03+2] docker_registry: move the /v2/restricted prefix to s3/apus [puppet] - 10https://gerrit.wikimedia.org/r/1243727 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [09:43:46] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:43:50] !log urbanecm@deploy2002 mwscript-k8s job started: foreachwikiindblist growthexperiments WikimediaMaintenance:createExtensionTables.php growthexperiments [09:44:25] !log jmm@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=pki,name=codfw [09:46:04] (03PS11) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [09:46:04] (03PS1) 10Tiziano Fogli: prometheus::resource_config: replace $facts['site'] with $::site [puppet] - 10https://gerrit.wikimedia.org/r/1244592 (https://phabricator.wikimedia.org/T412924) [09:47:21] (03CR) 10Tiziano Fogli: [C:03+2] prometheus::resource_config: replace $facts['site'] with $::site [puppet] - 10https://gerrit.wikimedia.org/r/1244592 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [09:47:50] !log move the Docker Registry's /v2/restricted (MediaWiki Docker image prefix) to s3/apus - T390251 [09:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:54] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [09:48:23] PROBLEM - Bird Internet Routing Daemon on cephosd1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:48:47] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:50:23] (03CR) 10Muehlenhoff: [C:03+2] pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [09:51:18] !log elukey@deploy2002 Started scap sync-world: Test new Docker Registry backend [09:53:23] PROBLEM - Bird Internet Routing Daemon on cephosd1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:53:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:53:47] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:54:23] PROBLEM - Bird Internet Routing Daemon on cephosd1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:54:43] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:54:47] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:57:27] !log jmm@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=pki,name=codfw [09:58:22] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:58:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:00:47] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:01:11] PROBLEM - Bird Internet Routing Daemon on cephosd1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:02:04] (03PS12) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [10:02:04] (03PS1) 10Tiziano Fogli: prometheus::ops/thanos_store: replace port param with port_parameter [puppet] - 10https://gerrit.wikimedia.org/r/1244595 (https://phabricator.wikimedia.org/T412924) [10:03:47] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:04:09] PROBLEM - Bird Internet Routing Daemon on cephosd1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:05:51] (03CR) 10Tiziano Fogli: [C:03+2] prometheus::ops/thanos_store: replace port param with port_parameter [puppet] - 10https://gerrit.wikimedia.org/r/1244595 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [10:08:23] RECOVERY - Bird Internet Routing Daemon on cephosd1004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:08:47] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:10:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:11:23] RECOVERY - Bird Internet Routing Daemon on cephosd1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:11:47] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:11:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [10:12:47] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:13:11] RECOVERY - Bird Internet Routing Daemon on cephosd1005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:13:47] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:14:23] RECOVERY - Bird Internet Routing Daemon on cephosd1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:15:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:16:47] (03PS1) 10Fabfur: hiera: test haproxy 3.0 on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1244597 (https://phabricator.wikimedia.org/T417253) [10:17:02] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T418433#11653532 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [10:18:22] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:18:23] (03PS2) 10Eevans: admin_ng: add namespace for linked-artifacts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243850 (https://phabricator.wikimedia.org/T414112) [10:18:30] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1244597 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [10:18:34] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None [10:19:09] RECOVERY - Bird Internet Routing Daemon on cephosd1003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:19:47] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:20:12] (03CR) 10Clément Goubert: [C:03+2] admin_ng: add namespace for linked-artifacts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243850 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [10:20:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [10:25:36] (03PS1) 10Muehlenhoff: Add DB grant for pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664) [10:26:27] (03CR) 10Muehlenhoff: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [10:26:32] (03CR) 10Vgutierrez: [C:03+1] hiera: test haproxy 3.0 on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1244597 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [10:26:46] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:26:55] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:54] (03Merged) 10jenkins-bot: admin_ng: add namespace for linked-artifacts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243850 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [10:29:26] (03CR) 10Elukey: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [10:30:24] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=wikidatawiki --logwiki=metawiki 'Luftlewis 1' 'Renamed user 4f8e749b4f28ee9e6ebc680c8c3c943d' # T418435 [10:30:29] T418435: Unblock stuck global rename of Renamed user 4f8e749b4f28ee9e6ebc680c8c3c943d - https://phabricator.wikimedia.org/T418435 [10:30:41] (03CR) 10Fabfur: [C:03+2] hiera: test haproxy 3.0 on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1244597 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [10:31:55] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:32:23] !log elukey@deploy2002 Finished scap sync-world: Test new Docker Registry backend (duration: 43m 02s) [10:33:04] !log depooling cp7001 to upgrade haproxy (T417253) [10:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:08] T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253 [10:33:17] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp7001.* [10:34:17] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp7001*} and A:cp - 3.0 upgrade () [10:35:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:38:46] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [10:39:11] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp7001*} and A:cp - 3.0 upgrade () [10:39:52] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:39:55] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7001.* [10:40:43] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:40:57] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:41:43] (03PS1) 10Jelto: aptrepo: update gitlab-runner-helper-image architecture to all [puppet] - 10https://gerrit.wikimedia.org/r/1244600 (https://phabricator.wikimedia.org/T418344) [10:41:45] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:41:46] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [10:43:06] (03CR) 10Jelto: "This config finds the new package (manually tested with `checkupdate` on `apt1001`). The helper-images package is released as "all". Is th" [puppet] - 10https://gerrit.wikimedia.org/r/1244600 (https://phabricator.wikimedia.org/T418344) (owner: 10Jelto) [10:43:10] (03CR) 10Muehlenhoff: [C:03+1] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1244600 (https://phabricator.wikimedia.org/T418344) (owner: 10Jelto) [10:43:20] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:43:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [10:44:25] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [10:44:54] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 2 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11653669 (10elukey) The /v2/restricted prefix is again served by S3, nothing to report when pushing the new... [10:47:40] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:47:55] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:48:28] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:49:23] (03CR) 10Jelto: [C:03+2] aptrepo: update gitlab-runner-helper-image architecture to all [puppet] - 10https://gerrit.wikimedia.org/r/1244600 (https://phabricator.wikimedia.org/T418344) (owner: 10Jelto) [10:55:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:57:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1100) [11:00:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:01:18] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:cloudelastic [11:01:41] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [11:03:14] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [11:04:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:cloudelastic [11:07:28] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt apus-fe1004,5 - jclark@cumin1003" [11:07:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt apus-fe1004,5 - jclark@cumin1003" [11:07:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:08:11] (03CR) 10Muehlenhoff: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [11:08:22] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:08:40] (03CR) 10Hnowlan: [C:03+1] Simplify spec-json-wikimedia route and use meta.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242576 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [11:08:42] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-backup1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:08:56] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host apus-fe1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:09:25] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:28] jclark@cumin1003 provision (PID 2197974) is awaiting input [11:11:29] jclark@cumin1003 provision (PID 2198026) is awaiting input [11:11:41] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [11:12:50] (03CR) 10Aklapper: "From a quick read this looks good, found just two small typos." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221191 (https://phabricator.wikimedia.org/T413531) (owner: 10Pppery) [11:13:08] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host apus-fe1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:13:09] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-backup1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:13:23] (03CR) 10Clément Goubert: [C:03+2] Simplify spec-json-wikimedia route and use meta.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242576 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [11:14:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11653766 (10Jclark-ctr) a:03Jclark-ctr [11:14:31] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244601 [11:14:32] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244602 [11:15:10] (03Merged) 10jenkins-bot: Simplify spec-json-wikimedia route and use meta.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242576 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [11:17:20] !log Deployment of the multi-instance Thanos Store Gateway patches for T412924: running tests on titan1002 [11:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:25] T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924 [11:17:43] (03CR) 10Elukey: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [11:17:58] (03PS2) 10Vgutierrez: admin: Add mikez to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1243845 (https://phabricator.wikimedia.org/T418098) [11:19:08] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1243845 (https://phabricator.wikimedia.org/T418098) (owner: 10Vgutierrez) [11:19:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11653777 (10Jclark-ctr) a:03Jclark-ctr [11:19:32] (03CR) 10Vgutierrez: [C:03+2] admin: Add mikez to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1243845 (https://phabricator.wikimedia.org/T418098) (owner: 10Vgutierrez) [11:19:44] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:19:51] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:19:56] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:20:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:20:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11653783 (10Jclark-ctr) @jcrespo Preseed yaml file is not setup for efi booting Can you update? to have for these servers? - partman/standard-efi.cfg - partm... [11:21:40] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None [11:21:58] (03CR) 10Tiziano Fogli: [C:03+2] Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [11:22:59] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:23:23] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:23:26] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-backup1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:23:41] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:23:45] jclark@cumin1003 provision (PID 2197974) is awaiting input [11:24:04] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-codfw [11:24:17] FIRING: [20x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:24:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:25:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-codfw [11:25:56] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-eqiad [11:26:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:26:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-eqiad [11:28:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11653820 (10jcrespo) Will, do sorry, these should use standard recipes, so it should be easy to update. [11:28:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:28:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11653822 (10jcrespo) [11:29:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:30:29] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11653828 (10jcrespo) [11:31:39] (03PS1) 10Vgutierrez: traffic: Avoid division by zero on HaproxyKafkaNoMessages [alerts] - 10https://gerrit.wikimedia.org/r/1244605 [11:34:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:36:59] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [11:39:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:42:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1244576 (https://phabricator.wikimedia.org/T338214) (owner: 10Slyngshede) [11:43:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-backup1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:44:17] FIRING: [22x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:46:09] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [11:46:33] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [11:49:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:50:27] (03CR) 10Slyngshede: [C:03+2] P:idp release givenName on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1244576 (https://phabricator.wikimedia.org/T338214) (owner: 10Slyngshede) [11:51:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11653940 (10Jclark-ctr) [11:51:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:51:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [11:51:38] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None [11:52:33] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11653943 (10Vgutierrez) [11:52:57] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11653945 (10Vgutierrez) 05Open→03Resolved change has been merged, and it should be live by now [11:52:58] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None [11:54:05] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-fe1004.eqiad.wmnet with OS bookworm [11:54:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11653952 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host apus-fe1004.eqiad.wmnet with OS bookworm [11:55:06] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-fe1005.eqiad.wmnet with OS bookworm [11:55:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11653955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host apus-fe1005.eqiad.wmnet with OS bookworm [11:55:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:22] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [11:56:43] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [12:00:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:01:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [12:02:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:08:19] (03CR) 10Marostegui: "Can I merge this myself? So I combine it with the actual grant deployment" [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [12:09:33] (03CR) 10Hashar: [C:04-1] "I forgot, I earlier proposed a similar patch but for the Apache > Jetty connection: I9e5167f9c9c2f346d314cb7c3bf410209b1dffce" [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [12:11:06] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-fe1004.eqiad.wmnet with reason: host reimage [12:12:35] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-fe1005.eqiad.wmnet with reason: host reimage [12:13:52] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-fe1004.eqiad.wmnet with reason: host reimage [12:13:59] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1096.eqiad.wmnet with OS bullseye [12:14:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11654001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-be1096.eqiad.wmnet with OS bullseye [12:15:40] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1097.eqiad.wmnet with OS bullseye [12:15:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11654021 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-be1097.eqiad.wmnet with OS bullseye [12:16:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [12:17:52] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-fe1005.eqiad.wmnet with reason: host reimage [12:20:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:23:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ... [12:23:51] 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F1 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [12:24:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:24:34] !ack [12:24:34] 7489 (ACKED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013} xe-3/2/1 gnmi eqiad) [12:25:25] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:26:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [12:29:06] (03PS1) 10JavierMonton: component: mediawiki.page_html_content_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) [12:29:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665) (owner: 10Scardenasmolinar) [12:30:25] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:07] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11654092 (10Jelto) [12:34:35] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [12:34:37] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11654094 (10Jelto) [12:34:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [12:34:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-fe1004.eqiad.wmnet with OS bookworm [12:35:00] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11654095 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host apus-fe1004.eqiad.wmnet with OS bookworm completed: - apus... [12:35:42] (03CR) 10Anzx: [C:03+1] "recheck, looks ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [12:35:55] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1016.eqiad.wmnet with OS trixie [12:36:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1016.eqiad.wmnet with OS trixie [12:38:25] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1096.eqiad.wmnet with reason: host reimage [12:38:40] (03CR) 10Volans: locking: Add a mechanism for a global Spicerack lock. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [12:38:47] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [12:38:53] (03PS1) 10AikoChou: ml-services: make inference_services a list in values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244632 [12:41:51] jclark@cumin1003 reimage (PID 2246309) is awaiting input [12:42:10] (03PS1) 10AikoChou: httpbb: remove the revertrisk-wikidata test from staging [puppet] - 10https://gerrit.wikimedia.org/r/1244633 [12:42:49] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-be1097.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:43:17] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1097.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:43:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1096.eqiad.wmnet with reason: host reimage [12:43:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ... [12:43:51] 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F1 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [12:43:58] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-be1097.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:44:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:46:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:46:21] (03PS1) 10Urbanecm: ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244638 (https://phabricator.wikimedia.org/T418194) [12:46:34] (03PS1) 10Urbanecm: ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244639 (https://phabricator.wikimedia.org/T418194) [12:46:49] (03CR) 10Dpogorzelski: [C:03+1] httpbb: remove the revertrisk-wikidata test from staging [puppet] - 10https://gerrit.wikimedia.org/r/1244633 (owner: 10AikoChou) [12:46:49] jouncebot: nowandnext [12:46:49] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [12:46:49] In 0 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1300) [12:46:51] (03PS2) 10Jsn.sherman: Deploy PersonalDashboard to new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665) (owner: 10Scardenasmolinar) [12:47:02] (03CR) 10Urbanecm: [C:03+2] ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244639 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [12:47:05] (03CR) 10Urbanecm: [C:03+2] ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244638 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [12:47:17] (03CR) 10Dpogorzelski: [C:03+1] ml-services: make inference_services a list in values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244632 (owner: 10AikoChou) [12:47:56] (03CR) 10AikoChou: [C:03+2] ml-services: make inference_services a list in values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244632 (owner: 10AikoChou) [12:48:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1097.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:48:55] (03CR) 10Dpogorzelski: [C:03+2] httpbb: remove the revertrisk-wikidata test from staging [puppet] - 10https://gerrit.wikimedia.org/r/1244633 (owner: 10AikoChou) [12:49:58] (03Merged) 10jenkins-bot: ml-services: make inference_services a list in values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244632 (owner: 10AikoChou) [12:50:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11654131 (10Jelto) p:05Triage→03Medium [12:51:02] !ack [12:51:02] no value provided for parameter incident and no default available [12:51:02] All incidents are already acked. [12:51:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [12:51:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244639 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [12:51:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244638 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [12:52:13] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11654136 (10Jelto) Hi, thanks for opening the access request. I think the only missing approval i... [12:53:14] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:54:11] jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade. [12:54:14] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [12:54:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [12:54:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-fe1005.eqiad.wmnet with OS bookworm [12:55:06] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11654142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host apus-fe1005.eqiad.wmnet with OS bookworm completed: - apus... [12:56:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:56:55] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:56:56] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1097 [12:57:05] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1097 [12:58:42] (03PS3) 10Jsn.sherman: Deploy PersonalDashboard to new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665) (owner: 10Scardenasmolinar) [12:59:38] (03Merged) 10jenkins-bot: ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244639 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [12:59:58] (03Merged) 10jenkins-bot: ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244638 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1300) [13:00:07] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [13:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:01:07] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1244639|ReassignMentees: Log more information (T418194)]], [[gerrit:1244638|ReassignMentees: Log more information (T418194)]] [13:01:07] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 116418 bytes in 0.531 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [13:01:11] T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194 [13:01:38] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:03:27] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [13:03:50] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [13:05:13] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1244639|ReassignMentees: Log more information (T418194)]], [[gerrit:1244638|ReassignMentees: Log more information (T418194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:05:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:05:19] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt apus-fe1004,5 - jclark@cumin1003" [13:05:55] !log urbanecm@deploy2002 urbanecm: Continuing with sync [13:08:24] jclark@cumin1003 netbox (PID 2287740) is awaiting input [13:08:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ... [13:08:51] 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F1 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [13:08:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt apus-fe1004,5 - jclark@cumin1003" [13:08:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:08:55] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:08:56] !ack [13:08:57] 7490 (ACKED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013} xe-3/2/1 gnmi eqiad) [13:09:12] FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:09:16] (03PS1) 10Majavah: toolforge: k8s: Remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1244643 [13:09:16] (03PS1) 10Majavah: toolforge: k8s: Allow observers to read Gateway API resources [puppet] - 10https://gerrit.wikimedia.org/r/1244644 (https://phabricator.wikimedia.org/T418276) [13:10:10] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:10:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1096.eqiad.wmnet with OS bullseye [13:10:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:10:30] (03PS1) 10AikoChou: Revert "ml-services: make inference_services a list in values-ml-staging-codfw.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244645 [13:10:37] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11654213 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-be1096.eqiad.wmnet with OS bullseye completed: - ms-be1096 (*... [13:10:42] (03PS1) 10Dpogorzelski: Revert "ml-services: make inference_services a list in values-ml-staging-codfw.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244646 [13:10:55] (03CR) 10Dpogorzelski: [C:03+2] Revert "ml-services: make inference_services a list in values-ml-staging-codfw.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244646 (owner: 10Dpogorzelski) [13:10:58] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] Revert "ml-services: make inference_services a list in values-ml-staging-codfw.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244646 (owner: 10Dpogorzelski) [13:11:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:11:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:11:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T418465)', diff saved to https://phabricator.wikimedia.org/P89034 and previous config saved to /var/cache/conftool/dbconfig/20260226-131147-marostegui.json [13:11:52] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:12:06] (03Abandoned) 10AikoChou: Revert "ml-services: make inference_services a list in values-ml-staging-codfw.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244645 (owner: 10AikoChou) [13:12:07] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244639|ReassignMentees: Log more information (T418194)]], [[gerrit:1244638|ReassignMentees: Log more information (T418194)]] (duration: 11m 00s) [13:12:11] T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194 [13:12:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:13:01] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:13:26] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:13:38] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11654228 (10Jclark-ctr) [13:13:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ... [13:13:51] 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F1 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [13:13:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T418465)', diff saved to https://phabricator.wikimedia.org/P89035 and previous config saved to /var/cache/conftool/dbconfig/20260226-131357-marostegui.json [13:14:03] (03CR) 10Blake: locking: Add a mechanism for a global Spicerack lock. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [13:14:14] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11654242 (10Jclark-ctr) [13:14:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11654243 (10Jclark-ctr) 05Open→03Resolved [13:15:44] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:17:30] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1097 [13:18:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1097 [13:19:12] FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:20:38] (03PS1) 10Gergő Tisza: Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244649 (https://phabricator.wikimedia.org/T415007) [13:20:47] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1097 [13:20:48] !log jclark@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ms-be1097 [13:20:59] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:21:55] (03PS1) 10Gergő Tisza: Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244650 (https://phabricator.wikimedia.org/T415007) [13:22:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:23:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:25:55] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1097.eqiad.wmnet with OS bullseye [13:26:02] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11654282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-be1097.eqiad.wmnet with OS bullseye executed with errors: - m... [13:29:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P89036 and previous config saved to /var/cache/conftool/dbconfig/20260226-132905-marostegui.json [13:29:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:31:33] (03PS1) 10Marostegui: Revert "dbproxy1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1244655 [13:31:36] (03CR) 10Elukey: "Sure! I was wondering if it is needed since it is the same as pki1001's afaics, but please go ahead :)" [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [13:32:43] (03CR) 10Marostegui: "I didn't realise it is exactly the same username, in that case it shouldn't be needed, you are right." [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [13:33:09] (03CR) 10Marostegui: [C:03+2] Revert "dbproxy1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1244655 (owner: 10Marostegui) [13:34:04] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1017.eqiad.wmnet with OS trixie [13:34:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654316 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1017.eqiad.wmnet with OS trixie [13:35:26] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1018.eqiad.wmnet with OS trixie [13:35:31] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1019.eqiad.wmnet with OS trixie [13:35:36] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1018.eqiad.wmnet with OS trixie [13:35:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1019.eqiad.wmnet with OS trixie [13:35:46] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1244644 (https://phabricator.wikimedia.org/T418276) (owner: 10Majavah) [13:36:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:36:19] (03CR) 10Majavah: [C:03+2] toolforge: k8s: Remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1244643 (owner: 10Majavah) [13:36:28] (03CR) 10Majavah: [C:03+2] toolforge: k8s: Allow observers to read Gateway API resources [puppet] - 10https://gerrit.wikimedia.org/r/1244644 (https://phabricator.wikimedia.org/T418276) (owner: 10Majavah) [13:37:32] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1244643 (owner: 10Majavah) [13:38:15] !log fetch haproxy 3.0.17 on thirdparty/haproxy30 bullseye-wikimedia (apt.wm.o) [13:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:18] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1016.eqiad.wmnet with reason: host reimage [13:40:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:41:31] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[7001,7009].*} and A:cp - 3.0.17 upgrade (T417253) [13:41:35] T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253 [13:42:05] I am doing a change on CI to have Wikibase Selenium tests to run in an independent job T287582 [13:42:06] T287582: Move some Wikibase selenium tests to a standalone job - https://phabricator.wikimedia.org/T287582 [13:43:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1016.eqiad.wmnet with reason: host reimage [13:43:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:44:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P89038 and previous config saved to /var/cache/conftool/dbconfig/20260226-134414-marostegui.json [13:44:33] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:46:23] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1020.eqiad.wmnet with OS trixie [13:46:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654356 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1020.eqiad.wmnet with OS trixie [13:48:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:50:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:09] jouncebot: nowandnext [13:51:09] For the next 0 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1300) [13:51:09] In 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1400) [13:51:59] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1016.eqiad.wmnet with OS trixie [13:52:08] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1016.eqiad.wmnet with OS trixie executed with errors: - backup1... [13:52:09] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1017.eqiad.wmnet with reason: host reimage [13:52:12] (03PS1) 10AikoChou: httpbb: fix the revscoring-editquality-goodfaith test [puppet] - 10https://gerrit.wikimedia.org/r/1244659 [13:52:39] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1019.eqiad.wmnet with reason: host reimage [13:52:54] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1016.eqiad.wmnet with OS trixie [13:52:57] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp[7001,7009].*} and A:cp - 3.0.17 upgrade (T417253) [13:53:02] T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253 [13:53:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654401 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1016.eqiad.wmnet with OS trixie [13:53:24] !log jclark@cumin1003 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on backup1019.eqiad.wmnet with reason: host reimage [13:53:37] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1018.eqiad.wmnet with OS trixie [13:53:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654403 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1018.eqiad.wmnet with OS trixie executed with errors: - backup1... [13:54:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:54:20] !log Deploy schema change on x1 on the master with replication enable T418480 [13:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:25] T418480: Drop default for sic_updated_timestamp and drop indexes on sic_created_timestamp in the cusi_case table on WMF wikis - https://phabricator.wikimedia.org/T418480 [13:55:14] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418429#11654424 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [13:55:25] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:34] Hi and nice to meet you all, I'm 1F616EMO and is here for the two patches regarding T418089 (1244373, 1244591). This is my first time dealing with a backport window, so if I have done something wrong, please tell me and I will learn from them. I am ready and will be available for the whole duration of the window. [13:55:34] T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089 [13:58:57] Hi Tran [13:59:02] o/ [13:59:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1017.eqiad.wmnet with reason: host reimage [13:59:15] (03CR) 10Volans: locking: Add a mechanism for a global Spicerack lock. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [13:59:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:59:19] This is my first time on a backport window. If I have done something wrong, please tell me and I will learn from them. [13:59:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T418465)', diff saved to https://phabricator.wikimedia.org/P89039 and previous config saved to /var/cache/conftool/dbconfig/20260226-135922-marostegui.json [13:59:27] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:59:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:59:43] (and I am already submitting two patches, but no worries, small ones) [13:59:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1168 (T418465)', diff saved to https://phabricator.wikimedia.org/P89040 and previous config saved to /var/cache/conftool/dbconfig/20260226-135946-marostegui.json [13:59:59] nw [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1400) [14:00:05] Tran, itamarWMDE, nya_1F616EMO, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:12] \o [14:00:14] o/ [14:00:19] o/ [14:00:25] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:26] (03PS1) 10Lucas Werkmeister (WMDE): Add new lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1244661 (https://phabricator.wikimedia.org/T418482) [14:00:28] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482 (10Lucas_Werkmeister_WMDE) 03NEW [14:00:41] o/ per ^ I can’t deploy today (but hopefully soon ^^) [14:00:58] hi nya_1F616EMO :) [14:01:19] (03CR) 10Lucas Werkmeister (WMDE): "Follow-up: T418482" [puppet] - 10https://gerrit.wikimedia.org/r/1239955 (owner: 10Lucas Werkmeister (WMDE)) [14:01:37] hi, i can deploy today [14:01:37] If not deployers are around in this window, I will re-schedule mine to Monday, March 02 UTC afternoon [14:01:43] thanks urbanecm! [14:01:48] Thanks urbanecm [14:01:57] (03PS3) 10STran: Revert^2 "Remove deprecated IRS v2 configurations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243847 (https://phabricator.wikimedia.org/T413951) [14:01:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T418465)', diff saved to https://phabricator.wikimedia.org/P89041 and previous config saved to /var/cache/conftool/dbconfig/20260226-140157-marostegui.json [14:02:01] (03CR) 10Urbanecm: [C:03+2] Revert^2 "Remove deprecated IRS v2 configurations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243847 (https://phabricator.wikimedia.org/T413951) (owner: 10STran) [14:02:14] Are we following the order shown on the wiki page? [14:02:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654464 (10karapayneWMDE) Approved on my end [14:02:20] usually [14:02:25] right now, seems so [14:02:25] Nice [14:02:46] I sometimes let volunteer patches take priority over staff ones fwiw ^^ [14:02:51] but up to the deployer [14:02:56] (03Merged) 10jenkins-bot: Revert^2 "Remove deprecated IRS v2 configurations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243847 (https://phabricator.wikimedia.org/T413951) (owner: 10STran) [14:03:21] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1020.eqiad.wmnet with reason: host reimage [14:04:24] (03PS2) 101F616EMO: zhwiki: drop event organizer's duplicated right to remove eventparticipant from self [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) [14:04:28] (03CR) 10Urbanecm: [C:03+2] zhwiki: drop event organizer's duplicated right to remove eventparticipant from self [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [14:04:34] itamarWMDE: hi, here as well? [14:05:21] (03Merged) 10jenkins-bot: zhwiki: drop event organizer's duplicated right to remove eventparticipant from self [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [14:05:40] FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:13] nya_1F616EMO: hi, do you know if `accountcreator` is unset from any other wiki? i'm unsure about outright dropping it, because if someone attempts to set it from metawiki, MW will behave in a very crazy way (without telling the user). i'd be ok with removing everyone's ability to manipulate it, but dropping it might have sideeffects that i'd like to discuss first [14:06:44] urbanecm: No, accountcreator is there on all wikis; but we do have precedence of removing rights on loginwiki [14:07:03] The if-block right above the zhwiki if-block [14:07:23] yeah, loginwiki is very special, i was looking for a content project. in that case, i'm not going to deploy that patch today, because i'm unsure about its sideeffects (i'll detail it more on task). [14:07:28] Okay [14:07:39] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1016.eqiad.wmnet with reason: host reimage [14:07:41] it's up to you if you want to upload a "remove everyone's ability to add/remove people from it" patch (that i would be OK with even today) [14:07:58] otherwise, we can delay few days until the sideffects can be clarified. [14:07:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [14:08:07] (03CR) 10Fabfur: [C:03+1] traffic: Avoid division by zero on HaproxyKafkaNoMessages [alerts] - 10https://gerrit.wikimedia.org/r/1244605 (owner: 10Vgutierrez) [14:08:08] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11654507 (10Jelto) [14:08:10] I can't write a patch in such a short period, I will do it in tuesday [14:08:13] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244666 [14:08:16] ok, no problem. [14:08:27] (03PS1) 10AOkoth: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244667 [14:08:27] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1243847|Revert^2 "Remove deprecated IRS v2 configurations" (T413951)]], [[gerrit:1244591|zhwiki: drop event organizer's duplicated right to remove eventparticipant from self (T418089)]] [14:08:34] T413951: Deprecate v1 non emergency flow for IRS - https://phabricator.wikimedia.org/T413951 [14:08:34] T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089 [14:08:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1020.eqiad.wmnet with reason: host reimage [14:08:39] (03PS1) 10Jcrespo: installserver: Migrate ms-backup hosts to EFI [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) [14:08:44] (03CR) 10Vgutierrez: [C:03+2] traffic: Avoid division by zero on HaproxyKafkaNoMessages [alerts] - 10https://gerrit.wikimedia.org/r/1244605 (owner: 10Vgutierrez) [14:08:54] Thanks for your rejection - rejecting at the right time is crucial to improvements. [14:08:58] (03CR) 10Urbanecm: [C:04-1] "I'm unsure about the effects of this for user rights management from metawiki. I'll write more details on the task, but I'm not comfortabl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [14:09:19] nya_1F616EMO: it's more of a "let's clarify the impacts" rather than a rejection. thank you for your understanding. [14:10:30] !log urbanecm@deploy2002 stran, urbanecm, 1f616emo: Backport for [[gerrit:1243847|Revert^2 "Remove deprecated IRS v2 configurations" (T413951)]], [[gerrit:1244591|zhwiki: drop event organizer's duplicated right to remove eventparticipant from self (T418089)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:40] RESOLVED: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:55] Tran: nya_1F616EMO: can you verify your patches on mwdebug, please? [14:10:56] Btw, just figured out that I might have forgotten to remove sysop's ability to add this group to users. [14:11:01] testing now [14:11:08] urbanecm: working on it [14:11:12] ty [14:11:18] urbanecm: Working [14:11:27] thanks for the confirmation [14:11:28] ^ I mean, it's ok [14:11:30] (03PS1) 10Slyngshede: P:idp map family_name to SN [puppet] - 10https://gerrit.wikimedia.org/r/1244670 (https://phabricator.wikimedia.org/T338214) [14:11:36] (03PS2) 10Federico Ceratto: sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) [14:11:36] (03CR) 10Federico Ceratto: "This can be tested with the next clone, it's a small change." [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [14:11:39] i understood [14:12:05] (03CR) 10Federico Ceratto: "(the implementation on the Zarcillo side is done)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [14:12:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1016.eqiad.wmnet with reason: host reimage [14:12:08] mine's worknig [14:12:10] (03CR) 10Marostegui: "If these hosts already exists I believe you have to run a cookbook to migrate them to EFI" [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo) [14:12:16] !log urbanecm@deploy2002 stran, urbanecm, 1f616emo: Continuing with sync [14:12:20] perf, proceeding [14:12:34] it seems itamarWMDE's not around for their patch [14:12:39] (03CR) 101F616EMO: "Mark this as unresolved before we know for sure what will happen in a global level." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [14:12:48] (03CR) 10Jcrespo: "It is for new hosts, I don't care much about the existing ones, will be decommissioned soon." [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo) [14:12:55] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654531 (10Lucas_Werkmeister_WMDE) FTR, I wrote down some notes on how I created this SSH key: https://wikitech.wikimedia.org/wiki/User:Lucas_... [14:13:18] (03CR) 10Marostegui: "ok then - then you should be good, do they already exist in site.pp etc?" [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo) [14:13:28] (03PS1) 10Krinkle: labs: Adopt same thumbnail steps and buckets as production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244672 (https://phabricator.wikimedia.org/T69525) [14:14:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11654546 (10Jelto) Hi, thank you for the access request. > - access request (or expansion) has sign off of WMF sponsor/manager (sponsor for volunteers, manager for wmf st... [14:14:47] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:15:31] (03CR) 10Jcrespo: "They should:" [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo) [14:16:11] (03PS1) 10Urbanecm: [Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244673 (https://phabricator.wikimedia.org/T418194) [14:16:12] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:16:15] (03PS2) 10Krinkle: labs: Adopt same thumbnail steps and buckets as production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244672 (https://phabricator.wikimedia.org/T69525) [14:16:15] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243847|Revert^2 "Remove deprecated IRS v2 configurations" (T413951)]], [[gerrit:1244591|zhwiki: drop event organizer's duplicated right to remove eventparticipant from self (T418089)]] (duration: 07m 48s) [14:16:21] It is so satisfying to observe how patches flows flawlessly without touching or SSH'ing onto the hardwares; as a amateur server maintainer myself I feel a bit ashamed by the scale of automation [14:16:22] T413951: Deprecate v1 non emergency flow for IRS - https://phabricator.wikimedia.org/T413951 [14:16:22] T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089 [14:16:49] Tran: nya_1F616EMO: patches deployed [14:16:54] thanks! [14:16:57] Thanks :-D and nice to meet you [14:16:59] still no signs of itamarWMDE [14:17:04] nice to meet you too, nya_1F616EMO! [14:17:05] (03CR) 10Marostegui: [C:03+1] installserver: Migrate ms-backup hosts to EFI [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo) [14:17:05] urbanecm: itamarWMDE should be there in a second [14:17:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P89042 and previous config saved to /var/cache/conftool/dbconfig/20260226-141705-marostegui.json [14:17:08] ok [14:17:13] I will forward your concern onto phab and raise other's attention [14:17:16] (03CR) 10Marostegui: [C:03+1] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo) [14:17:42] (03CR) 10Jcrespo: [C:03+2] "Thank you, that was useful <3, I sometimes forget steps." [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo) [14:17:51] jclark@cumin1003 reimage (PID 2312336) is awaiting input [14:17:58] Dreamy_Jazz: i'll deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1244673 for me, maybe itmar's patch if they'll be there then and then hand it over to you if that sounds good. [14:18:01] (03CR) 10Urbanecm: [C:03+2] [Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244673 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [14:18:53] (03Merged) 10jenkins-bot: [Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244673 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [14:19:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:20:00] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1244673|[Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 (T418194)]] [14:20:05] T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194 [14:20:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11654588 (10jcrespo) @Jclark-ctr I just marged thew new recipe, please give it 30 minutes to propagate, and should be done. Apologies again for the mist... [14:20:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:20:21] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1017.eqiad.wmnet with OS trixie [14:20:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1017.eqiad.wmnet with OS trixie completed: - backup1017 (**PASS... [14:20:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654595 (10Lucas_Werkmeister_WMDE) (And I also intend to buy a chain tomorrow which will let me attach the YubiKey to my keychain with a bit m... [14:21:17] (03CR) 10Elukey: "Hey Federico! I saw the change passing by, since it changes setup.py I have a couple of questions:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [14:21:55] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:03] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1244673|[Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 (T418194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:22:10] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:14] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11654603 (10Jelto) @MoritzMuehlenhoff or @SLyngshede-WMF this task sounds like an offboarding procedure... [14:22:24] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:22:25] urbanecm: By "setting it from metawiki", which extension or component are you talking about? [14:23:09] nya_1F616EMO: special:userrights. if i have enough permissions, i can change your zhwiki permissions from https://meta.wikimedia.org/wiki/Special:UserRights/1F616EMO@zhwiki [14:23:47] Hmm, that page brings me directly to Special:Userrights on zhwiki instead of staying on the metawiki [14:23:48] (03CR) 10Jelto: [C:03+1] "lgtm thank you, but can you also link Bug: T418483 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (owner: 10AOkoth) [14:24:24] urbanecm: I am not a steaward but is a local sysop, could that cause the redirection? [14:24:33] nya_1F616EMO: yeah, that's the difference. [14:24:39] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1018.eqiad.wmnet with OS trixie [14:24:46] i see this https://usercontent.irccloud-cdn.com/file/yvHVEe1Z/image.png [14:24:48] Oh, then I will have no chance to observe the interface then [14:24:51] (03CR) 10Taiwanese elephant: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [14:24:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1018.eqiad.wmnet with OS trixie [14:25:00] the main problem is it displays the list of meta's groups (not from zhwiki) [14:25:07] people assume "account creator" exists everywhere [14:25:23] so, question is, what will happen if a steward attempts to add someone to account creator group on zhwiki (when it doesn't exist)? [14:25:39] urbanecm: We probably need a larger-scaled patch that fetches the exact list of rights from the target wiki [14:25:54] nya_1F616EMO: it would be useful, but fairly challenging. i think a task exists on that. [14:26:15] Apologies! I had some connectivity issues, I can also postpone to Monday if I'm tooo late. [14:26:21] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244673|[Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 (T418194)]] (duration: 06m 20s) [14:26:27] T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194 [14:26:29] (03PS1) 10AOkoth: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244678 (https://phabricator.wikimedia.org/T418483) [14:26:29] itamarWMDE: thanks for the info! is the connection all right now? [14:26:31] urbanecm: IIRC Special:GlobalContributions went into such an complication when checking TAIV rights, which they have to use caches to solve [14:26:41] nya_1F616EMO: yeah, indeed. [14:26:50] Seems like it, changed locations. [14:26:50] Which is bad if we have to keep a cache just for stwards to grant a few sysop rights [14:26:59] ^ a few checkuser/ish rights [14:27:01] (03CR) 10CI reject: [V:04-1] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244678 (https://phabricator.wikimedia.org/T418483) (owner: 10AOkoth) [14:27:02] (03PS6) 10Itamar Givon: Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) [14:27:31] nya_1F616EMO: let's move the discussion on the implementation to the task. if you can summarize the topic there, that would be appreciated. i can comment there later, too. [14:27:44] (03CR) 10Urbanecm: [C:03+2] Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [14:27:44] Okay, see you there soon [14:27:48] thank you [14:27:56] (03PS2) 10AOkoth: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (https://phabricator.wikimedia.org/T418483) [14:28:10] (03CR) 10AOkoth: "Ack." [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (https://phabricator.wikimedia.org/T418483) (owner: 10AOkoth) [14:28:28] (03CR) 10CI reject: [V:04-1] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (https://phabricator.wikimedia.org/T418483) (owner: 10AOkoth) [14:28:38] (03Merged) 10jenkins-bot: Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [14:28:54] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [14:29:08] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [14:29:18] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:29:35] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [14:29:46] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [14:29:49] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1242416|Add configurations for graphql usage survey and its pipeline tests (T414476)]] [14:29:54] T414476: 📚 Add QuickSurvey to the dedicate page on Wikidata for GraphQL - https://phabricator.wikimedia.org/T414476 [14:30:10] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:13] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [14:31:52] !log urbanecm@deploy2002 urbanecm, itamar: Backport for [[gerrit:1242416|Add configurations for graphql usage survey and its pipeline tests (T414476)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:32:02] itamarWMDE: can you verify your patch on mwdebug, please? [14:32:10] On it [14:32:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P89043 and previous config saved to /var/cache/conftool/dbconfig/20260226-143213-marostegui.json [14:32:20] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [14:32:23] jclark@cumin1003 reimage (PID 2324829) is awaiting input [14:32:55] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:32:57] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.update-replication (exit_code=99) [14:33:02] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [14:33:10] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [14:33:27] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [14:33:35] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [14:33:54] 10ops-codfw, 06SRE, 06DC-Ops: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#11654657 (10Papaul) 05Open→03Resolved CODFW expansion is complete so we can close this task. [14:34:26] itamarWMDE: i'm seeing a lot of `Failed to find wd-graphql-quick-survey-yes (en)` and similar in logs. not sure if expected. [14:34:28] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:34:35] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [14:34:43] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [14:34:46] expected [14:34:58] which domain are you seeing it for? test wikidata? [14:35:11] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [14:35:21] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [14:35:42] !log jclark@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:35:43] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1020.eqiad.wmnet with OS trixie [14:35:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:35:48] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1016.eqiad.wmnet with OS trixie [14:35:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654665 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1020.eqiad.wmnet with OS trixie completed: - backup1020 (**WARN... [14:35:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654666 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1016.eqiad.wmnet with OS trixie completed: - backup1016 (**WARN... [14:36:20] itamarWMDE: https://test.wikidata.org/wiki/User:ItamarWMDE/test [14:36:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654667 (10Jclark-ctr) [14:36:36] I'm having some trouble confirming though, using the generic k8s-mwdebug, is that correct? [14:36:42] yes, that should be it [14:36:54] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [14:37:03] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1017.eqiad.wmnet with OS trixie [14:37:05] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1019.eqiad.wmnet with OS trixie [14:37:05] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [14:37:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654671 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1017.eqiad.wmnet with OS trixie [14:37:16] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [14:37:20] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654673 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1019.eqiad.wmnet with OS trixie [14:37:24] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [14:37:30] Okay, well the logspam is a good sign, though the survey should show even if the interface messages are not there. [14:38:58] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1018.eqiad.wmnet with reason: host reimage [14:39:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:39:49] I'd say roll back, It's really hard to debug these quicksurveys, I'll try to figure it out again locally before trying again. [14:39:57] okay, reverting [14:39:59] !log urbanecm@deploy2002 Sync cancelled. [14:41:01] (03PS1) 10TrainBranchBot: Revert "Add configurations for graphql usage survey and its pipeline tests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244679 [14:41:01] (03CR) 10TrainBranchBot: "urbanecm@deploy2002 created a revert of this change as I7aeb01cab59b990d4a02894bbc7f2ff134479f76" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [14:41:19] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None [14:41:21] Thank you! apologies. [14:41:28] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:41:28] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:41:30] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:41:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244679 (owner: 10TrainBranchBot) [14:42:15] itamarWMDE: no worries, it happens [14:42:20] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [14:42:26] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:42:28] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:42:28] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:43:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:43:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1018.eqiad.wmnet with reason: host reimage [14:43:40] (03Merged) 10jenkins-bot: Revert "Add configurations for graphql usage survey and its pipeline tests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244679 (owner: 10TrainBranchBot) [14:44:08] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1244679|Revert "Add configurations for graphql usage survey and its pipeline tests"]] [14:45:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11654698 (10Jclark-ctr) [14:46:14] !log urbanecm@deploy2002 trainbranchbot, urbanecm: Backport for [[gerrit:1244679|Revert "Add configurations for graphql usage survey and its pipeline tests"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:46:35] !log urbanecm@deploy2002 trainbranchbot, urbanecm: Continuing with sync [14:46:45] (03PS1) 10Tiziano Fogli: P::thanos::store::ruler (TMP): select only blocks generated locally [puppet] - 10https://gerrit.wikimedia.org/r/1244680 (https://phabricator.wikimedia.org/T412924) [14:46:50] (03CR) 10Taiwanese elephant: "How about just removing the noratelimit right from the accountcreator in zhwiki, as this has been done in several private wikis?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [14:47:09] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-backup1003.eqiad.wmnet with OS trixie [14:47:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11654713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-backup1003.eqiad.wmnet with OS trixie [14:47:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T418465)', diff saved to https://phabricator.wikimedia.org/P89044 and previous config saved to /var/cache/conftool/dbconfig/20260226-144721-marostegui.json [14:47:26] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [14:47:35] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-backup1004.eqiad.wmnet with OS trixie [14:47:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [14:47:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11654717 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie [14:47:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T418465)', diff saved to https://phabricator.wikimedia.org/P89045 and previous config saved to /var/cache/conftool/dbconfig/20260226-144746-marostegui.json [14:48:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:49:17] urbanecm: Yeah, that's fine with me. I'm in a meeting so the later the possible the better :D [14:49:23] ah [14:49:40] just finishing last sync :) [14:49:45] Thanks [14:49:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T418465)', diff saved to https://phabricator.wikimedia.org/P89046 and previous config saved to /var/cache/conftool/dbconfig/20260226-144956-marostegui.json [14:50:49] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244679|Revert "Add configurations for graphql usage survey and its pipeline tests"]] (duration: 06m 41s) [14:50:54] and done [14:50:56] Dreamy_Jazz: over to you [14:51:00] Thanks [14:52:08] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1017.eqiad.wmnet with reason: host reimage [14:52:43] (03CR) 10Tiziano Fogli: [C:03+2] P::thanos::store::ruler (TMP): select only blocks generated locally [puppet] - 10https://gerrit.wikimedia.org/r/1244680 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [14:52:50] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1019.eqiad.wmnet with reason: host reimage [14:53:40] (03PS3) 10AOkoth: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (https://phabricator.wikimedia.org/T418483) [14:54:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:54:27] 06SRE: Create nodejs 24 production images - https://phabricator.wikimedia.org/T418440#11654755 (10Jdforrester-WMF) [14:54:31] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:55:47] Scap is running [14:56:33] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:58:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1017.eqiad.wmnet with reason: host reimage [14:58:20] (03PS1) 10Urbanecm: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) [14:58:33] (03PS1) 10Urbanecm: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244686 (https://phabricator.wikimedia.org/T418194) [15:01:27] (03PS1) 10Dreamy Jazz: SI: Populate siu_info in cusi_user from matched signals [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244688 (https://phabricator.wikimedia.org/T411118) [15:01:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [15:02:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11654837 (10Aklapper) CC'ing @Khantstop who may want to sett an "Also Known As" value on https://phabricator.wikimedia.org/people/editprofile/42910/ for better discoverability [15:02:25] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1019.eqiad.wmnet with reason: host reimage [15:02:55] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:02:56] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:03:22] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1003.eqiad.wmnet with reason: host reimage [15:03:23] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:03:33] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:03:41] (03PS1) 10Fabfur: hiera: set haproxy version to 3.0 in upload@magru [puppet] - 10https://gerrit.wikimedia.org/r/1244689 (https://phabricator.wikimedia.org/T417253) [15:03:44] !log upgrade conf* nodes to facter 4 T381538 [15:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:49] T381538: Backport facter to bullseye - https://phabricator.wikimedia.org/T381538 [15:04:12] (03CR) 10Vgutierrez: [C:03+1] hiera: set haproxy version to 3.0 in upload@magru [puppet] - 10https://gerrit.wikimedia.org/r/1244689 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [15:04:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:04:43] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1244689 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [15:05:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P89047 and previous config saved to /var/cache/conftool/dbconfig/20260226-150504-marostegui.json [15:05:06] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:05:09] (03PS1) 10Elukey: Revert "ml-serve: fix istio/transparentproxy config" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244690 [15:05:14] Dreamy_Jazz: will you be deploying something else? (if not, i'd +2 two backports to aid investigating a bug...) [15:05:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:05:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1018.eqiad.wmnet with OS trixie [15:05:26] !log sudo cumin "C:bird%do_ipv6=true" "disable-puppet 'merging CR 1241003'" [15:05:27] Yes, I have a public backport to make too [15:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:32] ok, i'll wait then [15:05:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1018.eqiad.wmnet with OS trixie completed: - backup1018 (**WARN... [15:05:42] (03CR) 10Dreamy Jazz: [C:03+2] SI: Populate siu_info in cusi_user from matched signals [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244688 (https://phabricator.wikimedia.org/T411118) (owner: 10Dreamy Jazz) [15:06:31] Mine will take a bit of time to merge [15:06:36] (03CR) 10Ssingh: [V:03+1 C:03+2] P:bird::anycast: automatically detect IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:06:43] So urbanecm if you want to deploy any config changes first? [15:06:52] Dreamy_Jazz: no, just the backports [15:07:01] Ok [15:07:08] Do you want to combine the backports? [15:07:11] To save time? [15:07:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1003.eqiad.wmnet with reason: host reimage [15:07:17] if you're ok with it, sure. it's https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1244686 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1244685 [15:07:25] (it just changes log level, so should be fairly low-risk too) [15:07:57] (03CR) 10Dreamy Jazz: [C:03+2] ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:08:01] (03CR) 10Dreamy Jazz: [C:03+2] ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244686 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:08:06] thanks [15:08:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244688 (https://phabricator.wikimedia.org/T411118) (owner: 10Dreamy Jazz) [15:08:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244686 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:08:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:08:51] Np [15:12:49] (03PS1) 10D3r1ck01: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) [15:13:01] (03PS5) 10D3r1ck01: Enable JWT session cookie for bot passwords (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) [15:15:55] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11654951 (10herron) [15:16:04] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654952 (10Jelto) p:05Triage→03Medium [15:16:11] jouncebot: nowandnext [15:16:11] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [15:16:11] In 0 hour(s) and 13 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1530) [15:16:33] I also want to deploy private code changes again after this scap, so hopefully okay with colliding with the next window? [15:16:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654958 (10Jelto) Thanks for the access request, we have to confirm the key out of band and another approval from @thcipriani is needed for th... [15:18:22] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:19:19] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1017.eqiad.wmnet with OS trixie [15:19:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1017.eqiad.wmnet with OS trixie completed: - backup1017 (**WARN... [15:19:45] (03CR) 10Reedy: Revert "CommonSettings: Temporarily set $wgOATHUserHandlesTable = true" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239026 (https://phabricator.wikimedia.org/T416544) (owner: 10Reedy) [15:19:51] !log sudo cumin -b1 -s5 "C:bird%do_ipv6=true" "run-puppet-agent --enable 'merging CR 1241003'" [15:19:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654977 (10Jclark-ctr) [15:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:13] (03Merged) 10jenkins-bot: SI: Populate siu_info in cusi_user from matched signals [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244688 (https://phabricator.wikimedia.org/T411118) (owner: 10Dreamy Jazz) [15:20:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P89048 and previous config saved to /var/cache/conftool/dbconfig/20260226-152012-marostegui.json [15:20:15] (03Merged) 10jenkins-bot: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244686 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:20:16] Dreamy_Jazz: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1244685 failed due to some CI infra issue [15:20:19] (03CR) 10Ssingh: [V:03+1 C:03+2] "Post-merge comments: thanks for the review folks. This is now merged and documentation updated." [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:20:30] Thanks for the heads up [15:20:56] the last job is about to complete [15:21:14] and the change can be +2ed against once that job has been reported as a failure [15:21:19] (03CR) 10CI reject: [V:04-1] ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:21:22] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654984 (10Jelto) [15:21:29] (03PS2) 10Dreamy Jazz: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:21:33] (03PS3) 10Dreamy Jazz: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:21:37] (03CR) 10Dreamy Jazz: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:21:39] (03CR) 10Dreamy Jazz: [C:03+2] ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:21:45] <3 [15:21:46] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:22:09] (03CR) 10Muehlenhoff: "Ok, then I'll abandone the patch. Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [15:22:14] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:22:35] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244693 [15:22:38] (03PS1) 10MVernon: apus: add two new frontends in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1244694 (https://phabricator.wikimedia.org/T416386) [15:22:40] (03PS1) 10MVernon: apus: remove two eqiad frontends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1244695 (https://phabricator.wikimedia.org/T416386) [15:22:56] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:23:04] (03CR) 10Marostegui: [C:03+1] apus: remove two eqiad frontends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1244695 (https://phabricator.wikimedia.org/T416386) (owner: 10MVernon) [15:23:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:23:18] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1019.eqiad.wmnet with OS trixie [15:23:22] (03CR) 10Ottomata: "@joal@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [15:23:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1019.eqiad.wmnet with OS trixie completed: - backup1019 (**WARN... [15:23:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654993 (10Jclark-ctr) [15:24:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654995 (10Jclark-ctr) [15:24:07] (03PS1) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) [15:24:30] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:24:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:24:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:24:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-backup1003.eqiad.wmnet with OS trixie [15:25:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11655000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-backup1003.eqiad.wmnet with OS trixie completed: - ms-backup1003... [15:25:06] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654999 (10Jelto) @Lucas_Werkmeister_WMDE is still a member of the `deployment` group. So approval from @thcipriani is not really needed. I'll... [15:25:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11655002 (10Jelto) [15:25:54] (03CR) 10Vgutierrez: cache::haproxy: save x-wmf-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [15:26:51] (03CR) 10Jelto: [C:03+1] "key has been verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1244661 (https://phabricator.wikimedia.org/T418482) (owner: 10Lucas Werkmeister (WMDE)) [15:26:52] (03CR) 10Marostegui: [C:03+1] apus: add two new frontends in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1244694 (https://phabricator.wikimedia.org/T416386) (owner: 10MVernon) [15:26:56] (03PS1) 10Clément Goubert: wmnet: Add rest-gateway-ro record [dns] - 10https://gerrit.wikimedia.org/r/1244697 (https://phabricator.wikimedia.org/T418145) [15:27:39] (03CR) 10Jelto: [C:03+2] Add new lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1244661 (https://phabricator.wikimedia.org/T418482) (owner: 10Lucas Werkmeister (WMDE)) [15:28:34] (03PS5) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) [15:28:41] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655017 (10herron) [15:28:50] (03CR) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [15:29:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:29:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11655019 (10Jelto) 05Open→03Resolved a:03Jelto Key is merged into puppet, you should have access in 30 minutes. I'll resolve the task... [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1530) [15:30:22] (03CR) 10MVernon: [C:03+2] apus: add two new frontends in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1244694 (https://phabricator.wikimedia.org/T416386) (owner: 10MVernon) [15:32:22] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11655035 (10Jelto) p:05Triage→03Medium thank you @Aklapper @Khantstop we need your approval here for > - access request (or expansion) has sign off of WMF sponsor/m... [15:33:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11655048 (10Jclark-ctr) 05Open→03Resolved [15:33:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11655053 (10Jclark-ctr) [15:33:58] (03Merged) 10jenkins-bot: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [15:34:10] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1244061 (owner: 10Scott French) [15:34:12] (03CR) 10Scott French: [C:03+2] P:cache::haproxy: fix non-default scope key structure [puppet] - 10https://gerrit.wikimedia.org/r/1244061 (owner: 10Scott French) [15:34:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:34:17] (03PS1) 10Clément Goubert: api-gateway: Add api-gateway-ro to certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244700 (https://phabricator.wikimedia.org/T418145) [15:34:28] (03PS2) 10Clément Goubert: api-gateway: Add api-gateway-ro to certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244700 (https://phabricator.wikimedia.org/T418145) [15:34:29] jhancock@cumin2002 provision (PID 4168454) is awaiting input [15:34:37] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1244688|SI: Populate siu_info in cusi_user from matched signals (T411118)]], [[gerrit:1244686|ReassignMentees: Adjust logging level (T418194)]], [[gerrit:1244685|ReassignMentees: Adjust logging level (T418194)]] [15:34:41] T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194 [15:35:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T418465)', diff saved to https://phabricator.wikimedia.org/P89049 and previous config saved to /var/cache/conftool/dbconfig/20260226-153521-marostegui.json [15:35:26] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [15:35:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [15:35:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T418465)', diff saved to https://phabricator.wikimedia.org/P89050 and previous config saved to /var/cache/conftool/dbconfig/20260226-153545-marostegui.json [15:35:53] (03PS2) 10Clément Goubert: wmnet: Add api-gateway-ro record [dns] - 10https://gerrit.wikimedia.org/r/1244697 (https://phabricator.wikimedia.org/T418145) [15:36:40] !log dreamyjazz@deploy2002 dreamyjazz, urbanecm: Backport for [[gerrit:1244688|SI: Populate siu_info in cusi_user from matched signals (T411118)]], [[gerrit:1244686|ReassignMentees: Adjust logging level (T418194)]], [[gerrit:1244685|ReassignMentees: Adjust logging level (T418194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:36:44] voilá [15:36:53] (i have nothing to test, it is running in a job) [15:36:53] :D [15:37:09] Same here. Problem code I'm fixing is running in a job too [15:37:12] !log dreamyjazz@deploy2002 dreamyjazz, urbanecm: Continuing with sync [15:37:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T418465)', diff saved to https://phabricator.wikimedia.org/P89051 and previous config saved to /var/cache/conftool/dbconfig/20260226-153756-marostegui.json [15:40:21] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11655108 (10Khantstop) Thank you @Jelto! Confirming that I grant approval for Dani to access the items listed in this task as part of her role as a data scientist. Let me k... [15:40:33] (03CR) 10Marostegui: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [15:41:06] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244688|SI: Populate siu_info in cusi_user from matched signals (T411118)]], [[gerrit:1244686|ReassignMentees: Adjust logging level (T418194)]], [[gerrit:1244685|ReassignMentees: Adjust logging level (T418194)]] (duration: 06m 29s) [15:41:11] T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194 [15:41:30] (03CR) 10Elukey: [C:03+2] Revert "ml-serve: fix istio/transparentproxy config" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244690 (owner: 10Elukey) [15:42:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11655121 (10Khantstop) [15:42:46] (03CR) 10Ottomata: "Ah, sorry! I do not know how haproxy log vs haproxykafka logging works. https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_reques" [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [15:43:20] Still working on my private code changes [15:43:21] (03CR) 10Vgutierrez: [C:03+1] "+1 on traffic side of things" [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [15:44:32] FIRING: [22x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:44:41] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:44:42] (03CR) 10Ottomata: component: mediawiki.page_html_content_change.dev0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [15:44:52] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:45:37] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655135 (10herron) [15:47:17] (03PS2) 10JavierMonton: component: mediawiki.page_html_content_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) [15:47:22] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=apus,name=apus-fe1004.eqiad.wmnet [15:47:28] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=apus,name=apus-fe1005.eqiad.wmnet [15:47:44] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=apus,name=apus-fe1004.eqiad.wmnet [15:47:51] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=apus,name=apus-fe1005.eqiad.wmnet [15:48:44] (03CR) 10CI reject: [V:04-1] component: mediawiki.page_html_content_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [15:49:04] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [15:51:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:52:03] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-backup1004.eqiad.wmnet with OS trixie [15:52:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11655178 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie [15:52:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:53:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P89052 and previous config saved to /var/cache/conftool/dbconfig/20260226-155304-marostegui.json [15:53:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:54:31] (03CR) 10JavierMonton: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [15:54:47] (03CR) 10Ottomata: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [15:55:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2024.codfw.wmnet with OS bullseye [15:55:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11655197 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-fe2024.codfw.wmnet with OS bullseye [15:56:38] (03Abandoned) 10Muehlenhoff: Add DB grant for pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [15:56:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2008-dev.codfw.wmnet with OS bookworm [15:56:51] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11655202 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2008-dev.codfw.wmnet with OS bookworm [15:59:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:00:04] dduvall and dancy: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1600). [16:01:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:45] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11655260 (10SLyngshede-WMF) We normally don't disable accounts, just remove any special privileges. [16:02:49] (03CR) 10Fabfur: [C:03+1] "Don't know why this has been left back but +1 for me!!" [puppet] - 10https://gerrit.wikimedia.org/r/1243195 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [16:03:11] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd1054 - https://phabricator.wikimedia.org/T416395#11655266 (10Jhancock.wm) [16:03:27] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655267 (10herron) [16:03:59] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11655273 (10A_smart_kitten) FWIW, https://wikitech.wikimedia.org/wiki/SRE/LDAP/Renaming_users currently... [16:04:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:04:35] (03CR) 10Hnowlan: [C:03+1] wmnet: Add api-gateway-ro record [dns] - 10https://gerrit.wikimedia.org/r/1244697 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [16:04:36] 10ops-codfw, 06SRE, 06DC-Ops: sretest2009 test in nokia rack - https://phabricator.wikimedia.org/T404115#11655296 (10Jhancock.wm) 05Open→03Resolved [16:04:36] (03CR) 10JavierMonton: component: mediawiki.page_html_content_change.dev0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [16:05:01] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-backup1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:05:16] (03CR) 10Hnowlan: [C:03+1] api-gateway: Add api-gateway-ro to certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244700 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [16:05:32] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11655302 (10Lucas_Werkmeister_WMDE) It’s working \o/ thank you! [16:05:46] * Lucas_WMDE can deploy again \o/ [16:08:03] uhuuu! [16:08:06] welcome back Lucas_WMDE [16:08:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P89053 and previous config saved to /var/cache/conftool/dbconfig/20260226-160812-marostegui.json [16:08:22] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:27] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655310 (10herron) [16:09:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:09:51] (03CR) 10Ottomata: [C:03+1] component: mediawiki.page_html_content_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [16:12:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:12:58] FIRING: [6x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:13:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-backup1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:13:10] Here [16:13:13] !ack [16:13:14] 7491 (ACKED) [6x] ProbeDown sre (text-https:443 probes/service) [16:14:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:14:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 3 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:14:42] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655349 (10herron) [16:16:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=drmrs - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:17:43] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655370 (10herron) [16:17:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:17:58] FIRING: [6x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:17:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:18:22] FIRING: [4x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:19:30] FIRING: [22x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb6_443 has 6 unhealthy realservers pooled on lvs5004:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:19:55] !ack [16:19:55] 7492 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [16:20:41] (03CR) 101F616EMO: "That's probably more doable and elegant as it does not require a CommonSettings hack. To play safe, I will go for this approach, and the c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [16:21:05] (03PS1) 10Tiziano Fogli: thanos::store: align ruler instance min-time with main instance max-time [puppet] - 10https://gerrit.wikimedia.org/r/1244707 (https://phabricator.wikimedia.org/T412924) [16:21:49] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655387 (10herron) [16:21:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:22:57] FIRING: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T418465)', diff saved to https://phabricator.wikimedia.org/P89054 and previous config saved to /var/cache/conftool/dbconfig/20260226-162321-marostegui.json [16:23:22] FIRING: [5x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:23:27] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:23:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:23:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1201 (T418465)', diff saved to https://phabricator.wikimedia.org/P89055 and previous config saved to /var/cache/conftool/dbconfig/20260226-162346-marostegui.json [16:23:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:24:30] RESOLVED: [20x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 6 unhealthy realservers pooled on lvs5004:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:24:43] FIRING: [5x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:24:58] FIRING: [2x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from FR) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [16:25:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:25:11] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655400 (10herron) [16:25:22] (03CR) 10MVernon: [C:03+2] apus: remove two eqiad frontends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1244695 (https://phabricator.wikimedia.org/T416386) (owner: 10MVernon) [16:25:47] (03CR) 10Gergő Tisza: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [16:25:56] (03CR) 10Gergő Tisza: [C:04-1] CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [16:25:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T418465)', diff saved to https://phabricator.wikimedia.org/P89056 and previous config saved to /var/cache/conftool/dbconfig/20260226-162556-marostegui.json [16:26:18] (03CR) 10Gergő Tisza: [C:03+1] Enable JWT session cookie for bot passwords (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [16:26:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:27:58] RESOLVED: [6x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:27:58] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:29:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:29:58] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655424 (10herron) [16:29:58] RESOLVED: [2x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from FR) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [16:30:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:32:29] (03PS2) 10Giuseppe Lavagetto: haproxy: stop current abuse [puppet] - 10https://gerrit.wikimedia.org/r/1243884 [16:32:29] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: do not log requests dropped due to too much recent concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1244709 [16:32:53] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655446 (10herron) [16:33:05] (03CR) 10CDanis: [C:03+1] cache::haproxy: do not log requests dropped due to too much recent concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1244709 (owner: 10Giuseppe Lavagetto) [16:33:22] FIRING: [5x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts moss-fe[1001-1002].eqiad.wmnet [16:34:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:34:43] FIRING: [5x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:57] (03CR) 10Bartosz Dziewoński: haproxy: stop current abuse (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243884 (owner: 10Giuseppe Lavagetto) [16:36:46] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: do not log requests dropped due to too much recent concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1244709 (owner: 10Giuseppe Lavagetto) [16:37:07] (03PS2) 10Giuseppe Lavagetto: cache::haproxy: do not log requests dropped due to too much recent concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1244709 [16:38:22] FIRING: [5x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:38:24] (03PS1) 10Federico Ceratto: orchestrator: install orchestrator-client [puppet] - 10https://gerrit.wikimedia.org/r/1244710 (https://phabricator.wikimedia.org/T416582) [16:39:35] PROBLEM - Memcached on titan1002 is CRITICAL: connect to address 10.64.48.167 and port 11211: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:40:03] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: do not log requests dropped due to too much recent concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1244709 (owner: 10Giuseppe Lavagetto) [16:41:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P89058 and previous config saved to /var/cache/conftool/dbconfig/20260226-164105-marostegui.json [16:41:50] (03CR) 10Muehlenhoff: [C:03+1] orchestrator: install orchestrator-client [puppet] - 10https://gerrit.wikimedia.org/r/1244710 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto) [16:42:20] (03CR) 10Effie Mouzeli: "You may be right, I am unsure how we should proceed tbh" [puppet] - 10https://gerrit.wikimedia.org/r/1240720 (owner: 10Muehlenhoff) [16:42:36] jouncebot: nowandnext [16:42:36] For the next 0 hour(s) and 17 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1600) [16:42:36] In 0 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1700) [16:44:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:44:45] (03CR) 10MVernon: [C:03+1] cassandra: Java 8 no longer supported (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1242473 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [16:45:53] (03CR) 10MVernon: [C:03+1] cassandra: add new 'linked_artifacts' role (user) [puppet] - 10https://gerrit.wikimedia.org/r/1243985 (https://phabricator.wikimedia.org/T418420) (owner: 10Eevans) [16:47:43] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [16:49:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by javiermonton@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [16:49:20] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on sretest2001.codfw.wmnet with reason: T381919 [16:49:24] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [16:50:06] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2024 [16:50:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2024 [16:50:41] (03Merged) 10jenkins-bot: component: mediawiki.page_html_content_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [16:51:32] !log javiermonton@deploy2002 Started scap sync-world: Backport for [[gerrit:1244629|component: mediawiki.page_html_content_change.dev0 (T418467)]] [16:51:36] T418467: Enrich "parent" HTML using diffs - https://phabricator.wikimedia.org/T418467 [16:51:49] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-fe[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [16:52:36] (03CR) 10Marostegui: [C:03+1] orchestrator: install orchestrator-client [puppet] - 10https://gerrit.wikimedia.org/r/1244710 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto) [16:52:48] (03PS3) 101F616EMO: zhwiki: Remove all rights from accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) [16:53:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-fe[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [16:53:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:53:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts moss-fe[1001-1002].eqiad.wmnet [16:53:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11655584 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `moss-fe[1001-1002].eqiad.wmnet` - moss-fe1001.eqiad.w... [16:53:29] (03PS1) 10Ebernhardson: cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) [16:53:35] !log javiermonton@deploy2002 javiermonton: Backport for [[gerrit:1244629|component: mediawiki.page_html_content_change.dev0 (T418467)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:54:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:54:25] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-fe100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T418515 (10MatthewVernon) 03NEW [16:54:57] (03CR) 10CI reject: [V:04-1] cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [16:55:04] (03CR) 10Federico Ceratto: [C:03+2] orchestrator: install orchestrator-client [puppet] - 10https://gerrit.wikimedia.org/r/1244710 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto) [16:56:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P89059 and previous config saved to /var/cache/conftool/dbconfig/20260226-165613-marostegui.json [16:58:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-fe100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T418515#11655610 (10VRiley-WMF) a:03VRiley-WMF [16:58:40] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11655614 (10Jhancock.wm) @Andrew i got the installer to work but the config needs an edit. getting this error and it's going to fail. [40/50, retrying in 120.00s] Attempt to run 'cookbooks.s... [16:59:14] !log javiermonton@deploy2002 javiermonton: Continuing with sync [17:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [17:03:27] !log javiermonton@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244629|component: mediawiki.page_html_content_change.dev0 (T418467)]] (duration: 11m 55s) [17:03:32] T418467: Enrich "parent" HTML using diffs - https://phabricator.wikimedia.org/T418467 [17:07:42] (03PS1) 10Kosta Harlan: hcaptcha: Sanitize values of x_is_browser sent on risk_score events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244718 (https://phabricator.wikimedia.org/T418505) [17:08:33] (03PS11) 10Pppery: Extract strings from US English locale as source strings and apply PLURAL [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) [17:09:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:09:24] (03CR) 10Pppery: "The error was caused by me using syntax from a PHP version newer that what XHPAST supports. Fixed that in the latest patchset." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) (owner: 10Pppery) [17:09:49] (03PS5) 10Pppery: Handle all format specifiers [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221131 (https://phabricator.wikimedia.org/T413529) [17:10:08] (03PS5) 10Pppery: Add a `bin/translatewiki roundtrip` workflow to validate the string-mangling code [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221180 (https://phabricator.wikimedia.org/T413532) [17:11:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T418465)', diff saved to https://phabricator.wikimedia.org/P89061 and previous config saved to /var/cache/conftool/dbconfig/20260226-171121-marostegui.json [17:11:27] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [17:11:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [17:12:15] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-backup1004.eqiad.wmnet with OS trixie [17:12:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11655661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie executed with errors: - ms-... [17:12:51] (03CR) 10Gergő Tisza: rest-gateway: use rlc claim from cookie with bearer token (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [17:12:52] 06SRE, 06ServiceOps new, 10Wikibase GraphQL, 06Wikibase Reuse Team, and 2 others: Create a rewrite for the GraphQL endpoint on wikidata.org - https://phabricator.wikimedia.org/T417026#11655662 (10Blake) [17:13:01] (03CR) 10DCausse: [C:03+2] opensearch-semantic-search: test cluster capacity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243696 (owner: 10DCausse) [17:14:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [17:14:55] (03Merged) 10jenkins-bot: opensearch-semantic-search: test cluster capacity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243696 (owner: 10DCausse) [17:15:24] (03CR) 10Gergő Tisza: rest-gateway: use rlc claim from cookie with bearer token (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [17:15:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe2024.codfw.wmnet with OS bullseye [17:15:38] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11655677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-fe2024.codfw.wmnet with OS bullseye executed with errors: - ms-fe... [17:16:31] jouncebot: nowandnext [17:16:32] For the next 0 hour(s) and 43 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1700) [17:16:32] In 0 hour(s) and 43 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1800) [17:16:32] In 0 hour(s) and 43 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1800) [17:16:41] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [17:16:56] can I deploy a MW patch now, or is someone using the window? [17:17:43] I will go ahead [17:18:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244718 (https://phabricator.wikimedia.org/T418505) (owner: 10Kosta Harlan) [17:18:38] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [17:19:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:19:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2008-dev.codfw.wmnet with OS bookworm [17:19:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:26] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11655701 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcephosd2008-dev.codfw.wmnet with OS bookworm executed with errors: - cloudcepho... [17:19:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:20:25] (03Merged) 10jenkins-bot: hcaptcha: Sanitize values of x_is_browser sent on risk_score events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244718 (https://phabricator.wikimedia.org/T418505) (owner: 10Kosta Harlan) [17:20:58] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1244718|hcaptcha: Sanitize values of x_is_browser sent on risk_score events (T418505)]] [17:21:02] T418505: hCaptcha: Fix validation errors for x_is_browser being asigned a string value - https://phabricator.wikimedia.org/T418505 [17:22:05] (03PS1) 10Muehlenhoff: Record LDAP for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1244720 [17:22:51] (03CR) 10CI reject: [V:04-1] Record LDAP for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1244720 (owner: 10Muehlenhoff) [17:23:07] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1244718|hcaptcha: Sanitize values of x_is_browser sent on risk_score events (T418505)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:24:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:25:44] (03PS1) 10Muehlenhoff: Record LDAP access for shoffmanwmf [puppet] - 10https://gerrit.wikimedia.org/r/1244722 [17:25:55] (03CR) 10Tiziano Fogli: [C:03+2] thanos::store: align ruler instance min-time with main instance max-time [puppet] - 10https://gerrit.wikimedia.org/r/1244707 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [17:26:38] (03CR) 10CI reject: [V:04-1] Record LDAP access for shoffmanwmf [puppet] - 10https://gerrit.wikimedia.org/r/1244722 (owner: 10Muehlenhoff) [17:27:02] !log kharlan@deploy2002 kharlan: Continuing with sync [17:27:07] (03PS2) 10Muehlenhoff: Record LDAP for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1244720 [17:28:21] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1244720 (owner: 10Muehlenhoff) [17:28:41] (03PS2) 10Muehlenhoff: Record LDAP access for shoffmanwmf [puppet] - 10https://gerrit.wikimedia.org/r/1244722 [17:29:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:29:36] RECOVERY - Memcached on titan1002 is OK: TCP OK - 0.000 second response time on 10.64.48.167 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [17:29:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11655790 (10MBinder_WMF) @Aklapper done :) @MatthewVernon @Ottomata where I can I see what groups I'm in? [17:30:55] (03CR) 10AOkoth: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (https://phabricator.wikimedia.org/T418483) (owner: 10AOkoth) [17:30:58] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244718|hcaptcha: Sanitize values of x_is_browser sent on risk_score events (T418505)]] (duration: 10m 00s) [17:31:03] T418505: hCaptcha: Fix validation errors for x_is_browser being asigned a string value - https://phabricator.wikimedia.org/T418505 [17:34:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:34:35] (03PS1) 10Urbanecm: cleanup: Growth: Remove temporary GrowthMentorList overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) [17:34:58] (03CR) 10BCornwall: [C:03+2] varnishkafka: Only enable for text [puppet] - 10https://gerrit.wikimedia.org/r/1243195 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [17:35:15] (03CR) 10Urbanecm: [C:04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) (owner: 10Urbanecm) [17:35:53] (03PS2) 10Urbanecm: cleanup: Growth: Remove temporary GrowthMentorList overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) [17:36:31] (03PS1) 10Muehlenhoff: Unconditionally assume Facter 4 [puppet] - 10https://gerrit.wikimedia.org/r/1244725 [17:36:37] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for shoffmanwmf [puppet] - 10https://gerrit.wikimedia.org/r/1244722 (owner: 10Muehlenhoff) [17:37:05] (03CR) 10CI reject: [V:04-1] Unconditionally assume Facter 4 [puppet] - 10https://gerrit.wikimedia.org/r/1244725 (owner: 10Muehlenhoff) [17:37:48] (03PS2) 10Muehlenhoff: Unconditionally assume Facter 4 [puppet] - 10https://gerrit.wikimedia.org/r/1244725 [17:39:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1244725 (owner: 10Muehlenhoff) [17:39:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:08] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11655825 (10Aklapper) > where I can I see what groups I'm in? Uhm, none it seems, while I would have expected ldap/wmf at least: https://ldap.toolfo... [17:44:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:44:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:47:17] (03CR) 10JHathaway: [C:03+1] Unconditionally assume Facter 4 [puppet] - 10https://gerrit.wikimedia.org/r/1244725 (owner: 10Muehlenhoff) [17:49:39] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-fe100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T418515#11655856 (10VRiley-WMF) [17:49:58] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-fe100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T418515#11655858 (10VRiley-WMF) 05Open→03Resolved These have been decommissioned [17:53:02] (03CR) 10Muehlenhoff: "These spec tests test a random tiny fraction of a mediawiki install and will break randomly if groups get reorganised. Even if still used " [puppet] - 10https://gerrit.wikimedia.org/r/1240720 (owner: 10Muehlenhoff) [17:53:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11655864 (10Ottomata) > the former is now self-service via IDM. I suppose that would be https://idm.wikimedia.org/. I don't totally see how one can... [17:54:52] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11655866 (10MoritzMuehlenhoff) >>! In T417655#11655864, @Ottomata wrote: >> the former is now self-service via IDM. > > I suppose that would be http... [17:55:49] (03PS2) 10BryanDavis: toolhub: Bump container version to 2026-02-20-232022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241047 (https://phabricator.wikimedia.org/T372824) [17:56:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:56:18] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! I don't spot any obvious problems. Let's merge this :)" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) (owner: 10Pppery) [17:58:36] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp2043.codfw.wmnet [17:59:15] (03CR) 10BryanDavis: [C:03+2] toolhub: Bump container version to 2026-02-20-232022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241047 (https://phabricator.wikimedia.org/T372824) (owner: 10BryanDavis) [18:00:05] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1800) [18:00:05] swfrench-wmf: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1800). [18:01:04] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2026-02-20-232022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241047 (https://phabricator.wikimedia.org/T372824) (owner: 10BryanDavis) [18:01:10] FIRING: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:01:20] o/ [18:01:30] (03PS1) 10BryanDavis: developer-portal: Bump to 2026-02-23-122916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244728 [18:01:32] I'll get started on the work planned for this window in a bit. [18:01:32] PROBLEM - Confd vcl based reload on cp2033 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:02:44] I have updates for both toolhub and developer-portal today. a weird alignment of planets after a couple months of nothing for my window. [18:04:03] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11655884 (10DTotten-WMF) Hi @Aklapper - thanks for your help with this ticket. I have updated my Phabricator account to show my LDAP user name on my profile. [18:05:29] (03CR) 10Scott French: [C:03+2] mesh: Copy mesh.configuration 1.15.1 to 1.15.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242517 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:05:33] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp2044.codfw.wmnet [18:05:58] (03PS1) 10DCausse: opensearch-semantic-search: set library path to knn native libs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244729 [18:06:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:06:34] (03CR) 10Ebernhardson: [C:03+2] opensearch-semantic-search: set library path to knn native libs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244729 (owner: 10DCausse) [18:07:24] (03Merged) 10jenkins-bot: mesh: Copy mesh.configuration 1.15.1 to 1.15.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242517 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:07:40] (03CR) 10Scott French: [C:03+2] mesh: Set traffic_direction to INBOUND on local TLS listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:08:06] (03PS3) 10Scott French: mesh: Set traffic_direction to INBOUND on local TLS listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245) [18:08:34] (03CR) 10Scott French: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:08:52] (03Merged) 10jenkins-bot: opensearch-semantic-search: set library path to knn native libs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244729 (owner: 10DCausse) [18:09:13] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [18:09:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:09:30] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:09:40] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:09:52] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [18:10:15] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [18:11:23] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [18:11:28] (03Merged) 10jenkins-bot: mesh: Set traffic_direction to INBOUND on local TLS listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:11:57] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [18:12:12] (03PS3) 10Scott French: mesh: Copy mesh.deployment 1.3.1 to 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242519 (https://phabricator.wikimedia.org/T364245) [18:12:35] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [18:14:33] (03CR) 10Scott French: [C:03+2] mesh: Copy mesh.deployment 1.3.1 to 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242519 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:14:47] (03CR) 10Btullis: [C:03+2] Clean up list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1243954 (https://phabricator.wikimedia.org/T415193) (owner: 10Xcollazo) [18:15:12] (03PS2) 10BryanDavis: developer-portal: Bump to 2026-02-23-122916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244728 [18:15:16] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-backup1004.eqiad.wmnet with OS trixie [18:15:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11655936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie [18:16:27] (03Merged) 10jenkins-bot: mesh: Copy mesh.deployment 1.3.1 to 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242519 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:17:25] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:17:32] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:18:49] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2026-02-23-122916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244728 (owner: 10BryanDavis) [18:19:26] (03PS4) 10Scott French: mesh: Support injection of extra env vars into envoy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) [18:20:54] (03Merged) 10jenkins-bot: developer-portal: Bump to 2026-02-23-122916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244728 (owner: 10BryanDavis) [18:21:16] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:21:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:21:38] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:22:00] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:22:17] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:22:37] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:22:43] (03CR) 10Scott French: [C:03+2] mesh: Support injection of extra env vars into envoy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:23:37] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:24:20] (03PS1) 10Dzahn: site: add contint1003/2003 with insetup collab role [puppet] - 10https://gerrit.wikimedia.org/r/1244743 (https://phabricator.wikimedia.org/T418521) [18:24:32] (03Merged) 10jenkins-bot: mesh: Support injection of extra env vars into envoy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:25:07] I am done with my window today. [18:25:22] (03PS4) 10Scott French: mediawiki: Bump mesh.configuration and mesh.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245) [18:26:30] (03CR) 10Dzahn: [C:03+2] ProdPasteBot: Call paste.edit instead of deprecated paste.create [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper) [18:29:31] (03CR) 10Scott French: [C:03+2] mediawiki: Bump mesh.configuration and mesh.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:34:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe2024.codfw.wmnet [18:34:51] (03Merged) 10jenkins-bot: mediawiki: Bump mesh.configuration and mesh.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:35:18] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11656047 (10MBinder_WMF) Thanks! The only one on that list that I see might be relevant is Wmf, so I requested access and referenced this ticket. [18:35:22] (03PS1) 10BPirkle: REST: show the beta Attribution API in the REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522) [18:36:14] * swfrench-wmf is waiting for chart-museum [18:36:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:37:45] pt1979@cumin2002 dhcp (PID 65743) is awaiting input [18:39:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe2024.codfw.wmnet [18:39:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe2024.codfw.wmnet [18:41:28] (03CR) 10Dzahn: [C:03+2] "[phab1004:~] $ echo "teste mich" | /usr/local/bin/phaste --config /etc/phaste.conf -t "teste mich"" [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper) [18:42:02] (03CR) 10Dzahn: [C:03+2] "tested. still working." [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper) [18:42:12] !log swfrench@deploy2002 Started scap sync-world: helmfile-only deploy for mesh module updates - T364245 [18:42:16] T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis - https://phabricator.wikimedia.org/T364245 [18:42:42] pt1979@cumin2002 dhcp (PID 67882) is awaiting input [18:43:22] !log swfrench@deploy2002 swfrench: helmfile-only deploy for mesh module updates - T364245 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:44:14] alright, let's see if tracing still works ... [18:48:04] !log swfrench@deploy2002 swfrench: Continuing with sync [18:49:30] (03PS5) 10Scott French: mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245) [18:50:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe2024.codfw.wmnet [18:50:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe2024.codfw.wmnet [18:51:56] !log swfrench@deploy2002 Finished scap sync-world: helmfile-only deploy for mesh module updates - T364245 (duration: 11m 13s) [18:52:00] T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis - https://phabricator.wikimedia.org/T364245 [18:54:02] pt1979@cumin2002 dhcp (PID 72585) is awaiting input [18:55:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe2024.codfw.wmnet [18:56:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:00:05] dduvall and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1900). [19:00:08] I'll pause my work here and pick up during a quiet spot after the train [19:00:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe2024.codfw.wmnet [19:01:02] thanks swfrench-wmf [19:02:52] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244760 (https://phabricator.wikimedia.org/T413808) [19:02:54] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244760 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot) [19:03:25] pt1979@cumin2002 dhcp (PID 78796) is awaiting input [19:03:47] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244760 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot) [19:04:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe2024.codfw.wmnet [19:06:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:09:31] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.17 refs T413808 [19:09:36] T413808: 1.46.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T413808 [19:11:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:11:20] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp2043.codfw.wmnet [reason: NIC firmware issues] [19:11:40] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp2044.codfw.wmnet [reason: NIC firmware issues] [19:12:27] (03PS1) 10DCausse: opensearch-semantic-search: set LD_LIBRARY_PATH with knn lib [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244764 [19:15:28] (03CR) 10Ebernhardson: [C:03+2] opensearch-semantic-search: set LD_LIBRARY_PATH with knn lib [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244764 (owner: 10DCausse) [19:16:01] dduvall: once the dust settles, if you could give me a heads-up when it might be alright to mess with mw-debug a bit, that would be swell (no rush) [19:16:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:16:25] (03CR) 10Bking: [C:03+1] openjkd-21-jre: fix malformed changelog entry [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240701 (owner: 10MVernon) [19:16:26] swfrench-wmf: i think we're good to call it a train [19:16:29] !log hoo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [19:16:41] i.e. go ahead and thanks again [19:16:51] awesome, thanks dduvall [19:17:14] !log hoo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [19:17:22] (03Merged) 10jenkins-bot: opensearch-semantic-search: set LD_LIBRARY_PATH with knn lib [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244764 (owner: 10DCausse) [19:18:13] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:18:15] 10ops-codfw, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11656209 (10ssingh) [19:18:24] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:19:21] (03CR) 10Scott French: [C:03+2] mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [19:20:09] (03PS1) 10Gergő Tisza: EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) [19:20:27] (03PS1) 10Gergő Tisza: EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244771 (https://phabricator.wikimedia.org/T418512) [19:20:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza) [19:21:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:21:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244771 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza) [19:21:37] (03Merged) 10jenkins-bot: mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [19:21:48] 10ops-codfw, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11656222 (10BCornwall) [19:22:01] 10ops-codfw, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11656224 (10BCornwall) [19:22:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244649 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [19:22:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244650 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [19:23:01] (03CR) 10CI reject: [V:04-1] EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza) [19:24:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:25:01] (03CR) 10MVernon: [V:03+2 C:03+2] openjkd-21-jre: fix malformed changelog entry [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240701 (owner: 10MVernon) [19:27:16] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [19:27:58] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [19:28:38] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:29:04] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [19:29:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T415786)', diff saved to https://phabricator.wikimedia.org/P89066 and previous config saved to /var/cache/conftool/dbconfig/20260226-192927-marostegui.json [19:29:32] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [19:30:46] (03CR) 10Ssingh: [C:03+1] "Looks good to proceed since you are confident about the status of the cluster itself. I think we will first need to merge this patch to up" [dns] - 10https://gerrit.wikimedia.org/r/1238441 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [19:31:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:31:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11656259 (10BCornwall) FWIW `perf` has the majority of dropped packets as `QUEUE_PURGE` (and `NOT_SPECIFIED`) [19:35:29] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-backup1004.eqiad.wmnet with OS trixie [19:35:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11656284 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie executed with errors: - ms-... [19:35:48] (03PS2) 10Gergő Tisza: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [19:36:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:36:38] (03CR) 10CI reject: [V:04-1] CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [19:43:50] (03PS1) 10BCornwall: hardware.upgrade-firmware: Fix usage path [cookbooks] - 10https://gerrit.wikimedia.org/r/1244788 [19:44:07] !log brett@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2043.codfw.wmnet [19:44:19] !log brett@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2043.codfw.wmnet [19:44:21] (03PS1) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244789 [19:44:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P89067 and previous config saved to /var/cache/conftool/dbconfig/20260226-194435-marostegui.json [19:47:03] (03PS3) 10Gergő Tisza: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [19:47:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [19:47:25] (03PS2) 10BPirkle: REST: show the beta Attribution API in the REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522) [19:49:29] (03PS4) 10Gergő Tisza: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [19:50:43] (03PS6) 10Gergő Tisza: Enable JWT session cookie for bot passwords (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [19:50:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [19:51:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:51:29] (03PS2) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244789 [19:51:50] (03CR) 10CI reject: [V:04-1] site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244789 (owner: 10CDobbins) [19:55:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2095.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:55:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2096.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:56:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:56:56] (03PS1) 10Scott French: Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244799 (https://phabricator.wikimedia.org/T364245) [19:57:43] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2043.codfw.wmnet [19:59:12] (03CR) 10Scott French: [C:03+2] Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244799 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [19:59:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P89068 and previous config saved to /var/cache/conftool/dbconfig/20260226-195943-marostegui.json [20:01:01] (03PS1) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244802 [20:01:13] (03Abandoned) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244789 (owner: 10CDobbins) [20:01:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:01:21] (03Merged) 10jenkins-bot: Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244799 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [20:01:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:02:46] (03Abandoned) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244802 (owner: 10CDobbins) [20:04:26] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:04:52] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:04:57] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:05:18] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:06:10] alright, I'm done with mw-debug for now [20:06:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2095.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:07:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2096.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:07:44] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2043.codfw.wmnet [20:09:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11656360 (10BCornwall) [20:10:14] (03PS1) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244810 [20:11:56] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2095'] [20:12:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2095'] [20:12:18] (03CR) 10BCornwall: [C:03+1] site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244810 (owner: 10CDobbins) [20:13:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2095.codfw.wmnet with OS bullseye [20:13:26] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11656381 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2095.codfw.wmnet with OS bullseye [20:13:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2096.codfw.wmnet with OS bullseye [20:13:45] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11656384 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2096.codfw.wmnet with OS bullseye [20:14:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T415786)', diff saved to https://phabricator.wikimedia.org/P89069 and previous config saved to /var/cache/conftool/dbconfig/20260226-201451-marostegui.json [20:14:57] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [20:15:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:15:26] (03CR) 10CDobbins: [C:03+2] site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244810 (owner: 10CDobbins) [20:15:46] (03PS1) 10Jforrester: plugins/wm-pcc: Switch commands from experimental to new puppet [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244816 (https://phabricator.wikimedia.org/T414621) [20:16:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:16:27] (03CR) 10CI reject: [V:04-1] plugins/wm-pcc: Switch commands from experimental to new puppet [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244816 (https://phabricator.wikimedia.org/T414621) (owner: 10Jforrester) [20:20:16] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments1009 - https://phabricator.wikimedia.org/T416253#11656415 (10VRiley-WMF) a:03VRiley-WMF [20:23:28] !log Deployment of the multi-instance Thanos Store Gateway patches for T412924: starting the rollout on titan1001 [20:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:33] T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924 [20:31:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:31:41] (03PS1) 10Scott French: mediawiki: refresh mesh.deployment 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244807 (https://phabricator.wikimedia.org/T364245) [20:36:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:38:22] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:38:48] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2046.codfw.wmnet with OS trixie [20:39:54] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2047.codfw.wmnet with OS trixie [20:45:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:47:24] (03CR) 10RLazarus: [C:03+1] mediawiki: refresh mesh.deployment 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244807 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [20:50:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:50:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11656482 (10VRiley-WMF) @Marostegui So, we can start with dbproxy1029. Are there specific dates that would be preferred? Also, just to... [20:51:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:52:13] (03CR) 10D3r1ck01: [C:03+1] CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [20:52:47] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2046.codfw.wmnet with reason: host reimage [20:52:55] (03CR) 10D3r1ck01: [C:03+1] Enable JWT session cookie for bot passwords (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [20:54:04] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2047.codfw.wmnet with reason: host reimage [20:58:27] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2046.codfw.wmnet with reason: host reimage [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T2100). [21:00:05] RoanKattouw, danisztls, JSherman, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:22] here [21:00:46] I can deploy [21:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [21:01:11] o/ [21:01:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:02:28] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2047.codfw.wmnet with reason: host reimage [21:03:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665) (owner: 10Scardenasmolinar) [21:03:43] This is a cool new Spiderpig feature: it warned me that the Depends-On of this config change was only recently deployed and might be rolled back https://usercontent.irccloud-cdn.com/file/B4VIxlJs/image.png [21:03:56] (03Merged) 10jenkins-bot: Deploy PersonalDashboard to new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665) (owner: 10Scardenasmolinar) [21:04:12] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1244046|Deploy PersonalDashboard to new wikis (T417665)]] [21:04:16] I'm proceeding anyway, because the worst that could happen is that the config sets a setting that is ignored by MW [21:04:17] T417665: Deploy Extension:PersonalDashboard to id.wiki, tr.wiki, simple.wiki, and th.wiki - https://phabricator.wikimedia.org/T417665 [21:04:39] FIRING: CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (2620:0:860:fe0a::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr2-drmrs:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:05:19] !log Deployment of the multi-instance Thanos Store Gateway patches for T412924: starting the rollout on titan2002 [21:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:23] T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924 [21:06:05] !log catrope@deploy2002 catrope, suecarmol: Backport for [[gerrit:1244046|Deploy PersonalDashboard to new wikis (T417665)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:06:30] testing [21:09:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (2620:0:860:fe0a::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr2-drmrs:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:10:06] RoanKattouw: we're good [21:10:10] !log catrope@deploy2002 catrope, suecarmol: Continuing with sync [21:11:17] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:11:26] sry, had IRC problems [21:11:33] (03CR) 10Catrope: [C:03+2] EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza) [21:11:37] (03CR) 10Catrope: [C:03+2] EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244771 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza) [21:14:09] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244046|Deploy PersonalDashboard to new wikis (T417665)]] (duration: 09m 57s) [21:14:13] T417665: Deploy Extension:PersonalDashboard to id.wiki, tr.wiki, simple.wiki, and th.wiki - https://phabricator.wikimedia.org/T417665 [21:14:46] (03CR) 10CI reject: [V:04-1] EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza) [21:15:55] CI keeps failing on https://gerrit.wikimedia.org/r/1244770 without actually saying why it fails, and the change is trivial, so I'm going to force-merge it [21:16:00] (03Merged) 10jenkins-bot: EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244771 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza) [21:16:00] (03CR) 10Catrope: [V:03+2 C:03+2] EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza) [21:16:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza) [21:16:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244771 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza) [21:17:07] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1244770|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]], [[gerrit:1244771|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]] [21:17:08] (Also Spiderpig just reminded me that wmf.16 isn't live anywhere anymore anyway) [21:17:12] T418512: Newly created Wikimedia Vote Wiki accounts unable to log in – Fatal exception “Error” - https://phabricator.wikimedia.org/T418512 [21:17:37] (03CR) 10Catrope: [C:03+2] Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244649 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [21:17:39] (03CR) 10Catrope: [C:03+2] Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244650 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [21:18:56] !log catrope@deploy2002 tgr, catrope: Backport for [[gerrit:1244770|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]], [[gerrit:1244771|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:19:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:19:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:01] 22:13:52 ...............SSSSSSSSSSS.....................EEEEEEE... 183 / 183 [21:20:27] seems like there's a bunch of errors in the WikimediaEvents tests but they don't make it into the output somehow? [21:20:59] I have seen something like this happen in a transient way where a maximum test runtime was enforced and all the tests that took too long failed [21:21:07] There's also a PHP notice right after that [21:21:15] Anyway, CI passed on wmf.17 and that's the branch that matters [21:21:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:21:31] the whole test footer (runtime, number of tests etc) is missing, maybe a fatal error? [21:21:38] not sure if that's still a thing in PHP 8 [21:21:50] tgr_: Also please test on the test servers (if this is even testable) [21:23:02] testing [21:24:09] can I make myself an account on votewiki? [21:24:22] not sure how secret that wiki is [21:24:40] Uhhhh, probably not? [21:24:46] There's only like 6 people on it [21:24:58] So maybe we just proceed with the deployment, then ask the reporter to test again [21:26:11] !log Deployment of the multi-instance Thanos Store Gateway patches for T412924: starting the rollout on titan2001 [21:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:16] T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924 [21:26:17] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:26:59] I'll just test on a normal wiki then [21:27:21] (03PS1) 10BCornwall: cp2043: Set use_noflow_iface_preup to true [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527) [21:28:15] RoanKattouw: works [21:28:22] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:28:23] !log catrope@deploy2002 tgr, catrope: Continuing with sync [21:28:27] (on normal wikis anyway, so no worse than before) [21:28:32] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8154/co" [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall) [21:29:06] (03CR) 10BCornwall: cp2043: Set use_noflow_iface_preup to true [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall) [21:31:14] (03Merged) 10jenkins-bot: Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244649 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [21:31:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:31:22] (03Merged) 10jenkins-bot: Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244650 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [21:31:49] (03CR) 10Ssingh: [C:03+1] cp2043: Set use_noflow_iface_preup to true [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall) [21:32:11] am I next? [21:32:22] (03CR) 10CDobbins: [C:03+1] cp2043: Set use_noflow_iface_preup to true [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall) [21:32:26] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244770|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]], [[gerrit:1244771|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]] (duration: 15m 19s) [21:32:31] T418512: Newly created Wikimedia Vote Wiki accounts unable to log in – Fatal exception “Error” - https://phabricator.wikimedia.org/T418512 [21:32:41] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:32:59] (03CR) 10BCornwall: [C:03+2] cp2043: Set use_noflow_iface_preup to true [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall) [21:33:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2095.codfw.wmnet with OS bullseye [21:33:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11656609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2095.codfw.wmnet with OS bullseye executed with errors: -... [21:33:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2096.codfw.wmnet with OS bullseye [21:34:00] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1244649|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]], [[gerrit:1244650|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]] [21:34:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11656611 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2096.codfw.wmnet with OS bullseye executed with errors: -... [21:34:05] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [21:34:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:40] danisztls: Yes, you'll be next after I'm done with tgr_'s patches. Sorry for the delay [21:35:55] !log catrope@deploy2002 catrope, tgr: Backport for [[gerrit:1244649|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]], [[gerrit:1244650|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:36:17] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:37:10] RoanKattouw: no problem at all [21:39:25] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:39] sorry, this one is a bit cumbersome to test [21:40:13] (03CR) 10BCornwall: prometheus: add pooled host check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [21:41:37] RoanKattouw: looks good [21:41:46] !log catrope@deploy2002 catrope, tgr: Continuing with sync [21:42:41] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:43:33] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Certificate grafana.discovery.wmnet expires in 7 day(s) (Fri 06 Mar 2026 09:43:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:43:43] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Certificate grafana.discovery.wmnet expires in 7 day(s) (Fri 06 Mar 2026 09:43:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:45:39] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244649|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]], [[gerrit:1244650|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]] (duration: 11m 39s) [21:45:44] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [21:46:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243929 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza) [21:46:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243930 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza) [21:46:49] danisztls: Starting yours now, I'll ask you to test in a few minutes [21:47:06] RoanKattouw: thanks! [21:47:25] (03Merged) 10jenkins-bot: Deploy Comparative Reader Research survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243929 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza) [21:47:29] (03Merged) 10jenkins-bot: Deploy Comparative Reader Research survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243930 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza) [21:47:49] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1243929|Deploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1243930|Deploy Comparative Reader Research survey on enwiki (T417829)]] [21:47:55] T417834: Comparative Reader Research (Current Readers - ES) Deployment - https://phabricator.wikimedia.org/T417834 [21:47:55] T417829: Comparative Reader Research (Current Readers - EN) Deployment - https://phabricator.wikimedia.org/T417829 [21:49:45] !log catrope@deploy2002 dani, catrope: Backport for [[gerrit:1243929|Deploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1243930|Deploy Comparative Reader Research survey on enwiki (T417829)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:50:00] danisztls: Alright please test, and let me know when you're done [21:50:02] (03PS1) 10CDobbins: hieradata: add haproxy version for new cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1244857 [21:50:26] RoanKattouw: looks good [21:50:55] FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:51:23] !log catrope@deploy2002 dani, catrope: Continuing with sync [21:51:51] !log Deployment of the multi-instance Thanos Store Gateway patches for T412924: rollout complete. [21:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:55] T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924 [21:52:11] (03CR) 10BCornwall: [C:03+1] hieradata: add haproxy version for new cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1244857 (owner: 10CDobbins) [21:53:20] (03CR) 10BCornwall: prometheus: add pooled host check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [21:54:03] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8155/console" [puppet] - 10https://gerrit.wikimedia.org/r/1244857 (owner: 10CDobbins) [21:54:25] (03CR) 10CDobbins: [V:03+1 C:03+2] hieradata: add haproxy version for new cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1244857 (owner: 10CDobbins) [21:54:40] RESOLVED: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:51] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2043.codfw.wmnet [21:55:16] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243929|Deploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1243930|Deploy Comparative Reader Research survey on enwiki (T417829)]] (duration: 07m 28s) [21:55:22] T417834: Comparative Reader Research (Current Readers - ES) Deployment - https://phabricator.wikimedia.org/T417834 [21:55:23] T417829: Comparative Reader Research (Current Readers - EN) Deployment - https://phabricator.wikimedia.org/T417829 [21:56:16] RoanKattouw: thanks! [21:56:16] Whoops sorry tgr_ I forgot your two config changes ("Set $wgJwtSessionCookieIssuer for bot passwords" and "Enable JWT session cookie for bot passwords (all wikis)"). Can those go out together, or should I deploy them separately? [21:56:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242543 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [21:56:36] they can go together [21:56:44] thanks [21:56:46] OK great, I will deploy them together after my config change [21:57:20] (03Merged) 10jenkins-bot: Remove workaround for T370517, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242543 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [21:57:40] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1242543|Remove workaround for T370517, no longer needed (T370517)]] [21:57:44] T370517: Search button message text changes - https://phabricator.wikimedia.org/T370517 [21:59:35] !log catrope@deploy2002 catrope: Backport for [[gerrit:1242543|Remove workaround for T370517, no longer needed (T370517)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T2200) [22:00:40] !log catrope@deploy2002 catrope: Continuing with sync [22:01:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:02:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2043.codfw.wmnet [22:03:06] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2043.codfw.wmnet [22:04:43] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242543|Remove workaround for T370517, no longer needed (T370517)]] (duration: 07m 03s) [22:04:48] T370517: Search button message text changes - https://phabricator.wikimedia.org/T370517 [22:06:12] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2047.codfw.wmnet with OS trixie [22:06:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [22:06:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [22:07:46] (03Merged) 10jenkins-bot: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [22:07:49] (03Merged) 10jenkins-bot: Enable JWT session cookie for bot passwords (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01) [22:08:07] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1244692|CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords (T415007)]], [[gerrit:1244647|Enable JWT session cookie for bot passwords (all wikis) (T415007)]] [22:08:12] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [22:09:06] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2046.codfw.wmnet with OS trixie [22:09:58] !log catrope@deploy2002 catrope, d3r1ck01: Backport for [[gerrit:1244692|CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords (T415007)]], [[gerrit:1244647|Enable JWT session cookie for bot passwords (all wikis) (T415007)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:10:04] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2043.codfw.wmnet [22:11:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:11:52] (03PS2) 10Ebernhardson: cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) [22:12:41] (03CR) 10CI reject: [V:04-1] cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [22:13:46] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2048.codfw.wmnet with OS trixie [22:14:00] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2049.codfw.wmnet with OS trixie [22:14:08] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2050.codfw.wmnet with OS trixie [22:14:16] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2051.codfw.wmnet with OS trixie [22:14:58] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2052.codfw.wmnet with OS trixie [22:15:01] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2053.codfw.wmnet with OS trixie [22:15:45] RoanKattouw: looks good [22:15:51] thanks for the deploys! [22:15:59] (03PS1) 10BCornwall: Revert "cp2043: Set use_noflow_iface_preup to true" [puppet] - 10https://gerrit.wikimedia.org/r/1244869 [22:16:01] !log catrope@deploy2002 catrope, d3r1ck01: Continuing with sync [22:18:50] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission puppetmaster2001 - https://phabricator.wikimedia.org/T416606#11656726 (10BCornwall) Is https://gerrit.wikimedia.org/r/c/operations/dns/+/1237463 still needed to be merged? [22:19:56] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244692|CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords (T415007)]], [[gerrit:1244647|Enable JWT session cookie for bot passwords (all wikis) (T415007)]] (duration: 11m 48s) [22:20:00] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [22:27:58] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2048.codfw.wmnet with reason: host reimage [22:28:20] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2049.codfw.wmnet with reason: host reimage [22:28:27] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2051.codfw.wmnet with reason: host reimage [22:28:30] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2050.codfw.wmnet with reason: host reimage [22:28:58] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2052.codfw.wmnet with reason: host reimage [22:29:24] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2053.codfw.wmnet with reason: host reimage [22:30:19] (03PS3) 10Ebernhardson: cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) [22:33:48] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2048.codfw.wmnet with reason: host reimage [22:34:06] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Wikimedia-Mailing-lists: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11656753 (10jhathaway) @DamianZaremba I tried with a couple of my test accounts, but I was unable to duplicate your r... [22:37:36] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2052.codfw.wmnet with reason: host reimage [22:40:24] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp2050 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [22:40:26] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2053 is CRITICAL: connect to address 10.192.56.3 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:40:26] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp2053 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [22:41:34] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2053.codfw.wmnet with reason: host reimage [22:45:12] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2050.codfw.wmnet with reason: host reimage [22:45:26] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2051 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [22:49:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2049.codfw.wmnet with reason: host reimage [22:50:26] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2051 is CRITICAL: connect to address 10.192.40.25 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:50:26] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp2051 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [22:52:26] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2053 is OK: HTTP OK: HTTP/1.0 200 OK - 36064 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:53:49] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2051.codfw.wmnet with reason: host reimage [22:54:50] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2048.codfw.wmnet with OS trixie [22:58:24] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2052.codfw.wmnet with OS trixie [22:58:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe2024.codfw.wmnet [23:01:44] pt1979@cumin2002 dhcp (PID 209464) is awaiting input [23:02:04] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp2053 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 39 days) https://wikitech.wikimedia.org/wiki/HTTPS [23:04:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2053.codfw.wmnet with OS trixie [23:04:26] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2051 is OK: HTTP OK: HTTP/1.0 200 OK - 36018 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:04:33] jouncebot: nowandnext [23:04:33] No deployments scheduled for the next 7 hour(s) and 55 minute(s) [23:04:33] In 7 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260227T0700) [23:05:51] unless anyone has conflicting changes planned, I'd like to trigger a noop mediawiki deployment to clear a helm chart version diff - should be a pretty quick operation [23:07:02] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp2050 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2026-04-05 04:22:55 +0000 (expires in 37 days) https://wikitech.wikimedia.org/wiki/HTTPS [23:07:05] (03CR) 10Scott French: [C:03+2] mediawiki: refresh mesh.deployment 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244807 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [23:07:13] * swfrench-wmf will proceed with deployment [23:09:17] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2050.codfw.wmnet with OS trixie [23:10:40] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2049.codfw.wmnet with OS trixie [23:10:41] (03Merged) 10jenkins-bot: mediawiki: refresh mesh.deployment 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244807 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [23:12:58] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2051 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 69 days) https://wikitech.wikimedia.org/wiki/HTTPS [23:12:58] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp2051 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 39 days) https://wikitech.wikimedia.org/wiki/HTTPS [23:13:13] !log swfrench@deploy2002 Started scap sync-world: helmfile-only deployment to clear chart version diff [23:14:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2051.codfw.wmnet with OS trixie [23:15:44] !log swfrench@deploy2002 Finished scap sync-world: helmfile-only deployment to clear chart version diff (duration: 02m 31s) [23:15:53] all done [23:16:14] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:16:16] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:16:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:24:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:26:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:28:24] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:30:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:32:02] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2021 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:33:22] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:35:25] FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:02] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:38:02] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:40:25] FIRING: [4x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:40:26] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2015 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:41:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:41:26] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:41:54] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:42:26] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2015 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:42:26] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:43:20] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:43:26] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:45:25] FIRING: [7x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:46:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:47:02] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:48:24] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:50:25] FIRING: [8x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:51:30] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1018 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:51:54] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:52:02] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2021 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:53:20] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:53:40] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:55:25] FIRING: [10x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:56:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:58:02] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock