[00:01:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[00:03:07] <wikibugs>	 (03PS1) 10Dzahn: zuul::main: set tls_truststore for zookeeper to the copy it owns [puppet] - 10https://gerrit.wikimedia.org/r/1244021 (https://phabricator.wikimedia.org/T395938)
[00:03:33] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[00:04:10] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: Separate deadlock remediation config from script [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453)
[00:04:12] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453)
[00:04:30] <wikibugs>	 (03CR) 10Ryan Kemper: "Uploaded https://gerrit.wikimedia.org/r/c/operations/puppet/+/1244023 to address the review" [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[00:05:53] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul::main: set tls_truststore for zookeeper to the copy it owns [puppet] - 10https://gerrit.wikimedia.org/r/1244021 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[00:06:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: Separate deadlock remediation config from script [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[00:06:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[00:20:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:20:45] <wikibugs>	 (03PS1) 10Dzahn: zuul::main: add extra Java opts to debug zookeeper TLS [puppet] - 10https://gerrit.wikimedia.org/r/1244033 (https://phabricator.wikimedia.org/T395938)
[00:21:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[00:22:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[00:24:24] <wikibugs>	 (03PS2) 10Dzahn: zuul::main: add extra Java opts to debug zookeeper TLS [puppet] - 10https://gerrit.wikimedia.org/r/1244033 (https://phabricator.wikimedia.org/T395938)
[00:25:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:26:28] <wikibugs>	 (03CR) 10BryanDavis: "Thanks for this" [software/bitu] - 10https://gerrit.wikimedia.org/r/1243694 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede)
[00:27:53] <wikibugs>	 10ops-eqiad, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418429 (10phaultfinder) 03NEW
[00:31:23] <wikibugs>	 (03PS1) 10Scardenasmolinar: Deploy PersonalDashboard to new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665)
[00:36:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:37:10] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:38:53] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1244055
[00:38:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1244055 (owner: 10TrainBranchBot)
[00:41:31] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[00:41:37] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[00:43:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[00:46:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 07Wikimedia-Incident: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11652803 (10AlexisJazz) 05Open→03Resolved
[00:48:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[00:50:35] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1244055 (owner: 10TrainBranchBot)
[00:51:37] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[00:52:16] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1243813 (owner: 10Muehlenhoff)
[00:53:11] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453)
[00:53:12] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244058 (https://phabricator.wikimedia.org/T242453)
[00:55:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[00:59:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:01:04] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[01:04:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:04:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243930 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza)
[01:04:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243929 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza)
[01:05:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:08:58] <wikibugs>	 (03PS2) 10Scott French: P:cache::haproxy: fix non-default scope key structure [puppet] - 10https://gerrit.wikimedia.org/r/1244061
[01:09:08] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1244073
[01:09:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1244073 (owner: 10TrainBranchBot)
[01:09:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:10:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:17:11] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] "I think (or at least how I was interpreting it) "rewriting it in Python" was meant as a measure of complexity, rather than a concrete reco" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[01:19:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:21:13] <wikibugs>	 (03PS1) 10Zabe: typos: Add magru wrongly numbered hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244083
[01:23:06] <wikibugs>	 (03CR) 10Zabe: [C:03+2] typos: Add magru wrongly numbered hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244083 (owner: 10Zabe)
[01:23:59] <wikibugs>	 (03Merged) 10jenkins-bot: typos: Add magru wrongly numbered hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244083 (owner: 10Zabe)
[01:24:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:26:59] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453)
[01:26:59] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244058 (https://phabricator.wikimedia.org/T242453)
[01:27:51] <wikibugs>	 (03PS3) 10Dzahn: zuul::main: add extra Java opts to debug zookeeper TLS [puppet] - 10https://gerrit.wikimedia.org/r/1244033 (https://phabricator.wikimedia.org/T395938)
[01:29:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[01:34:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:34:27] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[01:36:18] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1244073 (owner: 10TrainBranchBot)
[01:37:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:42:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:43:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:47:31] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:49:43] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[01:50:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:56:25] <wikibugs>	 (03Abandoned) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244058 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[01:58:44] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453)
[01:58:44] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453)
[02:00:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[02:00:45] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[02:01:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[02:02:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[02:07:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[02:08:22] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:09:54] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T418433 (10phaultfinder) 03NEW
[02:13:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[02:13:58] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 12s)
[02:23:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[02:26:34] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+1] "IPs match the hosts. shipit." [puppet] - 10https://gerrit.wikimedia.org/r/1243894 (https://phabricator.wikimedia.org/T418393) (owner: 10Jgreen)
[02:33:22] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:43] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:57:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:58:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:01:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:01:41] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:05:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:05:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:08:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:08:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:14:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:22:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[03:23:22] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[03:32:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[03:33:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[03:37:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[03:40:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:40:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11652966 (10Papaul)  @ayounsi The new diagram is now on Wikitech. For the Wikimedia Amsterdam DCs, IP layer do you want for us to update it or just delete it.
[03:45:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:47:27] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[03:50:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:53:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[03:55:40] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:57:27] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[04:02:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:02:27] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[04:05:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:06:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:07:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[04:10:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:10:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:12:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[04:13:22] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[04:17:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:17:27] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[04:19:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:21:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:22:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:22:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:24:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:25:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:26:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:29:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:30:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:39:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:39:31] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:40:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:44:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:44:31] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:44:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:47:03] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[04:47:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:48:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:50:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:51:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:53:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:54:40] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:57:03] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[04:58:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:59:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:00:51] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439 (10Papaul) 03NEW
[05:01:04] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[05:01:25] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[05:03:22] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:04:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:08:22] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:10:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[05:10:53] <phuedx>	 dr0ptp4kt and I will be investigating that flapping ErrorBudgetBurn alert today ^
[05:10:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:11:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:14:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:15:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[05:19:40] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:20:05] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on cephosd1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[05:20:43] <icinga-wm>	 PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:21:25] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[05:22:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T415786)', diff saved to https://phabricator.wikimedia.org/P89028 and previous config saved to /var/cache/conftool/dbconfig/20260226-052230-marostegui.json
[05:22:36] <stashbot>	 T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786
[05:29:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[05:34:27] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[05:37:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] P:cache::haproxy: fix non-default scope key structure [puppet] - 10https://gerrit.wikimedia.org/r/1244061 (owner: 10Scott French)
[05:37:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P89029 and previous config saved to /var/cache/conftool/dbconfig/20260226-053739-marostegui.json
[05:37:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:38:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:39:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[05:39:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:40:21] <wikibugs>	 (03PS1) 101F616EMO: Wiping accountcreator from zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T303578)
[05:40:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:41:11] <wikibugs>	 (03PS2) 101F616EMO: Wiping accountcreator from zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089)
[05:45:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:45:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:48:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:48:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:49:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:49:43] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:52:47] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P89030 and previous config saved to /var/cache/conftool/dbconfig/20260226-055246-marostegui.json
[05:55:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] data.yaml: Add ssh-key for the bkup token. [puppet] - 10https://gerrit.wikimedia.org/r/1243889 (owner: 10Marostegui)
[05:55:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] package_builder: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1243813 (owner: 10Muehlenhoff)
[06:01:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for hahmed [puppet] - 10https://gerrit.wikimedia.org/r/1244392
[06:02:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Add monitor_heartbeat to core hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1243594 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui)
[06:04:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:04:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for hahmed [puppet] - 10https://gerrit.wikimedia.org/r/1244392 (owner: 10Muehlenhoff)
[06:07:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T415786)', diff saved to https://phabricator.wikimedia.org/P89031 and previous config saved to /var/cache/conftool/dbconfig/20260226-060755-marostegui.json
[06:08:00] <stashbot>	 T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786
[06:08:01] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1263.eqiad.wmnet with reason: Maintenance
[06:08:10] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1263 (T415786)', diff saved to https://phabricator.wikimedia.org/P89032 and previous config saved to /var/cache/conftool/dbconfig/20260226-060809-marostegui.json
[06:08:22] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[06:11:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[06:12:13] <wikibugs>	 (03PS4) 10Muehlenhoff: pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664)
[06:12:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[06:14:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:16:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:16:50] <moritzm>	 !log updated thirdparty/node22 to node 20.20.0
[06:16:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:42] <wikibugs>	 (03PS5) 10Muehlenhoff: pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664)
[06:19:17] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:20:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11653051 (10ayounsi) Updating them would be great. Thanks
[06:21:22] <wikibugs>	 06SRE: Create nodejs 24 production images - https://phabricator.wikimedia.org/T418440 (10MoritzMuehlenhoff) 03NEW
[06:21:27] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[06:21:55] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[06:23:46] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[06:26:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:26:23] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[06:28:46] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1244428 (https://phabricator.wikimedia.org/T414656)
[06:29:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:30:20] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] dbproxy1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1244428 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui)
[06:34:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:34:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:36:23] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[06:38:57] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[06:39:17] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:39:40] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:44:17] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:45:40] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:45:40] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:45:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:46:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:46:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:47:24] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[06:48:22] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[06:49:17] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:50:06] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on cephosd1003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[06:50:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:53:46] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[06:53:48] <icinga-wm>	 RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:53:57] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[06:54:40] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:55:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:57:24] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[06:59:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T0700).
[07:00:31] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:02:24] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[07:04:17] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:05:31] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:07:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:09:40] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:09:43] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:xe-0/1/1 (Transport: cr1-codfw:xe-1/1/1:0 (Lumen, 442550294) {#1065}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:12:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:12:24] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[07:13:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:14:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:23:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:28:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for hokwelum [puppet] - 10https://gerrit.wikimedia.org/r/1244459
[07:28:26] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11653142 (10Aklapper) @DTotten-WMF: Hi and welcome! Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/extern...
[07:28:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove access for hokwelum [puppet] - 10https://gerrit.wikimedia.org/r/1244459 (owner: 10Muehlenhoff)
[07:31:31] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:34:17] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:35:46] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:38:31] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove access for hokwelum [puppet] - 10https://gerrit.wikimedia.org/r/1244459
[07:39:17] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:40:35] <moritzm>	 !log installing openssl security updates
[07:40:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:17] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:44:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access for hokwelum [puppet] - 10https://gerrit.wikimedia.org/r/1244459 (owner: 10Muehlenhoff)
[07:48:46] <jinxer-wm>	 FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[07:50:42] <logmsgbot>	 !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Hokwelum out of all services on: 2432 hosts
[07:58:22] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T0800)
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:10:39] <wikibugs>	 (03CR) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur)
[08:10:55] <wikibugs>	 (03PS4) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864)
[08:25:39] <wikibugs>	 (03CR) 10Gehel: "Minor comments inline. Mostly suggestions, you can ignore them as you want." [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[08:26:36] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[08:27:24] <wikibugs>	 (03CR) 10Gehel: "It looks that a bunch of changes from the parent commit have ended up in this one." [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[08:31:45] <wikibugs>	 (03CR) 10Anzx: [C:03+1] "thanks for creating patch for removing usergroup, please schedule this for deploying through https://schedule-deployment.toolforge.org/bac" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[08:41:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:44:33] <wikibugs>	 (03CR) 10Aklapper: [C:04-1] "Thanks so much, sorry this takes me a while. I ran both extract and generate locally before and after." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) (owner: 10Pppery)
[08:49:43] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:51:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:58:22] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:58:34] <wikibugs>	 (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.8 [puppet] - 10https://gerrit.wikimedia.org/r/1244570 (https://phabricator.wikimedia.org/T418448)
[09:00:23] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe
[09:00:58] <moritzm>	 !log restart FPM on Phabricator hosts to pick up OpenSSL updates
[09:01:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:04] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[09:01:38] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host dbproxy1028.eqiad.wmnet with OS trixie
[09:01:53] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe
[09:04:10] <wikibugs>	 (03CR) 10Elukey: [C:03+1] pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[09:05:52] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe
[09:06:38] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=1) rolling restart_daemons on A:swift-fe
[09:08:22] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:09:21] <wikibugs>	 (03PS3) 10Elukey: Apply role to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664)
[09:09:23] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[09:09:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[09:10:06] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[09:10:13] <wikibugs>	 (03PS1) 10Vgutierrez: traffic: Exclude stats frontend on HaproxyKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/1244575
[09:10:39] <wikibugs>	 (03CR) 101F616EMO: "Scheduled to this afternoon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[09:11:48] <wikibugs>	 (03PS1) 10Slyngshede: P:idp release givenName on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1244576 (https://phabricator.wikimedia.org/T338214)
[09:11:48] <wikibugs>	 (03PS1) 10Hashar: wm-checks-api: document mapping of bot to rerun command [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244577 (https://phabricator.wikimedia.org/T418424)
[09:11:50] <wikibugs>	 (03PS1) 10Hashar: wm-checks-api: add Rerun command for codehealth [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244578 (https://phabricator.wikimedia.org/T418424)
[09:11:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] traffic: Exclude stats frontend on HaproxyKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/1244575 (owner: 10Vgutierrez)
[09:12:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:13:26] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1028.eqiad.wmnet with reason: host reimage
[09:13:48] <wikibugs>	 (03CR) 10Hashar: [C:03+2] wm-checks-api: document mapping of bot to rerun command [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244577 (https://phabricator.wikimedia.org/T418424) (owner: 10Hashar)
[09:13:53] <wikibugs>	 (03CR) 10Hashar: [C:03+2] wm-checks-api: add Rerun command for codehealth [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244578 (https://phabricator.wikimedia.org/T418424) (owner: 10Hashar)
[09:14:12] <wikibugs>	 (03PS1) 10Brouberol: deployment_server: install kubetail to be able to stream multicontainer pod logs [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450)
[09:14:24] <wikibugs>	 (03Merged) 10jenkins-bot: wm-checks-api: document mapping of bot to rerun command [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244577 (https://phabricator.wikimedia.org/T418424) (owner: 10Hashar)
[09:14:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:14:26] <wikibugs>	 (03Merged) 10jenkins-bot: wm-checks-api: add Rerun command for codehealth [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244578 (https://phabricator.wikimedia.org/T418424) (owner: 10Hashar)
[09:15:05] <logmsgbot>	 !log hashar@deploy2002 Started deploy [gerrit/gerrit@74473c2]: wm-checks-api: add Rerun command for codehealth + inline documentation
[09:15:19] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [gerrit/gerrit@74473c2]: wm-checks-api: add Rerun command for codehealth + inline documentation (duration: 00m 14s)
[09:16:28] <wikibugs>	 (03PS2) 10Vgutierrez: traffic: Exclude stats frontend on HaproxyKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/1244575
[09:16:35] <wikibugs>	 (03PS2) 10Brouberol: deployment_server: install kubetail to be able to stream multicontainer pod logs [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450)
[09:17:57] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad
[09:18:54] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1028.eqiad.wmnet with reason: host reimage
[09:19:15] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8153/co" [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450) (owner: 10Brouberol)
[09:19:48] <wikibugs>	 (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.8 [puppet] - 10https://gerrit.wikimedia.org/r/1244570 (https://phabricator.wikimedia.org/T418448) (owner: 10Jelto)
[09:21:06] <wikibugs>	 (03PS4) 10Elukey: Apply role to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664)
[09:21:40] <wikibugs>	 (03CR) 10Elukey: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[09:22:03] <wikibugs>	 (03PS5) 10Elukey: Apply role to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664)
[09:22:13] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus/resource_config: add resource_title param [puppet] - 10https://gerrit.wikimedia.org/r/1243898 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[09:22:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:22:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, the deployment servers are still on Bullseye but kubetail is already packaged for it." [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450) (owner: 10Brouberol)
[09:22:26] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus/ops: monitor thanos store instances with resouce_config [puppet] - 10https://gerrit.wikimedia.org/r/1243899 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[09:22:35] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] deployment_server: install kubetail to be able to stream multicontainer pod logs [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450) (owner: 10Brouberol)
[09:22:35] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] deployment_server: install kubetail to be able to stream multicontainer pod logs [puppet] - 10https://gerrit.wikimedia.org/r/1244582 (https://phabricator.wikimedia.org/T418450) (owner: 10Brouberol)
[09:23:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:24:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:24:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:28:59] <wikibugs>	 (03CR) 10Elukey: [C:03+2] docker_registry: remove the /test prefix special handling [puppet] - 10https://gerrit.wikimedia.org/r/1243726 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey)
[09:30:55] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] traffic: Exclude stats frontend on HaproxyKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/1244575 (owner: 10Vgutierrez)
[09:31:15] <elukey>	 jouncebot: nowandnext
[09:31:15] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 28 minute(s)
[09:31:15] <jouncebot>	 In 1 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1100)
[09:32:31] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:34:27] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:34:39] <wikibugs>	 (03PS1) 101F616EMO: zhwiki: drop event organizer's duplicated right to remove eventparticipant from self [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089)
[09:34:43] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[09:35:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] traffic: Exclude stats frontend on HaproxyKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/1244575 (owner: 10Vgutierrez)
[09:35:45] <wikibugs>	 (03CR) 10Volans: [C:03+2] "Tested on toolsbeta static" [puppet] - 10https://gerrit.wikimedia.org/r/1243151 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans)
[09:38:22] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[09:39:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:39:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:39:31] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:41:11] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1028.eqiad.wmnet with OS trixie
[09:42:32] <elukey>	 hello deployers! I am going to make a change to the Docker Registry to route MediaWiki docker images to a new (hopefully faster and more reliable) backend
[09:42:50] <elukey>	 I don't see anything ongoing, but please ping me if you need to deploy MW
[09:42:57] <wikibugs>	 (03CR) 10Elukey: [C:03+2] docker_registry: move the /v2/restricted prefix to s3/apus [puppet] - 10https://gerrit.wikimedia.org/r/1243727 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey)
[09:43:46] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:43:50] <logmsgbot>	 !log urbanecm@deploy2002 mwscript-k8s job started: foreachwikiindblist growthexperiments WikimediaMaintenance:createExtensionTables.php growthexperiments
[09:44:25] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=pki,name=codfw
[09:46:04] <wikibugs>	 (03PS11) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924)
[09:46:04] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus::resource_config: replace $facts['site'] with $::site [puppet] - 10https://gerrit.wikimedia.org/r/1244592 (https://phabricator.wikimedia.org/T412924)
[09:47:21] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus::resource_config: replace $facts['site'] with $::site [puppet] - 10https://gerrit.wikimedia.org/r/1244592 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[09:47:50] <elukey>	 !log move the Docker Registry's  /v2/restricted (MediaWiki Docker image prefix) to s3/apus - T390251 
[09:47:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:54] <stashbot>	 T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251
[09:48:23] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on cephosd1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:48:47] <icinga-wm>	 PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:50:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[09:51:18] <logmsgbot>	 !log elukey@deploy2002 Started scap sync-world: Test new Docker Registry backend
[09:53:23] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on cephosd1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:53:31] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:53:47] <icinga-wm>	 PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:54:23] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on cephosd1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:54:43] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:54:47] <icinga-wm>	 PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:57:27] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=pki,name=codfw
[09:58:22] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:58:31] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[10:00:47] <icinga-wm>	 PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:01:11] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on cephosd1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[10:02:04] <wikibugs>	 (03PS12) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924)
[10:02:04] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus::ops/thanos_store: replace port param with port_parameter [puppet] - 10https://gerrit.wikimedia.org/r/1244595 (https://phabricator.wikimedia.org/T412924)
[10:03:47] <icinga-wm>	 PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:04:09] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on cephosd1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[10:05:51] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus::ops/thanos_store: replace port param with port_parameter [puppet] - 10https://gerrit.wikimedia.org/r/1244595 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[10:08:23] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on cephosd1004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[10:08:47] <icinga-wm>	 RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:10:31] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[10:11:23] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on cephosd1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[10:11:47] <icinga-wm>	 RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:11:53] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[10:12:47] <icinga-wm>	 RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:13:11] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on cephosd1005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[10:13:47] <icinga-wm>	 RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:14:23] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on cephosd1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[10:15:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:16:47] <wikibugs>	 (03PS1) 10Fabfur: hiera: test haproxy 3.0 on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1244597 (https://phabricator.wikimedia.org/T417253)
[10:17:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T418433#11653532 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[10:18:22] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:18:23] <wikibugs>	 (03PS2) 10Eevans: admin_ng: add namespace for linked-artifacts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243850 (https://phabricator.wikimedia.org/T414112)
[10:18:30] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1244597 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur)
[10:18:34] <logmsgbot>	 !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None
[10:19:09] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on cephosd1003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[10:19:47] <icinga-wm>	 RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:20:12] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] admin_ng: add namespace for linked-artifacts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243850 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[10:20:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:21:53] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[10:25:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Add DB grant for pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664)
[10:26:27] <wikibugs>	 (03CR) 10Muehlenhoff: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[10:26:32] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: test haproxy 3.0 on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1244597 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur)
[10:26:46] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[10:26:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:27:54] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: add namespace for linked-artifacts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243850 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[10:29:26] <wikibugs>	 (03CR) 10Elukey: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[10:30:24] <logmsgbot>	 !log ammarpad@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=wikidatawiki --logwiki=metawiki 'Luftlewis 1' 'Renamed user 4f8e749b4f28ee9e6ebc680c8c3c943d'  # T418435
[10:30:29] <stashbot>	 T418435: Unblock stuck global rename of Renamed user 4f8e749b4f28ee9e6ebc680c8c3c943d - https://phabricator.wikimedia.org/T418435
[10:30:41] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: test haproxy 3.0 on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1244597 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur)
[10:31:55] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:32:23] <logmsgbot>	 !log elukey@deploy2002 Finished scap sync-world: Test new Docker Registry backend (duration: 43m 02s)
[10:33:04] <fabfur>	 !log depooling cp7001 to upgrade haproxy (T417253)
[10:33:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:08] <stashbot>	 T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253
[10:33:17] <logmsgbot>	 !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp7001.*
[10:34:17] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp7001*} and A:cp - 3.0 upgrade ()
[10:35:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[10:38:46] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[10:39:11] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp7001*} and A:cp - 3.0 upgrade ()
[10:39:52] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[10:39:55] <logmsgbot>	 !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7001.*
[10:40:43] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[10:40:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[10:41:43] <wikibugs>	 (03PS1) 10Jelto: aptrepo: update gitlab-runner-helper-image architecture to all [puppet] - 10https://gerrit.wikimedia.org/r/1244600 (https://phabricator.wikimedia.org/T418344)
[10:41:45] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[10:41:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public
[10:43:06] <wikibugs>	 (03CR) 10Jelto: "This config finds the new package (manually tested with `checkupdate` on `apt1001`). The helper-images package is released as "all". Is th" [puppet] - 10https://gerrit.wikimedia.org/r/1244600 (https://phabricator.wikimedia.org/T418344) (owner: 10Jelto)
[10:43:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1244600 (https://phabricator.wikimedia.org/T418344) (owner: 10Jelto)
[10:43:20] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[10:43:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public
[10:44:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all
[10:44:54] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 2 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11653669 (10elukey) The /v2/restricted prefix is again served by S3, nothing to report when pushing the new...
[10:47:40] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[10:47:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[10:48:28] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[10:49:23] <wikibugs>	 (03CR) 10Jelto: [C:03+2] aptrepo: update gitlab-runner-helper-image architecture to all [puppet] - 10https://gerrit.wikimedia.org/r/1244600 (https://phabricator.wikimedia.org/T418344) (owner: 10Jelto)
[10:55:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[10:57:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1100)
[11:00:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[11:01:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:cloudelastic
[11:01:41] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[11:03:14] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[11:04:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:05:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:cloudelastic
[11:07:28] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt apus-fe1004,5 - jclark@cumin1003"
[11:07:35] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt apus-fe1004,5 - jclark@cumin1003"
[11:07:36] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:08:11] <wikibugs>	 (03CR) 10Muehlenhoff: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[11:08:22] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:08:40] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Simplify spec-json-wikimedia route and use meta.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242576 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz)
[11:08:42] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-backup1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:08:56] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host apus-fe1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:09:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:11:28] <logmsgbot>	 jclark@cumin1003 provision (PID 2197974) is awaiting input
[11:11:29] <logmsgbot>	 jclark@cumin1003 provision (PID 2198026) is awaiting input
[11:11:41] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[11:12:50] <wikibugs>	 (03CR) 10Aklapper: "From a quick read this looks good, found just two small typos." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221191 (https://phabricator.wikimedia.org/T413531) (owner: 10Pppery)
[11:13:08] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host apus-fe1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:13:09] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-backup1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:13:23] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Simplify spec-json-wikimedia route and use meta.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242576 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz)
[11:14:05] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11653766 (10Jclark-ctr) a:03Jclark-ctr
[11:14:31] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244601
[11:14:32] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244602
[11:15:10] <wikibugs>	 (03Merged) 10jenkins-bot: Simplify spec-json-wikimedia route and use meta.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242576 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz)
[11:17:20] <tappof>	 !log Deployment of the multi-instance Thanos Store Gateway patches for T412924: running tests on titan1002
[11:17:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:25] <stashbot>	 T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924
[11:17:43] <wikibugs>	 (03CR) 10Elukey: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[11:17:58] <wikibugs>	 (03PS2) 10Vgutierrez: admin: Add mikez to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1243845 (https://phabricator.wikimedia.org/T418098)
[11:19:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1243845 (https://phabricator.wikimedia.org/T418098) (owner: 10Vgutierrez)
[11:19:18] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11653777 (10Jclark-ctr) a:03Jclark-ctr
[11:19:32] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] admin: Add mikez to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1243845 (https://phabricator.wikimedia.org/T418098) (owner: 10Vgutierrez)
[11:19:44] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[11:19:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[11:19:56] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:20:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[11:20:49] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11653783 (10Jclark-ctr) @jcrespo      Preseed yaml file is not setup for efi booting  Can you update? to have for these servers?        - partman/standard-efi.cfg     - partm...
[11:21:40] <logmsgbot>	 !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None
[11:21:58] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[11:22:59] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:23:23] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:23:26] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-backup1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:23:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:23:45] <logmsgbot>	 jclark@cumin1003 provision (PID 2197974) is awaiting input
[11:24:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-codfw
[11:24:17] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:24:43] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[11:25:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-codfw
[11:25:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-eqiad
[11:26:42] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:26:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-eqiad
[11:28:04] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11653820 (10jcrespo) Will, do sorry, these should use standard recipes, so it should be easy to update.
[11:28:23] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:28:32] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11653822 (10jcrespo)
[11:29:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[11:30:29] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11653828 (10jcrespo)
[11:31:39] <wikibugs>	 (03PS1) 10Vgutierrez: traffic: Avoid division by zero on HaproxyKafkaNoMessages [alerts] - 10https://gerrit.wikimedia.org/r/1244605
[11:34:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[11:36:59] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica
[11:39:31] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[11:42:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1244576 (https://phabricator.wikimedia.org/T338214) (owner: 10Slyngshede)
[11:43:02] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-backup1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:44:17] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:46:09] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica
[11:46:33] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica
[11:49:31] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[11:50:27] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idp release givenName on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1244576 (https://phabricator.wikimedia.org/T338214) (owner: 10Slyngshede)
[11:51:13] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11653940 (10Jclark-ctr)
[11:51:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[11:51:19] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[11:51:38] <logmsgbot>	 !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None
[11:52:33] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11653943 (10Vgutierrez)
[11:52:57] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11653945 (10Vgutierrez) 05Open→03Resolved change has been merged, and it should be live by now
[11:52:58] <logmsgbot>	 !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None
[11:54:05] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-fe1004.eqiad.wmnet with OS bookworm
[11:54:13] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11653952 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host apus-fe1004.eqiad.wmnet with OS bookworm
[11:55:06] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-fe1005.eqiad.wmnet with OS bookworm
[11:55:18] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11653955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host apus-fe1005.eqiad.wmnet with OS bookworm
[11:55:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:56:22] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica
[11:56:43] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab
[12:00:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:01:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:01:19] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[12:02:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:08:19] <wikibugs>	 (03CR) 10Marostegui: "Can I merge this myself? So I combine it with the actual grant deployment" [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[12:09:33] <wikibugs>	 (03CR) 10Hashar: [C:04-1] "I forgot, I earlier proposed a similar patch but for the Apache > Jetty connection: I9e5167f9c9c2f346d314cb7c3bf410209b1dffce" [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb)
[12:11:06] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-fe1004.eqiad.wmnet with reason: host reimage
[12:12:35] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-fe1005.eqiad.wmnet with reason: host reimage
[12:13:52] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-fe1004.eqiad.wmnet with reason: host reimage
[12:13:59] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1096.eqiad.wmnet with OS bullseye
[12:14:15] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11654001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-be1096.eqiad.wmnet with OS bullseye
[12:15:40] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1097.eqiad.wmnet with OS bullseye
[12:15:49] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11654021 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-be1097.eqiad.wmnet with OS bullseye
[12:16:19] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[12:17:52] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-fe1005.eqiad.wmnet with reason: host reimage
[12:20:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:21:31] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:23:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ...
[12:23:51] <jinxer-wm>	 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F1 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[12:24:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:24:34] <claime>	 !ack 
[12:24:34] <sirenbot>	 7489 (ACKED)  TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013} xe-3/2/1 gnmi eqiad)
[12:25:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:26:19] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[12:29:06] <wikibugs>	 (03PS1) 10JavierMonton: component: mediawiki.page_html_content_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467)
[12:29:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665) (owner: 10Scardenasmolinar)
[12:30:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:34:07] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11654092 (10Jelto)
[12:34:35] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[12:34:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11654094 (10Jelto)
[12:34:49] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[12:34:50] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-fe1004.eqiad.wmnet with OS bookworm
[12:35:00] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11654095 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host apus-fe1004.eqiad.wmnet with OS bookworm completed: - apus...
[12:35:42] <wikibugs>	 (03CR) 10Anzx: [C:03+1] "recheck, looks ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[12:35:55] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1016.eqiad.wmnet with OS trixie
[12:36:03] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1016.eqiad.wmnet with OS trixie
[12:38:25] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1096.eqiad.wmnet with reason: host reimage
[12:38:40] <wikibugs>	 (03CR) 10Volans: locking: Add a mechanism for a global Spicerack lock. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake)
[12:38:47] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[12:38:53] <wikibugs>	 (03PS1) 10AikoChou: ml-services: make inference_services a list in values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244632
[12:41:51] <logmsgbot>	 jclark@cumin1003 reimage (PID 2246309) is awaiting input
[12:42:10] <wikibugs>	 (03PS1) 10AikoChou: httpbb: remove the revertrisk-wikidata test from staging [puppet] - 10https://gerrit.wikimedia.org/r/1244633
[12:42:49] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-be1097.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:43:17] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1097.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:43:35] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1096.eqiad.wmnet with reason: host reimage
[12:43:51] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ...
[12:43:51] <jinxer-wm>	 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F1 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[12:43:58] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-be1097.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:44:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:46:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:46:21] <wikibugs>	 (03PS1) 10Urbanecm: ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244638 (https://phabricator.wikimedia.org/T418194)
[12:46:34] <wikibugs>	 (03PS1) 10Urbanecm: ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244639 (https://phabricator.wikimedia.org/T418194)
[12:46:49] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] httpbb: remove the revertrisk-wikidata test from staging [puppet] - 10https://gerrit.wikimedia.org/r/1244633 (owner: 10AikoChou)
[12:46:49] <urbanecm>	 jouncebot: nowandnext
[12:46:49] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 13 minute(s)
[12:46:49] <jouncebot>	 In 0 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1300)
[12:46:51] <wikibugs>	 (03PS2) 10Jsn.sherman: Deploy PersonalDashboard to new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665) (owner: 10Scardenasmolinar)
[12:47:02] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244639 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[12:47:05] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244638 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[12:47:17] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] ml-services: make inference_services a list in values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244632 (owner: 10AikoChou)
[12:47:56] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: make inference_services a list in values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244632 (owner: 10AikoChou)
[12:48:37] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1097.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:48:55] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] httpbb: remove the revertrisk-wikidata test from staging [puppet] - 10https://gerrit.wikimedia.org/r/1244633 (owner: 10AikoChou)
[12:49:58] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: make inference_services a list in values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244632 (owner: 10AikoChou)
[12:50:02] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11654131 (10Jelto) p:05Triage→03Medium
[12:51:02] <claime>	 !ack
[12:51:02] <sirenbot>	 no value provided for parameter incident and no default available
[12:51:02] <sirenbot>	 All incidents are already acked.
[12:51:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[12:51:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244639 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[12:51:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244638 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[12:52:13] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11654136 (10Jelto) Hi, thanks for opening the access request.  I think the only missing approval i...
[12:53:14] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[12:54:11] <logmsgbot>	 jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade.
[12:54:14] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[12:54:58] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[12:54:59] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-fe1005.eqiad.wmnet with OS bookworm
[12:55:06] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11654142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host apus-fe1005.eqiad.wmnet with OS bookworm completed: - apus...
[12:56:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:56:55] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:56:56] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1097
[12:57:05] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1097
[12:58:42] <wikibugs>	 (03PS3) 10Jsn.sherman: Deploy PersonalDashboard to new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665) (owner: 10Scardenasmolinar)
[12:59:38] <wikibugs>	 (03Merged) 10jenkins-bot: ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244639 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[12:59:58] <wikibugs>	 (03Merged) 10jenkins-bot: ReassignMentees: Log more information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244638 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1300)
[13:00:07] <icinga-wm>	 PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[13:01:04] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[13:01:07] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1244639|ReassignMentees: Log more information (T418194)]], [[gerrit:1244638|ReassignMentees: Log more information (T418194)]]
[13:01:07] <icinga-wm>	 RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 116418 bytes in 0.531 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[13:01:11] <stashbot>	 T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194
[13:01:38] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:03:27] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab
[13:03:50] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' .
[13:05:13] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1244639|ReassignMentees: Log more information (T418194)]], [[gerrit:1244638|ReassignMentees: Log more information (T418194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:05:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:05:19] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt apus-fe1004,5 - jclark@cumin1003"
[13:05:55] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[13:08:24] <logmsgbot>	 jclark@cumin1003 netbox (PID 2287740) is awaiting input
[13:08:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ...
[13:08:51] <jinxer-wm>	 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F1 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[13:08:53] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt apus-fe1004,5 - jclark@cumin1003"
[13:08:53] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:08:55] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[13:08:56] <claime>	 !ack
[13:08:57] <sirenbot>	 7490 (ACKED)  TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013} xe-3/2/1 gnmi eqiad)
[13:09:12] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:09:16] <wikibugs>	 (03PS1) 10Majavah: toolforge: k8s: Remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1244643
[13:09:16] <wikibugs>	 (03PS1) 10Majavah: toolforge: k8s: Allow observers to read Gateway API resources [puppet] - 10https://gerrit.wikimedia.org/r/1244644 (https://phabricator.wikimedia.org/T418276)
[13:10:10] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[13:10:11] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1096.eqiad.wmnet with OS bullseye
[13:10:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:10:30] <wikibugs>	 (03PS1) 10AikoChou: Revert "ml-services: make inference_services a list in values-ml-staging-codfw.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244645
[13:10:37] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11654213 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-be1096.eqiad.wmnet with OS bullseye completed: - ms-be1096 (*...
[13:10:42] <wikibugs>	 (03PS1) 10Dpogorzelski: Revert "ml-services: make inference_services a list in values-ml-staging-codfw.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244646
[13:10:55] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] Revert "ml-services: make inference_services a list in values-ml-staging-codfw.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244646 (owner: 10Dpogorzelski)
[13:10:58] <wikibugs>	 (03CR) 10Dpogorzelski: [V:03+2 C:03+2] Revert "ml-services: make inference_services a list in values-ml-staging-codfw.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244646 (owner: 10Dpogorzelski)
[13:11:19] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[13:11:39] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[13:11:47] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T418465)', diff saved to https://phabricator.wikimedia.org/P89034 and previous config saved to /var/cache/conftool/dbconfig/20260226-131147-marostegui.json
[13:11:52] <stashbot>	 T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465
[13:12:06] <wikibugs>	 (03Abandoned) 10AikoChou: Revert "ml-services: make inference_services a list in values-ml-staging-codfw.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244645 (owner: 10AikoChou)
[13:12:07] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244639|ReassignMentees: Log more information (T418194)]], [[gerrit:1244638|ReassignMentees: Log more information (T418194)]] (duration: 11m 00s)
[13:12:11] <stashbot>	 T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194
[13:12:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:13:01] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:13:26] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[13:13:38] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11654228 (10Jclark-ctr)
[13:13:51] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ...
[13:13:51] <jinxer-wm>	 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F1 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[13:13:58] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T418465)', diff saved to https://phabricator.wikimedia.org/P89035 and previous config saved to /var/cache/conftool/dbconfig/20260226-131357-marostegui.json
[13:14:03] <wikibugs>	 (03CR) 10Blake: locking: Add a mechanism for a global Spicerack lock. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake)
[13:14:14] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11654242 (10Jclark-ctr)
[13:14:23] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386#11654243 (10Jclark-ctr) 05Open→03Resolved
[13:15:44] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:17:30] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1097
[13:18:02] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1097
[13:19:12] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:20:38] <wikibugs>	 (03PS1) 10Gergő Tisza: Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244649 (https://phabricator.wikimedia.org/T415007)
[13:20:47] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1097
[13:20:48] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ms-be1097
[13:20:59] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:21:55] <wikibugs>	 (03PS1) 10Gergő Tisza: Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244650 (https://phabricator.wikimedia.org/T415007)
[13:22:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:23:38] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:24:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:25:55] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1097.eqiad.wmnet with OS bullseye
[13:26:02] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11654282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-be1097.eqiad.wmnet with OS bullseye executed with errors: - m...
[13:29:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P89036 and previous config saved to /var/cache/conftool/dbconfig/20260226-132905-marostegui.json
[13:29:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:31:33] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1244655
[13:31:36] <wikibugs>	 (03CR) 10Elukey: "Sure! I was wondering if it is needed since it is the same as pki1001's afaics, but please go ahead :)" [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[13:32:43] <wikibugs>	 (03CR) 10Marostegui: "I didn't realise it is exactly the same username, in that case it shouldn't be needed, you are right." [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[13:33:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "dbproxy1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1244655 (owner: 10Marostegui)
[13:34:04] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1017.eqiad.wmnet with OS trixie
[13:34:11] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654316 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1017.eqiad.wmnet with OS trixie
[13:35:26] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1018.eqiad.wmnet with OS trixie
[13:35:31] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1019.eqiad.wmnet with OS trixie
[13:35:36] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1018.eqiad.wmnet with OS trixie
[13:35:40] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1019.eqiad.wmnet with OS trixie
[13:35:46] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1244644 (https://phabricator.wikimedia.org/T418276) (owner: 10Majavah)
[13:36:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:36:19] <wikibugs>	 (03CR) 10Majavah: [C:03+2] toolforge: k8s: Remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1244643 (owner: 10Majavah)
[13:36:28] <wikibugs>	 (03CR) 10Majavah: [C:03+2] toolforge: k8s: Allow observers to read Gateway API resources [puppet] - 10https://gerrit.wikimedia.org/r/1244644 (https://phabricator.wikimedia.org/T418276) (owner: 10Majavah)
[13:37:32] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1244643 (owner: 10Majavah)
[13:38:15] <vgutierrez>	 !log fetch haproxy 3.0.17 on thirdparty/haproxy30 bullseye-wikimedia (apt.wm.o)
[13:38:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:18] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1016.eqiad.wmnet with reason: host reimage
[13:40:31] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:41:31] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[7001,7009].*} and A:cp - 3.0.17 upgrade (T417253)
[13:41:35] <stashbot>	 T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253
[13:42:05] <hashar>	 I am doing a change on CI to have Wikibase Selenium tests to run in an independent job T287582
[13:42:06] <stashbot>	 T287582: Move some Wikibase selenium tests to a standalone job - https://phabricator.wikimedia.org/T287582
[13:43:04] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1016.eqiad.wmnet with reason: host reimage
[13:43:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:44:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P89038 and previous config saved to /var/cache/conftool/dbconfig/20260226-134414-marostegui.json
[13:44:33] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[13:46:23] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1020.eqiad.wmnet with OS trixie
[13:46:30] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654356 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1020.eqiad.wmnet with OS trixie
[13:48:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:50:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:51:09] <Dreamy_Jazz>	 jouncebot: nowandnext
[13:51:09] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1300)
[13:51:09] <jouncebot>	 In 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1400)
[13:51:59] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1016.eqiad.wmnet with OS trixie
[13:52:08] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1016.eqiad.wmnet with OS trixie executed with errors: - backup1...
[13:52:09] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1017.eqiad.wmnet with reason: host reimage
[13:52:12] <wikibugs>	 (03PS1) 10AikoChou: httpbb: fix the revscoring-editquality-goodfaith test [puppet] - 10https://gerrit.wikimedia.org/r/1244659
[13:52:39] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1019.eqiad.wmnet with reason: host reimage
[13:52:54] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1016.eqiad.wmnet with OS trixie
[13:52:57] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp[7001,7009].*} and A:cp - 3.0.17 upgrade (T417253)
[13:53:02] <stashbot>	 T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253
[13:53:06] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654401 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1016.eqiad.wmnet with OS trixie
[13:53:24] <logmsgbot>	 !log jclark@cumin1003 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on backup1019.eqiad.wmnet with reason: host reimage
[13:53:37] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1018.eqiad.wmnet with OS trixie
[13:53:45] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654403 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1018.eqiad.wmnet with OS trixie executed with errors: - backup1...
[13:54:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:54:20] <marostegui>	 !log Deploy schema change on x1 on the master with replication enable T418480
[13:54:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:25] <stashbot>	 T418480: Drop default for sic_updated_timestamp and drop indexes on sic_created_timestamp in the cusi_case table on WMF wikis - https://phabricator.wikimedia.org/T418480
[13:55:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418429#11654424 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[13:55:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:55:34] <nya_1F616EMO>	 Hi and nice to meet you all, I'm 1F616EMO and is here for the two patches regarding T418089 (1244373, 1244591). This is my first time dealing with a backport window, so if I have done something wrong, please tell me and I will learn from them. I am ready and will be available for the whole duration of the window. 
[13:55:34] <stashbot>	 T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089
[13:58:57] <nya_1F616EMO>	 Hi Tran
[13:59:02] <Tran>	 o/
[13:59:14] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1017.eqiad.wmnet with reason: host reimage
[13:59:15] <wikibugs>	 (03CR) 10Volans: locking: Add a mechanism for a global Spicerack lock. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake)
[13:59:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:59:19] <nya_1F616EMO>	 This is my first time on a backport window. If I have done something wrong, please tell me and I will learn from them.
[13:59:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T418465)', diff saved to https://phabricator.wikimedia.org/P89039 and previous config saved to /var/cache/conftool/dbconfig/20260226-135922-marostegui.json
[13:59:27] <stashbot>	 T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465
[13:59:39] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[13:59:43] <nya_1F616EMO>	 (and I am already submitting two patches, but no worries, small ones)
[13:59:47] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1168 (T418465)', diff saved to https://phabricator.wikimedia.org/P89040 and previous config saved to /var/cache/conftool/dbconfig/20260226-135946-marostegui.json
[13:59:59] <Tran>	 nw
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1400)
[14:00:05] <jouncebot>	 Tran, itamarWMDE, nya_1F616EMO, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:12] <Dreamy_Jazz>	 \o
[14:00:14] <Tran>	 o/
[14:00:19] <nya_1F616EMO>	 o/
[14:00:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:00:26] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Add new lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1244661 (https://phabricator.wikimedia.org/T418482)
[14:00:28] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482 (10Lucas_Werkmeister_WMDE) 03NEW
[14:00:41] <Lucas_WMDE>	 o/ per ^ I can’t deploy today (but hopefully soon ^^)
[14:00:58] <Lucas_WMDE>	 hi nya_1F616EMO :)
[14:01:19] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Follow-up: T418482" [puppet] - 10https://gerrit.wikimedia.org/r/1239955 (owner: 10Lucas Werkmeister (WMDE))
[14:01:37] <urbanecm>	 hi, i can deploy today
[14:01:37] <nya_1F616EMO>	 If not deployers are around in this window, I will re-schedule mine to Monday, March 02 UTC afternoon
[14:01:43] <Lucas_WMDE>	 thanks urbanecm!
[14:01:48] <nya_1F616EMO>	 Thanks urbanecm
[14:01:57] <wikibugs>	 (03PS3) 10STran: Revert^2 "Remove deprecated IRS v2 configurations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243847 (https://phabricator.wikimedia.org/T413951)
[14:01:57] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T418465)', diff saved to https://phabricator.wikimedia.org/P89041 and previous config saved to /var/cache/conftool/dbconfig/20260226-140157-marostegui.json
[14:02:01] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Revert^2 "Remove deprecated IRS v2 configurations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243847 (https://phabricator.wikimedia.org/T413951) (owner: 10STran)
[14:02:14] <nya_1F616EMO>	 Are we following the order shown on the wiki page?
[14:02:19] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654464 (10karapayneWMDE) Approved on my end
[14:02:20] <Tran>	 usually
[14:02:25] <Tran>	 right now, seems so
[14:02:25] <nya_1F616EMO>	 Nice
[14:02:46] <Lucas_WMDE>	 I sometimes let volunteer patches take priority over staff ones fwiw ^^
[14:02:51] <Lucas_WMDE>	 but up to the deployer
[14:02:56] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "Remove deprecated IRS v2 configurations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243847 (https://phabricator.wikimedia.org/T413951) (owner: 10STran)
[14:03:21] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1020.eqiad.wmnet with reason: host reimage
[14:04:24] <wikibugs>	 (03PS2) 101F616EMO: zhwiki: drop event organizer's duplicated right to remove eventparticipant from self [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089)
[14:04:28] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] zhwiki: drop event organizer's duplicated right to remove eventparticipant from self [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[14:04:34] <urbanecm>	 itamarWMDE: hi, here as well?
[14:05:21] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: drop event organizer's duplicated right to remove eventparticipant from self [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[14:05:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:06:13] <urbanecm>	 nya_1F616EMO: hi, do you know if `accountcreator` is unset from any other wiki? i'm unsure about outright dropping it, because if someone attempts to set it from metawiki, MW will behave in a very crazy way (without telling the user). i'd be ok with removing everyone's ability to manipulate it, but dropping it might have sideeffects that i'd like to discuss first
[14:06:44] <nya_1F616EMO>	 urbanecm: No, accountcreator is there on all wikis; but we do have precedence of removing rights on loginwiki
[14:07:03] <nya_1F616EMO>	 The if-block right above the zhwiki if-block
[14:07:23] <urbanecm>	 yeah, loginwiki is very special, i was looking for a content project. in that case, i'm not going to deploy that patch today, because i'm unsure about its sideeffects (i'll detail it more on task). 
[14:07:28] <nya_1F616EMO>	 Okay
[14:07:39] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1016.eqiad.wmnet with reason: host reimage
[14:07:41] <urbanecm>	 it's up to you if you want to upload a "remove everyone's ability to add/remove people from it" patch (that i would be OK with even today)
[14:07:58] <urbanecm>	 otherwise, we can delay few days until the sideffects can be clarified.
[14:07:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244591 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[14:08:07] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] traffic: Avoid division by zero on HaproxyKafkaNoMessages [alerts] - 10https://gerrit.wikimedia.org/r/1244605 (owner: 10Vgutierrez)
[14:08:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11654507 (10Jelto)
[14:08:10] <nya_1F616EMO>	  I can't write a patch in such a short period, I will do it in tuesday
[14:08:13] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244666
[14:08:16] <urbanecm>	 ok, no problem.
[14:08:27] <wikibugs>	 (03PS1) 10AOkoth: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244667
[14:08:27] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1243847|Revert^2 "Remove deprecated IRS v2 configurations" (T413951)]], [[gerrit:1244591|zhwiki: drop event organizer's duplicated right to remove eventparticipant from self (T418089)]]
[14:08:34] <stashbot>	 T413951: Deprecate v1 non emergency flow for IRS - https://phabricator.wikimedia.org/T413951
[14:08:34] <stashbot>	 T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089
[14:08:35] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1020.eqiad.wmnet with reason: host reimage
[14:08:39] <wikibugs>	 (03PS1) 10Jcrespo: installserver: Migrate ms-backup hosts to EFI [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718)
[14:08:44] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] traffic: Avoid division by zero on HaproxyKafkaNoMessages [alerts] - 10https://gerrit.wikimedia.org/r/1244605 (owner: 10Vgutierrez)
[14:08:54] <nya_1F616EMO>	 Thanks for your rejection - rejecting at the right time is crucial to improvements.
[14:08:58] <wikibugs>	 (03CR) 10Urbanecm: [C:04-1] "I'm unsure about the effects of this for user rights management from metawiki. I'll write more details on the task, but I'm not comfortabl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[14:09:19] <urbanecm>	 nya_1F616EMO: it's more of a "let's clarify the impacts" rather than a rejection. thank you for your understanding.
[14:10:30] <logmsgbot>	 !log urbanecm@deploy2002 stran, urbanecm, 1f616emo: Backport for [[gerrit:1243847|Revert^2 "Remove deprecated IRS v2 configurations" (T413951)]], [[gerrit:1244591|zhwiki: drop event organizer's duplicated right to remove eventparticipant from self (T418089)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:10:40] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:10:55] <urbanecm>	 Tran: nya_1F616EMO: can you verify your patches on mwdebug, please?
[14:10:56] <nya_1F616EMO>	 Btw, just figured out that I might have forgotten to remove sysop's ability to add this group to users. 
[14:11:01] <Tran>	 testing now
[14:11:08] <nya_1F616EMO>	 urbanecm: working on it
[14:11:12] <urbanecm>	 ty
[14:11:18] <nya_1F616EMO>	 urbanecm: Working
[14:11:27] <urbanecm>	 thanks for the confirmation
[14:11:28] <nya_1F616EMO>	 ^ I mean, it's ok
[14:11:30] <wikibugs>	 (03PS1) 10Slyngshede: P:idp map family_name to SN [puppet] - 10https://gerrit.wikimedia.org/r/1244670 (https://phabricator.wikimedia.org/T338214)
[14:11:36] <wikibugs>	 (03PS2) 10Federico Ceratto: sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608)
[14:11:36] <wikibugs>	 (03CR) 10Federico Ceratto: "This can be tested with the next clone, it's a small change." [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto)
[14:11:39] <urbanecm>	 i understood
[14:12:05] <wikibugs>	 (03CR) 10Federico Ceratto: "(the implementation on the Zarcillo side is done)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto)
[14:12:07] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1016.eqiad.wmnet with reason: host reimage
[14:12:08] <Tran>	 mine's worknig
[14:12:10] <wikibugs>	 (03CR) 10Marostegui: "If these hosts already exists I believe you have to run a cookbook to migrate them to EFI" [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo)
[14:12:16] <logmsgbot>	 !log urbanecm@deploy2002 stran, urbanecm, 1f616emo: Continuing with sync
[14:12:20] <urbanecm>	 perf, proceeding
[14:12:34] <urbanecm>	 it seems itamarWMDE's not around for their patch
[14:12:39] <wikibugs>	 (03CR) 101F616EMO: "Mark this as unresolved before we know for sure what will happen in a global level." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[14:12:48] <wikibugs>	 (03CR) 10Jcrespo: "It is for new hosts, I don't care much about the existing ones, will be decommissioned soon." [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo)
[14:12:55] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654531 (10Lucas_Werkmeister_WMDE) FTR, I wrote down some notes on how I created this SSH key: https://wikitech.wikimedia.org/wiki/User:Lucas_...
[14:13:18] <wikibugs>	 (03CR) 10Marostegui: "ok then - then you should be good, do they already exist in site.pp etc?" [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo)
[14:13:28] <wikibugs>	 (03PS1) 10Krinkle: labs: Adopt same thumbnail steps and buckets as production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244672 (https://phabricator.wikimedia.org/T69525)
[14:14:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11654546 (10Jelto) Hi, thank you for the access request.  >  - access request (or expansion) has sign off of WMF sponsor/manager (sponsor for volunteers, manager for wmf st...
[14:14:47] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[14:15:31] <wikibugs>	 (03CR) 10Jcrespo: "They should:" [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo)
[14:16:11] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244673 (https://phabricator.wikimedia.org/T418194)
[14:16:12] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[14:16:15] <wikibugs>	 (03PS2) 10Krinkle: labs: Adopt same thumbnail steps and buckets as production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244672 (https://phabricator.wikimedia.org/T69525)
[14:16:15] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243847|Revert^2 "Remove deprecated IRS v2 configurations" (T413951)]], [[gerrit:1244591|zhwiki: drop event organizer's duplicated right to remove eventparticipant from self (T418089)]] (duration: 07m 48s)
[14:16:21] <nya_1F616EMO>	 It is so satisfying to observe how patches flows flawlessly without touching or SSH'ing onto the hardwares; as a amateur server maintainer myself I feel a bit ashamed by the scale of automation
[14:16:22] <stashbot>	 T413951: Deprecate v1 non emergency flow for IRS - https://phabricator.wikimedia.org/T413951
[14:16:22] <stashbot>	 T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089
[14:16:49] <urbanecm>	 Tran: nya_1F616EMO: patches deployed
[14:16:54] <Tran>	 thanks!
[14:16:57] <nya_1F616EMO>	 Thanks :-D and nice to meet you
[14:16:59] <urbanecm>	 still no signs of itamarWMDE 
[14:17:04] <urbanecm>	 nice to meet you too, nya_1F616EMO!
[14:17:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] installserver: Migrate ms-backup hosts to EFI [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo)
[14:17:05] <Lucas_WMDE>	 urbanecm: itamarWMDE should be there in a second
[14:17:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P89042 and previous config saved to /var/cache/conftool/dbconfig/20260226-141705-marostegui.json
[14:17:08] <urbanecm>	 ok
[14:17:13] <nya_1F616EMO>	 I will forward your concern onto phab and raise other's attention
[14:17:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo)
[14:17:42] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] "Thank you, that was useful <3, I sometimes forget steps." [puppet] - 10https://gerrit.wikimedia.org/r/1244668 (https://phabricator.wikimedia.org/T414718) (owner: 10Jcrespo)
[14:17:51] <logmsgbot>	 jclark@cumin1003 reimage (PID 2312336) is awaiting input
[14:17:58] <urbanecm>	 Dreamy_Jazz: i'll deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1244673 for me, maybe itmar's patch if they'll be there then and then hand it over to you if that sounds good.
[14:18:01] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244673 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[14:18:53] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244673 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[14:19:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:20:00] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1244673|[Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 (T418194)]]
[14:20:05] <stashbot>	 T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194
[14:20:13] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11654588 (10jcrespo) @Jclark-ctr I just marged thew new recipe, please give it 30 minutes to propagate, and should be done. Apologies again for the mist...
[14:20:20] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[14:20:21] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1017.eqiad.wmnet with OS trixie
[14:20:27] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1017.eqiad.wmnet with OS trixie completed: - backup1017 (**PASS...
[14:20:59] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654595 (10Lucas_Werkmeister_WMDE) (And I also intend to buy a chain tomorrow which will let me attach the YubiKey to my keychain with a bit m...
[14:21:17] <wikibugs>	 (03CR) 10Elukey: "Hey Federico! I saw the change passing by, since it changes setup.py I have a couple of questions:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto)
[14:21:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:22:03] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1244673|[Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 (T418194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:22:10] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:22:14] <wikibugs>	 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11654603 (10Jelto) @MoritzMuehlenhoff or @SLyngshede-WMF this task sounds like an offboarding procedure...
[14:22:24] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[14:22:25] <nya_1F616EMO>	 urbanecm: By "setting it from metawiki", which extension or component are you talking about?
[14:23:09] <urbanecm>	 nya_1F616EMO: special:userrights. if i have enough permissions, i can change your zhwiki permissions from https://meta.wikimedia.org/wiki/Special:UserRights/1F616EMO@zhwiki
[14:23:47] <nya_1F616EMO>	 Hmm, that page brings me directly to Special:Userrights on zhwiki instead of staying on the metawiki
[14:23:48] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm thank you, but can you also link Bug: T418483 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (owner: 10AOkoth)
[14:24:24] <nya_1F616EMO>	 urbanecm: I am not a steaward but is a local sysop, could that cause the redirection?
[14:24:33] <urbanecm>	 nya_1F616EMO: yeah, that's the difference.
[14:24:39] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1018.eqiad.wmnet with OS trixie
[14:24:46] <urbanecm>	 i see this https://usercontent.irccloud-cdn.com/file/yvHVEe1Z/image.png
[14:24:48] <nya_1F616EMO>	 Oh, then I will have no chance to observe the interface then
[14:24:51] <wikibugs>	 (03CR) 10Taiwanese elephant: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[14:24:52] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1018.eqiad.wmnet with OS trixie
[14:25:00] <urbanecm>	 the main problem is it displays the list of meta's groups (not from zhwiki)
[14:25:07] <urbanecm>	 people assume "account creator" exists everywhere
[14:25:23] <urbanecm>	 so, question is, what will happen if a steward attempts to add someone to account creator group on zhwiki (when it doesn't exist)?
[14:25:39] <nya_1F616EMO>	 urbanecm: We probably need a larger-scaled patch that fetches the exact list of rights from the target wiki
[14:25:54] <urbanecm>	 nya_1F616EMO: it would be useful, but fairly challenging. i think a task exists on that.
[14:26:15] <itamarWMDE>	 Apologies! I had some connectivity issues, I can also postpone to Monday if I'm tooo late.
[14:26:21] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244673|[Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 (T418194)]] (duration: 06m 20s)
[14:26:27] <stashbot>	 T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194
[14:26:29] <wikibugs>	 (03PS1) 10AOkoth: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244678 (https://phabricator.wikimedia.org/T418483)
[14:26:29] <urbanecm>	 itamarWMDE: thanks for the info! is the connection all right now?
[14:26:31] <nya_1F616EMO>	 urbanecm: IIRC Special:GlobalContributions went into such an complication when checking TAIV rights, which they have to use caches to solve
[14:26:41] <urbanecm>	 nya_1F616EMO: yeah, indeed.
[14:26:50] <itamarWMDE>	 Seems like it, changed locations.
[14:26:50] <nya_1F616EMO>	 Which is bad if we have to keep a cache just for stwards to grant a few sysop rights
[14:26:59] <nya_1F616EMO>	 ^ a few checkuser/ish rights
[14:27:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244678 (https://phabricator.wikimedia.org/T418483) (owner: 10AOkoth)
[14:27:02] <wikibugs>	 (03PS6) 10Itamar Givon: Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476)
[14:27:31] <urbanecm>	 nya_1F616EMO: let's move the discussion on the implementation to the task. if you can summarize the topic there, that would be appreciated. i can comment there later, too.
[14:27:44] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon)
[14:27:44] <nya_1F616EMO>	 Okay, see you there soon
[14:27:48] <urbanecm>	 thank you
[14:27:56] <wikibugs>	 (03PS2) 10AOkoth: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (https://phabricator.wikimedia.org/T418483)
[14:28:10] <wikibugs>	 (03CR) 10AOkoth: "Ack." [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (https://phabricator.wikimedia.org/T418483) (owner: 10AOkoth)
[14:28:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (https://phabricator.wikimedia.org/T418483) (owner: 10AOkoth)
[14:28:38] <wikibugs>	 (03Merged) 10jenkins-bot: Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon)
[14:28:54] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication
[14:29:08] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0)
[14:29:18] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[14:29:35] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication
[14:29:46] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0)
[14:29:49] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1242416|Add configurations for graphql usage survey and its pipeline tests (T414476)]]
[14:29:54] <stashbot>	 T414476: 📚 Add QuickSurvey to the dedicate page on Wikidata for GraphQL - https://phabricator.wikimedia.org/T414476
[14:30:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:30:13] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication
[14:31:52] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm, itamar: Backport for [[gerrit:1242416|Add configurations for graphql usage survey and its pipeline tests (T414476)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:32:02] <urbanecm>	 itamarWMDE: can you verify your patch on mwdebug, please?
[14:32:10] <itamarWMDE>	 On it
[14:32:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P89043 and previous config saved to /var/cache/conftool/dbconfig/20260226-143213-marostegui.json
[14:32:20] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[14:32:23] <logmsgbot>	 jclark@cumin1003 reimage (PID 2324829) is awaiting input
[14:32:55] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:32:57] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.update-replication (exit_code=99)
[14:33:02] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication
[14:33:10] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0)
[14:33:27] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication
[14:33:35] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0)
[14:33:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#11654657 (10Papaul) 05Open→03Resolved CODFW expansion is complete so we can close this task.
[14:34:26] <urbanecm>	 itamarWMDE: i'm seeing a lot of `Failed to find wd-graphql-quick-survey-yes (en)` and similar in logs. not sure if expected.
[14:34:28] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[14:34:35] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication
[14:34:43] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0)
[14:34:46] <itamarWMDE>	 expected
[14:34:58] <itamarWMDE>	 which domain are you seeing it for? test wikidata?
[14:35:11] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication
[14:35:21] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0)
[14:35:42] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[14:35:43] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1020.eqiad.wmnet with OS trixie
[14:35:47] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[14:35:48] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1016.eqiad.wmnet with OS trixie
[14:35:51] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654665 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1020.eqiad.wmnet with OS trixie completed: - backup1020 (**WARN...
[14:35:54] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654666 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1016.eqiad.wmnet with OS trixie completed: - backup1016 (**WARN...
[14:36:20] <urbanecm>	 itamarWMDE: https://test.wikidata.org/wiki/User:ItamarWMDE/test
[14:36:26] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654667 (10Jclark-ctr)
[14:36:36] <itamarWMDE>	 I'm having some trouble confirming though, using the generic k8s-mwdebug, is that correct?
[14:36:42] <urbanecm>	 yes, that should be it
[14:36:54] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication
[14:37:03] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1017.eqiad.wmnet with OS trixie
[14:37:05] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1019.eqiad.wmnet with OS trixie
[14:37:05] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0)
[14:37:11] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654671 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1017.eqiad.wmnet with OS trixie
[14:37:16] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication
[14:37:20] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654673 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1019.eqiad.wmnet with OS trixie
[14:37:24] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0)
[14:37:30] <itamarWMDE>	 Okay, well the logspam is a good sign, though the survey should show even if the interface messages are not there.
[14:38:58] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1018.eqiad.wmnet with reason: host reimage
[14:39:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:39:49] <itamarWMDE>	 I'd say roll back, It's really hard to debug these quicksurveys, I'll try to figure it out again locally before trying again.
[14:39:57] <urbanecm>	 okay, reverting
[14:39:59] <logmsgbot>	 !log urbanecm@deploy2002 Sync cancelled.
[14:41:01] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "Add configurations for graphql usage survey and its pipeline tests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244679
[14:41:01] <wikibugs>	 (03CR) 10TrainBranchBot: "urbanecm@deploy2002 created a revert of this change as I7aeb01cab59b990d4a02894bbc7f2ff134479f76" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon)
[14:41:19] <logmsgbot>	 !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None
[14:41:21] <itamarWMDE>	 Thank you! apologies.
[14:41:28] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:41:28] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:41:30] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:41:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244679 (owner: 10TrainBranchBot)
[14:42:15] <urbanecm>	 itamarWMDE: no worries, it happens
[14:42:20] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[14:42:26] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:42:28] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:42:28] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:43:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:43:24] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1018.eqiad.wmnet with reason: host reimage
[14:43:40] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add configurations for graphql usage survey and its pipeline tests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244679 (owner: 10TrainBranchBot)
[14:44:08] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1244679|Revert "Add configurations for graphql usage survey and its pipeline tests"]]
[14:45:19] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11654698 (10Jclark-ctr)
[14:46:14] <logmsgbot>	 !log urbanecm@deploy2002 trainbranchbot, urbanecm: Backport for [[gerrit:1244679|Revert "Add configurations for graphql usage survey and its pipeline tests"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:46:35] <logmsgbot>	 !log urbanecm@deploy2002 trainbranchbot, urbanecm: Continuing with sync
[14:46:45] <wikibugs>	 (03PS1) 10Tiziano Fogli: P::thanos::store::ruler (TMP): select only blocks generated locally [puppet] - 10https://gerrit.wikimedia.org/r/1244680 (https://phabricator.wikimedia.org/T412924)
[14:46:50] <wikibugs>	 (03CR) 10Taiwanese elephant: "How about just removing the noratelimit right from the accountcreator in zhwiki, as this has been done in several private wikis?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[14:47:09] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-backup1003.eqiad.wmnet with OS trixie
[14:47:17] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11654713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-backup1003.eqiad.wmnet with OS trixie
[14:47:22] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T418465)', diff saved to https://phabricator.wikimedia.org/P89044 and previous config saved to /var/cache/conftool/dbconfig/20260226-144721-marostegui.json
[14:47:26] <stashbot>	 T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465
[14:47:35] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-backup1004.eqiad.wmnet with OS trixie
[14:47:38] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[14:47:42] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11654717 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie
[14:47:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T418465)', diff saved to https://phabricator.wikimedia.org/P89045 and previous config saved to /var/cache/conftool/dbconfig/20260226-144746-marostegui.json
[14:48:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:49:17] <Dreamy_Jazz>	 urbanecm: Yeah, that's fine with me. I'm in a meeting so the later the possible the better :D
[14:49:23] <urbanecm>	 ah
[14:49:40] <urbanecm>	 just finishing last sync :)
[14:49:45] <Dreamy_Jazz>	 Thanks
[14:49:57] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T418465)', diff saved to https://phabricator.wikimedia.org/P89046 and previous config saved to /var/cache/conftool/dbconfig/20260226-144956-marostegui.json
[14:50:49] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244679|Revert "Add configurations for graphql usage survey and its pipeline tests"]] (duration: 06m 41s)
[14:50:54] <urbanecm>	 and done
[14:50:56] <urbanecm>	 Dreamy_Jazz: over to you
[14:51:00] <Dreamy_Jazz>	 Thanks
[14:52:08] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1017.eqiad.wmnet with reason: host reimage
[14:52:43] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] P::thanos::store::ruler (TMP): select only blocks generated locally [puppet] - 10https://gerrit.wikimedia.org/r/1244680 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[14:52:50] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1019.eqiad.wmnet with reason: host reimage
[14:53:40] <wikibugs>	 (03PS3) 10AOkoth: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (https://phabricator.wikimedia.org/T418483)
[14:54:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:54:27] <wikibugs>	 06SRE: Create nodejs 24 production images - https://phabricator.wikimedia.org/T418440#11654755 (10Jdforrester-WMF)
[14:54:31] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:55:47] <Dreamy_Jazz>	 Scap is running
[14:56:33] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:58:11] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1017.eqiad.wmnet with reason: host reimage
[14:58:20] <wikibugs>	 (03PS1) 10Urbanecm: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194)
[14:58:33] <wikibugs>	 (03PS1) 10Urbanecm: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244686 (https://phabricator.wikimedia.org/T418194)
[15:01:27] <wikibugs>	 (03PS1) 10Dreamy Jazz: SI: Populate siu_info in cusi_user from matched signals [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244688 (https://phabricator.wikimedia.org/T411118)
[15:01:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle)
[15:02:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11654837 (10Aklapper) CC'ing @Khantstop who may want to sett an "Also Known As" value on https://phabricator.wikimedia.org/people/editprofile/42910/ for better discoverability
[15:02:25] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1019.eqiad.wmnet with reason: host reimage
[15:02:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:02:56] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:03:22] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1003.eqiad.wmnet with reason: host reimage
[15:03:23] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:03:33] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:03:41] <wikibugs>	 (03PS1) 10Fabfur: hiera: set haproxy version to 3.0 in upload@magru [puppet] - 10https://gerrit.wikimedia.org/r/1244689 (https://phabricator.wikimedia.org/T417253)
[15:03:44] <moritzm>	 !log upgrade conf* nodes to facter 4 T381538
[15:03:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:49] <stashbot>	 T381538: Backport facter to bullseye - https://phabricator.wikimedia.org/T381538
[15:04:12] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: set haproxy version to 3.0 in upload@magru [puppet] - 10https://gerrit.wikimedia.org/r/1244689 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur)
[15:04:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[15:04:43] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1244689 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur)
[15:05:05] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P89047 and previous config saved to /var/cache/conftool/dbconfig/20260226-150504-marostegui.json
[15:05:06] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[15:05:09] <wikibugs>	 (03PS1) 10Elukey: Revert "ml-serve: fix istio/transparentproxy config" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244690
[15:05:14] <urbanecm>	 Dreamy_Jazz: will you be deploying something else? (if not, i'd +2 two backports to aid investigating a bug...)
[15:05:22] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[15:05:22] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1018.eqiad.wmnet with OS trixie
[15:05:26] <sukhe>	 !log sudo cumin "C:bird%do_ipv6=true" "disable-puppet 'merging CR 1241003'"
[15:05:27] <Dreamy_Jazz>	 Yes, I have a public backport to make too
[15:05:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:32] <urbanecm>	 ok, i'll wait then
[15:05:40] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1018.eqiad.wmnet with OS trixie completed: - backup1018 (**WARN...
[15:05:42] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] SI: Populate siu_info in cusi_user from matched signals [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244688 (https://phabricator.wikimedia.org/T411118) (owner: 10Dreamy Jazz)
[15:06:31] <Dreamy_Jazz>	 Mine will take a bit of time to merge
[15:06:36] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:bird::anycast: automatically detect IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh)
[15:06:43] <Dreamy_Jazz>	 So urbanecm if you want to deploy any config changes first?
[15:06:52] <urbanecm>	 Dreamy_Jazz: no, just the backports
[15:07:01] <Dreamy_Jazz>	 Ok
[15:07:08] <Dreamy_Jazz>	 Do you want to combine the backports?
[15:07:11] <Dreamy_Jazz>	 To save time?
[15:07:17] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1003.eqiad.wmnet with reason: host reimage
[15:07:17] <urbanecm>	 if you're ok with it, sure. it's https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1244686 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1244685
[15:07:25] <urbanecm>	 (it just changes log level, so should be fairly low-risk too)
[15:07:57] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:08:01] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244686 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:08:06] <urbanecm>	 thanks
[15:08:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244688 (https://phabricator.wikimedia.org/T411118) (owner: 10Dreamy Jazz)
[15:08:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244686 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:08:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:08:51] <Dreamy_Jazz>	 Np
[15:12:49] <wikibugs>	 (03PS1) 10D3r1ck01: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007)
[15:13:01] <wikibugs>	 (03PS5) 10D3r1ck01: Enable JWT session cookie for bot passwords (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007)
[15:15:55] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11654951 (10herron)
[15:16:04] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654952 (10Jelto) p:05Triage→03Medium
[15:16:11] <Dreamy_Jazz>	 jouncebot: nowandnext
[15:16:11] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 13 minute(s)
[15:16:11] <jouncebot>	 In 0 hour(s) and 13 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1530)
[15:16:33] <Dreamy_Jazz>	 I also want to deploy private code changes again after this scap, so hopefully okay with colliding with the next window?
[15:16:48] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654958 (10Jelto) Thanks for the access request, we have to confirm the key out of band and another approval from @thcipriani is needed for th...
[15:18:22] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:19:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[15:19:19] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1017.eqiad.wmnet with OS trixie
[15:19:26] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1017.eqiad.wmnet with OS trixie completed: - backup1017 (**WARN...
[15:19:45] <wikibugs>	 (03CR) 10Reedy: Revert "CommonSettings: Temporarily set $wgOATHUserHandlesTable = true" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239026 (https://phabricator.wikimedia.org/T416544) (owner: 10Reedy)
[15:19:51] <sukhe>	 !log sudo cumin -b1 -s5 "C:bird%do_ipv6=true" "run-puppet-agent --enable 'merging CR 1241003'"
[15:19:52] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654977 (10Jclark-ctr)
[15:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:13] <wikibugs>	 (03Merged) 10jenkins-bot: SI: Populate siu_info in cusi_user from matched signals [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244688 (https://phabricator.wikimedia.org/T411118) (owner: 10Dreamy Jazz)
[15:20:13] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P89048 and previous config saved to /var/cache/conftool/dbconfig/20260226-152012-marostegui.json
[15:20:15] <wikibugs>	 (03Merged) 10jenkins-bot: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244686 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:20:16] <hashar>	 Dreamy_Jazz: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1244685 failed due to some CI infra issue
[15:20:19] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] "Post-merge comments: thanks for the review folks. This is now merged and documentation updated." [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh)
[15:20:30] <Dreamy_Jazz>	 Thanks for the heads up
[15:20:56] <hashar>	 the last job is about to complete
[15:21:14] <hashar>	 and the change can be +2ed against once that job has been reported as a failure
[15:21:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:21:22] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654984 (10Jelto)
[15:21:29] <wikibugs>	 (03PS2) 10Dreamy Jazz: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:21:33] <wikibugs>	 (03PS3) 10Dreamy Jazz: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:21:37] <wikibugs>	 (03CR) 10Dreamy Jazz: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:21:39] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:21:45] <hashar>	 <3
[15:21:46] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:22:09] <wikibugs>	 (03CR) 10Muehlenhoff: "Ok, then I'll abandone the patch. Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[15:22:14] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:22:35] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244693
[15:22:38] <wikibugs>	 (03PS1) 10MVernon: apus: add two new frontends in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1244694 (https://phabricator.wikimedia.org/T416386)
[15:22:40] <wikibugs>	 (03PS1) 10MVernon: apus: remove two eqiad frontends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1244695 (https://phabricator.wikimedia.org/T416386)
[15:22:56] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[15:23:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] apus: remove two eqiad frontends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1244695 (https://phabricator.wikimedia.org/T416386) (owner: 10MVernon)
[15:23:17] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[15:23:18] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1019.eqiad.wmnet with OS trixie
[15:23:22] <wikibugs>	 (03CR) 10Ottomata: "@joal@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur)
[15:23:28] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1019.eqiad.wmnet with OS trixie completed: - backup1019 (**WARN...
[15:23:35] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654993 (10Jclark-ctr)
[15:24:07] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11654995 (10Jclark-ctr)
[15:24:07] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273)
[15:24:30] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[15:24:43] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:24:49] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[15:24:50] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-backup1003.eqiad.wmnet with OS trixie
[15:25:05] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11655000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-backup1003.eqiad.wmnet with OS trixie completed: - ms-backup1003...
[15:25:06] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11654999 (10Jelto) @Lucas_Werkmeister_WMDE is still a member of the `deployment` group. So approval from @thcipriani is not really needed. I'll...
[15:25:11] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11655002 (10Jelto)
[15:25:54] <wikibugs>	 (03CR) 10Vgutierrez: cache::haproxy: save x-wmf-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur)
[15:26:51] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "key has been verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1244661 (https://phabricator.wikimedia.org/T418482) (owner: 10Lucas Werkmeister (WMDE))
[15:26:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] apus: add two new frontends in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1244694 (https://phabricator.wikimedia.org/T416386) (owner: 10MVernon)
[15:26:56] <wikibugs>	 (03PS1) 10Clément Goubert: wmnet: Add rest-gateway-ro record [dns] - 10https://gerrit.wikimedia.org/r/1244697 (https://phabricator.wikimedia.org/T418145)
[15:27:39] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Add new lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1244661 (https://phabricator.wikimedia.org/T418482) (owner: 10Lucas Werkmeister (WMDE))
[15:28:34] <wikibugs>	 (03PS5) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864)
[15:28:41] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655017 (10herron)
[15:28:50] <wikibugs>	 (03CR) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur)
[15:29:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[15:29:32] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11655019 (10Jelto) 05Open→03Resolved a:03Jelto Key is merged into puppet, you should have access in 30 minutes. I'll resolve the task...
[15:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1530)
[15:30:22] <wikibugs>	 (03CR) 10MVernon: [C:03+2] apus: add two new frontends in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1244694 (https://phabricator.wikimedia.org/T416386) (owner: 10MVernon)
[15:32:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11655035 (10Jelto) p:05Triage→03Medium thank you @Aklapper   @Khantstop we need your approval here for  >  - access request (or expansion) has sign off of WMF sponsor/m...
[15:33:01] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11655048 (10Jclark-ctr) 05Open→03Resolved
[15:33:28] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11655053 (10Jclark-ctr)
[15:33:58] <wikibugs>	 (03Merged) 10jenkins-bot: ReassignMentees: Adjust logging level [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244685 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm)
[15:34:10] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1244061 (owner: 10Scott French)
[15:34:12] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:cache::haproxy: fix non-default scope key structure [puppet] - 10https://gerrit.wikimedia.org/r/1244061 (owner: 10Scott French)
[15:34:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[15:34:17] <wikibugs>	 (03PS1) 10Clément Goubert: api-gateway: Add api-gateway-ro to certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244700 (https://phabricator.wikimedia.org/T418145)
[15:34:28] <wikibugs>	 (03PS2) 10Clément Goubert: api-gateway: Add api-gateway-ro to certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244700 (https://phabricator.wikimedia.org/T418145)
[15:34:29] <logmsgbot>	 jhancock@cumin2002 provision (PID 4168454) is awaiting input
[15:34:37] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1244688|SI: Populate siu_info in cusi_user from matched signals (T411118)]], [[gerrit:1244686|ReassignMentees: Adjust logging level (T418194)]], [[gerrit:1244685|ReassignMentees: Adjust logging level (T418194)]]
[15:34:41] <stashbot>	 T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194
[15:35:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T418465)', diff saved to https://phabricator.wikimedia.org/P89049 and previous config saved to /var/cache/conftool/dbconfig/20260226-153521-marostegui.json
[15:35:26] <stashbot>	 T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465
[15:35:37] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[15:35:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T418465)', diff saved to https://phabricator.wikimedia.org/P89050 and previous config saved to /var/cache/conftool/dbconfig/20260226-153545-marostegui.json
[15:35:53] <wikibugs>	 (03PS2) 10Clément Goubert: wmnet: Add api-gateway-ro record [dns] - 10https://gerrit.wikimedia.org/r/1244697 (https://phabricator.wikimedia.org/T418145)
[15:36:40] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz, urbanecm: Backport for [[gerrit:1244688|SI: Populate siu_info in cusi_user from matched signals (T411118)]], [[gerrit:1244686|ReassignMentees: Adjust logging level (T418194)]], [[gerrit:1244685|ReassignMentees: Adjust logging level (T418194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:36:44] <urbanecm>	 voilá
[15:36:53] <urbanecm>	 (i have nothing to test, it is running in a job)
[15:36:53] <Dreamy_Jazz>	 :D
[15:37:09] <Dreamy_Jazz>	 Same here. Problem code I'm fixing is running in a job too
[15:37:12] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz, urbanecm: Continuing with sync
[15:37:57] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T418465)', diff saved to https://phabricator.wikimedia.org/P89051 and previous config saved to /var/cache/conftool/dbconfig/20260226-153756-marostegui.json
[15:40:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11655108 (10Khantstop) Thank you @Jelto! Confirming that I grant approval for Dani to access the items listed in this task as part of her role as a data scientist. Let me k...
[15:40:33] <wikibugs>	 (03CR) 10Marostegui: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[15:41:06] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244688|SI: Populate siu_info in cusi_user from matched signals (T411118)]], [[gerrit:1244686|ReassignMentees: Adjust logging level (T418194)]], [[gerrit:1244685|ReassignMentees: Adjust logging level (T418194)]] (duration: 06m 29s)
[15:41:11] <stashbot>	 T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194
[15:41:30] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Revert "ml-serve: fix istio/transparentproxy config" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244690 (owner: 10Elukey)
[15:42:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11655121 (10Khantstop)
[15:42:46] <wikibugs>	 (03CR) 10Ottomata: "Ah, sorry! I do not know how haproxy log vs haproxykafka logging works. https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_reques" [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur)
[15:43:20] <Dreamy_Jazz>	 Still working on my private code changes
[15:43:21] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "+1 on traffic side of things" [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur)
[15:44:32] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:44:41] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:44:42] <wikibugs>	 (03CR) 10Ottomata: component: mediawiki.page_html_content_change.dev0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton)
[15:44:52] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:45:37] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655135 (10herron)
[15:47:17] <wikibugs>	 (03PS2) 10JavierMonton: component: mediawiki.page_html_content_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467)
[15:47:22] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=apus,name=apus-fe1004.eqiad.wmnet
[15:47:28] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=apus,name=apus-fe1005.eqiad.wmnet
[15:47:44] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=apus,name=apus-fe1004.eqiad.wmnet
[15:47:51] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=apus,name=apus-fe1005.eqiad.wmnet
[15:48:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] component: mediawiki.page_html_content_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton)
[15:49:04] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[15:51:46] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:52:03] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-backup1004.eqiad.wmnet with OS trixie
[15:52:12] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11655178 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie
[15:52:15] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:53:05] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P89052 and previous config saved to /var/cache/conftool/dbconfig/20260226-155304-marostegui.json
[15:53:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:54:31] <wikibugs>	 (03CR) 10JavierMonton: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton)
[15:54:47] <wikibugs>	 (03CR) 10Ottomata: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton)
[15:55:12] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2024.codfw.wmnet with OS bullseye
[15:55:22] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11655197 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-fe2024.codfw.wmnet with OS bullseye
[15:56:38] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Add DB grant for pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1244599 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[15:56:39] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2008-dev.codfw.wmnet with OS bookworm
[15:56:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11655202 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2008-dev.codfw.wmnet with OS bookworm
[15:59:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:00:04] <jouncebot>	 dduvall and dancy: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1600).
[16:01:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:02:45] <wikibugs>	 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11655260 (10SLyngshede-WMF) We normally don't disable accounts, just remove any special privileges.
[16:02:49] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "Don't know why this has been left back but +1 for me!!" [puppet] - 10https://gerrit.wikimedia.org/r/1243195 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall)
[16:03:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd1054 - https://phabricator.wikimedia.org/T416395#11655266 (10Jhancock.wm)
[16:03:27] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655267 (10herron)
[16:03:59] <wikibugs>	 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11655273 (10A_smart_kitten) FWIW, https://wikitech.wikimedia.org/wiki/SRE/LDAP/Renaming_users currently...
[16:04:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:04:35] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] wmnet: Add api-gateway-ro record [dns] - 10https://gerrit.wikimedia.org/r/1244697 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert)
[16:04:36] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: sretest2009 test in nokia rack - https://phabricator.wikimedia.org/T404115#11655296 (10Jhancock.wm) 05Open→03Resolved
[16:04:36] <wikibugs>	 (03CR) 10JavierMonton: component: mediawiki.page_html_content_change.dev0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton)
[16:05:01] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-backup1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:05:16] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] api-gateway: Add api-gateway-ro to certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244700 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert)
[16:05:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T418482#11655302 (10Lucas_Werkmeister_WMDE) It’s working \o/ thank you!
[16:05:46] * Lucas_WMDE can deploy again \o/
[16:08:03] <urbanecm>	 uhuuu!
[16:08:06] <urbanecm>	 welcome back Lucas_WMDE 
[16:08:13] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P89053 and previous config saved to /var/cache/conftool/dbconfig/20260226-160812-marostegui.json
[16:08:22] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:08:27] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655310 (10herron)
[16:09:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:09:51] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] component: mediawiki.page_html_content_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton)
[16:12:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:12:58] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:13:03] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-backup1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:13:10] <claime>	 Here
[16:13:13] <claime>	 !ack 
[16:13:14] <sirenbot>	 7491 (ACKED)  [6x] ProbeDown sre (text-https:443 probes/service)
[16:14:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:14:30] <jinxer-wm>	 FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 3 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[16:14:42] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655349 (10herron)
[16:16:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=drmrs - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:17:43] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655370 (10herron)
[16:17:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:17:58] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:17:58] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[16:18:22] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:19:30] <jinxer-wm>	 FIRING: [22x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb6_443 has 6 unhealthy realservers pooled on lvs5004:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[16:19:55] <claime>	 !ack
[16:19:55] <sirenbot>	 7492 (ACKED)  NELHigh sre (thanos-rule@main tcp.timed_out)
[16:20:41] <wikibugs>	 (03CR) 101F616EMO: "That's probably more doable and elegant as it does not require a CommonSettings hack. To play safe, I will go for this approach, and the c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO)
[16:21:05] <wikibugs>	 (03PS1) 10Tiziano Fogli: thanos::store: align ruler instance min-time with main instance max-time [puppet] - 10https://gerrit.wikimedia.org/r/1244707 (https://phabricator.wikimedia.org/T412924)
[16:21:49] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655387 (10herron)
[16:21:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1  - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:22:57] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:23:22] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T418465)', diff saved to https://phabricator.wikimedia.org/P89054 and previous config saved to /var/cache/conftool/dbconfig/20260226-162321-marostegui.json
[16:23:22] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:23:27] <stashbot>	 T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465
[16:23:38] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[16:23:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1201 (T418465)', diff saved to https://phabricator.wikimedia.org/P89055 and previous config saved to /var/cache/conftool/dbconfig/20260226-162346-marostegui.json
[16:23:48] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:24:30] <jinxer-wm>	 RESOLVED: [20x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 6 unhealthy realservers pooled on lvs5004:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[16:24:43] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:24:45] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:24:58] <jinxer-wm>	 FIRING: [2x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from FR) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[16:25:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:25:11] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655400 (10herron)
[16:25:22] <wikibugs>	 (03CR) 10MVernon: [C:03+2] apus: remove two eqiad frontends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1244695 (https://phabricator.wikimedia.org/T416386) (owner: 10MVernon)
[16:25:47] <wikibugs>	 (03CR) 10Gergő Tisza: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[16:25:56] <wikibugs>	 (03CR) 10Gergő Tisza: [C:04-1] CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[16:25:57] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T418465)', diff saved to https://phabricator.wikimedia.org/P89056 and previous config saved to /var/cache/conftool/dbconfig/20260226-162556-marostegui.json
[16:26:18] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] Enable JWT session cookie for bot passwords (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[16:26:51] <jinxer-wm>	 RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1  - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:27:58] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:27:58] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[16:29:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:29:58] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655424 (10herron)
[16:29:58] <jinxer-wm>	 RESOLVED: [2x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from FR) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[16:30:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:32:29] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: haproxy: stop current abuse [puppet] - 10https://gerrit.wikimedia.org/r/1243884
[16:32:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cache::haproxy: do not log requests dropped due to too much recent concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1244709
[16:32:53] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11655446 (10herron)
[16:33:05] <wikibugs>	 (03CR) 10CDanis: [C:03+1] cache::haproxy: do not log requests dropped due to too much recent concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1244709 (owner: 10Giuseppe Lavagetto)
[16:33:22] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:33:54] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts moss-fe[1001-1002].eqiad.wmnet
[16:34:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:34:43] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:35:57] <wikibugs>	 (03CR) 10Bartosz Dziewoński: haproxy: stop current abuse (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243884 (owner: 10Giuseppe Lavagetto)
[16:36:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: do not log requests dropped due to too much recent concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1244709 (owner: 10Giuseppe Lavagetto)
[16:37:07] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: cache::haproxy: do not log requests dropped due to too much recent concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1244709
[16:38:22] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:38:24] <wikibugs>	 (03PS1) 10Federico Ceratto: orchestrator: install orchestrator-client [puppet] - 10https://gerrit.wikimedia.org/r/1244710 (https://phabricator.wikimedia.org/T416582)
[16:39:35] <icinga-wm>	 PROBLEM - Memcached on titan1002 is CRITICAL: connect to address 10.64.48.167 and port 11211: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[16:40:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: do not log requests dropped due to too much recent concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1244709 (owner: 10Giuseppe Lavagetto)
[16:41:05] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P89058 and previous config saved to /var/cache/conftool/dbconfig/20260226-164105-marostegui.json
[16:41:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] orchestrator: install orchestrator-client [puppet] - 10https://gerrit.wikimedia.org/r/1244710 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto)
[16:42:20] <wikibugs>	 (03CR) 10Effie Mouzeli: "You may be right, I am unsure how we should proceed tbh" [puppet] - 10https://gerrit.wikimedia.org/r/1240720 (owner: 10Muehlenhoff)
[16:42:36] <Dreamy_Jazz>	 jouncebot: nowandnext
[16:42:36] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1600)
[16:42:36] <jouncebot>	 In 0 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1700)
[16:44:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:44:45] <wikibugs>	 (03CR) 10MVernon: [C:03+1] cassandra: Java 8 no longer supported (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1242473 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans)
[16:45:53] <wikibugs>	 (03CR) 10MVernon: [C:03+1] cassandra: add new 'linked_artifacts' role (user) [puppet] - 10https://gerrit.wikimedia.org/r/1243985 (https://phabricator.wikimedia.org/T418420) (owner: 10Eevans)
[16:47:43] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[16:49:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by javiermonton@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton)
[16:49:20] <logmsgbot>	 !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on sretest2001.codfw.wmnet with reason: T381919
[16:49:24] <stashbot>	 T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919
[16:50:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2024
[16:50:13] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2024
[16:50:41] <wikibugs>	 (03Merged) 10jenkins-bot: component: mediawiki.page_html_content_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244629 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton)
[16:51:32] <logmsgbot>	 !log javiermonton@deploy2002 Started scap sync-world: Backport for [[gerrit:1244629|component: mediawiki.page_html_content_change.dev0 (T418467)]]
[16:51:36] <stashbot>	 T418467: Enrich "parent" HTML using diffs - https://phabricator.wikimedia.org/T418467
[16:51:49] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-fe[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002"
[16:52:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] orchestrator: install orchestrator-client [puppet] - 10https://gerrit.wikimedia.org/r/1244710 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto)
[16:52:48] <wikibugs>	 (03PS3) 101F616EMO: zhwiki: Remove all rights from accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089)
[16:53:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-fe[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002"
[16:53:01] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:53:02] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts moss-fe[1001-1002].eqiad.wmnet
[16:53:14] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11655584 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `moss-fe[1001-1002].eqiad.wmnet` - moss-fe1001.eqiad.w...
[16:53:29] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969)
[16:53:35] <logmsgbot>	 !log javiermonton@deploy2002 javiermonton: Backport for [[gerrit:1244629|component: mediawiki.page_html_content_change.dev0 (T418467)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:54:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:54:25] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-fe100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T418515 (10MatthewVernon) 03NEW
[16:54:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson)
[16:55:04] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] orchestrator: install orchestrator-client [puppet] - 10https://gerrit.wikimedia.org/r/1244710 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto)
[16:56:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P89059 and previous config saved to /var/cache/conftool/dbconfig/20260226-165613-marostegui.json
[16:58:11] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-fe100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T418515#11655610 (10VRiley-WMF) a:03VRiley-WMF
[16:58:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11655614 (10Jhancock.wm) @Andrew i got the installer to work but the config needs an edit. getting this error and it's going to fail.   [40/50, retrying in 120.00s] Attempt to run 'cookbooks.s...
[16:59:14] <logmsgbot>	 !log javiermonton@deploy2002 javiermonton: Continuing with sync
[17:00:05] <jouncebot>	 jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:01:04] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[17:03:27] <logmsgbot>	 !log javiermonton@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244629|component: mediawiki.page_html_content_change.dev0 (T418467)]] (duration: 11m 55s)
[17:03:32] <stashbot>	 T418467: Enrich "parent" HTML using diffs - https://phabricator.wikimedia.org/T418467
[17:07:42] <wikibugs>	 (03PS1) 10Kosta Harlan: hcaptcha: Sanitize values of x_is_browser sent on risk_score events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244718 (https://phabricator.wikimedia.org/T418505)
[17:08:33] <wikibugs>	 (03PS11) 10Pppery: Extract strings from US English locale as source strings and apply PLURAL [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421)
[17:09:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[17:09:24] <wikibugs>	 (03CR) 10Pppery: "The error was caused by me using syntax from a PHP version newer that what XHPAST supports. Fixed that in the latest patchset." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) (owner: 10Pppery)
[17:09:49] <wikibugs>	 (03PS5) 10Pppery: Handle all format specifiers [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221131 (https://phabricator.wikimedia.org/T413529)
[17:10:08] <wikibugs>	 (03PS5) 10Pppery: Add a `bin/translatewiki roundtrip` workflow to validate the string-mangling code [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221180 (https://phabricator.wikimedia.org/T413532)
[17:11:22] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T418465)', diff saved to https://phabricator.wikimedia.org/P89061 and previous config saved to /var/cache/conftool/dbconfig/20260226-171121-marostegui.json
[17:11:27] <stashbot>	 T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465
[17:11:38] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[17:12:15] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-backup1004.eqiad.wmnet with OS trixie
[17:12:22] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11655661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie executed with errors: - ms-...
[17:12:51] <wikibugs>	 (03CR) 10Gergő Tisza: rest-gateway: use rlc claim from cookie with bearer token (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler)
[17:12:52] <wikibugs>	 06SRE, 06ServiceOps new, 10Wikibase GraphQL, 06Wikibase Reuse Team, and 2 others: Create a rewrite for the GraphQL endpoint on wikidata.org - https://phabricator.wikimedia.org/T417026#11655662 (10Blake)
[17:13:01] <wikibugs>	 (03CR) 10DCausse: [C:03+2] opensearch-semantic-search: test cluster capacity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243696 (owner: 10DCausse)
[17:14:28] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[17:14:55] <wikibugs>	 (03Merged) 10jenkins-bot: opensearch-semantic-search: test cluster capacity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243696 (owner: 10DCausse)
[17:15:24] <wikibugs>	 (03CR) 10Gergő Tisza: rest-gateway: use rlc claim from cookie with bearer token (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler)
[17:15:28] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe2024.codfw.wmnet with OS bullseye
[17:15:38] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11655677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-fe2024.codfw.wmnet with OS bullseye executed with errors: - ms-fe...
[17:16:31] <kostajh>	 jouncebot: nowandnext
[17:16:32] <jouncebot>	 For the next 0 hour(s) and 43 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1700)
[17:16:32] <jouncebot>	 In 0 hour(s) and 43 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1800)
[17:16:32] <jouncebot>	 In 0 hour(s) and 43 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1800)
[17:16:41] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[17:16:56] <kostajh>	 can I deploy a MW patch now, or is someone using the window? 
[17:17:43] <kostajh>	 I will go ahead
[17:18:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244718 (https://phabricator.wikimedia.org/T418505) (owner: 10Kosta Harlan)
[17:18:38] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[17:19:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[17:19:21] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2008-dev.codfw.wmnet with OS bookworm
[17:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:19:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11655701 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcephosd2008-dev.codfw.wmnet with OS bookworm executed with errors: - cloudcepho...
[17:19:27] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:20:25] <wikibugs>	 (03Merged) 10jenkins-bot: hcaptcha: Sanitize values of x_is_browser sent on risk_score events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244718 (https://phabricator.wikimedia.org/T418505) (owner: 10Kosta Harlan)
[17:20:58] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1244718|hcaptcha: Sanitize values of x_is_browser sent on risk_score events (T418505)]]
[17:21:02] <stashbot>	 T418505: hCaptcha: Fix validation errors for x_is_browser being asigned a string value - https://phabricator.wikimedia.org/T418505
[17:22:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1244720
[17:22:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Record LDAP for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1244720 (owner: 10Muehlenhoff)
[17:23:07] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1244718|hcaptcha: Sanitize values of x_is_browser sent on risk_score events (T418505)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:24:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[17:25:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for shoffmanwmf [puppet] - 10https://gerrit.wikimedia.org/r/1244722
[17:25:55] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] thanos::store: align ruler instance min-time with main instance max-time [puppet] - 10https://gerrit.wikimedia.org/r/1244707 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[17:26:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Record LDAP access for shoffmanwmf [puppet] - 10https://gerrit.wikimedia.org/r/1244722 (owner: 10Muehlenhoff)
[17:27:02] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[17:27:07] <wikibugs>	 (03PS2) 10Muehlenhoff: Record LDAP for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1244720
[17:28:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1244720 (owner: 10Muehlenhoff)
[17:28:41] <wikibugs>	 (03PS2) 10Muehlenhoff: Record LDAP access for shoffmanwmf [puppet] - 10https://gerrit.wikimedia.org/r/1244722
[17:29:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[17:29:36] <icinga-wm>	 RECOVERY - Memcached on titan1002 is OK: TCP OK - 0.000 second response time on 10.64.48.167 port 11211 https://wikitech.wikimedia.org/wiki/Memcached
[17:29:40] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11655790 (10MBinder_WMF) @Aklapper done :)  @MatthewVernon @Ottomata where I can I see what groups I'm in?
[17:30:55] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244667 (https://phabricator.wikimedia.org/T418483) (owner: 10AOkoth)
[17:30:58] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244718|hcaptcha: Sanitize values of x_is_browser sent on risk_score events (T418505)]] (duration: 10m 00s)
[17:31:03] <stashbot>	 T418505: hCaptcha: Fix validation errors for x_is_browser being asigned a string value - https://phabricator.wikimedia.org/T418505
[17:34:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[17:34:35] <wikibugs>	 (03PS1) 10Urbanecm: cleanup: Growth: Remove temporary GrowthMentorList overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518)
[17:34:58] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] varnishkafka: Only enable for text [puppet] - 10https://gerrit.wikimedia.org/r/1243195 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall)
[17:35:15] <wikibugs>	 (03CR) 10Urbanecm: [C:04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) (owner: 10Urbanecm)
[17:35:53] <wikibugs>	 (03PS2) 10Urbanecm: cleanup: Growth: Remove temporary GrowthMentorList overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518)
[17:36:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Unconditionally assume Facter 4 [puppet] - 10https://gerrit.wikimedia.org/r/1244725
[17:36:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for shoffmanwmf [puppet] - 10https://gerrit.wikimedia.org/r/1244722 (owner: 10Muehlenhoff)
[17:37:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Unconditionally assume Facter 4 [puppet] - 10https://gerrit.wikimedia.org/r/1244725 (owner: 10Muehlenhoff)
[17:37:48] <wikibugs>	 (03PS2) 10Muehlenhoff: Unconditionally assume Facter 4 [puppet] - 10https://gerrit.wikimedia.org/r/1244725
[17:39:08] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1244725 (owner: 10Muehlenhoff)
[17:39:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:42:08] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11655825 (10Aklapper) > where I can I see what groups I'm in?  Uhm, none it seems, while I would have expected ldap/wmf at least: https://ldap.toolfo...
[17:44:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[17:44:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:47:17] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Unconditionally assume Facter 4 [puppet] - 10https://gerrit.wikimedia.org/r/1244725 (owner: 10Muehlenhoff)
[17:49:39] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-fe100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T418515#11655856 (10VRiley-WMF)
[17:49:58] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, and 2 others: decommission moss-fe100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T418515#11655858 (10VRiley-WMF) 05Open→03Resolved These have been decommissioned
[17:53:02] <wikibugs>	 (03CR) 10Muehlenhoff: "These spec tests test a random tiny fraction of a mediawiki install and will break randomly if groups get reorganised. Even if still used " [puppet] - 10https://gerrit.wikimedia.org/r/1240720 (owner: 10Muehlenhoff)
[17:53:15] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11655864 (10Ottomata) > the former is now self-service via IDM.  I suppose that would be https://idm.wikimedia.org/. I don't totally see how one can...
[17:54:52] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11655866 (10MoritzMuehlenhoff) >>! In T417655#11655864, @Ottomata wrote: >> the former is now self-service via IDM. >  > I suppose that would be http...
[17:55:49] <wikibugs>	 (03PS2) 10BryanDavis: toolhub: Bump container version to 2026-02-20-232022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241047 (https://phabricator.wikimedia.org/T372824)
[17:56:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:56:18] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! I don't spot any obvious problems. Let's merge this :)" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) (owner: 10Pppery)
[17:58:36] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp2043.codfw.wmnet
[17:59:15] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] toolhub: Bump container version to 2026-02-20-232022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241047 (https://phabricator.wikimedia.org/T372824) (owner: 10BryanDavis)
[18:00:05] <jouncebot>	 bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1800)
[18:00:05] <jouncebot>	 swfrench-wmf: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1800).
[18:01:04] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: Bump container version to 2026-02-20-232022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241047 (https://phabricator.wikimedia.org/T372824) (owner: 10BryanDavis)
[18:01:10] <jinxer-wm>	 FIRING: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:01:20] <swfrench-wmf>	 o/
[18:01:30] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump to 2026-02-23-122916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244728
[18:01:32] <swfrench-wmf>	 I'll get started on the work planned for this window in a bit.
[18:01:32] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2033 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:02:44] <bd808>	 I have updates for both toolhub and developer-portal today. a weird alignment of planets after a couple months of nothing for my window.
[18:04:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11655884 (10DTotten-WMF) Hi @Aklapper - thanks for your help with this ticket. I have updated my Phabricator account to show my LDAP user name on my profile.
[18:05:29] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mesh: Copy mesh.configuration 1.15.1 to 1.15.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242517 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[18:05:33] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp2044.codfw.wmnet
[18:05:58] <wikibugs>	 (03PS1) 10DCausse: opensearch-semantic-search: set library path to knn native libs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244729
[18:06:10] <jinxer-wm>	 RESOLVED: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:06:34] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] opensearch-semantic-search: set library path to knn native libs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244729 (owner: 10DCausse)
[18:07:24] <wikibugs>	 (03Merged) 10jenkins-bot: mesh: Copy mesh.configuration 1.15.1 to 1.15.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242517 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[18:07:40] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mesh: Set traffic_direction to INBOUND on local TLS listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[18:08:06] <wikibugs>	 (03PS3) 10Scott French: mesh: Set traffic_direction to INBOUND on local TLS listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245)
[18:08:34] <wikibugs>	 (03CR) 10Scott French: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[18:08:52] <wikibugs>	 (03Merged) 10jenkins-bot: opensearch-semantic-search: set library path to knn native libs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244729 (owner: 10DCausse)
[18:09:13] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply
[18:09:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[18:09:30] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[18:09:40] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[18:09:52] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply
[18:10:15] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply
[18:11:23] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply
[18:11:28] <wikibugs>	 (03Merged) 10jenkins-bot: mesh: Set traffic_direction to INBOUND on local TLS listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[18:11:57] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply
[18:12:12] <wikibugs>	 (03PS3) 10Scott French: mesh: Copy mesh.deployment 1.3.1 to 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242519 (https://phabricator.wikimedia.org/T364245)
[18:12:35] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply
[18:14:33] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mesh: Copy mesh.deployment 1.3.1 to 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242519 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[18:14:47] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Clean up list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1243954 (https://phabricator.wikimedia.org/T415193) (owner: 10Xcollazo)
[18:15:12] <wikibugs>	 (03PS2) 10BryanDavis: developer-portal: Bump to 2026-02-23-122916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244728
[18:15:16] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-backup1004.eqiad.wmnet with OS trixie
[18:15:24] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11655936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie
[18:16:27] <wikibugs>	 (03Merged) 10jenkins-bot: mesh: Copy mesh.deployment 1.3.1 to 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242519 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[18:17:25] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[18:17:32] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[18:18:49] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2026-02-23-122916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244728 (owner: 10BryanDavis)
[18:19:26] <wikibugs>	 (03PS4) 10Scott French: mesh: Support injection of extra env vars into envoy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245)
[18:20:54] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump to 2026-02-23-122916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244728 (owner: 10BryanDavis)
[18:21:16] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[18:21:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[18:21:38] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[18:22:00] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[18:22:17] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[18:22:37] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[18:22:43] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mesh: Support injection of extra env vars into envoy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[18:23:37] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[18:24:20] <wikibugs>	 (03PS1) 10Dzahn: site: add contint1003/2003 with insetup collab role [puppet] - 10https://gerrit.wikimedia.org/r/1244743 (https://phabricator.wikimedia.org/T418521)
[18:24:32] <wikibugs>	 (03Merged) 10jenkins-bot: mesh: Support injection of extra env vars into envoy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[18:25:07] <bd808>	 I am done with my window today.</window>
[18:25:22] <wikibugs>	 (03PS4) 10Scott French: mediawiki: Bump mesh.configuration and mesh.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245)
[18:26:30] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] ProdPasteBot: Call paste.edit instead of deprecated paste.create [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper)
[18:29:31] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mediawiki: Bump mesh.configuration and mesh.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[18:34:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe2024.codfw.wmnet
[18:34:51] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Bump mesh.configuration and mesh.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[18:35:18] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11656047 (10MBinder_WMF) Thanks! The only one on that list that I see might be relevant is Wmf, so I requested access and referenced this ticket.
[18:35:22] <wikibugs>	 (03PS1) 10BPirkle: REST: show the beta Attribution API in the REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522)
[18:36:14] * swfrench-wmf is waiting for chart-museum
[18:36:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[18:37:45] <logmsgbot>	 pt1979@cumin2002 dhcp (PID 65743) is awaiting input
[18:39:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe2024.codfw.wmnet
[18:39:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe2024.codfw.wmnet
[18:41:28] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "[phab1004:~] $ echo "teste mich" | /usr/local/bin/phaste --config /etc/phaste.conf -t "teste mich"" [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper)
[18:42:02] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "tested. still working." [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper)
[18:42:12] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: helmfile-only deploy for mesh module updates - T364245
[18:42:16] <stashbot>	 T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis - https://phabricator.wikimedia.org/T364245
[18:42:42] <logmsgbot>	 pt1979@cumin2002 dhcp (PID 67882) is awaiting input
[18:43:22] <logmsgbot>	 !log swfrench@deploy2002 swfrench: helmfile-only deploy for mesh module updates - T364245 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:44:14] <swfrench-wmf>	 alright, let's see if tracing still works ...
[18:48:04] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Continuing with sync
[18:49:30] <wikibugs>	 (03PS5) 10Scott French: mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245)
[18:50:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe2024.codfw.wmnet
[18:50:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe2024.codfw.wmnet
[18:51:56] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: helmfile-only deploy for mesh module updates - T364245 (duration: 11m 13s)
[18:52:00] <stashbot>	 T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis - https://phabricator.wikimedia.org/T364245
[18:54:02] <logmsgbot>	 pt1979@cumin2002 dhcp (PID 72585) is awaiting input
[18:55:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe2024.codfw.wmnet
[18:56:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:00:05] <jouncebot>	 dduvall and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T1900).
[19:00:08] <swfrench-wmf>	 I'll pause my work here and pick up during a quiet spot after the train
[19:00:20] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe2024.codfw.wmnet
[19:01:02] <dduvall>	 thanks swfrench-wmf 
[19:02:52] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244760 (https://phabricator.wikimedia.org/T413808)
[19:02:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244760 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot)
[19:03:25] <logmsgbot>	 pt1979@cumin2002 dhcp (PID 78796) is awaiting input
[19:03:47] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244760 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot)
[19:04:12] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe2024.codfw.wmnet
[19:06:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:09:31] <logmsgbot>	 !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.17  refs T413808
[19:09:36] <stashbot>	 T413808: 1.46.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T413808
[19:11:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:11:20] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp2043.codfw.wmnet [reason: NIC firmware issues]
[19:11:40] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp2044.codfw.wmnet [reason: NIC firmware issues]
[19:12:27] <wikibugs>	 (03PS1) 10DCausse: opensearch-semantic-search: set LD_LIBRARY_PATH with knn lib [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244764
[19:15:28] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] opensearch-semantic-search: set LD_LIBRARY_PATH with knn lib [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244764 (owner: 10DCausse)
[19:16:01] <swfrench-wmf>	 dduvall: once the dust settles, if you could give me a heads-up when it might be alright to mess with mw-debug a bit, that would be swell (no rush)
[19:16:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:16:25] <wikibugs>	 (03CR) 10Bking: [C:03+1] openjkd-21-jre: fix malformed changelog entry [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240701 (owner: 10MVernon)
[19:16:26] <dduvall>	 swfrench-wmf: i think we're good to call it a train
[19:16:29] <logmsgbot>	 !log hoo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[19:16:41] <dduvall>	 i.e. go ahead and thanks again
[19:16:51] <swfrench-wmf>	 awesome, thanks dduvall
[19:17:14] <logmsgbot>	 !log hoo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[19:17:22] <wikibugs>	 (03Merged) 10jenkins-bot: opensearch-semantic-search: set LD_LIBRARY_PATH with knn lib [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244764 (owner: 10DCausse)
[19:18:13] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[19:18:15] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11656209 (10ssingh)
[19:18:24] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[19:19:21] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[19:20:09] <wikibugs>	 (03PS1) 10Gergő Tisza: EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512)
[19:20:27] <wikibugs>	 (03PS1) 10Gergő Tisza: EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244771 (https://phabricator.wikimedia.org/T418512)
[19:20:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza)
[19:21:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:21:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244771 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza)
[19:21:37] <wikibugs>	 (03Merged) 10jenkins-bot: mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[19:21:48] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11656222 (10BCornwall)
[19:22:01] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11656224 (10BCornwall)
[19:22:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244649 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza)
[19:22:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244650 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza)
[19:23:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza)
[19:24:43] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[19:25:01] <wikibugs>	 (03CR) 10MVernon: [V:03+2 C:03+2] openjkd-21-jre: fix malformed changelog entry [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1240701 (owner: 10MVernon)
[19:27:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[19:27:58] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[19:28:38] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[19:29:04] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[19:29:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T415786)', diff saved to https://phabricator.wikimedia.org/P89066 and previous config saved to /var/cache/conftool/dbconfig/20260226-192927-marostegui.json
[19:29:32] <stashbot>	 T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786
[19:30:46] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good to proceed since you are confident about the status of the cluster itself. I think we will first need to merge this patch to up" [dns] - 10https://gerrit.wikimedia.org/r/1238441 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking)
[19:31:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:31:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11656259 (10BCornwall) FWIW `perf` has the majority of dropped packets as `QUEUE_PURGE` (and `NOT_SPECIFIED`)
[19:35:29] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-backup1004.eqiad.wmnet with OS trixie
[19:35:35] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11656284 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie executed with errors: - ms-...
[19:35:48] <wikibugs>	 (03PS2) 10Gergő Tisza: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[19:36:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:36:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[19:43:50] <wikibugs>	 (03PS1) 10BCornwall: hardware.upgrade-firmware: Fix usage path [cookbooks] - 10https://gerrit.wikimedia.org/r/1244788
[19:44:07] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2043.codfw.wmnet
[19:44:19] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2043.codfw.wmnet
[19:44:21] <wikibugs>	 (03PS1) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244789
[19:44:36] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P89067 and previous config saved to /var/cache/conftool/dbconfig/20260226-194435-marostegui.json
[19:47:03] <wikibugs>	 (03PS3) 10Gergő Tisza: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[19:47:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[19:47:25] <wikibugs>	 (03PS2) 10BPirkle: REST: show the beta Attribution API in the REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522)
[19:49:29] <wikibugs>	 (03PS4) 10Gergő Tisza: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[19:50:43] <wikibugs>	 (03PS6) 10Gergő Tisza: Enable JWT session cookie for bot passwords (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[19:50:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[19:51:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:51:29] <wikibugs>	 (03PS2) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244789
[19:51:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244789 (owner: 10CDobbins)
[19:55:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2095.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:55:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2096.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:56:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:56:56] <wikibugs>	 (03PS1) 10Scott French: Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244799 (https://phabricator.wikimedia.org/T364245)
[19:57:43] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2043.codfw.wmnet
[19:59:12] <wikibugs>	 (03CR) 10Scott French: [C:03+2] Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244799 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[19:59:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P89068 and previous config saved to /var/cache/conftool/dbconfig/20260226-195943-marostegui.json
[20:01:01] <wikibugs>	 (03PS1) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244802
[20:01:13] <wikibugs>	 (03Abandoned) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244789 (owner: 10CDobbins)
[20:01:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[20:01:21] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244799 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[20:01:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:02:46] <wikibugs>	 (03Abandoned) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244802 (owner: 10CDobbins)
[20:04:26] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[20:04:52] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[20:04:57] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[20:05:18] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[20:06:10] <swfrench-wmf>	 alright, I'm done with mw-debug for now
[20:06:43] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2095.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:07:19] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2096.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:07:44] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2043.codfw.wmnet
[20:09:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11656360 (10BCornwall)
[20:10:14] <wikibugs>	 (03PS1) 10CDobbins: site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244810
[20:11:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2095']
[20:12:12] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2095']
[20:12:18] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244810 (owner: 10CDobbins)
[20:13:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2095.codfw.wmnet with OS bullseye
[20:13:26] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11656381 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2095.codfw.wmnet with OS bullseye
[20:13:38] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2096.codfw.wmnet with OS bullseye
[20:13:45] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11656384 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2096.codfw.wmnet with OS bullseye
[20:14:52] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T415786)', diff saved to https://phabricator.wikimedia.org/P89069 and previous config saved to /var/cache/conftool/dbconfig/20260226-201451-marostegui.json
[20:14:57] <stashbot>	 T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786
[20:15:08] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[20:15:26] <wikibugs>	 (03CR) 10CDobbins: [C:03+2] site.pp: add cp hosts to text & upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1244810 (owner: 10CDobbins)
[20:15:46] <wikibugs>	 (03PS1) 10Jforrester: plugins/wm-pcc: Switch commands from experimental to new puppet [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244816 (https://phabricator.wikimedia.org/T414621)
[20:16:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[20:16:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] plugins/wm-pcc: Switch commands from experimental to new puppet [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244816 (https://phabricator.wikimedia.org/T414621) (owner: 10Jforrester)
[20:20:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments1009 - https://phabricator.wikimedia.org/T416253#11656415 (10VRiley-WMF) a:03VRiley-WMF
[20:23:28] <tappof>	 !log Deployment of the multi-instance Thanos Store Gateway patches for T412924: starting the rollout on titan1001
[20:23:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:33] <stashbot>	 T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924
[20:31:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[20:31:41] <wikibugs>	 (03PS1) 10Scott French: mediawiki: refresh mesh.deployment 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244807 (https://phabricator.wikimedia.org/T364245)
[20:36:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[20:38:22] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:38:48] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2046.codfw.wmnet with OS trixie
[20:39:54] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2047.codfw.wmnet with OS trixie
[20:45:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:47:24] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] mediawiki: refresh mesh.deployment 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244807 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[20:50:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:50:28] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11656482 (10VRiley-WMF) @Marostegui So, we can start with dbproxy1029. Are there specific dates that would be preferred? Also, just to...
[20:51:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[20:52:13] <wikibugs>	 (03CR) 10D3r1ck01: [C:03+1] CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[20:52:47] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2046.codfw.wmnet with reason: host reimage
[20:52:55] <wikibugs>	 (03CR) 10D3r1ck01: [C:03+1] Enable JWT session cookie for bot passwords (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[20:54:04] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2047.codfw.wmnet with reason: host reimage
[20:58:27] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2046.codfw.wmnet with reason: host reimage
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T2100).
[21:00:05] <jouncebot>	 RoanKattouw, danisztls, JSherman, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:22] <JSherman>	 here
[21:00:46] <RoanKattouw>	 I can deploy
[21:01:04] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[21:01:11] <tgr_>	 o/
[21:01:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[21:02:28] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2047.codfw.wmnet with reason: host reimage
[21:03:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665) (owner: 10Scardenasmolinar)
[21:03:43] <RoanKattouw>	 This is a cool new Spiderpig feature: it warned me that the Depends-On of this config change was only recently deployed and might be rolled back https://usercontent.irccloud-cdn.com/file/B4VIxlJs/image.png
[21:03:56] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy PersonalDashboard to new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244046 (https://phabricator.wikimedia.org/T417665) (owner: 10Scardenasmolinar)
[21:04:12] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1244046|Deploy PersonalDashboard to new wikis (T417665)]]
[21:04:16] <RoanKattouw>	 I'm proceeding anyway, because the worst that could happen is that the config sets a setting that is ignored by MW
[21:04:17] <stashbot>	 T417665: Deploy Extension:PersonalDashboard to id.wiki, tr.wiki, simple.wiki, and th.wiki - https://phabricator.wikimedia.org/T417665
[21:04:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (2620:0:860:fe0a::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr2-drmrs:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[21:05:19] <tappof>	 !log Deployment of the multi-instance Thanos Store Gateway patches for T412924: starting the rollout on titan2002
[21:05:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:23] <stashbot>	 T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924
[21:06:05] <logmsgbot>	 !log catrope@deploy2002 catrope, suecarmol: Backport for [[gerrit:1244046|Deploy PersonalDashboard to new wikis (T417665)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:06:30] <JSherman>	 testing
[21:09:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (2620:0:860:fe0a::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr2-drmrs:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[21:10:06] <JSherman>	 RoanKattouw: we're good
[21:10:10] <logmsgbot>	 !log catrope@deploy2002 catrope, suecarmol: Continuing with sync
[21:11:17] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[21:11:26] <danisztls>	 sry, had IRC problems
[21:11:33] <wikibugs>	 (03CR) 10Catrope: [C:03+2] EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza)
[21:11:37] <wikibugs>	 (03CR) 10Catrope: [C:03+2] EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244771 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza)
[21:14:09] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244046|Deploy PersonalDashboard to new wikis (T417665)]] (duration: 09m 57s)
[21:14:13] <stashbot>	 T417665: Deploy Extension:PersonalDashboard to id.wiki, tr.wiki, simple.wiki, and th.wiki - https://phabricator.wikimedia.org/T417665
[21:14:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza)
[21:15:55] <RoanKattouw>	 CI keeps failing on https://gerrit.wikimedia.org/r/1244770 without actually saying why it fails, and the change is trivial, so I'm going to force-merge it
[21:16:00] <wikibugs>	 (03Merged) 10jenkins-bot: EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244771 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza)
[21:16:00] <wikibugs>	 (03CR) 10Catrope: [V:03+2 C:03+2] EmailAuthHookHandler: Fix LoginNotify being an optional dependency [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza)
[21:16:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244770 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza)
[21:16:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244771 (https://phabricator.wikimedia.org/T418512) (owner: 10Gergő Tisza)
[21:17:07] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1244770|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]], [[gerrit:1244771|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]]
[21:17:08] <RoanKattouw>	 (Also Spiderpig just reminded me that wmf.16 isn't live anywhere anymore anyway)
[21:17:12] <stashbot>	 T418512: Newly created Wikimedia Vote Wiki accounts unable to log in – Fatal exception “Error” - https://phabricator.wikimedia.org/T418512
[21:17:37] <wikibugs>	 (03CR) 10Catrope: [C:03+2] Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244649 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza)
[21:17:39] <wikibugs>	 (03CR) 10Catrope: [C:03+2] Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244650 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza)
[21:18:56] <logmsgbot>	 !log catrope@deploy2002 tgr, catrope: Backport for [[gerrit:1244770|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]], [[gerrit:1244771|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:19:27] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[21:19:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:20:01] <tgr_>	 22:13:52 ...............SSSSSSSSSSS.....................EEEEEEE...       183 / 183
[21:20:27] <tgr_>	 seems like there's a bunch of errors in the WikimediaEvents tests but they don't make it into the output somehow?
[21:20:59] <RoanKattouw>	 I have seen something like this happen in a transient way where a maximum test runtime was enforced and all the tests that took too long failed
[21:21:07] <RoanKattouw>	 There's also a PHP notice right after that
[21:21:15] <RoanKattouw>	 Anyway, CI passed on wmf.17 and that's the branch that matters
[21:21:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[21:21:31] <tgr_>	 the whole test footer (runtime, number of tests etc) is missing, maybe a fatal error?
[21:21:38] <tgr_>	 not sure if that's still a thing in PHP 8
[21:21:50] <RoanKattouw>	 tgr_: Also please test on the test servers (if this is even testable)
[21:23:02] <tgr_>	 testing
[21:24:09] <tgr_>	 can I make myself an account on votewiki?
[21:24:22] <tgr_>	 not sure how secret that wiki is
[21:24:40] <RoanKattouw>	 Uhhhh, probably not?
[21:24:46] <RoanKattouw>	 There's only like 6 people on it
[21:24:58] <RoanKattouw>	 So maybe we just proceed with the deployment, then ask the reporter to test again
[21:26:11] <tappof>	 !log Deployment of the multi-instance Thanos Store Gateway patches for T412924: starting the rollout on titan2001
[21:26:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:16] <stashbot>	 T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924
[21:26:17] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[21:26:59] <tgr_>	 I'll just test on a normal wiki then
[21:27:21] <wikibugs>	 (03PS1) 10BCornwall: cp2043: Set use_noflow_iface_preup to true [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527)
[21:28:15] <tgr_>	 RoanKattouw: works
[21:28:22] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:28:23] <logmsgbot>	 !log catrope@deploy2002 tgr, catrope: Continuing with sync
[21:28:27] <tgr_>	 (on normal wikis anyway, so no worse than before)
[21:28:32] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8154/co" [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall)
[21:29:06] <wikibugs>	 (03CR) 10BCornwall: cp2043: Set use_noflow_iface_preup to true [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall)
[21:31:14] <wikibugs>	 (03Merged) 10jenkins-bot: Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244649 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza)
[21:31:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[21:31:22] <wikibugs>	 (03Merged) 10jenkins-bot: Session: Emit JWT cookie in ImmutableSessionProviderWithCookie [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244650 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza)
[21:31:49] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] cp2043: Set use_noflow_iface_preup to true [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall)
[21:32:11] <danisztls>	 am I next?
[21:32:22] <wikibugs>	 (03CR) 10CDobbins: [C:03+1] cp2043: Set use_noflow_iface_preup to true [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall)
[21:32:26] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244770|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]], [[gerrit:1244771|EmailAuthHookHandler: Fix LoginNotify being an optional dependency (T418512)]] (duration: 15m 19s)
[21:32:31] <stashbot>	 T418512: Newly created Wikimedia Vote Wiki accounts unable to log in – Fatal exception “Error” - https://phabricator.wikimedia.org/T418512
[21:32:41] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[21:32:59] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] cp2043: Set use_noflow_iface_preup to true [puppet] - 10https://gerrit.wikimedia.org/r/1244854 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall)
[21:33:37] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2095.codfw.wmnet with OS bullseye
[21:33:43] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11656609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2095.codfw.wmnet with OS bullseye executed with errors: -...
[21:33:55] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2096.codfw.wmnet with OS bullseye
[21:34:00] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1244649|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]], [[gerrit:1244650|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]]
[21:34:01] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11656611 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2096.codfw.wmnet with OS bullseye executed with errors: -...
[21:34:05] <stashbot>	 T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007
[21:34:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:34:40] <RoanKattouw>	 danisztls: Yes, you'll be next after I'm done with tgr_'s patches. Sorry for the delay
[21:35:55] <logmsgbot>	 !log catrope@deploy2002 catrope, tgr: Backport for [[gerrit:1244649|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]], [[gerrit:1244650|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:36:17] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[21:37:10] <danisztls>	 RoanKattouw: no problem at all
[21:39:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:39:39] <tgr_>	 sorry, this one is a bit cumbersome to test
[21:40:13] <wikibugs>	 (03CR) 10BCornwall: prometheus: add pooled host check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins)
[21:41:37] <tgr_>	 RoanKattouw: looks good
[21:41:46] <logmsgbot>	 !log catrope@deploy2002 catrope, tgr: Continuing with sync
[21:42:41] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[21:43:33] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Certificate grafana.discovery.wmnet expires in 7 day(s) (Fri 06 Mar 2026 09:43:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:43:43] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Certificate grafana.discovery.wmnet expires in 7 day(s) (Fri 06 Mar 2026 09:43:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:45:39] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244649|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]], [[gerrit:1244650|Session: Emit JWT cookie in ImmutableSessionProviderWithCookie (T415007)]] (duration: 11m 39s)
[21:45:44] <stashbot>	 T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007
[21:46:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243929 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza)
[21:46:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243930 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza)
[21:46:49] <RoanKattouw>	 danisztls: Starting yours now, I'll ask you to test in a few minutes
[21:47:06] <danisztls>	 RoanKattouw: thanks!
[21:47:25] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Comparative Reader Research survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243929 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza)
[21:47:29] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Comparative Reader Research survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243930 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza)
[21:47:49] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1243929|Deploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1243930|Deploy Comparative Reader Research survey on enwiki (T417829)]]
[21:47:55] <stashbot>	 T417834: Comparative Reader Research (Current Readers - ES) Deployment - https://phabricator.wikimedia.org/T417834
[21:47:55] <stashbot>	 T417829: Comparative Reader Research (Current Readers - EN) Deployment - https://phabricator.wikimedia.org/T417829
[21:49:45] <logmsgbot>	 !log catrope@deploy2002 dani, catrope: Backport for [[gerrit:1243929|Deploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1243930|Deploy Comparative Reader Research survey on enwiki (T417829)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:50:00] <RoanKattouw>	 danisztls: Alright please test, and let me know when you're done
[21:50:02] <wikibugs>	 (03PS1) 10CDobbins: hieradata: add haproxy version for new cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1244857
[21:50:26] <danisztls>	 RoanKattouw: looks good
[21:50:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:51:23] <logmsgbot>	 !log catrope@deploy2002 dani, catrope: Continuing with sync
[21:51:51] <tappof>	 !log Deployment of the multi-instance Thanos Store Gateway patches for T412924: rollout complete.
[21:51:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:55] <stashbot>	 T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924
[21:52:11] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] hieradata: add haproxy version for new cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1244857 (owner: 10CDobbins)
[21:53:20] <wikibugs>	 (03CR) 10BCornwall: prometheus: add pooled host check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins)
[21:54:03] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8155/console" [puppet] - 10https://gerrit.wikimedia.org/r/1244857 (owner: 10CDobbins)
[21:54:25] <wikibugs>	 (03CR) 10CDobbins: [V:03+1 C:03+2] hieradata: add haproxy version for new cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1244857 (owner: 10CDobbins)
[21:54:40] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:54:51] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2043.codfw.wmnet
[21:55:16] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243929|Deploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1243930|Deploy Comparative Reader Research survey on enwiki (T417829)]] (duration: 07m 28s)
[21:55:22] <stashbot>	 T417834: Comparative Reader Research (Current Readers - ES) Deployment - https://phabricator.wikimedia.org/T417834
[21:55:23] <stashbot>	 T417829: Comparative Reader Research (Current Readers - EN) Deployment - https://phabricator.wikimedia.org/T417829
[21:56:16] <danisztls>	 RoanKattouw: thanks!
[21:56:16] <RoanKattouw>	 Whoops sorry tgr_ I forgot your two config changes ("Set $wgJwtSessionCookieIssuer for bot passwords" and "Enable JWT session cookie for bot passwords (all wikis)"). Can those go out together, or should I deploy them separately?
[21:56:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242543 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope)
[21:56:36] <tgr_>	 they can go together
[21:56:44] <tgr_>	 thanks
[21:56:46] <RoanKattouw>	 OK great, I will deploy them together after my config change
[21:57:20] <wikibugs>	 (03Merged) 10jenkins-bot: Remove workaround for T370517, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242543 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope)
[21:57:40] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1242543|Remove workaround for T370517, no longer needed (T370517)]]
[21:57:44] <stashbot>	 T370517: Search button message text changes - https://phabricator.wikimedia.org/T370517
[21:59:35] <logmsgbot>	 !log catrope@deploy2002 catrope: Backport for [[gerrit:1242543|Remove workaround for T370517, no longer needed (T370517)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260226T2200)
[22:00:40] <logmsgbot>	 !log catrope@deploy2002 catrope: Continuing with sync
[22:01:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:02:26] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2043.codfw.wmnet
[22:03:06] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2043.codfw.wmnet
[22:04:43] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242543|Remove workaround for T370517, no longer needed (T370517)]] (duration: 07m 03s)
[22:04:48] <stashbot>	 T370517: Search button message text changes - https://phabricator.wikimedia.org/T370517
[22:06:12] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2047.codfw.wmnet with OS trixie
[22:06:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[22:06:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[22:07:46] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244692 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[22:07:49] <wikibugs>	 (03Merged) 10jenkins-bot: Enable JWT session cookie for bot passwords (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244647 (https://phabricator.wikimedia.org/T415007) (owner: 10D3r1ck01)
[22:08:07] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1244692|CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords (T415007)]], [[gerrit:1244647|Enable JWT session cookie for bot passwords (all wikis) (T415007)]]
[22:08:12] <stashbot>	 T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007
[22:09:06] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2046.codfw.wmnet with OS trixie
[22:09:58] <logmsgbot>	 !log catrope@deploy2002 catrope, d3r1ck01: Backport for [[gerrit:1244692|CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords (T415007)]], [[gerrit:1244647|Enable JWT session cookie for bot passwords (all wikis) (T415007)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:10:04] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2043.codfw.wmnet
[22:11:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[22:11:52] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969)
[22:12:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson)
[22:13:46] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2048.codfw.wmnet with OS trixie
[22:14:00] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2049.codfw.wmnet with OS trixie
[22:14:08] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2050.codfw.wmnet with OS trixie
[22:14:16] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2051.codfw.wmnet with OS trixie
[22:14:58] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2052.codfw.wmnet with OS trixie
[22:15:01] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2053.codfw.wmnet with OS trixie
[22:15:45] <tgr_>	 RoanKattouw: looks good
[22:15:51] <tgr_>	 thanks for the deploys!
[22:15:59] <wikibugs>	 (03PS1) 10BCornwall: Revert "cp2043: Set use_noflow_iface_preup to true" [puppet] - 10https://gerrit.wikimedia.org/r/1244869
[22:16:01] <logmsgbot>	 !log catrope@deploy2002 catrope, d3r1ck01: Continuing with sync
[22:18:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission puppetmaster2001 - https://phabricator.wikimedia.org/T416606#11656726 (10BCornwall) Is https://gerrit.wikimedia.org/r/c/operations/dns/+/1237463 still needed to be merged?
[22:19:56] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244692|CommonSettings: Set $wgJwtSessionCookieIssuer for bot passwords (T415007)]], [[gerrit:1244647|Enable JWT session cookie for bot passwords (all wikis) (T415007)]] (duration: 11m 48s)
[22:20:00] <stashbot>	 T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007
[22:27:58] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2048.codfw.wmnet with reason: host reimage
[22:28:20] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2049.codfw.wmnet with reason: host reimage
[22:28:27] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2051.codfw.wmnet with reason: host reimage
[22:28:30] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2050.codfw.wmnet with reason: host reimage
[22:28:58] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2052.codfw.wmnet with reason: host reimage
[22:29:24] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2053.codfw.wmnet with reason: host reimage
[22:30:19] <wikibugs>	 (03PS3) 10Ebernhardson: cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969)
[22:33:48] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2048.codfw.wmnet with reason: host reimage
[22:34:06] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Wikimedia-Mailing-lists: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11656753 (10jhathaway) @DamianZaremba I tried with a couple of my test accounts, but I was unable to duplicate your r...
[22:37:36] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2052.codfw.wmnet with reason: host reimage
[22:40:24] <icinga-wm>	 PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp2050 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[22:40:26] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2053 is CRITICAL: connect to address 10.192.56.3 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[22:40:26] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp2053 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[22:41:34] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2053.codfw.wmnet with reason: host reimage
[22:45:12] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2050.codfw.wmnet with reason: host reimage
[22:45:26] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2051 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[22:49:39] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2049.codfw.wmnet with reason: host reimage
[22:50:26] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2051 is CRITICAL: connect to address 10.192.40.25 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[22:50:26] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp2051 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[22:52:26] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2053 is OK: HTTP OK: HTTP/1.0 200 OK - 36064 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[22:53:49] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2051.codfw.wmnet with reason: host reimage
[22:54:50] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2048.codfw.wmnet with OS trixie
[22:58:24] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2052.codfw.wmnet with OS trixie
[22:58:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe2024.codfw.wmnet
[23:01:44] <logmsgbot>	 pt1979@cumin2002 dhcp (PID 209464) is awaiting input
[23:02:04] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp2053 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 39 days) https://wikitech.wikimedia.org/wiki/HTTPS
[23:04:22] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2053.codfw.wmnet with OS trixie
[23:04:26] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2051 is OK: HTTP OK: HTTP/1.0 200 OK - 36018 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[23:04:33] <swfrench-wmf>	 jouncebot: nowandnext
[23:04:33] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 55 minute(s)
[23:04:33] <jouncebot>	 In 7 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260227T0700)
[23:05:51] <swfrench-wmf>	 unless anyone has conflicting changes planned, I'd like to trigger a noop mediawiki deployment to clear a helm chart version diff - should be a pretty quick operation
[23:07:02] <icinga-wm>	 RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp2050 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2026-04-05 04:22:55 +0000 (expires in 37 days) https://wikitech.wikimedia.org/wiki/HTTPS
[23:07:05] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mediawiki: refresh mesh.deployment 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244807 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[23:07:13] * swfrench-wmf will proceed with deployment
[23:09:17] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2050.codfw.wmnet with OS trixie
[23:10:40] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2049.codfw.wmnet with OS trixie
[23:10:41] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: refresh mesh.deployment 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244807 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[23:12:58] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2051 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 69 days) https://wikitech.wikimedia.org/wiki/HTTPS
[23:12:58] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp2051 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 39 days) https://wikitech.wikimedia.org/wiki/HTTPS
[23:13:13] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: helmfile-only deployment to clear chart version diff
[23:14:29] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2051.codfw.wmnet with OS trixie
[23:15:44] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: helmfile-only deployment to clear chart version diff (duration: 02m 31s)
[23:15:53] <swfrench-wmf>	 all done
[23:16:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:16:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:16:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[23:24:43] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[23:26:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[23:28:24] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:30:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:32:02] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2021 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:33:22] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[23:35:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:37:02] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:38:02] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:40:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:40:26] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2015 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:41:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[23:41:26] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:41:54] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:42:26] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2015 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:42:26] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:43:20] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:43:26] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:45:25] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:46:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[23:47:02] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:48:24] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:50:25] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:51:30] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1018 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:51:54] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:52:02] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2021 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:53:20] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:53:40] <icinga-wm>	 PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock
[23:55:25] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:56:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[23:58:02] <icinga-wm>	 RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock