[00:00:25] FIRING: [11x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:01:14] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:01:16] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:01:30] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1018 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:01:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe2024.codfw.wmnet [00:01:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe2024.codfw.wmnet [00:02:54] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:03:26] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:03:57] (03PS1) 10Dzahn: zookeeper: support TLS by loading Netty jars into class path [puppet] - 10https://gerrit.wikimedia.org/r/1244927 (https://phabricator.wikimedia.org/T395938) [00:04:14] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:04:16] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:04:20] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:04:43] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:04:58] pt1979@cumin2002 dhcp (PID 246129) is awaiting input [00:05:25] FIRING: [11x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:08:22] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:09:02] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:10:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:12:38] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:12:54] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:12:54] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:13:23] (03PS1) 10Dzahn: zookeeper::server: allow Hiera to override $extra_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) [00:13:40] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:14:01] (03CR) 10CI reject: [V:04-1] zookeeper::server: allow Hiera to override $extra_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [00:14:43] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:14:59] (03PS2) 10Dzahn: zookeeper::server: allow Hiera to override $extra_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) [00:15:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:17:12] (03CR) 10CI reject: [V:04-1] zookeeper::server: allow Hiera to override $extra_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [00:18:14] (03PS1) 10Dzahn: zuul::main: add debugging extra_java_opts: "-Djavax.net.debug=ssl,handshake" [puppet] - 10https://gerrit.wikimedia.org/r/1244944 (https://phabricator.wikimedia.org/T395938) [00:19:02] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:19:24] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:20:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:22:14] PROBLEM - mailman3_queue_size on lists1004 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 29 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [00:22:54] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:24:20] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:24:58] (03PS3) 10Dzahn: zookeeper::server: allow Hiera to override $extra_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) [00:25:25] FIRING: [9x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:26:14] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:26:16] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:26:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:27:14] RECOVERY - mailman3_queue_size on lists1004 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [00:29:24] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:30:14] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:30:16] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:30:25] FIRING: [7x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:31:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:32:38] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [00:37:14] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:39:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1244961 [00:39:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1244961 (owner: 10TrainBranchBot) [00:40:14] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:47:14] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:47:16] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:49:55] (03PS1) 10Dzahn: zuul::main: build full chain of trust for Java Netty TLS [puppet] - 10https://gerrit.wikimedia.org/r/1244969 (https://phabricator.wikimedia.org/T395938) [00:50:16] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:51:14] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:51:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:52:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1244961 (owner: 10TrainBranchBot) [00:56:20] (03PS2) 10Dzahn: zuul::main: build full chain of trust for Java Netty TLS [puppet] - 10https://gerrit.wikimedia.org/r/1244969 (https://phabricator.wikimedia.org/T395938) [01:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [01:01:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:02:19] 10ops-eqiad, 06DC-Ops: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544 (10Dzahn) 03NEW [01:03:06] 10ops-codfw, 06DC-Ops: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545 (10Dzahn) 03NEW [01:03:36] 10ops-eqiad, 06DC-Ops: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11656919 (10Dzahn) [01:03:44] 10ops-codfw, 06DC-Ops: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11656921 (10Dzahn) [01:04:43] 10ops-codfw, 06DC-Ops: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11656923 (10Dzahn) [01:04:52] 10ops-eqiad, 06DC-Ops: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11656926 (10Dzahn) [01:06:16] RESOLVED: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:06:35] (03PS2) 10Dzahn: site: add contint1003/2003 with insetup collab role [puppet] - 10https://gerrit.wikimedia.org/r/1244743 (https://phabricator.wikimedia.org/T418521) [01:09:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1244974 [01:09:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1244974 (owner: 10TrainBranchBot) [01:09:16] 10ops-codfw, 06collaboration-services, 10Continuous-Integration-Infrastructure, 06DC-Ops, and 2 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11656939 (10Dzahn) [01:10:03] 10ops-eqiad, 06collaboration-services, 10Continuous-Integration-Infrastructure, 06DC-Ops, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11656940 (10Dzahn) [01:11:46] 10ops-codfw, 06collaboration-services, 10Continuous-Integration-Infrastructure, 06DC-Ops, and 2 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11656951 (10Dzahn) requesting public IPs at T418520 [01:11:50] 10ops-eqiad, 06collaboration-services, 10Continuous-Integration-Infrastructure, 06DC-Ops, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11656954 (10Dzahn) requesting public IPs at T418520 [01:13:22] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:15:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2015:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:19:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:19:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:20:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:27:11] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1244974 (owner: 10TrainBranchBot) [01:30:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:34:43] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:35:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2012:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:40:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:45:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe2024.codfw.wmnet [01:45:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:50:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:00:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:00:46] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:01:17] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:05:27] RESOLVED: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:06:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:06:21] FIRING: [6x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:11:21] FIRING: [6x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:14:04] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 18s) [02:14:43] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:15:20] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:15:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:16:17] FIRING: [4x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:20:24] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:20:25] FIRING: [7x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:23:02] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2021 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:23:38] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:24:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:25:20] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:25:25] FIRING: [7x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:29:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:30:24] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:30:25] FIRING: [9x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:31:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:31:17] FIRING: [4x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:33:02] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2021 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:33:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:38] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:35:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:40:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:41:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:43:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:44:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:45:25] FIRING: [9x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:46:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:48:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:50:25] FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:51:32] FIRING: [4x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:54:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [02:55:25] FIRING: [6x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:56:17] FIRING: [6x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:00:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:01:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:03:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:03:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:04:41] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:05:25] FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:08:03] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:10:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:13:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:14:01] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2021 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:14:41] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:15:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:15:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:16:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:18:03] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:20:25] FIRING: [9x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:22:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:23:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:24:01] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2021 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:24:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:24:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:25:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:25:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:26:17] FIRING: [6x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:27:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:27:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:30:17] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:30:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:31:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:32:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:34:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:34:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:35:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:36:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:37:27] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:38:27] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:40:17] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:40:25] FIRING: [11x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:41:17] FIRING: [4x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:43:17] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:44:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:45:25] FIRING: [11x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:47:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:50:25] FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:55:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:55:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [03:56:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:59:03] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:00:25] FIRING: [9x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:01:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:02:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:03:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:04:37] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:05:25] FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:06:27] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:06:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:07:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:07:27] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:08:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:08:29] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:09:03] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:10:25] FIRING: [9x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:14:13] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:14:37] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:15:11] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:15:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:41] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:18:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:18:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:20:25] FIRING: [11x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:17] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:22:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:23:29] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:25:15] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:25:17] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:25:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:26:41] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:27:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:28:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:28:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:30:25] FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:35:25] FIRING: [7x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:35:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:36:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:40:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:40:17] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:40:25] FIRING: [7x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:43:15] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:43:17] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:45:25] FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:45:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:50:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:50:25] FIRING: [7x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:17] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:54:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:54:43] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:55:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:55:17] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:55:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:56:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [04:58:15] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:58:17] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:58:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:58:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:00:25] FIRING: [11x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [05:01:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:04:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:05:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:10:25] FIRING: [9x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:13:22] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:15:25] FIRING: [11x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:37] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:16:07] PROBLEM - MegaRAID on db1162 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:16:08] ACKNOWLEDGEMENT - MegaRAID on db1162 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T418550 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:16:23] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T418550 (10ops-monitoring-bot) 03NEW [05:17:41] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:17:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:18:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:18:17] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:18:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:19:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:19:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:25] FIRING: [11x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:21:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:23:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:25:25] FIRING: [13x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:25:37] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:26:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:27:41] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:29:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:30:25] FIRING: [12x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:31:32] 06SRE, 06Infrastructure-Foundations, 10netops: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11657121 (10Papaul) @ayounsi please see below for the IP layer diagram let me know if we need to modify it thanks {F72443222} [05:33:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:35:25] FIRING: [12x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:36:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:36:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:37:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:38:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:39:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:39:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:40:25] FIRING: [13x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:43:16] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1245099 (https://phabricator.wikimedia.org/T418553) [05:43:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:43:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T418550#11657148 (10Marostegui) [05:44:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1222 with weight 0 T418553', diff saved to https://phabricator.wikimedia.org/P89072 and previous config saved to /var/cache/conftool/dbconfig/20260227-054410-marostegui.json [05:44:15] T418553: Switchover s2 master (db1162 -> db1222) - https://phabricator.wikimedia.org/T418553 [05:44:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T418553 [05:44:58] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1245099 (https://phabricator.wikimedia.org/T418553) (owner: 10Gerrit maintenance bot) [05:45:25] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:46:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:47:13] !log Starting s2 eqiad failover from db1162 to db1222 - T418553 [05:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1222 to s2 primary T418553', diff saved to https://phabricator.wikimedia.org/P89073 and previous config saved to /var/cache/conftool/dbconfig/20260227-054750-marostegui.json [05:48:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1162 T418553', diff saved to https://phabricator.wikimedia.org/P89074 and previous config saved to /var/cache/conftool/dbconfig/20260227-054833-marostegui.json [05:48:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1162: Repooling after switchover [05:49:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [05:49:29] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1162: Repooling after switchover [05:49:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1162: Repooling after switchover [05:50:37] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T418550#11657171 (10Marostegui) p:05Triage→03Medium Can we get a disk for this host? It is fine if it is an used disk. [05:51:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance [05:52:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:53:00] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 3 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11657176 (10VRiley-WMF) a:03VRiley-WMF [05:53:25] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/1245110 (https://phabricator.wikimedia.org/T401966) [05:54:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:55:03] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/1245110 (https://phabricator.wikimedia.org/T401966) (owner: 10Marostegui) [05:55:07] !log marostegui@dns1006 START - running authdns-update [05:55:20] !log Failover m5-master T401966 [05:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:25] T401966: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966 [05:56:07] (03PS1) 10Marostegui: Revert "wmnet: Failover m5-master" [dns] - 10https://gerrit.wikimedia.org/r/1245113 [05:56:34] !log marostegui@dns1006 END - running authdns-update [05:56:48] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11657186 (10Marostegui) >>! In T401966#11656482, @VRiley-WMF wrote: > @Marostegui So, we can start with dbproxy1029. Are there specif... [05:57:19] (03CR) 10Marostegui: [C:03+2] Revert "wmnet: Failover m5-master" [dns] - 10https://gerrit.wikimedia.org/r/1245113 (owner: 10Marostegui) [05:57:23] !log marostegui@dns1006 START - running authdns-update [05:57:27] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11657187 (10Marostegui) dbproxy1028 can now be done any time as it is not a master anymore, so you could do that one whenever it is c... [05:58:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance [05:58:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2151 (T418465)', diff saved to https://phabricator.wikimedia.org/P89076 and previous config saved to /var/cache/conftool/dbconfig/20260227-055835-marostegui.json [05:58:40] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [05:58:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1160 with weight 0 T418079', diff saved to https://phabricator.wikimedia.org/P89077 and previous config saved to /var/cache/conftool/dbconfig/20260227-055845-marostegui.json [05:58:49] !log marostegui@dns1006 END - running authdns-update [05:58:50] T418079: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T418079 [05:59:04] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1242167 (https://phabricator.wikimedia.org/T418079) (owner: 10Gerrit maintenance bot) [05:59:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 42 hosts with reason: Primary switchover s4 T418079 [06:00:00] !log Starting s4 eqiad failover from db1244 to db1160 - T418079 [06:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:20] !log revert: Failover m5-master T401966 [06:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:25] T401966: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966 [06:00:59] (03PS1) 10Marostegui: db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1245120 (https://phabricator.wikimedia.org/T418079) [06:02:28] (03CR) 10Marostegui: [C:03+2] db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1245120 (https://phabricator.wikimedia.org/T418079) (owner: 10Marostegui) [06:03:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T418465)', diff saved to https://phabricator.wikimedia.org/P89078 and previous config saved to /var/cache/conftool/dbconfig/20260227-060331-marostegui.json [06:04:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1160 to s4 primary T418079', diff saved to https://phabricator.wikimedia.org/P89080 and previous config saved to /var/cache/conftool/dbconfig/20260227-060455-marostegui.json [06:05:01] T418079: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T418079 [06:05:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1244 T418079', diff saved to https://phabricator.wikimedia.org/P89081 and previous config saved to /var/cache/conftool/dbconfig/20260227-060534-marostegui.json [06:07:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance [06:08:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:13:27] (03PS1) 10Marostegui: dbproxy1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1245135 (https://phabricator.wikimedia.org/T414656) [06:16:18] (03CR) 10Marostegui: [C:03+2] dbproxy1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1245135 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [06:18:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P89082 and previous config saved to /var/cache/conftool/dbconfig/20260227-061840-marostegui.json [06:23:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:25:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:30:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:33:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P89084 and previous config saved to /var/cache/conftool/dbconfig/20260227-063348-marostegui.json [06:35:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1162: Repooling after switchover [06:36:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:38:54] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [06:40:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:45:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:48:54] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [06:48:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T418465)', diff saved to https://phabricator.wikimedia.org/P89086 and previous config saved to /var/cache/conftool/dbconfig/20260227-064856-marostegui.json [06:49:02] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [06:49:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [06:49:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2158 (T418465)', diff saved to https://phabricator.wikimedia.org/P89087 and previous config saved to /var/cache/conftool/dbconfig/20260227-064922-marostegui.json [06:54:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T418465)', diff saved to https://phabricator.wikimedia.org/P89088 and previous config saved to /var/cache/conftool/dbconfig/20260227-065417-marostegui.json [06:54:23] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [06:57:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:59:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260227T0700) [07:02:56] (03CR) 10Taiwanese elephant: [C:03+1] zhwiki: Remove all rights from accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [07:04:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:04:26] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11657276 (10MoritzMuehlenhoff) You're using the wrong account: You shell access is for mbinder, but you requested access for the "wmf" group with "ma... [07:09:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P89089 and previous config saved to /var/cache/conftool/dbconfig/20260227-070926-marostegui.json [07:13:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:14:32] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11657283 (10MoritzMuehlenhoff) >>! In T418068#11654602, @Jelto wrote: > @MoritzMuehlenhoff or @SLyngshe... [07:15:50] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [07:18:19] !log aokoth@cumin1003 END (ERROR) - Cookbook sre.gitlab.upgrade (exit_code=97) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [07:19:06] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [07:19:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:24:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P89090 and previous config saved to /var/cache/conftool/dbconfig/20260227-072434-marostegui.json [07:24:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:25:18] !log aokoth@cumin1003 END (ERROR) - Cookbook sre.gitlab.upgrade (exit_code=97) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [07:27:46] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T418483 [07:29:07] (03PS1) 10Ayounsi: k8s: add missing accept statement [homer/public] - 10https://gerrit.wikimedia.org/r/1245200 (https://phabricator.wikimedia.org/T417817) [07:30:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:38:04] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T418483 [07:38:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:39:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T418465)', diff saved to https://phabricator.wikimedia.org/P89091 and previous config saved to /var/cache/conftool/dbconfig/20260227-073943-marostegui.json [07:39:48] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [07:39:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [07:39:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2169 (T418465)', diff saved to https://phabricator.wikimedia.org/P89092 and previous config saved to /var/cache/conftool/dbconfig/20260227-073957-marostegui.json [07:40:25] FIRING: [6x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:40:54] (03PS1) 10Slyngshede: P:idm explicitly set email validators [puppet] - 10https://gerrit.wikimedia.org/r/1245205 [07:41:17] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:44:04] (03PS6) 10Muehlenhoff: Apply role to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [07:44:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T418465)', diff saved to https://phabricator.wikimedia.org/P89094 and previous config saved to /var/cache/conftool/dbconfig/20260227-074454-marostegui.json [07:45:00] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [07:45:25] FIRING: [6x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:50] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1245205 (owner: 10Slyngshede) [07:53:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:54:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:57:18] (03CR) 10Slyngshede: [C:03+2] P:idm explicitly set email validators [puppet] - 10https://gerrit.wikimedia.org/r/1245205 (owner: 10Slyngshede) [07:58:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:59:42] (03PS1) 10Muehlenhoff: Re-add Hiera config files [puppet] - 10https://gerrit.wikimedia.org/r/1245253 (https://phabricator.wikimedia.org/T365798) [08:00:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P89095 and previous config saved to /var/cache/conftool/dbconfig/20260227-080002-marostegui.json [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260227T0800) [08:00:22] (03CR) 10CI reject: [V:04-1] Re-add Hiera config files [puppet] - 10https://gerrit.wikimedia.org/r/1245253 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:00:22] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:02:22] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:03:12] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:03:12] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:03:22] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:04:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:04:43] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:04:59] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:05:22] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:05:22] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:05:31] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:08:57] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11657347 (10MoritzMuehlenhoff) 05Resolved→03Open a:05MatthewVernon→03None @MBinder_WMF I did a little digging in account history: Your original m... [08:09:12] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:09:12] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:09:22] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:10:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:11:56] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11657352 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:14:03] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [08:14:10] !log aokoth@cumin1003 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [08:15:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P89096 and previous config saved to /var/cache/conftool/dbconfig/20260227-081510-marostegui.json [08:15:22] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:15:22] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:17:09] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Remove Puppet 5 dependencies from PCC - https://phabricator.wikimedia.org/T418559 (10MoritzMuehlenhoff) 03NEW [08:18:01] (03PS2) 10Muehlenhoff: Re-add Hiera config files still used by PCC [puppet] - 10https://gerrit.wikimedia.org/r/1245253 (https://phabricator.wikimedia.org/T418559) [08:18:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:18:17] (03PS3) 10Muehlenhoff: Re-add Hiera config files still used by PCC [puppet] - 10https://gerrit.wikimedia.org/r/1245253 (https://phabricator.wikimedia.org/T418559) [08:18:52] (03CR) 10CI reject: [V:04-1] Re-add Hiera config files still used by PCC [puppet] - 10https://gerrit.wikimedia.org/r/1245253 (https://phabricator.wikimedia.org/T418559) (owner: 10Muehlenhoff) [08:19:01] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [08:19:07] !log aokoth@cumin1003 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [08:19:18] (03PS6) 10Fabfur: cache::haproxy: save x-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) [08:19:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:22:10] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [08:22:17] !log aokoth@cumin1003 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [08:24:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:24:36] (03PS4) 10Muehlenhoff: Re-add Hiera config files still used by PCC [puppet] - 10https://gerrit.wikimedia.org/r/1245253 (https://phabricator.wikimedia.org/T418559) [08:26:28] (03PS1) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [08:28:46] (03CR) 10CI reject: [V:04-1] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [08:29:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:29:40] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [08:30:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T418465)', diff saved to https://phabricator.wikimedia.org/P89097 and previous config saved to /var/cache/conftool/dbconfig/20260227-083018-marostegui.json [08:30:24] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [08:30:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [08:30:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2180 (T418465)', diff saved to https://phabricator.wikimedia.org/P89098 and previous config saved to /var/cache/conftool/dbconfig/20260227-083043-marostegui.json [08:32:28] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [08:32:36] !log aokoth@cumin1003 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [08:32:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T418465)', diff saved to https://phabricator.wikimedia.org/P89099 and previous config saved to /var/cache/conftool/dbconfig/20260227-083257-marostegui.json [08:33:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:33:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:35:51] (03PS2) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [08:38:15] (03CR) 10CI reject: [V:04-1] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [08:38:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:39:16] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host dbproxy1027.eqiad.wmnet with OS trixie [08:39:54] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [08:40:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:40:30] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [08:40:47] (03CR) 10Dpogorzelski: [C:03+1] httpbb: fix the revscoring-editquality-goodfaith test [puppet] - 10https://gerrit.wikimedia.org/r/1244659 (owner: 10AikoChou) [08:43:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:44:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:45:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:43] !log restart corto on alert1002 [08:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P89100 and previous config saved to /var/cache/conftool/dbconfig/20260227-084806-marostegui.json [08:48:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:49:54] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [08:50:06] (03PS3) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [08:50:30] (03PS1) 10Matthias Mullie: Limit additional whitespace to sticky header version only [extensions/MobileFrontend] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245265 (https://phabricator.wikimedia.org/T416598) [08:51:04] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T418483 [08:51:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/MobileFrontend] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245265 (https://phabricator.wikimedia.org/T416598) (owner: 10Matthias Mullie) [08:52:33] (03CR) 10CI reject: [V:04-1] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [08:55:31] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1027.eqiad.wmnet with reason: host reimage [08:58:30] (03PS4) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [08:59:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1027.eqiad.wmnet with reason: host reimage [09:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [09:01:06] (03CR) 10CI reject: [V:04-1] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [09:03:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P89101 and previous config saved to /var/cache/conftool/dbconfig/20260227-090314-marostegui.json [09:04:41] (03CR) 10Elukey: [C:03+1] Re-add Hiera config files still used by PCC [puppet] - 10https://gerrit.wikimedia.org/r/1245253 (https://phabricator.wikimedia.org/T418559) (owner: 10Muehlenhoff) [09:04:54] (03PS5) 10Ryan Kemper: wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) [09:04:54] (03PS3) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) [09:05:06] (03PS5) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [09:05:52] (03CR) 10CI reject: [V:04-1] wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [09:06:49] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T418483 [09:08:08] (03CR) 10CI reject: [V:04-1] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [09:10:53] (03PS6) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [09:11:20] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [09:12:20] (03PS7) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [09:13:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:14:45] (03PS1) 10Fabfur: hiera: set haproxy version to 3.0 on all magru hosts [puppet] - 10https://gerrit.wikimedia.org/r/1245274 (https://phabricator.wikimedia.org/T417253) [09:14:53] (03PS1) 10Marostegui: Revert "dbproxy1027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1245275 [09:14:55] (03CR) 10CI reject: [V:04-1] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [09:15:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:29] (03CR) 10Jelto: "one typo in-line" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240412 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [09:15:40] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1245274 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [09:18:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T418465)', diff saved to https://phabricator.wikimedia.org/P89102 and previous config saved to /var/cache/conftool/dbconfig/20260227-091822-marostegui.json [09:18:28] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [09:18:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2193.codfw.wmnet with reason: Maintenance [09:18:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2193 (T418465)', diff saved to https://phabricator.wikimedia.org/P89103 and previous config saved to /var/cache/conftool/dbconfig/20260227-091847-marostegui.json [09:19:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:19:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1027.eqiad.wmnet with OS trixie [09:20:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T418465)', diff saved to https://phabricator.wikimedia.org/P89104 and previous config saved to /var/cache/conftool/dbconfig/20260227-092101-marostegui.json [09:21:10] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11657417 (10mikez-WMF) Thank you very much! [09:21:20] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [09:23:36] (03PS8) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [09:24:21] (03CR) 10CI reject: [V:04-1] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [09:24:51] FIRING: ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:28] (03CR) 10Marostegui: [C:03+2] Revert "dbproxy1027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1245275 (owner: 10Marostegui) [09:28:17] (03PS2) 10Fabfur: hiera: set haproxy version to 3.0 on all magru hosts [puppet] - 10https://gerrit.wikimedia.org/r/1245274 (https://phabricator.wikimedia.org/T417253) [09:29:29] (03CR) 10Muehlenhoff: [C:03+2] Re-add Hiera config files still used by PCC [puppet] - 10https://gerrit.wikimedia.org/r/1245253 (https://phabricator.wikimedia.org/T418559) (owner: 10Muehlenhoff) [09:29:50] RESOLVED: ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:30:54] (03PS9) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [09:31:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [09:31:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [09:32:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [09:32:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [09:33:11] (03CR) 10CI reject: [V:04-1] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [09:33:53] (03PS1) 10Muehlenhoff: Enable Java 21 on build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1245280 (https://phabricator.wikimedia.org/T418109) [09:36:01] (03PS1) 10DCausse: opensearch-semantic-search: setup egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245284 [09:36:01] (03PS10) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [09:36:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P89105 and previous config saved to /var/cache/conftool/dbconfig/20260227-093609-marostegui.json [09:36:11] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1245274 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [09:38:16] (03CR) 10Elukey: [C:03+2] k8s: add missing accept statement [homer/public] - 10https://gerrit.wikimedia.org/r/1245200 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [09:38:22] (03CR) 10CI reject: [V:04-1] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [09:39:27] (03CR) 10Brouberol: [C:03+1] opensearch-semantic-search: setup egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245284 (owner: 10DCausse) [09:39:52] (03CR) 10DCausse: [C:03+2] opensearch-semantic-search: setup egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245284 (owner: 10DCausse) [09:41:20] (03PS3) 10Muehlenhoff: ldap::client::sssd: Only support socket activation [puppet] - 10https://gerrit.wikimedia.org/r/1243795 [09:41:59] (03Merged) 10jenkins-bot: opensearch-semantic-search: setup egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245284 (owner: 10DCausse) [09:42:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [09:42:36] (03PS11) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [09:44:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:44:54] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [09:45:06] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [09:45:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:52] (03CR) 10CI reject: [V:04-1] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [09:48:50] (03PS12) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [09:48:54] (03PS1) 10JMeybohm: admin/data: Add second YubiKey [puppet] - 10https://gerrit.wikimedia.org/r/1245286 [09:49:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:50:21] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/1245287 (https://phabricator.wikimedia.org/T401966) [09:50:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:43] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [09:50:58] (03CR) 10Ryan Kemper: "Oops, hosts line got clobbered. cancelling pcc run" [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [09:51:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P89106 and previous config saved to /var/cache/conftool/dbconfig/20260227-095117-marostegui.json [09:51:39] (03CR) 10Jelto: [C:03+1] "verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1245286 (owner: 10JMeybohm) [09:52:09] (03CR) 10JMeybohm: [C:03+2] admin/data: Add second YubiKey [puppet] - 10https://gerrit.wikimedia.org/r/1245286 (owner: 10JMeybohm) [09:52:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:52:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [09:52:33] (03PS13) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [09:53:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [09:54:49] (03CR) 10CI reject: [V:04-1] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [09:55:14] (03PS4) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) [09:55:26] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/1245287 (https://phabricator.wikimedia.org/T401966) (owner: 10Marostegui) [09:56:01] (03CR) 10CI reject: [V:04-1] wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [09:56:01] !log marostegui@dns1004 START - running authdns-update [09:56:02] (03PS6) 10Ryan Kemper: wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) [09:56:02] (03PS5) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) [09:56:17] FIRING: [3x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:56:53] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11657541 (10Marostegui) I've failed over dbproxy1029 so on Monday it will be ready to get done. As I stated before, dbproxy1028 can b... [09:56:56] (03PS14) 10Slyngshede: P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 [09:56:58] (03CR) 10CI reject: [V:04-1] wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [09:57:28] !log marostegui@dns1004 END - running authdns-update [09:57:37] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11657543 (10Jelto) 05Open→03Declined Thanks for the quick feedback. I double-checked the member... [09:57:40] (03PS6) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) [09:57:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243795 (owner: 10Muehlenhoff) [09:59:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:00:54] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11657555 (10Jelto) Thank you for the quick sign off @Khantstop! I reach out out of band to confirm the ssh key [10:01:17] FIRING: [14x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:03:01] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [10:04:43] aokoth@cumin1003 aokoth: The backup on gitlab1004 is complete, ready to proceed with upgrade. [10:06:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T418465)', diff saved to https://phabricator.wikimedia.org/P89107 and previous config saved to /var/cache/conftool/dbconfig/20260227-100625-marostegui.json [10:06:32] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [10:06:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2197.codfw.wmnet with reason: Maintenance [10:07:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:07:43] aokoth@cumin1003 upgrade (PID 2733177) is awaiting input [10:08:01] (03CR) 10Muehlenhoff: [C:03+1] "As beautiful as a Puppet spec test can be!" [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [10:08:26] (03CR) 10Slyngshede: [C:03+2] P:idm initial spec for IDM profile [puppet] - 10https://gerrit.wikimedia.org/r/1245258 (owner: 10Slyngshede) [10:09:31] RESOLVED: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:09:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2214.codfw.wmnet with reason: Maintenance [10:09:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2214 (T418465)', diff saved to https://phabricator.wikimedia.org/P89108 and previous config saved to /var/cache/conftool/dbconfig/20260227-100944-marostegui.json [10:10:51] !log Failover m5-master T401966 [10:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:55] T401966: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966 [10:11:17] FIRING: [16x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:11:44] (03PS1) 10Elukey: admin_ng: restructure Istio sidecar's config for Istio 1.24+ [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245289 [10:11:44] (03PS1) 10Elukey: ml-services: restore transparent proxy functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245290 [10:11:45] (03PS1) 10Elukey: kserve-inference: allow both lists and dicts as inference_services values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245291 [10:11:45] (03PS1) 10Elukey: ml-services: reduce the revertrisk's isvcs in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245292 [10:14:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T418465)', diff saved to https://phabricator.wikimedia.org/P89110 and previous config saved to /var/cache/conftool/dbconfig/20260227-101422-marostegui.json [10:14:28] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [10:14:31] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [10:14:37] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11657592 (10A_smart_kitten) In that case, should the docs at https://wikitech.wikimedia.org/wiki/SR... [10:15:31] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 116112 bytes in 0.790 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [10:18:17] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T418483 [10:19:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:53] (03CR) 10Vgutierrez: [C:04-1] cache::haproxy: save x-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [10:21:17] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host dse-k8s-worker1026.eqiad.wmnet [10:21:39] (03CR) 10Dpogorzelski: [C:03+1] admin_ng: restructure Istio sidecar's config for Istio 1.24+ [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245289 (owner: 10Elukey) [10:24:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:18] (03CR) 10Jelto: "so this change can be abandoned then?" [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [10:29:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P89111 and previous config saved to /var/cache/conftool/dbconfig/20260227-102931-marostegui.json [10:30:34] (03CR) 10Jelto: "@vgutierrez@wikimedia.org does this change look reasonable to you? Can we use the `gerrit-https` `realserver::pools` also for the replica " [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [10:30:45] (03CR) 10Dpogorzelski: [C:03+1] ml-services: restore transparent proxy functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245290 (owner: 10Elukey) [10:31:12] (03CR) 10Elukey: [C:03+2] admin_ng: restructure Istio sidecar's config for Istio 1.24+ [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245289 (owner: 10Elukey) [10:31:16] (03CR) 10Jelto: gerrit: add gerrit-replica backend to LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [10:31:46] (03CR) 10Muehlenhoff: "The PCC failure on that one node seems unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/1243795 (owner: 10Muehlenhoff) [10:33:03] (03CR) 10Elukey: [C:03+2] ml-services: restore transparent proxy functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245290 (owner: 10Elukey) [10:33:29] !log aokoth@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [10:33:43] !log aokoth@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host gitlab1003.wikimedia.org [10:33:53] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [10:34:29] !log aokoth@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [10:34:51] (03CR) 10Vgutierrez: [C:04-2] "yes" [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [10:35:08] (03PS8) 10Arnaudb: gerrit: add gerrit-replica backend to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) [10:38:10] (03CR) 10Vgutierrez: [C:04-1] gerrit: add gerrit-replica backend to LVS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [10:38:37] (03CR) 10Dpogorzelski: [C:03+1] kserve-inference: allow both lists and dicts as inference_services values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245291 (owner: 10Elukey) [10:38:52] (03CR) 10Vgutierrez: [C:04-1] gerrit: add gerrit-replica backend to LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [10:38:56] (03CR) 10Dpogorzelski: [C:03+1] ml-services: reduce the revertrisk's isvcs in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245292 (owner: 10Elukey) [10:39:12] (03CR) 10Fabfur: cache::haproxy: save x-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [10:39:39] (03CR) 10Fabfur: cache::haproxy: save x-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [10:40:14] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [10:42:05] !log aokoth@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [10:44:31] (03PS9) 10Jelto: trafficserver: Add gerrit-replica backend [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [10:44:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P89112 and previous config saved to /var/cache/conftool/dbconfig/20260227-104439-marostegui.json [10:44:44] (03PS2) 10Majavah: Add toolsbeta-acme-chief private key [labs/private] - 10https://gerrit.wikimedia.org/r/1240325 [10:44:44] (03PS2) 10Majavah: Add fake metricsinfra Grafana admin password [labs/private] - 10https://gerrit.wikimedia.org/r/1240326 [10:44:45] (03PS1) 10Majavah: Add fake Docker registry passwrod for cloudinfra [labs/private] - 10https://gerrit.wikimedia.org/r/1245297 [10:45:40] (03CR) 10Jelto: trafficserver: Add gerrit-replica backend (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [10:45:53] (03CR) 10Majavah: [C:03+1] "indeed, sent https://gerrit.wikimedia.org/r/c/labs/private/+/1245297 to fix that" [puppet] - 10https://gerrit.wikimedia.org/r/1243795 (owner: 10Muehlenhoff) [10:47:59] (03CR) 10AikoChou: [C:03+1] kserve-inference: allow both lists and dicts as inference_services values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245291 (owner: 10Elukey) [10:48:05] (03CR) 10Elukey: [C:03+2] kserve-inference: allow both lists and dicts as inference_services values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245291 (owner: 10Elukey) [10:48:13] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [10:48:14] (03CR) 10Elukey: [C:03+2] ml-services: reduce the revertrisk's isvcs in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245292 (owner: 10Elukey) [10:48:31] (03Abandoned) 10Jelto: gerrit: add gerrit-replica service to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [10:49:18] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:49:27] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:50:21] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! After another round of export and generate on top of 1217844 I do not spot any unexpected differences so this seems to work :)" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221131 (https://phabricator.wikimedia.org/T413529) (owner: 10Pppery) [10:50:27] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:51:13] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:52:52] (03PS7) 10Fabfur: cache::haproxy: save x-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) [10:53:04] (03PS1) 10Elukey: ml-services: bump resources for rr-multilingual in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245301 [10:54:40] (03CR) 10Dpogorzelski: [C:03+1] ml-services: bump resources for rr-multilingual in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245301 (owner: 10Elukey) [10:54:50] (03CR) 10AikoChou: [C:03+1] ml-services: bump resources for rr-multilingual in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245301 (owner: 10Elukey) [10:55:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:55:38] (03CR) 10Elukey: [C:03+2] ml-services: bump resources for rr-multilingual in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245301 (owner: 10Elukey) [10:56:25] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:59:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T418465)', diff saved to https://phabricator.wikimedia.org/P89113 and previous config saved to /var/cache/conftool/dbconfig/20260227-105947-marostegui.json [10:59:52] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:00:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2217.codfw.wmnet with reason: Maintenance [11:00:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2217 (T418465)', diff saved to https://phabricator.wikimedia.org/P89114 and previous config saved to /var/cache/conftool/dbconfig/20260227-110011-marostegui.json [11:00:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:54] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:01:53] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [11:02:20] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [11:04:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T418465)', diff saved to https://phabricator.wikimedia.org/P89115 and previous config saved to /var/cache/conftool/dbconfig/20260227-110447-marostegui.json [11:04:53] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:06:04] (03CR) 10Aklapper: [C:04-1] "Thanks! Seems to work as expected. :) Only setting -1 because of three small issues" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221180 (https://phabricator.wikimedia.org/T413532) (owner: 10Pppery) [11:09:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243824 (owner: 10Muehlenhoff) [11:09:31] (03CR) 10Clément Goubert: [C:03+1] "Trusting your judgement on this 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1240720 (owner: 10Muehlenhoff) [11:12:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:19:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P89116 and previous config saved to /var/cache/conftool/dbconfig/20260227-111956-marostegui.json [11:22:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:23:23] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:25:14] (03CR) 10Filippo Giunchedi: check_timedatectl: Drop support for old systemd versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243824 (owner: 10Muehlenhoff) [11:25:23] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1142.eqiad.wmnet [11:25:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11657915 (10ops-monitoring-bot) Host an-worker1142.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:27:15] (03CR) 10Vgutierrez: [C:03+1] trafficserver: Add gerrit-replica backend [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [11:28:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:34:18] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11657970 (10ajhalili2006) >>! In T418068#11657543, @Jelto wrote: > Thanks for the quick feedback. I... [11:34:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1142.eqiad.wmnet [11:34:45] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1143.eqiad.wmnet [11:35:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P89117 and previous config saved to /var/cache/conftool/dbconfig/20260227-113504-marostegui.json [11:35:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11657971 (10ops-monitoring-bot) Host an-worker1143.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:38:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:39:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:43:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:43:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1143.eqiad.wmnet [11:43:51] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1144.eqiad.wmnet [11:44:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11658001 (10ops-monitoring-bot) Host an-worker1144.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:45:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:46:22] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Decommission an-worker11[17-41] but reuse an-worker11[17,18,31,33,34] as dse-k8s-workers - https://phabricator.wikimedia.org/T414948#11658011 (10BTullis) a:05BTullis→03None [11:50:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T418465)', diff saved to https://phabricator.wikimedia.org/P89118 and previous config saved to /var/cache/conftool/dbconfig/20260227-115012-marostegui.json [11:50:18] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:50:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2224.codfw.wmnet with reason: Maintenance [11:50:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2224 (T418465)', diff saved to https://phabricator.wikimedia.org/P89119 and previous config saved to /var/cache/conftool/dbconfig/20260227-115026-marostegui.json [11:55:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T418465)', diff saved to https://phabricator.wikimedia.org/P89120 and previous config saved to /var/cache/conftool/dbconfig/20260227-115504-marostegui.json [11:55:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1144.eqiad.wmnet [11:55:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:55:18] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1145.eqiad.wmnet [11:55:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11658034 (10ops-monitoring-bot) Host an-worker1145.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:56:46] (03CR) 10Vgutierrez: [C:04-1] cache::haproxy: save x-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260227T0800) [12:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260227T1200). [12:00:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:03:58] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11658036 (10A_smart_kitten) >>! In T418068#11657543, @Jelto wrote: > So I'll habe to decline the ta... [12:06:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1145.eqiad.wmnet [12:06:34] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1146.eqiad.wmnet [12:06:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11658051 (10ops-monitoring-bot) Host an-worker1146.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:08:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11658053 (10BTullis) I have started configuring the first handful of hadoop worker nodes with the new server p... [12:10:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P89121 and previous config saved to /var/cache/conftool/dbconfig/20260227-121012-marostegui.json [12:10:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:15:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1146.eqiad.wmnet [12:15:46] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1147.eqiad.wmnet [12:15:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11658087 (10ops-monitoring-bot) Host an-worker1147.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:15:57] FIRING: [3x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:16:42] !ack [12:16:43] 7496 (ACKED) [3x] ProbeDown sre (text-https:443 probes/service) [12:16:44] here [12:16:58] FIRING: ProbeDown: Service text:80 has failed probes (http_text_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:18:30] FIRING: [10x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb6_443 has 4 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [12:20:57] RESOLVED: [3x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:21:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:21:58] RESOLVED: [5x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:23:30] RESOLVED: [10x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb6_443 has 4 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [12:24:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1147.eqiad.wmnet [12:24:53] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1148.eqiad.wmnet [12:25:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11658098 (10ops-monitoring-bot) Host an-worker1148.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:25:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P89122 and previous config saved to /var/cache/conftool/dbconfig/20260227-122521-marostegui.json [12:26:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:30:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:31:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:31:46] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:34:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:35:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:35:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1148.eqiad.wmnet [12:35:44] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1149.eqiad.wmnet [12:36:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11658114 (10ops-monitoring-bot) Host an-worker1149.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:36:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:40:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T418465)', diff saved to https://phabricator.wikimedia.org/P89123 and previous config saved to /var/cache/conftool/dbconfig/20260227-124029-marostegui.json [12:40:34] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [12:43:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2229.codfw.wmnet with reason: Maintenance [12:46:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:47:01] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1149.eqiad.wmnet [12:47:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [12:47:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T418465)', diff saved to https://phabricator.wikimedia.org/P89124 and previous config saved to /var/cache/conftool/dbconfig/20260227-124711-marostegui.json [12:47:16] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [12:47:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:48:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:55:32] (03PS1) 10Slyngshede: LDAPBackend: Test LDAP email validator [software/bitu] - 10https://gerrit.wikimedia.org/r/1245351 [12:56:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [12:56:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T418465)', diff saved to https://phabricator.wikimedia.org/P89125 and previous config saved to /var/cache/conftool/dbconfig/20260227-125654-marostegui.json [12:56:59] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [12:58:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:00:14] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [13:00:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:01:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:02:11] PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:02:26] :( [13:02:44] gitlab needs a reboot, should be resolved in 5 mins [13:02:51] thx [13:03:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T418465)', diff saved to https://phabricator.wikimedia.org/P89126 and previous config saved to /var/cache/conftool/dbconfig/20260227-130306-marostegui.json [13:03:12] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:03:39] RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [13:04:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:45] PROBLEM - MegaRAID on an-worker1148 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:05:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:06:52] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [13:07:15] GitLab maintenance done [13:08:24] (03PS1) 10JMeybohm: admin/data: Refresh YubiKey-5C [puppet] - 10https://gerrit.wikimedia.org/r/1245353 [13:08:43] FIRING: [4x] ProbeDown: Service gitlab1004:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:09:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T418465)', diff saved to https://phabricator.wikimedia.org/P89127 and previous config saved to /var/cache/conftool/dbconfig/20260227-130926-marostegui.json [13:09:31] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:12:39] (03CR) 10Jelto: [C:03+1] "lgtm, verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1245353 (owner: 10JMeybohm) [13:13:57] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:15:06] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:18:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P89128 and previous config saved to /var/cache/conftool/dbconfig/20260227-131815-marostegui.json [13:19:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:20:04] (03CR) 10Vgutierrez: [C:03+1] "nitpick: commit message isn't accurate" [puppet] - 10https://gerrit.wikimedia.org/r/1245274 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [13:20:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:23] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:24:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P89129 and previous config saved to /var/cache/conftool/dbconfig/20260227-132434-marostegui.json [13:25:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P89130 and previous config saved to /var/cache/conftool/dbconfig/20260227-133323-marostegui.json [13:34:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:35:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:21] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [13:38:43] RESOLVED: ProbeDown: Service gitlab1004:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gitlab1004:22 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P89131 and previous config saved to /var/cache/conftool/dbconfig/20260227-133943-marostegui.json [13:40:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:40:25] FIRING: [6x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:16] (03CR) 10JMeybohm: [C:03+2] admin/data: Refresh YubiKey-5C [puppet] - 10https://gerrit.wikimedia.org/r/1245353 (owner: 10JMeybohm) [13:45:25] FIRING: [6x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:17] FIRING: [14x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:47:21] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [13:47:22] (03PS8) 10Fabfur: cache::haproxy: save x-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) [13:48:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T418465)', diff saved to https://phabricator.wikimedia.org/P89132 and previous config saved to /var/cache/conftool/dbconfig/20260227-134831-marostegui.json [13:48:37] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:48:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:48:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1166 (T418465)', diff saved to https://phabricator.wikimedia.org/P89133 and previous config saved to /var/cache/conftool/dbconfig/20260227-134855-marostegui.json [13:50:25] FIRING: [7x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Decommission an-worker11[17-41] but reuse an-worker11[17,18,31,33,34] as dse-k8s-workers - https://phabricator.wikimedia.org/T414948#11658338 (10Jclark-ctr) a:03Jclark-ctr [13:54:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T418465)', diff saved to https://phabricator.wikimedia.org/P89134 and previous config saved to /var/cache/conftool/dbconfig/20260227-135451-marostegui.json [13:54:56] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:55:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [13:55:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:55:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T418465)', diff saved to https://phabricator.wikimedia.org/P89135 and previous config saved to /var/cache/conftool/dbconfig/20260227-135516-marostegui.json [13:55:25] FIRING: [7x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:17] FIRING: [14x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:56:31] (03PS1) 10Itamar Givon: Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245364 (https://phabricator.wikimedia.org/T414476) [13:58:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T418465)', diff saved to https://phabricator.wikimedia.org/P89136 and previous config saved to /var/cache/conftool/dbconfig/20260227-135827-marostegui.json [14:00:25] FIRING: [6x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:02:51] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1245351 (owner: 10Slyngshede) [14:04:42] (03CR) 10Slyngshede: [C:03+2] LDAPBackend: Test LDAP email validator [software/bitu] - 10https://gerrit.wikimedia.org/r/1245351 (owner: 10Slyngshede) [14:04:55] (03PS1) 10DCausse: opensearch-semantic-search: increase cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245366 [14:05:48] (03CR) 10Ebernhardson: [C:03+2] opensearch-semantic-search: increase cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245366 (owner: 10DCausse) [14:07:07] PROBLEM - Host maps1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:07:46] (03Merged) 10jenkins-bot: LDAPBackend: Test LDAP email validator [software/bitu] - 10https://gerrit.wikimedia.org/r/1245351 (owner: 10Slyngshede) [14:07:54] (03Merged) 10jenkins-bot: opensearch-semantic-search: increase cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245366 (owner: 10DCausse) [14:10:43] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [14:10:54] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [14:11:13] (03PS1) 10Btullis: Add an analytics PSP permitting access to certain hostPaths [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245367 (https://phabricator.wikimedia.org/T405509) [14:12:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:13:14] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T418550#11658466 (10Jclark-ctr) a:03Jclark-ctr @Marostegui Can this be swapped at any time? [14:13:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P89137 and previous config saved to /var/cache/conftool/dbconfig/20260227-141336-marostegui.json [14:15:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:47] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T418550#11658480 (10Marostegui) Yeah, you can go for it! [14:17:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T418465)', diff saved to https://phabricator.wikimedia.org/P89138 and previous config saved to /var/cache/conftool/dbconfig/20260227-141732-marostegui.json [14:17:38] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [14:18:02] fabfur: claime: thcipriani: dduvall: dancy: I need to do an emergency config change, any concerns? [14:18:15] unflipping the config flag for T415007, it's behaving weirdly [14:18:16] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [14:18:27] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T418550#11658499 (10Jclark-ctr) @Marostegui Drive has been swapped i see in idrac it is Rebuilding. will leave ticket open till it finishes [14:20:07] (03PS1) 10Gergő Tisza: Revert "Enable JWT session cookie for bot passwords (all wikis)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245368 (https://phabricator.wikimedia.org/T415007) [14:20:20] (03PS1) 10Btullis: Apply the analytics pod security profile to several namespaces in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [14:20:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:41] tgr_: not from me [14:21:49] jouncebot: nowandnext [14:21:49] For the next 17 hour(s) and 38 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260227T0800) [14:21:49] In 17 hour(s) and 38 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260228T0800) [14:21:54] oh right it's friday lol [14:21:59] cdanis: x) [14:22:20] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T418550#11658508 (10Marostegui) Thanks, I see it too! ` root@db1162:~# megacli -PDRbld -ShowProg -PhysDrv[32:6] -a0 Rebuild Progress on Device at Enclosure 32, Slot 6 Completed 3% in 4 Minutes. ` [14:25:53] (03CR) 10D3r1ck01: [C:03+1] "Per T415007#11658470" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245368 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [14:26:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245368 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [14:27:10] (03Merged) 10jenkins-bot: Revert "Enable JWT session cookie for bot passwords (all wikis)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245368 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [14:27:33] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1245368|Revert "Enable JWT session cookie for bot passwords (all wikis)" (T415007)]] [14:27:38] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [14:28:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P89139 and previous config saved to /var/cache/conftool/dbconfig/20260227-142844-marostegui.json [14:29:28] !log tgr@deploy2002 tgr: Backport for [[gerrit:1245368|Revert "Enable JWT session cookie for bot passwords (all wikis)" (T415007)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:31:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:32:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P89140 and previous config saved to /var/cache/conftool/dbconfig/20260227-143241-marostegui.json [14:37:04] !log tgr@deploy2002 tgr: Continuing with sync [14:40:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:41:06] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1245368|Revert "Enable JWT session cookie for bot passwords (all wikis)" (T415007)]] (duration: 13m 32s) [14:41:11] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [14:43:17] 06SRE, 06Infrastructure-Foundations: Create nodejs 24 production images - https://phabricator.wikimedia.org/T418440#11658580 (10MLechvien-WMF) [14:43:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T418465)', diff saved to https://phabricator.wikimedia.org/P89141 and previous config saved to /var/cache/conftool/dbconfig/20260227-144353-marostegui.json [14:43:58] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [14:43:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:44:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T418465)', diff saved to https://phabricator.wikimedia.org/P89142 and previous config saved to /var/cache/conftool/dbconfig/20260227-144407-marostegui.json [14:44:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:45:17] !log emergency deploy for T415007#11658252 done [14:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P89143 and previous config saved to /var/cache/conftool/dbconfig/20260227-144749-marostegui.json [14:48:01] (03CR) 10Daniel Kinzler: rest-gateway: use rlc claim from cookie with bearer token (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [14:49:22] (03PS1) 10DCausse: opensearch-semantic-search: increase mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245373 [14:50:05] (03CR) 10Ebernhardson: [C:03+2] opensearch-semantic-search: increase mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245373 (owner: 10DCausse) [14:50:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T418465)', diff saved to https://phabricator.wikimedia.org/P89144 and previous config saved to /var/cache/conftool/dbconfig/20260227-145016-marostegui.json [14:50:22] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [14:50:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:20] (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: save x-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [14:52:06] (03Merged) 10jenkins-bot: opensearch-semantic-search: increase mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245373 (owner: 10DCausse) [14:53:47] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [14:53:57] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [14:54:03] (03CR) 10Daniel Kinzler: rest-gateway: use rlc claim from cookie with bearer token (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [14:54:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:55:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:56:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:58:12] (03CR) 10Fabfur: cache::haproxy: save x-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [15:02:52] claime: that didn't help, and I'm not really sure what's going on, so I'll do another deploy (just extra logging for now) [15:02:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T418465)', diff saved to https://phabricator.wikimedia.org/P89146 and previous config saved to /var/cache/conftool/dbconfig/20260227-150257-marostegui.json [15:03:03] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [15:03:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [15:03:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T418465)', diff saved to https://phabricator.wikimedia.org/P89147 and previous config saved to /var/cache/conftool/dbconfig/20260227-150322-marostegui.json [15:05:03] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [15:05:15] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [15:05:21] (03PS1) 10Gergő Tisza: session: Log stack trace for JWT errors [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245381 [15:05:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P89148 and previous config saved to /var/cache/conftool/dbconfig/20260227-150525-marostegui.json [15:05:28] (03PS1) 10Gergő Tisza: session: Log stack trace for JWT errors [extensions/CentralAuth] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245382 [15:05:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:05:59] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [15:06:02] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245383 (https://phabricator.wikimedia.org/T418467) [15:06:07] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [15:07:20] tgr_: ack [15:07:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Decommission an-worker11[17-41] but reuse an-worker11[17,18,31,33,34] as dse-k8s-workers - https://phabricator.wikimedia.org/T414948#11658657 (10Jclark-ctr) All cables have been removed from servers currently still in racks at th... [15:08:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:09:12] (03PS1) 10Bking: opensearch-semantic-search: Permit pods up to 32GB RAM [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245384 (https://phabricator.wikimedia.org/T413969) [15:09:21] (03CR) 10CI reject: [V:04-1] session: Log stack trace for JWT errors [extensions/CentralAuth] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245382 (owner: 10Gergő Tisza) [15:09:26] (03CR) 10Snwachukwu: [C:03+1] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245383 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [15:10:02] (03PS2) 10Bking: opensearch-semantic-search: Permit pods up to 32GB RAM [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245384 (https://phabricator.wikimedia.org/T413969) [15:10:35] (03CR) 10DCausse: opensearch-semantic-search: Permit pods up to 32GB RAM (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245384 (https://phabricator.wikimedia.org/T413969) (owner: 10Bking) [15:10:51] (03CR) 10D3r1ck01: "Failure is caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1244684. Does it seem we have to create a backpor" [extensions/CentralAuth] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245382 (owner: 10Gergő Tisza) [15:12:29] (03PS1) 10Gergő Tisza: tests: Fix missing JWT issuer for CentralAuthSessionProvider [extensions/CentralAuth] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245387 (https://phabricator.wikimedia.org/T418487) [15:12:45] (03PS3) 10Bking: opensearch-semantic-search: Permit pods up to 32GB RAM [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245384 (https://phabricator.wikimedia.org/T413969) [15:12:51] (03CR) 10Bking: opensearch-semantic-search: Permit pods up to 32GB RAM (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245384 (https://phabricator.wikimedia.org/T413969) (owner: 10Bking) [15:12:52] (03PS2) 10Gergő Tisza: session: Log stack trace for JWT errors [extensions/CentralAuth] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245382 [15:13:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:14:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:16:02] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245383 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [15:16:52] (03CR) 10Dpogorzelski: [C:03+2] httpbb: fix the revscoring-editquality-goodfaith test [puppet] - 10https://gerrit.wikimedia.org/r/1244659 (owner: 10AikoChou) [15:17:55] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245383 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [15:20:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P89149 and previous config saved to /var/cache/conftool/dbconfig/20260227-152033-marostegui.json [15:22:08] (03CR) 10Ebernhardson: [C:03+2] opensearch-semantic-search: Permit pods up to 32GB RAM [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245384 (https://phabricator.wikimedia.org/T413969) (owner: 10Bking) [15:24:12] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:24:23] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:25:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T418465)', diff saved to https://phabricator.wikimedia.org/P89150 and previous config saved to /var/cache/conftool/dbconfig/20260227-152538-marostegui.json [15:25:44] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [15:26:17] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:26:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245381 (owner: 10Gergő Tisza) [15:26:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245387 (https://phabricator.wikimedia.org/T418487) (owner: 10Gergő Tisza) [15:26:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245382 (owner: 10Gergő Tisza) [15:26:40] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:29:51] (03Merged) 10jenkins-bot: opensearch-semantic-search: Permit pods up to 32GB RAM [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245384 (https://phabricator.wikimedia.org/T413969) (owner: 10Bking) [15:30:56] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:31:04] (03Merged) 10jenkins-bot: session: Log stack trace for JWT errors [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245381 (owner: 10Gergő Tisza) [15:31:20] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:31:27] (03Merged) 10jenkins-bot: tests: Fix missing JWT issuer for CentralAuthSessionProvider [extensions/CentralAuth] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245387 (https://phabricator.wikimedia.org/T418487) (owner: 10Gergő Tisza) [15:31:30] (03Merged) 10jenkins-bot: session: Log stack trace for JWT errors [extensions/CentralAuth] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245382 (owner: 10Gergő Tisza) [15:31:47] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:31:54] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1245381|session: Log stack trace for JWT errors]], [[gerrit:1245387|tests: Fix missing JWT issuer for CentralAuthSessionProvider (T418487 T415007)]], [[gerrit:1245382|session: Log stack trace for JWT errors]] [15:32:00] T418487: Extension CI broken: CentralAuthSessionProviderTest: undefined option: 'JwtSessionCookieIssuer' - https://phabricator.wikimedia.org/T418487 [15:32:01] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [15:32:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:33:49] !log tgr@deploy2002 tgr: Backport for [[gerrit:1245381|session: Log stack trace for JWT errors]], [[gerrit:1245387|tests: Fix missing JWT issuer for CentralAuthSessionProvider (T418487 T415007)]], [[gerrit:1245382|session: Log stack trace for JWT errors]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:34:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:34:50] (03PS2) 10Btullis: Apply the analytics pod security profile to several namespaces in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [15:35:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T418465)', diff saved to https://phabricator.wikimedia.org/P89151 and previous config saved to /var/cache/conftool/dbconfig/20260227-153541-marostegui.json [15:35:47] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [15:35:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance [15:36:05] !log tgr@deploy2002 tgr: Continuing with sync [15:36:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1198 (T418465)', diff saved to https://phabricator.wikimedia.org/P89152 and previous config saved to /var/cache/conftool/dbconfig/20260227-153606-marostegui.json [15:37:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:40:05] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1245381|session: Log stack trace for JWT errors]], [[gerrit:1245387|tests: Fix missing JWT issuer for CentralAuthSessionProvider (T418487 T415007)]], [[gerrit:1245382|session: Log stack trace for JWT errors]] (duration: 08m 11s) [15:40:10] T418487: Extension CI broken: CentralAuthSessionProviderTest: undefined option: 'JwtSessionCookieIssuer' - https://phabricator.wikimedia.org/T418487 [15:40:11] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [15:40:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P89153 and previous config saved to /var/cache/conftool/dbconfig/20260227-154046-marostegui.json [15:42:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T418465)', diff saved to https://phabricator.wikimedia.org/P89154 and previous config saved to /var/cache/conftool/dbconfig/20260227-154216-marostegui.json [15:42:21] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [15:45:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:46:17] FIRING: [10x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:47:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:48:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:48:40] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [15:48:41] (03PS1) 10Marostegui: orchestrator: Monitor for non-FQDNs in the host resolve cache [puppet] - 10https://gerrit.wikimedia.org/r/1245393 (https://phabricator.wikimedia.org/T272347) [15:49:40] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1245393 (https://phabricator.wikimedia.org/T272347) (owner: 10Marostegui) [15:50:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:34] (03CR) 10CI reject: [V:04-1] orchestrator: Monitor for non-FQDNs in the host resolve cache [puppet] - 10https://gerrit.wikimedia.org/r/1245393 (https://phabricator.wikimedia.org/T272347) (owner: 10Marostegui) [15:51:39] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:51:45] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:53:11] (03PS2) 10Marostegui: orchestrator: Monitor for non-FQDNs in the host resolve cache [puppet] - 10https://gerrit.wikimedia.org/r/1245393 (https://phabricator.wikimedia.org/T272347) [15:53:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:54:52] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:55:12] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:55:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:33] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1245393 (https://phabricator.wikimedia.org/T272347) (owner: 10Marostegui) [15:55:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P89155 and previous config saved to /var/cache/conftool/dbconfig/20260227-155554-marostegui.json [15:57:04] (03CR) 10Marostegui: "PCC which looks good: https://puppet-compiler.wmflabs.org/output/1245393/5955/" [puppet] - 10https://gerrit.wikimedia.org/r/1245393 (https://phabricator.wikimedia.org/T272347) (owner: 10Marostegui) [15:57:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P89156 and previous config saved to /var/cache/conftool/dbconfig/20260227-155724-marostegui.json [15:58:40] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [15:59:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:00:25] RESOLVED: [5x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch2066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:10:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:11:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T418465)', diff saved to https://phabricator.wikimedia.org/P89157 and previous config saved to /var/cache/conftool/dbconfig/20260227-161103-marostegui.json [16:11:08] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:11:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2190.codfw.wmnet with reason: Maintenance [16:11:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T418465)', diff saved to https://phabricator.wikimedia.org/P89158 and previous config saved to /var/cache/conftool/dbconfig/20260227-161127-marostegui.json [16:12:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P89159 and previous config saved to /var/cache/conftool/dbconfig/20260227-161233-marostegui.json [16:16:25] (03PS1) 10Btullis: Add a ValidatingAdmissionPolicy permitting access to /srv/spark [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) [16:23:18] (03CR) 10JHathaway: "Is this still in use somewhere, it looks like it was removed in [P:systemd::timesyncd remove deprecated Icinga check](https://gerrit.wikim" [puppet] - 10https://gerrit.wikimedia.org/r/1243824 (owner: 10Muehlenhoff) [16:24:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:24:55] 06SRE, 06ServiceOps new, 10TimedMediaHandler-Transcode: Increase capacity for Mercurius webvideoTranscode job (1080p) processing - https://phabricator.wikimedia.org/T414427#11658883 (10hnowlan) a:05hnowlan→03None [16:27:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T418465)', diff saved to https://phabricator.wikimedia.org/P89160 and previous config saved to /var/cache/conftool/dbconfig/20260227-162741-marostegui.json [16:27:47] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:27:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1212.eqiad.wmnet with reason: Maintenance [16:27:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 6 hosts with reason: Maintenance [16:28:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1212 (T418465)', diff saved to https://phabricator.wikimedia.org/P89161 and previous config saved to /var/cache/conftool/dbconfig/20260227-162806-marostegui.json [16:33:23] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:34:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:34:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T418465)', diff saved to https://phabricator.wikimedia.org/P89162 and previous config saved to /var/cache/conftool/dbconfig/20260227-163446-marostegui.json [16:34:51] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:35:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T418465)', diff saved to https://phabricator.wikimedia.org/P89163 and previous config saved to /var/cache/conftool/dbconfig/20260227-163514-marostegui.json [16:35:49] RESOLVED: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [16:42:59] (03PS1) 10JavierMonton: stream: mediawiki.page_html_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245410 (https://phabricator.wikimedia.org/T418467) [16:44:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:45:10] (03CR) 10JavierMonton: stream: mediawiki.page_html_content_change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245410 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [16:48:34] (03PS2) 10Btullis: Add a ValidatingAdmissionPolicy permitting access to /srv/spark [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) [16:48:34] (03PS3) 10Btullis: Apply the new PSP and VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [16:49:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:49:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P89164 and previous config saved to /var/cache/conftool/dbconfig/20260227-164954-marostegui.json [16:50:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P89165 and previous config saved to /var/cache/conftool/dbconfig/20260227-165022-marostegui.json [16:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:52:15] (03CR) 10Xcollazo: [C:03+1] stream: mediawiki.page_html_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245410 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [16:54:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:56:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:59:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:04:11] (03CR) 10Vgutierrez: [C:03+1] ncmonitor: Add ncmonitor sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1243258 (owner: 10BCornwall) [17:04:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:04:45] (03CR) 10BCornwall: [V:03+1 C:03+2] ncmonitor: Add ncmonitor sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1243258 (owner: 10BCornwall) [17:05:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P89166 and previous config saved to /var/cache/conftool/dbconfig/20260227-170503-marostegui.json [17:05:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P89167 and previous config saved to /var/cache/conftool/dbconfig/20260227-170530-marostegui.json [17:09:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:10:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:00] (03PS2) 10Btullis: Add an analytics PSP permitting access to certain hostPaths [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245367 (https://phabricator.wikimedia.org/T405509) [17:11:00] (03PS3) 10Btullis: Add a ValidatingAdmissionPolicy permitting access to /srv/spark [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) [17:11:00] (03PS4) 10Btullis: Apply the new PSP and VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [17:13:19] (03PS1) 10Btullis: Add a /srv/spark managed directory on dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1245413 (https://phabricator.wikimedia.org/T412925) [17:13:42] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8158/console" [puppet] - 10https://gerrit.wikimedia.org/r/1245413 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [17:14:14] (03PS1) 10Herron: centrallog: opt poolcounter log into hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/1245414 (https://phabricator.wikimedia.org/T418612) [17:14:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:14:58] (03PS2) 10Herron: centrallog: opt poolcounter log into hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/1245414 (https://phabricator.wikimedia.org/T418612) [17:19:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:19:55] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T418465)', diff saved to https://phabricator.wikimedia.org/P89168 and previous config saved to /var/cache/conftool/dbconfig/20260227-172011-marostegui.json [17:20:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:20:16] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [17:20:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2194.codfw.wmnet with reason: Maintenance [17:20:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T418465)', diff saved to https://phabricator.wikimedia.org/P89169 and previous config saved to /var/cache/conftool/dbconfig/20260227-172036-marostegui.json [17:20:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T418465)', diff saved to https://phabricator.wikimedia.org/P89170 and previous config saved to /var/cache/conftool/dbconfig/20260227-172046-marostegui.json [17:21:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1223.eqiad.wmnet with reason: Maintenance [17:21:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1223 (T418465)', diff saved to https://phabricator.wikimedia.org/P89171 and previous config saved to /var/cache/conftool/dbconfig/20260227-172111-marostegui.json [17:23:57] (03CR) 10Hnowlan: [C:03+1] centrallog: opt poolcounter log into hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/1245414 (https://phabricator.wikimedia.org/T418612) (owner: 10Herron) [17:24:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:24:55] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:28:28] (03PS8) 10Elukey: profile::kafka::broker: support new confluent distributions [puppet] - 10https://gerrit.wikimedia.org/r/1239135 (https://phabricator.wikimedia.org/T416670) [17:28:28] (03PS7) 10Elukey: role::kafka::test: prepare the cluster for the Kafka upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1239142 (https://phabricator.wikimedia.org/T416670) [17:28:41] (03CR) 10Herron: [C:03+2] centrallog: opt poolcounter log into hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/1245414 (https://phabricator.wikimedia.org/T418612) (owner: 10Herron) [17:29:14] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1239142 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [17:30:38] (03CR) 10Elukey: "I added some extra ensure_resources to properly clean up packages without the need for a manual intervention, and tested those in pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/1239135 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [17:30:51] (03PS1) 10Btullis: Disable the systemd timer that pulls the latest phabricator dump [puppet] - 10https://gerrit.wikimedia.org/r/1245416 (https://phabricator.wikimedia.org/T417824) [17:31:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T418465)', diff saved to https://phabricator.wikimedia.org/P89172 and previous config saved to /var/cache/conftool/dbconfig/20260227-173107-marostegui.json [17:31:12] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [17:31:16] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8159/console" [puppet] - 10https://gerrit.wikimedia.org/r/1245416 (https://phabricator.wikimedia.org/T417824) (owner: 10Btullis) [17:34:23] (03PS2) 10Btullis: Disable the systemd timer that pulls the latest phabricator dump [puppet] - 10https://gerrit.wikimedia.org/r/1245416 (https://phabricator.wikimedia.org/T417824) [17:34:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:34:45] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8160/console" [puppet] - 10https://gerrit.wikimedia.org/r/1245416 (https://phabricator.wikimedia.org/T417824) (owner: 10Btullis) [17:37:36] (03CR) 10Btullis: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1245416 (https://phabricator.wikimedia.org/T417824) (owner: 10Btullis) [17:39:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:40:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:44:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T418465)', diff saved to https://phabricator.wikimedia.org/P89173 and previous config saved to /var/cache/conftool/dbconfig/20260227-174448-marostegui.json [17:44:54] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [17:46:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P89174 and previous config saved to /var/cache/conftool/dbconfig/20260227-174615-marostegui.json [17:50:51] (03PS3) 10Btullis: Disable the systemd timer that pulls the latest phabricator dump [puppet] - 10https://gerrit.wikimedia.org/r/1245416 (https://phabricator.wikimedia.org/T417824) [17:50:51] (03PS1) 10Btullis: Remove the job that synced the phab dumps to the clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1245419 (https://phabricator.wikimedia.org/T417824) [17:50:55] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11659214 (10herron) [17:51:35] (03PS2) 10Btullis: Remove the job that synced the phab dumps to the clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1245419 (https://phabricator.wikimedia.org/T417824) [17:53:55] (03CR) 10BPirkle: [C:03+1] [DNM] Add growthexperiments.v0 to $wgRestSandboxSpecs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242613 (https://phabricator.wikimedia.org/T414470) (owner: 10Aaron Schulz) [17:55:46] RECOVERY - Confd vcl based reload on cp2033 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [17:56:13] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11659217 (10herron) [17:57:17] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1245416 (https://phabricator.wikimedia.org/T417824) (owner: 10Btullis) [17:59:21] (03CR) 10Btullis: [C:03+2] Disable the systemd timer that pulls the latest phabricator dump [puppet] - 10https://gerrit.wikimedia.org/r/1245416 (https://phabricator.wikimedia.org/T417824) (owner: 10Btullis) [17:59:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P89175 and previous config saved to /var/cache/conftool/dbconfig/20260227-175957-marostegui.json [18:00:02] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:00:12] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:01:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P89176 and previous config saved to /var/cache/conftool/dbconfig/20260227-180123-marostegui.json [18:01:38] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11659235 (10herron) [18:02:46] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11659238 (10herron) [18:03:04] (03CR) 10ArielGlenn: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [18:04:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:06:43] (03PS1) 10Btullis: Add the five new dse-k8s-worker nodes to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1245422 (https://phabricator.wikimedia.org/T418582) [18:07:53] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 31 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11659263 (10BCornwall) @ssingh Are we providing information on traffic hosts or all hosts? [18:09:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:14:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:14:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [18:14:44] Deployment miscweb-bugzilla in miscweb at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=miscweb&var-deployment=miscweb-bugzilla - ... [18:14:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:15:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P89177 and previous config saved to /var/cache/conftool/dbconfig/20260227-181505-marostegui.json [18:16:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T418465)', diff saved to https://phabricator.wikimedia.org/P89178 and previous config saved to /var/cache/conftool/dbconfig/20260227-181632-marostegui.json [18:16:37] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:16:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [18:19:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:55] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:24:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:24:40] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T418465)', diff saved to https://phabricator.wikimedia.org/P89179 and previous config saved to /var/cache/conftool/dbconfig/20260227-183013-marostegui.json [18:30:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:30:19] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:30:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2205.codfw.wmnet with reason: Maintenance [18:30:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2205 (T418465)', diff saved to https://phabricator.wikimedia.org/P89180 and previous config saved to /var/cache/conftool/dbconfig/20260227-183038-marostegui.json [18:32:19] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:33:57] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:34:05] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:34:14] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:34:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:39:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [18:39:44] Deployment miscweb-bugzilla in miscweb at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=miscweb&var-deployment=miscweb-bugzilla - ... [18:39:47] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:46:17] FIRING: [8x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:50:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:50:25] 06SRE, 07Sustainability (Incident Followup): Noise in #wikimedia-operations is making incident response more difficult - https://phabricator.wikimedia.org/T417163#11659366 (10herron) We could consider setting bots to use direct messages to reduce the amount of chatter in the channel without losing notification... [18:51:17] RESOLVED: [6x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:53:12] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:53:24] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:55:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T418465)', diff saved to https://phabricator.wikimedia.org/P89181 and previous config saved to /var/cache/conftool/dbconfig/20260227-185500-marostegui.json [18:55:06] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:59:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:04:06] (03PS1) 10CDobbins: conftool: remove ats-be from cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1245426 (https://phabricator.wikimedia.org/T418161) [19:04:42] (03CR) 10CI reject: [V:04-1] conftool: remove ats-be from cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1245426 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [19:05:33] (03PS2) 10CDobbins: conftool: remove ats-be from cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1245426 (https://phabricator.wikimedia.org/T418161) [19:10:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P89182 and previous config saved to /var/cache/conftool/dbconfig/20260227-191009-marostegui.json [19:13:34] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8161/console" [puppet] - 10https://gerrit.wikimedia.org/r/1245426 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [19:13:37] 06SRE, 07SRE-Unowned, 10Deployments, 06Release-Engineering-Team: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#11659440 (10A_smart_kitten) (in case anyone subscribed to this ticket is interested in following it, a new/similar task was filed recently as {... [19:14:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:15:41] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:15:47] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:15:53] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:15:56] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:16:02] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:16:20] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:16:30] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:16:39] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:18:16] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:18:20] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:19:25] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:19:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:25:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P89183 and previous config saved to /var/cache/conftool/dbconfig/20260227-192517-marostegui.json [19:25:46] 06SRE, 06Infrastructure-Foundations: Avoid dhcpcd-base on trixie hosts - https://phabricator.wikimedia.org/T414341#11659461 (10BCornwall) FWIW, The journal gets spammed with `dhcpcd is not running` due to this. [19:29:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:30:03] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:30:17] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:31:59] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:32:09] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:40:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T418465)', diff saved to https://phabricator.wikimedia.org/P89185 and previous config saved to /var/cache/conftool/dbconfig/20260227-194026-marostegui.json [19:40:31] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [19:40:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2227.codfw.wmnet with reason: Maintenance [19:40:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T418465)', diff saved to https://phabricator.wikimedia.org/P89186 and previous config saved to /var/cache/conftool/dbconfig/20260227-194051-marostegui.json [19:42:37] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8166/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1245426 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [19:47:19] (03CR) 10BCornwall: [C:03+1] conftool: remove ats-be from cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1245426 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [19:57:41] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:57:51] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:59:28] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:59:36] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:05:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T418465)', diff saved to https://phabricator.wikimedia.org/P89187 and previous config saved to /var/cache/conftool/dbconfig/20260227-200507-marostegui.json [20:05:13] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [20:06:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11659574 (10VRiley-WMF) [20:07:28] !log [WDQS] `ryankemper@wdqs1014:~$ sudo systemctl restart wdqs-blazegraph` (lag was high, see https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=2026-02-27T18:05:46.506Z&to=2026-02-27T20:03:42.806Z&timezone=utc&var-cluster_name=wdqs-main&var-graph_type=%289102%7C919%5B35%5D%29&refresh=1m&viewPanel=panel-8) [20:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:40] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11659582 (10VRiley-WMF) @Marostegui It seems when I try to run it on dbproxy1028, it tries to run it as sudo. I may not have permission... [20:14:26] (03CR) 10CDobbins: [V:03+1 C:03+2] conftool: remove ats-be from cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1245426 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [20:20:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P89188 and previous config saved to /var/cache/conftool/dbconfig/20260227-202015-marostegui.json [20:25:49] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2054.codfw.wmnet with OS trixie [20:26:36] (03CR) 10Ryan Kemper: wdqs: Separate deadlock remediation config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [20:26:44] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2055.codfw.wmnet with OS trixie [20:27:23] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2056.codfw.wmnet with OS trixie [20:34:28] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 3 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11659632 (10VRiley-WMF) Hey @Dzahn we have decommed Dell R440's, but none that were in the "Config C" would it be po... [20:34:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:35:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P89189 and previous config saved to /var/cache/conftool/dbconfig/20260227-203523-marostegui.json [20:39:45] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2054.codfw.wmnet with reason: host reimage [20:40:39] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2055.codfw.wmnet with reason: host reimage [20:41:00] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2056.codfw.wmnet with reason: host reimage [20:44:35] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2054.codfw.wmnet with reason: host reimage [20:48:15] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2055.codfw.wmnet with reason: host reimage [20:50:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T418465)', diff saved to https://phabricator.wikimedia.org/P89190 and previous config saved to /var/cache/conftool/dbconfig/20260227-205031-marostegui.json [20:50:37] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [20:50:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2239.codfw.wmnet with reason: Maintenance [20:51:50] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2056.codfw.wmnet with reason: host reimage [20:55:08] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2057.codfw.wmnet with OS trixie [20:57:11] I need an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1245465 -- context is T418549 which is corrupting edits on zhwiki, are SRE ok with a deployment? (cc: thcipriani dduvall denisse cwhite ). I can self-deploy with spiderpig. [20:57:12] T418549: VisualEditor may add excessive LanguageConverter tags since 1.46.0-wmf.17 - https://phabricator.wikimedia.org/T418549 [20:57:22] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2058.codfw.wmnet with OS trixie [20:57:57] (patch isn't merged to master yet, but I thought I'd get the approval process under way while I waited for jenkins) [20:58:40] (03PS1) 10Cwhite: validator: add note about dot-delimited root fields [software/ecs] - 10https://gerrit.wikimedia.org/r/1245467 [20:59:08] cscott: ack and thanks [20:59:55] cscott: please feel free to deploy [21:00:07] Here. [21:00:08] ^ what dduvall and cwhite said [21:00:08] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [21:00:23] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [21:01:44] thanks! i'll let you know before i start, i'm going to poke at it on beta before i pull the trigger. [21:08:32] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2054.codfw.wmnet with OS trixie [21:09:18] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2057.codfw.wmnet with reason: host reimage [21:11:24] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2055.codfw.wmnet with OS trixie [21:11:52] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2058.codfw.wmnet with reason: host reimage [21:12:07] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2056.codfw.wmnet with OS trixie [21:14:05] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [21:14:50] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [21:15:00] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2057.codfw.wmnet with reason: host reimage [21:17:55] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2058.codfw.wmnet with reason: host reimage [21:19:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:36:45] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2057.codfw.wmnet with OS trixie [21:37:53] (03PS1) 10Cwhite: logging: set poolcounter channel log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245473 (https://phabricator.wikimedia.org/T418612) [21:38:19] (03PS37) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [21:38:58] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2058.codfw.wmnet with OS trixie [21:43:28] (03CR) 10CI reject: [V:04-1] logging: set poolcounter channel log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245473 (https://phabricator.wikimedia.org/T418612) (owner: 10Cwhite) [21:43:39] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [21:46:34] (03CR) 10CDobbins: [V:03+1] prometheus: add pooled host check (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [21:48:57] FIRING: CertAlmostExpired: Certificate for service grafana:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#grafana:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:57:49] (03CR) 10BCornwall: prometheus: add pooled host check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:15:04] (03PS1) 10C. Scott Ananian: Ensure that Parsoid canonical HTML is not language converted [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245477 (https://phabricator.wikimedia.org/T418549) [22:16:08] (03PS6) 10Pppery: Add a `bin/translatewiki roundtrip` workflow to validate the string-mangling code [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221180 (https://phabricator.wikimedia.org/T413532) [22:16:29] (03CR) 10Pppery: Add a `bin/translatewiki roundtrip` workflow to validate the string-mangling code (033 comments) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221180 (https://phabricator.wikimedia.org/T413532) (owner: 10Pppery) [22:18:09] (03PS11) 10Krinkle: MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) (owner: 10Bartosz Dziewoński) [22:18:34] (03CR) 10Subramanya Sastry: [C:03+1] Ensure that Parsoid canonical HTML is not language converted [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245477 (https://phabricator.wikimedia.org/T418549) (owner: 10C. Scott Ananian) [22:20:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:25:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:45:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245477 (https://phabricator.wikimedia.org/T418549) (owner: 10C. Scott Ananian) [22:46:18] i'm going ahead with the backport. something like 2-3% of all visual editor edits on zhwiki are being corrupted. [22:46:21] (03CR) 10JHathaway: [C:03+1] firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [22:46:26] (that's only 111 edits so far, but still) [22:53:20] (03Merged) 10jenkins-bot: Ensure that Parsoid canonical HTML is not language converted [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245477 (https://phabricator.wikimedia.org/T418549) (owner: 10C. Scott Ananian) [22:53:41] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1245477|Ensure that Parsoid canonical HTML is not language converted (T418549)]] [22:53:47] T418549: VisualEditor may add excessive LanguageConverter tags since 1.46.0-wmf.17 - https://phabricator.wikimedia.org/T418549 [22:55:29] !log cscott@deploy2002 cscott: Backport for [[gerrit:1245477|Ensure that Parsoid canonical HTML is not language converted (T418549)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:58:25] (03CR) 10JHathaway: wmflib: hosts2ips: Allow passing in IP ranges (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211650 (owner: 10Majavah) [22:58:28] !log cscott@deploy2002 cscott: Continuing with sync [23:02:29] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1245477|Ensure that Parsoid canonical HTML is not language converted (T418549)]] (duration: 08m 47s) [23:02:33] T418549: VisualEditor may add excessive LanguageConverter tags since 1.46.0-wmf.17 - https://phabricator.wikimedia.org/T418549 [23:03:26] ok, backport tested & done [23:16:37] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:20:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:25:25] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:26:37] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:29:21] (03PS5) 10Pppery: Set up `arc lint`, make it pass, update README [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221191 (https://phabricator.wikimedia.org/T413531) [23:30:25] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:47:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [23:49:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:54:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:56:57] (03PS1) 10Pppery: Remove `projects/phabricator_ext/README` [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1245511 [23:57:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock