[00:08:15] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1141553
[00:08:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1141553 (owner: 10TrainBranchBot)
[00:10:51] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 644.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:16:27] <icinga-wm>	 PROBLEM - Host prometheus1008 is DOWN: PING CRITICAL - Packet loss = 100%
[00:18:43] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service prometheus1008:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:28:43] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:29:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1141553 (owner: 10TrainBranchBot)
[00:32:43] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:49:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job smoke/dns in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:50:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job blackbox/pingthing in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:54:13] <icinga-wm>	 PROBLEM - SSH on centrallog2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:54:42] <jinxer-wm>	 FIRING: [20x] JobUnavailable: Reduced availability for job blackbox/icmp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:57:42] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:57:55] <icinga-wm>	 PROBLEM - SSH on prometheus1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:57:57] <icinga-wm>	 PROBLEM - SSH on prometheus2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:58:43] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:02:43] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:04:42] <jinxer-wm>	 FIRING: [21x] JobUnavailable: Reduced availability for job blackbox/icmp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:08:19] <icinga-wm>	 PROBLEM - SSH on prometheus2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[01:10:57] <icinga-wm>	 PROBLEM - SSH on prometheus2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[01:13:57] <icinga-wm>	 PROBLEM - SSH on vrts1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[01:19:00] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1011.eqiad.wmnet with OS bullseye
[01:27:42] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:28:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:32:42] <jinxer-wm>	 RESOLVED: [10x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:36:35] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1011.eqiad.wmnet with reason: host reimage
[01:39:31] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1011.eqiad.wmnet with reason: host reimage
[01:44:13] <icinga-wm>	 PROBLEM - SSH on prometheus1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[01:55:22] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1011.eqiad.wmnet with OS bullseye
[01:56:25] <icinga-wm>	 PROBLEM - SSH on centrallog1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:04:53] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:05:11] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:20:51] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:29:42] <jinxer-wm>	 FIRING: [22x] JobUnavailable: Reduced availability for job blackbox/icmp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:37:53] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:38:09] <icinga-wm>	 PROBLEM - SSH on vrts2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:38:51] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:48:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[02:53:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[02:53:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:58:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:59:33] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS bullseye
[03:17:48] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:18:48] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudrabbit200x-dev: fix fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1141564 (https://phabricator.wikimedia.org/T392539)
[03:20:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudrabbit200x-dev: fix fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1141564 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott)
[03:24:10] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage
[03:26:55] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage
[03:27:36] <wikibugs>	 (03PS1) 10Andrew Bogott: Add cloudrabbit200[1-3]-dev to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1141566 (https://phabricator.wikimedia.org/T392539)
[03:28:49] <wikibugs>	 (03PS2) 10Andrew Bogott: Add cloudrabbit200[1-3]-dev to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1141566 (https://phabricator.wikimedia.org/T392539)
[03:31:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Add cloudrabbit200[1-3]-dev to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1141566 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott)
[03:36:43] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.rename from cloudcontrol2007-dev to cloudrabbit2001-dev
[03:37:05] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[03:41:51] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cloudcontrol2007-dev to cloudrabbit2001-dev - andrew@cumin1002"
[03:42:13] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cloudcontrol2007-dev to cloudrabbit2001-dev - andrew@cumin1002"
[03:42:13] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[03:42:14] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudrabbit2001-dev
[03:42:23] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1016.eqiad.wmnet with OS bullseye
[03:42:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:42:30] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudrabbit2001-dev
[03:43:09] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from cloudcontrol2007-dev to cloudrabbit2001-dev
[03:43:29] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye
[03:43:50] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.rename from cloudcontrol2008-dev to cloudrabbit2002-dev
[03:43:52] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.rename from cloudcontrol2009-dev to cloudrabbit2003-dev
[03:44:13] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[03:46:15] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2001-dev.codfw.wmnet with OS bookworm
[03:47:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:48:39] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cloudcontrol2008-dev to cloudrabbit2002-dev - andrew@cumin1002"
[03:49:04] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cloudcontrol2008-dev to cloudrabbit2002-dev - andrew@cumin1002"
[03:49:04] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[03:49:04] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudrabbit2002-dev
[03:49:22] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudrabbit2002-dev
[03:49:44] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[03:50:00] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from cloudcontrol2008-dev to cloudrabbit2002-dev
[03:52:21] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[03:52:21] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudrabbit2003-dev
[03:52:42] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudrabbit2003-dev
[03:53:21] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from cloudcontrol2009-dev to cloudrabbit2003-dev
[03:54:37] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2002-dev.codfw.wmnet with OS bookworm
[03:54:37] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2003-dev.codfw.wmnet with OS bookworm
[03:57:34] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove refs to cloudcontrol200[789] [puppet] - 10https://gerrit.wikimedia.org/r/1141568 (https://phabricator.wikimedia.org/T392539)
[03:57:48] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:58:32] <jinxer-wm>	 RESOLVED: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:59:42] <jinxer-wm>	 FIRING: [22x] JobUnavailable: Reduced availability for job blackbox/icmp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:00:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job blackbox/pingthing in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:02:18] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:05:27] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2001-dev.codfw.wmnet with reason: host reimage
[04:08:52] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2001-dev.codfw.wmnet with reason: host reimage
[04:12:22] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2003-dev.codfw.wmnet with reason: host reimage
[04:13:16] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2002-dev.codfw.wmnet with reason: host reimage
[04:15:43] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2003-dev.codfw.wmnet with reason: host reimage
[04:18:23] <wikibugs>	 (03PS1) 10DDesouza: Design Research Participant Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141569 (https://phabricator.wikimedia.org/T392325)
[04:19:01] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2002-dev.codfw.wmnet with reason: host reimage
[04:19:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141569 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[04:19:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141569 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[04:28:57] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2001-dev.codfw.wmnet with OS bookworm
[04:31:42] <wikibugs>	 10ops-codfw, 06cloud-services-team, 06DC-Ops: Update labels on cloudcontrol200[789]-dev.codfw - https://phabricator.wikimedia.org/T393347 (10Andrew) 03NEW
[04:34:56] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2003-dev.codfw.wmnet with OS bookworm
[04:38:29] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2002-dev.codfw.wmnet with OS bookworm
[04:58:11] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2008.codfw.wmnet with OS bullseye
[04:58:41] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host wdqs2008
[05:00:58] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[05:04:17] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye
[05:05:52] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2008 - ryankemper@cumin2002"
[05:05:58] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2008 - ryankemper@cumin2002"
[05:05:58] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[05:05:59] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs2008.codfw.wmnet 194.32.192.10.in-addr.arpa 4.9.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[05:06:02] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs2008.codfw.wmnet 194.32.192.10.in-addr.arpa 4.9.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[05:06:03] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2008
[05:06:16] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2008
[05:06:16] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wdqs2008
[05:21:01] <kart_>	 Around 4.5 hours ago, we noticed traffic to MinT (machinetranslation) service is reduced and found that only 3 pods running per DC. Is that known outage or work going on?
[05:22:29] <kart_>	 Infact, 2 per DC. 3 workers per pods.
[05:25:51] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2008.codfw.wmnet with reason: host reimage
[05:32:07] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2008.codfw.wmnet with reason: host reimage
[05:39:42] <jinxer-wm>	 RESOLVED: [11x] JobUnavailable: Reduced availability for job blackbox/icmp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:40:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job blackbox/pingthing in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:49:06] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:49:28] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2008.codfw.wmnet with OS bullseye
[05:50:42] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:53:11] <wikibugs>	 (03PS2) 10Anzx: nnwiki: enable wgCiteResponsiveReferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141573 (https://phabricator.wikimedia.org/T393299)
[05:53:14] <wikibugs>	 (03PS2) 10Anzx: nupwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141574 (https://phabricator.wikimedia.org/T390711)
[05:53:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141574 (https://phabricator.wikimedia.org/T390711) (owner: 10Anzx)
[05:53:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141573 (https://phabricator.wikimedia.org/T393299) (owner: 10Anzx)
[05:54:01] <logmsgbot>	 ryankemper@cumin2002 reimage (PID 2270920) is awaiting input
[05:54:06] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:55:18] <jinxer-wm>	 FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[05:55:18] <jinxer-wm>	 FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported
[05:55:42] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:57:20] <wikibugs>	 (03CR) 10Bunnypranav: "Hey folks, it's my first time uploading a patch to mediawiki-config. Even though it a simple enough change, do I need to schedule a backpo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav)
[06:02:45] <wikibugs>	 (03CR) 10Anzx: [C:03+1] "https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav)
[06:04:06] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[06:05:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav)
[06:05:42] <jinxer-wm>	 FIRING: [6x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[06:08:48] <Dreamy_Jazz>	 jouncebot: nowandnext
[06:08:48] <jouncebot>	 For the next 0 hour(s) and 51 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250504T0700)
[06:08:48] <jouncebot>	 In 0 hour(s) and 51 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T0700)
[06:16:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet
[06:17:35] <moritzm>	 FYI, aux-k8s-etcd1003, dse-k8s-etcd1001 and kubestagemaster1005 will briefly go down for a Ganeti reboot
[06:17:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet
[06:19:36] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[06:19:40] <icinga-wm>	 PROBLEM - Host kubestagemaster1005 is DOWN: PING CRITICAL - Packet loss = 100%
[06:20:16] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[06:21:21] <wikibugs>	 (03PS2) 10Anzx: ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141582 (https://phabricator.wikimedia.org/T392803)
[06:21:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141582 (https://phabricator.wikimedia.org/T392803) (owner: 10Anzx)
[06:22:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet
[06:23:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1140795 (owner: 10JHathaway)
[06:23:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet
[06:24:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster1005.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:25:28] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1001 is UP: PING WARNING - Packet loss = 90%, RTA = 6.15 ms
[06:25:42] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms
[06:25:56] <icinga-wm>	 RECOVERY - Host kubestagemaster1005 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms
[06:26:34] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2014.codfw.wmnet with OS bullseye
[06:27:03] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host wdqs2014
[06:27:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Install linux-sysctl-defaults on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1141585 (https://phabricator.wikimedia.org/T391083)
[06:29:57] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster1005.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:30:06] <logmsgbot>	 ryankemper@cumin2002 reimage (PID 2270920) is awaiting input
[06:30:11] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[06:35:49] <logmsgbot>	 ryankemper@cumin2002 reimage (PID 2270920) is awaiting input
[06:37:22] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2014 - ryankemper@cumin2002"
[06:37:27] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2014 - ryankemper@cumin2002"
[06:37:28] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:37:28] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs2014.codfw.wmnet 192.16.192.10.in-addr.arpa 2.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[06:37:32] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs2014.codfw.wmnet 192.16.192.10.in-addr.arpa 2.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[06:37:32] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2014
[06:39:36] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2014
[06:39:36] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wdqs2014
[06:41:45] <wikibugs>	 06SRE, 06serviceops, 06Traffic-Icebox, 06Trust and Safety Product Team: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933#10790816 (10kostajh) 05Open→03Declined
[06:42:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Install linux-sysctl-defaults on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1141585 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff)
[06:44:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] icinga: frack: adjust fran* groupings and add host [puppet] - 10https://gerrit.wikimedia.org/r/1140775 (https://phabricator.wikimedia.org/T386259) (owner: 10Dwisehaupt)
[06:45:31] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gerrit: enable bacula backups on gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1140506 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn)
[06:57:06] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2014.codfw.wmnet with reason: host reimage
[06:58:38] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141697
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T0700).
[07:00:05] <jouncebot>	 abijeet, anzx, and bunnypranav: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:08] <anzx>	 o/
[07:01:41] <kart_>	 I can deploy abijeet's patch..
[07:02:09] <abijeet>	 kart_, thanks
[07:02:23] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2014.codfw.wmnet with reason: host reimage
[07:03:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140703 (https://phabricator.wikimedia.org/T393144) (owner: 10Abijeet Patro)
[07:09:01] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141697 (owner: 10PipelineBot)
[07:11:14] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[07:11:39] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[07:13:14] <wikibugs>	 (03PS1) 10Gergő Tisza: CommonSettings: Document wmfGetPrivilegedGroups usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141700
[07:13:35] <wikibugs>	 (03Merged) 10jenkins-bot: Mobile frequent languages entrypoint: Add dependency to sitemapper [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140703 (https://phabricator.wikimedia.org/T393144) (owner: 10Abijeet Patro)
[07:14:17] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1140703|Mobile frequent languages entrypoint: Add dependency to sitemapper (T393144 T386223)]]
[07:14:21] <stashbot>	 T393144: TypeError: undefined is not an object (evaluating 'new mw.cx.SiteMapper') / TypeError: Cannot read properties of undefined (reading 'SiteMapper') / TypeError: mw.cx is undefined - https://phabricator.wikimedia.org/T393144
[07:14:22] <stashbot>	 T386223: 1.44.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T386223
[07:14:36] <wikibugs>	 (03PS2) 10Gergő Tisza: CommonSettings: Document wmfGetPrivilegedGroups usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141700
[07:15:10] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:15:27] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:18:15] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "this can't be merged till a new version of the debian package gets deployed, currently haproxykafka package deploys the systemd unit on /u" [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur)
[07:19:15] <logmsgbot>	 !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1140703|Mobile frequent languages entrypoint: Add dependency to sitemapper (T393144 T386223)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:19:46] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2014.codfw.wmnet with OS bullseye
[07:20:32] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2015.codfw.wmnet with OS bullseye
[07:21:00] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host wdqs2015
[07:21:10] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[07:21:48] <kart_>	 abijeet: you can test the patch
[07:22:48] <abijeet>	 kart_, ok, scap took a while
[07:23:03] <kart_>	 yeah
[07:24:43] <bunnypranav>	 Hi
[07:24:57] <abijeet>	 kart_, looks god
[07:24:58] <abijeet>	 kart_, looks good
[07:25:03] <kart_>	 cool
[07:25:09] <logmsgbot>	 !log kartik@deploy1003 abi, kartik: Continuing with sync
[07:25:34] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] CommonSettings: Document wmfGetPrivilegedGroups usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141700 (owner: 10Gergő Tisza)
[07:25:44] <bunnypranav>	 kart_ can you do mine as well?.
[07:26:54] <logmsgbot>	 ryankemper@cumin2002 reimage (PID 2360871) is awaiting input
[07:27:19] <kart_>	 bunnypranav: sadly, I've to got for meetings after abijeet's deployment is done :/
[07:27:45] <bunnypranav>	 Ok, fine.
[07:30:09] <wikibugs>	 (03PS7) 10Ayounsi: netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437
[07:31:45] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140703|Mobile frequent languages entrypoint: Add dependency to sitemapper (T393144 T386223)]] (duration: 17m 27s)
[07:31:48] <Dreamy_Jazz>	 I can deploy bunnypranav's change
[07:31:49] <stashbot>	 T393144: TypeError: undefined is not an object (evaluating 'new mw.cx.SiteMapper') / TypeError: Cannot read properties of undefined (reading 'SiteMapper') / TypeError: mw.cx is undefined - https://phabricator.wikimedia.org/T393144
[07:31:49] <stashbot>	 T386223: 1.44.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T386223
[07:32:31] <wikibugs>	 (03CR) 10Ayounsi: "PS 6..7 :" [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi)
[07:33:05] <anzx>	 Dreamy_Jazz: could you deploy my changes aswell, or should I move mine to next window 
[07:33:23] <Dreamy_Jazz>	 Let me take a look
[07:33:58] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Add checkuserwiki favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav)
[07:34:45] <wikibugs>	 (03Merged) 10jenkins-bot: Add checkuserwiki favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav)
[07:35:33] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] nupwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141574 (https://phabricator.wikimedia.org/T390711) (owner: 10Anzx)
[07:36:29] <Dreamy_Jazz>	 Yeah. I should be able to deploy your changes.
[07:36:30] <wikibugs>	 (03Merged) 10jenkins-bot: nupwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141574 (https://phabricator.wikimedia.org/T390711) (owner: 10Anzx)
[07:37:09] <anzx>	 Dreamy_Jazz: ty
[07:38:03] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] nnwiki: enable wgCiteResponsiveReferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141573 (https://phabricator.wikimedia.org/T393299) (owner: 10Anzx)
[07:38:35] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141582 (https://phabricator.wikimedia.org/T392803) (owner: 10Anzx)
[07:39:21] <wikibugs>	 (03Merged) 10jenkins-bot: nnwiki: enable wgCiteResponsiveReferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141573 (https://phabricator.wikimedia.org/T393299) (owner: 10Anzx)
[07:39:38] <wikibugs>	 (03Merged) 10jenkins-bot: ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141582 (https://phabricator.wikimedia.org/T392803) (owner: 10Anzx)
[07:40:06] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1141573|nnwiki: enable wgCiteResponsiveReferences (T393299)]], [[gerrit:1141582|ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace (T392803)]], [[gerrit:1141089|Add checkuserwiki favicon (T393246)]], [[gerrit:1141574|nupwiki: add timezone (T390711)]]
[07:40:14] <stashbot>	 T393299: Convert reference lists over to `responsive` on nnwiki - https://phabricator.wikimedia.org/T393299
[07:40:15] <stashbot>	 T392803: VE in namespace in ruWikibooks - https://phabricator.wikimedia.org/T392803
[07:40:15] <stashbot>	 T393246: Change favicon on the CheckUser wiki - https://phabricator.wikimedia.org/T393246
[07:40:15] <stashbot>	 T390711: Post-creation work for nupwiki - https://phabricator.wikimedia.org/T390711
[07:41:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi)
[07:43:11] <icinga-wm>	 RECOVERY - SSH on prometheus1005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:44:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:44:43] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz, bunnypranav, anzx: Backport for [[gerrit:1141573|nnwiki: enable wgCiteResponsiveReferences (T393299)]], [[gerrit:1141582|ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace (T392803)]], [[gerrit:1141089|Add checkuserwiki favicon (T393246)]], [[gerrit:1141574|nupwiki: add timezone (T390711)]] synced to the testservers (https://wikitech.wikimedia.org
[07:44:43] <logmsgbot>	 /wiki/Mwdebug)
[07:44:53] <anzx>	 Dreamy_Jazz: checking
[07:44:56] <Dreamy_Jazz>	 Thanks!
[07:45:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:46:11] <Dreamy_Jazz>	 I'll check the checkuserwiki change as I don't think bunnypranav has access to that wiki.
[07:47:17] <Dreamy_Jazz>	 checkuser.wikimedia.org favicon appears to work
[07:47:19] <anzx>	 Dreamy_Jazz: all looks good, check userwiki aswell
[07:47:28] <Dreamy_Jazz>	 Thanks
[07:47:31] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz, bunnypranav, anzx: Continuing with sync
[07:49:14] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[07:49:31] <icinga-wm>	 RECOVERY - SSH on prometheus2006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:49:59] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:50:08] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:50:30] <jinxer-wm>	 RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[07:50:34] <jinxer-wm>	 RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported
[07:50:46] <jinxer-wm>	 FIRING: [6x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[07:51:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:53:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:53:43] <wikibugs>	 (03CR) 10Fabfur: "yeah, that's the plan, I also modified the task to be more clear in the needed steps" [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur)
[07:53:57] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gerrit: enable backups on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140507 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn)
[07:54:06] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[07:54:18] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141573|nnwiki: enable wgCiteResponsiveReferences (T393299)]], [[gerrit:1141582|ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace (T392803)]], [[gerrit:1141089|Add checkuserwiki favicon (T393246)]], [[gerrit:1141574|nupwiki: add timezone (T390711)]] (duration: 14m 11s)
[07:54:22] <anzx>	 Dreamy_Jazz: thanks again for deploying 
[07:54:24] <stashbot>	 T393299: Convert reference lists over to `responsive` on nnwiki - https://phabricator.wikimedia.org/T393299
[07:54:25] <stashbot>	 T392803: VE in namespace in ruWikibooks - https://phabricator.wikimedia.org/T392803
[07:54:25] <stashbot>	 T393246: Change favicon on the CheckUser wiki - https://phabricator.wikimedia.org/T393246
[07:54:25] <stashbot>	 T390711: Post-creation work for nupwiki - https://phabricator.wikimedia.org/T390711
[07:54:32] <Dreamy_Jazz>	 Np
[07:54:46] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:54:50] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:54:59] <Dreamy_Jazz>	 !log UTC morning backport window finished
[07:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:01] <icinga-wm>	 PROBLEM - Host mr1-esams.oob IPv6 is DOWN: CRITICAL - Host Unreachable (2a00:1188:5:e::4)
[07:55:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:55:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:55:46] <jinxer-wm>	 FIRING: [6x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[07:56:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:58:49] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[07:59:06] <jinxer-wm>	 RESOLVED: [4x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[07:59:26] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[07:59:34] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[07:59:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:59:46] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[08:00:02] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2015 - ryankemper@cumin2002"
[08:00:03] <icinga-wm>	 RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 88.00 ms
[08:00:04] <jinxer-wm>	 FIRING: [30x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:00:08] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2015 - ryankemper@cumin2002"
[08:00:08] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:00:08] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs2015.codfw.wmnet 209.48.192.10.in-addr.arpa 9.0.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:00:12] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs2015.codfw.wmnet 209.48.192.10.in-addr.arpa 9.0.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:00:13] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2015
[08:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:00:30] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2015
[08:00:30] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wdqs2015
[08:00:42] <jinxer-wm>	 RESOLVED: [6x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[08:01:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:02:17] <icinga-wm>	 RECOVERY - SSH on prometheus1006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:02:41] <tappof>	 !log rebooting prometheus1005 prometheus1006 and prometheus2006
[08:02:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:04:50] <jinxer-wm>	 FIRING: [30x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:05:06] <jinxer-wm>	 FIRING: [30x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:05:12] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[08:05:49] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[08:06:40] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:09:31] <wikibugs>	 (03PS5) 10Fabfur: haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016)
[08:09:40] <wikibugs>	 (03CR) 10Fabfur: haproxykafka: service unit brought by deb package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur)
[08:09:50] <jinxer-wm>	 FIRING: [30x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:10:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:11:12] <wikibugs>	 (03PS2) 10Msz2001: [plwiki] Add 'abusefilter-view-private' to sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353)
[08:11:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:04-1] "I tested this in Pontoon and it doesn't seem to work (reply comes from opensearch not apache)" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron)
[08:11:41] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001)
[08:11:45] <elukey>	 !log powercycle prometheus1008 - no ssh, mgmt console showing cpu soft lockup continously
[08:11:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:32] <elukey>	 !log powercycle prometheus2005 - no ssh, mgmt console showing systemd units being deactivated, no root login
[08:15:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:16:21] <icinga-wm>	 RECOVERY - Host prometheus1008 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[08:16:59] <godog>	 thank you elukey tappof 
[08:17:12] <tappof>	 !log powercycle prometheus2008 - no ssh, mgmt console showing systemd units being deactivated, no root login
[08:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:44] <elukey>	 ciao godog, buongiorno
[08:17:55] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2015.codfw.wmnet with reason: host reimage
[08:18:03] <godog>	 buongiorno to you too
[08:18:19] <icinga-wm>	 RECOVERY - SSH on prometheus2005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:19:15] <icinga-wm>	 PROBLEM - Host prometheus2008 is DOWN: PING CRITICAL - Packet loss = 100%
[08:19:50] <jinxer-wm>	 FIRING: [30x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:21:19] <icinga-wm>	 RECOVERY - SSH on prometheus2008 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:21:21] <icinga-wm>	 RECOVERY - Host prometheus2008 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms
[08:21:45] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2015.codfw.wmnet with reason: host reimage
[08:24:50] <jinxer-wm>	 FIRING: [28x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:25:06] <jinxer-wm>	 FIRING: [28x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:25:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:29:50] <jinxer-wm>	 FIRING: [28x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:30:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:32:02] <godog>	 !log powercycle centrallog1002 - can not login on ssh or console
[08:32:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:22] <tappof>	 !log rebooting prometheus2007 - no ssh, com2 via racadm hangs
[08:32:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:21] <wikibugs>	 (03CR) 10Hashar: "We had a thread on Slack with QTE about having a RTL wiki.  That `en_rtl` filled that niche at the time https://wikimedia.slack.com/archiv" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140976 (owner: 10Jforrester)
[08:33:29] <icinga-wm>	 PROBLEM - Host centrallog1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:34:17] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:34:37] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:34:50] <jinxer-wm>	 FIRING: [28x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:35:35] <icinga-wm>	 RECOVERY - SSH on centrallog1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:35:37] <icinga-wm>	 RECOVERY - Host centrallog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[08:35:54] <wikibugs>	 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10791244 (10Nikerabbit) 05Stalled→03In progress
[08:36:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:36:19] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[08:37:17] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:37:19] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is OK: OK: UP (pid=3790) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[08:37:37] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:38:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:39:50] <jinxer-wm>	 FIRING: [26x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:40:07] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2015.codfw.wmnet with OS bullseye
[08:40:40] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:41:01] <wikibugs>	 (03PS2) 10Hashar: python3: add python3-venv to devel image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442
[08:41:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:42:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:44:50] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:45:10] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): Improve function and property documentation for php code (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender)
[08:46:54] <wikibugs>	 (03CR) 10Hashar: "@jhathaway@wikimedia.org may you puppet-merge this one for me please? I don't have +2 or access to the Puppet servers." [puppet] - 10https://gerrit.wikimedia.org/r/1136403 (owner: 10Hashar)
[08:47:35] <icinga-wm>	 PROBLEM - SSH on prometheus2007 is CRITICAL: connect to address 10.192.9.11 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:47:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:49:40] <wikibugs>	 (03PS11) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071
[08:49:50] <jinxer-wm>	 FIRING: [24x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:50:46] <bunnypranav>	 Thanks Dreamy_Jazz a lot for the deploy. Sorry for my absence, I went offline as kart told they were busy.
[08:51:25] <bunnypranav>	 Btw, I can see the main page of checkuserwiki, so the favicon is visible to public.
[08:51:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede)
[08:52:31] <wikibugs>	 06SRE: soft lockup on prometheus and centrallog hosts with the new kernel - https://phabricator.wikimedia.org/T393357 (10fgiunchedi) 03NEW
[08:53:31] <anzx>	 bunnypranav: generally during deployment we use https://wikitech.wikimedia.org/wiki/WikimediaDebug check changes in test server before it gets live
[08:54:50] <jinxer-wm>	 FIRING: [23x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:55:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10791322 (10Stevemunene) 05Open→03Resolved Hosts look ok after 2 days, I think it is safe to close this and move...
[08:55:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:56:05] <godog>	 !log powercycle centrallog2002 - can not login on ssh or console
[08:56:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:09] <icinga-wm>	 PROBLEM - Host prometheus2007 is DOWN: PING CRITICAL - Packet loss = 100%
[08:57:47] <icinga-wm>	 PROBLEM - Host centrallog2002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:58:15] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:58:15] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:58:49] <wikibugs>	 (03PS12) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071
[08:59:19] <icinga-wm>	 RECOVERY - SSH on centrallog2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:59:21] <icinga-wm>	 RECOVERY - Host centrallog2002 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms
[08:59:37] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5447/co" [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede)
[08:59:43] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on centrallog2002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[08:59:50] <wikibugs>	 (03CR) 10Bunnypranav: "@marcinszwarc@hotmail.com You have set the perm to false, which basically restricts them from seeing the private filters. Is that what you" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001)
[08:59:50] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:00:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:00:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede)
[09:01:13] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on centrallog2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:01:43] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on centrallog2002 is OK: OK: UP (pid=3794) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[09:02:13] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on centrallog2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:02:15] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:02:15] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:02:31] <wikibugs>	 06SRE: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357#10791363 (10fgiunchedi)
[09:03:10] <godog>	 !log powercycle vrts1003 + vrts2002 - soft lockup T393357
[09:03:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:12] <stashbot>	 T393357: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357
[09:03:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:04:33] <icinga-wm>	 PROBLEM - Host vrts2002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:04:35] <icinga-wm>	 RECOVERY - SSH on prometheus2007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:04:37] <icinga-wm>	 RECOVERY - Host prometheus2007 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms
[09:04:41] <icinga-wm>	 PROBLEM - Host vrts1003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:04:51] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:05:40] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:06:20] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service vrts1003:1443 has failed probes (http_ticket_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:07:05] <icinga-wm>	 RECOVERY - Host vrts2002 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms
[09:07:11] <icinga-wm>	 RECOVERY - Host vrts1003 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[09:07:19] <icinga-wm>	 RECOVERY - SSH on vrts1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:07:19] <icinga-wm>	 RECOVERY - SSH on vrts2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:07:53] <wikibugs>	 06SRE: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357#10791398 (10fgiunchedi) Another correlation (maybe causation) is the fact that all hosts locking up so far have mdadm raid10
[09:08:13] <icinga-wm>	 PROBLEM - freshclam running on vrts1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (clamav), command name freshclam https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV
[09:09:13] <icinga-wm>	 RECOVERY - freshclam running on vrts1003 is OK: PROCS OK: 1 process with UID = 110 (clamav), command name freshclam https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV
[09:11:10] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:11:20] <jinxer-wm>	 RESOLVED: [5x] ProbeDown: Service vrts1003:1443 has failed probes (http_ticket_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:12:17] <Dreamy_Jazz>	 bunnypranav: No problem on being away. Thanks for the patch.
[09:12:32] <Dreamy_Jazz>	 jouncebot: nowandnext
[09:12:32] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 47 minute(s)
[09:12:32] <jouncebot>	 In 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1000)
[09:12:49] <Dreamy_Jazz>	 Going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1141844/2
[09:13:00] <wikibugs>	 (03PS13) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071
[09:14:28] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] [plwiki] Add 'abusefilter-view-private' to sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001)
[09:15:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede)
[09:15:15] <wikibugs>	 (03Merged) 10jenkins-bot: [plwiki] Add 'abusefilter-view-private' to sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001)
[09:16:56] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] "From what I can see, this change now removes the `false` definition. The right is given to `sysop` group by default, so this change should" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001)
[09:17:14] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1141844|[plwiki] Add 'abusefilter-view-private' to sysop (T393353)]]
[09:17:17] <stashbot>	 T393353: Add (abusefilter-view-private) right to sysop group on plwiki - https://phabricator.wikimedia.org/T393353
[09:18:20] <wikibugs>	 (03CR) 10Bunnypranav: "Oh, I misinterpreted the remove for a added line, my bad." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001)
[09:19:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146#10791431 (10elukey) My 2c: before starting we should decide if what controller we want to use, because in T391854 it seems that we may be oriented in buying...
[09:21:37] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz, msz2001: Backport for [[gerrit:1141844|[plwiki] Add 'abusefilter-view-private' to sysop (T393353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:23:35] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz, msz2001: Continuing with sync
[09:24:39] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10791445 (10elukey) @MatthewVernon I think that the new controller costs the same as the old one, so the config-J price shouldn't chang...
[09:26:10] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:30:19] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141844|[plwiki] Add 'abusefilter-view-private' to sysop (T393353)]] (duration: 13m 04s)
[09:30:21] <stashbot>	 T393353: Add (abusefilter-view-private) right to sysop group on plwiki - https://phabricator.wikimedia.org/T393353
[09:31:10] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:35:52] <wikibugs>	 06SRE: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357#10791471 (10MoritzMuehlenhoff) RAID 10 is a good lead! It seems the same was already reported in Debian a few days ago: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1104460
[09:36:19] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: enable Knative's secure-pod-defaults for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140140 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[09:38:26] <elukey>	 !log depool inference/codfw from DNS discovery to safely apply new pod/container security settings - T369493
[09:38:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:29] <stashbot>	 T369493: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493
[09:39:01] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[09:39:22] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[09:41:10] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:51:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5448/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[09:55:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:58:51] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' .
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1000)
[10:05:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:06:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366 (10MoritzMuehlenhoff) 03NEW
[10:09:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10791559 (10phaultfinder)
[10:11:20] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791564 (10MoritzMuehlenhoff)
[10:15:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[10:15:54] <jem>	 Hi, can anyone help in deploying asap a fix for the main menu text in eswiki? Or in "forcing" a local text to override the mistake?
[10:16:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[10:17:40] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts2002.codfw.wmnet
[10:18:01] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791574 (10ops-monitoring-bot) Host rebooted by jelto@cumin1002 with reason: revert kernel
[10:20:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Please change the commit message, specifically we are temporarily disabling dashboard_sync for the grafana upgrade and not add the feature" [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[10:24:14] <tappof>	 !log rebooting prometheus1007 into linux-image-6.1.0-33-amd64
[10:24:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:37] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T393368 (10phaultfinder) 03NEW
[10:24:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sql_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:26:05] <icinga-wm>	 PROBLEM - SSH on prometheus1007 is CRITICAL: connect to address 10.64.48.171 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:26:36] <jynus>	 jem: https://es.wikipedia.org/w/index.php?title=MediaWiki:Vector-opt-out&action=edit
[10:27:45] <jynus>	 jem: it was fixed already on translatewiki, so it should be fixed soon: https://translatewiki.net/w/i.php?title=MediaWiki:Vector-opt-out/es&diff=next&oldid=13050998
[10:29:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sql_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:29:50] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service prometheus1007:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:31:46] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791622 (10MoritzMuehlenhoff)
[10:32:10] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2002.codfw.wmnet
[10:32:18] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts2002.codfw.wmnet
[10:32:38] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791624 (10ops-monitoring-bot) Host rebooted by jelto@cumin1002 with reason: revert kernel
[10:34:27] <icinga-wm>	 PROBLEM - Host prometheus1007 is DOWN: PING CRITICAL - Packet loss = 100%
[10:34:37] <jem>	 jynus: thanks and yes, I had checked in translatewiki.net
[10:35:34] <jem>	 Usually I would just wait, but this is being seen from every article and I thought it would be worth a quicker fix 
[10:35:51] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=inference,name=codfw
[10:36:45] <jem>	 I have created the local message for eswiki, it seems a purge is needed... but I don't know where (not in MediaWiki:Sidebar, it seems)
[10:37:14] <jynus>	 yeah, that won't work, as it would need a purge to every cached page
[10:37:19] <jem>	 Ugh
[10:37:23] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report puppetdb_virtual (instance netbox1003) - https://phabricator.wikimedia.org/T393370 (10LSobanski) 03NEW
[10:37:25] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report puppetdb_virtual (instance netbox1003) - https://phabricator.wikimedia.org/T393370#10791636 (10LSobanski)
[10:37:48] <jynus>	 jem: that's why I belive strings are only updated every some time
[10:37:53] <RhinosF1>	 I think sidebar cache is turned on
[10:38:17] <RhinosF1>	 But then also cache for logged out users would still be polluted even with that purged
[10:38:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sql_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:40:05] <icinga-wm>	 RECOVERY - SSH on prometheus1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:40:07] <icinga-wm>	 RECOVERY - Host prometheus1007 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[10:40:42] <jynus>	 jem: I don't know enough about the deployment process, but if you are around during the train update someone more knowleable may be able to answer you
[10:41:11] <jem>	 Thanks, jynus... in this channel, I guess
[10:41:39] <jynus>	 yes, check the schedule at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1000
[10:42:03] <RhinosF1>	 Could try pinging releng
[10:42:19] <RhinosF1>	 But not sure how much they'd know about the specifics of the sidebar
[10:42:55] <RhinosF1>	 hashar, jeena: any ideas ^
[10:44:51] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service prometheus1007:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:44:52] <wikibugs>	 (03PS1) 10Elukey: admin_ng: disable PSP mutations for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141858 (https://phabricator.wikimedia.org/T369493)
[10:46:30] <jem>	 Thanks... I'll be checking from time to time
[10:49:35] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791662 (10MoritzMuehlenhoff)
[10:49:53] <icinga-wm>	 PROBLEM - Host prometheus1007 is DOWN: PING CRITICAL - Packet loss = 100%
[10:51:17] <wikibugs>	 06SRE, 07SRE-Unowned, 07SEO: Index pl.wikinews in Google Publisher Center - https://phabricator.wikimedia.org/T393288#10791665 (10BZPN2) Maybe it's worth trying to index Wikinews through the Google Publisher Center panel, maybe that speeds up the process somehow? Also, for the site to be indexed in Google Ne...
[10:51:26] <wikibugs>	 06SRE, 07SRE-Unowned, 06WMF-Legal, 07SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437#10791667 (10BZPN2) Maybe it's worth trying to index Wikinews through the Google Publisher Center panel, maybe that speeds up the pr...
[10:52:07] <icinga-wm>	 RECOVERY - Host prometheus1007 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[10:52:29] <icinga-wm>	 PROBLEM - Host vrts2002 is DOWN: PING CRITICAL - Packet loss = 100%
[10:53:35] <wikibugs>	 (03PS14) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071
[10:54:09] <icinga-wm>	 RECOVERY - Host vrts2002 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms
[10:54:48] <jem>	 Anyway: I have created the right local message (without the /es, my mistake) and now it is fixed for me in all pages
[10:55:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede)
[10:55:50] <jem>	 ... and it seems the main menu isn't shown to logged out users (!)
[10:56:42] <jynus>	 jem: it is, it just defaults to the hamburger menu on top
[10:57:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2002.codfw.wmnet
[10:57:51] <jem>	 Ah, yes
[10:57:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts2002.codfw.wmnet
[10:58:16] <jynus>	 jem: https://i.imgur.com/DsSUrQp.png
[10:58:16] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791671 (10ops-monitoring-bot) Host rebooted by jelto@cumin1002 with reason: revert kernel
[10:58:17] <jem>	 I'm really trying to get used to the new Vector, but...
[10:58:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sql_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:58:48] <jem>	 Anyway, the "Switch to old version" text doesn't appear there
[11:01:14] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10791672 (10Silvan_WMDE) True, the deployment had not actually happened when [[ https:...
[11:04:34] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2002.codfw.wmnet
[11:05:01] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report puppetdb_virtual (instance netbox1003) - https://phabricator.wikimedia.org/T393370#10791680 (10Volans) 05Open→03Resolved a:03Volans There was no last run reported on the script page, re-run it manually and that r...
[11:05:13] <logmsgbot>	 !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 6:00:00 on backup[1010-1014].eqiad.wmnet with reason: Upgrade and restart
[11:05:25] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791683 (10MoritzMuehlenhoff)
[11:05:28] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet
[11:05:54] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791684 (10ops-monitoring-bot) Host rebooted by jelto@cumin1002 with reason: revert kernel
[11:09:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146#10791692 (10MoritzMuehlenhoff) >>! In T393146#10791431, @elukey wrote: > My 2c: before starting we should decide if what controller we want to use, because i...
[11:11:44] <wikibugs>	 (03PS15) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071
[11:12:05] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791698 (10Jelto)
[11:12:09] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet
[11:13:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede)
[11:14:21] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791702 (10MoritzMuehlenhoff)
[11:15:17] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791703 (10MoritzMuehlenhoff)
[11:15:18] <wikibugs>	 06SRE: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357#10791704 (10MoritzMuehlenhoff)
[11:17:26] <wikibugs>	 (03PS16) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071
[11:21:04] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic, 13Patch-For-Review: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10791709 (10Volans) Have you considered just downtiming the affected...
[11:26:05] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede)
[11:26:37] <wikibugs>	 (03PS1) 10Abijeet Patro: Disable Special:ContentTranslationStats page [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141866 (https://phabricator.wikimedia.org/T392839)
[11:26:53] <wikibugs>	 (03PS1) 10Abijeet Patro: Disable APIs used in Special:ContentTranslationStats [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141867 (https://phabricator.wikimedia.org/T392839)
[11:27:12] <wikibugs>	 (03PS1) 10Abijeet Patro: Remove links to Special:ContentTranslationStats from dashboards [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141868 (https://phabricator.wikimedia.org/T392839)
[11:27:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141866 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro)
[11:28:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141867 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro)
[11:28:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141868 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro)
[11:31:11] <wikibugs>	 (03PS1) 10Hoo man: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141869 (https://phabricator.wikimedia.org/T391532)
[11:31:45] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857) (owner: 10Elukey)
[11:34:20] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791722 (10MoritzMuehlenhoff) p:05Triage→03High
[11:34:27] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet
[11:38:33] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet
[11:39:20] <wikibugs>	 (03PS1) 10Slyngshede: P:ldap::client::ldaptui move files [puppet] - 10https://gerrit.wikimedia.org/r/1141871
[11:40:19] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5449/console" [puppet] - 10https://gerrit.wikimedia.org/r/1141871 (owner: 10Slyngshede)
[11:41:54] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] P:ldap::client::ldaptui move files [puppet] - 10https://gerrit.wikimedia.org/r/1141871 (owner: 10Slyngshede)
[11:43:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt)
[11:44:31] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet
[11:44:43] <wikibugs>	 (03Abandoned) 10Cyndywikime: Regenerate speed-test snapshot without GENewcomerTasksGuidanceEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138350 (https://phabricator.wikimedia.org/T379568) (owner: 10Cyndywikime)
[11:44:50] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service prometheus2005:443 has failed probes (http_prometheus_codfw_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:45:54] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet
[11:46:34] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet
[11:46:35] <logmsgbot>	 !log filippo@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host prometheus2006.codfw.wmnet
[11:49:04] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet
[11:49:23] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2008.codfw.wmnet
[11:49:37] <wikibugs>	 (03PS1) 10Slyngshede: P:ldap::client::ldaptui correct paths [puppet] - 10https://gerrit.wikimedia.org/r/1141875
[11:49:46] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet
[11:49:46] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:49:50] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service prometheus2005:443 has failed probes (http_prometheus_codfw_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:50:43] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5450/co" [puppet] - 10https://gerrit.wikimedia.org/r/1141875 (owner: 10Slyngshede)
[11:51:58] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] P:ldap::client::ldaptui correct paths [puppet] - 10https://gerrit.wikimedia.org/r/1141875 (owner: 10Slyngshede)
[11:52:21] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:52:21] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:53:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet
[11:55:21] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:55:21] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:55:23] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:55:39] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:56:00] <wikibugs>	 06SRE: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357#10791747 (10fgiunchedi) 05Open→03Invalid I'm resolving this in favor of  T393366
[11:56:16] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet
[11:56:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:58:04] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2008.codfw.wmnet
[11:58:56] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet
[11:59:41] <logmsgbot>	 !log aqu@deploy1003 Started deploy [analytics/refinery@dbfa557] (hadoop-test): Deploying new refinery/source artifacts TEST [analytics/refinery@dbfa557d]
[11:59:50] <jinxer-wm>	 FIRING: [23x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:00:03] <logmsgbot>	 jmm@cumin2002 drain-node (PID 2637159) is awaiting input
[12:00:05] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791759 (10fgiunchedi)
[12:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:00:34] <logmsgbot>	 !log aqu@deploy1003 Finished deploy [analytics/refinery@dbfa557] (hadoop-test): Deploying new refinery/source artifacts TEST [analytics/refinery@dbfa557d] (duration: 00m 53s)
[12:01:15] <logmsgbot>	 !log aqu@deploy1003 Started deploy [analytics/refinery@dbfa557]: Deploying new refinery/source artifacts [analytics/refinery@dbfa557d]
[12:22:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet
[12:27:22] <logmsgbot>	 jmm@cumin2002 drain-node (PID 2666612) is awaiting input
[12:27:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet
[12:28:07] <tappof>	 !log Rolling reboot of Prometheus nodes in eqiad (1005, 1006, 1008) to rollback the kernel
[12:28:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:45] <icinga-wm>	 PROBLEM - Host prometheus1005 is DOWN: PING CRITICAL - Packet loss = 100%
[12:32:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Install linux-sysctl-defaults on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1141585 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff)
[12:32:04] <wikibugs>	 (03PS1) 10Andrew Bogott: Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141893 (https://phabricator.wikimedia.org/T392539)
[12:33:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141893 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott)
[12:33:19] <icinga-wm>	 RECOVERY - Host prometheus1005 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[12:34:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet
[12:34:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet
[12:34:40] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Make cloudrabbit200[123] into rabbitmq nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1141894
[12:34:50] <jinxer-wm>	 FIRING: [21x] ProbeDown: Service ganeti1025:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:35:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "Make cloudrabbit200[123] into rabbitmq nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1141894 (owner: 10Andrew Bogott)
[12:39:02] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet
[12:39:37] <icinga-wm>	 PROBLEM - Host prometheus1006 is DOWN: PING CRITICAL - Packet loss = 100%
[12:39:50] <jinxer-wm>	 FIRING: [21x] ProbeDown: Service ganeti1025:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:40:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146#10791808 (10elukey) Maybe I got the wrong PCI via lspci, but I see:  ` elukey@ms-be1091:~$ lspci -nn | grep -i sas 98:00.0 Serial Attached SCSI controller [0...
[12:41:11] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[12:41:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:41:59] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] sre.hosts: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans)
[12:42:11] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[12:42:19] <icinga-wm>	 RECOVERY - Host prometheus1006 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms
[12:42:28] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet
[12:43:56] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet
[12:44:31] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] sre.hosts: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans)
[12:44:50] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:44:53] <icinga-wm>	 PROBLEM - Host prometheus1008 is DOWN: PING CRITICAL - Packet loss = 100%
[12:45:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:47:23] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet
[12:47:48] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet
[12:48:07] <wikibugs>	 (03Abandoned) 10Hashar: CI: diff against parent commit instead of remote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar)
[12:48:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10791818 (10Gehel)
[12:48:21] <icinga-wm>	 RECOVERY - Host prometheus1008 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms
[12:48:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10791832 (10Gehel)
[12:49:23] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:49:25] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10791850 (10Gehel)
[12:49:39] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:49:50] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service prometheus1008:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:51:45] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10791918 (10Gehel)
[12:52:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10791933 (10Gehel)
[12:52:23] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10791927 (10Gehel)
[12:52:29] <wikibugs>	 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Bring relforge100[89] into production - https://phabricator.wikimedia.org/T389957#10791931 (10Gehel)
[12:52:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10791937 (10Gehel)
[12:52:45] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet
[12:52:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10791935 (10Gehel)
[12:52:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10791939 (10Gehel)
[12:52:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10791941 (10Gehel)
[12:53:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10791943 (10Gehel)
[12:53:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10791945 (10Gehel)
[12:54:03] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#10791951 (10Gehel)
[12:55:04] <wikibugs>	 (03PS2) 10Cathal Mooney: WMF-Plugin: Potential clean-up of b-end circuit finding logic [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577)
[12:55:08] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] [analytics] Refine deterministic transform deduplication [puppet] - 10https://gerrit.wikimedia.org/r/1141884 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[12:55:52] <wikibugs>	 (03PS1) 10Andrew Bogott: Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539)
[12:56:04] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet
[12:57:39] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:57:44] <wikibugs>	 (03PS2) 10Andrew Bogott: Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539)
[12:57:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Stop passing krb2002 to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1140142 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[12:57:52] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott)
[12:57:53] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1193 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:59:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1204 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:59:30] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet
[13:00:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1300).
[13:00:05] <jouncebot>	 abijeet, tchin, and Cyndywikime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:07] <abijeet>	 o/
[13:00:21] <tchin>	 o/
[13:00:37] <kart_>	 here. I'll deploy abijeet's patches
[13:01:56] <Cyndywikime>	 o/
[13:02:49] <kart_>	 abijeet Not sure why IRC not showing autocompletion of your nice ;)
[13:03:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141866 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro)
[13:03:40] <abijeet>	 hi kart_, thanks
[13:03:55] <logmsgbot>	 jmm@cumin2002 drain-node (PID 2706496) is awaiting input
[13:04:11] <tappof>	 !log rebooting centrallog1002 to rollback the kernel
[13:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:23] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:04:39] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:05:15] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10792018 (10tappof)
[13:05:27] <icinga-wm>	 PROBLEM - Host centrallog1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:05:54] <wikibugs>	 (03Merged) 10jenkins-bot: Disable Special:ContentTranslationStats page [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141866 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro)
[13:06:12] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1141866|Disable Special:ContentTranslationStats page (T392839 T325790)]]
[13:06:12] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[13:06:17] <stashbot>	 T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839
[13:06:17] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:06:17] <stashbot>	 T325790: Special:ContentTranslationStats is slow and getting crowded - https://phabricator.wikimedia.org/T325790
[13:06:29] <logmsgbot>	 !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad
[13:06:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet
[13:06:37] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:07:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Initial Puppet agent apt config for Puppet 7 in Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1140659 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff)
[13:07:53] <icinga-wm>	 RECOVERY - Host centrallog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms
[13:07:58] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[13:08:06] <logmsgbot>	 !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad
[13:08:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:08:29] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[13:08:31] <logmsgbot>	 !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad
[13:08:59] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[13:09:00] <logmsgbot>	 !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad
[13:09:19] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:09:27] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on centrallog1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:09:39] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: disable PSP mutations for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141858 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[13:09:40] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[13:09:41] <logmsgbot>	 !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad
[13:09:50] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:09:57] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[13:10:53] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1193 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:10:55] <logmsgbot>	 !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1141866|Disable Special:ContentTranslationStats page (T392839 T325790)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:11:09] <kart_>	 abijeet, Please test
[13:11:19] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is OK: OK: UP (pid=3993) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:11:19] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:11:20] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:11:24] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:11:27] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on centrallog1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:11:33] <wikibugs>	 (03CR) 10Volans: [C:03+2] sre.hosts: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans)
[13:11:39] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:11:46] <abijeet>	 kart_, on it
[13:12:15] <kart_>	 Special:CX seems disable.
[13:12:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet
[13:12:47] <kart_>	 We will get some 404s with first patch, but seems fine?
[13:12:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet
[13:12:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10792050 (10Jclark-ctr) a:03VRiley-WMF
[13:13:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:13:40] <wikibugs>	 (03CR) 10TChin: [C:03+1] Stream config for edge uniques on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt)
[13:14:25] <wikibugs>	 (03CR) 10TChin: [C:03+1] "Ah! Can I just +2 this then?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt)
[13:14:41] <abijeet>	 kart_, looks good
[13:14:46] <wikibugs>	 (03CR) 10Volans: "@sukhe: the hiddenparma cookbook is currently unowned awaiting for an official owner, see T383809. If Traffic wants to own it that would b" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans)
[13:14:47] <kart_>	 cool.
[13:14:50] <jinxer-wm>	 FIRING: [21x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:14:51] <logmsgbot>	 !log kartik@deploy1003 kartik, abi: Continuing with sync
[13:14:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141034 (https://phabricator.wikimedia.org/T393167) (owner: 10Novem Linguae)
[13:16:39] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:17:23] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:18:39] <wikibugs>	 (03PS6) 10Federico Ceratto: sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146)
[13:18:55] <fabfur>	 !log depooling cp7001 to test new haproxykafka version (T393016)
[13:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:58] <stashbot>	 T393016: haproxykafka service isn't restarted when upgraded - https://phabricator.wikimedia.org/T393016
[13:19:15] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans)
[13:19:23] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet
[13:20:06] <fabfur>	 !log disabled puppet on cp7001 to test haproxykafka version (T393016)
[13:20:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-druid1003 - https://phabricator.wikimedia.org/T393229#10792119 (10Jclark-ctr) This Server is out of warranty Please advise if you would like me to and if i am able to Swap with drive from Decom server
[13:21:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-druid1003 - https://phabricator.wikimedia.org/T393229#10792120 (10Jclark-ctr) a:03Jclark-ctr
[13:21:42] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141866|Disable Special:ContentTranslationStats page (T392839 T325790)]] (duration: 15m 29s)
[13:21:46] <stashbot>	 T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839
[13:21:46] <stashbot>	 T325790: Special:ContentTranslationStats is slow and getting crowded - https://phabricator.wikimedia.org/T325790
[13:22:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:23:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141867 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro)
[13:23:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10792126 (10Jclark-ctr) 05Open→03Resolved ` Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] md2 : active raid10 sdg2[4] sdd2[0] sdh2[3] sdf2[1]       3701655552 blo...
[13:23:10] <kart_>	 abijeet, on 2nd patch now
[13:24:55] <wikibugs>	 (03PS2) 10Volans: Data Platform cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836
[13:25:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1204 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:27:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1164 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:27:32] <wikibugs>	 (03CR) 10AOkoth: "That's the ID of the junk queue. If you run `SELECT * FROM queue WHERE id = 3` on the database you'll see it." [puppet] - 10https://gerrit.wikimedia.org/r/1140207 (https://phabricator.wikimedia.org/T389079) (owner: 10AOkoth)
[13:27:46] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:27:56] <abijeet>	 kart_, ok
[13:28:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Remove refs to cloudcontrol200[789] [puppet] - 10https://gerrit.wikimedia.org/r/1141568 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott)
[13:29:20] <wikibugs>	 (03PS9) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759)
[13:29:21] <wikibugs>	 (03CR) 10Volans: "@fceratto@wikimedia.org are you ok too with the change? Just making sure to not step on other refactors that might be happening." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 (owner: 10Volans)
[13:29:28] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[13:29:59] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:32:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1146 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:32:45] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10792170 (10Jclark-ctr) @matthewvernon thanos-fe100[1-3] are R440's but no the XD2  servers use a 730mini raid...
[13:33:14] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10792184 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All RAID10 servers which were upgraded to 6.1.135, are...
[13:33:17] <wikibugs>	 (03PS3) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729)
[13:33:21] <wikibugs>	 (03CR) 10Bking: sre.hosts.rename: wipe DNS cache after rename (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[13:33:26] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:33:27] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:34:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet
[13:34:15] <wikibugs>	 (03Merged) 10jenkins-bot: Disable APIs used in Special:ContentTranslationStats [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141867 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro)
[13:34:32] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1141867|Disable APIs used in Special:ContentTranslationStats (T392839)]]
[13:34:35] <stashbot>	 T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839
[13:34:48] <wikibugs>	 (03CR) 10Ssingh: "Thanks, the changes look good but I also wanted to keep the options around in case --help was passed. But I am guessing the expectation is" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans)
[13:36:23] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:36:39] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:39:02] <logmsgbot>	 !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1141867|Disable APIs used in Special:ContentTranslationStats (T392839)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:39:13] <kart_>	 abijeet, Please test 2nd patch. I'll also +2 the 3rd patch to save time.
[13:39:42] <logmsgbot>	 jmm@cumin2002 drain-node (PID 2740892) is awaiting input
[13:39:55] <abijeet>	 kart_, yes, please
[13:40:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet
[13:40:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[13:40:13] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Remove links to Special:ContentTranslationStats from dashboards [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141868 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro)
[13:40:51] <abijeet>	 kart_, looks ok
[13:41:09] <kart_>	 cool
[13:41:12] <logmsgbot>	 !log kartik@deploy1003 kartik, abi: Continuing with sync
[13:41:13] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:41:14] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:42:09] <wikibugs>	 (03CR) 10Volans: "They options are added to the parser automatically, the output of `--help` will not change much (at most order and wording) and in additio" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans)
[13:42:26] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:42:28] <logmsgbot>	 !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.sanitize-wiki (exit_code=1) Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:43:07] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Ah, my bad then. And yes, this is even better!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans)
[13:43:22] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:43:22] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-eqiad
[13:43:22] <logmsgbot>	 !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.sanitize-wiki (exit_code=1) Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:43:35] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:43:38] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Setting up permissions and view database sanitization for wikis nupwiki in section s5
[13:44:59] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128#10792278 (10SLyngshede-WMF) 05Open→03Invalid
[13:46:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet
[13:46:10] <wikibugs>	 (03CR) 10Volans: [C:03+2] Traffic cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans)
[13:46:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet
[13:47:41] <wikibugs>	 (03CR) 10Volans: "Kind ping to the data platform team seeking a review" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136842 (owner: 10Volans)
[13:47:56] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141867|Disable APIs used in Special:ContentTranslationStats (T392839)]] (duration: 13m 23s)
[13:47:59] <stashbot>	 T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839
[13:48:10] <kart_>	 abijeet, on 3rd patch now..
[13:48:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141868 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro)
[13:48:36] <abijeet>	 kart_, ok
[13:49:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1164 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:49:35] <Cyndywikime>	 ok
[13:51:19] <wikibugs>	 (03CR) 10Elukey: [C:03+2] k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857) (owner: 10Elukey)
[13:51:37] <kart_>	 tchin: Your patch can be +2 directly, AFAIK, but let's follow deployment protocol :)
[13:52:12] <Cyndywikime>	 :)
[13:52:16] * tchin sounds good to me
[13:52:20] <wikibugs>	 (03Merged) 10jenkins-bot: Remove links to Special:ContentTranslationStats from dashboards [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141868 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro)
[13:52:38] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1141868|Remove links to Special:ContentTranslationStats from dashboards (T392839)]]
[13:53:32] <wikibugs>	 (03Merged) 10jenkins-bot: Traffic cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans)
[13:54:47] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "No need to make things smaller. I will just move these backups to a dedicated server to handle the extra data. I think this is ok, just ne" [puppet] - 10https://gerrit.wikimedia.org/r/1140506 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn)
[13:56:13] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "Outside of a deployment window, yes (else the person currently deploying will have an extra patch showing up and that might raise a warnin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt)
[13:57:03] <wikibugs>	 (03CR) 10Volans: [C:03+2] Data Platform cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836 (owner: 10Volans)
[13:57:12] <kart_>	 I think 3rd patch will slow, due to l10n changes - abijeet
[13:58:27] <abijeet>	 kart_, uh.
[14:00:48] <wikibugs>	 (03PS1) 10Bking: cirrus: disable completion indices in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1141903 (https://phabricator.wikimedia.org/T388610)
[14:02:09] <wikibugs>	 (03Merged) 10jenkins-bot: k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857) (owner: 10Elukey)
[14:02:20] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T392751#10792400 (10jcrespo) Thank you, @Marostegui for taking care about this.
[14:04:51] <wikibugs>	 (03Merged) 10jenkins-bot: Data Platform cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836 (owner: 10Volans)
[14:07:40] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] cirrus: disable completion indices in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1141903 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[14:08:31] <abijeet>	 hmm, thats a lot of time
[14:09:44] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic, 13Patch-For-Review: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10792415 (10ssingh) >>! In T392848#10791709, @Volans wrote: > Have y...
[14:10:53] <logmsgbot>	 !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1141868|Remove links to Special:ContentTranslationStats from dashboards (T392839)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:10:55] <stashbot>	 T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839
[14:11:22] <wikibugs>	 (03PS6) 10Fabfur: haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016)
[14:11:24] <kart_>	 abijeet, Please test :)
[14:12:32] <abijeet>	 kart_, checking
[14:13:58] <wikibugs>	 (03PS1) 10Jdlrobson: Nearby should show file namespace on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141904 (https://phabricator.wikimedia.org/T52133)
[14:14:01] <abijeet>	 kart_, we'll have to rollback this one. This needs a CX build.
[14:14:12] <kart_>	 ah
[14:14:32] <logmsgbot>	 !log kartik@deploy1003 Sync cancelled.
[14:14:33] <abijeet>	 kart_, apologies.
[14:14:46] <kart_>	 I'll do revert and deploy. No worries.
[14:14:58] <wikibugs>	 (03PS1) 10Kamila Součková: benthos/mw-accesslog-metrics: set start_offset to latest [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366)
[14:15:53] <wikibugs>	 (03PS1) 10KartikMistry: Revert "Remove links to Special:ContentTranslationStats from dashboards" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141906
[14:16:34] <kart_>	 tchin: let's go with your change if you're around?
[14:16:37] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur)
[14:17:01] <kart_>	 or Cyndywikime if you're around.
[14:17:06] <Cyndywikime>	 yes
[14:17:17] <kart_>	 OK. Let's go with your change first.
[14:17:29] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Upgrade codfw E/F Juniper equipment to Junos 23.x - https://phabricator.wikimedia.org/T393001#10792478 (10Volans) p:05Triage→03Medium
[14:17:32] <Cyndywikime>	 :)
[14:18:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131696 (https://phabricator.wikimedia.org/T379566) (owner: 10Cyndywikime)
[14:19:43] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Remove GELevelingUpFeaturesEnabled and GEMentorDashboardEnabled feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131696 (https://phabricator.wikimedia.org/T379566) (owner: 10Cyndywikime)
[14:19:54] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková)
[14:19:56] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1131696|Growth: Remove GELevelingUpFeaturesEnabled and GEMentorDashboardEnabled feature flags (T379566)]]
[14:19:59] <stashbot>	 T379566: Remove obsolete Feature Flags - https://phabricator.wikimedia.org/T379566
[14:20:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146#10792524 (10Volans) p:05Triage→03Medium
[14:23:06] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur)
[14:23:37] <fabfur>	 !log uploading haproxykafka 0.3.10 on apt repo (T393016)
[14:23:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:39] <stashbot>	 T393016: haproxykafka service isn't restarted when upgraded - https://phabricator.wikimedia.org/T393016
[14:23:41] <wikibugs>	 (03PS10) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759)
[14:23:41] <wikibugs>	 (03PS1) 10Andrew Bogott: nova policy: permit GET for os-server-groups and and os-flavor-extra-specs [puppet] - 10https://gerrit.wikimedia.org/r/1141909
[14:24:21] <wikibugs>	 (03PS1) 10Elukey: admin_ng: enforce PSS on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141910 (https://phabricator.wikimedia.org/T369493)
[14:24:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] nova policy: permit GET for os-server-groups and and os-flavor-extra-specs [puppet] - 10https://gerrit.wikimedia.org/r/1141909 (owner: 10Andrew Bogott)
[14:25:56] <logmsgbot>	 !log kartik@deploy1003 cyndywikime, kartik: Backport for [[gerrit:1131696|Growth: Remove GELevelingUpFeaturesEnabled and GEMentorDashboardEnabled feature flags (T379566)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:25:59] <stashbot>	 T379566: Remove obsolete Feature Flags - https://phabricator.wikimedia.org/T379566
[14:27:09] <kart_>	 Cyndywikime: Please test!
[14:27:20] <Cyndywikime>	 ok :)
[14:27:27] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet
[14:27:39] <fabfur>	 !log enable puppet and repooled cp7001 (T393016)
[14:27:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:26] <Cyndywikime>	 @kart, LGTM!Thanks :)
[14:29:13] <kart_>	 cool. Deploying.
[14:29:16] <logmsgbot>	 !log kartik@deploy1003 cyndywikime, kartik: Continuing with sync
[14:31:29] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Revert "Remove links to Special:ContentTranslationStats from dashboards" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141906 (owner: 10KartikMistry)
[14:32:03] <kart_>	 Apologies, I've to deploy this revert as well ^^
[14:32:27] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10792668 (10Jelto) >>! In T378922#10780806, @MatthewVernon wrote: > Second, I am not an expert at this, but I think you need `"Prin...
[14:32:40] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10792670 (10Jelto)
[14:34:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Remove links to Special:ContentTranslationStats from dashboards" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141906 (owner: 10KartikMistry)
[14:38:29] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: enforce PSS on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141910 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[14:38:42] <fabfur>	 !log upgrading haproxykafka to version 0.3.10 on A:cp (T393016)
[14:38:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:44] <stashbot>	 T393016: haproxykafka service isn't restarted when upgraded - https://phabricator.wikimedia.org/T393016
[14:39:28] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131696|Growth: Remove GELevelingUpFeaturesEnabled and GEMentorDashboardEnabled feature flags (T379566)]] (duration: 19m 32s)
[14:39:31] <stashbot>	 T379566: Remove obsolete Feature Flags - https://phabricator.wikimedia.org/T379566
[14:40:32] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1141906|Revert "Remove links to Special:ContentTranslationStats from dashboards"]]
[14:42:36] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:44:22] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:45:03] <wikibugs>	 (03CR) 10SBassett: [C:03+1] "I generally support this, as most folks on the secteam would, I assume.  But this should at least tie back to a phabricator task and/or be" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140702 (owner: 10Zabe)
[14:45:26] <wikibugs>	 (03PS2) 10Filippo Giunchedi: benthos/mw-accesslog-metrics: set start_offset to latest [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková)
[14:45:34] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur)
[14:45:54] <wikibugs>	 (03PS3) 10Filippo Giunchedi: benthos/mw-accesslog-metrics: start_from_oldest: false [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková)
[14:47:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, I've edited the option name to match our benthos version" [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková)
[14:47:23] <wikibugs>	 (03PS1) 10Bking: elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966)
[14:48:29] <wikibugs>	 (03PS2) 10Bking: elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966)
[14:49:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet
[14:49:31] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Update labels on cloudcontrol200[789]-dev.codfw - https://phabricator.wikimedia.org/T393347#10792749 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm We just need to change the external labels on the server. This has been done. Thank you for t...
[14:55:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet
[14:58:47] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking)
[14:58:47] <logmsgbot>	 !log kartik@deploy1003 kartik: Backport for [[gerrit:1141906|Revert "Remove links to Special:ContentTranslationStats from dashboards"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:00:38] <logmsgbot>	 !log kartik@deploy1003 kartik: Continuing with sync
[15:01:04] <wikibugs>	 (03PS3) 10Bking: elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966)
[15:01:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet
[15:01:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet
[15:02:58] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking)
[15:03:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet
[15:04:50] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service ganeti1028:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:05:35] <wikibugs>	 (03PS1) 10Brouberol: mediawiki-dumps-legacy: rename Secret key associated to private key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141926 (https://phabricator.wikimedia.org/T389784)
[15:05:59] <wikibugs>	 (03CR) 10Andrea Denisse: "Thanks for taking a look! I wrote this commit message as adding the feature flag because it specifically introduces and declares the flag " [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:09] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141927
[15:07:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet
[15:07:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10792806 (10Jhancock.wm) @Clement_Goubert i believe this server is yours. Is this still failed or did it heal? I'm not finding any evidence in the idrac of a failed disk. If it is valid, I can put in a re...
[15:07:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10792808 (10Jhancock.wm) a:03Jhancock.wm
[15:09:37] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141927 (owner: 10PipelineBot)
[15:11:00] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141906|Revert "Remove links to Special:ContentTranslationStats from dashboards"]] (duration: 30m 27s)
[15:11:06] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141927 (owner: 10PipelineBot)
[15:11:53] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:12:03] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:12:16] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:12:39] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:13:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet
[15:13:58] <wikibugs>	 (03PS1) 10Elukey: kubernetes: disable PSP for ml-serve-codfw and ml-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1141928 (https://phabricator.wikimedia.org/T369493)
[15:14:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet
[15:15:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[15:15:43] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5452/co" [puppet] - 10https://gerrit.wikimedia.org/r/1141928 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[15:17:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet
[15:20:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet
[15:21:27] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic, 13Patch-For-Review: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10792865 (10Volans) No, you're right, `current_state` in the icinga...
[15:21:42] <wikibugs>	 (03PS4) 10Bking: elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966)
[15:23:09] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5
[15:23:14] <wikibugs>	 (03CR) 10Klausman: [C:03+1] kubernetes: disable PSP for ml-serve-codfw and ml-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1141928 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[15:23:19] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Setting up permissions and view database sanitization for wikis nupwiki in section s5
[15:25:00] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] kubernetes: disable PSP for ml-serve-codfw and ml-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1141928 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[15:25:22] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5
[15:25:25] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Setting up permissions and view database sanitization for wikis nupwiki in section s5
[15:26:26] <wikibugs>	 (03PS7) 10Federico Ceratto: sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146)
[15:26:28] <wikibugs>	 (03CR) 10Hoo man: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141869 (https://phabricator.wikimedia.org/T391532) (owner: 10Hoo man)
[15:26:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet
[15:26:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet
[15:28:33] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141869 (https://phabricator.wikimedia.org/T391532) (owner: 10Hoo man)
[15:28:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet
[15:29:49] <logmsgbot>	 !log hoo@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[15:29:50] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service ganeti1030:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:30:05] <jouncebot>	 jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1530).
[15:30:38] <logmsgbot>	 !log hoo@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[15:31:12] <logmsgbot>	 !log hoo@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[15:31:24] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.160.0" for 2 host(s)
[15:31:47] <logmsgbot>	 !log hoo@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[15:32:00] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:32:01] <wikibugs>	 (03CR) 10Bking: [C:03+1] mediawiki-dumps-legacy: rename Secret key associated to private key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141926 (https://phabricator.wikimedia.org/T389784) (owner: 10Brouberol)
[15:32:11] <logmsgbot>	 !log hoo@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[15:32:34] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1031.eqiad.wmnet
[15:32:37] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:32:41] <logmsgbot>	 !log hoo@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[15:32:46] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:33:06] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10792927 (10Eevans) Ok, I forgot to factor something in: Node data //other// than the SSTables.  So commitlogs...
[15:33:12] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.160.0" completed for 2 hosts
[15:33:20] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:34:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10792930 (10Jhancock.wm)
[15:34:52] <tchin>	 Is the backport still happening? Or can I merge myself since it's just a beta cluster change
[15:35:15] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10792937 (10RobH) @MatthewVernon,  Please note that we've ordered 4 new hosts to replace ms-be10[60-63], but t...
[15:35:23] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2047.codfw.wmnet with OS bookworm
[15:35:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10792938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with err...
[15:35:34] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2048.codfw.wmnet with OS bookworm
[15:35:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10792942 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm executed with err...
[15:36:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:37:25] <wikibugs>	 (03PS4) 10Scott French: hieradata: remove icu67 override on deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139910 (https://phabricator.wikimedia.org/T392938)
[15:37:29] <wikibugs>	 (03PS3) 10Scott French: hieradata: switch deployment hosts to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1139914 (https://phabricator.wikimedia.org/T392938)
[15:40:38] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10792967 (10Eevans) >>! In T391544#10792925, @Eevans wrote: >  > [ ... ] >  > If we say 60G (for sake of even...
[15:40:39] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:41:02] <wikibugs>	 (03PS1) 10Hoo man: Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141935 (https://phabricator.wikimedia.org/T391532)
[15:41:33] <wikibugs>	 (03CR) 10Hoo man: [C:03+2] Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141935 (https://phabricator.wikimedia.org/T391532) (owner: 10Hoo man)
[15:42:26] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic, 13Patch-For-Review: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10792981 (10ssingh) >>! In T392848#10792865, @Volans wrote: > No, yo...
[15:42:59] <wikibugs>	 (03PS2) 10Scott French: P:mediawiki::maintenance::purge_loginnotify: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1141916 (https://phabricator.wikimedia.org/T388536)
[15:43:16] <wikibugs>	 (03PS5) 10Bking: elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966)
[15:43:17] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141935 (https://phabricator.wikimedia.org/T391532) (owner: 10Hoo man)
[15:44:43] <logmsgbot>	 !log hoo@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[15:45:01] <logmsgbot>	 !log hoo@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[15:45:45] <logmsgbot>	 !log hoo@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[15:46:05] <logmsgbot>	 !log hoo@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[15:46:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:46:22] <logmsgbot>	 !log hoo@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[15:46:39] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm
[15:46:45] <logmsgbot>	 !log hoo@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[15:46:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10793011 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm
[15:47:51] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:48:03] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:49:41] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:49:46] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[15:49:52] <wikibugs>	 (03PS2) 10Muehlenhoff: Extend package list to be installed from component/puppet7 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1140716 (https://phabricator.wikimedia.org/T392790)
[15:49:55] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:51:13] <wikibugs>	 (03CR) 10TChin: [C:03+2] Stream config for edge uniques on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt)
[15:52:02] <wikibugs>	 (03Merged) 10jenkins-bot: Stream config for edge uniques on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt)
[15:57:41] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] "LGTM, in particular with `retry_on: gateway-error`, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[15:57:41] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] miscweb: update os-reports image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[15:59:26] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: update os-reports image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[16:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:01:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10793083 (10Papaul) 05Open→03Resolved Complete
[16:02:50] <logmsgbot>	 !log aokoth@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[16:03:09] <logmsgbot>	 !log aokoth@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[16:06:31] <logmsgbot>	 !log aokoth@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[16:07:44] <logmsgbot>	 !log aokoth@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[16:09:14] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[16:09:15] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[16:09:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] site.pp changes for aux-k8s-workers [puppet] - 10https://gerrit.wikimedia.org/r/1140701 (https://phabricator.wikimedia.org/T393053) (owner: 10Alexandros Kosiaris)
[16:10:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Thanks for taking a look! I wrote this commit message as adding the feature flag because it specifically introduces and declares the fla" [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[16:10:56] <jinxer-wm>	 FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:15:56] <jinxer-wm>	 RESOLVED: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:19:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:20:19] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[16:20:21] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[16:22:05] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10793153 (10wiki_willy) It's about $250 for the RAID controllers, so we can definitely order those to replace the existing ones for Con...
[16:22:11] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] "Ah, OK, thanks a lot for the fix and the clarification!" [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková)
[16:22:32] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking)
[16:24:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[16:25:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Upgrade codfw E/F Juniper equipment to Junos 23.x - https://phabricator.wikimedia.org/T393001#10793160 (10Papaul) 05Open→03Resolved a:03Papaul @ayounsi the solution here was to start a shell  and run the commands below ` star...
[16:29:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002"
[16:30:05] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002"
[16:30:05] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:30:47] <wikibugs>	 (03CR) 10Bking: [C:03+2] elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking)
[16:30:55] <Raine>	 akosiaris: if your puppet-merge comes across my benthos change, feel free to merge it 
[16:31:08] <akosiaris>	 Raine: my bad, sorry, thanks
[16:31:23] <Raine>	 np, thanks too
[16:31:24] <inflatador>	 akosiaris what Raine said ;P
[16:31:41] <akosiaris>	 {{done}} for both
[16:31:42] <Raine>	 :D
[16:31:44] <Raine>	 ty!
[16:31:51] <inflatador>	 {◕ ◡ ◕}
[16:33:31] <wikibugs>	 (03PS4) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729)
[16:33:51] <wikibugs>	 (03PS2) 10Andrea Denisse: grafana: Toggle data sync using feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841)
[16:33:51] <wikibugs>	 (03PS4) 10Andrea Denisse: grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841)
[16:34:34] <wikibugs>	 (03PS5) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729)
[16:35:21] <wikibugs>	 (03CR) 10Andrea Denisse: "Thanks for the explanation, I've inverted the order of the commits." [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[16:35:41] <wikibugs>	 (03CR) 10Bking: sre.hosts.rename: wipe DNS cache after rename (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[16:38:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1111 to cirrussearch1111
[16:39:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:40:02] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1140262 (owner: 10Ncmonitor)
[16:40:26] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results after inverting the commit order: https://puppet-compiler.wmflabs.org/output/1140760/5453/" [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[16:40:35] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:40:40] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5453/console" [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[16:41:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:41:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[16:41:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:42:04] <wikibugs>	 (03CR) 10Andrea Denisse: "I ran PCC on the change that uses the default value (true) and it's a NOOP, thanks for the suggestion on testing it like this." [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[16:43:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:44:07] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[16:45:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1111 to cirrussearch1111 - bking@cumin2002"
[16:45:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:46:44] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:46:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:48:27] <logmsgbot>	 bking@cumin2002 rename (PID 2927008) is awaiting input
[16:48:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1111 to cirrussearch1111 - bking@cumin2002"
[16:48:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:48:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047
[16:49:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:49:08] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047
[16:49:54] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:50:23] <wikibugs>	 (03CR) 10Volans: sre.hosts.rename: wipe DNS cache after rename (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[16:54:45] <logmsgbot>	 bking@cumin2002 rename (PID 2927008) is awaiting input
[16:54:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:55:25] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:56:53] <wikibugs>	 (03PS6) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729)
[16:57:05] <wikibugs>	 (03PS7) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729)
[16:57:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:58:03] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047
[16:58:11] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047
[17:00:05] <jouncebot>	 swfrench-wmf: That opportune time for a MediaWiki infrastructure (UTC late) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1700).
[17:00:05] <jouncebot>	 ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1700).
[17:00:14] <swfrench-wmf>	 o/
[17:01:12] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw:periodic_job:kubernetes: quote job description [puppet] - 10https://gerrit.wikimedia.org/r/1140548 (owner: 10Scott French)
[17:02:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:03:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[17:04:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:09:13] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:mediawiki::maintenance::purge_loginnotify: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1141916 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French)
[17:13:41] <wikibugs>	 (03CR) 10Zabe: [C:03+1] manage-dblist: Fix indentation and stray blank line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139487 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE))
[17:14:29] <wikibugs>	 (03CR) 10Zabe: [C:03+1] manage-dblist: Fix some random phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139488 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE))
[17:15:48] <wikibugs>	 (03PS8) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729)
[17:16:03] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[17:16:11] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[17:17:39] <wikibugs>	 (03CR) 10Bking: sre.hosts.rename: wipe DNS cache after rename (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[17:19:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rolling back cirrussearch1111 to elastic1111 - bking@cumin2002"
[17:19:51] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rolling back cirrussearch1111 to elastic1111 - bking@cumin2002"
[17:19:51] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:19:58] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from elastic1111 to cirrussearch1111
[17:20:25] <wikibugs>	 (03PS4) 10Jdlrobson: Stream registration for article summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia)
[17:23:56] <wikibugs>	 (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) (owner: 10Xcollazo)
[17:24:03] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:mediawiki::maintenance::pageassessments: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140266 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French)
[17:24:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:27:32] <wikibugs>	 (03PS1) 10Volans: setup.py: update kubernetes dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141941
[17:29:30] <wikibugs>	 (03CR) 10Volans: [C:04-1] sre.hosts.rename: wipe DNS cache after rename (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[17:30:24] <wikibugs>	 (03PS9) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729)
[17:30:49] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[17:30:56] <wikibugs>	 (03PS1) 10Ssingh: CI check: see if tabs fail [dns] - 10https://gerrit.wikimedia.org/r/1141943
[17:30:57] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[17:31:10] <wikibugs>	 (03PS3) 10Herron: logs-api: add write/delete acl via htgroup [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194)
[17:31:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CI check: see if tabs fail [dns] - 10https://gerrit.wikimedia.org/r/1141943 (owner: 10Ssingh)
[17:32:42] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephosd100[1-3] - https://phabricator.wikimedia.org/T393196#10793398 (10VRiley-WMF) a:03VRiley-WMF
[17:32:58] <wikibugs>	 (03CR) 10Ssingh: "So it fails as expected and as it should. Not sure why it didn't show up in I85edb13bf2b678e3414de9bfd7383ac877145f49 but abandoning." [dns] - 10https://gerrit.wikimedia.org/r/1141943 (owner: 10Ssingh)
[17:33:17] <wikibugs>	 (03Abandoned) 10Ssingh: CI check: see if tabs fail [dns] - 10https://gerrit.wikimedia.org/r/1141943 (owner: 10Ssingh)
[17:38:08] <swfrench-wmf>	 no more changes planned on my end for this infra window
[17:38:50] <wikibugs>	 (03PS10) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729)
[17:41:15] <wikibugs>	 (03PS1) 10Herron: add dummy write group for testing [labs/private] - 10https://gerrit.wikimedia.org/r/1141944
[17:43:15] <wikibugs>	 (03CR) 10Herron: [V:03+2 C:03+2] add dummy write group for testing [labs/private] - 10https://gerrit.wikimedia.org/r/1141944 (owner: 10Herron)
[17:45:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[17:45:59] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:46:25] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:47:28] <wikibugs>	 (03CR) 10Herron: "Thanks for checking it out.  I switched the config from <limit> to <requireall> and require method which is looking better to me.  Also ad" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron)
[17:47:49] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic)
[17:48:25] <wikibugs>	 (03PS4) 10Brouberol: Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic)
[17:48:27] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic)
[17:49:37] <logmsgbot>	 vriley@cumin1002 provision (PID 2538605) is awaiting input
[17:50:12] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: rename Secret key associated to private key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141926 (https://phabricator.wikimedia.org/T389784) (owner: 10Brouberol)
[17:52:46] <wikibugs>	 (03CR) 10Brouberol: [C:04-1] "We first need to absent all resources before we can remove the code" [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic)
[17:52:53] <wikibugs>	 (03PS5) 10Brouberol: Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic)
[17:54:25] <logmsgbot>	 vriley@cumin1002 provision (PID 2538605) is awaiting input
[17:57:12] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE-OnFire, 10Cassandra, and 4 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster - https://phabricator.wikimedia.org/T393406 (10Eevans) 03NEW
[17:57:16] <wikibugs>	 (03PS7) 10Umherirrender: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115)
[17:57:21] <wikibugs>	 (03CR) 10Umherirrender: Improve function and property documentation for php code (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender)
[18:00:02] <wikibugs>	 (03CR) 10Xcollazo: "Thanks for review Joal." [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) (owner: 10Xcollazo)
[18:02:04] <logmsgbot>	 !log aokoth@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[18:02:26] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:02:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:03:28] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:05:21] <wikibugs>	 (03PS1) 10Scott French: alertmanager: add receiver and routing for moderator-tools tasks [puppet] - 10https://gerrit.wikimedia.org/r/1141945 (https://phabricator.wikimedia.org/T393395)
[18:05:21] <wikibugs>	 (03CR) 10Scott French: "It turns out PageTriage is actually owned by Moderator Tools, rather than Community Tech, so this and the next patch update the notificati" [puppet] - 10https://gerrit.wikimedia.org/r/1141945 (https://phabricator.wikimedia.org/T393395) (owner: 10Scott French)
[18:05:24] <wikibugs>	 (03PS1) 10Scott French: mw::maintenance: update team for pagetriage jobs [puppet] - 10https://gerrit.wikimedia.org/r/1141946 (https://phabricator.wikimedia.org/T393395)
[18:07:28] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[18:07:29] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[18:07:42] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:09:34] <wikibugs>	 (03PS1) 10Jsn.sherman: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401)
[18:11:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman)
[18:12:20] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10793553 (10VRiley-WMF) Upon request, I have added (2) 480 Gig SSDs per server sessionstore1004, sessionstore1...
[18:12:22] <logmsgbot>	 !log aokoth@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[18:15:40] <wikibugs>	 (03PS2) 10Jsn.sherman: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401)
[18:22:36] <wikibugs>	 (03PS2) 10Volans: setup.py: update kubernetes dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141941
[18:22:36] <wikibugs>	 (03PS1) 10Volans: setup.py: update redis dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141949
[18:37:08] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06SRE-OnFire, and 5 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster (codfw) - https://phabricator.wikimedia.org/T393406#10793611 (10Eevans)
[18:37:35] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1140795 (owner: 10JHathaway)
[18:37:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:38:19] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 10Cassandra, 06DC-Ops, and 3 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster (eqiad) - https://phabricator.wikimedia.org/T393408 (10Eevans) 03NEW
[18:41:37] <wikibugs>	 (03CR) 10Jdrewniak: Stream registration for article summaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia)
[18:44:22] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06SRE-OnFire, and 5 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster (codfw) - https://phabricator.wikimedia.org/T393406#10793638 (10Jhancock.wm) installed and detected by servers
[18:44:24] <wikibugs>	 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409 (10JVanderhoop-WMF) 03NEW
[18:48:36] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 10Cassandra, 06DC-Ops, and 3 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster (eqiad) - https://phabricator.wikimedia.org/T393408#10793653 (10VRiley-WMF) 05Open→03Resolved Added (x2) 480Gig SSD drives to sessionstore1004, sessionstore1005, sess...
[18:48:55] <wikibugs>	 (03PS1) 10JHathaway: Revert "systemd::sysuser: create the user synchronously in the define" [puppet] - 10https://gerrit.wikimedia.org/r/1141952
[18:49:26] <wikibugs>	 (03CR) 10Kgraessle: [C:03+1] "LGTM based on the diff; was unable to test the loading of the survey using the js module." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman)
[18:52:01] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Revert "systemd::sysuser: create the user synchronously in the define" [puppet] - 10https://gerrit.wikimedia.org/r/1141952 (owner: 10JHathaway)
[19:01:47] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender)
[19:04:35] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs-main: switch old internal hosts to main graph [puppet] - 10https://gerrit.wikimedia.org/r/1141956 (https://phabricator.wikimedia.org/T388134)
[19:04:37] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs-main: bring old internal hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/1141957 (https://phabricator.wikimedia.org/T388134)
[19:17:56] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephosd100[1-3] - https://phabricator.wikimedia.org/T393196#10793728 (10VRiley-WMF)
[19:18:27] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephosd100[1-3] - https://phabricator.wikimedia.org/T393196#10793730 (10VRiley-WMF) 05Open→03Resolved These have been decommed
[19:23:18] <wikibugs>	 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10793743 (10Ahoelzl) Approved.
[19:26:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10793748 (10VRiley-WMF)
[19:26:20] <wikibugs>	 (03PS1) 10JHathaway: systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963
[19:26:37] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway)
[19:26:43] <wikibugs>	 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10793749 (10Scott_French) After a bit of thought and some back-testing over the last 2 months of data, ht...
[19:27:37] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1184.eqiad.wmnet with OS bullseye
[19:27:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10793750 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1184.eqiad.wmnet with OS b...
[19:28:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway)
[19:29:50] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:32:50] <wikibugs>	 (03CR) 10Ladsgroup: "Can I merge this now?" [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas)
[19:35:17] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:37:30] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:43:08] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1184.eqiad.wmnet with reason: host reimage
[19:43:20] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10793784 (10VirginiaPoundstone) Approved.
[19:46:37] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1184.eqiad.wmnet with reason: host reimage
[19:47:21] <wikibugs>	 (03PS3) 10Jsn.sherman: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401)
[19:47:54] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "self merge, hosts not in prod" [puppet] - 10https://gerrit.wikimedia.org/r/1141956 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[19:47:55] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs-main: switch old internal hosts to main graph [puppet] - 10https://gerrit.wikimedia.org/r/1141956 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[19:49:46] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[19:52:05] <ryankemper>	 herron: cool if I merge b412424? (labs/private)
[19:52:22] <herron>	 ryankemper: thanks please do
[19:52:31] <ryankemper>	 done
[19:53:23] <wikibugs>	 (03PS2) 10JHathaway: systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963
[19:53:31] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] sre.wdqs.data-transfer: improve graph type checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1097552 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper)
[19:55:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway)
[19:57:04] <wikibugs>	 (03PS4) 10Jsn.sherman: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401)
[19:58:06] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye
[19:58:34] <wikibugs>	 (03PS5) 10Jsn.sherman: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T2000).
[20:00:05] <jouncebot>	 danisztls and JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:01:41] <JSherman>	 here
[20:01:56] <jinxer-wm>	 FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:03:32] <danisztls>	 o/
[20:05:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:06:12] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] hieradata: remove icu67 override on deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139910 (https://phabricator.wikimedia.org/T392938) (owner: 10Scott French)
[20:06:37] <JSherman>	 do we have a deployer around? I can self deploy if not.
[20:06:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:07:10] <JSherman>	 danisztls: I can probably deploy for you too; it look like config?
[20:07:13] <RhinosF1>	 JSherman: doesn't look like it
[20:07:39] <JSherman>	 RhinosF1: ack
[20:07:47] <danisztls>	 JSherman: its just config, thanks
[20:08:02] <JSherman>	 mmk, let me get myself setup
[20:08:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:10:25] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:10:28] <swfrench-wmf>	 !incidents
[20:10:29] <sirenbot>	 No incidents occurred in the past 24 hours for team SRE
[20:10:36] <sukhe>	 yeah
[20:10:46] <sukhe>	 this is the page from the weekend
[20:10:46] <swfrench-wmf>	 do we need a downtime?
[20:11:01] <sukhe>	 I marked as resolved as host is depooled 
[20:11:07] <sukhe>	 otherwise it would ping again
[20:11:09] <JSherman>	 danisztls: okay, I'm going to deploy us together since we're both doing survey config
[20:11:13] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[20:11:15] <danisztls>	 JSherman: ok
[20:11:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141569 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[20:11:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman)
[20:11:50] <swfrench-wmf>	 sukhe: ah, great - thanks for marking resolved
[20:12:07] <wikibugs>	 (03Merged) 10jenkins-bot: Design Research Participant Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141569 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza)
[20:12:09] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman)
[20:13:16] <JSherman>	 hmm, we had an unexpected commit from friday
[20:13:27] <rzl>	 sukhe, swfrench-wmf: thanks both!
[20:13:28] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:14:20] <logmsgbot>	 vriley@cumin1002 reimage (PID 2549027) is awaiting input
[20:14:49] <JSherman>	 looks like it's for labs settings; proceeding
[20:15:06] <logmsgbot>	 !log jsn@deploy1003 Started scap sync-world: Backport for [[gerrit:1141569|Design Research Participant Survey: Undeploy (T392325)]], [[gerrit:1141947|Deploy first set of Patroller Tools surveys (T389401)]]
[20:15:10] <stashbot>	 T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325
[20:15:11] <stashbot>	 T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401
[20:15:25] <jinxer-wm>	 FIRING: [15x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:15:33] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[20:15:34] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1184.eqiad.wmnet with OS bullseye
[20:15:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10793857 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1184.eqiad.wmnet with OS bulls...
[20:18:28] <jinxer-wm>	 FIRING: [5x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:20:25] <jinxer-wm>	 FIRING: [19x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:21:11] <logmsgbot>	 !log jsn@deploy1003 dani, jsn: Backport for [[gerrit:1141569|Design Research Participant Survey: Undeploy (T392325)]], [[gerrit:1141947|Deploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:21:14] <stashbot>	 T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325
[20:21:15] <stashbot>	 T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401
[20:21:29] <JSherman>	 test servers barfed the first time; were happy on retry
[20:21:44] <JSherman>	 danisztls: please test
[20:21:58] <danisztls>	 JSherman: done, looks good
[20:22:39] <JSherman>	 excellent; it's going to take me a minute to test mine
[20:23:28] <jinxer-wm>	 RESOLVED: [5x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:24:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:25:25] <jinxer-wm>	 FIRING: [23x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:28:07] <JSherman>	 okay, good on my end; proceeding
[20:28:10] <logmsgbot>	 !log jsn@deploy1003 dani, jsn: Continuing with sync
[20:28:41] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] hieradata: switch deployment hosts to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1139914 (https://phabricator.wikimedia.org/T392938) (owner: 10Scott French)
[20:29:17] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:30:25] <jinxer-wm>	 FIRING: [23x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:32:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:33:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:35:05] <logmsgbot>	 !log jsn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141569|Design Research Participant Survey: Undeploy (T392325)]], [[gerrit:1141947|Deploy first set of Patroller Tools surveys (T389401)]] (duration: 19m 58s)
[20:35:08] <stashbot>	 T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325
[20:35:09] <stashbot>	 T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401
[20:35:48] <JSherman>	 danisztls: okay, we're done!
[20:35:48] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs1016 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:35:48] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:35:50] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2014 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:35:50] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs2015 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:36:40] <danisztls>	 JSherman: thanks again
[20:36:58] <JSherman>	 no prob
[20:38:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:39:05] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793912 (10VRiley-WMF)
[20:40:06] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:40:06] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:40:06] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:40:06] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs2014 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:40:06] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2015 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:41:35] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[20:41:55] <JSherman>	 aaand I see that I messed up the privacy statement link for my backport
[20:42:20] <JSherman>	 I'm going to backport that too since we're still in the window
[20:42:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:43:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[20:44:22] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs2015 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:44:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2015:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:44:39] <logmsgbot>	 !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on wdqs[2008,2014-2015].codfw.wmnet,wdqs[1011,1016].eqiad.wmnet with reason: T388134
[20:44:42] <stashbot>	 T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134
[20:44:54] <ryankemper>	 Sorry for the wdqs noise, downtimed these hosts. Their systemd units won't be happy until their data transfers complete
[20:46:34] <wikibugs>	 (03PS1) 10Jsn.sherman: Fix link for first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141969 (https://phabricator.wikimedia.org/T389401)
[20:46:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:47:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141969 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman)
[20:48:03] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  apus-fe1003 - vriley@cumin1002"
[20:48:09] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  apus-fe1003 - vriley@cumin1002"
[20:48:09] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:48:33] <wikibugs>	 (03Merged) 10jenkins-bot: Fix link for first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141969 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman)
[20:48:41] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793945 (10VRiley-WMF)
[20:48:49] <logmsgbot>	 !log jsn@deploy1003 Started scap sync-world: Backport for [[gerrit:1141969|Fix link for first set of Patroller Tools surveys (T389401)]]
[20:48:51] <stashbot>	 T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401
[20:49:22] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host apus-fe1003
[20:49:47] <wikibugs>	 (03CR) 10Bking: [C:03+1] "Conditional +1. Feel free to merge if you're OK with using 1017 as a main host, as opposed to internal-main." [puppet] - 10https://gerrit.wikimedia.org/r/1141957 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[20:49:53] <logmsgbot>	 ryankemper@cumin2002 reimage (PID 3129320) is awaiting input
[20:50:23] <wikibugs>	 (03PS1) 10Clare Ming: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141970 (https://phabricator.wikimedia.org/T390036)
[20:50:26] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:50:30] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host apus-fe1003
[20:51:51] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:52:38] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141970 (https://phabricator.wikimedia.org/T390036) (owner: 10Clare Ming)
[20:54:18] <wikibugs>	 (03Merged) 10jenkins-bot: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141970 (https://phabricator.wikimedia.org/T390036) (owner: 10Clare Ming)
[20:55:26] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:55:47] <logmsgbot>	 !log jsn@deploy1003 jsn: Backport for [[gerrit:1141969|Fix link for first set of Patroller Tools surveys (T389401)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:55:47] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[20:55:50] <stashbot>	 T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401
[20:56:09] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[20:56:31] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:56:52] <logmsgbot>	 !log jsn@deploy1003 jsn: Continuing with sync
[20:58:10] <JSherman>	 verified that the patch made things happy
[20:58:53] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793954 (10VRiley-WMF)
[20:59:45] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye
[20:59:57] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.wikimedia.org with OS bookworm
[21:00:03] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793957 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: Time to snap out of that daydream and deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T2100).
[21:01:10] <JSherman>	 noting that this is running slightly over the time window; currently @ 60% for k8s deployment
[21:03:33] <logmsgbot>	 !log jsn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141969|Fix link for first set of Patroller Tools surveys (T389401)]] (duration: 14m 43s)
[21:03:35] <stashbot>	 T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401
[21:04:18] <JSherman>	 okay, done!
[21:04:49] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10793975 (10ArthurPSmith) Yes, P13552 did not result in delay, although I did wait som...
[21:04:54] <JSherman>	 things look happy and I'm outta here. Feel free to @ me on slack if followup is needed.
[21:13:39] <logmsgbot>	 vriley@cumin1002 reimage (PID 2563356) is awaiting input
[21:14:02] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-fe1003.wikimedia.org with OS bookworm
[21:14:08] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm executed with errors: - apus-...
[21:15:19] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.wikimedia.org with OS bookworm
[21:15:28] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793986 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm
[21:20:25] <wikibugs>	 (03PS11) 10Ryan Kemper: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[21:20:54] <wikibugs>	 (03PS12) 10Ryan Kemper: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[21:21:26] <wikibugs>	 (03PS13) 10Ryan Kemper: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[21:23:46] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06SRE-OnFire, and 5 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster (codfw) - https://phabricator.wikimedia.org/T393406#10794000 (10Jhancock.wm) 05Open→03Resolved
[21:29:00] <wikibugs>	 (03PS14) 10Ryan Kemper: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[21:34:09] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye
[21:37:47] <wikibugs>	 (03CR) 10Ryan Kemper: "Fixed the method invocations. Should be ready for another round of review (cc @volans)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking)
[21:39:44] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone: update policy.yaml files [puppet] - 10https://gerrit.wikimedia.org/r/1141977 (https://phabricator.wikimedia.org/T330759)
[21:39:44] <wikibugs>	 (03PS1) 10Andrew Bogott: nova policy.yaml: update with advice from oslopolicy-validator [puppet] - 10https://gerrit.wikimedia.org/r/1141978 (https://phabricator.wikimedia.org/T330759)
[21:39:46] <wikibugs>	 (03PS1) 10Andrew Bogott: nova policy.json: remove a bunch of redundant rules [puppet] - 10https://gerrit.wikimedia.org/r/1141979 (https://phabricator.wikimedia.org/T330759)
[21:39:47] <wikibugs>	 (03PS1) 10Andrew Bogott: glance: update policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1141980 (https://phabricator.wikimedia.org/T330759)
[21:39:49] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder: explicitly use new policy rules [puppet] - 10https://gerrit.wikimedia.org/r/1141981 (https://phabricator.wikimedia.org/T330759)
[21:39:50] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder policy.yaml: update, remove redundant rules [puppet] - 10https://gerrit.wikimedia.org/r/1141982 (https://phabricator.wikimedia.org/T330759)
[21:39:52] <wikibugs>	 (03PS1) 10Andrew Bogott: Neutron: update policy rules [puppet] - 10https://gerrit.wikimedia.org/r/1141983 (https://phabricator.wikimedia.org/T330759)
[21:39:54] <wikibugs>	 (03PS1) 10Andrew Bogott: Designate: update policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1141984 (https://phabricator.wikimedia.org/T330759)
[21:40:22] <sbassett>	 Hey all - have a couple of security patches for T392341 I’d like to get out today.
[21:40:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:50:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:53:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:57:38] <sbassett>	 !log Deployed security fix (1) for T392341
[21:57:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:04:44] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "Looks reasonable, confirm it fixes the error in my local dev." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141516 (owner: 10Krinkle)
[22:05:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2151:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2151 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:12:14] <sbassett>	 !log Deployed security fix (2) for T392341
[22:12:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:28:30] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "The tagline is too big and overlaps search on Vector 2022. It shouldn't exceed the max size of the wordmark ( 124px). https://www.mediawik" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky)
[22:33:08] <logmsgbot>	 ryankemper@cumin2002 reimage (PID 3226688) is awaiting input
[22:35:33] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-fe1003.wikimedia.org with OS bookworm
[22:35:45] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10794191 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm executed with errors: - apus-...
[22:46:19] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye
[22:46:31] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017.eqiad.wmnet']
[22:46:42] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1017.eqiad.wmnet']
[22:47:05] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017.eqiad.wmnet']
[22:48:40] <zabe>	 jouncebot: nowandnext
[22:48:40] <jouncebot>	 For the next 0 hour(s) and 11 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T2100)
[22:48:40] <jouncebot>	 In 0 hour(s) and 11 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T2300)
[22:50:12] <logmsgbot>	 ryankemper@cumin2002 upgrade-firmware (PID 3300272) is awaiting input
[22:50:48] <wikibugs>	 (03CR) 10Zabe: [C:03+2] core-Permissions: refactor enwiki wgRemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140661 (owner: 10Novem Linguae)
[22:51:36] <wikibugs>	 (03Merged) 10jenkins-bot: core-Permissions: refactor enwiki wgRemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140661 (owner: 10Novem Linguae)
[22:52:47] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1140661|core-Permissions: refactor enwiki wgRemoveGroups]]
[22:54:03] <logmsgbot>	 ryankemper@cumin2002 upgrade-firmware (PID 3300272) is awaiting input
[22:56:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:57:16] <logmsgbot>	 !log zabe@deploy1003 zabe, novemlinguae: Backport for [[gerrit:1140661|core-Permissions: refactor enwiki wgRemoveGroups]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:57:22] <logmsgbot>	 !log zabe@deploy1003 zabe, novemlinguae: Continuing with sync
[22:59:23] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['wdqs1017.eqiad.wmnet']
[22:59:45] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017.eqiad.wmnet']
[23:00:06] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T2300)
[23:00:52] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: "@brouberol@wikimedia.org so is that supposed to be done in another patch, deployed, and then we come back to this one?" [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic)
[23:00:57] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1017.eqiad.wmnet']
[23:01:02] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017.eqiad.wmnet']
[23:01:17] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['wdqs1017.eqiad.wmnet']
[23:04:00] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140661|core-Permissions: refactor enwiki wgRemoveGroups]] (duration: 11m 13s)
[23:06:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:13:00] <wikibugs>	 (03CR) 10Cwhite: grafana: Toggle data sync using feature flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[23:13:13] <wikibugs>	 (03CR) 10Cwhite: grafana: Add enable_dashboard_sync feature flag in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[23:14:40] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php enwiki --delete /home/zabe/afl_text_table_deletedump/enwiki --sleep 0.3 # T381599
[23:14:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:14:46] <stashbot>	 T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599
[23:25:47] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10794276 (10Eevans) OK, after having two new 480G SSDs added to each machine (used devices from decomm'd machi...
[23:29:51] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:32:48] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye
[23:38:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1142000
[23:38:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1142000 (owner: 10TrainBranchBot)
[23:44:56] <jinxer-wm>	 FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:49:46] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[23:49:51] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1142000 (owner: 10TrainBranchBot)
[23:49:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown