[00:08:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1141553 [00:08:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1141553 (owner: 10TrainBranchBot) [00:10:51] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 644.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:16:27] PROBLEM - Host prometheus1008 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:43] FIRING: [8x] ProbeDown: Service prometheus1008:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:28:43] FIRING: [9x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:29:33] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1141553 (owner: 10TrainBranchBot) [00:32:43] FIRING: [9x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:49:42] FIRING: [2x] JobUnavailable: Reduced availability for job smoke/dns in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:50:42] FIRING: [2x] JobUnavailable: Reduced availability for job blackbox/pingthing in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:54:13] PROBLEM - SSH on centrallog2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:54:42] FIRING: [20x] JobUnavailable: Reduced availability for job blackbox/icmp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:57:42] FIRING: [9x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:57:55] PROBLEM - SSH on prometheus1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:57:57] PROBLEM - SSH on prometheus2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:58:43] FIRING: [9x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:02:43] FIRING: [9x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:04:42] FIRING: [21x] JobUnavailable: Reduced availability for job blackbox/icmp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:08:19] PROBLEM - SSH on prometheus2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:10:57] PROBLEM - SSH on prometheus2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:13:57] PROBLEM - SSH on vrts1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:19:00] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1011.eqiad.wmnet with OS bullseye [01:27:42] FIRING: [10x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:42] RESOLVED: [10x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:36:35] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1011.eqiad.wmnet with reason: host reimage [01:39:31] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1011.eqiad.wmnet with reason: host reimage [01:44:13] PROBLEM - SSH on prometheus1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:55:22] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1011.eqiad.wmnet with OS bullseye [01:56:25] PROBLEM - SSH on centrallog1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:04:53] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:05:11] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:20:51] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:29:42] FIRING: [22x] JobUnavailable: Reduced availability for job blackbox/icmp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:53] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:38:09] PROBLEM - SSH on vrts2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:38:51] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:48:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [02:53:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [02:53:25] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:25] RESOLVED: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:33] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS bullseye [03:17:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:18:48] (03PS1) 10Andrew Bogott: cloudrabbit200x-dev: fix fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1141564 (https://phabricator.wikimedia.org/T392539) [03:20:13] (03CR) 10Andrew Bogott: [C:03+2] cloudrabbit200x-dev: fix fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1141564 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [03:24:10] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage [03:26:55] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage [03:27:36] (03PS1) 10Andrew Bogott: Add cloudrabbit200[1-3]-dev to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1141566 (https://phabricator.wikimedia.org/T392539) [03:28:49] (03PS2) 10Andrew Bogott: Add cloudrabbit200[1-3]-dev to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1141566 (https://phabricator.wikimedia.org/T392539) [03:31:25] (03CR) 10Andrew Bogott: [C:03+2] Add cloudrabbit200[1-3]-dev to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1141566 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [03:36:43] !log andrew@cumin1002 START - Cookbook sre.hosts.rename from cloudcontrol2007-dev to cloudrabbit2001-dev [03:37:05] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [03:41:51] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cloudcontrol2007-dev to cloudrabbit2001-dev - andrew@cumin1002" [03:42:13] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cloudcontrol2007-dev to cloudrabbit2001-dev - andrew@cumin1002" [03:42:13] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:42:14] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudrabbit2001-dev [03:42:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1016.eqiad.wmnet with OS bullseye [03:42:25] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:42:30] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudrabbit2001-dev [03:43:09] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from cloudcontrol2007-dev to cloudrabbit2001-dev [03:43:29] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [03:43:50] !log andrew@cumin1002 START - Cookbook sre.hosts.rename from cloudcontrol2008-dev to cloudrabbit2002-dev [03:43:52] !log andrew@cumin1002 START - Cookbook sre.hosts.rename from cloudcontrol2009-dev to cloudrabbit2003-dev [03:44:13] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [03:46:15] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2001-dev.codfw.wmnet with OS bookworm [03:47:25] RESOLVED: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:48:39] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cloudcontrol2008-dev to cloudrabbit2002-dev - andrew@cumin1002" [03:49:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cloudcontrol2008-dev to cloudrabbit2002-dev - andrew@cumin1002" [03:49:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:49:04] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudrabbit2002-dev [03:49:22] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudrabbit2002-dev [03:49:44] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [03:50:00] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from cloudcontrol2008-dev to cloudrabbit2002-dev [03:52:21] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:52:21] !log andrew@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudrabbit2003-dev [03:52:42] !log andrew@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudrabbit2003-dev [03:53:21] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from cloudcontrol2009-dev to cloudrabbit2003-dev [03:54:37] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2002-dev.codfw.wmnet with OS bookworm [03:54:37] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit2003-dev.codfw.wmnet with OS bookworm [03:57:34] (03PS1) 10Andrew Bogott: Remove refs to cloudcontrol200[789] [puppet] - 10https://gerrit.wikimedia.org/r/1141568 (https://phabricator.wikimedia.org/T392539) [03:57:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:58:32] RESOLVED: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:59:42] FIRING: [22x] JobUnavailable: Reduced availability for job blackbox/icmp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:00:42] FIRING: [3x] JobUnavailable: Reduced availability for job blackbox/pingthing in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:02:18] RESOLVED: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:27] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2001-dev.codfw.wmnet with reason: host reimage [04:08:52] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2001-dev.codfw.wmnet with reason: host reimage [04:12:22] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2003-dev.codfw.wmnet with reason: host reimage [04:13:16] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2002-dev.codfw.wmnet with reason: host reimage [04:15:43] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2003-dev.codfw.wmnet with reason: host reimage [04:18:23] (03PS1) 10DDesouza: Design Research Participant Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141569 (https://phabricator.wikimedia.org/T392325) [04:19:01] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2002-dev.codfw.wmnet with reason: host reimage [04:19:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141569 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [04:19:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141569 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [04:28:57] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2001-dev.codfw.wmnet with OS bookworm [04:31:42] 10ops-codfw, 06cloud-services-team, 06DC-Ops: Update labels on cloudcontrol200[789]-dev.codfw - https://phabricator.wikimedia.org/T393347 (10Andrew) 03NEW [04:34:56] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2003-dev.codfw.wmnet with OS bookworm [04:38:29] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2002-dev.codfw.wmnet with OS bookworm [04:58:11] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2008.codfw.wmnet with OS bullseye [04:58:41] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host wdqs2008 [05:00:58] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [05:04:17] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [05:05:52] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2008 - ryankemper@cumin2002" [05:05:58] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2008 - ryankemper@cumin2002" [05:05:58] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:05:59] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs2008.codfw.wmnet 194.32.192.10.in-addr.arpa 4.9.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [05:06:02] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs2008.codfw.wmnet 194.32.192.10.in-addr.arpa 4.9.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [05:06:03] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2008 [05:06:16] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2008 [05:06:16] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wdqs2008 [05:21:01] Around 4.5 hours ago, we noticed traffic to MinT (machinetranslation) service is reduced and found that only 3 pods running per DC. Is that known outage or work going on? [05:22:29] Infact, 2 per DC. 3 workers per pods. [05:25:51] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2008.codfw.wmnet with reason: host reimage [05:32:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2008.codfw.wmnet with reason: host reimage [05:39:42] RESOLVED: [11x] JobUnavailable: Reduced availability for job blackbox/icmp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:40:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job blackbox/pingthing in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:49:06] FIRING: SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:49:28] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2008.codfw.wmnet with OS bullseye [05:50:42] FIRING: [2x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:53:11] (03PS2) 10Anzx: nnwiki: enable wgCiteResponsiveReferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141573 (https://phabricator.wikimedia.org/T393299) [05:53:14] (03PS2) 10Anzx: nupwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141574 (https://phabricator.wikimedia.org/T390711) [05:53:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141574 (https://phabricator.wikimedia.org/T390711) (owner: 10Anzx) [05:53:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141573 (https://phabricator.wikimedia.org/T393299) (owner: 10Anzx) [05:54:01] ryankemper@cumin2002 reimage (PID 2270920) is awaiting input [05:54:06] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:55:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [05:55:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [05:55:42] FIRING: [3x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:57:20] (03CR) 10Bunnypranav: "Hey folks, it's my first time uploading a patch to mediawiki-config. Even though it a simple enough change, do I need to schedule a backpo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav) [06:02:45] (03CR) 10Anzx: [C:03+1] "https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav) [06:04:06] FIRING: [4x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:05:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav) [06:05:42] FIRING: [6x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:08:48] jouncebot: nowandnext [06:08:48] For the next 0 hour(s) and 51 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250504T0700) [06:08:48] In 0 hour(s) and 51 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T0700) [06:16:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet [06:17:35] FYI, aux-k8s-etcd1003, dse-k8s-etcd1001 and kubestagemaster1005 will briefly go down for a Ganeti reboot [06:17:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [06:19:36] PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [06:19:40] PROBLEM - Host kubestagemaster1005 is DOWN: PING CRITICAL - Packet loss = 100% [06:20:16] PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [06:21:21] (03PS2) 10Anzx: ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141582 (https://phabricator.wikimedia.org/T392803) [06:21:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141582 (https://phabricator.wikimedia.org/T392803) (owner: 10Anzx) [06:22:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [06:23:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1140795 (owner: 10JHathaway) [06:23:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet [06:24:57] FIRING: KubernetesCalicoDown: kubestagemaster1005.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:25:28] RECOVERY - Host dse-k8s-etcd1001 is UP: PING WARNING - Packet loss = 90%, RTA = 6.15 ms [06:25:42] RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [06:25:56] RECOVERY - Host kubestagemaster1005 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [06:26:34] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2014.codfw.wmnet with OS bullseye [06:27:03] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host wdqs2014 [06:27:29] (03PS1) 10Muehlenhoff: Install linux-sysctl-defaults on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1141585 (https://phabricator.wikimedia.org/T391083) [06:29:57] RESOLVED: KubernetesCalicoDown: kubestagemaster1005.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:30:06] ryankemper@cumin2002 reimage (PID 2270920) is awaiting input [06:30:11] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [06:35:49] ryankemper@cumin2002 reimage (PID 2270920) is awaiting input [06:37:22] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2014 - ryankemper@cumin2002" [06:37:27] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2014 - ryankemper@cumin2002" [06:37:28] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:37:28] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs2014.codfw.wmnet 192.16.192.10.in-addr.arpa 2.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [06:37:32] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs2014.codfw.wmnet 192.16.192.10.in-addr.arpa 2.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [06:37:32] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2014 [06:39:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2014 [06:39:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wdqs2014 [06:41:45] 06SRE, 06serviceops, 06Traffic-Icebox, 06Trust and Safety Product Team: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933#10790816 (10kostajh) 05Open→03Declined [06:42:24] (03CR) 10Filippo Giunchedi: [C:03+1] Install linux-sysctl-defaults on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1141585 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [06:44:09] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] icinga: frack: adjust fran* groupings and add host [puppet] - 10https://gerrit.wikimedia.org/r/1140775 (https://phabricator.wikimedia.org/T386259) (owner: 10Dwisehaupt) [06:45:31] (03CR) 10Arnaudb: [C:03+1] gerrit: enable bacula backups on gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1140506 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [06:57:06] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2014.codfw.wmnet with reason: host reimage [06:58:38] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141697 [07:00:05] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T0700). [07:00:05] abijeet, anzx, and bunnypranav: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:08] o/ [07:01:41] I can deploy abijeet's patch.. [07:02:09] kart_, thanks [07:02:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2014.codfw.wmnet with reason: host reimage [07:03:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140703 (https://phabricator.wikimedia.org/T393144) (owner: 10Abijeet Patro) [07:09:01] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141697 (owner: 10PipelineBot) [07:11:14] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [07:11:39] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [07:13:14] (03PS1) 10Gergő Tisza: CommonSettings: Document wmfGetPrivilegedGroups usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141700 [07:13:35] (03Merged) 10jenkins-bot: Mobile frequent languages entrypoint: Add dependency to sitemapper [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140703 (https://phabricator.wikimedia.org/T393144) (owner: 10Abijeet Patro) [07:14:17] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1140703|Mobile frequent languages entrypoint: Add dependency to sitemapper (T393144 T386223)]] [07:14:21] T393144: TypeError: undefined is not an object (evaluating 'new mw.cx.SiteMapper') / TypeError: Cannot read properties of undefined (reading 'SiteMapper') / TypeError: mw.cx is undefined - https://phabricator.wikimedia.org/T393144 [07:14:22] T386223: 1.44.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T386223 [07:14:36] (03PS2) 10Gergő Tisza: CommonSettings: Document wmfGetPrivilegedGroups usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141700 [07:15:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:15:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:18:15] (03CR) 10Vgutierrez: [C:04-1] "this can't be merged till a new version of the debian package gets deployed, currently haproxykafka package deploys the systemd unit on /u" [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [07:19:15] !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1140703|Mobile frequent languages entrypoint: Add dependency to sitemapper (T393144 T386223)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:19:46] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2014.codfw.wmnet with OS bullseye [07:20:32] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2015.codfw.wmnet with OS bullseye [07:21:00] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host wdqs2015 [07:21:10] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [07:21:48] abijeet: you can test the patch [07:22:48] kart_, ok, scap took a while [07:23:03] yeah [07:24:43] Hi [07:24:57] kart_, looks god [07:24:58] kart_, looks good [07:25:03] cool [07:25:09] !log kartik@deploy1003 abi, kartik: Continuing with sync [07:25:34] (03CR) 10Kosta Harlan: [C:03+1] CommonSettings: Document wmfGetPrivilegedGroups usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141700 (owner: 10Gergő Tisza) [07:25:44] kart_ can you do mine as well?. [07:26:54] ryankemper@cumin2002 reimage (PID 2360871) is awaiting input [07:27:19] bunnypranav: sadly, I've to got for meetings after abijeet's deployment is done :/ [07:27:45] Ok, fine. [07:30:09] (03PS7) 10Ayounsi: netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 [07:31:45] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140703|Mobile frequent languages entrypoint: Add dependency to sitemapper (T393144 T386223)]] (duration: 17m 27s) [07:31:48] I can deploy bunnypranav's change [07:31:49] T393144: TypeError: undefined is not an object (evaluating 'new mw.cx.SiteMapper') / TypeError: Cannot read properties of undefined (reading 'SiteMapper') / TypeError: mw.cx is undefined - https://phabricator.wikimedia.org/T393144 [07:31:49] T386223: 1.44.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T386223 [07:32:31] (03CR) 10Ayounsi: "PS 6..7 :" [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [07:33:05] Dreamy_Jazz: could you deploy my changes aswell, or should I move mine to next window [07:33:23] Let me take a look [07:33:58] (03CR) 10Dreamy Jazz: [C:03+2] Add checkuserwiki favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav) [07:34:45] (03Merged) 10jenkins-bot: Add checkuserwiki favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav) [07:35:33] (03CR) 10Dreamy Jazz: [C:03+2] nupwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141574 (https://phabricator.wikimedia.org/T390711) (owner: 10Anzx) [07:36:29] Yeah. I should be able to deploy your changes. [07:36:30] (03Merged) 10jenkins-bot: nupwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141574 (https://phabricator.wikimedia.org/T390711) (owner: 10Anzx) [07:37:09] Dreamy_Jazz: ty [07:38:03] (03CR) 10Dreamy Jazz: [C:03+2] nnwiki: enable wgCiteResponsiveReferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141573 (https://phabricator.wikimedia.org/T393299) (owner: 10Anzx) [07:38:35] (03CR) 10Dreamy Jazz: [C:03+2] ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141582 (https://phabricator.wikimedia.org/T392803) (owner: 10Anzx) [07:39:21] (03Merged) 10jenkins-bot: nnwiki: enable wgCiteResponsiveReferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141573 (https://phabricator.wikimedia.org/T393299) (owner: 10Anzx) [07:39:38] (03Merged) 10jenkins-bot: ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141582 (https://phabricator.wikimedia.org/T392803) (owner: 10Anzx) [07:40:06] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1141573|nnwiki: enable wgCiteResponsiveReferences (T393299)]], [[gerrit:1141582|ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace (T392803)]], [[gerrit:1141089|Add checkuserwiki favicon (T393246)]], [[gerrit:1141574|nupwiki: add timezone (T390711)]] [07:40:14] T393299: Convert reference lists over to `responsive` on nnwiki - https://phabricator.wikimedia.org/T393299 [07:40:15] T392803: VE in namespace in ruWikibooks - https://phabricator.wikimedia.org/T392803 [07:40:15] T393246: Change favicon on the CheckUser wiki - https://phabricator.wikimedia.org/T393246 [07:40:15] T390711: Post-creation work for nupwiki - https://phabricator.wikimedia.org/T390711 [07:41:46] (03CR) 10CI reject: [V:04-1] netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [07:43:11] RECOVERY - SSH on prometheus1005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:44:01] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:44:43] !log dreamyjazz@deploy1003 dreamyjazz, bunnypranav, anzx: Backport for [[gerrit:1141573|nnwiki: enable wgCiteResponsiveReferences (T393299)]], [[gerrit:1141582|ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace (T392803)]], [[gerrit:1141089|Add checkuserwiki favicon (T393246)]], [[gerrit:1141574|nupwiki: add timezone (T390711)]] synced to the testservers (https://wikitech.wikimedia.org [07:44:43] /wiki/Mwdebug) [07:44:53] Dreamy_Jazz: checking [07:44:56] Thanks! [07:45:01] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:46:11] I'll check the checkuserwiki change as I don't think bunnypranav has access to that wiki. [07:47:17] checkuser.wikimedia.org favicon appears to work [07:47:19] Dreamy_Jazz: all looks good, check userwiki aswell [07:47:28] Thanks [07:47:31] !log dreamyjazz@deploy1003 dreamyjazz, bunnypranav, anzx: Continuing with sync [07:49:14] FIRING: [4x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:49:31] RECOVERY - SSH on prometheus2006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:49:59] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:50:08] FIRING: [11x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:50:30] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [07:50:34] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [07:50:46] FIRING: [6x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:51:25] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:53:42] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:53:43] (03CR) 10Fabfur: "yeah, that's the plan, I also modified the task to be more clear in the needed steps" [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [07:53:57] (03CR) 10Arnaudb: [C:03+1] gerrit: enable backups on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140507 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [07:54:06] FIRING: [4x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:54:18] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141573|nnwiki: enable wgCiteResponsiveReferences (T393299)]], [[gerrit:1141582|ruwikibooks: enable VisualEditorAvailableNamespaces for Рецепт (recipe) namespace (T392803)]], [[gerrit:1141089|Add checkuserwiki favicon (T393246)]], [[gerrit:1141574|nupwiki: add timezone (T390711)]] (duration: 14m 11s) [07:54:22] Dreamy_Jazz: thanks again for deploying [07:54:24] T393299: Convert reference lists over to `responsive` on nnwiki - https://phabricator.wikimedia.org/T393299 [07:54:25] T392803: VE in namespace in ruWikibooks - https://phabricator.wikimedia.org/T392803 [07:54:25] T393246: Change favicon on the CheckUser wiki - https://phabricator.wikimedia.org/T393246 [07:54:25] T390711: Post-creation work for nupwiki - https://phabricator.wikimedia.org/T390711 [07:54:32] Np [07:54:46] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:54:50] FIRING: [18x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:54:59] !log UTC morning backport window finished [07:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:01] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: CRITICAL - Host Unreachable (2a00:1188:5:e::4) [07:55:01] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:55:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:55:46] FIRING: [6x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:56:25] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:49] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [07:59:06] RESOLVED: [4x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:59:26] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [07:59:34] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [07:59:40] RESOLVED: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:59:46] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [08:00:02] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2015 - ryankemper@cumin2002" [08:00:03] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 88.00 ms [08:00:04] FIRING: [30x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:00:08] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2015 - ryankemper@cumin2002" [08:00:08] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:00:08] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs2015.codfw.wmnet 209.48.192.10.in-addr.arpa 9.0.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:00:12] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs2015.codfw.wmnet 209.48.192.10.in-addr.arpa 9.0.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:00:13] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2015 [08:00:25] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:30] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2015 [08:00:30] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wdqs2015 [08:00:42] RESOLVED: [6x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:01:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:02:17] RECOVERY - SSH on prometheus1006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:02:41] !log rebooting prometheus1005 prometheus1006 and prometheus2006 [08:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:42] FIRING: [2x] JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:04:50] FIRING: [30x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:05:06] FIRING: [30x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:05:12] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [08:05:49] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [08:06:40] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:09:31] (03PS5) 10Fabfur: haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) [08:09:40] (03CR) 10Fabfur: haproxykafka: service unit brought by deb package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [08:09:50] FIRING: [30x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:10:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:11:12] (03PS2) 10Msz2001: [plwiki] Add 'abusefilter-view-private' to sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) [08:11:35] (03CR) 10Filippo Giunchedi: [C:04-1] "I tested this in Pontoon and it doesn't seem to work (reply comes from opensearch not apache)" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [08:11:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001) [08:11:45] !log powercycle prometheus1008 - no ssh, mgmt console showing cpu soft lockup continously [08:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:32] !log powercycle prometheus2005 - no ssh, mgmt console showing systemd units being deactivated, no root login [08:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:16:21] RECOVERY - Host prometheus1008 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [08:16:59] thank you elukey tappof [08:17:12] !log powercycle prometheus2008 - no ssh, mgmt console showing systemd units being deactivated, no root login [08:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:44] ciao godog, buongiorno [08:17:55] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2015.codfw.wmnet with reason: host reimage [08:18:03] buongiorno to you too [08:18:19] RECOVERY - SSH on prometheus2005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:19:15] PROBLEM - Host prometheus2008 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:50] FIRING: [30x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:21:19] RECOVERY - SSH on prometheus2008 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:21:21] RECOVERY - Host prometheus2008 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms [08:21:45] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2015.codfw.wmnet with reason: host reimage [08:24:50] FIRING: [28x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:25:06] FIRING: [28x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:25:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:29:50] FIRING: [28x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:30:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:32:02] !log powercycle centrallog1002 - can not login on ssh or console [08:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:22] !log rebooting prometheus2007 - no ssh, com2 via racadm hangs [08:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:21] (03CR) 10Hashar: "We had a thread on Slack with QTE about having a RTL wiki. That `en_rtl` filled that niche at the time https://wikimedia.slack.com/archiv" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140976 (owner: 10Jforrester) [08:33:29] PROBLEM - Host centrallog1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:17] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:34:37] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:34:50] FIRING: [28x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:35:35] RECOVERY - SSH on centrallog1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:35:37] RECOVERY - Host centrallog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [08:35:54] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10791244 (10Nikerabbit) 05Stalled→03In progress [08:36:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:36:19] PROBLEM - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [08:37:17] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:37:19] RECOVERY - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is OK: OK: UP (pid=3790) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [08:37:37] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:38:42] FIRING: [2x] JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:39:50] FIRING: [26x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:40:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2015.codfw.wmnet with OS bullseye [08:40:40] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:41:01] (03PS2) 10Hashar: python3: add python3-venv to devel image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442 [08:41:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:42:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:44:50] FIRING: [22x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:45:10] (03CR) 10Thiemo Kreuz (WMDE): Improve function and property documentation for php code (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [08:46:54] (03CR) 10Hashar: "@jhathaway@wikimedia.org may you puppet-merge this one for me please? I don't have +2 or access to the Puppet servers." [puppet] - 10https://gerrit.wikimedia.org/r/1136403 (owner: 10Hashar) [08:47:35] PROBLEM - SSH on prometheus2007 is CRITICAL: connect to address 10.192.9.11 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:47:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:40] (03PS11) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [08:49:50] FIRING: [24x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:50:46] Thanks Dreamy_Jazz a lot for the deploy. Sorry for my absence, I went offline as kart told they were busy. [08:51:25] Btw, I can see the main page of checkuserwiki, so the favicon is visible to public. [08:51:47] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [08:52:31] 06SRE: soft lockup on prometheus and centrallog hosts with the new kernel - https://phabricator.wikimedia.org/T393357 (10fgiunchedi) 03NEW [08:53:31] bunnypranav: generally during deployment we use https://wikitech.wikimedia.org/wiki/WikimediaDebug check changes in test server before it gets live [08:54:50] FIRING: [23x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:55:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10791322 (10Stevemunene) 05Open→03Resolved Hosts look ok after 2 days, I think it is safe to close this and move... [08:55:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:56:05] !log powercycle centrallog2002 - can not login on ssh or console [08:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:09] PROBLEM - Host prometheus2007 is DOWN: PING CRITICAL - Packet loss = 100% [08:57:47] PROBLEM - Host centrallog2002 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:15] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:58:15] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:58:49] (03PS12) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [08:59:19] RECOVERY - SSH on centrallog2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:59:21] RECOVERY - Host centrallog2002 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [08:59:37] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5447/co" [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [08:59:43] PROBLEM - Check if anycast-healthchecker and all configured threads are running on centrallog2002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [08:59:50] (03CR) 10Bunnypranav: "@marcinszwarc@hotmail.com You have set the perm to false, which basically restricts them from seeing the private filters. Is that what you" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001) [08:59:50] FIRING: [22x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:00:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:00:56] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [09:01:13] PROBLEM - Bird Internet Routing Daemon on centrallog2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:01:43] RECOVERY - Check if anycast-healthchecker and all configured threads are running on centrallog2002 is OK: OK: UP (pid=3794) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [09:02:13] RECOVERY - Bird Internet Routing Daemon on centrallog2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:02:15] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:02:15] RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:02:31] 06SRE: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357#10791363 (10fgiunchedi) [09:03:10] !log powercycle vrts1003 + vrts2002 - soft lockup T393357 [09:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:12] T393357: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357 [09:03:42] RESOLVED: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:04:33] PROBLEM - Host vrts2002 is DOWN: PING CRITICAL - Packet loss = 100% [09:04:35] RECOVERY - SSH on prometheus2007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:04:37] RECOVERY - Host prometheus2007 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [09:04:41] PROBLEM - Host vrts1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:04:51] FIRING: [22x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:05:40] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:06:20] FIRING: [5x] ProbeDown: Service vrts1003:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:07:05] RECOVERY - Host vrts2002 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [09:07:11] RECOVERY - Host vrts1003 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [09:07:19] RECOVERY - SSH on vrts1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:07:19] RECOVERY - SSH on vrts2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:07:53] 06SRE: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357#10791398 (10fgiunchedi) Another correlation (maybe causation) is the fact that all hosts locking up so far have mdadm raid10 [09:08:13] PROBLEM - freshclam running on vrts1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (clamav), command name freshclam https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [09:09:13] RECOVERY - freshclam running on vrts1003 is OK: PROCS OK: 1 process with UID = 110 (clamav), command name freshclam https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [09:11:10] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:11:20] RESOLVED: [5x] ProbeDown: Service vrts1003:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:12:17] bunnypranav: No problem on being away. Thanks for the patch. [09:12:32] jouncebot: nowandnext [09:12:32] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [09:12:32] In 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1000) [09:12:49] Going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1141844/2 [09:13:00] (03PS13) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [09:14:28] (03CR) 10Dreamy Jazz: [C:03+2] [plwiki] Add 'abusefilter-view-private' to sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001) [09:15:05] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [09:15:15] (03Merged) 10jenkins-bot: [plwiki] Add 'abusefilter-view-private' to sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001) [09:16:56] (03CR) 10Dreamy Jazz: [C:03+2] "From what I can see, this change now removes the `false` definition. The right is given to `sysop` group by default, so this change should" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001) [09:17:14] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1141844|[plwiki] Add 'abusefilter-view-private' to sysop (T393353)]] [09:17:17] T393353: Add (abusefilter-view-private) right to sysop group on plwiki - https://phabricator.wikimedia.org/T393353 [09:18:20] (03CR) 10Bunnypranav: "Oh, I misinterpreted the remove for a added line, my bad." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141844 (https://phabricator.wikimedia.org/T393353) (owner: 10Msz2001) [09:19:06] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146#10791431 (10elukey) My 2c: before starting we should decide if what controller we want to use, because in T391854 it seems that we may be oriented in buying... [09:21:37] !log dreamyjazz@deploy1003 dreamyjazz, msz2001: Backport for [[gerrit:1141844|[plwiki] Add 'abusefilter-view-private' to sysop (T393353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:23:35] !log dreamyjazz@deploy1003 dreamyjazz, msz2001: Continuing with sync [09:24:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10791445 (10elukey) @MatthewVernon I think that the new controller costs the same as the old one, so the config-J price shouldn't chang... [09:26:10] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:30:19] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141844|[plwiki] Add 'abusefilter-view-private' to sysop (T393353)]] (duration: 13m 04s) [09:30:21] T393353: Add (abusefilter-view-private) right to sysop group on plwiki - https://phabricator.wikimedia.org/T393353 [09:31:10] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:35:52] 06SRE: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357#10791471 (10MoritzMuehlenhoff) RAID 10 is a good lead! It seems the same was already reported in Debian a few days ago: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1104460 [09:36:19] (03CR) 10Elukey: [C:03+2] admin_ng: enable Knative's secure-pod-defaults for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140140 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:38:26] !log depool inference/codfw from DNS discovery to safely apply new pod/container security settings - T369493 [09:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:29] T369493: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493 [09:39:01] !log elukey@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:39:22] !log elukey@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:41:10] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:51:37] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5448/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [09:55:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:58:51] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1000) [10:05:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:06:26] 06SRE, 06Infrastructure-Foundations: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366 (10MoritzMuehlenhoff) 03NEW [10:09:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10791559 (10phaultfinder) [10:11:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791564 (10MoritzMuehlenhoff) [10:15:39] (03CR) 10Filippo Giunchedi: [C:03+1] grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [10:15:54] Hi, can anyone help in deploying asap a fix for the main menu text in eswiki? Or in "forcing" a local text to override the mistake? [10:16:44] (03CR) 10Filippo Giunchedi: grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [10:17:40] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts2002.codfw.wmnet [10:18:01] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791574 (10ops-monitoring-bot) Host rebooted by jelto@cumin1002 with reason: revert kernel [10:20:42] (03CR) 10Filippo Giunchedi: "Please change the commit message, specifically we are temporarily disabling dashboard_sync for the grafana upgrade and not add the feature" [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [10:24:14] !log rebooting prometheus1007 into linux-image-6.1.0-33-amd64 [10:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:37] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T393368 (10phaultfinder) 03NEW [10:24:42] FIRING: JobUnavailable: Reduced availability for job sql_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:26:05] PROBLEM - SSH on prometheus1007 is CRITICAL: connect to address 10.64.48.171 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:26:36] jem: https://es.wikipedia.org/w/index.php?title=MediaWiki:Vector-opt-out&action=edit [10:27:45] jem: it was fixed already on translatewiki, so it should be fixed soon: https://translatewiki.net/w/i.php?title=MediaWiki:Vector-opt-out/es&diff=next&oldid=13050998 [10:29:42] RESOLVED: JobUnavailable: Reduced availability for job sql_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:29:50] FIRING: [20x] ProbeDown: Service prometheus1007:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:31:46] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791622 (10MoritzMuehlenhoff) [10:32:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2002.codfw.wmnet [10:32:18] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts2002.codfw.wmnet [10:32:38] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791624 (10ops-monitoring-bot) Host rebooted by jelto@cumin1002 with reason: revert kernel [10:34:27] PROBLEM - Host prometheus1007 is DOWN: PING CRITICAL - Packet loss = 100% [10:34:37] jynus: thanks and yes, I had checked in translatewiki.net [10:35:34] Usually I would just wait, but this is being seen from every article and I thought it would be worth a quicker fix [10:35:51] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=inference,name=codfw [10:36:45] I have created the local message for eswiki, it seems a purge is needed... but I don't know where (not in MediaWiki:Sidebar, it seems) [10:37:14] yeah, that won't work, as it would need a purge to every cached page [10:37:19] Ugh [10:37:23] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report puppetdb_virtual (instance netbox1003) - https://phabricator.wikimedia.org/T393370 (10LSobanski) 03NEW [10:37:25] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report puppetdb_virtual (instance netbox1003) - https://phabricator.wikimedia.org/T393370#10791636 (10LSobanski) [10:37:48] jem: that's why I belive strings are only updated every some time [10:37:53] I think sidebar cache is turned on [10:38:17] But then also cache for logged out users would still be polluted even with that purged [10:38:42] FIRING: JobUnavailable: Reduced availability for job sql_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:40:05] RECOVERY - SSH on prometheus1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:40:07] RECOVERY - Host prometheus1007 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [10:40:42] jem: I don't know enough about the deployment process, but if you are around during the train update someone more knowleable may be able to answer you [10:41:11] Thanks, jynus... in this channel, I guess [10:41:39] yes, check the schedule at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1000 [10:42:03] Could try pinging releng [10:42:19] But not sure how much they'd know about the specifics of the sidebar [10:42:55] hashar, jeena: any ideas ^ [10:44:51] FIRING: [20x] ProbeDown: Service prometheus1007:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:44:52] (03PS1) 10Elukey: admin_ng: disable PSP mutations for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141858 (https://phabricator.wikimedia.org/T369493) [10:46:30] Thanks... I'll be checking from time to time [10:49:35] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791662 (10MoritzMuehlenhoff) [10:49:53] PROBLEM - Host prometheus1007 is DOWN: PING CRITICAL - Packet loss = 100% [10:51:17] 06SRE, 07SRE-Unowned, 07SEO: Index pl.wikinews in Google Publisher Center - https://phabricator.wikimedia.org/T393288#10791665 (10BZPN2) Maybe it's worth trying to index Wikinews through the Google Publisher Center panel, maybe that speeds up the process somehow? Also, for the site to be indexed in Google Ne... [10:51:26] 06SRE, 07SRE-Unowned, 06WMF-Legal, 07SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437#10791667 (10BZPN2) Maybe it's worth trying to index Wikinews through the Google Publisher Center panel, maybe that speeds up the pr... [10:52:07] RECOVERY - Host prometheus1007 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [10:52:29] PROBLEM - Host vrts2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:53:35] (03PS14) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [10:54:09] RECOVERY - Host vrts2002 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [10:54:48] Anyway: I have created the right local message (without the /es, my mistake) and now it is fixed for me in all pages [10:55:40] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [10:55:50] ... and it seems the main menu isn't shown to logged out users (!) [10:56:42] jem: it is, it just defaults to the hamburger menu on top [10:57:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2002.codfw.wmnet [10:57:51] Ah, yes [10:57:57] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts2002.codfw.wmnet [10:58:16] jem: https://i.imgur.com/DsSUrQp.png [10:58:16] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791671 (10ops-monitoring-bot) Host rebooted by jelto@cumin1002 with reason: revert kernel [10:58:17] I'm really trying to get used to the new Vector, but... [10:58:42] RESOLVED: JobUnavailable: Reduced availability for job sql_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:58:48] Anyway, the "Switch to old version" text doesn't appear there [11:01:14] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10791672 (10Silvan_WMDE) True, the deployment had not actually happened when [[ https:... [11:04:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2002.codfw.wmnet [11:05:01] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report puppetdb_virtual (instance netbox1003) - https://phabricator.wikimedia.org/T393370#10791680 (10Volans) 05Open→03Resolved a:03Volans There was no last run reported on the script page, re-run it manually and that r... [11:05:13] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 6:00:00 on backup[1010-1014].eqiad.wmnet with reason: Upgrade and restart [11:05:25] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791683 (10MoritzMuehlenhoff) [11:05:28] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet [11:05:54] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791684 (10ops-monitoring-bot) Host rebooted by jelto@cumin1002 with reason: revert kernel [11:09:56] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146#10791692 (10MoritzMuehlenhoff) >>! In T393146#10791431, @elukey wrote: > My 2c: before starting we should decide if what controller we want to use, because i... [11:11:44] (03PS15) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [11:12:05] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791698 (10Jelto) [11:12:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet [11:13:48] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [11:14:21] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791702 (10MoritzMuehlenhoff) [11:15:17] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791703 (10MoritzMuehlenhoff) [11:15:18] 06SRE: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357#10791704 (10MoritzMuehlenhoff) [11:17:26] (03PS16) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [11:21:04] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic, 13Patch-For-Review: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10791709 (10Volans) Have you considered just downtiming the affected... [11:26:05] (03CR) 10Slyngshede: [C:03+2] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [11:26:37] (03PS1) 10Abijeet Patro: Disable Special:ContentTranslationStats page [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141866 (https://phabricator.wikimedia.org/T392839) [11:26:53] (03PS1) 10Abijeet Patro: Disable APIs used in Special:ContentTranslationStats [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141867 (https://phabricator.wikimedia.org/T392839) [11:27:12] (03PS1) 10Abijeet Patro: Remove links to Special:ContentTranslationStats from dashboards [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141868 (https://phabricator.wikimedia.org/T392839) [11:27:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141866 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro) [11:28:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141867 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro) [11:28:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141868 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro) [11:31:11] (03PS1) 10Hoo man: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141869 (https://phabricator.wikimedia.org/T391532) [11:31:45] (03CR) 10JMeybohm: [C:03+1] k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857) (owner: 10Elukey) [11:34:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791722 (10MoritzMuehlenhoff) p:05Triage→03High [11:34:27] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [11:38:33] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet [11:39:20] (03PS1) 10Slyngshede: P:ldap::client::ldaptui move files [puppet] - 10https://gerrit.wikimedia.org/r/1141871 [11:40:19] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5449/console" [puppet] - 10https://gerrit.wikimedia.org/r/1141871 (owner: 10Slyngshede) [11:41:54] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:ldap::client::ldaptui move files [puppet] - 10https://gerrit.wikimedia.org/r/1141871 (owner: 10Slyngshede) [11:43:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [11:44:31] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [11:44:43] (03Abandoned) 10Cyndywikime: Regenerate speed-test snapshot without GENewcomerTasksGuidanceEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138350 (https://phabricator.wikimedia.org/T379568) (owner: 10Cyndywikime) [11:44:50] FIRING: [20x] ProbeDown: Service prometheus2005:443 has failed probes (http_prometheus_codfw_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:54] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet [11:46:34] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [11:46:35] !log filippo@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host prometheus2006.codfw.wmnet [11:49:04] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [11:49:23] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2008.codfw.wmnet [11:49:37] (03PS1) 10Slyngshede: P:ldap::client::ldaptui correct paths [puppet] - 10https://gerrit.wikimedia.org/r/1141875 [11:49:46] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [11:49:46] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:49:50] FIRING: [22x] ProbeDown: Service prometheus2005:443 has failed probes (http_prometheus_codfw_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:50:43] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5450/co" [puppet] - 10https://gerrit.wikimedia.org/r/1141875 (owner: 10Slyngshede) [11:51:58] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:ldap::client::ldaptui correct paths [puppet] - 10https://gerrit.wikimedia.org/r/1141875 (owner: 10Slyngshede) [11:52:21] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:52:21] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:53:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [11:55:21] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:55:21] RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:55:23] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:55:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:56:00] 06SRE: soft lockup on prometheus, centrallog, vrts hosts with the new kernel - https://phabricator.wikimedia.org/T393357#10791747 (10fgiunchedi) 05Open→03Invalid I'm resolving this in favor of T393366 [11:56:16] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [11:56:25] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:58:04] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2008.codfw.wmnet [11:58:56] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [11:59:41] !log aqu@deploy1003 Started deploy [analytics/refinery@dbfa557] (hadoop-test): Deploying new refinery/source artifacts TEST [analytics/refinery@dbfa557d] [11:59:50] FIRING: [23x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:00:03] jmm@cumin2002 drain-node (PID 2637159) is awaiting input [12:00:05] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10791759 (10fgiunchedi) [12:00:25] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:34] !log aqu@deploy1003 Finished deploy [analytics/refinery@dbfa557] (hadoop-test): Deploying new refinery/source artifacts TEST [analytics/refinery@dbfa557d] (duration: 00m 53s) [12:01:15] !log aqu@deploy1003 Started deploy [analytics/refinery@dbfa557]: Deploying new refinery/source artifacts [analytics/refinery@dbfa557d] [12:22:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [12:27:22] jmm@cumin2002 drain-node (PID 2666612) is awaiting input [12:27:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [12:28:07] !log Rolling reboot of Prometheus nodes in eqiad (1005, 1006, 1008) to rollback the kernel [12:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:45] PROBLEM - Host prometheus1005 is DOWN: PING CRITICAL - Packet loss = 100% [12:32:00] (03CR) 10Muehlenhoff: [C:03+2] Install linux-sysctl-defaults on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1141585 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [12:32:04] (03PS1) 10Andrew Bogott: Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141893 (https://phabricator.wikimedia.org/T392539) [12:33:15] (03CR) 10Andrew Bogott: [C:03+2] Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141893 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [12:33:19] RECOVERY - Host prometheus1005 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [12:34:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [12:34:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [12:34:40] (03PS1) 10Andrew Bogott: Revert "Make cloudrabbit200[123] into rabbitmq nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1141894 [12:34:50] FIRING: [21x] ProbeDown: Service ganeti1025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:35:49] (03CR) 10Andrew Bogott: [C:03+2] Revert "Make cloudrabbit200[123] into rabbitmq nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1141894 (owner: 10Andrew Bogott) [12:39:02] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet [12:39:37] PROBLEM - Host prometheus1006 is DOWN: PING CRITICAL - Packet loss = 100% [12:39:50] FIRING: [21x] ProbeDown: Service ganeti1025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:23] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146#10791808 (10elukey) Maybe I got the wrong PCI via lspci, but I see: ` elukey@ms-be1091:~$ lspci -nn | grep -i sas 98:00.0 Serial Attached SCSI controller [0... [12:41:11] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:41:25] RESOLVED: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:59] (03CR) 10Ayounsi: [C:03+1] sre.hosts: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans) [12:42:11] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:42:19] RECOVERY - Host prometheus1006 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [12:42:28] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet [12:43:56] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet [12:44:31] (03CR) 10Ayounsi: [C:03+1] sre.hosts: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans) [12:44:50] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:44:53] PROBLEM - Host prometheus1008 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:29] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:23] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet [12:47:48] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [12:48:07] (03Abandoned) 10Hashar: CI: diff against parent commit instead of remote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar) [12:48:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10791818 (10Gehel) [12:48:21] RECOVERY - Host prometheus1008 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [12:48:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10791832 (10Gehel) [12:49:23] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:49:25] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10791850 (10Gehel) [12:49:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:49:50] FIRING: [20x] ProbeDown: Service prometheus1008:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:45] 07sre-alert-triage, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10791918 (10Gehel) [12:52:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10791933 (10Gehel) [12:52:23] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10791927 (10Gehel) [12:52:29] 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Bring relforge100[89] into production - https://phabricator.wikimedia.org/T389957#10791931 (10Gehel) [12:52:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10791937 (10Gehel) [12:52:45] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [12:52:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10791935 (10Gehel) [12:52:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10791939 (10Gehel) [12:52:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10791941 (10Gehel) [12:53:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10791943 (10Gehel) [12:53:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10791945 (10Gehel) [12:54:03] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#10791951 (10Gehel) [12:55:04] (03PS2) 10Cathal Mooney: WMF-Plugin: Potential clean-up of b-end circuit finding logic [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) [12:55:08] (03CR) 10Stevemunene: [C:03+2] [analytics] Refine deterministic transform deduplication [puppet] - 10https://gerrit.wikimedia.org/r/1141884 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [12:55:52] (03PS1) 10Andrew Bogott: Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) [12:56:04] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [12:57:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:57:44] (03PS2) 10Andrew Bogott: Make cloudrabbit200[123] into rabbitmq nodes [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) [12:57:47] (03CR) 10Muehlenhoff: [C:03+2] Stop passing krb2002 to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1140142 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [12:57:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141896 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [12:57:53] PROBLEM - Hadoop NodeManager on an-worker1193 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:59:05] PROBLEM - Hadoop NodeManager on an-worker1204 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:59:30] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [13:00:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1300). [13:00:05] abijeet, tchin, and Cyndywikime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] o/ [13:00:21] o/ [13:00:37] here. I'll deploy abijeet's patches [13:01:56] o/ [13:02:49] abijeet Not sure why IRC not showing autocompletion of your nice ;) [13:03:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141866 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro) [13:03:40] hi kart_, thanks [13:03:55] jmm@cumin2002 drain-node (PID 2706496) is awaiting input [13:04:11] !log rebooting centrallog1002 to rollback the kernel [13:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:23] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:04:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:05:15] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10792018 (10tappof) [13:05:27] PROBLEM - Host centrallog1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:54] (03Merged) 10jenkins-bot: Disable Special:ContentTranslationStats page [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141866 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro) [13:06:12] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1141866|Disable Special:ContentTranslationStats page (T392839 T325790)]] [13:06:12] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [13:06:17] T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839 [13:06:17] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:06:17] T325790: Special:ContentTranslationStats is slow and getting crowded - https://phabricator.wikimedia.org/T325790 [13:06:29] !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad [13:06:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [13:06:37] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:07:21] (03CR) 10Muehlenhoff: [C:03+2] Initial Puppet agent apt config for Puppet 7 in Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1140659 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff) [13:07:53] RECOVERY - Host centrallog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [13:07:58] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [13:08:06] !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad [13:08:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:08:29] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [13:08:31] !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad [13:08:59] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [13:09:00] !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad [13:09:19] PROBLEM - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:09:27] PROBLEM - Bird Internet Routing Daemon on centrallog1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:09:39] (03CR) 10Elukey: [C:03+2] admin_ng: disable PSP mutations for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141858 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [13:09:40] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [13:09:41] !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad [13:09:50] FIRING: [20x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:57] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [13:10:53] RECOVERY - Hadoop NodeManager on an-worker1193 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:10:55] !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1141866|Disable Special:ContentTranslationStats page (T392839 T325790)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:09] abijeet, Please test [13:11:19] RECOVERY - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is OK: OK: UP (pid=3993) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:11:19] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:11:20] !log elukey@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:11:24] !log elukey@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:11:27] RECOVERY - Bird Internet Routing Daemon on centrallog1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:11:33] (03CR) 10Volans: [C:03+2] sre.hosts: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans) [13:11:39] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:11:46] kart_, on it [13:12:15] Special:CX seems disable. [13:12:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [13:12:47] We will get some 404s with first patch, but seems fine? [13:12:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet [13:12:54] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10792050 (10Jclark-ctr) a:03VRiley-WMF [13:13:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:13:40] (03CR) 10TChin: [C:03+1] Stream config for edge uniques on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [13:14:25] (03CR) 10TChin: [C:03+1] "Ah! Can I just +2 this then?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [13:14:41] kart_, looks good [13:14:46] (03CR) 10Volans: "@sukhe: the hiddenparma cookbook is currently unowned awaiting for an official owner, see T383809. If Traffic wants to own it that would b" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans) [13:14:47] cool. [13:14:50] FIRING: [21x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:51] !log kartik@deploy1003 kartik, abi: Continuing with sync [13:14:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141034 (https://phabricator.wikimedia.org/T393167) (owner: 10Novem Linguae) [13:16:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:17:23] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:18:39] (03PS6) 10Federico Ceratto: sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) [13:18:55] !log depooling cp7001 to test new haproxykafka version (T393016) [13:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:58] T393016: haproxykafka service isn't restarted when upgraded - https://phabricator.wikimedia.org/T393016 [13:19:15] (03Merged) 10jenkins-bot: sre.hosts: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans) [13:19:23] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [13:20:06] !log disabled puppet on cp7001 to test haproxykafka version (T393016) [13:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:27] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-druid1003 - https://phabricator.wikimedia.org/T393229#10792119 (10Jclark-ctr) This Server is out of warranty Please advise if you would like me to and if i am able to Swap with drive from Decom server [13:21:37] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-druid1003 - https://phabricator.wikimedia.org/T393229#10792120 (10Jclark-ctr) a:03Jclark-ctr [13:21:42] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141866|Disable Special:ContentTranslationStats page (T392839 T325790)]] (duration: 15m 29s) [13:21:46] T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839 [13:21:46] T325790: Special:ContentTranslationStats is slow and getting crowded - https://phabricator.wikimedia.org/T325790 [13:22:05] PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:23:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141867 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro) [13:23:07] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10792126 (10Jclark-ctr) 05Open→03Resolved ` Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] md2 : active raid10 sdg2[4] sdd2[0] sdh2[3] sdf2[1] 3701655552 blo... [13:23:10] abijeet, on 2nd patch now [13:24:55] (03PS2) 10Volans: Data Platform cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836 [13:25:05] RECOVERY - Hadoop NodeManager on an-worker1204 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:27:05] PROBLEM - Hadoop NodeManager on an-worker1164 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:27:32] (03CR) 10AOkoth: "That's the ID of the junk queue. If you run `SELECT * FROM queue WHERE id = 3` on the database you'll see it." [puppet] - 10https://gerrit.wikimedia.org/r/1140207 (https://phabricator.wikimedia.org/T389079) (owner: 10AOkoth) [13:27:46] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:27:56] kart_, ok [13:28:19] (03CR) 10Andrew Bogott: [C:03+2] Remove refs to cloudcontrol200[789] [puppet] - 10https://gerrit.wikimedia.org/r/1141568 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [13:29:20] (03PS9) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) [13:29:21] (03CR) 10Volans: "@fceratto@wikimedia.org are you ok too with the change? Just making sure to not step on other refactors that might be happening." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 (owner: 10Volans) [13:29:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [13:29:59] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:32:05] RECOVERY - Hadoop NodeManager on an-worker1146 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:32:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10792170 (10Jclark-ctr) @matthewvernon thanos-fe100[1-3] are R440's but no the XD2 servers use a 730mini raid... [13:33:14] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10792184 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All RAID10 servers which were upgraded to 6.1.135, are... [13:33:17] (03PS3) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) [13:33:21] (03CR) 10Bking: sre.hosts.rename: wipe DNS cache after rename (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [13:33:26] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:33:27] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:34:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet [13:34:15] (03Merged) 10jenkins-bot: Disable APIs used in Special:ContentTranslationStats [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141867 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro) [13:34:32] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1141867|Disable APIs used in Special:ContentTranslationStats (T392839)]] [13:34:35] T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839 [13:34:48] (03CR) 10Ssingh: "Thanks, the changes look good but I also wanted to keep the options around in case --help was passed. But I am guessing the expectation is" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans) [13:36:23] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:36:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:39:02] !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1141867|Disable APIs used in Special:ContentTranslationStats (T392839)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:39:13] abijeet, Please test 2nd patch. I'll also +2 the 3rd patch to save time. [13:39:42] jmm@cumin2002 drain-node (PID 2740892) is awaiting input [13:39:55] kart_, yes, please [13:40:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [13:40:12] (03CR) 10CI reject: [V:04-1] sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [13:40:13] (03CR) 10KartikMistry: [C:03+2] Remove links to Special:ContentTranslationStats from dashboards [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141868 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro) [13:40:51] kart_, looks ok [13:41:09] cool [13:41:12] !log kartik@deploy1003 kartik, abi: Continuing with sync [13:41:13] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:41:14] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:42:09] (03CR) 10Volans: "They options are added to the parser automatically, the output of `--help` will not change much (at most order and wording) and in additio" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans) [13:42:26] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:42:28] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.sanitize-wiki (exit_code=1) Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:43:07] (03CR) 10Ssingh: [C:03+1] "Ah, my bad then. And yes, this is even better!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans) [13:43:22] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:43:22] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-eqiad [13:43:22] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.sanitize-wiki (exit_code=1) Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:43:35] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:43:38] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Setting up permissions and view database sanitization for wikis nupwiki in section s5 [13:44:59] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128#10792278 (10SLyngshede-WMF) 05Open→03Invalid [13:46:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [13:46:10] (03CR) 10Volans: [C:03+2] Traffic cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans) [13:46:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet [13:47:41] (03CR) 10Volans: "Kind ping to the data platform team seeking a review" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136842 (owner: 10Volans) [13:47:56] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141867|Disable APIs used in Special:ContentTranslationStats (T392839)]] (duration: 13m 23s) [13:47:59] T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839 [13:48:10] abijeet, on 3rd patch now.. [13:48:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141868 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro) [13:48:36] kart_, ok [13:49:05] RECOVERY - Hadoop NodeManager on an-worker1164 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:49:35] ok [13:51:19] (03CR) 10Elukey: [C:03+2] k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857) (owner: 10Elukey) [13:51:37] tchin: Your patch can be +2 directly, AFAIK, but let's follow deployment protocol :) [13:52:12] :) [13:52:16] * tchin sounds good to me [13:52:20] (03Merged) 10jenkins-bot: Remove links to Special:ContentTranslationStats from dashboards [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141868 (https://phabricator.wikimedia.org/T392839) (owner: 10Abijeet Patro) [13:52:38] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1141868|Remove links to Special:ContentTranslationStats from dashboards (T392839)]] [13:53:32] (03Merged) 10jenkins-bot: Traffic cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans) [13:54:47] (03CR) 10Jcrespo: [C:03+1] "No need to make things smaller. I will just move these backups to a dedicated server to handle the extra data. I think this is ok, just ne" [puppet] - 10https://gerrit.wikimedia.org/r/1140506 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [13:56:13] (03CR) 10Hashar: [C:03+1] "Outside of a deployment window, yes (else the person currently deploying will have an extra patch showing up and that might raise a warnin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [13:57:03] (03CR) 10Volans: [C:03+2] Data Platform cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836 (owner: 10Volans) [13:57:12] I think 3rd patch will slow, due to l10n changes - abijeet [13:58:27] kart_, uh. [14:00:48] (03PS1) 10Bking: cirrus: disable completion indices in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1141903 (https://phabricator.wikimedia.org/T388610) [14:02:09] (03Merged) 10jenkins-bot: k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857) (owner: 10Elukey) [14:02:20] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T392751#10792400 (10jcrespo) Thank you, @Marostegui for taking care about this. [14:04:51] (03Merged) 10jenkins-bot: Data Platform cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836 (owner: 10Volans) [14:07:40] (03CR) 10Ebernhardson: [C:03+1] cirrus: disable completion indices in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1141903 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:08:31] hmm, thats a lot of time [14:09:44] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic, 13Patch-For-Review: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10792415 (10ssingh) >>! In T392848#10791709, @Volans wrote: > Have y... [14:10:53] !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1141868|Remove links to Special:ContentTranslationStats from dashboards (T392839)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:10:55] T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839 [14:11:22] (03PS6) 10Fabfur: haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) [14:11:24] abijeet, Please test :) [14:12:32] kart_, checking [14:13:58] (03PS1) 10Jdlrobson: Nearby should show file namespace on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141904 (https://phabricator.wikimedia.org/T52133) [14:14:01] kart_, we'll have to rollback this one. This needs a CX build. [14:14:12] ah [14:14:32] !log kartik@deploy1003 Sync cancelled. [14:14:33] kart_, apologies. [14:14:46] I'll do revert and deploy. No worries. [14:14:58] (03PS1) 10Kamila Součková: benthos/mw-accesslog-metrics: set start_offset to latest [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366) [14:15:53] (03PS1) 10KartikMistry: Revert "Remove links to Special:ContentTranslationStats from dashboards" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141906 [14:16:34] tchin: let's go with your change if you're around? [14:16:37] (03CR) 10Vgutierrez: [C:03+1] haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [14:17:01] or Cyndywikime if you're around. [14:17:06] yes [14:17:17] OK. Let's go with your change first. [14:17:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Upgrade codfw E/F Juniper equipment to Junos 23.x - https://phabricator.wikimedia.org/T393001#10792478 (10Volans) p:05Triage→03Medium [14:17:32] :) [14:18:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131696 (https://phabricator.wikimedia.org/T379566) (owner: 10Cyndywikime) [14:19:43] (03Merged) 10jenkins-bot: Growth: Remove GELevelingUpFeaturesEnabled and GEMentorDashboardEnabled feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131696 (https://phabricator.wikimedia.org/T379566) (owner: 10Cyndywikime) [14:19:54] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková) [14:19:56] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1131696|Growth: Remove GELevelingUpFeaturesEnabled and GEMentorDashboardEnabled feature flags (T379566)]] [14:19:59] T379566: Remove obsolete Feature Flags - https://phabricator.wikimedia.org/T379566 [14:20:47] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146#10792524 (10Volans) p:05Triage→03Medium [14:23:06] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [14:23:37] !log uploading haproxykafka 0.3.10 on apt repo (T393016) [14:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:39] T393016: haproxykafka service isn't restarted when upgraded - https://phabricator.wikimedia.org/T393016 [14:23:41] (03PS10) 10Andrew Bogott: cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) [14:23:41] (03PS1) 10Andrew Bogott: nova policy: permit GET for os-server-groups and and os-flavor-extra-specs [puppet] - 10https://gerrit.wikimedia.org/r/1141909 [14:24:21] (03PS1) 10Elukey: admin_ng: enforce PSS on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141910 (https://phabricator.wikimedia.org/T369493) [14:24:47] (03CR) 10Andrew Bogott: [C:03+2] nova policy: permit GET for os-server-groups and and os-flavor-extra-specs [puppet] - 10https://gerrit.wikimedia.org/r/1141909 (owner: 10Andrew Bogott) [14:25:56] !log kartik@deploy1003 cyndywikime, kartik: Backport for [[gerrit:1131696|Growth: Remove GELevelingUpFeaturesEnabled and GEMentorDashboardEnabled feature flags (T379566)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:25:59] T379566: Remove obsolete Feature Flags - https://phabricator.wikimedia.org/T379566 [14:27:09] Cyndywikime: Please test! [14:27:20] ok :) [14:27:27] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [14:27:39] !log enable puppet and repooled cp7001 (T393016) [14:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:26] @kart, LGTM!Thanks :) [14:29:13] cool. Deploying. [14:29:16] !log kartik@deploy1003 cyndywikime, kartik: Continuing with sync [14:31:29] (03CR) 10KartikMistry: [C:03+2] Revert "Remove links to Special:ContentTranslationStats from dashboards" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141906 (owner: 10KartikMistry) [14:32:03] Apologies, I've to deploy this revert as well ^^ [14:32:27] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10792668 (10Jelto) >>! In T378922#10780806, @MatthewVernon wrote: > Second, I am not an expert at this, but I think you need `"Prin... [14:32:40] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10792670 (10Jelto) [14:34:11] (03Merged) 10jenkins-bot: Revert "Remove links to Special:ContentTranslationStats from dashboards" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141906 (owner: 10KartikMistry) [14:38:29] (03CR) 10Elukey: [C:03+2] admin_ng: enforce PSS on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141910 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [14:38:42] !log upgrading haproxykafka to version 0.3.10 on A:cp (T393016) [14:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:44] T393016: haproxykafka service isn't restarted when upgraded - https://phabricator.wikimedia.org/T393016 [14:39:28] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131696|Growth: Remove GELevelingUpFeaturesEnabled and GEMentorDashboardEnabled feature flags (T379566)]] (duration: 19m 32s) [14:39:31] T379566: Remove obsolete Feature Flags - https://phabricator.wikimedia.org/T379566 [14:40:32] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1141906|Revert "Remove links to Special:ContentTranslationStats from dashboards"]] [14:42:36] !log elukey@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:44:22] !log elukey@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:45:03] (03CR) 10SBassett: [C:03+1] "I generally support this, as most folks on the secteam would, I assume. But this should at least tie back to a phabricator task and/or be" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140702 (owner: 10Zabe) [14:45:26] (03PS2) 10Filippo Giunchedi: benthos/mw-accesslog-metrics: set start_offset to latest [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková) [14:45:34] (03CR) 10Fabfur: [C:03+2] haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [14:45:54] (03PS3) 10Filippo Giunchedi: benthos/mw-accesslog-metrics: start_from_oldest: false [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková) [14:47:04] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, I've edited the option name to match our benthos version" [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková) [14:47:23] (03PS1) 10Bking: elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) [14:48:29] (03PS2) 10Bking: elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) [14:49:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [14:49:31] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Update labels on cloudcontrol200[789]-dev.codfw - https://phabricator.wikimedia.org/T393347#10792749 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm We just need to change the external labels on the server. This has been done. Thank you for t... [14:55:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [14:58:47] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking) [14:58:47] !log kartik@deploy1003 kartik: Backport for [[gerrit:1141906|Revert "Remove links to Special:ContentTranslationStats from dashboards"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:00:38] !log kartik@deploy1003 kartik: Continuing with sync [15:01:04] (03PS3) 10Bking: elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) [15:01:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [15:01:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet [15:02:58] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking) [15:03:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [15:04:50] FIRING: [19x] ProbeDown: Service ganeti1028:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:05:35] (03PS1) 10Brouberol: mediawiki-dumps-legacy: rename Secret key associated to private key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141926 (https://phabricator.wikimedia.org/T389784) [15:05:59] (03CR) 10Andrea Denisse: "Thanks for taking a look! I wrote this commit message as adding the feature flag because it specifically introduces and declares the flag " [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:09] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141927 [15:07:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [15:07:51] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10792806 (10Jhancock.wm) @Clement_Goubert i believe this server is yours. Is this still failed or did it heal? I'm not finding any evidence in the idrac of a failed disk. If it is valid, I can put in a re... [15:07:59] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10792808 (10Jhancock.wm) a:03Jhancock.wm [15:09:37] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141927 (owner: 10PipelineBot) [15:11:00] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141906|Revert "Remove links to Special:ContentTranslationStats from dashboards"]] (duration: 30m 27s) [15:11:06] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141927 (owner: 10PipelineBot) [15:11:53] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:12:03] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:12:16] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:12:39] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:13:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [15:13:58] (03PS1) 10Elukey: kubernetes: disable PSP for ml-serve-codfw and ml-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1141928 (https://phabricator.wikimedia.org/T369493) [15:14:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [15:15:14] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps: clean up a couple of hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1140773 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [15:15:43] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5452/co" [puppet] - 10https://gerrit.wikimedia.org/r/1141928 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [15:17:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [15:20:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [15:21:27] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic, 13Patch-For-Review: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10792865 (10Volans) No, you're right, `current_state` in the icinga... [15:21:42] (03PS4) 10Bking: elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) [15:23:09] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5 [15:23:14] (03CR) 10Klausman: [C:03+1] kubernetes: disable PSP for ml-serve-codfw and ml-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1141928 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [15:23:19] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Setting up permissions and view database sanitization for wikis nupwiki in section s5 [15:25:00] (03CR) 10Elukey: [V:03+1 C:03+2] kubernetes: disable PSP for ml-serve-codfw and ml-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1141928 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [15:25:22] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Setting up permissions and view database sanitization for wikis nupwiki in section s5 [15:25:25] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Setting up permissions and view database sanitization for wikis nupwiki in section s5 [15:26:26] (03PS7) 10Federico Ceratto: sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) [15:26:28] (03CR) 10Hoo man: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141869 (https://phabricator.wikimedia.org/T391532) (owner: 10Hoo man) [15:26:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [15:26:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet [15:28:33] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141869 (https://phabricator.wikimedia.org/T391532) (owner: 10Hoo man) [15:28:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [15:29:49] !log hoo@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [15:29:50] FIRING: [19x] ProbeDown: Service ganeti1030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1530). [15:30:38] !log hoo@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [15:31:12] !log hoo@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [15:31:24] !log dancy@deploy1003 Installing scap version "4.160.0" for 2 host(s) [15:31:47] !log hoo@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [15:32:00] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:32:01] (03CR) 10Bking: [C:03+1] mediawiki-dumps-legacy: rename Secret key associated to private key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141926 (https://phabricator.wikimedia.org/T389784) (owner: 10Brouberol) [15:32:11] !log hoo@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [15:32:34] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1031.eqiad.wmnet [15:32:37] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:32:41] !log hoo@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [15:32:46] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:33:06] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10792927 (10Eevans) Ok, I forgot to factor something in: Node data //other// than the SSTables. So commitlogs... [15:33:12] !log dancy@deploy1003 Installation of scap version "4.160.0" completed for 2 hosts [15:33:20] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:34:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10792930 (10Jhancock.wm) [15:34:52] Is the backport still happening? Or can I merge myself since it's just a beta cluster change [15:35:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10792937 (10RobH) @MatthewVernon, Please note that we've ordered 4 new hosts to replace ms-be10[60-63], but t... [15:35:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2047.codfw.wmnet with OS bookworm [15:35:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10792938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with err... [15:35:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2048.codfw.wmnet with OS bookworm [15:35:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10792942 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm executed with err... [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:25] (03PS4) 10Scott French: hieradata: remove icu67 override on deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139910 (https://phabricator.wikimedia.org/T392938) [15:37:29] (03PS3) 10Scott French: hieradata: switch deployment hosts to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1139914 (https://phabricator.wikimedia.org/T392938) [15:40:38] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10792967 (10Eevans) >>! In T391544#10792925, @Eevans wrote: > > [ ... ] > > If we say 60G (for sake of even... [15:40:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:41:02] (03PS1) 10Hoo man: Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141935 (https://phabricator.wikimedia.org/T391532) [15:41:33] (03CR) 10Hoo man: [C:03+2] Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141935 (https://phabricator.wikimedia.org/T391532) (owner: 10Hoo man) [15:42:26] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic, 13Patch-For-Review: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10792981 (10ssingh) >>! In T392848#10792865, @Volans wrote: > No, yo... [15:42:59] (03PS2) 10Scott French: P:mediawiki::maintenance::purge_loginnotify: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1141916 (https://phabricator.wikimedia.org/T388536) [15:43:16] (03PS5) 10Bking: elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) [15:43:17] (03Merged) 10jenkins-bot: Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141935 (https://phabricator.wikimedia.org/T391532) (owner: 10Hoo man) [15:44:43] !log hoo@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [15:45:01] !log hoo@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [15:45:45] !log hoo@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [15:46:05] !log hoo@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [15:46:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:46:22] !log hoo@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [15:46:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm [15:46:45] !log hoo@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [15:46:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10793011 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm [15:47:51] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:48:03] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:49:41] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:49:46] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:49:52] (03PS2) 10Muehlenhoff: Extend package list to be installed from component/puppet7 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1140716 (https://phabricator.wikimedia.org/T392790) [15:49:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:51:13] (03CR) 10TChin: [C:03+2] Stream config for edge uniques on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [15:52:02] (03Merged) 10jenkins-bot: Stream config for edge uniques on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [15:57:41] (03CR) 10RLazarus: [C:03+1] "LGTM, in particular with `retry_on: gateway-error`, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [15:57:41] (03CR) 10AOkoth: [C:03+2] miscweb: update os-reports image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [15:59:26] (03Merged) 10jenkins-bot: miscweb: update os-reports image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [16:00:25] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:34] 10ops-codfw, 06SRE, 06DC-Ops: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10793083 (10Papaul) 05Open→03Resolved Complete [16:02:50] !log aokoth@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [16:03:09] !log aokoth@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:06:31] !log aokoth@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [16:07:44] !log aokoth@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:09:14] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [16:09:15] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [16:09:32] (03CR) 10Alexandros Kosiaris: [C:03+2] site.pp changes for aux-k8s-workers [puppet] - 10https://gerrit.wikimedia.org/r/1140701 (https://phabricator.wikimedia.org/T393053) (owner: 10Alexandros Kosiaris) [16:10:43] (03CR) 10Filippo Giunchedi: "> Thanks for taking a look! I wrote this commit message as adding the feature flag because it specifically introduces and declares the fla" [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [16:10:56] FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:15:56] RESOLVED: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:19] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [16:20:21] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [16:22:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10793153 (10wiki_willy) It's about $250 for the RAID controllers, so we can definitely order those to replace the existing ones for Con... [16:22:11] (03CR) 10Kamila Součková: [C:03+2] "Ah, OK, thanks a lot for the fix and the clarification!" [puppet] - 10https://gerrit.wikimedia.org/r/1141905 (https://phabricator.wikimedia.org/T393366) (owner: 10Kamila Součková) [16:22:32] (03CR) 10Brouberol: [C:03+1] elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking) [16:24:02] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:25:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Upgrade codfw E/F Juniper equipment to Junos 23.x - https://phabricator.wikimedia.org/T393001#10793160 (10Papaul) 05Open→03Resolved a:03Papaul @ayounsi the solution here was to start a shell and run the commands below ` star... [16:29:58] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002" [16:30:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002" [16:30:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:30:47] (03CR) 10Bking: [C:03+2] elastic/cirrussearch: add cirrussearch11[11-25] to cluster, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1141914 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking) [16:30:55] akosiaris: if your puppet-merge comes across my benthos change, feel free to merge it [16:31:08] Raine: my bad, sorry, thanks [16:31:23] np, thanks too [16:31:24] akosiaris what Raine said ;P [16:31:41] {{done}} for both [16:31:42] :D [16:31:44] ty! [16:31:51] {◕ ◡ ◕} [16:33:31] (03PS4) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) [16:33:51] (03PS2) 10Andrea Denisse: grafana: Toggle data sync using feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) [16:33:51] (03PS4) 10Andrea Denisse: grafana: Add enable_dashboard_sync feature flag in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) [16:34:34] (03PS5) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) [16:35:21] (03CR) 10Andrea Denisse: "Thanks for the explanation, I've inverted the order of the commits." [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [16:35:41] (03CR) 10Bking: sre.hosts.rename: wipe DNS cache after rename (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [16:38:43] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1111 to cirrussearch1111 [16:39:07] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:40:02] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1140262 (owner: 10Ncmonitor) [16:40:26] (03CR) 10Andrea Denisse: "PCC results after inverting the commit order: https://puppet-compiler.wmflabs.org/output/1140760/5453/" [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [16:40:35] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:40:40] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5453/console" [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [16:41:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:41:34] (03CR) 10CI reject: [V:04-1] sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [16:41:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:42:04] (03CR) 10Andrea Denisse: "I ran PCC on the change that uses the default value (true) and it's a NOOP, thanks for the suggestion on testing it like this." [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [16:43:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:44:07] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:45:20] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1111 to cirrussearch1111 - bking@cumin2002" [16:45:25] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:46:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:48:27] bking@cumin2002 rename (PID 2927008) is awaiting input [16:48:45] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1111 to cirrussearch1111 - bking@cumin2002" [16:48:46] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:48:59] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [16:49:08] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:49:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [16:49:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:50:23] (03CR) 10Volans: sre.hosts.rename: wipe DNS cache after rename (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [16:54:45] bking@cumin2002 rename (PID 2927008) is awaiting input [16:54:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:56:53] (03PS6) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) [16:57:05] (03PS7) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) [16:57:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:58:03] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [16:58:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [17:00:05] swfrench-wmf: That opportune time for a MediaWiki infrastructure (UTC late) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1700). [17:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T1700). [17:00:14] o/ [17:01:12] (03CR) 10Scott French: [C:03+2] mw:periodic_job:kubernetes: quote job description [puppet] - 10https://gerrit.wikimedia.org/r/1140548 (owner: 10Scott French) [17:02:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:03:28] (03CR) 10CI reject: [V:04-1] sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [17:04:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:09:13] (03CR) 10Scott French: [C:03+2] P:mediawiki::maintenance::purge_loginnotify: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1141916 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [17:13:41] (03CR) 10Zabe: [C:03+1] manage-dblist: Fix indentation and stray blank line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139487 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [17:14:29] (03CR) 10Zabe: [C:03+1] manage-dblist: Fix some random phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139488 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [17:15:48] (03PS8) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) [17:16:03] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:16:11] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:17:39] (03CR) 10Bking: sre.hosts.rename: wipe DNS cache after rename (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [17:19:33] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rolling back cirrussearch1111 to elastic1111 - bking@cumin2002" [17:19:51] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rolling back cirrussearch1111 to elastic1111 - bking@cumin2002" [17:19:51] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:19:58] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from elastic1111 to cirrussearch1111 [17:20:25] (03PS4) 10Jdlrobson: Stream registration for article summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [17:23:56] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) (owner: 10Xcollazo) [17:24:03] (03CR) 10Scott French: [C:03+2] P:mediawiki::maintenance::pageassessments: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140266 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [17:24:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:27:32] (03PS1) 10Volans: setup.py: update kubernetes dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141941 [17:29:30] (03CR) 10Volans: [C:04-1] sre.hosts.rename: wipe DNS cache after rename (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [17:30:24] (03PS9) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) [17:30:49] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:30:56] (03PS1) 10Ssingh: CI check: see if tabs fail [dns] - 10https://gerrit.wikimedia.org/r/1141943 [17:30:57] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:31:10] (03PS3) 10Herron: logs-api: add write/delete acl via htgroup [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) [17:31:35] (03CR) 10CI reject: [V:04-1] CI check: see if tabs fail [dns] - 10https://gerrit.wikimedia.org/r/1141943 (owner: 10Ssingh) [17:32:42] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephosd100[1-3] - https://phabricator.wikimedia.org/T393196#10793398 (10VRiley-WMF) a:03VRiley-WMF [17:32:58] (03CR) 10Ssingh: "So it fails as expected and as it should. Not sure why it didn't show up in I85edb13bf2b678e3414de9bfd7383ac877145f49 but abandoning." [dns] - 10https://gerrit.wikimedia.org/r/1141943 (owner: 10Ssingh) [17:33:17] (03Abandoned) 10Ssingh: CI check: see if tabs fail [dns] - 10https://gerrit.wikimedia.org/r/1141943 (owner: 10Ssingh) [17:38:08] no more changes planned on my end for this infra window [17:38:50] (03PS10) 10Bking: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) [17:41:15] (03PS1) 10Herron: add dummy write group for testing [labs/private] - 10https://gerrit.wikimedia.org/r/1141944 [17:43:15] (03CR) 10Herron: [V:03+2 C:03+2] add dummy write group for testing [labs/private] - 10https://gerrit.wikimedia.org/r/1141944 (owner: 10Herron) [17:45:18] (03CR) 10CI reject: [V:04-1] sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [17:45:59] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:46:25] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:47:28] (03CR) 10Herron: "Thanks for checking it out. I switched the config from to and require method which is looking better to me. Also ad" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [17:47:49] (03CR) 10Brouberol: [C:03+1] Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [17:48:25] (03PS4) 10Brouberol: Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [17:48:27] (03CR) 10Brouberol: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [17:49:37] vriley@cumin1002 provision (PID 2538605) is awaiting input [17:50:12] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: rename Secret key associated to private key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141926 (https://phabricator.wikimedia.org/T389784) (owner: 10Brouberol) [17:52:46] (03CR) 10Brouberol: [C:04-1] "We first need to absent all resources before we can remove the code" [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [17:52:53] (03PS5) 10Brouberol: Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [17:54:25] vriley@cumin1002 provision (PID 2538605) is awaiting input [17:57:12] 10ops-codfw, 10ops-eqiad, 06SRE-OnFire, 10Cassandra, and 4 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster - https://phabricator.wikimedia.org/T393406 (10Eevans) 03NEW [17:57:16] (03PS7) 10Umherirrender: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) [17:57:21] (03CR) 10Umherirrender: Improve function and property documentation for php code (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [18:00:02] (03CR) 10Xcollazo: "Thanks for review Joal." [puppet] - 10https://gerrit.wikimedia.org/r/1140765 (https://phabricator.wikimedia.org/T393202) (owner: 10Xcollazo) [18:02:04] !log aokoth@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [18:02:26] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:02:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:28] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:05:21] (03PS1) 10Scott French: alertmanager: add receiver and routing for moderator-tools tasks [puppet] - 10https://gerrit.wikimedia.org/r/1141945 (https://phabricator.wikimedia.org/T393395) [18:05:21] (03CR) 10Scott French: "It turns out PageTriage is actually owned by Moderator Tools, rather than Community Tech, so this and the next patch update the notificati" [puppet] - 10https://gerrit.wikimedia.org/r/1141945 (https://phabricator.wikimedia.org/T393395) (owner: 10Scott French) [18:05:24] (03PS1) 10Scott French: mw::maintenance: update team for pagetriage jobs [puppet] - 10https://gerrit.wikimedia.org/r/1141946 (https://phabricator.wikimedia.org/T393395) [18:07:28] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [18:07:29] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [18:07:42] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:09:34] (03PS1) 10Jsn.sherman: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) [18:11:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman) [18:12:20] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10793553 (10VRiley-WMF) Upon request, I have added (2) 480 Gig SSDs per server sessionstore1004, sessionstore1... [18:12:22] !log aokoth@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:15:40] (03PS2) 10Jsn.sherman: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) [18:22:36] (03PS2) 10Volans: setup.py: update kubernetes dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141941 [18:22:36] (03PS1) 10Volans: setup.py: update redis dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1141949 [18:37:08] 10ops-codfw, 10ops-eqiad, 06SRE, 06SRE-OnFire, and 5 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster (codfw) - https://phabricator.wikimedia.org/T393406#10793611 (10Eevans) [18:37:35] (03CR) 10JHathaway: [C:03+2] systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1140795 (owner: 10JHathaway) [18:37:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:38:19] 10ops-eqiad, 06SRE-OnFire, 10Cassandra, 06DC-Ops, and 3 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster (eqiad) - https://phabricator.wikimedia.org/T393408 (10Eevans) 03NEW [18:41:37] (03CR) 10Jdrewniak: Stream registration for article summaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [18:44:22] 10ops-codfw, 10ops-eqiad, 06SRE, 06SRE-OnFire, and 5 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster (codfw) - https://phabricator.wikimedia.org/T393406#10793638 (10Jhancock.wm) installed and detected by servers [18:44:24] 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409 (10JVanderhoop-WMF) 03NEW [18:48:36] 10ops-eqiad, 06SRE-OnFire, 10Cassandra, 06DC-Ops, and 3 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster (eqiad) - https://phabricator.wikimedia.org/T393408#10793653 (10VRiley-WMF) 05Open→03Resolved Added (x2) 480Gig SSD drives to sessionstore1004, sessionstore1005, sess... [18:48:55] (03PS1) 10JHathaway: Revert "systemd::sysuser: create the user synchronously in the define" [puppet] - 10https://gerrit.wikimedia.org/r/1141952 [18:49:26] (03CR) 10Kgraessle: [C:03+1] "LGTM based on the diff; was unable to test the loading of the survey using the js module." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman) [18:52:01] (03CR) 10JHathaway: [C:03+2] Revert "systemd::sysuser: create the user synchronously in the define" [puppet] - 10https://gerrit.wikimedia.org/r/1141952 (owner: 10JHathaway) [19:01:47] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [19:04:35] (03PS1) 10Ryan Kemper: wdqs-main: switch old internal hosts to main graph [puppet] - 10https://gerrit.wikimedia.org/r/1141956 (https://phabricator.wikimedia.org/T388134) [19:04:37] (03PS1) 10Ryan Kemper: wdqs-main: bring old internal hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/1141957 (https://phabricator.wikimedia.org/T388134) [19:17:56] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephosd100[1-3] - https://phabricator.wikimedia.org/T393196#10793728 (10VRiley-WMF) [19:18:27] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephosd100[1-3] - https://phabricator.wikimedia.org/T393196#10793730 (10VRiley-WMF) 05Open→03Resolved These have been decommed [19:23:18] 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10793743 (10Ahoelzl) Approved. [19:26:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10793748 (10VRiley-WMF) [19:26:20] (03PS1) 10JHathaway: systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 [19:26:37] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway) [19:26:43] 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10793749 (10Scott_French) After a bit of thought and some back-testing over the last 2 months of data, ht... [19:27:37] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1184.eqiad.wmnet with OS bullseye [19:27:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10793750 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1184.eqiad.wmnet with OS b... [19:28:28] (03CR) 10CI reject: [V:04-1] systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway) [19:29:50] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:32:50] (03CR) 10Ladsgroup: "Can I merge this now?" [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [19:35:17] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:37:30] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:43:08] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1184.eqiad.wmnet with reason: host reimage [19:43:20] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10793784 (10VirginiaPoundstone) Approved. [19:46:37] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1184.eqiad.wmnet with reason: host reimage [19:47:21] (03PS3) 10Jsn.sherman: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) [19:47:54] (03CR) 10Ryan Kemper: [C:03+1] "self merge, hosts not in prod" [puppet] - 10https://gerrit.wikimedia.org/r/1141956 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [19:47:55] (03CR) 10Ryan Kemper: [C:03+2] wdqs-main: switch old internal hosts to main graph [puppet] - 10https://gerrit.wikimedia.org/r/1141956 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [19:49:46] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:52:05] herron: cool if I merge b412424? (labs/private) [19:52:22] ryankemper: thanks please do [19:52:31] done [19:53:23] (03PS2) 10JHathaway: systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 [19:53:31] (03CR) 10Ryan Kemper: [C:03+2] sre.wdqs.data-transfer: improve graph type checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1097552 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [19:55:28] (03CR) 10CI reject: [V:04-1] systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway) [19:57:04] (03PS4) 10Jsn.sherman: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) [19:58:06] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [19:58:34] (03PS5) 10Jsn.sherman: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T2000). [20:00:05] danisztls and JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:25] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:01:41] here [20:01:56] FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:03:32] o/ [20:05:25] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:06:12] (03CR) 10RLazarus: [C:03+1] hieradata: remove icu67 override on deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139910 (https://phabricator.wikimedia.org/T392938) (owner: 10Scott French) [20:06:37] do we have a deployer around? I can self deploy if not. [20:06:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:07:10] danisztls: I can probably deploy for you too; it look like config? [20:07:13] JSherman: doesn't look like it [20:07:39] RhinosF1: ack [20:07:47] JSherman: its just config, thanks [20:08:02] mmk, let me get myself setup [20:08:28] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:10:25] FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:28] !incidents [20:10:29] No incidents occurred in the past 24 hours for team SRE [20:10:36] yeah [20:10:46] this is the page from the weekend [20:10:46] do we need a downtime? [20:11:01] I marked as resolved as host is depooled [20:11:07] otherwise it would ping again [20:11:09] danisztls: okay, I'm going to deploy us together since we're both doing survey config [20:11:13] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:11:15] JSherman: ok [20:11:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141569 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [20:11:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman) [20:11:50] sukhe: ah, great - thanks for marking resolved [20:12:07] (03Merged) 10jenkins-bot: Design Research Participant Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141569 (https://phabricator.wikimedia.org/T392325) (owner: 10DDesouza) [20:12:09] (03Merged) 10jenkins-bot: Deploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141947 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman) [20:13:16] hmm, we had an unexpected commit from friday [20:13:27] sukhe, swfrench-wmf: thanks both! [20:13:28] FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:14:20] vriley@cumin1002 reimage (PID 2549027) is awaiting input [20:14:49] looks like it's for labs settings; proceeding [20:15:06] !log jsn@deploy1003 Started scap sync-world: Backport for [[gerrit:1141569|Design Research Participant Survey: Undeploy (T392325)]], [[gerrit:1141947|Deploy first set of Patroller Tools surveys (T389401)]] [20:15:10] T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325 [20:15:11] T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401 [20:15:25] FIRING: [15x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:15:33] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:15:34] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1184.eqiad.wmnet with OS bullseye [20:15:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10793857 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1184.eqiad.wmnet with OS bulls... [20:18:28] FIRING: [5x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:20:25] FIRING: [19x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:11] !log jsn@deploy1003 dani, jsn: Backport for [[gerrit:1141569|Design Research Participant Survey: Undeploy (T392325)]], [[gerrit:1141947|Deploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:21:14] T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325 [20:21:15] T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401 [20:21:29] test servers barfed the first time; were happy on retry [20:21:44] danisztls: please test [20:21:58] JSherman: done, looks good [20:22:39] excellent; it's going to take me a minute to test mine [20:23:28] RESOLVED: [5x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:24:17] FIRING: [6x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:25] FIRING: [23x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:07] okay, good on my end; proceeding [20:28:10] !log jsn@deploy1003 dani, jsn: Continuing with sync [20:28:41] (03CR) 10RLazarus: [C:03+1] hieradata: switch deployment hosts to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1139914 (https://phabricator.wikimedia.org/T392938) (owner: 10Scott French) [20:29:17] FIRING: [8x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:30:25] FIRING: [23x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:27] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:33:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:35:05] !log jsn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141569|Design Research Participant Survey: Undeploy (T392325)]], [[gerrit:1141947|Deploy first set of Patroller Tools surveys (T389401)]] (duration: 19m 58s) [20:35:08] T392325: QuickSurvey request for Design Research participant database recruitment - https://phabricator.wikimedia.org/T392325 [20:35:09] T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401 [20:35:48] danisztls: okay, we're done! [20:35:48] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1016 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:35:48] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:35:50] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2014 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:35:50] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2015 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:36:40] JSherman: thanks again [20:36:58] no prob [20:38:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:39:05] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793912 (10VRiley-WMF) [20:40:06] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:40:06] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:40:06] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:40:06] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2014 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:40:06] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2015 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:41:35] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [20:41:55] aaand I see that I messed up the privacy statement link for my backport [20:42:20] I'm going to backport that too since we're still in the window [20:42:27] FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:43:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:44:22] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2015 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:44:27] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2015:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:44:39] !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on wdqs[2008,2014-2015].codfw.wmnet,wdqs[1011,1016].eqiad.wmnet with reason: T388134 [20:44:42] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [20:44:54] Sorry for the wdqs noise, downtimed these hosts. Their systemd units won't be happy until their data transfers complete [20:46:34] (03PS1) 10Jsn.sherman: Fix link for first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141969 (https://phabricator.wikimedia.org/T389401) [20:46:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:47:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141969 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman) [20:48:03] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt apus-fe1003 - vriley@cumin1002" [20:48:09] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt apus-fe1003 - vriley@cumin1002" [20:48:09] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:48:33] (03Merged) 10jenkins-bot: Fix link for first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141969 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman) [20:48:41] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793945 (10VRiley-WMF) [20:48:49] !log jsn@deploy1003 Started scap sync-world: Backport for [[gerrit:1141969|Fix link for first set of Patroller Tools surveys (T389401)]] [20:48:51] T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401 [20:49:22] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host apus-fe1003 [20:49:47] (03CR) 10Bking: [C:03+1] "Conditional +1. Feel free to merge if you're OK with using 1017 as a main host, as opposed to internal-main." [puppet] - 10https://gerrit.wikimedia.org/r/1141957 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:49:53] ryankemper@cumin2002 reimage (PID 3129320) is awaiting input [20:50:23] (03PS1) 10Clare Ming: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141970 (https://phabricator.wikimedia.org/T390036) [20:50:26] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:50:30] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host apus-fe1003 [20:51:51] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:52:38] (03CR) 10Santiago Faci: [C:03+2] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141970 (https://phabricator.wikimedia.org/T390036) (owner: 10Clare Ming) [20:54:18] (03Merged) 10jenkins-bot: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141970 (https://phabricator.wikimedia.org/T390036) (owner: 10Clare Ming) [20:55:26] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:55:47] !log jsn@deploy1003 jsn: Backport for [[gerrit:1141969|Fix link for first set of Patroller Tools surveys (T389401)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:55:47] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:55:50] T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401 [20:56:09] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:56:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:56:52] !log jsn@deploy1003 jsn: Continuing with sync [20:58:10] verified that the patch made things happy [20:58:53] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793954 (10VRiley-WMF) [20:59:45] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [20:59:57] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.wikimedia.org with OS bookworm [21:00:03] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793957 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm [21:00:04] Reedy, sbassett, Maryum, and manfredi: Time to snap out of that daydream and deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T2100). [21:01:10] noting that this is running slightly over the time window; currently @ 60% for k8s deployment [21:03:33] !log jsn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141969|Fix link for first set of Patroller Tools surveys (T389401)]] (duration: 14m 43s) [21:03:35] T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401 [21:04:18] okay, done! [21:04:49] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10793975 (10ArthurPSmith) Yes, P13552 did not result in delay, although I did wait som... [21:04:54] things look happy and I'm outta here. Feel free to @ me on slack if followup is needed. [21:13:39] vriley@cumin1002 reimage (PID 2563356) is awaiting input [21:14:02] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-fe1003.wikimedia.org with OS bookworm [21:14:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm executed with errors: - apus-... [21:15:19] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.wikimedia.org with OS bookworm [21:15:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10793986 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm [21:20:25] (03PS11) 10Ryan Kemper: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [21:20:54] (03PS12) 10Ryan Kemper: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [21:21:26] (03PS13) 10Ryan Kemper: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [21:23:46] 10ops-codfw, 10ops-eqiad, 06SRE, 06SRE-OnFire, and 5 others: Provision some spare SSDs (decomm'd servers) to sessionstore cluster (codfw) - https://phabricator.wikimedia.org/T393406#10794000 (10Jhancock.wm) 05Open→03Resolved [21:29:00] (03PS14) 10Ryan Kemper: sre.hosts.rename: wipe DNS cache after rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [21:34:09] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [21:37:47] (03CR) 10Ryan Kemper: "Fixed the method invocations. Should be ready for another round of review (cc @volans)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1140504 (https://phabricator.wikimedia.org/T392729) (owner: 10Bking) [21:39:44] (03PS1) 10Andrew Bogott: keystone: update policy.yaml files [puppet] - 10https://gerrit.wikimedia.org/r/1141977 (https://phabricator.wikimedia.org/T330759) [21:39:44] (03PS1) 10Andrew Bogott: nova policy.yaml: update with advice from oslopolicy-validator [puppet] - 10https://gerrit.wikimedia.org/r/1141978 (https://phabricator.wikimedia.org/T330759) [21:39:46] (03PS1) 10Andrew Bogott: nova policy.json: remove a bunch of redundant rules [puppet] - 10https://gerrit.wikimedia.org/r/1141979 (https://phabricator.wikimedia.org/T330759) [21:39:47] (03PS1) 10Andrew Bogott: glance: update policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1141980 (https://phabricator.wikimedia.org/T330759) [21:39:49] (03PS1) 10Andrew Bogott: Cinder: explicitly use new policy rules [puppet] - 10https://gerrit.wikimedia.org/r/1141981 (https://phabricator.wikimedia.org/T330759) [21:39:50] (03PS1) 10Andrew Bogott: cinder policy.yaml: update, remove redundant rules [puppet] - 10https://gerrit.wikimedia.org/r/1141982 (https://phabricator.wikimedia.org/T330759) [21:39:52] (03PS1) 10Andrew Bogott: Neutron: update policy rules [puppet] - 10https://gerrit.wikimedia.org/r/1141983 (https://phabricator.wikimedia.org/T330759) [21:39:54] (03PS1) 10Andrew Bogott: Designate: update policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1141984 (https://phabricator.wikimedia.org/T330759) [21:40:22] Hey all - have a couple of security patches for T392341 I’d like to get out today. [21:40:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:50:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:53:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:57:38] !log Deployed security fix (1) for T392341 [21:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:04:44] (03CR) 10Ebernhardson: [C:03+1] "Looks reasonable, confirm it fixes the error in my local dev." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141516 (owner: 10Krinkle) [22:05:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2151:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2151 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:12:14] !log Deployed security fix (2) for T392341 [22:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:30] (03CR) 10Jdlrobson: [C:04-1] "The tagline is too big and overlaps search on Vector 2022. It shouldn't exceed the max size of the wordmark ( 124px). https://www.mediawik" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [22:33:08] ryankemper@cumin2002 reimage (PID 3226688) is awaiting input [22:35:33] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-fe1003.wikimedia.org with OS bookworm [22:35:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10794191 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm executed with errors: - apus-... [22:46:19] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [22:46:31] !log ryankemper@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017.eqiad.wmnet'] [22:46:42] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1017.eqiad.wmnet'] [22:47:05] !log ryankemper@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017.eqiad.wmnet'] [22:48:40] jouncebot: nowandnext [22:48:40] For the next 0 hour(s) and 11 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T2100) [22:48:40] In 0 hour(s) and 11 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T2300) [22:50:12] ryankemper@cumin2002 upgrade-firmware (PID 3300272) is awaiting input [22:50:48] (03CR) 10Zabe: [C:03+2] core-Permissions: refactor enwiki wgRemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140661 (owner: 10Novem Linguae) [22:51:36] (03Merged) 10jenkins-bot: core-Permissions: refactor enwiki wgRemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140661 (owner: 10Novem Linguae) [22:52:47] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1140661|core-Permissions: refactor enwiki wgRemoveGroups]] [22:54:03] ryankemper@cumin2002 upgrade-firmware (PID 3300272) is awaiting input [22:56:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:57:16] !log zabe@deploy1003 zabe, novemlinguae: Backport for [[gerrit:1140661|core-Permissions: refactor enwiki wgRemoveGroups]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:57:22] !log zabe@deploy1003 zabe, novemlinguae: Continuing with sync [22:59:23] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['wdqs1017.eqiad.wmnet'] [22:59:45] !log ryankemper@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017.eqiad.wmnet'] [23:00:06] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250505T2300) [23:00:52] (03CR) 10Aleksandar Mastilovic: "@brouberol@wikimedia.org so is that supposed to be done in another patch, deployed, and then we come back to this one?" [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [23:00:57] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1017.eqiad.wmnet'] [23:01:02] !log ryankemper@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017.eqiad.wmnet'] [23:01:17] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['wdqs1017.eqiad.wmnet'] [23:04:00] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140661|core-Permissions: refactor enwiki wgRemoveGroups]] (duration: 11m 13s) [23:06:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:13:00] (03CR) 10Cwhite: grafana: Toggle data sync using feature flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140760 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [23:13:13] (03CR) 10Cwhite: grafana: Add enable_dashboard_sync feature flag in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140523 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [23:14:40] !log zabe@mwmaint1002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php enwiki --delete /home/zabe/afl_text_table_deletedump/enwiki --sleep 0.3 # T381599 [23:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:46] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [23:25:47] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10794276 (10Eevans) OK, after having two new 480G SSDs added to each machine (used devices from decomm'd machi... [23:29:51] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:32:48] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [23:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1142000 [23:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1142000 (owner: 10TrainBranchBot) [23:44:56] FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:49:46] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:49:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1142000 (owner: 10TrainBranchBot) [23:49:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown