[00:01:32] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [00:02:23] RESOLVED: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.464s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:03:36] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [00:07:11] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 155545432 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:08:11] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 321944 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:08:16] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.174s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:10:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1053.eqiad.wmnet with OS bookworm [00:10:57] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [00:14:38] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-worker - jclark@cumin1002" [00:15:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-worker - jclark@cumin1002" [00:15:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:16:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.482s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:16:43] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10589479 (10Sreejithk2000) Its happening for this file as well. https://commons.wikimedia.org/w/index.php?title=File:WTA_logo_2025.svg [00:18:23] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1188.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:18:33] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1189.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:18:40] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1187.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:18:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1190.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:18:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1191.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:18:46] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1192.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:21:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.277s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:21:46] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.462s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:26:46] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 964.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:30:17] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.241s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:33:51] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1193 [00:34:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1193 [00:34:08] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1194 [00:34:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1194 [00:34:19] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1195 [00:34:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1195 [00:34:30] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1196 [00:34:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1196 [00:34:44] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1197 [00:34:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1197 [00:34:55] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1198 [00:35:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1198 [00:35:07] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1199 [00:35:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1199 [00:35:17] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.241s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:35:19] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1200 [00:35:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1200 [00:35:33] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1201 [00:35:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1201 [00:35:44] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1202 [00:35:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1202 [00:36:03] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1203 [00:36:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1203 [00:36:14] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1204 [00:36:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1204 [00:36:25] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1205 [00:36:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1205 [00:36:36] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1206 [00:36:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1206 [00:36:47] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1207 [00:36:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1207 [00:36:59] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1208 [00:37:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1208 [00:38:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123493 [00:38:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123493 (owner: 10TrainBranchBot) [00:39:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1191.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:39:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1192.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:40:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1188.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:40:17] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.145s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:40:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1187.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:41:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.282s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:42:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1189.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:43:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1190.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:44:25] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1193.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:44:27] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1194.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:44:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1195.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:44:30] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1196.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:44:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1197.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:44:32] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1198.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:44:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1195.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:46:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1195.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:46:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.28s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:49:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123493 (owner: 10TrainBranchBot) [00:51:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.307s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:57:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.075s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:01:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122683 (owner: 10Bartosz Dziewoński) [01:01:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122709 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński) [01:02:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.075s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:05:17] (03PS1) 10Bartosz Dziewoński: Change license for Russian Wikinews to CC-BY-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123495 (https://phabricator.wikimedia.org/T387279) [01:05:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123495 (https://phabricator.wikimedia.org/T387279) (owner: 10Bartosz Dziewoński) [01:08:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123497 [01:08:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123497 (owner: 10TrainBranchBot) [01:12:51] (03CR) 10Bartosz Dziewoński: Deduplicate JsonConfig config (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński) [01:13:58] (03PS3) 10Bartosz Dziewoński: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 [01:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10589572 (10phaultfinder) [01:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:30:53] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123497 (owner: 10TrainBranchBot) [01:53:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1195.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:53:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1193.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:54:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1194.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:54:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1197.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:54:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1196.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:56:41] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1201.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:56:42] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:56:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1203.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:56:47] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1200.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:56:50] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1199.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:59:42] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:02:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1198.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:03:42] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1204.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:09:37] (03PS1) 10Pppery: Use MediaWikiServices hook for push-subscription-manager changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123499 (https://phabricator.wikimedia.org/T275336) [02:11:54] (03PS1) 10Eevans: cassandra_dev: cleanup unused param [puppet] - 10https://gerrit.wikimedia.org/r/1123500 [02:14:11] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123500 (owner: 10Eevans) [02:16:51] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1208.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:16:59] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1206.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:17:02] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1207.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:17:03] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1205.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:17:30] (03CR) 10Eevans: [C:03+2] cassandra_dev: cleanup unused param [puppet] - 10https://gerrit.wikimedia.org/r/1123500 (owner: 10Eevans) [02:18:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1201.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:18:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1200.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:18:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:18:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1199.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:19:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1203.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:25:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1204.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:29:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1208.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:30:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1205.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10589651 (10phaultfinder) [02:39:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1207.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:39:42] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1206.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:42:39] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10589660 (10Jclark-ctr) [02:42:48] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10589661 (10Jclark-ctr) a:03Jclark-ctr [02:44:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1187.eqiad.wmnet with OS bullseye [02:44:42] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1188.eqiad.wmnet with OS bullseye [02:55:15] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10589667 (10Scott_French) Ah, that's good to know, @Jhancock.wm. If leaving it in place isn't causing any troubl... [03:04:42] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:45:02] (03PS4) 10Abijeet Patro: metawiki: Enable Chinese variant translation for message bundles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) [04:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:44:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:44:09] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:28:43] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:32:01] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:32:11] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:41:43] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250228T0700) [07:04:42] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:16:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1122951 (owner: 10Slyngshede) [07:19:42] RESOLVED: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:21:14] (03CR) 10Slyngshede: [C:03+2] Add option to delete a single signup [software/bitu] - 10https://gerrit.wikimedia.org/r/1122951 (owner: 10Slyngshede) [07:21:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1030.eqiad.wmnet with OS bookworm [07:21:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10589812 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1030.eqiad.wmnet with OS bookworm [07:26:42] (03PS1) 10Brouberol: airflow: inject the AIRFLOW_APPOWNER environment variable in all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123524 (https://phabricator.wikimedia.org/T386282) [07:26:49] (03Merged) 10jenkins-bot: Add option to delete a single signup [software/bitu] - 10https://gerrit.wikimedia.org/r/1122951 (owner: 10Slyngshede) [07:36:51] (03PS1) 10Brouberol: airflow: mount the hadoop configuration in the webserver and scheduler pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123527 (https://phabricator.wikimedia.org/T386282) [07:44:20] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1030.eqiad.wmnet with reason: host reimage [07:48:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1030.eqiad.wmnet with reason: host reimage [07:58:59] (03CR) 10Vgutierrez: [C:03+2] sre:loadbalancer:migrate-service-ipip: Fix format strings [cookbooks] - 10https://gerrit.wikimedia.org/r/1123322 (owner: 10Vgutierrez) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250228T0800) [08:05:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:06:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:07:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1030.eqiad.wmnet with OS bookworm [08:07:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10589845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1030.eqiad.wmnet with OS bookworm completed: - ganeti103... [08:24:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1027.eqiad.wmnet with OS bookworm [08:24:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10589865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bookworm [08:25:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2195 db1178', diff saved to https://phabricator.wikimedia.org/P73839 and previous config saved to /var/cache/conftool/dbconfig/20250228-082500-marostegui.json [08:25:09] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1178.eqiad.wmnet [08:25:15] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2195.codfw.wmnet [08:26:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1241 db2237', diff saved to https://phabricator.wikimedia.org/P73840 and previous config saved to /var/cache/conftool/dbconfig/20250228-082603-marostegui.json [08:26:25] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1241.eqiad.wmnet [08:26:30] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2237.codfw.wmnet [08:27:54] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1153, db2143 to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [08:29:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2143.codfw.wmnet,db1153.eqiad.wmnet with reason: Setup [08:30:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2195.codfw.wmnet [08:32:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1241.eqiad.wmnet [08:32:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1178.eqiad.wmnet [08:32:42] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test [08:32:42] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1159 gradually with 4 steps - test [08:32:45] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2195.codfw.wmnet with reason: Index rebuild [08:33:06] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Index rebuild [08:33:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2237.codfw.wmnet [08:33:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1241.eqiad.wmnet with reason: Index rebuild [08:33:37] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2237.codfw.wmnet with reason: Index rebuild [08:33:38] (03PS11) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [08:34:44] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10589890 (10elukey) @Papaul I am currently waiting since the host is being used by Jesse for another test. I tried the solution outlined... [08:38:26] (03CR) 10Elukey: [C:03+2] admin_ng: upgrade knative's docker images on ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123412 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:40:27] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [08:45:45] (03PS1) 10Marostegui: instance.schema: Add ms1,ms2 and ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123587 (https://phabricator.wikimedia.org/T387332) [08:46:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1027.eqiad.wmnet with reason: host reimage [08:46:06] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:47:38] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:49:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1027.eqiad.wmnet with reason: host reimage [08:50:28] (03CR) 10Federico Ceratto: [C:03+2] instance.schema: Add ms1,ms2 and ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123587 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [08:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:03:08] (03PS1) 10Marostegui: dbconfig.schema: Add ms1,ms2,ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123589 (https://phabricator.wikimedia.org/T387332) [09:05:24] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1123589 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [09:05:26] (03CR) 10Marostegui: [C:03+2] dbconfig.schema: Add ms1,ms2,ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123589 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [09:07:18] !log T387445 Ran mwscript-k8s --comment="T387445" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Nasir.uddin682' 'Renamed user 93d5fcb2f4862bda0383cf97a6cfeb7f' [09:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:22] T387445: Unblock stuck global rename of Renamed_user_93d5fcb2f4862bda0383cf97a6cfeb7f - https://phabricator.wikimedia.org/T387445 [09:07:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1027.eqiad.wmnet with OS bookworm [09:07:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10589943 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bookworm completed: - ganeti102... [09:08:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add ms3 T387332', diff saved to https://phabricator.wikimedia.org/P73841 and previous config saved to /var/cache/conftool/dbconfig/20250228-090838-marostegui.json [09:08:43] T387332: Set up ms1, ms2, ms3 db clusters - https://phabricator.wikimedia.org/T387332 [09:10:22] (03PS1) 10Muehlenhoff: Switch ganeti1027 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1123591 [09:14:22] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1027 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1123591 (owner: 10Muehlenhoff) [09:15:19] (03PS1) 10Marostegui: db2143: Add testing comment [puppet] - 10https://gerrit.wikimedia.org/r/1123592 [09:15:57] (03CR) 10Marostegui: [C:03+2] db2143: Add testing comment [puppet] - 10https://gerrit.wikimedia.org/r/1123592 (owner: 10Marostegui) [09:18:00] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1027.eqiad.wmnet [09:20:21] (03CR) 10Elukey: [C:03+1] "LGTM, but please follow up with Traffic to double check if anything else needs to be done. The fact that we are adding a new IP with the s" [puppet] - 10https://gerrit.wikimedia.org/r/1123426 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [09:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:25:41] (03PS1) 10JMeybohm: Remove upgrade checking and notice [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123593 (https://phabricator.wikimedia.org/T387376) [09:29:54] jayme, vgutierrez we have a full blown outage https://www.wikipedia.org/ [09:30:09] jynus: excuse me? [09:30:12] what? [09:30:20] jynus: that's working as expected here [09:30:28] the portal is not working it is redirecting 20+ times [09:30:38] definitely not happening here [09:30:38] and I am not the only one: https://www.reddit.com/r/wikipedia/comments/1j037hg/anybody_else_having_an_issue_on_the_iphone/ [09:30:57] reported on phab as T387549 [09:30:58] T387549: Wikipedia central page (https://www.wikipedia.org) fails to load with Too Many Redirects error - https://phabricator.wikimedia.org/T387549 [09:31:59] I am going to report it on the status page [09:32:02] I can repro as well [09:32:07] elukey: how? [09:32:09] That an iPhone only thing jynus? or can you repro somewhere else? [09:32:17] I can repor on my firefox on linux [09:32:20] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1123594 (https://phabricator.wikimedia.org/T387552) [09:32:20] *repro [09:32:41] can you share the network trace of firefox requests? [09:33:12] and/or send your traffic through a VPN temporarily? [09:33:29] yeah.. I can reproduce with firefox [09:33:40] not with chrome [09:33:41] this is curl: https://phabricator.wikimedia.org/P73842 [09:33:47] I still can't [09:33:48] vgutierrez: simply using Chrome, I get the redirects (I am going through Marseille, x-cache-status hit front) [09:34:27] it is a loop 301 [09:34:27] should we try to purge the page? [09:34:40] works fine for me with Chromium and Firefox as well [09:34:41] (03PS1) 10Marostegui: wmnet: Add ms3-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1123596 (https://phabricator.wikimedia.org/T387332) [09:34:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3611 MB (3% inode=98%): /tmp 3611 MB (3% inode=98%): /var/tmp 3611 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [09:34:56] so in drmrs some CDN nodes are affected and some aren't [09:35:48] but that doesn't explain why firefox isn't working for me and chrome works [09:36:32] wait what.. [09:36:32] (03CR) 10Marostegui: [C:03+2] wmnet: Add ms3-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1123596 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [09:36:35] !log marostegui@dns1006 START - running authdns-update [09:36:37] firefox is sending me to codfw [09:37:02] !log updated status page https://www.wikimediastatus.net/ [09:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:33] while using wikimedia-dns.org as dns over https as DNS provider [09:38:11] I'm send to esams and that seems to be fine [09:38:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [09:38:35] hmmm let's see [09:38:38] !log marostegui@dns1006 END - running authdns-update [09:39:10] the expected response for https://www.wikipedia.org is a 200 right? [09:39:20] we shouldn't redirect to anywhere else [09:39:21] :? [09:39:45] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10590144 (10fgiunchedi) Today `grafana-ldap-sync-users` broke mentioning that it couldn't find `uid=HCoplin-WMF` in LDAP. Indeed the user is (was,... [09:40:32] jayme: what is the x-cache-status and server field that you get from esams? [09:40:36] (03PS12) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [09:40:53] vgutierrez: I don't recall, I get redirected to www.wikipedia.org/ [09:41:07] * vgutierrez probing all text nodes [09:41:22] the 301 has "mw-web.eqiad.main-XXXXX" as server [09:41:23] https://www.irccloud.com/pastebin/DSB13IXW/ [09:41:27] but ulsfo is affected as well [09:41:43] 19 nodes reply with a 200 right now a 37 with a 301 [09:41:45] vgutierrez: yes [09:41:49] so no...no redirect :D [09:41:51] x-cache: cp3072 miss, cp3072 hit/2355310 and hit-front elukey [09:42:05] across all DCs [09:43:18] jayme: do you have a "server" field? [09:43:30] yeah, but ATS given it's hit-front [09:44:22] I have an hit-front as well, but I see what appears to be a mw pod in the server field [09:44:41] I am wondering if a specific bad backend polluted some cache nodes, for some obscure reason [09:44:45] < server: mw-web.eqiad.main-5d9658d95-vh888 [09:44:45] < x-powered-by: PHP/7.4.33 [09:44:55] that served a 301 to ATS [09:45:13] I have mw-web.eqiad.main-5d9658d95-jwqh8 [09:45:52] eqiad is now serving 200s [09:46:01] urg [09:46:03] wrong url [09:46:06] should we switch chan? [09:46:07] I was using en.wp.o [09:46:26] ok, I can reproduce against eqiad right now [09:47:01] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [09:48:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [09:48:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1027.eqiad.wmnet [09:54:00] (03PS1) 10Elukey: Revert "www.wikipedia.org: fix "search" URL parameter" [puppet] - 10https://gerrit.wikimedia.org/r/1123599 [09:54:21] (03CR) 10Vgutierrez: [C:03+1] Revert "www.wikipedia.org: fix "search" URL parameter" [puppet] - 10https://gerrit.wikimedia.org/r/1123599 (owner: 10Elukey) [09:54:47] (03CR) 10JMeybohm: [C:03+1] Revert "www.wikipedia.org: fix "search" URL parameter" [puppet] - 10https://gerrit.wikimedia.org/r/1123599 (owner: 10Elukey) [09:56:24] (03CR) 10Elukey: [C:03+2] Revert "www.wikipedia.org: fix "search" URL parameter" [puppet] - 10https://gerrit.wikimedia.org/r/1123599 (owner: 10Elukey) [09:56:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [10:04:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [10:05:04] !log elukey@deploy2002 Started scap sync-world: Revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1080357 [10:08:14] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:08:28] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:10:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [10:13:16] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:14:52] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:15:05] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:15:14] (03PS13) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [10:15:40] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:16:16] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:16:37] (03PS1) 10Ladsgroup: etcd: Ignore ms hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123601 [10:17:27] (03CR) 10CI reject: [V:04-1] etcd: Ignore ms hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123601 (owner: 10Ladsgroup) [10:17:50] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:17:51] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:18:40] (03PS2) 10Ladsgroup: etcd: Ignore ms hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123601 [10:18:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [10:19:22] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:19:23] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [10:20:04] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [10:20:33] (03CR) 10Marostegui: [C:03+1] etcd: Ignore ms hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123601 (owner: 10Ladsgroup) [10:21:21] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [10:21:43] (03CR) 10Ladsgroup: [C:03+2] etcd: Ignore ms hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123601 (owner: 10Ladsgroup) [10:21:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123601 (owner: 10Ladsgroup) [10:22:23] (03Merged) 10jenkins-bot: etcd: Ignore ms hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123601 (owner: 10Ladsgroup) [10:22:51] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1123601|etcd: Ignore ms hosts]] [10:23:19] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on cr2-magru with reason: IBGP instability from cr1 to cr2 in magru causing ping faulures from alert1002 [10:25:42] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1123601|etcd: Ignore ms hosts]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:26:09] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:27:08] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:28:39] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:31:10] (03CR) 10Marostegui: "Can we get this merged? Thanks. I want to start deploying all the hosts." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [10:32:29] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123601|etcd: Ignore ms hosts]] (duration: 09m 37s) [10:34:14] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:34:18] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:35:01] (03PS1) 10Cathal Mooney: Add BGP peering from codfw CRs to test Nokia Spines [homer/public] - 10https://gerrit.wikimedia.org/r/1123604 (https://phabricator.wikimedia.org/T371088) [10:35:31] (03PS2) 10Cathal Mooney: Add BGP peering from codfw CRs to test Nokia Spines [homer/public] - 10https://gerrit.wikimedia.org/r/1123604 (https://phabricator.wikimedia.org/T371088) [10:35:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73843 and previous config saved to /var/cache/conftool/dbconfig/20250228-103549-root.json [10:38:22] 06SRE, 06serviceops, 10Wikimedia-Apache-configuration, 10Wikimedia-Portals, and 2 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10590297 (10elukey) 05Resolved→03Open Hi folks! I am really sorry to ruin the... [10:41:49] (03PS1) 10Sergio Gimeno: beta: add mediawiki.product_metrics.growth_product_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123606 (https://phabricator.wikimedia.org/T387286) [10:41:50] (03PS1) 10Sergio Gimeno: [Growth] Add mediawiki.product_metrics.growth_product_interaction stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123607 (https://phabricator.wikimedia.org/T387286) [10:42:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73844 and previous config saved to /var/cache/conftool/dbconfig/20250228-104246-root.json [10:45:01] (03PS2) 10Sergio Gimeno: [Growth] Add mediawiki.product_metrics.growth_product_interaction stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123607 (https://phabricator.wikimedia.org/T387286) [10:45:49] (03CR) 10Simon04: "@ltoscano@wikimedia.org, @Ladsgroup@gmail.com, how can this patch be repaired?" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [10:45:59] (03Abandoned) 10Sergio Gimeno: beta: add mediawiki.product_metrics.growth_product_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123606 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [10:46:54] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [10:49:48] (03PS3) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294) [10:49:48] (03PS3) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) [10:49:54] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns for codfw nokia lab to core router links - cmooney@cumin1002" [10:50:01] (03PS1) 10Cathal Mooney: IPv6 entries for new /64 networks on nokia lab links [dns] - 10https://gerrit.wikimedia.org/r/1123609 (https://phabricator.wikimedia.org/T385217) [10:50:10] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns for codfw nokia lab to core router links - cmooney@cumin1002" [10:50:10] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:50:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73845 and previous config saved to /var/cache/conftool/dbconfig/20250228-105055-root.json [10:52:01] (03CR) 10Cathal Mooney: [C:03+2] IPv6 entries for new /64 networks on nokia lab links [dns] - 10https://gerrit.wikimedia.org/r/1123609 (https://phabricator.wikimedia.org/T385217) (owner: 10Cathal Mooney) [10:52:20] !log cmooney@dns2005 START - running authdns-update [10:53:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1237', diff saved to https://phabricator.wikimedia.org/P73846 and previous config saved to /var/cache/conftool/dbconfig/20250228-105321-root.json [10:53:34] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1237.eqiad.wmnet [10:54:04] !log cmooney@dns2005 END - running authdns-update [10:54:09] (03PS1) 10Vgutierrez: migrate-service-ipip: Allow picking realservers using a cumin alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1123611 (https://phabricator.wikimedia.org/T373020) [10:54:57] 06SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Portals, 10Sustainability (Incident Followup), 07Wikimedia-production-error: Wikipedia central page (https://www.wikipedia.org) fails to load with Too Many Redirects error - https://phabricator.wikimedia.org/T387549#10590328 (10jcrespo) 05Open→0... [10:55:16] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:55:22] (03PS1) 10Marostegui: db1179: Make it x1 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1123612 [10:56:29] (03CR) 10Marostegui: [C:03+2] db1179: Make it x1 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1123612 (owner: 10Marostegui) [10:57:14] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:57:49] (03PS1) 10Marostegui: Revert "db1179: Make it x1 candidate master" [puppet] - 10https://gerrit.wikimedia.org/r/1123613 [10:57:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73847 and previous config saved to /var/cache/conftool/dbconfig/20250228-105751-root.json [10:58:13] (03CR) 10CI reject: [V:04-1] Revert "db1179: Make it x1 candidate master" [puppet] - 10https://gerrit.wikimedia.org/r/1123613 (owner: 10Marostegui) [10:59:01] (03PS2) 10Marostegui: Revert "db1179: Make it x1 candidate master" [puppet] - 10https://gerrit.wikimedia.org/r/1123613 [10:59:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1237.eqiad.wmnet [10:59:40] (03CR) 10Marostegui: [C:03+2] Revert "db1179: Make it x1 candidate master" [puppet] - 10https://gerrit.wikimedia.org/r/1123613 (owner: 10Marostegui) [11:00:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73848 and previous config saved to /var/cache/conftool/dbconfig/20250228-110003-root.json [11:01:59] (03PS1) 10Marostegui: db1237.yaml: Make it candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1123614 [11:02:24] (03CR) 10Marostegui: "This is a noop, but used by switchover tool to generate the task and patches." [puppet] - 10https://gerrit.wikimedia.org/r/1123614 (owner: 10Marostegui) [11:02:27] (03CR) 10Marostegui: [C:03+2] db1237.yaml: Make it candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1123614 (owner: 10Marostegui) [11:02:30] (03PS2) 10Vgutierrez: migrate-service-ipip: Allow picking realservers using a cumin alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1123611 (https://phabricator.wikimedia.org/T373020) [11:03:59] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1237 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1123615 (https://phabricator.wikimedia.org/T387557) [11:04:20] (03CR) 10Ladsgroup: [C:04-2] "This is waiting for the core patch (I80da12396858ee4fc58ae) to be merged and deployed which is not done yet. That's waiting for @krinkle@f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [11:04:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:05:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:06:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73849 and previous config saved to /var/cache/conftool/dbconfig/20250228-110559-root.json [11:06:10] (03PS3) 10Vgutierrez: migrate-service-ipip: Allow picking realservers using a cumin alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1123611 (https://phabricator.wikimedia.org/T373020) [11:07:51] (03PS1) 10Esanders: Hide "Insert graph" tool in VE when graphs are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123620 (https://phabricator.wikimedia.org/T387501) [11:08:17] (03PS3) 10Filippo Giunchedi: pontoon: reorganize cloudvps / ctl code interactions [puppet] - 10https://gerrit.wikimedia.org/r/1123024 [11:08:29] (03CR) 10Cathal Mooney: [C:03+2] Add BGP peering from codfw CRs to test Nokia Spines [homer/public] - 10https://gerrit.wikimedia.org/r/1123604 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [11:09:05] (03Merged) 10jenkins-bot: Add BGP peering from codfw CRs to test Nokia Spines [homer/public] - 10https://gerrit.wikimedia.org/r/1123604 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [11:10:37] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: reorganize cloudvps / ctl code interactions [puppet] - 10https://gerrit.wikimedia.org/r/1123024 (owner: 10Filippo Giunchedi) [11:12:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73850 and previous config saved to /var/cache/conftool/dbconfig/20250228-111256-root.json [11:15:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73851 and previous config saved to /var/cache/conftool/dbconfig/20250228-111508-root.json [11:18:01] (03PS1) 10Vgutierrez: migrate-service-ipip: Move realserver validation to its own function [cookbooks] - 10https://gerrit.wikimedia.org/r/1123621 [11:19:02] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10590415 (10MatthewVernon) @Sreejithk2000 please __always__ include the full error message (and at least an approximate timestamp). @Sreejithk2000... [11:19:24] PROBLEM - Host lvs5004 is DOWN: PING CRITICAL - Packet loss = 100% [11:19:26] (03PS1) 10Simon04: www.wikipedia.org: fix "search" URL parameter [puppet] - 10https://gerrit.wikimedia.org/r/1123622 (https://phabricator.wikimedia.org/T318285) [11:19:36] RECOVERY - Host lvs5004 is UP: PING WARNING - Packet loss = 66%, RTA = 222.90 ms [11:19:43] wow [11:20:13] !log depooling lvs5004 [11:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:46] (03CR) 10Simon04: www.wikipedia.org: fix "search" URL parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123622 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [11:21:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73852 and previous config saved to /var/cache/conftool/dbconfig/20250228-112105-root.json [11:21:38] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:21:47] (03CR) 10Klausman: [C:03+1] cassandra: reset '4.x' to be 4.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/1123471 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [11:23:12] PROBLEM - pybal on lvs5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [11:23:16] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [11:24:26] PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [11:25:22] !log repooling lvs5004 [11:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:12] RECOVERY - pybal on lvs5004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [11:26:16] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:28:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73853 and previous config saved to /var/cache/conftool/dbconfig/20250228-112801-root.json [11:29:24] RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 8 connections established with conf2006.codfw.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [11:30:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73854 and previous config saved to /var/cache/conftool/dbconfig/20250228-113014-root.json [11:30:18] (03PS1) 10Marostegui: control-mariadb-10.6-bookworm: Updated version [software] - 10https://gerrit.wikimedia.org/r/1123623 (https://phabricator.wikimedia.org/T385678) [11:30:58] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.6-bookworm: Updated version [software] - 10https://gerrit.wikimedia.org/r/1123623 (https://phabricator.wikimedia.org/T385678) (owner: 10Marostegui) [11:31:26] (03Merged) 10jenkins-bot: control-mariadb-10.6-bookworm: Updated version [software] - 10https://gerrit.wikimedia.org/r/1123623 (https://phabricator.wikimedia.org/T385678) (owner: 10Marostegui) [11:36:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73855 and previous config saved to /var/cache/conftool/dbconfig/20250228-113610-root.json [11:36:17] (03PS5) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) [11:36:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1238', diff saved to https://phabricator.wikimedia.org/P73856 and previous config saved to /var/cache/conftool/dbconfig/20250228-113646-marostegui.json [11:37:01] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1238.eqiad.wmnet [11:37:23] FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:51] (03PS1) 10Marostegui: control-mariadb-10.11-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1123624 [11:41:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2236', diff saved to https://phabricator.wikimedia.org/P73857 and previous config saved to /var/cache/conftool/dbconfig/20250228-114127-marostegui.json [11:43:03] !log fceratto@cumin1002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1252.eqiad.wmnet with reason: preparing - T385141 [11:43:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73858 and previous config saved to /var/cache/conftool/dbconfig/20250228-114306-root.json [11:43:07] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [11:43:20] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2236.codfw.wmnet [11:44:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1238.eqiad.wmnet [11:44:45] PROBLEM - Host db1238 #page is DOWN: PING CRITICAL - Packet loss = 100% [11:44:51] RECOVERY - Host db1238 #page is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms [11:44:57] !incidents [11:44:57] ANother downtime that got lost [11:44:57] 5702 (UNACKED) Host db1238 (paged) - PING - Packet loss = 100% [11:44:58] 5701 (RESOLVED) ATSBackendErrorsHigh cache_text sre (wdqs.discovery.wmnet esams) [11:45:02] !resolve 5702 [11:45:02] 5702 (RESOLVED) Host db1238 (paged) - PING - Packet loss = 100% [11:45:03] !resolve 5702 [11:45:03] Attempt to resolve incident 5702 failed. [11:45:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1238.eqiad.wmnet with reason: Index rebuild [11:45:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73859 and previous config saved to /var/cache/conftool/dbconfig/20250228-114520-root.json [11:46:45] (03PS1) 10Hnowlan: trafficserver: send PUTs to the write datacentre [puppet] - 10https://gerrit.wikimedia.org/r/1123625 (https://phabricator.wikimedia.org/T387509) [11:47:24] (03CR) 10Hnowlan: "For completeness, should we do the same for PATCH and DELETE also?" [puppet] - 10https://gerrit.wikimedia.org/r/1123625 (https://phabricator.wikimedia.org/T387509) (owner: 10Hnowlan) [11:49:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2236.codfw.wmnet [11:49:41] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2236.codfw.wmnet with reason: Index rebuild [11:51:38] (03CR) 10Vgutierrez: trafficserver: send PUTs to the write datacentre (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123625 (https://phabricator.wikimedia.org/T387509) (owner: 10Hnowlan) [11:54:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10590471 (10MatthewVernon) @elukey when you get to doing that test, can you make sure the `/srv/swift-storage` partitions are all mounte... [11:54:24] FIRING: ProbeDown: Service ml-serve-ctrl1001:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:54:34] !incidents [11:54:34] 5703 (UNACKED) ProbeDown sre (10.64.16.202 ip4 ml-serve-ctrl1001:6443 probes/custom http_ml_serve_eqiad_kube_apiserver_ip4 eqiad) [11:54:34] 5702 (RESOLVED) Host db1238 (paged) - PING - Packet loss = 100% [11:54:35] 5701 (RESOLVED) ATSBackendErrorsHigh cache_text sre (wdqs.discovery.wmnet esams) [11:54:38] !ack 5703 [11:54:38] 5703 (ACKED) ProbeDown sre (10.64.16.202 ip4 ml-serve-ctrl1001:6443 probes/custom http_ml_serve_eqiad_kube_apiserver_ip4 eqiad) [11:55:43] klausman: ^^ [11:56:42] klausman: kube-apiserver taking too long to restart (like last time) [11:59:24] RESOLVED: ProbeDown: Service ml-serve-ctrl1001:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250228T0800) [12:00:05] jelto, arnoldokoth, and mutante: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250228T1200). [12:00:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73860 and previous config saved to /var/cache/conftool/dbconfig/20250228-120025-root.json [12:01:33] (03PS2) 10Hnowlan: trafficserver: send PUTs to the write datacentre [puppet] - 10https://gerrit.wikimedia.org/r/1123625 (https://phabricator.wikimedia.org/T387509) [12:03:05] jayme: hrm. I wonder what's going on there [12:04:07] well...as said last time I would assume it takes too long to restart like it did at some point for wikikube clusters due to the amount of API objects, CPU load on apiserver etc. [12:05:45] did you bump resources for the wikikube vms? [12:06:04] We moved them to hardware, together with the etcd nodes [12:06:14] mhmm. [12:06:42] when you say together, do you mean co-hosting kubnrapi and etcd? [12:06:52] kubeapi* [12:07:12] yes [12:07:57] https://phabricator.wikimedia.org/T353464 [12:07:57] yes https://phabricator.wikimedia.org/T363307 [12:08:06] merci! [12:08:35] I would again suggest to take a look at historic restart times and resource usage on the ml apiservers [12:08:47] might as well be that bumping resources of the VM is good enough [12:12:24] yeah, given the lead time for hardware, I'd try that first anyway [12:27:38] !log Deployed security patch for T386826 [12:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:54] I think this might be a memory- rather than a cpu-problem. https://grafana.wikimedia.org/goto/hUlDzDpNR?orgId=1 shows allocstalls for both ctrl machines. I haven't checked the etcd nodes yet [12:39:52] (my initial hypothesis was CPU starvation,. but that's probably not it) [12:40:06] The etcd nodes seem to be just fine™ [12:41:10] The ctrl nmodes have 4G each, and while there doesn't seem to be super-tight memory pressure (plenty of page cache that could be reclaimed), maybe having only 1G of active page cache is a bit toot tight for this workload. [12:41:40] I'll bump the ctrl nodes to 6G RAM each, see if that helps. [12:42:14] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1123624 (owner: 10Marostegui) [12:42:30] sgtm [12:42:40] (03Merged) 10jenkins-bot: control-mariadb-10.11-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1123624 (owner: 10Marostegui) [12:43:49] !log klausman@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [12:46:06] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:46:08] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:46:11] (03PS2) 10JMeybohm: Remove upgrade checking and notice [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123593 (https://phabricator.wikimedia.org/T387376) [12:48:05] !log klausman@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [12:48:57] !log klausman@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [12:51:03] (03PS1) 10JMeybohm: Update to new upstream version 3.10.0 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1123642 (https://phabricator.wikimedia.org/T341984) [12:51:08] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:51:12] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:52:21] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1187.eqiad.wmnet with OS bullseye [12:52:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1188.eqiad.wmnet with OS bullseye [12:54:45] !log klausman@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [12:56:47] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:56:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73861 and previous config saved to /var/cache/conftool/dbconfig/20250228-125652-root.json [12:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:01:54] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add test server IP dns nokia lab - cmooney@cumin1002" [13:02:00] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add test server IP dns nokia lab - cmooney@cumin1002" [13:02:00] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:02:54] (03PS2) 10JMeybohm: Update to new upstream version 3.10.0 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1123642 (https://phabricator.wikimedia.org/T387376) [13:05:10] (03CR) 10Muehlenhoff: [C:03+2] idm-test: Add airflow-search-ops group request config [puppet] - 10https://gerrit.wikimedia.org/r/1123307 (owner: 10Muehlenhoff) [13:11:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73862 and previous config saved to /var/cache/conftool/dbconfig/20250228-131157-root.json [13:13:41] (03CR) 10CI reject: [V:04-1] Update to new upstream version 3.10.0 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1123642 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [13:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:26:45] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10590752 (10Gehel) [13:26:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 2 others: Relabel Elastic hosts to Relforge hosts - https://phabricator.wikimedia.org/T386358#10590766 (10Gehel) [13:27:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73863 and previous config saved to /var/cache/conftool/dbconfig/20250228-132701-root.json [13:27:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install elastic1108-elastic1122 - https://phabricator.wikimedia.org/T384966#10590784 (10Gehel) [13:31:33] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10590877 (10Gehel) [13:31:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10590886 (10Gehel) [13:32:01] (03CR) 10Giuseppe Lavagetto: When executing cli scripts, wait for the service mesh (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [13:33:07] 07sre-alert-triage, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10590910 (10Gehel) [13:37:33] 06SRE, 06serviceops, 10Wikimedia-Apache-configuration, 10Wikimedia-Portals, and 2 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10590993 (10Gehel) [13:40:12] (03PS1) 10Ebrahim: Make 'automatic' default vector theme option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123650 (https://phabricator.wikimedia.org/T387382) [13:40:20] (03PS2) 10Ebrahim: Make 'automatic' default vector theme option in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123650 (https://phabricator.wikimedia.org/T387382) [13:41:53] (03PS3) 10Ebrahim: Make 'automatic' default vector theme option in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123650 (https://phabricator.wikimedia.org/T387382) [13:42:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73864 and previous config saved to /var/cache/conftool/dbconfig/20250228-134206-root.json [13:47:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73865 and previous config saved to /var/cache/conftool/dbconfig/20250228-134737-root.json [13:50:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73866 and previous config saved to /var/cache/conftool/dbconfig/20250228-135029-root.json [13:51:05] (03PS1) 10Vgutierrez: hiera,cirrus: Enable IPIP on search*@codfw services [puppet] - 10https://gerrit.wikimedia.org/r/1123652 (https://phabricator.wikimedia.org/T387309) [13:51:11] (03PS1) 10Vgutierrez: hiera,cirrus: Enable IPIP on search*@eqiad services [puppet] - 10https://gerrit.wikimedia.org/r/1123653 (https://phabricator.wikimedia.org/T387309) [13:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:52:35] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123652 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [13:52:39] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123653 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [13:53:42] (03PS1) 10Federico Ceratto: db1252.yaml, instances.yaml, site.pp: Prepare db1252 for prod [puppet] - 10https://gerrit.wikimedia.org/r/1123654 (https://phabricator.wikimedia.org/T385141) [13:54:33] (03CR) 10Marostegui: "Remember this host needs to be added also to zarcillo database" [puppet] - 10https://gerrit.wikimedia.org/r/1123654 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [13:54:59] (03CR) 10Marostegui: [C:03+1] db1252.yaml, instances.yaml, site.pp: Prepare db1252 for prod [puppet] - 10https://gerrit.wikimedia.org/r/1123654 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [13:55:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73867 and previous config saved to /var/cache/conftool/dbconfig/20250228-135529-root.json [13:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:57:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73868 and previous config saved to /var/cache/conftool/dbconfig/20250228-135711-root.json [13:57:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2181', diff saved to https://phabricator.wikimedia.org/P73869 and previous config saved to /var/cache/conftool/dbconfig/20250228-135741-marostegui.json [13:57:50] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2181.codfw.wmnet [13:57:59] (03PS1) 10Elukey: admin_ng: enable monitoring for knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123656 [13:58:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Setup [14:02:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73870 and previous config saved to /var/cache/conftool/dbconfig/20250228-140242-root.json [14:03:05] (03CR) 10Federico Ceratto: [C:03+2] db1252.yaml, instances.yaml, site.pp: Prepare db1252 for prod [puppet] - 10https://gerrit.wikimedia.org/r/1123654 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [14:05:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2181.codfw.wmnet [14:05:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73871 and previous config saved to /var/cache/conftool/dbconfig/20250228-140534-root.json [14:05:41] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2181.codfw.wmnet with reason: Index rebuild [14:08:10] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10591134 (10Ladsgroup) These are eqiad hosts which I haven't been deleting the thumbnails from. Do you want me to start the script ther... [14:10:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73872 and previous config saved to /var/cache/conftool/dbconfig/20250228-141035-root.json [14:17:14] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10591147 (10RobH) [14:17:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73873 and previous config saved to /var/cache/conftool/dbconfig/20250228-141748-root.json [14:19:06] (03CR) 10Ssingh: migrate-service-ipip: Allow picking realservers using a cumin alias (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1123611 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [14:20:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73874 and previous config saved to /var/cache/conftool/dbconfig/20250228-142039-root.json [14:25:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73875 and previous config saved to /var/cache/conftool/dbconfig/20250228-142540-root.json [14:28:15] (03CR) 10Ssingh: [C:03+1] "Looks good, one question in-line." [cookbooks] - 10https://gerrit.wikimedia.org/r/1123621 (owner: 10Vgutierrez) [14:32:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1177', diff saved to https://phabricator.wikimedia.org/P73876 and previous config saved to /var/cache/conftool/dbconfig/20250228-143256-marostegui.json [14:33:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73877 and previous config saved to /var/cache/conftool/dbconfig/20250228-143300-root.json [14:33:13] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1177.eqiad.wmnet [14:35:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73878 and previous config saved to /var/cache/conftool/dbconfig/20250228-143544-root.json [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1177.eqiad.wmnet [14:40:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73879 and previous config saved to /var/cache/conftool/dbconfig/20250228-144046-root.json [14:40:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Index rebuild [14:41:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2219', diff saved to https://phabricator.wikimedia.org/P73880 and previous config saved to /var/cache/conftool/dbconfig/20250228-144128-marostegui.json [14:41:39] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2219.codfw.wmnet [14:43:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1199', diff saved to https://phabricator.wikimedia.org/P73881 and previous config saved to /var/cache/conftool/dbconfig/20250228-144309-marostegui.json [14:43:25] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1199.eqiad.wmnet [14:46:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2219.codfw.wmnet [14:46:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2219.codfw.wmnet with reason: Index rebuild [14:47:12] (03CR) 10Ssingh: [C:03+1] migrate-service-ipip: Move realserver validation to its own function (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1123621 (owner: 10Vgutierrez) [14:48:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73882 and previous config saved to /var/cache/conftool/dbconfig/20250228-144807-root.json [14:48:47] (03PS4) 10Vgutierrez: migrate-service-ipip: Allow picking realservers using a cumin alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1123611 (https://phabricator.wikimedia.org/T373020) [14:48:47] (03PS2) 10Vgutierrez: migrate-service-ipip: Move realserver validation to its own function [cookbooks] - 10https://gerrit.wikimedia.org/r/1123621 [14:49:43] (03CR) 10Vgutierrez: migrate-service-ipip: Allow picking realservers using a cumin alias (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1123611 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [14:50:40] (03CR) 10Ssingh: [C:03+1] migrate-service-ipip: Allow picking realservers using a cumin alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1123611 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [14:50:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73883 and previous config saved to /var/cache/conftool/dbconfig/20250228-145050-root.json [14:50:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1199.eqiad.wmnet [14:51:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1199.eqiad.wmnet with reason: Index rebuild [14:51:40] (03CR) 10Vgutierrez: [C:04-2] "blocked by T387569" [puppet] - 10https://gerrit.wikimedia.org/r/1123652 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [14:51:48] (03CR) 10Vgutierrez: "blocked by T387569" [puppet] - 10https://gerrit.wikimedia.org/r/1123653 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [14:52:49] (03CR) 10Eevans: [C:03+2] cassandra: reset '4.x' to be 4.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/1123471 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [14:52:50] !log vriley@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [14:52:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1256.eqiad.wmnet with OS bookworm [14:52:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10591251 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host db1256.eqiad.wmnet with OS bookworm completed: - db1256 (**WARN**) -... [14:53:22] (03PS1) 10Ssingh: fail CI to test operations-dnslist update to bookworm [dns] - 10https://gerrit.wikimedia.org/r/1123661 [14:53:55] (03CR) 10CI reject: [V:04-1] fail CI to test operations-dnslist update to bookworm [dns] - 10https://gerrit.wikimedia.org/r/1123661 (owner: 10Ssingh) [14:55:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73884 and previous config saved to /var/cache/conftool/dbconfig/20250228-145551-root.json [14:56:04] (03CR) 10Vgutierrez: [C:03+2] migrate-service-ipip: Allow picking realservers using a cumin alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1123611 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [14:56:09] (03CR) 10Vgutierrez: [C:03+2] migrate-service-ipip: Move realserver validation to its own function [cookbooks] - 10https://gerrit.wikimedia.org/r/1123621 (owner: 10Vgutierrez) [14:56:34] (03CR) 10Ssingh: "registry.wikimedia.org/releng/operations-dnslint:0.1.0 using the latest image." [dns] - 10https://gerrit.wikimedia.org/r/1123661 (owner: 10Ssingh) [14:56:44] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387528#10591258 (10VRiley-WMF) →14Duplicate dup:03T385251 [14:56:48] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - an-worker1098 - https://phabricator.wikimedia.org/T385251#10591260 (10VRiley-WMF) [14:58:50] (03Abandoned) 10Ssingh: fail CI to test operations-dnslist update to bookworm [dns] - 10https://gerrit.wikimedia.org/r/1123661 (owner: 10Ssingh) [15:03:42] (03Merged) 10jenkins-bot: migrate-service-ipip: Allow picking realservers using a cumin alias [cookbooks] - 10https://gerrit.wikimedia.org/r/1123611 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [15:03:46] (03Merged) 10jenkins-bot: migrate-service-ipip: Move realserver validation to its own function [cookbooks] - 10https://gerrit.wikimedia.org/r/1123621 (owner: 10Vgutierrez) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:45] (03CR) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [15:08:31] (03CR) 10Jdlrobson: [C:04-1] "sorry for the confusion: i meant this as advice for third parties. for production we want all projects to come out of beta at the same tim" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123650 (https://phabricator.wikimedia.org/T387382) (owner: 10Ebrahim) [15:10:41] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:11:05] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - an-worker1098 - https://phabricator.wikimedia.org/T385251#10591294 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Reseated the cable, this issue should be resolved. [15:11:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10591297 (10elukey) I am going to try UEFI with this node to see if the same PXE issue comes up. [15:11:25] (03PS1) 10Vgutierrez: hiera,wcqs: Enable IPIP on wcqs@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123663 (https://phabricator.wikimedia.org/T387313) [15:11:26] (03PS1) 10Vgutierrez: hiera,wcqs: Enable IPIP on wcqs@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123664 (https://phabricator.wikimedia.org/T387313) [15:12:15] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123663 (https://phabricator.wikimedia.org/T387313) (owner: 10Vgutierrez) [15:12:19] (03Abandoned) 10Ebrahim: Make 'automatic' default vector theme option in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123650 (https://phabricator.wikimedia.org/T387382) (owner: 10Ebrahim) [15:12:20] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123664 (https://phabricator.wikimedia.org/T387313) (owner: 10Vgutierrez) [15:15:16] (03CR) 10Ebrahim: "> automatic mode is incompatible with many browser extensions people use and breaking their experience suddenly would be confusing to the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123650 (https://phabricator.wikimedia.org/T387382) (owner: 10Ebrahim) [15:20:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10591313 (10VRiley-WMF) Power has been rebalanced. Closing this for now. [15:20:17] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10591314 (10VRiley-WMF) 05Open→03Resolved [15:20:27] (03CR) 10Herron: "Thanks, yes good call!" [puppet] - 10https://gerrit.wikimedia.org/r/1123426 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [15:20:54] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:21:01] (03PS1) 10Muehlenhoff: idm: Add approval rule for airflow-search-ops in production [puppet] - 10https://gerrit.wikimedia.org/r/1123665 [15:22:51] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387515#10591324 (10Jhancock.wm) →14Duplicate dup:03T387431 [15:22:51] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10591326 (10Jhancock.wm) [15:24:07] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-be2088 - https://phabricator.wikimedia.org/T387392#10591341 (10Jhancock.wm) →14Duplicate dup:03T387257 [15:24:08] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-be2088 - https://phabricator.wikimedia.org/T387257#10591343 (10Jhancock.wm) [15:24:22] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [dns] - 10https://gerrit.wikimedia.org/r/1123666 [15:25:45] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [dns] - 10https://gerrit.wikimedia.org/r/1123666 (owner: 10Hashar) [15:27:35] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10591355 (10Jhancock.wm) 05Open→03Resolved we can leave it. The last server should be getting decommissi... [15:29:13] (03PS10) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [15:34:21] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP for wdqs(-ssl|-heavy-queries)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123667 (https://phabricator.wikimedia.org/T387314) [15:34:26] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP for wdqs(-ssl|-heavy-queries)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123668 (https://phabricator.wikimedia.org/T387314) [15:35:29] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123667 (https://phabricator.wikimedia.org/T387314) (owner: 10Vgutierrez) [15:35:30] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [15:35:34] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123668 (https://phabricator.wikimedia.org/T387314) (owner: 10Vgutierrez) [15:38:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10591428 (10cmooney) Just want to confirm all the links are in place and working (the only ones I have not tested are the 100G t... [15:38:43] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm [15:38:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10591433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm [15:39:07] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123667 (https://phabricator.wikimedia.org/T387314) (owner: 10Vgutierrez) [15:41:10] (03CR) 10Muehlenhoff: "(To be merged after we've refined the display of pending approvals for multi approval processes)" [puppet] - 10https://gerrit.wikimedia.org/r/1123665 (owner: 10Muehlenhoff) [15:41:16] (03PS11) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [15:41:40] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10591449 (10VRiley-WMF) I would like to check in on this and see if there has been any further warnings? [15:43:35] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1248.eqiad.wmnet onto db1252.eqiad.wmnet [15:44:42] (03CR) 10Elukey: "Hi! If needed I am going to help in the review of the mysql cookbooks, so we can speed up your time-to-production :) I left a couple of co" [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [15:46:32] (03CR) 10Arlolra: "Ok. It wasn't going to be deployed before Monday anyways" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra) [15:46:52] (03CR) 10Elukey: "Hi!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [15:47:37] (03CR) 10Scott French: [C:03+1] "Thanks, Hugh!" [puppet] - 10https://gerrit.wikimedia.org/r/1123625 (https://phabricator.wikimedia.org/T387509) (owner: 10Hnowlan) [15:47:43] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bullseye [15:47:45] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10591464 (10Jhancock.wm) elastic2089 in A4 and , elastic2083, 2102, and 2103 in C7 are t... [15:47:57] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10591467 (10fnegri) 05Open→03Resolved @VRiley-WMF No more warnings since 16:00 UTC yesterday! {F58519560} I will resolve this again, I... [15:48:45] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [15:52:03] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:52:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10591475 (10elukey) If I try to use UEFI, and recheck the BIOS settings via Redfish, I see: ` P1_AIOMAOC_ATG_b2TMLAN1OPROM = Disabled P1_AIOMAOC_ATG_b2TMLAN2OPROM... [15:52:12] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host puppetserver2004.codfw.wmnet with OS bookworm [15:52:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10591476 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm executed wit... [15:52:53] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:54:24] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-main@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123672 (https://phabricator.wikimedia.org/T387315) [15:54:26] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-main@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123673 (https://phabricator.wikimedia.org/T387315) [15:54:41] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123672 (https://phabricator.wikimedia.org/T387315) (owner: 10Vgutierrez) [15:54:45] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123673 (https://phabricator.wikimedia.org/T387315) (owner: 10Vgutierrez) [15:59:19] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [16:03:17] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [16:04:57] (03CR) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [16:07:14] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-scholarly@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123676 (https://phabricator.wikimedia.org/T387316) [16:07:15] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-scholarly@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123677 (https://phabricator.wikimedia.org/T387316) [16:07:59] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123676 (https://phabricator.wikimedia.org/T387316) (owner: 10Vgutierrez) [16:08:03] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123677 (https://phabricator.wikimedia.org/T387316) (owner: 10Vgutierrez) [16:09:43] 06SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Portals, 07Wikimedia-production-error: Wikipedia central page (https://www.wikipedia.org) fails to load with Too Many Redirects error - https://phabricator.wikimedia.org/T387549#10591538 (10jcrespo) [16:10:29] 06SRE, 06serviceops, 10Wikimedia-Apache-configuration, 10Wikimedia-Portals, and 3 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10591539 (10jcrespo) [16:17:48] (03CR) 10Elukey: sre.mysql.pool: sanity check for depool operations (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [16:18:18] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123678 (https://phabricator.wikimedia.org/T387318) [16:18:20] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123679 (https://phabricator.wikimedia.org/T387318) [16:18:42] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123678 (https://phabricator.wikimedia.org/T387318) (owner: 10Vgutierrez) [16:18:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123679 (https://phabricator.wikimedia.org/T387318) (owner: 10Vgutierrez) [16:25:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 2 others: Relabel Elastic hosts to Relforge hosts - https://phabricator.wikimedia.org/T386358#10591563 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF These have been relabeled. [16:36:28] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal-main@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123684 (https://phabricator.wikimedia.org/T387319) [16:36:30] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal-main@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123685 (https://phabricator.wikimedia.org/T387319) [16:36:53] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123684 (https://phabricator.wikimedia.org/T387319) (owner: 10Vgutierrez) [16:36:55] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123685 (https://phabricator.wikimedia.org/T387319) (owner: 10Vgutierrez) [16:39:03] 10ops-eqiad, 06SRE, 06DC-Ops: Update the labels on an-presto100[1-5] to be an-worker106[5-9] - https://phabricator.wikimedia.org/T382482#10591601 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF These have been relabeled [16:42:36] (03PS2) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal-main@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123684 (https://phabricator.wikimedia.org/T387319) [16:42:37] (03PS2) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal-main@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123685 (https://phabricator.wikimedia.org/T387319) [16:42:48] 06SRE, 06Infrastructure-Foundations: Provide a pxe-bootable rescue image - https://phabricator.wikimedia.org/T78135#10591609 (10jhathaway) I found this task while pondering similar functionality, as I have been using SystemRescue to troubleshoot some issues on our Supermicro hosts. A couple of questions: 1. D... [16:44:06] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123684 (https://phabricator.wikimedia.org/T387319) (owner: 10Vgutierrez) [16:44:09] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123685 (https://phabricator.wikimedia.org/T387319) (owner: 10Vgutierrez) [16:45:42] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123524 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [16:48:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73887 and previous config saved to /var/cache/conftool/dbconfig/20250228-164828-root.json [16:49:10] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2088.codfw.wmnet with OS bullseye [16:52:41] (03CR) 10Michael Große: [C:03+1] "I don't really feel qualified to review this, but maybe we can just deploy it all on Monday and check in test-wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123607 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [16:52:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10591659 (10elukey) Before proceeding further I'd like to solve T387577 first! [16:54:17] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal-scholarly@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123688 (https://phabricator.wikimedia.org/T387320) [16:54:18] (03PS1) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal-scholarly@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123689 (https://phabricator.wikimedia.org/T387320) [16:55:15] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123688 (https://phabricator.wikimedia.org/T387320) (owner: 10Vgutierrez) [16:55:21] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123689 (https://phabricator.wikimedia.org/T387320) (owner: 10Vgutierrez) [16:56:18] (03CR) 10Vgutierrez: [C:04-2] hiera,cirrus: Enable IPIP on search*@eqiad services [puppet] - 10https://gerrit.wikimedia.org/r/1123653 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [16:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:02:35] (03CR) 10BryanDavis: "Scheduled for 2025-03-04 Puppet request window, but happy to see this merged at any time. The instance that this config went with has been" [puppet] - 10https://gerrit.wikimedia.org/r/1117997 (https://phabricator.wikimedia.org/T385849) (owner: 10BryanDavis) [17:03:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73888 and previous config saved to /var/cache/conftool/dbconfig/20250228-170333-root.json [17:15:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1199 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73890 and previous config saved to /var/cache/conftool/dbconfig/20250228-171534-root.json [17:18:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73891 and previous config saved to /var/cache/conftool/dbconfig/20250228-171838-root.json [17:21:43] (03PS1) 10Scott French: shellbox-media: serve 1/8 of requests on 8.1 with more logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123690 (https://phabricator.wikimedia.org/T377038) [17:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:24:15] (03CR) 10JHathaway: [C:03+1] deployment-prep: Remove parsoid things from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1117997 (https://phabricator.wikimedia.org/T385849) (owner: 10BryanDavis) [17:29:29] 06SRE, 06serviceops: HTTP 429 error on private wikis trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10591796 (10Bugreporter) [17:30:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1199 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73892 and previous config saved to /var/cache/conftool/dbconfig/20250228-173040-root.json [17:32:21] (03PS1) 10Elukey: kserve: allow Prometheus metrics to be fetched [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123692 (https://phabricator.wikimedia.org/T387580) [17:33:06] (03CR) 10CI reject: [V:04-1] kserve: allow Prometheus metrics to be fetched [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123692 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey) [17:33:24] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387581 (10phaultfinder) 03NEW [17:33:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73893 and previous config saved to /var/cache/conftool/dbconfig/20250228-173343-root.json [17:36:45] (03PS1) 10Scott French: Enroll 100% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123694 (https://phabricator.wikimedia.org/T383845) [17:45:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1199 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73894 and previous config saved to /var/cache/conftool/dbconfig/20250228-174545-root.json [17:48:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73895 and previous config saved to /var/cache/conftool/dbconfig/20250228-174849-root.json [17:59:41] (03PS14) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [18:00:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1199 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73896 and previous config saved to /var/cache/conftool/dbconfig/20250228-180050-root.json [18:06:36] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [18:12:15] 06SRE, 06collaboration-services, 10Phabricator, 06Traffic, 13Patch-For-Review: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228#10592025 (10Dzahn) In this context T240297 also seems relevant. Specifically comments like T240297#5749688 and T... [18:15:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1199 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73897 and previous config saved to /var/cache/conftool/dbconfig/20250228-181556-root.json [18:22:42] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.028e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [18:26:18] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1188.eqiad.wmnet with OS bullseye [18:35:53] (03PS15) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [18:36:58] (03CR) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [18:42:30] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [18:49:19] (03PS1) 10Eevans: cassandra: obsolete secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1123703 (https://phabricator.wikimedia.org/T387586) [18:59:32] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1188.eqiad.wmnet with OS bullseye [19:12:32] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10592186 (10Jhancock.wm) ah man that disk 16 coming back is no bueno. i was going to suggest making 24 and 25 a raid but with that coming back, i'm not sure. another thing we... [19:13:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10592188 (10Neobeta61) Do a quick test. OS drive did not connect to AOC-S39xx. Able to run command "storcli /c0 restart" in Administrato... [19:14:17] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387581#10592191 (10Jhancock.wm) →14Duplicate dup:03T387431 [19:14:19] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10592193 (10Jhancock.wm) [19:15:04] (03PS2) 10Scott French: Enroll 100% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123694 (https://phabricator.wikimedia.org/T383845) [19:15:05] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10592195 (10Jhancock.wm) I tried to reboot the idrac. that didn't work. next fix is to shut off the whole server for a few minutes, drain the power, and then reboot. @MoritzMuehlenhoff would you be able to... [19:21:45] (03CR) 10Scott French: "Thanks in advance for the review, Effie! This is one of a couple of patches for the migration steps planned for Monday, the other two I wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123694 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [19:22:02] (03PS2) 10Scott French: mw-(api-ext|web): scale next to 40% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) [19:22:02] (03CR) 10Scott French: "Thanks in advance for the review, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [19:22:14] (03PS2) 10Scott French: mw-api-int: serve 10% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123700 (https://phabricator.wikimedia.org/T383845) [19:23:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73899 and previous config saved to /var/cache/conftool/dbconfig/20250228-192306-root.json [19:38:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73900 and previous config saved to /var/cache/conftool/dbconfig/20250228-193812-root.json [19:43:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:43:26] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:49:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 19.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:49:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:53:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73901 and previous config saved to /var/cache/conftool/dbconfig/20250228-195317-root.json [19:55:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73902 and previous config saved to /var/cache/conftool/dbconfig/20250228-195517-root.json [19:57:23] (03PS1) 10Dwisehaupt: community_crm: Add trusted_host_patterns to settings template [puppet] - 10https://gerrit.wikimedia.org/r/1123711 (https://phabricator.wikimedia.org/T386267) [19:59:00] (03CR) 10Scott French: [C:04-1] "Still looks good in principle, just needs some additional coordination before it can be put to use as intended. Thanks again, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [19:59:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 16.07% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:01:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:06:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 19.64% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:08:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73903 and previous config saved to /var/cache/conftool/dbconfig/20250228-200822-root.json [20:10:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73904 and previous config saved to /var/cache/conftool/dbconfig/20250228-201023-root.json [20:11:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/main at eqiad: 23.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/main at eqiad: 21.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:16:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/main at eqiad: 24.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:17:55] (03PS1) 10Jdlrobson: Revert "styles: Remove transparent PNG fallback for `.vector-icon`" [skins/Vector] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123713 (https://phabricator.wikimedia.org/T358910) [20:19:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10592354 (10VRiley-WMF) 05Open→03Resolved These have been relabled [20:20:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/main at eqiad: 23.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:23:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73905 and previous config saved to /var/cache/conftool/dbconfig/20250228-202328-root.json [20:25:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:25:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73906 and previous config saved to /var/cache/conftool/dbconfig/20250228-202528-root.json [20:30:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:32:58] (03CR) 10Ebrahim: "*your data" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123650 (https://phabricator.wikimedia.org/T387382) (owner: 10Ebrahim) [20:35:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:38:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:38:26] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:39:21] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:40:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73907 and previous config saved to /var/cache/conftool/dbconfig/20250228-204034-root.json [20:47:43] (03PS1) 10Andrew Bogott: nova vendordata: also install python3-openstackclient on startup [puppet] - 10https://gerrit.wikimedia.org/r/1123715 [20:48:27] (03CR) 10Andrew Bogott: [C:03+2] nova vendordata: also install python3-openstackclient on startup [puppet] - 10https://gerrit.wikimedia.org/r/1123715 (owner: 10Andrew Bogott) [20:55:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73908 and previous config saved to /var/cache/conftool/dbconfig/20250228-205539-root.json [20:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:57:46] (03CR) 10Dwisehaupt: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5012/co" [puppet] - 10https://gerrit.wikimedia.org/r/1123711 (https://phabricator.wikimedia.org/T386267) (owner: 10Dwisehaupt) [21:00:03] (03PS2) 10Dwisehaupt: community_crm: Add trusted_host_patterns to settings template [puppet] - 10https://gerrit.wikimedia.org/r/1123711 (https://phabricator.wikimedia.org/T386267) [21:03:14] (03CR) 10Dwisehaupt: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5013/co" [puppet] - 10https://gerrit.wikimedia.org/r/1123711 (https://phabricator.wikimedia.org/T386267) (owner: 10Dwisehaupt) [21:03:42] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [21:20:02] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host doc2003.codfw.wmnet [21:20:04] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [21:22:27] (03PS1) 10Dzahn: site: add future bookworm doc hosts with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1123717 (https://phabricator.wikimedia.org/T384595) [21:23:42] (03CR) 10Dzahn: [C:03+2] site: add future bookworm doc hosts with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1123717 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [21:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:23:52] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc2003.codfw.wmnet - dzahn@cumin1002" [21:23:57] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc2003.codfw.wmnet - dzahn@cumin1002" [21:23:57] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:23:57] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache doc2003.codfw.wmnet on all recursors [21:24:00] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doc2003.codfw.wmnet on all recursors [21:24:30] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doc2003.codfw.wmnet - dzahn@cumin1002" [21:24:36] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doc2003.codfw.wmnet - dzahn@cumin1002" [21:28:32] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host doc2003.codfw.wmnet with OS bookworm [21:33:40] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387597 (10phaultfinder) 03NEW [21:36:09] 10ops-codfw, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387597#10592495 (10Dzahn) [21:39:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.035s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:41:37] (03PS1) 10Andrew Bogott: nova metadata api: pass in params needed for a proper clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1123720 [21:41:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123720 (owner: 10Andrew Bogott) [21:41:58] (03CR) 10CI reject: [V:04-1] nova metadata api: pass in params needed for a proper clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1123720 (owner: 10Andrew Bogott) [21:42:14] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:42:14] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:43:07] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on doc2003.codfw.wmnet with reason: host reimage [21:43:13] (03PS2) 10Andrew Bogott: nova metadata api: pass in params needed for a proper clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1123720 [21:45:21] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123720 (owner: 10Andrew Bogott) [21:46:48] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doc2003.codfw.wmnet with reason: host reimage [21:48:10] (03CR) 10Andrew Bogott: [C:03+2] nova metadata api: pass in params needed for a proper clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1123720 (owner: 10Andrew Bogott) [21:49:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.522s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:49:56] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:54:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.434s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:59:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.162s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:02:18] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1103*,elastic1107* for ban hosts to change threadpool settings - bking@cumin2002 - T387176 [22:02:21] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1103*,elastic1107* for ban hosts to change threadpool settings - bking@cumin2002 - T387176 [22:02:22] T387176: Investigate eqiad Elastic cluster latency - https://phabricator.wikimedia.org/T387176 [22:04:14] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:04:14] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:04:15] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doc2003.codfw.wmnet with OS bookworm [22:04:15] !log dzahn@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doc2003.codfw.wmnet [22:04:34] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:11:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [22:12:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.113s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:12:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [22:13:04] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [22:13:07] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [22:14:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:17:15] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.121s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:19:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:21:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [22:22:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [22:24:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.146s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:29:15] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.379s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:30:47] !log bking@elastic1103 restart elastic-chi to apply thread pool settings T387176 [22:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:51] T387176: Investigate eqiad Elastic cluster latency - https://phabricator.wikimedia.org/T387176 [22:34:15] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.176s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:37:46] RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 81, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:56:15] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10592644 (10Jhancock.wm) a:03Jhancock.wm [22:59:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:00:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1103-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:02:23] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:05:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1103-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:20:44] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.008e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad