[00:05:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1059174 (owner: 10TrainBranchBot) [00:08:31] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1251.eqiad.wmnet with reason: host reimage [00:08:51] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1250.eqiad.wmnet with reason: host reimage [00:08:57] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1253.eqiad.wmnet with reason: host reimage [00:09:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1252.eqiad.wmnet with reason: host reimage [00:09:26] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1254.eqiad.wmnet with reason: host reimage [00:09:59] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1257.eqiad.wmnet with reason: host reimage [00:10:10] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1259.eqiad.wmnet with reason: host reimage [00:10:22] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1258.eqiad.wmnet with reason: host reimage [00:11:26] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1256.eqiad.wmnet with reason: host reimage [00:11:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1251.eqiad.wmnet with reason: host reimage [00:11:37] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1255.eqiad.wmnet with reason: host reimage [00:13:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host alert2002.wikimedia.org with OS bookworm [00:13:24] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10037275 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host alert2002.wikimedia.org with OS bookworm [00:14:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1250.eqiad.wmnet with reason: host reimage [00:15:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on alert2002.wikimedia.org with reason: host reimage [00:17:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1256.eqiad.wmnet with reason: host reimage [00:19:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on alert2002.wikimedia.org with reason: host reimage [00:22:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1255.eqiad.wmnet with reason: host reimage [00:24:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1254.eqiad.wmnet with reason: host reimage [00:27:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1258.eqiad.wmnet with reason: host reimage [00:28:14] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1251.eqiad.wmnet with OS bullseye [00:28:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1251.eqiad.wmnet with OS bullseye... [00:29:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host alert2002.wikimedia.org with OS bookworm [00:29:28] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10037290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host alert2002.wikimedia.org with OS bookworm executed with errors: - alert2002 (**... [00:30:08] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1250.eqiad.wmnet with OS bullseye [00:30:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1250.eqiad.wmnet with OS bullseye... [00:31:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1253.eqiad.wmnet with reason: host reimage [00:32:45] FIRING: [3x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:33:07] (03PS1) 10Zabe: Further configurations for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059178 (https://phabricator.wikimedia.org/T371452) [00:33:41] (03PS2) 10Zabe: Further configurations for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059178 (https://phabricator.wikimedia.org/T371452) [00:33:43] (03CR) 10CI reject: [V:04-1] Further configurations for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059178 (https://phabricator.wikimedia.org/T371452) (owner: 10Zabe) [00:33:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1259.eqiad.wmnet with reason: host reimage [00:34:26] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1256.eqiad.wmnet with OS bullseye [00:34:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037310 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1256.eqiad.wmnet with OS bullseye... [00:35:23] (03CR) 10Zabe: [C:03+2] Further configurations for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059178 (https://phabricator.wikimedia.org/T371452) (owner: 10Zabe) [00:35:58] (03Merged) 10jenkins-bot: Further configurations for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059178 (https://phabricator.wikimedia.org/T371452) (owner: 10Zabe) [00:36:20] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1059178|Further configurations for u4cwiki (T371452)]] [00:36:22] T371452: Configuration changes for u4cwiki - https://phabricator.wikimedia.org/T371452 [00:36:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1252.eqiad.wmnet with reason: host reimage [00:36:59] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission payments2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371631#10037317 (10Dwisehaupt) [00:37:09] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission payments2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371630#10037318 (10Dwisehaupt) [00:37:20] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371629#10037319 (10Dwisehaupt) [00:38:01] !log zabe@mwmaint1002:~$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php u4cwiki translate # T371452 [00:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:30] !log zabe@deploy1003 zabe: Backport for [[gerrit:1059178|Further configurations for u4cwiki (T371452)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:39:13] !log zabe@deploy1003 zabe: Continuing with sync [00:40:20] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1255.eqiad.wmnet with OS bullseye [00:40:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1257.eqiad.wmnet with reason: host reimage [00:40:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037322 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1255.eqiad.wmnet with OS bullseye... [00:41:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1254.eqiad.wmnet with OS bullseye [00:41:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037323 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1254.eqiad.wmnet with OS bullseye... [00:43:44] !log zabe@deploy1003 Finished scap: Backport for [[gerrit:1059178|Further configurations for u4cwiki (T371452)]] (duration: 07m 24s) [00:43:48] T371452: Configuration changes for u4cwiki - https://phabricator.wikimedia.org/T371452 [00:44:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1258.eqiad.wmnet with OS bullseye [00:44:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037327 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1258.eqiad.wmnet with OS bullseye... [00:48:20] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1253.eqiad.wmnet with OS bullseye [00:48:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037329 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1253.eqiad.wmnet with OS bullseye... [00:50:24] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1250.mgmt.eqiad.wmnet with reboot policy FORCED [00:51:01] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1259.eqiad.wmnet with OS bullseye [00:51:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1259.eqiad.wmnet with OS bullseye... [00:51:40] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1250.mgmt.eqiad.wmnet with reboot policy FORCED [00:52:45] RESOLVED: [3x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:53:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1250.eqiad.wmnet with OS bullseye [00:53:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037331 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1250.eqiad.wmnet with OS bull... [00:53:33] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1252.eqiad.wmnet with OS bullseye [00:53:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037332 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1252.eqiad.wmnet with OS bullseye... [00:55:27] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1250.eqiad.wmnet with reason: host reimage [00:57:16] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1257.eqiad.wmnet with OS bullseye [00:57:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1257.eqiad.wmnet with OS bullseye... [00:58:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1250.eqiad.wmnet with reason: host reimage [01:07:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037348 (10Jclark-ctr) [01:08:21] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1250.eqiad.wmnet with OS bullseye [01:08:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037349 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1250.eqiad.wmnet with OS bullseye... [01:22:45] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [01:25:44] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1260-9 - jclark@cumin1002" [01:25:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1260-9 - jclark@cumin1002" [01:25:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:26:47] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1260.mgmt.eqiad.wmnet with reboot policy FORCED [01:26:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1261.mgmt.eqiad.wmnet with reboot policy FORCED [01:26:59] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1262.mgmt.eqiad.wmnet with reboot policy FORCED [01:27:04] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1260.mgmt.eqiad.wmnet with reboot policy FORCED [01:27:28] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1261.mgmt.eqiad.wmnet with reboot policy FORCED [01:28:54] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1263.mgmt.eqiad.wmnet with reboot policy FORCED [01:30:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1260.mgmt.eqiad.wmnet with reboot policy FORCED [01:31:49] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1261.mgmt.eqiad.wmnet with reboot policy FORCED [01:33:53] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1265.mgmt.eqiad.wmnet with reboot policy FORCED [01:35:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1264.mgmt.eqiad.wmnet with reboot policy FORCED [01:37:10] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1266.mgmt.eqiad.wmnet with reboot policy FORCED [01:38:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1269.mgmt.eqiad.wmnet with reboot policy FORCED [01:39:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1268.mgmt.eqiad.wmnet with reboot policy FORCED [01:44:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1267.mgmt.eqiad.wmnet with reboot policy FORCED [01:57:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1264.mgmt.eqiad.wmnet with reboot policy FORCED [01:57:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1265.mgmt.eqiad.wmnet with reboot policy FORCED [01:57:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1262.mgmt.eqiad.wmnet with reboot policy FORCED [01:59:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1268.mgmt.eqiad.wmnet with reboot policy FORCED [01:59:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1263.mgmt.eqiad.wmnet with reboot policy FORCED [02:01:00] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1260.mgmt.eqiad.wmnet with reboot policy FORCED [02:03:46] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1260.mgmt.eqiad.wmnet with reboot policy FORCED [02:04:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1269.mgmt.eqiad.wmnet with reboot policy FORCED [02:05:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1266.mgmt.eqiad.wmnet with reboot policy FORCED [02:07:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1261.mgmt.eqiad.wmnet with reboot policy FORCED [02:07:32] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1260.mgmt.eqiad.wmnet with reboot policy FORCED [02:07:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1267.mgmt.eqiad.wmnet with reboot policy FORCED [02:08:57] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1260 [02:09:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1260 [02:39:22] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:40] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:08:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 2.398s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:11:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:13:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.817s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:44:22] FIRING: SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:45:40] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:22] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:59:22] RESOLVED: [2x] SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240802T0600) [06:03:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:31:47] (03CR) 10Jelto: [C:04-1] "This should be configured using the defaultPadText in Puppet: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/head" [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1059036 (https://phabricator.wikimedia.org/T371591) (owner: 10Aklapper) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240802T0700) [07:00:42] (03CR) 10Jelto: [C:03+2] add byteplus to external_clouds_vendors_nets [puppet] - 10https://gerrit.wikimedia.org/r/1058558 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [07:00:57] (03PS2) 10Jelto: add byteplus to external_clouds_vendors_nets [puppet] - 10https://gerrit.wikimedia.org/r/1058558 (https://phabricator.wikimedia.org/T371418) [07:04:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:06:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 5.957s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:11:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 7.52s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:12:49] (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1058558 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [07:14:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:33:10] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding sbailey [puppet] - 10https://gerrit.wikimedia.org/r/1058952 (owner: 10Slyngshede) [07:34:14] (03CR) 10Alexandros Kosiaris: "2 years later, is this now redundant? could the entire stanza be removed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776164 (https://phabricator.wikimedia.org/T305176) (owner: 10Daniel Kinzler) [07:36:19] (03CR) 10Stevemunene: [C:03+1] cloudnative-pg: create operator namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059101 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [07:36:39] !log slyngshede@cumin1002 START - Cookbook sre.idm.logout Logging Sbailey out of all services on: 2241 hosts [07:36:46] (03CR) 10Stevemunene: [C:03+1] cloudnative-pg: create a test namespace and make the operator watch it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059093 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [07:37:25] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Sbailey out of all services on: 2241 hosts [07:49:44] (03CR) 10Filippo Giunchedi: [C:03+1] nrpe::monitor_service: clarify interval is in minutes [puppet] - 10https://gerrit.wikimedia.org/r/1059131 (owner: 10Ssingh) [07:50:24] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10037668 (10Fabfur) Hi @NBaca-WMF @thcipriani, could you confirm this request please? [07:50:48] (03CR) 10Filippo Giunchedi: [C:03+1] "lol re: "kinda gross" but if it works well it works well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059158 (https://phabricator.wikimedia.org/T371390) (owner: 10CDanis) [07:51:27] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10037676 (10Fabfur) a:03Fabfur [07:51:40] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10037677 (10Fabfur) a:03Fabfur [07:53:36] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: create a test namespace and make the operator watch it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059093 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [07:53:46] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: create operator namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059101 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [07:59:20] (03PS1) 10Brouberol: cloudnative-pg: rename chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059244 (https://phabricator.wikimedia.org/T364797) [08:03:46] (03CR) 10Stevemunene: [C:03+1] "missed that, lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059244 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [08:05:52] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: rename chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059244 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [08:09:57] (03CR) 10Alexandros Kosiaris: "Hi, my multiversion -> single version adventures led me to this change. I am wondering currently what exactly uses this file. My codesearc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [08:10:02] (03PS1) 10Brouberol: cloudnative-pg: rename chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059246 (https://phabricator.wikimedia.org/T364797) [08:14:28] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: rename chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059246 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [08:19:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:19:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:20:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:20:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:21:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T367856)', diff saved to https://phabricator.wikimedia.org/P67199 and previous config saved to /var/cache/conftool/dbconfig/20240802-082105-marostegui.json [08:21:08] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:22:51] (03CR) 10Vgutierrez: [C:03+1] varnish: fix bug causing %error_body_content% to appear in response body (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [08:26:41] (03PS1) 10Ayounsi: Link most log_ messages to relevant objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059248 (https://phabricator.wikimedia.org/T371653) [08:26:42] (03PS1) 10Ayounsi: Remove log_success line from Capirca script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059249 (https://phabricator.wikimedia.org/T371653) [08:26:44] (03PS1) 10Ayounsi: raise AbortScript when needed [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059250 [08:27:35] (03CR) 10CI reject: [V:04-1] Link most log_ messages to relevant objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059248 (https://phabricator.wikimedia.org/T371653) (owner: 10Ayounsi) [08:27:40] (03CR) 10CI reject: [V:04-1] raise AbortScript when needed [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059250 (owner: 10Ayounsi) [08:27:43] (03CR) 10CI reject: [V:04-1] Remove log_success line from Capirca script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059249 (https://phabricator.wikimedia.org/T371653) (owner: 10Ayounsi) [08:29:51] (03PS1) 10Brouberol: cloudnative-pg: update networkpolicy clounative-pg -> kubeapi selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059251 (https://phabricator.wikimedia.org/T364797) [08:30:31] (03PS2) 10Ayounsi: Link most log_ messages to relevant objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059248 (https://phabricator.wikimedia.org/T371653) [08:30:31] (03PS2) 10Ayounsi: Remove log_success line from Capirca script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059249 (https://phabricator.wikimedia.org/T371653) [08:30:31] (03PS2) 10Ayounsi: raise AbortScript when needed [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059250 [08:31:35] (03CR) 10CI reject: [V:04-1] raise AbortScript when needed [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059250 (owner: 10Ayounsi) [08:36:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P67200 and previous config saved to /var/cache/conftool/dbconfig/20240802-083612-marostegui.json [08:36:21] (03PS1) 10Elukey: external_clouds_vendors: set logging to DEBUG [puppet] - 10https://gerrit.wikimedia.org/r/1059253 (https://phabricator.wikimedia.org/T368023) [08:38:28] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3513/co" [puppet] - 10https://gerrit.wikimedia.org/r/1059253 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:39:24] (03PS86) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [08:39:46] (03CR) 10Ayounsi: [V:03+1] "Tested on netbox-next, no (obvious) issues." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059248 (https://phabricator.wikimedia.org/T371653) (owner: 10Ayounsi) [08:40:53] (03PS2) 10Elukey: external_clouds_vendors: set logging to DEBUG [puppet] - 10https://gerrit.wikimedia.org/r/1059253 (https://phabricator.wikimedia.org/T368023) [08:44:05] (03CR) 10Elukey: [C:03+2] external_clouds_vendors: set logging to DEBUG [puppet] - 10https://gerrit.wikimedia.org/r/1059253 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:51:02] (03PS3) 10Ayounsi: raise AbortScript when needed [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059250 [08:51:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P67201 and previous config saved to /var/cache/conftool/dbconfig/20240802-085119-marostegui.json [09:06:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T367856)', diff saved to https://phabricator.wikimedia.org/P67202 and previous config saved to /var/cache/conftool/dbconfig/20240802-090627-marostegui.json [09:06:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1195.eqiad.wmnet with reason: Maintenance [09:06:30] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [09:06:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1195.eqiad.wmnet with reason: Maintenance [09:06:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T367856)', diff saved to https://phabricator.wikimedia.org/P67203 and previous config saved to /var/cache/conftool/dbconfig/20240802-090649-marostegui.json [09:09:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:14:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:23:53] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:34:08] (03Abandoned) 10Brouberol: cloudnative-pg: update networkpolicy clounative-pg -> kubeapi selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059251 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [09:36:01] (03PS1) 10Brouberol: cloudnative-pg: revert back to chart name = cloudnative-pg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059257 (https://phabricator.wikimedia.org/T364797) [09:37:08] (03CR) 10Giuseppe Lavagetto: "I overall like how this shaped up in the end. Only thing I'd take care of is removing the repetition of the cronjob name in the data struc" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [09:38:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdh) failed on ms-be1056 - https://phabricator.wikimedia.org/T371192#10037891 (10MatthewVernon) Good question, and yes, I think per T368930 it's due to be decommed some time in this quarter or next (depending a bit on when the replacement hardware ar... [09:38:39] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, and 2 others: Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10037894 (10hashar) You can add the ssh key and its passphrase to the Jenkins credentials store. You first need administrative... [09:38:42] (03PS1) 10Brouberol: cloudnative-pg: allow the operator to perform actions in its own namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059258 (https://phabricator.wikimedia.org/T364797) [09:39:16] (03CR) 10CI reject: [V:04-1] cloudnative-pg: revert back to chart name = cloudnative-pg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059257 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [09:41:59] (03CR) 10CI reject: [V:04-1] cloudnative-pg: allow the operator to perform actions in its own namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059258 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [09:42:23] (03CR) 10Hashar: ci: add new ECDSA ssh key for jenkins to connect to itself (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1059106 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [09:42:36] (03PS1) 10MVernon: swift: mark sdh1 in ms-be1056 as failed [puppet] - 10https://gerrit.wikimedia.org/r/1059259 (https://phabricator.wikimedia.org/T371192) [09:58:21] (03PS2) 10Brouberol: cloudnative-pg: revert back to chart name = cloudnative-pg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059257 (https://phabricator.wikimedia.org/T364797) [09:58:35] (03PS2) 10Brouberol: cloudnative-pg: allow the operator to perform actions in its own namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059258 (https://phabricator.wikimedia.org/T364797) [09:59:14] (03PS3) 10Brouberol: cloudnative-pg: revert back to chart name = cloudnative-pg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059257 (https://phabricator.wikimedia.org/T364797) [10:04:16] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: revert back to chart name = cloudnative-pg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059257 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:04:21] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: allow the operator to perform actions in its own namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059258 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:04:39] (03PS1) 10Filippo Giunchedi: benthos: add ensure support [puppet] - 10https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) [10:07:29] (03Merged) 10jenkins-bot: cloudnative-pg: revert back to chart name = cloudnative-pg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059257 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:11:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:11:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:13:55] (03PS1) 10Brouberol: cloudnative-pg: bump chart after rename [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059267 (https://phabricator.wikimedia.org/T364797) [10:18:37] !log manually start dump_cloud_ip_ranges.service on puppetmaster1001 as test [10:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:58] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "alert2002 - ayounsi@cumin1002" [10:19:59] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: bump chart after rename [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059267 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:23:38] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "alert2002 - ayounsi@cumin1002" [10:28:00] (03PS1) 10Brouberol: cloudnative-pg: bump major versiom [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059276 (https://phabricator.wikimedia.org/T364797) [10:32:45] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: bump major versiom [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059276 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:39:30] (03CR) 10Elukey: [C:03+1] Link most log_ messages to relevant objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059248 (https://phabricator.wikimedia.org/T371653) (owner: 10Ayounsi) [10:39:52] (03CR) 10Elukey: [C:03+1] Remove log_success line from Capirca script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059249 (https://phabricator.wikimedia.org/T371653) (owner: 10Ayounsi) [10:41:59] (03PS2) 10MVernon: swift: mark sdh1 in ms-be1056 as failed [puppet] - 10https://gerrit.wikimedia.org/r/1059259 (https://phabricator.wikimedia.org/T371192) [10:42:11] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059259 (https://phabricator.wikimedia.org/T371192) (owner: 10MVernon) [10:42:35] (03CR) 10Elukey: [C:03+1] "LGTM! Can you run PCC before merging to double check that all works?" [puppet] - 10https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) (owner: 10Filippo Giunchedi) [10:45:46] (03PS1) 10Brouberol: cloudnative-pg: rename cloudnative-pg dse-k8s-eqiad value file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059277 (https://phabricator.wikimedia.org/T364797) [10:52:05] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: rename cloudnative-pg dse-k8s-eqiad value file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059277 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:55:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240802T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: Time to snap out of that daydream and deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240802T1100). [11:00:43] (03PS1) 10Brouberol: cloudnative-pg: hardcode release values path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059284 (https://phabricator.wikimedia.org/T364797) [11:03:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:03:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:04:54] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: hardcode release values path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059284 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [11:10:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:13:20] (03CR) 10Jelto: [V:03+1] gitlab: enable throttling for all GitLab instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [11:15:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:24] (03PS4) 10Effie Mouzeli: mediawiki: add wikitech to virtual hosts [puppet] - 10https://gerrit.wikimedia.org/r/1059103 (https://phabricator.wikimedia.org/T371360) [11:38:38] (03PS1) 10Marostegui: installserver: Do not reimage db2221 [puppet] - 10https://gerrit.wikimedia.org/r/1059300 [11:39:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689 (10WMDECyn) 03NEW [11:41:18] (03PS2) 10Filippo Giunchedi: benthos: add ensure support [puppet] - 10https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) [11:41:35] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) (owner: 10Filippo Giunchedi) [11:44:36] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2221 [puppet] - 10https://gerrit.wikimedia.org/r/1059300 (owner: 10Marostegui) [11:45:47] (03CR) 10Marostegui: [C:03+1] "Confirmed via dmesg that's the broken disk" [puppet] - 10https://gerrit.wikimedia.org/r/1059259 (https://phabricator.wikimedia.org/T371192) (owner: 10MVernon) [11:46:20] (03CR) 10MVernon: [C:03+2] swift: mark sdh1 in ms-be1056 as failed [puppet] - 10https://gerrit.wikimedia.org/r/1059259 (https://phabricator.wikimedia.org/T371192) (owner: 10MVernon) [11:47:09] (03PS1) 10Filippo Giunchedi: o11y: higher thresholds for webrequest-live benthos kafka lag alert [alerts] - 10https://gerrit.wikimedia.org/r/1059304 [11:53:39] (03PS2) 10Filippo Giunchedi: o11y: higher thresholds for webrequest-live benthos kafka lag alert [alerts] - 10https://gerrit.wikimedia.org/r/1059304 [11:54:37] (03CR) 10Filippo Giunchedi: "{{done}}" [puppet] - 10https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) (owner: 10Filippo Giunchedi) [12:14:02] (03PS3) 10Filippo Giunchedi: o11y: higher thresholds for webrequest-live benthos kafka lag alert [alerts] - 10https://gerrit.wikimedia.org/r/1059304 [12:18:02] (03CR) 10Ayounsi: [V:03+1 C:03+2] Link most log_ messages to relevant objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059248 (https://phabricator.wikimedia.org/T371653) (owner: 10Ayounsi) [12:18:13] (03CR) 10Ayounsi: [C:03+2] Remove log_success line from Capirca script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059249 (https://phabricator.wikimedia.org/T371653) (owner: 10Ayounsi) [12:19:13] (03Merged) 10jenkins-bot: Link most log_ messages to relevant objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059248 (https://phabricator.wikimedia.org/T371653) (owner: 10Ayounsi) [12:19:14] (03Merged) 10jenkins-bot: Remove log_success line from Capirca script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059249 (https://phabricator.wikimedia.org/T371653) (owner: 10Ayounsi) [12:21:22] (03PS1) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [12:21:58] (03CR) 10CI reject: [V:04-1] (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [12:23:49] (03PS2) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [12:24:27] (03CR) 10CI reject: [V:04-1] (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [12:26:11] (03PS3) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [12:30:42] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10038469 (10Fabfur) a:03Fabfur [12:33:20] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10038494 (10Fabfur) [12:33:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10038496 (10Fabfur) [12:34:56] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10038497 (10Fabfur) As [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#analytics-privatedata-users | per procedure ]], req... [12:34:59] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10038498 (10Fabfur) As [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#analytics-privatedata-users | per procedure... [12:42:48] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T371694 (10seanleong-WMDE) 03NEW [12:44:20] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10038534 (10seanleong-WMDE) [12:44:40] (03PS1) 10Ayounsi: ImportPuppetDB: Run Validate on VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059349 [12:51:35] (03CR) 10Ssingh: [C:03+2] nrpe::monitor_service: clarify interval is in minutes [puppet] - 10https://gerrit.wikimedia.org/r/1059131 (owner: 10Ssingh) [12:56:06] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10038550 (10WMDECyn) Approving request as Sean's manager [12:57:41] (03PS1) 10Fabfur: hiera:benthos: partially revert benthos removal [puppet] - 10https://gerrit.wikimedia.org/r/1059355 (https://phabricator.wikimedia.org/T371492) [12:58:30] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059355 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [13:04:23] (03CR) 10Ayounsi: "Yeah, we need a pynetbox with your fix from https://github.com/netbox-community/pynetbox/pull/632" [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi) [13:05:21] (03CR) 10Ayounsi: "But this file update can get merged anytime anyway. As it's a pre-requisite." [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi) [13:06:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10038585 (10WMDE-leszek) I approve this request on WMDE's behalf. Thank you! [13:07:08] (03PS1) 10Fabfur: haproxy: remove template switch for benthos extended logging [puppet] - 10https://gerrit.wikimedia.org/r/1059358 (https://phabricator.wikimedia.org/T370741) [13:10:25] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059358 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur) [13:10:51] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "prometheus - ayounsi@cumin1002" [13:11:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "prometheus - ayounsi@cumin1002" [13:12:37] (03CR) 10Fabfur: "I would wait first 1059358 and 1059355 before merging this so we can have the hiera keys in place to test it and be sure that haproxy conf" [puppet] - 10https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) (owner: 10Filippo Giunchedi) [13:14:46] (03CR) 10Fabfur: "Merging this before everything else should keep us away from potential errors on Benthos/HAProxy side" [puppet] - 10https://gerrit.wikimedia.org/r/1059358 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur) [13:20:40] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 10MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10038621 (10Marostegui) We just had another lag spike caused by dumps on en... [13:21:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host alert2002.wikimedia.org with OS bookworm [13:22:04] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10038626 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host alert2002.wikimedia.org with OS bookworm [13:24:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on alert2002.wikimedia.org with reason: host reimage [13:27:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on alert2002.wikimedia.org with reason: host reimage [13:30:01] (03PS1) 10Ssingh: wikimedia.org: update DKIM and add SPF for dayforce.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1059362 (https://phabricator.wikimedia.org/T371304) [13:30:36] (03PS5) 10Ayounsi: check_netbox_report.py: reports -> scripts [puppet] - 10https://gerrit.wikimedia.org/r/1059042 [13:30:36] (03PS2) 10Ayounsi: Netbox add libpq-dev package [puppet] - 10https://gerrit.wikimedia.org/r/1059099 [13:30:36] (03PS1) 10Ayounsi: Netbox: enable netbox_more_metrics plugin [puppet] - 10https://gerrit.wikimedia.org/r/1059363 (https://phabricator.wikimedia.org/T311052) [13:31:12] (03PS2) 10Ayounsi: Netbox: enable netbox_more_metrics plugin [puppet] - 10https://gerrit.wikimedia.org/r/1059363 (https://phabricator.wikimedia.org/T311052) [13:33:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus2007.codfw.wmnet with OS bookworm [13:33:18] (03PS4) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [13:33:20] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10038637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host prometheus2007.codfw.wmnet with OS bookworm [13:34:16] (03CR) 10CI reject: [V:04-1] (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [13:35:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus2007.codfw.wmnet with reason: host reimage [13:36:04] (03PS5) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [13:36:45] (03CR) 10CI reject: [V:04-1] (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [13:37:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host alert2002.wikimedia.org with OS bookworm [13:37:35] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10038646 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host alert2002.wikimedia.org with OS bookworm completed: - alert2002 (**PASS**) -... [13:38:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus2007.codfw.wmnet with reason: host reimage [13:39:15] (03PS2) 10Elukey: provision_server.py: add mac address to network provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) [13:39:56] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10038649 (10Ottomata) Approved. [13:40:32] (03CR) 10Elukey: "Manually tested, results in https://netbox-next.wikimedia.org/dcim/interfaces/35261/ (see the mac address field)." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:41:04] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10038662 (10Papaul) [13:41:14] (03PS2) 10Ssingh: wikimedia.org: update DKIM and add SPF for dayforce.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1059362 (https://phabricator.wikimedia.org/T371304) [13:41:55] (03PS6) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [13:42:01] (03CR) 10Ayounsi: provision_server.py: add mac address to network provision script (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:42:28] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1059363 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [13:42:31] (03CR) 10CI reject: [V:04-1] (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [13:43:07] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10038668 (10Papaul) 05Stalled→03Resolved @Jhancock.wm this is done the issue was fixed by @ayounsi in https://phabricator.wikimedia.org/T371653 [13:43:28] (03PS7) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [13:43:54] (03CR) 10BBlack: [C:03+1] wikimedia.org: update DKIM and add SPF for dayforce.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1059362 (https://phabricator.wikimedia.org/T371304) (owner: 10Ssingh) [13:44:07] (03CR) 10CI reject: [V:04-1] (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [13:44:16] (03CR) 10Ssingh: [C:03+2] wikimedia.org: update DKIM and add SPF for dayforce.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1059362 (https://phabricator.wikimedia.org/T371304) (owner: 10Ssingh) [13:44:36] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10038674 (10Ottomata) Approved [13:44:39] !log running authdns-update for CR: T3713041059362 [13:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:49] !log running authdns-update for CR: 1059362 T371304 [13:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:51] T371304: Adding IP Addresses to SPF (Dayforce) - https://phabricator.wikimedia.org/T371304 [13:46:42] (03PS1) 10Urbanecm: [Growth] enwiki: Enable frontend for Add Link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059370 (https://phabricator.wikimedia.org/T370802) [13:48:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus2007.codfw.wmnet with OS bookworm [13:49:10] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10038694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host prometheus2007.codfw.wmnet with OS bookworm completed: - prometheus200... [13:50:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus2008.codfw.wmnet with OS bookworm [13:50:17] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10038696 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host prometheus2008.codfw.wmnet with OS bookworm [13:50:25] (03PS5) 10Ssingh: P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) [13:51:16] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10038698 (10Papaul) [13:51:27] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [13:52:44] (03PS1) 10Fabfur: admin: added two users to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1059371 (https://phabricator.wikimedia.org/T371689) [13:52:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus2008.codfw.wmnet with reason: host reimage [13:53:34] (03CR) 10CI reject: [V:04-1] admin: added two users to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1059371 (https://phabricator.wikimedia.org/T371689) (owner: 10Fabfur) [13:54:22] (03PS8) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [13:55:00] (03CR) 10CI reject: [V:04-1] (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [13:56:07] (03CR) 10Hnowlan: "lgtm, but I think we could just drop the redirect for /view/" [puppet] - 10https://gerrit.wikimedia.org/r/1059103 (https://phabricator.wikimedia.org/T371360) (owner: 10Effie Mouzeli) [13:56:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus2008.codfw.wmnet with reason: host reimage [14:00:21] (03PS3) 10Elukey: provision_server.py: add mac address to network provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) [14:01:19] (03CR) 10CI reject: [V:04-1] provision_server.py: add mac address to network provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:01:23] (03PS4) 10Elukey: provision_server.py: add mac address to network provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) [14:01:29] (03CR) 10Elukey: provision_server.py: add mac address to network provision script (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:02:18] (03CR) 10CI reject: [V:04-1] provision_server.py: add mac address to network provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:02:31] (03CR) 10CDanis: [C:03+2] jaeger: very basic archive traces support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059158 (https://phabricator.wikimedia.org/T371390) (owner: 10CDanis) [14:02:37] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10038734 (10Fabfur) Do you need access to private data? Because the procedure will be different in case... have a l... [14:02:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10038735 (10Fabfur) Do you need access to private data? Because the procedure will be different in case... have a look at h... [14:02:46] (03PS5) 10Elukey: provision_server.py: add mac address to network provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) [14:03:24] (03Merged) 10jenkins-bot: jaeger: very basic archive traces support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059158 (https://phabricator.wikimedia.org/T371390) (owner: 10CDanis) [14:05:55] (03CR) 10Ssingh: [C:03+1] "Looks good! Let's roll it on next week." [puppet] - 10https://gerrit.wikimedia.org/r/1059152 (owner: 10Dzahn) [14:08:42] (03PS34) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [14:11:11] (03PS1) 10Joely Rooke WMDE: Add wikibase client interaction stream to Event Logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059374 (https://phabricator.wikimedia.org/T370045) [14:11:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus2008.codfw.wmnet with OS bookworm [14:11:22] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10038751 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host prometheus2008.codfw.wmnet with OS bookworm completed: - prometheus200... [14:20:51] (03PS35) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [14:24:19] (03CR) 10Ayounsi: [C:03+1] "nice!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:27:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059374 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [14:27:55] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2231.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:29:24] (03CR) 10Elukey: [C:03+2] provision_server.py: add mac address to network provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:30:24] (03Merged) 10jenkins-bot: provision_server.py: add mac address to network provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:34:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2231.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:34:56] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2232.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:35:20] (03PS36) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [14:35:47] (03CR) 10Ahmon Dancy: [C:03+1] gitlab: enable throttling for all GitLab instances [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [14:39:22] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:10] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2232.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:41:30] (03CR) 10Ahmon Dancy: [C:03+1] "This stuff was removed in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/876013" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [14:49:38] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2233.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:52:53] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2233.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:53:21] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2234.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:59:22] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2234.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:03:44] (03CR) 10Herron: [C:03+1] grafana: set timeinterval 60s for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/1058106 (https://phabricator.wikimedia.org/T371102) (owner: 10Filippo Giunchedi) [15:04:29] (03CR) 10Herron: [C:03+1] "SGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1059304 (owner: 10Filippo Giunchedi) [15:05:13] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2235.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:09:22] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:41] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2235.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:14:39] (03PS6) 10Ottomata: Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T1048385) (owner: 10Btullis) [15:15:29] (03PS37) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [15:17:09] (03CR) 10CI reject: [V:04-1] Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T1048385) (owner: 10Btullis) [15:17:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:31] (03PS7) 10Ottomata: Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T1048385) (owner: 10Btullis) [15:24:05] (03CR) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [15:25:20] (03CR) 10BBlack: [C:03+1] P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [15:26:12] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3519/co" [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [15:29:25] (03PS1) 10Jelto: gitlab: add missing parameter description in profile::gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1059389 [15:33:39] (03PS1) 10CDanis: jaeger: enable archive support in query and ui [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059390 (https://phabricator.wikimedia.org/T371390) [15:34:42] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10038994 (10Papaul) [15:36:58] (03PS6) 10Ssingh: P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) [15:37:19] (03CR) 10Ssingh: "No code change; updated comments in admin_state.tpl.erb to show examples for depool." [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [15:37:27] (03CR) 10CDobbins: "https://puppet-compiler.wmflabs.org/output/1059123/3519/" [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [15:37:30] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10038998 (10Papaul) 05Open→03Resolved @fgiunchedi this is complete [15:38:01] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [15:39:07] (03PS38) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [15:40:10] (03PS39) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [15:42:24] (03CR) 10Dzahn: [C:03+1] gitlab: add missing parameter description in profile::gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1059389 (owner: 10Jelto) [15:42:26] (03PS40) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [15:49:51] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, and 2 others: Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10039040 (10Dzahn) Adding users to LDAP admin groups would require an access request process and approvals which I would like to... [15:52:54] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10039043 (10thcipriani) [15:53:47] (03CR) 10Ssingh: "bblack and I decided to do away with generic-map for this to not only shorten the command for the user but also to not worry about knowing" [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [15:54:37] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10039048 (10thcipriani) >>! In T371650#10037668, @Fabfur wrote: > Hi @NBaca-WMF @thcipriani, could you confirm this request please? 👍 good from my side. @SToyofuku-WMF has sat... [15:55:07] (03PS6) 10Ssingh: P:conftool: add schema for geodns [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) [16:00:00] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@d573c40]: Deploy latest DAGs for analytics Airflow instance. T368756 [16:00:06] T368756: Airflow job to orchestrate the emission mechanism - https://phabricator.wikimedia.org/T368756 [16:01:02] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@d573c40]: Deploy latest DAGs for analytics Airflow instance. T368756 (duration: 01m 02s) [16:04:22] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:43] (03PS1) 10Jdlrobson: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) [16:06:09] (03PS1) 10Hnowlan: rpc: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) [16:06:51] (03CR) 10CI reject: [V:04-1] rpc: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [16:08:18] (03CR) 10Lucas Werkmeister (WMDE): Move section mapping to separate file (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe) [16:08:28] (03PS2) 10Hnowlan: rpc: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) [16:30:50] (03CR) 10Ssingh: varnish: fix bug causing %error_body_content% to appear in response body (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [16:42:10] (03CR) 10Vgutierrez: varnish: fix bug causing %error_body_content% to appear in response body (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [16:43:43] (03CR) 10BBlack: [C:03+1] P:conftool: add schema for geodns [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [16:47:38] (03CR) 10Zabe: [C:03+2] Move section mapping to separate file (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe) [16:51:08] (03PS1) 10Zabe: noc: Provide db-sections.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059402 [16:51:32] (03CR) 10Zabe: [C:03+2] Move section mapping to separate file (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe) [17:23:39] (03CR) 10Ssingh: varnish: fix bug causing %error_body_content% to appear in response body (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [17:26:43] (03CR) 10Ssingh: [C:03+1] "One comment that is not a big deal but looks good otherwise! Let's merge on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [17:29:39] (03PS1) 10Andrew Bogott: New files, templates and manifests for OpenStack Caracal [puppet] - 10https://gerrit.wikimedia.org/r/1059408 (https://phabricator.wikimedia.org/T369044) [17:29:41] (03PS1) 10Andrew Bogott: wmf_sink: rip out the proxy-cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/1059409 (https://phabricator.wikimedia.org/T371707) [17:30:13] (03CR) 10CI reject: [V:04-1] wmf_sink: rip out the proxy-cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/1059409 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [17:47:02] (03PS1) 10Dzahn: Revert "lists: switch firewall provider to nftables" [puppet] - 10https://gerrit.wikimedia.org/r/1059411 [17:48:51] (03CR) 10CDanis: [C:03+1] "I have some of Scott's same concerns about `weight` being included as a no-op, but, seems ok." [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [17:51:29] (03CR) 10Dzahn: [C:03+2] "There isn't a simple fix to the puppet issue that removes and then reinstalls spamd every other puppet run. Details in the linked ticket. " [puppet] - 10https://gerrit.wikimedia.org/r/1059411 (owner: 10Dzahn) [17:58:59] (03CR) 10Dzahn: [C:03+2] "reverting because of https://phabricator.wikimedia.org/T371575" [puppet] - 10https://gerrit.wikimedia.org/r/1055492 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:18:43] (03CR) 10Dzahn: [C:03+2] "maybe can be improved but definitely fixes the issue for now that we have syncs again https://puppet-compiler.wmflabs.org/output/1058264/3" [puppet] - 10https://gerrit.wikimedia.org/r/1058264 (https://phabricator.wikimedia.org/T257741) (owner: 10Dzahn) [18:22:36] (03PS1) 10Dwisehaupt: Add yahoo-verification-key for Complaint Feedback Loop [dns] - 10https://gerrit.wikimedia.org/r/1059412 (https://phabricator.wikimedia.org/T370963) [18:27:16] (03CR) 10Dzahn: [C:03+2] "sync service working again. the replica host is the dest host and is also where the service runs. it pulls from the active host." [puppet] - 10https://gerrit.wikimedia.org/r/1058264 (https://phabricator.wikimedia.org/T257741) (owner: 10Dzahn) [18:30:53] (03PS41) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [18:41:36] (03PS42) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [18:42:15] (03CR) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [18:54:26] (03PS1) 10Dzahn: gerrit: enable nft throttling on role level, but just log [puppet] - 10https://gerrit.wikimedia.org/r/1059416 (https://phabricator.wikimedia.org/T365259) [18:55:34] (03PS2) 10Dzahn: gerrit: enable nft throttling on role level, but just log [puppet] - 10https://gerrit.wikimedia.org/r/1059416 (https://phabricator.wikimedia.org/T365259) [18:57:37] (03PS1) 10Dzahn: gerrit: set nft throttling policy to drop, only on replica host [puppet] - 10https://gerrit.wikimedia.org/r/1059417 (https://phabricator.wikimedia.org/T365259) [19:01:24] (03PS1) 10Dzahn: miscweb: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) [19:10:43] (03CR) 10Scott French: [C:03+1] "That sounds quite a bit easier to reason about, vs. having to remember or look up precedence rules between service- and map-level override" [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [19:17:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:19:44] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [19:58:02] (03PS1) 10Dreamy Jazz: Define wgVirtualDomainsMapping for virtual-checkuser-global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059422 (https://phabricator.wikimedia.org/T371724) [20:07:26] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10039534 (10NBaca-WMF) approved from my side as well! Thanks all [20:24:57] (03PS1) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [20:29:45] (03PS2) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [20:31:29] (03PS3) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [20:33:21] (03PS4) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [20:34:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdh) failed on ms-be1056 - https://phabricator.wikimedia.org/T371192#10039579 (10VRiley-WMF) 05Open→03Resolved Thanks for the update. With this information, I'll go ahead and mark this as resolved due to future upgrade planned. If there is an... [20:43:21] (03PS5) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [20:51:38] (03PS6) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [21:10:14] (03PS7) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [21:10:41] (03PS8) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [21:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:16] (03CR) 10Dzahn: [C:03+2] gitlab: add missing parameter description in profile::gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1059389 (owner: 10Jelto) [21:15:28] (03Abandoned) 10Dzahn: firewall: if provider is nft and not pulling requestctl, remove confd [puppet] - 10https://gerrit.wikimedia.org/r/1057264 (https://phabricator.wikimedia.org/T356296) (owner: 10Dzahn) [21:17:06] (03Abandoned) 10Dzahn: scap: remove scandium from dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/1053791 (https://phabricator.wikimedia.org/T363402) (owner: 10Dzahn) [22:44:16] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1260.mgmt.eqiad.wmnet with reboot policy FORCED [22:44:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1260.mgmt.eqiad.wmnet with reboot policy FORCED [22:48:26] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1260.mgmt.eqiad.wmnet with reboot policy FORCED [22:50:40] FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:22] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:17:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1260.mgmt.eqiad.wmnet with reboot policy FORCED [23:19:56] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T371741 (10phaultfinder) 03NEW [23:21:46] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1250.eqiad.wmnet with OS bullseye [23:21:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039923 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1250.eqiad.wmnet with OS bull... [23:24:07] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1250.eqiad.wmnet with reason: host reimage [23:24:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1251.eqiad.wmnet with OS bullseye [23:24:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1251.eqiad.wmnet with OS bull... [23:26:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1250.eqiad.wmnet with reason: host reimage [23:26:25] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1252.eqiad.wmnet with OS bullseye [23:26:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1252.eqiad.wmnet with OS bull... [23:26:58] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1251.eqiad.wmnet with reason: host reimage [23:29:00] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1252.eqiad.wmnet with reason: host reimage [23:29:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1251.eqiad.wmnet with reason: host reimage [23:30:53] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1253.eqiad.wmnet with OS bullseye [23:31:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1253.eqiad.wmnet with OS bull... [23:33:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1252.eqiad.wmnet with reason: host reimage [23:33:28] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1253.eqiad.wmnet with reason: host reimage [23:35:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1253.eqiad.wmnet with reason: host reimage [23:36:20] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:36:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:36:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1250.eqiad.wmnet with OS bullseye [23:37:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039942 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1250.eqiad.wmnet with OS bullseye... [23:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1059436 [23:38:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1059436 (owner: 10TrainBranchBot) [23:40:30] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:40:40] FIRING: SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:40:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:40:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1251.eqiad.wmnet with OS bullseye [23:41:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1251.eqiad.wmnet with OS bullseye... [23:44:08] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:44:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:44:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1252.eqiad.wmnet with OS bullseye [23:44:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039975 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1252.eqiad.wmnet with OS bullseye... [23:45:56] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:46:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:46:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1253.eqiad.wmnet with OS bullseye [23:46:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1253.eqiad.wmnet with OS bullseye... [23:48:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1257.eqiad.wmnet with OS bullseye [23:48:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1254.eqiad.wmnet with OS bullseye [23:48:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1256.eqiad.wmnet with OS bullseye [23:48:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039987 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1257.eqiad.wmnet with OS bull... [23:48:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039988 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1254.eqiad.wmnet with OS bull... [23:48:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1256.eqiad.wmnet with OS bull... [23:49:32] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1255.eqiad.wmnet with OS bullseye [23:49:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10039994 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1255.eqiad.wmnet with OS bull... [23:51:11] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1257.eqiad.wmnet with reason: host reimage [23:51:15] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1254.eqiad.wmnet with reason: host reimage [23:51:17] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1256.eqiad.wmnet with reason: host reimage [23:51:51] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1255.eqiad.wmnet with reason: host reimage [23:54:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1257.eqiad.wmnet with reason: host reimage [23:54:22] RESOLVED: SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:56:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1255.eqiad.wmnet with reason: host reimage