[00:00:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1279.eqiad.wmnet with reason: host reimage [00:07:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1060183 (owner: 10TrainBranchBot) [00:10:03] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker1283.eqiad.wmnet with reason: host reimage [00:16:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:20:03] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker1282.eqiad.wmnet with reason: host reimage [00:26:36] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker1281.eqiad.wmnet with reason: host reimage [00:27:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1280.eqiad.wmnet with reason: host reimage [00:30:41] FIRING: [4x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:32:02] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker1284.eqiad.wmnet with reason: host reimage [00:33:19] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:34:25] FIRING: SystemdUnitFailed: prometheus-ipmi-exporter.service on wikikube-worker1282:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:37:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1283.eqiad.wmnet with OS bullseye [00:38:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1283.eqiad.wmnet with OS bullseye... [00:38:04] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:39:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:39:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1279.eqiad.wmnet with OS bullseye [00:39:25] RESOLVED: SystemdUnitFailed: prometheus-ipmi-exporter.service on wikikube-worker1282:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:39:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047162 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1279.eqiad.wmnet with OS bullseye... [00:41:07] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:41:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:41:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1282.eqiad.wmnet with OS bullseye [00:41:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1282.eqiad.wmnet with OS bullseye... [00:41:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:43:54] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:44:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:44:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1281.eqiad.wmnet with OS bullseye [00:44:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1281.eqiad.wmnet with OS bullseye... [00:45:01] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:46:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:46:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1280.eqiad.wmnet with OS bullseye [00:46:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047165 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1280.eqiad.wmnet with OS bullseye... [00:47:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047178 (10Jclark-ctr) [00:48:22] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:50:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047179 (10Jclark-ctr) [00:50:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:50:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1284.eqiad.wmnet with OS bullseye [00:50:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047180 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1284.eqiad.wmnet with OS bullseye... [00:50:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047181 (10Jclark-ctr) [00:54:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1295.eqiad.wmnet with OS bullseye [00:55:04] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1288.eqiad.wmnet with OS bullseye [00:55:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047186 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1295.eqiad.wmnet with OS bull... [00:55:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1288.eqiad.wmnet with OS bull... [00:55:13] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1289.eqiad.wmnet with OS bullseye [00:55:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047188 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1289.eqiad.wmnet with OS bull... [00:56:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1290.eqiad.wmnet with OS bullseye [00:56:09] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1291.eqiad.wmnet with OS bullseye [00:56:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047189 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1290.eqiad.wmnet with OS bull... [00:56:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047190 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1291.eqiad.wmnet with OS bull... [00:56:51] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1292.eqiad.wmnet with OS bullseye [00:57:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047191 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1292.eqiad.wmnet with OS bull... [00:57:39] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1293.eqiad.wmnet with OS bullseye [00:57:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047192 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1293.eqiad.wmnet with OS bull... [00:58:14] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1294.eqiad.wmnet with OS bullseye [00:58:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047193 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1294.eqiad.wmnet with OS bull... [00:58:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1287.eqiad.wmnet with OS bullseye [00:59:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047194 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1287.eqiad.wmnet with OS bull... [01:01:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1286.mgmt.eqiad.wmnet with reboot policy FORCED [01:02:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1286.mgmt.eqiad.wmnet with reboot policy FORCED [01:02:26] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED [01:02:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED [01:11:45] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1295.eqiad.wmnet with reason: host reimage [01:11:46] (03PS1) 10Arlolra: Enabled KartographerParsoidSupport on (cs|hi|shn|ps|tr)wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060186 (https://phabricator.wikimedia.org/T371936) [01:11:50] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1288.eqiad.wmnet with reason: host reimage [01:12:03] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1289.eqiad.wmnet with reason: host reimage [01:13:02] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1290.eqiad.wmnet with reason: host reimage [01:13:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1291.eqiad.wmnet with reason: host reimage [01:13:46] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1292.eqiad.wmnet with reason: host reimage [01:14:07] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1293.eqiad.wmnet with reason: host reimage [01:14:59] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1294.eqiad.wmnet with reason: host reimage [01:15:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1295.eqiad.wmnet with reason: host reimage [01:15:40] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1287.eqiad.wmnet with reason: host reimage [01:18:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1291.eqiad.wmnet with reason: host reimage [01:19:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1285.eqiad.wmnet with OS bullseye [01:19:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047215 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1285.eqiad.wmnet with OS bullseye... [01:19:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1286.eqiad.wmnet with OS bullseye [01:19:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1286.eqiad.wmnet with OS bullseye... [01:21:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1293.eqiad.wmnet with reason: host reimage [01:24:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1292.eqiad.wmnet with reason: host reimage [01:25:43] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949 (10phaultfinder) 03NEW [01:26:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1287.eqiad.wmnet with reason: host reimage [01:33:06] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:33:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1290.eqiad.wmnet with reason: host reimage [01:33:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:33:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1295.eqiad.wmnet with OS bullseye [01:33:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1295.eqiad.wmnet with OS bullseye... [01:35:23] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:35:46] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047240 (10phaultfinder) [01:36:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1289.eqiad.wmnet with reason: host reimage [01:36:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:36:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1291.eqiad.wmnet with OS bullseye [01:36:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1291.eqiad.wmnet with OS bullseye... [01:39:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1294.eqiad.wmnet with reason: host reimage [01:39:44] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:40:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:40:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1293.eqiad.wmnet with OS bullseye [01:40:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047242 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1293.eqiad.wmnet with OS bullseye... [01:41:13] (03PS1) 10Andrew Bogott: Make cloudcephosd103[578] into ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1060188 (https://phabricator.wikimedia.org/T363344) [01:41:41] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:42:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:42:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1292.eqiad.wmnet with OS bullseye [01:42:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1292.eqiad.wmnet with OS bullseye... [01:42:43] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1296.eqiad.wmnet with OS bullseye [01:42:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047250 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bull... [01:43:11] (03PS2) 10Andrew Bogott: Make cloudcephosd103[578] into ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1060188 (https://phabricator.wikimedia.org/T363344) [01:43:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060188 (https://phabricator.wikimedia.org/T363344) (owner: 10Andrew Bogott) [01:43:37] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:44:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:44:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1287.eqiad.wmnet with OS bullseye [01:44:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047251 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1287.eqiad.wmnet with OS bullseye... [01:44:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1288.eqiad.wmnet with reason: host reimage [01:48:28] (03CR) 10Andrew Bogott: [C:03+2] "pcc failures are because the systems are new" [puppet] - 10https://gerrit.wikimedia.org/r/1060188 (https://phabricator.wikimedia.org/T363344) (owner: 10Andrew Bogott) [01:51:05] (03CR) 10Andrew Bogott: [C:03+2] " It wasn't." [puppet] - 10https://gerrit.wikimedia.org/r/1060188 (https://phabricator.wikimedia.org/T363344) (owner: 10Andrew Bogott) [01:51:21] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:52:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:52:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1290.eqiad.wmnet with OS bullseye [01:52:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047253 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1290.eqiad.wmnet with OS bullseye... [01:53:07] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:55:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:55:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1289.eqiad.wmnet with OS bullseye [01:55:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047254 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1289.eqiad.wmnet with OS bullseye... [01:56:49] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:57:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:57:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1294.eqiad.wmnet with OS bullseye [01:57:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047255 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1294.eqiad.wmnet with OS bullseye... [02:02:16] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:02:29] (03PS1) 10Andrew Bogott: Add ceph config for cloudcephosd103[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344) [02:02:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:02:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1288.eqiad.wmnet with OS bullseye [02:02:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047258 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1288.eqiad.wmnet with OS bullseye... [02:02:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047259 (10Jclark-ctr) [02:03:11] (03PS2) 10Andrew Bogott: Add ceph config for cloudcephosd103[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344) [02:03:41] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344) (owner: 10Andrew Bogott) [02:04:57] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED [02:06:10] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED [02:06:42] (03CR) 10Andrew Bogott: [C:03+2] Add ceph config for cloudcephosd103[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344) (owner: 10Andrew Bogott) [02:21:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1296.mgmt.eqiad.wmnet with reboot policy FORCED [02:21:59] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1296.mgmt.eqiad.wmnet with reboot policy FORCED [02:35:45] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047267 (10phaultfinder) [02:39:23] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:46] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047268 (10phaultfinder) [02:45:48] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047271 (10phaultfinder) [02:50:48] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047272 (10phaultfinder) [02:59:23] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1296.eqiad.wmnet with OS bullseye [03:03:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047274 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bullseye... [03:35:43] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047278 (10phaultfinder) [03:40:44] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047279 (10phaultfinder) [03:45:46] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047280 (10phaultfinder) [03:50:49] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047281 (10phaultfinder) [03:55:48] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047283 (10phaultfinder) [04:00:49] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047284 (10phaultfinder) [04:34:23] FIRING: [4x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:52] (03PS1) 10Giuseppe Lavagetto: haproxy: fallback to global requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1060194 [05:19:09] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: fallback to global requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1060194 (owner: 10Giuseppe Lavagetto) [05:23:33] (03PS1) 10Giuseppe Lavagetto: haproxy: fix text/template [puppet] - 10https://gerrit.wikimedia.org/r/1060195 [05:23:57] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: fix text/template [puppet] - 10https://gerrit.wikimedia.org/r/1060195 (owner: 10Giuseppe Lavagetto) [05:35:10] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:49:15] (03PS1) 10Giuseppe Lavagetto: haproxy: change behaviour for requestctl filters [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:10] RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:24:25] (03PS1) 10Jelto: gerrit: disable logging for nftables rules [puppet] - 10https://gerrit.wikimedia.org/r/1060334 (https://phabricator.wikimedia.org/T371951) [06:26:22] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3570/co" [puppet] - 10https://gerrit.wikimedia.org/r/1060334 (https://phabricator.wikimedia.org/T371951) (owner: 10Jelto) [06:28:00] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: disable logging for nftables rules [puppet] - 10https://gerrit.wikimedia.org/r/1060334 (https://phabricator.wikimedia.org/T371951) (owner: 10Jelto) [06:40:08] (03CR) 10Fabfur: "Do we need to define this in the http frontend too?" [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [06:44:10] (03CR) 10Giuseppe Lavagetto: "I don't think so, the http frontend just does redirects, adding this would only add unneeded complexity IMHO. We can revisit later." [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [06:47:17] (03CR) 10Fabfur: [C:03+1] haproxy: change behaviour for requestctl filters [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [06:48:04] (03CR) 10Fabfur: [C:03+2] haproxy: remove template switch for benthos extended logging [puppet] - 10https://gerrit.wikimedia.org/r/1059358 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur) [06:49:21] (03CR) 10Ayounsi: [C:03+2] Netbox prometheus: replace exporter script with plugin [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [06:59:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:59:52] (03CR) 10David Caro: Add ceph config for cloudcephosd103[5-8] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344) (owner: 10Andrew Bogott) [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:41] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:01:54] (03CR) 10Vgutierrez: [C:04-1] haproxy: change behaviour for requestctl filters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [07:05:41] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:05:52] (03CR) 10Fabfur: "thanks @slyngshede@wikimedia.org for taking care of this!" [puppet] - 10https://gerrit.wikimedia.org/r/1059371 (https://phabricator.wikimedia.org/T371689) (owner: 10Fabfur) [07:08:13] (03PS1) 10David Caro: ceph.osd: move the new 103[5-8] nodes to the per-rack ip blocks [puppet] - 10https://gerrit.wikimedia.org/r/1060337 (https://phabricator.wikimedia.org/T363344) [07:08:43] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060337 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [07:09:23] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:59] (03Abandoned) 10David Caro: ceph: add new cloudcephosd1035 [puppet] - 10https://gerrit.wikimedia.org/r/1060146 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [07:10:41] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:11:57] (03CR) 10David Caro: [C:03+2] ceph.osd: move the new 103[5-8] nodes to the per-rack ip blocks [puppet] - 10https://gerrit.wikimedia.org/r/1060337 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [07:12:16] (03CR) 10David Caro: [C:03+2] "PCC looks good, no more duplicated ips, and each host has ip on it's own rack's block" [puppet] - 10https://gerrit.wikimedia.org/r/1060337 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [07:12:39] (03CR) 10Fabfur: [C:03+1] "ok for me" [puppet] - 10https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) (owner: 10Filippo Giunchedi) [07:14:24] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:15:34] (03CR) 10Filippo Giunchedi: [C:03+2] benthos: add ensure support [puppet] - 10https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) (owner: 10Filippo Giunchedi) [07:15:41] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:19:05] (03PS2) 10Fabfur: hiera:benthos: partially revert benthos removal [puppet] - 10https://gerrit.wikimedia.org/r/1059355 (https://phabricator.wikimedia.org/T371492) [07:20:02] (03CR) 10Ayounsi: [C:03+2] Netbox: use standard STORAGE_BACKEND/CONFIG keys [puppet] - 10https://gerrit.wikimedia.org/r/983716 (https://phabricator.wikimedia.org/T310717) (owner: 10Ayounsi) [07:20:40] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059355 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [07:21:30] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10047409 (10SLyngshede-WMF) [07:21:37] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059355 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [07:23:27] (03PS1) 10Slyngshede: data.yaml: Add toyofuku to deployment group. [puppet] - 10https://gerrit.wikimedia.org/r/1060338 (https://phabricator.wikimedia.org/T371650) [07:24:23] FIRING: [4x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:25:41] FIRING: [4x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:28:40] hey hey. This is probably outside of ops scope, but does anyone here know where wikimedia.de stuff is hosted / runs? I can't seem to connect to anything wikimedia.de (wiki.wikimedia.de, mattermost.wikimedia.de, www,wikimedia.de) [07:29:26] (03CR) 10Giuseppe Lavagetto: haproxy: change behaviour for requestctl filters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [07:32:28] (03PS1) 10Kevin Bazira: ml-services: use cxserver host header in rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060377 (https://phabricator.wikimedia.org/T371465) [07:33:35] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10047442 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03High [07:33:37] codders: yes definitely a question for WMDE folks [07:35:10] k - thanks! [07:36:05] (03CR) 10Ayounsi: [C:03+2] Netbox-hiera: add device role to mgmt_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [07:36:31] (03CR) 10Vgutierrez: [C:04-1] haproxy: change behaviour for requestctl filters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [07:39:23] (03CR) 10Fabfur: [C:03+2] hiera:benthos: partially revert benthos removal [puppet] - 10https://gerrit.wikimedia.org/r/1059355 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [07:40:12] (03Merged) 10jenkins-bot: Netbox-hiera: add device role to mgmt_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [07:43:37] (03PS1) 10Kevin Bazira: ml-services: langid from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060378 (https://phabricator.wikimedia.org/T369344) [07:43:42] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add role to mgmt devices - ayounsi@cumin1002" [07:44:23] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add role to mgmt devices - ayounsi@cumin1002" [07:46:17] (03CR) 10Cathal Mooney: [C:03+2] common: add dcaro user for access to cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/1060087 (owner: 10David Caro) [07:46:46] (03Merged) 10jenkins-bot: common: add dcaro user for access to cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/1060087 (owner: 10David Caro) [07:49:38] (03CR) 10Cathal Mooney: [C:03+2] common: add dcaro user for access to cloudsw (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1060087 (owner: 10David Caro) [07:50:58] (03CR) 10Filippo Giunchedi: [C:03+2] benthos: use fully qualified kafka cluster name [puppet] - 10https://gerrit.wikimedia.org/r/1060070 (owner: 10Filippo Giunchedi) [07:51:09] (03CR) 10Filippo Giunchedi: [C:03+2] webperf: use fully qualified kafka cluster names [puppet] - 10https://gerrit.wikimedia.org/r/1060069 (owner: 10Filippo Giunchedi) [07:51:29] (03PS1) 10Cathal Mooney: Remove taavi user from network devices [homer/public] - 10https://gerrit.wikimedia.org/r/1060379 [07:51:40] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: restore Benthos instances functionality [puppet] - 10https://gerrit.wikimedia.org/r/1060071 (owner: 10Filippo Giunchedi) [07:52:07] (03CR) 10Ayounsi: [C:03+2] Prometheus SSH probe: ignore network devices [puppet] - 10https://gerrit.wikimedia.org/r/1056899 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [07:53:12] (03CR) 10Cathal Mooney: [C:03+2] Remove taavi user from network devices [homer/public] - 10https://gerrit.wikimedia.org/r/1060379 (owner: 10Cathal Mooney) [07:53:43] (03Merged) 10jenkins-bot: Remove taavi user from network devices [homer/public] - 10https://gerrit.wikimedia.org/r/1060379 (owner: 10Cathal Mooney) [07:56:01] (03PS1) 10Fabfur: hiera:benthos: remove Benthos from ulsfo using benthos module [puppet] - 10https://gerrit.wikimedia.org/r/1060380 (https://phabricator.wikimedia.org/T371492) [07:56:47] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060380 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [08:00:05] jnuche and brennen: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T0800). [08:00:43] (03CR) 10Filippo Giunchedi: [C:03+1] hiera:benthos: remove Benthos from ulsfo using benthos module [puppet] - 10https://gerrit.wikimedia.org/r/1060380 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [08:00:58] hi, I'll be deploying the train in a few minutes [08:01:36] (03PS1) 10David Caro: cloudceph.osd: remove 1036 as we are not adding it yet [puppet] - 10https://gerrit.wikimedia.org/r/1060381 (https://phabricator.wikimedia.org/T363344) [08:02:39] (03PS1) 10Ayounsi: Revert "Prometheus SSH probe: ignore network devices" [puppet] - 10https://gerrit.wikimedia.org/r/1060382 [08:04:35] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3572/co" [puppet] - 10https://gerrit.wikimedia.org/r/1060381 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [08:05:39] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060383 (https://phabricator.wikimedia.org/T366962) [08:05:41] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060383 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [08:06:19] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060383 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [08:07:46] (03CR) 10David Caro: [V:03+1] "pcc" [puppet] - 10https://gerrit.wikimedia.org/r/1060381 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [08:09:00] (03PS2) 10Giuseppe Lavagetto: haproxy: change behaviour for requestctl filters [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) [08:09:07] (03CR) 10David Caro: [C:03+2] "pcc looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1060381 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [08:09:40] (03PS1) 10Ayounsi: Add role to type Netbox::Device::Location::BareMetal [puppet] - 10https://gerrit.wikimedia.org/r/1060385 (https://phabricator.wikimedia.org/T368513) [08:10:03] (03CR) 10Giuseppe Lavagetto: haproxy: change behaviour for requestctl filters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [08:10:28] (03PS2) 10Ayounsi: Add role to type Netbox::Device::Location::BareMetal [puppet] - 10https://gerrit.wikimedia.org/r/1060385 (https://phabricator.wikimedia.org/T368513) [08:10:35] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060385 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [08:11:03] (03CR) 10Filippo Giunchedi: [C:03+1] Add role to type Netbox::Device::Location::BareMetal [puppet] - 10https://gerrit.wikimedia.org/r/1060385 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [08:12:45] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3574/co" [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [08:13:32] (03CR) 10Vgutierrez: haproxy: change behaviour for requestctl filters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [08:13:44] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC output: https://puppet-compiler.wmflabs.org/output/1060198/3574/cp4044.ulsfo.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [08:16:12] (03PS9) 10Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) [08:17:19] (03PS3) 10Hashar: cumin: set git::clone umask to match requested file mode [puppet] - 10https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277) [08:17:35] (03CR) 10Ayounsi: [C:03+2] Add role to type Netbox::Device::Location::BareMetal [puppet] - 10https://gerrit.wikimedia.org/r/1060385 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [08:18:18] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.17 refs T366962 [08:18:21] T366962: 1.43.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T366962 [08:18:45] (03CR) 10Hashar: "I have rebased by mistake but there is no other change :)" [puppet] - 10https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [08:19:37] (03PS1) 10Ayounsi: Revert "Add role to type Netbox::Device::Location::BareMetal" [puppet] - 10https://gerrit.wikimedia.org/r/1060386 [08:20:19] (03CR) 10Ayounsi: [C:03+2] Revert "Add role to type Netbox::Device::Location::BareMetal" [puppet] - 10https://gerrit.wikimedia.org/r/1060386 (owner: 10Ayounsi) [08:20:26] (03CR) 10Ayounsi: [C:03+2] Revert "Prometheus SSH probe: ignore network devices" [puppet] - 10https://gerrit.wikimedia.org/r/1060382 (owner: 10Ayounsi) [08:20:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:21:23] (03PS23) 10Effie Mouzeli: cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [08:22:48] (03CR) 10Giuseppe Lavagetto: [V:03+1] haproxy: change behaviour for requestctl filters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [08:25:10] (03PS3) 10Giuseppe Lavagetto: haproxy: change behaviour for requestctl filters [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) [08:25:25] (03PS10) 10Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) [08:26:09] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3575/co" [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [08:31:26] (03PS1) 10Ayounsi: Prometheus SSH probe: ignore network devices - try 2 [puppet] - 10https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) [08:31:47] !log openjdk-11 upgrades for bullseye rolled out to prod [08:31:52] (03CR) 10CI reject: [V:04-1] Prometheus SSH probe: ignore network devices - try 2 [puppet] - 10https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [08:33:55] (03CR) 10Vgutierrez: [C:03+1] haproxy: change behaviour for requestctl filters [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [08:34:23] FIRING: [4x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T367856)', diff saved to https://phabricator.wikimedia.org/P67237 and previous config saved to /var/cache/conftool/dbconfig/20240807-083434-marostegui.json [08:35:11] (03CR) 10Elukey: [C:03+2] cumin: set git::clone umask to match requested file mode [puppet] - 10https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [08:38:47] (03PS27) 10Elukey: git: remove umask from git::clone [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [08:39:23] FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:41:26] 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10047547 (10elukey) Rolled out the change to the hadoop cluster, this is the only error that I got: ` [2024-08-07T08:38:59] Unable to update host 'an-worker110... [08:41:28] (03PS2) 10Ayounsi: Prometheus SSH probe: ignore network devices - try 2 [puppet] - 10https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) [08:42:01] (03PS3) 10Ayounsi: Prometheus SSH probe: ignore network devices - try 2 [puppet] - 10https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) [08:42:20] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [08:44:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10047553 (10dcaro) cloudcephosd1035 has one drive that wrongly assigned as 'os raid': ` sdb... [08:45:28] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] haproxy: change behaviour for requestctl filters [puppet] - 10https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [08:46:22] (03PS1) 10Ayounsi: Revert "Netbox-hiera: add device role to mgmt_hosts" [cookbooks] - 10https://gerrit.wikimedia.org/r/1060391 [08:46:52] (03CR) 10Elukey: Netbox script proxy: set to absent where possible (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [08:47:57] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "Netbox-hiera: add device role to mgmt_hosts" [cookbooks] - 10https://gerrit.wikimedia.org/r/1060391 (owner: 10Ayounsi) [08:49:23] RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P67238 and previous config saved to /var/cache/conftool/dbconfig/20240807-084942-marostegui.json [08:50:11] (03CR) 10Elukey: [C:03+1] "Left a couple of comments related to the #TODOs, but the rest looks good! Feel free to merge anytime" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059250 (owner: 10Ayounsi) [08:51:00] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10047576 (10hashar) [08:51:19] (03CR) 10Elukey: [C:03+1] ImportPuppetDB: Run Validate on VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059349 (owner: 10Ayounsi) [08:51:39] (03CR) 10Ayounsi: [C:03+2] Revert "Netbox-hiera: add device role to mgmt_hosts" [cookbooks] - 10https://gerrit.wikimedia.org/r/1060391 (owner: 10Ayounsi) [08:53:18] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10047580 (10hashar) After discussing with Simon (`@SLyngshede-WMF`), the `jenkins-deploy` account hits so... [08:53:50] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [08:54:32] !log upgrade debmonitor-client to 0.4.0 fleetwide - T368744 [08:55:34] (03CR) 10Effie Mouzeli: cronjobs : update modules to job 2.0.0 (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [08:55:37] (03PS1) 10Btullis: Add a record of the kerberos enablement of ifrahkh [puppet] - 10https://gerrit.wikimedia.org/r/1060394 (https://phabricator.wikimedia.org/T371894) [08:55:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:57:08] (03Merged) 10jenkins-bot: Revert "Netbox-hiera: add device role to mgmt_hosts" [cookbooks] - 10https://gerrit.wikimedia.org/r/1060391 (owner: 10Ayounsi) [08:58:54] the debmonitor1003 failures are surely due to me rolling out the new debmonitor-client [08:59:14] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "rollback adding role to mgmt devices - ayounsi@cumin1002" [08:59:14] it is updating a lot of things in the db (first time only that runs) and the server may suffer a bit [08:59:23] FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:59:40] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "rollback adding role to mgmt devices - ayounsi@cumin1002" [09:00:41] RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:02:32] (03CR) 10Fabfur: [C:03+2] hiera:benthos: remove Benthos from ulsfo using benthos module [puppet] - 10https://gerrit.wikimedia.org/r/1060380 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [09:03:26] (03PS1) 10Slyngshede: P:idp More precise base_dn for user lookup [puppet] - 10https://gerrit.wikimedia.org/r/1060396 (https://phabricator.wikimedia.org/T371930) [09:04:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P67239 and previous config saved to /var/cache/conftool/dbconfig/20240807-090449-marostegui.json [09:12:13] (03PS2) 10Slyngshede: P:idp More precise base_dn for user lookup [puppet] - 10https://gerrit.wikimedia.org/r/1060396 (https://phabricator.wikimedia.org/T371930) [09:13:10] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3577/co" [puppet] - 10https://gerrit.wikimedia.org/r/1060396 (https://phabricator.wikimedia.org/T371930) (owner: 10Slyngshede) [09:19:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T367856)', diff saved to https://phabricator.wikimedia.org/P67240 and previous config saved to /var/cache/conftool/dbconfig/20240807-091956-marostegui.json [09:19:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance [09:20:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance [09:20:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T367856)', diff saved to https://phabricator.wikimedia.org/P67241 and previous config saved to /var/cache/conftool/dbconfig/20240807-092018-marostegui.json [09:20:32] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1060394 (https://phabricator.wikimedia.org/T371894) (owner: 10Btullis) [09:20:43] (03CR) 10Brouberol: [C:03+1] "Access was authorized in https://phabricator.wikimedia.org/T366558" [puppet] - 10https://gerrit.wikimedia.org/r/1060394 (https://phabricator.wikimedia.org/T371894) (owner: 10Btullis) [09:21:07] (03PS11) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [09:25:15] (03CR) 10Stevemunene: [C:03+1] Add a record of the kerberos enablement of ifrahkh [puppet] - 10https://gerrit.wikimedia.org/r/1060394 (https://phabricator.wikimedia.org/T371894) (owner: 10Btullis) [09:27:56] (03CR) 10Btullis: [V:03+2 C:03+2] Update the beta cluster scap targets for dumps [dumps/scap] - 10https://gerrit.wikimedia.org/r/1059891 (https://phabricator.wikimedia.org/T370465) (owner: 10Btullis) [09:28:13] (03CR) 10Btullis: [C:03+2] Update the mediawiki-installation dsh group with new beta snapshot host [puppet] - 10https://gerrit.wikimedia.org/r/1059893 (https://phabricator.wikimedia.org/T370465) (owner: 10Btullis) [09:29:23] FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:32:33] (03PS12) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [09:33:01] (03PS13) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [09:33:52] (03PS1) 10Klausman: hiera/manifest/partman: Add configuration for new ML hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) [09:33:52] (03CR) 10Klausman: "Feel free to redirect review to someone else." [puppet] - 10https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) (owner: 10Klausman) [09:34:23] RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:36:36] (03PS11) 10Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) [09:40:41] FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:41:54] (03CR) 10Btullis: hiera/manifest/partman: Add configuration for new ML hosts in codfw (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) (owner: 10Klausman) [09:43:32] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts netbox2002.codfw.wmnet [09:44:23] RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:50] (03PS1) 10David Caro: parted: add a recipe to autouse the two smaller disks [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) [09:45:22] (03CR) 10CI reject: [V:04-1] parted: add a recipe to autouse the two smaller disks [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [09:45:41] FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:46:04] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2036.mgmt.codfw.wmnet with reboot policy GRACEFUL [09:46:47] (03PS1) 10Ayounsi: Remove Netbox 3 from MariaDB ferm ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1060403 (https://phabricator.wikimedia.org/T371957) [09:46:54] (03PS2) 10Klausman: hiera/manifest/partman: Add configuration for new ML hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) [09:48:02] (03CR) 10Klausman: hiera/manifest/partman: Add configuration for new ML hosts in codfw (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) (owner: 10Klausman) [09:49:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2036.mgmt.codfw.wmnet with reboot policy GRACEFUL [09:53:05] (03CR) 10Elukey: [C:03+1] Remove Netbox 3 from MariaDB ferm ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1060403 (https://phabricator.wikimedia.org/T371957) (owner: 10Ayounsi) [09:53:20] (03CR) 10Ayounsi: [C:03+2] Remove Netbox 3 from MariaDB ferm ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1060403 (https://phabricator.wikimedia.org/T371957) (owner: 10Ayounsi) [09:54:10] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2037.mgmt.codfw.wmnet with reboot policy GRACEFUL [09:55:18] (03PS1) 10Giuseppe Lavagetto: haproxy: make indentation from go template more readable [puppet] - 10https://gerrit.wikimedia.org/r/1060406 [09:57:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2037.mgmt.codfw.wmnet with reboot policy GRACEFUL [09:57:51] (03CR) 10Btullis: [C:03+2] Add a record of the kerberos enablement of ifrahkh [puppet] - 10https://gerrit.wikimedia.org/r/1060394 (https://phabricator.wikimedia.org/T371894) (owner: 10Btullis) [09:58:07] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: make indentation from go template more readable [puppet] - 10https://gerrit.wikimedia.org/r/1060406 (owner: 10Giuseppe Lavagetto) [09:58:46] (03PS2) 10David Caro: partman: add a recipe for using the smallest 2 drives for cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) [09:59:59] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1000) [10:00:26] (03PS3) 10Klausman: hiera/manifest/partman: Add configuration for new ML hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) [10:05:10] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [10:05:17] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2038.mgmt.codfw.wmnet with reboot policy GRACEFUL [10:05:41] FIRING: [3x] JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:06:34] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958 (10ArthurTaylor) 03NEW [10:06:53] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [10:06:53] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:06:53] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netbox2002.codfw.wmnet [10:07:41] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [10:07:47] <_joe_> jouncebot: now [10:07:47] For the next 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1000) [10:08:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2038.mgmt.codfw.wmnet with reboot policy GRACEFUL [10:09:11] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) (owner: 10Klausman) [10:09:13] !log dcausse@deploy1003 Started deploy [airflow-dags/search@5569f85]: search: bump rdf artifact to 0.3.146 [10:09:34] !log dcausse@deploy1003 Finished deploy [airflow-dags/search@5569f85]: search: bump rdf artifact to 0.3.146 (duration: 00m 21s) [10:09:48] (03CR) 10Klausman: [C:03+2] hiera/manifest/partman: Add configuration for new ML hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) (owner: 10Klausman) [10:11:13] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2039.mgmt.codfw.wmnet with reboot policy GRACEFUL [10:11:55] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts netbox1002.eqiad.wmnet [10:11:59] (03PS1) 10Filippo Giunchedi: mw-jobrunner: bump limit/request for statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060411 (https://phabricator.wikimedia.org/T371885) [10:12:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2039.mgmt.codfw.wmnet with reboot policy GRACEFUL [10:13:33] (03PS1) 10Ayounsi: Remove netbox 3 references [puppet] - 10https://gerrit.wikimedia.org/r/1060412 (https://phabricator.wikimedia.org/T371957) [10:14:23] RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:14:53] (03PS3) 10David Caro: partman: add a recipe for using the smallest 2 drives for cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) [10:15:19] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-jobrunner: bump limit/request for statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060411 (https://phabricator.wikimedia.org/T371885) (owner: 10Filippo Giunchedi) [10:17:34] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060412 (https://phabricator.wikimedia.org/T371957) (owner: 10Ayounsi) [10:18:25] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:19:23] FIRING: [3x] JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:19:54] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [10:20:20] (03PS2) 10Kosta Harlan: AbuseFilter: Enable showcaptcha consequence everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110) [10:20:32] (03CR) 10Dreamy Jazz: [C:03+1] AbuseFilter: Enable showcaptcha consequence everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan) [10:20:41] FIRING: [3x] JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:22:46] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [10:22:49] jouncebot: nowandnext [10:22:50] For the next 0 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1000) [10:22:50] In 0 hour(s) and 37 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1100) [10:23:21] Anyone mind if I deploy now? [10:24:11] (03PS4) 10David Caro: partman: add a recipe for using the smallest 2 drives for cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) [10:24:12] (03PS1) 10Effie Mouzeli: mw-mcrouter: balance resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060413 [10:24:23] FIRING: [3x] JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:24:35] !log elukey@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-main-codfw cluster: Roll restart of jvm daemons. [10:24:36] (03CR) 10David Caro: "Tested on cloudcephosd1035, generates this:" [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [10:24:37] (03PS1) 10Ayounsi: Remove "netbox4" upgrade flag [puppet] - 10https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957) [10:24:48] Dreamy_Jazz: lol you and I have a history of bumping into each other :p [10:24:57] I want to attempt to rollout a mcrouter change [10:24:59] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957) (owner: 10Ayounsi) [10:25:10] Mine isn't particularly urgent, but would like to deploy today [10:25:32] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [10:25:32] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:25:32] lets see how mine will go [10:25:33] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts netbox1002.eqiad.wmnet [10:25:41] (03PS1) 10C. Scott Ananian: Turn on Parsoid support for Kartographer on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) [10:25:50] So you'll want to do that first? [10:26:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) (owner: 10C. Scott Ananian) [10:27:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan) [10:27:27] I'll schedule it in to the backport window in a few hours :) [10:27:41] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove netbox1002 - ayounsi@cumin1002" [10:27:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove netbox1002 - ayounsi@cumin1002" [10:28:02] Dreamy_Jazz: yes please, it will take a while though [10:28:44] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts netboxdb1002.eqiad.wmnet [10:30:56] !log elukey@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-main-codfw cluster: Roll restart of jvm daemons. [10:31:26] Sure. I've placed my change into the backport window. [10:33:14] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Turn on Parsoid support for Kartographer on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) (owner: 10C. Scott Ananian) [10:34:19] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:34:37] (03PS2) 10Effie Mouzeli: mw-mcrouter: balance resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060413 [10:36:16] (03CR) 10JMeybohm: [C:03+1] mw-mcrouter: balance resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060413 (owner: 10Effie Mouzeli) [10:36:44] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: balance resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060413 (owner: 10Effie Mouzeli) [10:37:35] (03Merged) 10jenkins-bot: mw-mcrouter: balance resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060413 (owner: 10Effie Mouzeli) [10:37:42] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netboxdb1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [10:37:59] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netboxdb1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [10:37:59] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:37:59] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netboxdb1002.eqiad.wmnet [10:38:17] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts netboxdb2002.codfw.wmnet [10:38:39] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [10:43:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:43:35] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:43:39] ths is me^ [10:47:01] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netboxdb2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [10:48:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:49:56] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netboxdb2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [10:49:56] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:49:57] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netboxdb2002.codfw.wmnet [10:50:38] (03PS1) 10Fabfur: hiera:benthos: finally removing all hiera relative to Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1060416 (https://phabricator.wikimedia.org/T371492) [10:50:47] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:52:17] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060416 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [10:52:24] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [10:53:58] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [10:54:05] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [10:55:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:57:26] (03PS12) 10Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) [10:59:46] (03PS1) 10Giuseppe Lavagetto: haproxy: improve management of x-requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1060417 [11:00:04] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1100). nyaa~ [11:01:35] !log btullis@deploy1003 Started deploy [dumps/dumps@0d1f9be]: (no justification provided) [11:01:36] !log btullis@deploy1003 Finished deploy [dumps/dumps@0d1f9be]: (no justification provided) (duration: 00m 00s) [11:03:17] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10047874 (10SLyngshede-WMF) p:05Triage→03Medium [11:08:27] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10047902 (10SLyngshede-WMF) You shouldn't need access to the WMF group to access or contribute to repos/mediawiki. @dancy / @Jelto is there a mechanism in Gitlab to grant that access, or some alte... [11:10:35] Dreamy_Jazz: I am done, so you may want to use the current window [11:10:41] which is not mine :p [11:12:22] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [11:13:00] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [11:15:03] (03PS13) 10Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) [11:16:11] (03CR) 10Btullis: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [11:17:22] (03PS14) 10Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) [11:17:28] (03CR) 10Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [11:17:33] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10047933 (10Jelto) afaik there is [automation](https://gitlab.wikimedia.org/repos/releng/gitlab-settings/-/blob/main/group-management/sync-gitlab-group-with-ldap?ref_type=heads) which syncs ldap use... [11:17:45] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:19:14] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10047936 (10SLyngshede-WMF) The WMF group is for staff and contractor, so I suspect there's another one. [11:29:30] Thanks! [11:29:34] jouncebot: nowandnext [11:29:34] For the next 0 hour(s) and 30 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1100) [11:29:34] In 1 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1300) [11:34:16] (03PS2) 10Ayounsi: Remove netbox 3 references [puppet] - 10https://gerrit.wikimedia.org/r/1060412 (https://phabricator.wikimedia.org/T371957) [11:34:19] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060412 (https://phabricator.wikimedia.org/T371957) (owner: 10Ayounsi) [11:34:27] (03PS2) 10Ayounsi: Remove "netbox4" upgrade flag [puppet] - 10https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957) [11:34:29] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957) (owner: 10Ayounsi) [11:36:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan) [11:37:09] (03CR) 10Ayounsi: raise AbortScript when needed (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059250 (owner: 10Ayounsi) [11:37:25] (03Merged) 10jenkins-bot: AbuseFilter: Enable showcaptcha consequence everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan) [11:37:29] (03CR) 10Ayounsi: [C:03+2] raise AbortScript when needed [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059250 (owner: 10Ayounsi) [11:37:59] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1056146|AbuseFilter: Enable showcaptcha consequence everywhere (T20110)]] [11:38:42] (03Merged) 10jenkins-bot: raise AbortScript when needed [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059250 (owner: 10Ayounsi) [11:40:05] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [11:40:19] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [11:41:53] !log dreamyjazz@deploy1003 kharlan, dreamyjazz: Backport for [[gerrit:1056146|AbuseFilter: Enable showcaptcha consequence everywhere (T20110)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:42:44] !log dreamyjazz@deploy1003 kharlan, dreamyjazz: Continuing with sync [11:47:12] !log dreamyjazz@deploy1003 Finished scap: Backport for [[gerrit:1056146|AbuseFilter: Enable showcaptcha consequence everywhere (T20110)]] (duration: 09m 13s) [11:47:18] Done my deploy [11:49:02] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [11:49:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [11:49:38] (03CR) 10Ayounsi: [C:03+2] ImportPuppetDB: Run Validate on VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059349 (owner: 10Ayounsi) [11:49:44] (03CR) 10CI reject: [V:04-1] ImportPuppetDB: Run Validate on VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059349 (owner: 10Ayounsi) [11:49:58] (03PS2) 10Ayounsi: ImportPuppetDB: Run Validate on VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059349 [11:52:17] (03CR) 10Ayounsi: "recheck" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059349 (owner: 10Ayounsi) [11:53:24] (03Merged) 10jenkins-bot: ImportPuppetDB: Run Validate on VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059349 (owner: 10Ayounsi) [11:53:56] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [11:54:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [11:56:44] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [11:57:11] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [12:03:01] (03CR) 10FNegri: [C:04-1] "I would rename "partman/raid1-2dev-autodetect.cfg" to "partman/custom/cloudcephosd.cfg", for consistency with "partman/custom/cephosd.cfg"" [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [12:14:36] (03CR) 10FNegri: [C:04-1] partman: add a recipe for using the smallest 2 drives for cloudceph (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [12:17:10] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/927986/1626/" [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [12:25:57] (03CR) 10FNegri: [C:04-1] partman: add a recipe for using the smallest 2 drives for cloudceph (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [12:34:23] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:54] (03PS5) 10David Caro: partman: add a recipe for using the smallest 2 drives for cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) [12:34:55] (03CR) 10David Caro: partman: add a recipe for using the smallest 2 drives for cloudceph (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [12:40:48] (03CR) 10FNegri: [C:04-1] "the path should be partman/custom/cloudcephosd.cfg instead of partman/cloudcephosd.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [12:40:52] !log elukey@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-main-eqiad cluster: Roll restart of jvm daemons. [12:40:57] <_joe_> !log adding conftool 3.2.2 to apt [12:42:28] <_joe_> uhm !log not working [12:42:33] <_joe_> is stashbot down? [12:47:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-main-eqiad cluster: Roll restart of jvm daemons. [12:48:00] seems to be. can someone with the right perms restart it? [12:54:00] (03PS15) 10Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) [12:54:49] (03PS2) 10Giuseppe Lavagetto: haproxy: improve management of x-requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1060417 [12:55:23] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: improve management of x-requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1060417 (owner: 10Giuseppe Lavagetto) [12:56:20] (03CR) 10Elukey: [C:03+2] git: remove umask from git::clone [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [12:57:08] (03PS1) 10Brouberol: cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240) [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1300). nyaa~ [13:00:04] cscott and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:02] (03PS1) 10Giuseppe Lavagetto: haproxy: remove redundant "end" stanza [puppet] - 10https://gerrit.wikimedia.org/r/1060425 [13:01:13] I'm here [13:01:25] I am going to restart jenkins [13:01:47] I am waiting for a couple jobs to finish ;) [13:01:50] (03PS1) 10Ayounsi: Add request argument to validate() method [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889) [13:01:51] (03PS6) 10David Caro: partman: use the same recipe for cloudcephosd than cephosd [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) [13:02:10] (I'm also at wikimania) [13:02:11] (03CR) 10David Caro: partman: use the same recipe for cloudcephosd than cephosd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [13:03:17] (03PS2) 10Giuseppe Lavagetto: haproxy: remove redundant "end" stanza [puppet] - 10https://gerrit.wikimedia.org/r/1060425 [13:03:59] cscott: looks like you r change is in merge conflict https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1060415 [13:04:01] (03CR) 10FNegri: [C:03+1] partman: use the same recipe for cloudcephosd than cephosd [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [13:04:08] most probably cause some other patch touched InitialiseSettings.php [13:05:04] (03CR) 10David Caro: [C:03+2] partman: use the same recipe for cloudcephosd than cephosd [puppet] - 10https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [13:05:16] (03CR) 10CI reject: [V:04-1] Add request argument to validate() method [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889) (owner: 10Ayounsi) [13:05:40] (03PS2) 10Brouberol: cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240) [13:06:49] (03PS2) 10C. Scott Ananian: Turn on Parsoid support for Kartographer on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) [13:07:12] (03CR) 10Hashar: "I have rebased the change since Gerrit marked it as being in conflict." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) (owner: 10C. Scott Ananian) [13:08:13] !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye [13:08:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1037.eqi... [13:08:58] cscott: I am doing the backport [13:09:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) (owner: 10C. Scott Ananian) [13:09:13] !log elukey@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [13:10:11] !log rollout openjdk-17 upgrades to prod [13:10:25] (03Merged) 10jenkins-bot: Turn on Parsoid support for Kartographer on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) (owner: 10C. Scott Ananian) [13:10:44] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1060415|Turn on Parsoid support for Kartographer on Wikivoyage (T371823)]] [13:11:23] !log Restarting CI Jenkins [13:11:37] stashbot is broken [13:11:47] I am looking for someone with access to restart it [13:11:51] if you are that person, please do it :) [13:12:03] this job is never ending [13:12:05] I don't have access :/ [13:12:15] hashar: sadly me neither, I just requested it as well [13:12:26] but I guess people in #wikimedia-cloud-admin would be able? [13:12:38] good idea, going there [13:12:48] dont tell them I have sent you! ;-] [13:13:12] I will tell them hashar told me to not tell them that hashar sent me [13:13:15] :] [13:13:22] * hashar grins [13:14:14] (03CR) 10Btullis: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [13:14:19] !log hashar@deploy1003 cscott, hashar: Backport for [[gerrit:1060415|Turn on Parsoid support for Kartographer on Wikivoyage (T371823)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:14:35] cscott: should be good now [13:14:40] well on debug servers [13:14:47] then I am not quite sure if anything has to be tested? [13:15:02] I can test hang on [13:15:05] !log Restarted CI Jenkins [13:15:05] hashar: Failed to log message to wiki. Somebody should check the error logs. [13:15:10] oops [13:15:20] it did log it though, ha [13:15:31] https://sal.toolforge.org/log/HQ_6LJEBKFqumxvtlfYt [13:15:32] yeah [13:15:33] !log elukey@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [13:15:33] elukey@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:15:41] (03CR) 10Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [13:15:46] but we lost fivish hours of `!log` [13:16:01] which should probably be logged [13:16:05] yeah [13:16:45] the other message from stash bot is that it apparently cant write to wikitech.wikimedia.org [13:17:52] !log elukey@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [13:17:52] elukey@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:18:05] (03PS1) 10DCausse: search: index stems for mul labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060430 (https://phabricator.wikimedia.org/T371401) [13:18:07] !log stashbot got restarted since it was not processing anything [13:18:07] hashar: Failed to log message to wiki. Somebody should check the error logs. [13:18:58] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp3073*} and A:cp for 9.2.5-1wm2 [13:18:59] sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:19:03] sukhe: interestingly I am a member of `stashbot` so I could have restarted it ;) [13:19:08] hashar: haha [13:19:27] Hashar: is my patch in codfw now? [13:19:34] (03PS3) 10Brouberol: cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240) [13:19:36] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [13:20:00] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [13:20:23] cscott: it is only on the mwdebug servers but there is one in codfw as well? [13:20:41] 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10048179 (10elukey) Buster and Bookworm rollouts done, no big issues registered. The only drawback is that due to the high volume of writes to the db (since we... [13:21:38] (03PS2) 10Ayounsi: Add request argument to validate() method [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889) [13:21:38] (03PS1) 10Ayounsi: Add validators for console(server) and power ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590) [13:22:20] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp3073*} and A:cp for 9.2.5-1wm2 [13:22:20] sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:22:28] I am ignoring the stashbot error log, cause there is at least 3 tasks I could file as follows up [13:22:36] and well E_TOO_MANY_THINGS [13:22:46] hashar: dhinus is on it [13:22:53] cool ;) [13:23:05] thank you! [13:23:20] Hashar: ok ship it, tested and looks good [13:23:27] !log hashar@deploy1003 cscott, hashar: Continuing with sync [13:23:27] hashar@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [13:23:30] * hashar ships [13:23:43] I restarted stashbot but it's only half alive :) -- it's now writing to sal.toolforge.org, but not to wiki SAL [13:23:54] (03PS1) 10DCausse: search: use the stem field when search mul labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) [13:24:13] !log elukey@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [13:24:13] elukey@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:24:24] (03PS2) 10DCausse: search: use the stem field when searching mul labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) [13:24:49] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: remove redundant "end" stanza [puppet] - 10https://gerrit.wikimedia.org/r/1060425 (owner: 10Giuseppe Lavagetto) [13:25:26] (03CR) 10CI reject: [V:04-1] Add request argument to validate() method [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889) (owner: 10Ayounsi) [13:25:33] (03PS1) 10Ayounsi: Enable validators on Netbox-next for console(server) and power ports [puppet] - 10https://gerrit.wikimedia.org/r/1060435 (https://phabricator.wikimedia.org/T310590) [13:25:35] (03PS1) 10Ayounsi: Enable validators on Netbox for console(server) and power ports [puppet] - 10https://gerrit.wikimedia.org/r/1060436 (https://phabricator.wikimedia.org/T310590) [13:25:36] (03CR) 10CI reject: [V:04-1] Add validators for console(server) and power ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:25:45] I checked the stashbot error logs and the exception is "mwclient.errors.NoWriteApi" [13:26:22] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#10048193 (10elukey) 05Open→03Resolved a:03elukey [13:26:34] !log dcaro@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage [13:26:34] dcaro@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:26:52] jouncebot: now and next [13:26:52] For the next 0 hour(s) and 33 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1300) [13:27:12] !log sudo cumin "lvs3009*" 'disable-puppet "rebooting" && systemctl stop pybal.service' [13:27:12] sukhe: Failed to log message to wiki. Somebody should check the error logs. [13:28:10] !log hashar@deploy1003 Finished scap: Backport for [[gerrit:1060415|Turn on Parsoid support for Kartographer on Wikivoyage (T371823)]] (duration: 17m 26s) [13:28:11] hashar@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [13:28:13] (03PS1) 10AikoChou: ml-services: update readability model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060437 (https://phabricator.wikimedia.org/T369712) [13:28:22] T371823: Turn on wgKartographerParsoidSupport on all wikivoyage wikis - https://phabricator.wikimedia.org/T371823 [13:28:51] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage [13:28:51] dcaro@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:29:24] (03PS1) 10Tiziano Fogli: icinga: add Tiziano Fogli to authorized_for_system_information, authorized_for_configuration_information, authorized_for_all_service_commands, authorized_for_all_host_commands [puppet] - 10https://gerrit.wikimedia.org/r/1060438 [13:30:06] (03CR) 10CI reject: [V:04-1] icinga: add Tiziano Fogli to authorized_for_system_information, authorized_for_configuration_information, authorized_for_all_service_commands, authorized_for_all_host_commands [puppet] - 10https://gerrit.wikimedia.org/r/1060438 (owner: 10Tiziano Fogli) [13:30:18] (03CR) 10Filippo Giunchedi: [C:03+1] hiera:benthos: finally removing all hiera relative to Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1060416 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [13:30:45] the other scheduled patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1056146 was deployed earlier today [13:31:04] by Dreamy_Jazz ;) [13:31:15] Yeah. It was deployed already. [13:31:19] !log UTC afternoon backport window is completed [13:31:20] hashar: Failed to log message to wiki. Somebody should check the error logs. [13:31:23] \o/ [13:32:56] (03PS1) 10Fabfur: cache:benthos: remove Benthos references from cache files [puppet] - 10https://gerrit.wikimedia.org/r/1060441 (https://phabricator.wikimedia.org/T371492) [13:34:14] hnowlan and I have a statsd-exporter resource change to deploy to k8s then scap test, ok to do it now hashar even though the window hasn't closed yet technically ? [13:34:23] this guy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1060411?usp=email [13:35:45] I'll take that as a yes [13:36:01] (03CR) 10Filippo Giunchedi: [C:03+2] mw-jobrunner: bump limit/request for statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060411 (https://phabricator.wikimedia.org/T371885) (owner: 10Filippo Giunchedi) [13:36:13] (03CR) 10Fabfur: [C:03+2] hiera:benthos: finally removing all hiera relative to Benthos [puppet] - 10https://gerrit.wikimedia.org/r/1060416 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [13:38:07] (03PS2) 10Tiziano Fogli: icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438 [13:38:48] !log filippo@deploy1003 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [13:38:49] filippo@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [13:39:00] !log filippo@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [13:39:00] filippo@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [13:39:07] !log filippo@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [13:39:08] filippo@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [13:39:17] !log filippo@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [13:39:17] filippo@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [13:39:45] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060441 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [13:39:48] ok stashbot is busted but we're good otherwise [13:39:59] (03CR) 10Klausman: [C:03+1] ml-services: update readability model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060437 (https://phabricator.wikimedia.org/T369712) (owner: 10AikoChou) [13:42:25] !log hnowlan@deploy1003 Started scap sync-world: sync to test mw-jobrunner resource increase [13:42:25] hnowlan@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [13:43:17] (03CR) 10Filippo Giunchedi: [C:03+1] cache:benthos: remove Benthos references from cache files [puppet] - 10https://gerrit.wikimedia.org/r/1060441 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [13:43:55] !log hnowlan@deploy1003 Finished scap: sync to test mw-jobrunner resource increase (duration: 02m 22s) [13:43:55] hnowlan@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [13:44:27] 07Puppet, 06Release-Engineering-Team, 13Patch-For-Review: Puppet git::clone probably does not need `umask` parameter - https://phabricator.wikimedia.org/T338277#10048286 (10hashar) 05Open→03Resolved The series of patch has led to the removal of `umask` from `git::clone` In roughly the order the patc... [13:45:57] grafana down? [13:46:07] yeah [13:46:20] curious, checking [13:46:33] mmhh we're back ? [13:46:39] back indeed yep [13:46:54] spike in 503s [13:46:57] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1037.eqiad.wmnet with OS bullseye [13:46:57] dcaro@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:47:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048325 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1037.eqiad.w... [13:47:24] (03CR) 10Fabfur: [C:03+2] cache:benthos: remove Benthos references from cache files [puppet] - 10https://gerrit.wikimedia.org/r/1060441 (https://phabricator.wikimedia.org/T371492) (owner: 10Fabfur) [13:47:34] (03PS4) 10Brouberol: cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240) [13:50:58] (03PS5) 10Brouberol: cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240) [13:51:02] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3009.esams.wmnet [13:51:02] sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:54:17] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3009.esams.wmnet [13:54:18] sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:55:43] !log start pybal on lvs3009 [13:55:43] sukhe: Failed to log message to wiki. Somebody should check the error logs. [13:56:12] (03PS1) 10Clare Ming: Fix labs config for Metrics Platform vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060445 (https://phabricator.wikimedia.org/T366234) [13:59:21] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [14:00:05] (03PS1) 10DCausse: search: use mul fallback for manually-tuned search profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1400) [14:00:21] (03PS1) 10David Caro: cloudcephosd: use the new partitions on the new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1060450 (https://phabricator.wikimedia.org/T363344) [14:00:56] (03PS2) 10David Caro: cloudcephosd: use the new partitions on the new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1060450 (https://phabricator.wikimedia.org/T363344) [14:01:40] !log import Jenkins 2.462.1 on bullseye-wikimedia:thirdparty/ci [14:01:41] elukey: Failed to log message to wiki. Somebody should check the error logs. [14:01:51] (03CR) 10David Caro: [C:03+2] cloudcephosd: use the new partitions on the new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1060450 (https://phabricator.wikimedia.org/T363344) (owner: 10David Caro) [14:03:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:03:51] brouberol@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [14:04:02] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:04:04] brouberol@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [14:07:49] Why no server admin logs? [14:10:19] (03CR) 10Phuedx: [C:03+1] "Running" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060445 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [14:13:40] (03PS1) 10Klausman: knative-serving: Switch components to use Calic Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 [14:14:23] (03PS2) 10Klausman: knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 [14:21:28] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) [14:21:29] jnuche@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [14:22:05] 07Puppet, 06Infrastructure-Foundations, 06Release-Engineering-Team: Puppet git::clone should default mode to 0644 (read-only) instead of 0755 - https://phabricator.wikimedia.org/T371980 (10hashar) 03NEW [14:22:22] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) (duration: 00m 53s) [14:22:22] jnuche@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [14:22:56] Dreamy_Jazz: https://phabricator.wikimedia.org/T371977 [14:23:58] Thanks [14:23:58] (03Abandoned) 10Arlolra: Enabled KartographerParsoidSupport on (cs|hi|shn|ps|tr)wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060186 (https://phabricator.wikimedia.org/T371936) (owner: 10Arlolra) [14:24:00] !log sudo cumin "lvs3008*" 'disable-puppet "rebooting" && systemctl stop pybal.service' [14:24:00] sukhe: Failed to log message to wiki. Somebody should check the error logs. [14:24:23] FIRING: JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:24:28] (03PS8) 10Btullis: Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T370354) [14:25:12] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) [14:25:13] jnuche@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [14:26:24] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) (duration: 01m 12s) [14:26:25] jnuche@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [14:27:39] (03CR) 10Btullis: "I have updated the patch so that it links to the relevant ticket for the current work." [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T370354) (owner: 10Btullis) [14:29:23] RESOLVED: JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:27] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3578/co" [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T370354) (owner: 10Btullis) [14:31:28] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10048445 (10elukey) Sent an email to all SREs, the move will happen on Aug 12th 13:00 UTC. [14:33:38] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Openjdk upgrade - elukey@cumin1002 [14:33:39] elukey@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:39:23] FIRING: [2x] JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:42] (03CR) 10Scott French: [C:03+1] mediawiki: Bump ttlSecondsAfterFinished for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060184 (owner: 10RLazarus) [14:49:06] (03PS2) 10Filippo Giunchedi: data-engineering: fix MediawikiPageContentChangeEnrichAvailability matching [alerts] - 10https://gerrit.wikimedia.org/r/1060061 (https://phabricator.wikimedia.org/T354255) [14:50:11] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3008.esams.wmnet [14:50:11] sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:50:51] 10ops-codfw, 06SRE, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T371923#10048515 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm alerts cleared [14:53:19] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3008.esams.wmnet [14:53:20] sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [14:54:29] (03PS1) 10Papaul: Add new payments node to DNS file [dns] - 10https://gerrit.wikimedia.org/r/1060457 [14:56:06] (03CR) 10Papaul: [C:03+2] Add new payments node to DNS file [dns] - 10https://gerrit.wikimedia.org/r/1060457 (owner: 10Papaul) [14:57:14] (03CR) 10Klausman: [C:03+1] ml-services: use cxserver host header in rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060377 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [14:57:40] (03CR) 10Klausman: [C:03+1] ml-services: langid from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060378 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [14:58:08] !log start pybal on lvs3008 [14:58:08] sukhe: Failed to log message to wiki. Somebody should check the error logs. [14:59:23] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:20] (03PS1) 10Cathal Mooney: Add mtr to standard packages for WMF hosts [puppet] - 10https://gerrit.wikimedia.org/r/1060458 [15:02:53] (03CR) 10Brouberol: [C:03+1] Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T370354) (owner: 10Btullis) [15:03:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10048545 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt 2004 and 2005 are ready when. you get them online we can decom 2003 and rack/install 200... [15:11:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060445 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [15:12:17] (03CR) 10Kevin Bazira: [C:03+2] ml-services: use cxserver host header in rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060377 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [15:13:24] (03Merged) 10jenkins-bot: ml-services: use cxserver host header in rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060377 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [15:13:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872#10048594 (10Jhancock.wm) this one has been out of warranty for more than a half a year. We do have a spare DIMM on hand to repl... [15:14:55] (03CR) 10Btullis: [V:03+1 C:03+2] Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T370354) (owner: 10Btullis) [15:15:13] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [15:15:14] kevinbazira@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [15:20:30] (03CR) 10Kevin Bazira: [C:03+2] ml-services: langid from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060378 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [15:21:17] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1038.eqiad.wmnet with OS bullseye [15:21:17] andrew@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:21:29] 10ops-codfw, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984 (10RobH) 03NEW [15:21:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048620 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eq... [15:21:35] (03Merged) 10jenkins-bot: ml-services: langid from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060378 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [15:21:46] 10ops-codfw, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10048637 (10RobH) [15:23:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872#10048641 (10Jhancock.wm) a:03Jhancock.wm [15:25:08] 10ops-codfw, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10048645 (10RobH) a:03ABran-WMF @ABran-WMF, This racking task lists you as your teams point of contact. As this has now been escalated to order, the new wor... [15:25:11] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [15:25:11] kevinbazira@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [15:33:24] (03CR) 10AikoChou: [C:03+2] ml-services: update readability model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060437 (https://phabricator.wikimedia.org/T369712) (owner: 10AikoChou) [15:34:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872#10048669 (10klausman) @Jhancock.wm machine is drained, feel free to proceed. [15:34:23] (03Merged) 10jenkins-bot: ml-services: update readability model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060437 (https://phabricator.wikimedia.org/T369712) (owner: 10AikoChou) [15:36:13] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1038.eqiad.wmnet with OS bullseye [15:36:15] andrew@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:36:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad.... [15:36:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:37:05] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1038.eqiad.wmnet with OS bullseye [15:37:05] andrew@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:37:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eq... [15:40:18] !log stop pybal on lvs2013 for server reboot [15:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:09] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10048688 (10Dzahn) All sounds good. Thank you! Also T371930#10047573 sounds like good progress is already... [15:43:40] I have hacked stashbot to work around the problem from T371977 that this week's train has triggered in the mwclient python library. My hack is very hacky, but should be fine until a proper fix is introduced upstream. [15:43:41] T371977: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977 [15:47:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:49:31] (03CR) 10Dzahn: [C:03+1] "looks all good and approved to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1060338 (https://phabricator.wikimedia.org/T371650) (owner: 10Slyngshede) [15:51:02] (03CR) 10Elukey: "LGTM! Could you run the puppet compiler on some random nodes (including librenms etc..) so we double check that we are good?" [puppet] - 10https://gerrit.wikimedia.org/r/1060458 (owner: 10Cathal Mooney) [15:52:50] RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:54:16] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage [15:54:41] (03PS2) 10Dzahn: zuul: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677) [15:56:32] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1057930/3581/contint2002.wikimedia.org/change.contint2002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [15:57:01] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage [15:57:05] (03PS1) 10Ahmon Dancy: mw-web: train-dev: Supply placeholder for STATSD_EXPORTER_PROMETHEUS_SERVICE_HOST [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060464 [15:57:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872#10048716 (10klausman) 05Open→03Resolved Machine has had DIMM replaced and is back in service. [15:58:38] (03PS3) 10Dzahn: zuul: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677) [15:59:53] 10ops-eqiad, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987 (10RobH) 03NEW [16:00:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10048757 (10JMeybohm) [16:01:24] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Openjdk upgrade - elukey@cumin1002 [16:01:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10048760 (10JMeybohm) [16:03:14] 10ops-eqiad, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10048763 (10RobH) a:03jijiki Effie, The workflow for racking tasks has changed this quarter, once I create the racking task I assign it to the SRE sub-teams point of contact (for this task... [16:03:22] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10048767 (10dancy) >>! In T371958#10047933, @Jelto wrote: > afaik there is [automation](https://gitlab.wikimedia.org/repos/releng/gitlab-settings/-/blob/main/group-management/sync-gitlab-group-with-... [16:03:44] 10ops-eqiad, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10048782 (10RobH) [16:03:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10048783 (10JMeybohm) The nodes are not in service, so no need to schedule a maint-window from our side. Feel free to choose a time that suits you best. [16:03:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10048784 (10JMeybohm) The nodes are not in service, so no need to schedule a maint-window from our side. Feel free to choose a time that suits you best. [16:05:14] (03CR) 10Vgutierrez: ACMEChiefConfig: Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1055232 (owner: 10Ncmonitor) [16:08:10] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2013.codfw.wmnet [16:09:59] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927#10048826 (10VRiley-WMF) a:03VRiley-WMF [16:11:12] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2013.codfw.wmnet [16:12:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927#10048831 (10VRiley-WMF) Warranty on server has expired. Located another SSD from Decommed servers. Swapped drive in slot 6 as per iDRAC error indicated. [16:13:21] (03CR) 10Dzahn: [V:03+1 C:03+1] "at last a simple one again: https://puppet-compiler.wmflabs.org/output/1057928/3583/contint2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1057928 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [16:13:21] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927#10048834 (10Ladsgroup) let me depool it. Let me know when you want it shut off. [16:14:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P67246 and previous config saved to /var/cache/conftool/dbconfig/20240807-161452-ladsgroup.json [16:15:26] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927#10048835 (10Ladsgroup) depooled. [16:15:35] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1038.eqiad.wmnet with OS bullseye [16:15:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048836 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad.... [16:17:11] (03CR) 10Scott French: [C:03+2] "This seems like a reasonable fix, but also suggests a subtle difference in the inherited configuration between mw-web and mw-debug." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060464 (owner: 10Ahmon Dancy) [16:18:13] (03Merged) 10jenkins-bot: mw-web: train-dev: Supply placeholder for STATSD_EXPORTER_PROMETHEUS_SERVICE_HOST [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060464 (owner: 10Ahmon Dancy) [16:20:28] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927#10048839 (10Ladsgroup) I will do some checks before repooling [16:21:16] (03PS1) 10BryanDavis: Revert "Drop writeapi flag from siteinfo API" [core] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060468 (https://phabricator.wikimedia.org/T115414) [16:27:03] !log start pybal on lvs2013 [16:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:23] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:53] !log milimetric@deploy1003 Started deploy [analytics/refinery@0d25645]: Syncing browser general script, and refinery-source 0.2.45 apparently [16:37:41] !log puppetserver1002 systemctl start dump_ip_reputation [16:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:23] RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:42:36] !log stop pybal on lvs2014 for server reboot [16:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:49] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10048986 (10Jhancock.wm) [16:54:04] jouncebot nowandnext [16:54:05] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [16:54:05] In 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1700) [16:56:23] i'm going to roll out https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1060468 for a train blocker [16:56:28] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10049000 (10Jhancock.wm) a:05Jhancock.wm→03Papaul @Papaul ready for your part civi2002 ETH1 <> FASW-C8A eth-0/0/37 ETH2 <> FASW-C8B eth-1/0/37 frpig200... [16:56:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060468 (https://phabricator.wikimedia.org/T115414) (owner: 10BryanDavis) [16:58:44] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920#10049008 (10Jhancock.wm) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1700) [17:03:31] (03PS2) 10DCausse: search: index stems for mul labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060430 (https://phabricator.wikimedia.org/T371401) [17:03:31] (03PS3) 10DCausse: search: use the stem field when searching mul labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) [17:03:31] (03PS2) 10DCausse: search: use mul fallback for manually-tuned search profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401) [17:07:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1296.eqiad.wmnet with OS bullseye [17:07:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bull... [17:08:23] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2014.codfw.wmnet [17:11:18] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2014.codfw.wmnet [17:11:48] (03CR) 10Dzahn: [V:03+1 C:03+2] ci: replace ferm::service with firewall::service in data_rsync [puppet] - 10https://gerrit.wikimedia.org/r/1057928 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:14:50] !log start pybal on lvs2014 [17:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:32] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop - all that happens here is that a config file got renamed (underscore vs hyphen) - no change to actual firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/1057928 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:16:06] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10049067 (10Jhancock.wm) [17:17:01] !log stop pybal on lvs1019 for server reboot [17:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:58] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10049070 (10Jhancock.wm) a:05Jhancock.wm→03Papaul @Papaul this one is ready for you. ETH1 <> FASW-C8A eth-0/0/36 ETH2 <> FASW-C8B eth-0/1/36 [17:27:13] (03Merged) 10jenkins-bot: Revert "Drop writeapi flag from siteinfo API" [core] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060468 (https://phabricator.wikimedia.org/T115414) (owner: 10BryanDavis) [17:27:31] !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1060468|Revert "Drop writeapi flag from siteinfo API" (T115414 T294397 T371977)]] [17:28:38] T115414: Remove the ability to disable the API with $wgEnableAPI - https://phabricator.wikimedia.org/T115414 [17:28:38] T294397: Drop writeapi MediaWiki right - https://phabricator.wikimedia.org/T294397 [17:28:38] T371977: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977 [17:29:14] !log milimetric@deploy1003 Finished deploy [analytics/refinery@0d25645]: Syncing browser general script, and refinery-source 0.2.45 apparently (duration: 54m 21s) [17:29:44] !log brennen@deploy1003 brennen, bd808: Backport for [[gerrit:1060468|Revert "Drop writeapi flag from siteinfo API" (T115414 T294397 T371977)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:30:34] !log milimetric@deploy1003 Started deploy [analytics/refinery@0d25645] (thin): Syncing browser general script, and refinery-source 0.2.45 apparently [17:31:08] !log brennen@deploy1003 brennen, bd808: Continuing with sync [17:34:56] !log milimetric@deploy1003 Finished deploy [analytics/refinery@0d25645] (thin): Syncing browser general script, and refinery-source 0.2.45 apparently (duration: 04m 21s) [17:35:37] !log brennen@deploy1003 Finished scap: Backport for [[gerrit:1060468|Revert "Drop writeapi flag from siteinfo API" (T115414 T294397 T371977)]] (duration: 08m 06s) [17:35:41] T115414: Remove the ability to disable the API with $wgEnableAPI - https://phabricator.wikimedia.org/T115414 [17:35:42] T294397: Drop writeapi MediaWiki right - https://phabricator.wikimedia.org/T294397 [17:35:42] T371977: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977 [17:35:55] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10049124 (10Dzahn) Turns out there is another jenkins SSH key here: https://phabricator.wikimedia.org/au... [17:36:03] (03CR) 10Bking: [C:03+1] knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 (owner: 10Klausman) [17:40:07] (03CR) 10Ssingh: [C:03+1] "I am not sure if this is supposed to go under @ or under a specific spop1024 record so going with this for now and we can see. Since it's " [dns] - 10https://gerrit.wikimedia.org/r/1059412 (https://phabricator.wikimedia.org/T370963) (owner: 10Dwisehaupt) [17:40:14] (03CR) 10Ssingh: [C:03+2] Add yahoo-verification-key for Complaint Feedback Loop [dns] - 10https://gerrit.wikimedia.org/r/1059412 (https://phabricator.wikimedia.org/T370963) (owner: 10Dwisehaupt) [17:40:26] (03PS2) 10Ssingh: Add yahoo-verification-key for Complaint Feedback Loop [dns] - 10https://gerrit.wikimedia.org/r/1059412 (https://phabricator.wikimedia.org/T370963) (owner: 10Dwisehaupt) [17:41:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:41:29] !log running authdns-update for Yahoo CFL TXT record: T370963 [17:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:31] T370963: Add a TXT record to the Yahoo sending domain - https://phabricator.wikimedia.org/T370963 [17:44:44] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1059412 (https://phabricator.wikimedia.org/T370963) (owner: 10Dwisehaupt) [17:45:21] (03PS1) 10Dzahn: ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - 10https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) [17:49:56] (03PS1) 10Ssingh: wikimedia.org: dummy change to check auto-review [dns] - 10https://gerrit.wikimedia.org/r/1060484 [17:50:03] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10049142 (10Papaul) ` [edit interfaces interface-range disabled] - member ge-0/0/36; - member ge-1/0/36; [edit interfaces interface-range vlan-administration] member... [17:53:40] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1060483/3584/contint1002.wikimedia.org/change.contint1002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:54:17] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1038.eqiad.wmnet with OS bullseye [17:54:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10049148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eq... [17:56:57] (03PS2) 10Dzahn: ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - 10https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) [17:57:35] (03CR) 10Ssingh: "Nice, that worked. Abandoning." [dns] - 10https://gerrit.wikimedia.org/r/1060484 (owner: 10Ssingh) [17:58:29] (03Abandoned) 10Ssingh: wikimedia.org: dummy change to check auto-review [dns] - 10https://gerrit.wikimedia.org/r/1060484 (owner: 10Ssingh) [18:00:04] jnuche and brennen: Time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1800). [18:02:12] (03PS24) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [18:05:21] (03PS1) 10BCornwall: dummy change to check auto-review [dns] - 10https://gerrit.wikimedia.org/r/1060485 [18:06:12] (03CR) 10Pppery: "Ideally some of these domains would point to more specific places rather than wikimedia.org, like wiktionary.app -> wiktionary.org instead" [puppet] - 10https://gerrit.wikimedia.org/r/1055231 (owner: 10Ncmonitor) [18:06:29] (03CR) 10Ebernhardson: [C:03+1] search: index stems for mul labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060430 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [18:06:50] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:07:34] (03PS3) 10Dzahn: ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - 10https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) [18:09:22] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage [18:10:56] (03PS4) 10Dzahn: ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - 10https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) [18:11:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage [18:12:05] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage [18:12:33] (03CR) 10Dzahn: "@hashar No more need to do the resolve part and no more need to join the array elements. It all just works now when passing an array strai" [puppet] - 10https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:12:39] (03CR) 10Dzahn: [V:03+1] ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - 10https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:13:10] (03Abandoned) 10Ssingh: dummy change to check auto-review [dns] - 10https://gerrit.wikimedia.org/r/1060485 (owner: 10BCornwall) [18:14:17] (03PS25) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [18:14:40] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage [18:15:20] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10049179 (10XiaoXiao-WMF) Hi! I have followed the email instruction and I have done this step on May 23rd, and now I log into the stat machine I still... [18:15:26] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10049184 (10XiaoXiao-WMF) 05Resolved→03Open [18:17:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10049185 (10XiaoXiao-WMF) a:05Clement_Goubert→03None [18:17:26] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1019.eqiad.wmnet [18:18:37] (03PS26) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [18:19:26] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10049204 (10Jclark-ctr) p:05Triage→03Low a:03Jclark-ctr These can be ignored i am process of imaging these servers and are single power at this time [18:20:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1019.eqiad.wmnet [18:21:59] (03PS27) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [18:22:59] !log milimetric@deploy1003 Started deploy [analytics/refinery@fe20690]: Syncing browser general script hive version [18:28:56] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1296.eqiad.wmnet with OS bullseye [18:29:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bullseye... [18:30:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1296.eqiad.wmnet with OS bullseye [18:30:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049237 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bull... [18:32:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10049244 (10Jclark-ctr) a:03VRiley-WMF [18:32:05] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1038.eqiad.wmnet with OS bullseye [18:32:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10049245 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad.... [18:32:29] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage [18:33:36] !log start pybal on lvs1019 [18:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage [18:36:54] (03CR) 10Scott French: [C:03+2] "I'll go ahead and approve / merge this now, as there is no change in sudo rights with this patch - only preparation for a change in entryp" [puppet] - 10https://gerrit.wikimedia.org/r/1059942 (https://phabricator.wikimedia.org/T371904) (owner: 10Ahmon Dancy) [18:39:05] !log milimetric@deploy1003 Finished deploy [analytics/refinery@fe20690]: Syncing browser general script hive version (duration: 16m 05s) [18:40:26] !log stop pybal on lvs1018 for server reboot [18:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:28] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1296.eqiad.wmnet with OS bullseye [18:45:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049259 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bullseye... [18:55:19] (03CR) 10Andrew Bogott: [C:03+2] wmfsink: hook delete.end rather than delete.start [puppet] - 10https://gerrit.wikimedia.org/r/1060172 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [18:55:38] (03PS2) 10Andrew Bogott: wmfsink: hook delete.end rather than delete.start [puppet] - 10https://gerrit.wikimedia.org/r/1060172 (https://phabricator.wikimedia.org/T371707) [18:55:38] (03PS10) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) [18:55:38] (03PS12) 10Andrew Bogott: wmf_sink: replace targeted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [18:56:14] (03PS1) 10Brennen Bearnes: Fix TypeError in PendingChanges by handling null subPage [extensions/FlaggedRevs] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060489 (https://phabricator.wikimedia.org/T371986) [18:56:21] (03PS2) 10Jforrester: Fix TypeError in PendingChanges by handling null subPage [extensions/FlaggedRevs] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060489 (https://phabricator.wikimedia.org/T371986) (owner: 10Brennen Bearnes) [18:56:43] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1018.eqiad.wmnet [18:57:17] James_F: jinx [18:57:34] Oops, sorry for the clash pick brennen. The perils of doing this from my phone at Wikimania. :-) [18:57:50] Thank you for looking after the train! [18:58:06] i shall deploy, you go enjoy wikimania. :) [18:59:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1018.eqiad.wmnet [19:00:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1003 using scap backport" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060489 (https://phabricator.wikimedia.org/T371986) (owner: 10Brennen Bearnes) [19:00:39] (03CR) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [19:00:49] !log start pybal on lvs1018 [19:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:25] !log stop pybal on lvs1017 for server reboot [19:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:06] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [19:10:00] (03Merged) 10jenkins-bot: Fix TypeError in PendingChanges by handling null subPage [extensions/FlaggedRevs] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060489 (https://phabricator.wikimedia.org/T371986) (owner: 10Brennen Bearnes) [19:10:22] !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1060489|Fix TypeError in PendingChanges by handling null subPage (T371986)]] [19:10:30] T371986: TypeError: Argument 1 passed to PendingChanges::parseParams() must be of the type string, null given - https://phabricator.wikimedia.org/T371986 [19:11:08] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt gerrit1004 - jclark@cumin1002" [19:11:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt gerrit1004 - jclark@cumin1002" [19:11:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:11:30] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host gerrit1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:12:29] !log brennen@deploy1003 brennen: Backport for [[gerrit:1060489|Fix TypeError in PendingChanges by handling null subPage (T371986)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:14:13] !log brennen@deploy1003 brennen: Continuing with sync [19:18:46] !log brennen@deploy1003 Finished scap: Backport for [[gerrit:1060489|Fix TypeError in PendingChanges by handling null subPage (T371986)]] (duration: 08m 23s) [19:18:53] T371986: TypeError: Argument 1 passed to PendingChanges::parseParams() must be of the type string, null given - https://phabricator.wikimedia.org/T371986 [19:29:09] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1017.eqiad.wmnet [19:32:25] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1017.eqiad.wmnet [19:33:04] !log start pybal on lvs1017 [19:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gerrit1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:38:50] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host gerrit1004.wikimedia.org with OS bookworm [19:38:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049405 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host gerrit1004.wikimedia.org with OS bookworm [19:39:27] !log ebernhardson@deploy1003 Started deploy [airflow-dags/search@049c09e]: workaround process_sparql_query oom issues [19:39:48] !log ebernhardson@deploy1003 Finished deploy [airflow-dags/search@049c09e]: workaround process_sparql_query oom issues (duration: 00m 20s) [19:39:52] 10ops-eqiad, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T372001 (10phaultfinder) 03NEW [19:42:14] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1060172 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [19:43:53] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10049413 (10Jclark-ctr) [19:45:36] !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@049c09e]: Deploying new Browser General job [19:46:17] !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@049c09e]: Deploying new Browser General job (duration: 00m 41s) [19:47:33] !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@049c09e]: Deploying new Browser General job [19:47:36] !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@049c09e]: Deploying new Browser General job (duration: 00m 02s) [19:51:17] !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@216348d]: (no justification provided) [19:52:04] !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@216348d]: (no justification provided) (duration: 00m 47s) [19:52:48] !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@049c09e]: (no justification provided) [19:53:47] !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@049c09e]: (no justification provided) (duration: 00m 59s) [19:55:42] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit1004.wikimedia.org with reason: host reimage [19:59:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit1004.wikimedia.org with reason: host reimage [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T2000). [20:00:04] cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] i will self-deploy! [20:00:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060445 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [20:01:35] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:01:38] (03Merged) 10jenkins-bot: Fix labs config for Metrics Platform vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060445 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [20:02:25] I'll hang out for a little bit if anyone needs anything [20:04:41] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt vrts1003 - jclark@cumin1002" [20:04:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt vrts1003 - jclark@cumin1002" [20:04:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:08:59] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host vrts1003.mgmt.eqiad.wmnet with reboot policy FORCED [20:10:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:11:27] !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@049c09e]: (no justification provided) [20:11:31] !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@049c09e]: (no justification provided) (duration: 00m 03s) [20:15:57] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gerrit1004.wikimedia.org with OS bookworm [20:16:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049555 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host gerrit1004.wikimedia.org with OS bookworm execut... [20:17:28] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host gerrit1004.wikimedia.org with OS bookworm [20:17:32] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10049557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host gerrit1004.wikimedia.org with OS bookworm [20:19:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [20:19:45] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [20:20:08] hi [20:20:33] !incidents [20:20:33] 4954 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet) [20:20:33] 4955 (UNACKED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-f4-eqiad.mgmt.eqiad.wmnet) [20:20:44] !ack 4954 [20:20:44] 4954 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet) [20:20:53] !ack 4955 [20:20:53] 4955 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-f4-eqiad.mgmt.eqiad.wmnet) [20:21:02] !log end of UTC late backport window [20:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:04] that is nice [20:21:17] that is the first time I notice sirenbot and it LOOKS SO RAD [20:21:35] still looking [20:23:28] here. acked that from mobile. limited to cloud [20:24:45] RESOLVED: Primary outbound port utilisation over 80% #page: Device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [20:24:45] RESOLVED: Primary inbound port utilisation over 80% #page: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [20:24:50] :P [20:25:05] and that was that... [20:26:04] well that was a spike alright, looking at librenms [20:29:28] looking at the netstats section for that device.. i dont even see i? [20:30:22] mutante: this https://librenms.wikimedia.org/graphs/to=1723062300/device=242/type=device_bits/from=1722975900/legend=no/ [20:31:16] it's only the management switch.. [20:31:16] mutante: anything from the cloud folks? [20:31:44] 20:31 < andrewbogott> We have a bad switch so are migrating lots of things way from it. In theory that's not disruptive [20:31:47] :) [20:31:50] :) [20:31:57] thanks for following up [20:31:58] mutante: the management switch isn't pushing 60Gbit [20:32:19] the context is T371878 [20:32:19] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [20:32:23] for what I'm doing [20:32:30] but right now all I'm doing is very gradually pooling new ceph nodes [20:32:41] that's the prod sw1-f4 being polled on its management IPs [20:32:53] ok, so nothing to do with me it sounds lke? [20:33:21] is pooling the ceph nodes causing data to be resilvered? [20:33:22] well, "bad switch" and reboot of the exact device that just alerted [20:33:38] I have to run for daycare pickup. I will be back later [20:34:26] it was "d5" and "f4" [20:34:29] cdanis: yes, it rebalances whenever new drives are added. [20:34:33] andrewbogott: something was nearly maxing out the 40G interconnect between cloudsw1-f4 and cloudsw1-d5 https://librenms.wikimedia.org/device/device=242/tab=port/port=25230/ [20:34:40] so I'm guessing that was Ceph [20:35:15] possible although I'm not sure how we'd get to 40G [20:35:24] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit1004.wikimedia.org with reason: host reimage [20:35:29] there's only one new host coming online and it only has a 10G nic [20:35:38] and usually cpu bottlenecks before network bandwidth [20:35:44] well, *something* exceeded it -- if you look at the errors on the other side of the port, there were a lot of discards https://librenms.wikimedia.org/device/device=242/tab=port/port=25230/ [20:35:48] is it still happening? [20:36:24] no, but you've had two large spikes of discards in the past 24h [20:36:35] ok [20:37:00] that could be ceph, maybe coupled with that badly-behaving switch [20:37:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host vrts1003.mgmt.eqiad.wmnet with reboot policy FORCED [20:39:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host vrts1003.eqiad.wmnet with OS bookworm [20:39:08] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10049655 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host vrts1003.eqiad.wmnet with OS bookworm [20:39:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit1004.wikimedia.org with reason: host reimage [20:40:18] so I have no hands-on experience with the system, but, I thought one of the drawbacks of Ceph was that the CRUSH algorithm often caused a lot of churn when fresh nodes/disks were added to the system? [20:40:24] do you think that might be happening here andrewbogott [20:41:07] It's definitely causing churn but only a reasonable amount (according to the mon: "recovery: 536 MiB/s, 134 objects/s") [20:41:18] hm [20:41:21] But if the switch malfunctions and discards 99% of our traffic then all bets are off [20:42:16] cursed switch info is at https://phabricator.wikimedia.org/T371879 [20:42:22] *both* switches were saying "this link is 36Gbps+", but only one switch was saying "I'm discarding traffic because my output buffer is full" ... which is expected when you're saturating such a link [20:42:30] ah, I see [20:42:41] * andrewbogott looks for a timestamp [20:43:10] sees reports like " [20:43:10] Ceph went haywire after a switch hiccup [20:43:22] mutante: yeah I think Ceph caused a network saturation event [20:43:37] distributed storage systems are often very good at that :) [20:44:05] cdanis: ack, thanks for that. at least it felt like it might be related to maintenance [20:44:47] as Andrew said, in combination with the switch issue [20:45:39] if BFD is running over the links that are saturating, then, Ceph *is* the "switch issue" [20:45:43] this is about the alert at 15:24 right? [20:45:45] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10049671 (10phaultfinder) [20:45:46] is what I am saying, mutante [20:46:11] I didn't repool the thing I am repooling until 20 minutes later than that [20:46:31] hm [20:46:37] But there has been some amount of ceph rebalancing ongoing for several days [20:47:07] andrewbogott: alert went out at 8:19 UTC [20:47:20] (and of course adding new nodes isn't unusual, we have hundreds of disks in play and they were all added sometime) [20:49:05] cdanis: but there is an actual action that was taken by Cathal per the comment "things remain stable since the changes earlier on" [20:49:16] mutante: do you man 20:19 UTC? [20:49:44] andrewbogott: yes [20:49:48] so 30 minutes ago [20:49:54] yes [20:50:24] I don't think I was doing anything interesting then other than waiting for a previous drive to finish rebalancing which it had been doing for an hour+ at that point. But let me look in the logs some more [20:52:49] 06SRE, 06Infrastructure-Foundations, 10netops: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10049699 (10Dzahn) We got paged at 20:19 UTC for "primary outbound port utilisation over 80%" on both cloudsw1-d5 and cloudsw1-f4 today. Shortly after it resolved. But somethi... [20:53:11] !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@4cf9922]: (no justification provided) [20:53:25] andrewbogott: the real question is what you were doing at 20:06 [20:53:29] https://grafana.wikimedia.org/goto/aqyMilrIg?orgId=1 [20:53:50] !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@4cf9922]: (no justification provided) (duration: 00m 38s) [20:53:54] which is when the cloudcephosd hosts themselves started reporting their network usage to be 120Gbps+ [20:54:01] gigabit/second [20:54:10] which is almost definitely a problem for your switches [20:54:57] 06SRE, 06Infrastructure-Foundations, 10netops: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10049702 (10Dzahn) {F57154133} [20:55:06] have you been rebalancing these hosts 'gradually' since about 2024-08-06 16:12? because that's when the crazy spikes in cloudcephosd self-reported NIC usage begin https://grafana.wikimedia.org/goto/6Ay4ilrIg?orgId=1 [20:55:39] there is also https://phabricator.wikimedia.org/T371869 [20:56:02] cross-switch link saturation would absolutely explain that as well, potentially [20:56:16] yep, yesterday (my AM) was when we started evacuating things that use that switch so we can upgrade and reboot it. [20:56:17] and, the thing that BFD does is it tells the control plane about neighbor links that are dropping packets [20:56:40] and I don't think we have any QoS for it (or anywhere) atm [20:56:43] "we have increased the timeouts and changed the LACP mode from 'fast' to 'slow' keepalive messages and that seems to have stabilized the network " [20:56:47] yeah... [20:56:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gerrit1004.wikimedia.org with OS bookworm [20:57:06] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10049708 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host gerrit1004.wikimedia.org with OS bookworm executed with errors: - ge... [20:57:07] I would guess you are wrecking the network with microbursts at the beginning of each rebalance [20:57:18] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on vrts1003.eqiad.wmnet with reason: host reimage [20:57:23] anyway I'm sorry, I have to go [20:57:28] daycare closes soon :) [20:57:32] could be if the 'decide what to do' stage is somehow not throttled properly [20:57:44] Bets to consult topranks about all this during overlapping hours [20:58:13] please feel me to cc me on the tasks as well, if you want :) [20:58:16] anyway afk [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T2100) [21:02:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on vrts1003.eqiad.wmnet with reason: host reimage [21:09:18] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10049724 (10Dzahn) Even though the comment here says the cookbook failed.. I can see gerrit1004 is up on mgmt interface. I can also login as root on mgmt. Only thing that s... [21:10:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host vrts1003.eqiad.wmnet with OS bookworm [21:19:59] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10049745 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host vrts1003.eqiad.wmnet with OS bookworm executed with errors: - vrts1003... [21:24:45] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10049753 (10bd808) 05Open→03Invalid T371888#10049750 [21:29:07] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10049761 (10Jhancock.wm) [21:30:42] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10049766 (10Jhancock.wm) a:05Jhancock.wm→03Papaul this one server is ready for @Papaul frdc2004 ETH1 <> FASW-C8A eth-0/0/20 ETH2 <> FASW-C8B eth-1/0/20 [21:41:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:41:41] (03PS5) 10Dzahn: ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - 10https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) [21:44:59] (03PS1) 10Dzahn: gerrit: increase allowed requests from 300 to 600 for throttling [puppet] - 10https://gerrit.wikimedia.org/r/1060502 (https://phabricator.wikimedia.org/T365259) [21:46:34] (03CR) 10Dzahn: [C:03+2] "Nothing gets actually dropped - it's just to observe the content of the created host sets." [puppet] - 10https://gerrit.wikimedia.org/r/1060502 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [21:52:18] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10049806 (10Dzahn) This appears to be T371653. I reopened that ticket and left a comment. Meanwhile I manually changed the status for this host to "active" in netbox. So I t... [21:53:04] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10049811 (10Dzahn) Isssue above same as T369671#10049724 Manually changed the status to "active" in netbox. [22:07:05] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:32:44] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Update release_repo_build_and_push_images_cmd [puppet] - 10https://gerrit.wikimedia.org/r/1060505 (https://phabricator.wikimedia.org/T371904) [22:35:24] (03CR) 10CI reject: [V:04-1] scap.cfg.erb: Update release_repo_build_and_push_images_cmd [puppet] - 10https://gerrit.wikimedia.org/r/1060505 (https://phabricator.wikimedia.org/T371904) (owner: 10Ahmon Dancy) [22:36:23] (03PS2) 10Ahmon Dancy: scap.cfg.erb: Update release_repo_build_and_push_images_cmd [puppet] - 10https://gerrit.wikimedia.org/r/1060505 (https://phabricator.wikimedia.org/T371904) [23:00:09] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10049925 (10Papaul) [23:07:34] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10049939 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member "ge-[0-1]/0/20"; [edit interfaces interface-range vlan-fundr... [23:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1060508 [23:38:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1060508 (owner: 10TrainBranchBot) [23:45:30] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10049961 (10Jclark-ctr) [23:46:32] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10049962 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr