[00:00:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2094.codfw.wmnet with reason: host reimage [00:01:12] !log krinkle@deploy2002 Synchronized wmf-config/CommonSettings.php: (no justification provided) (duration: 06m 37s) [00:02:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2098.codfw.wmnet with reason: host reimage [00:03:58] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:05:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:05:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1107.eqiad.wmnet with OS bookworm [00:05:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm completed: - elastic1107 (**PASS**)... [00:05:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2098.codfw.wmnet with reason: host reimage [00:08:05] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:09:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:09:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1105.eqiad.wmnet with OS bookworm [00:09:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1105.eqiad.wmnet with OS bookworm completed: - elastic1105 (**PASS**)... [00:12:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr) [00:12:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [00:14:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2099.codfw.wmnet with reason: host reimage [00:14:21] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:17:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2099.codfw.wmnet with reason: host reimage [00:19:31] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:55] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:22:27] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1107.eqiad.wmnet with OS bookworm [00:22:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [00:23:01] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1107.eqiad.wmnet with OS bookworm [00:23:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm executed with errors: - elastic1107... [00:25:04] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:29:41] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:05] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:32:13] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:15] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:34] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:36:53] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:38:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2094.codfw.wmnet with OS bookworm [00:38:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:38:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2097.codfw.wmnet with OS bookworm [00:38:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2094.codfw.wmnet with OS bookworm completed: - elastic2094 (**PASS**)... [00:38:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2097.codfw.wmnet with OS bookworm completed: - elastic2097 (**WARN**)... [00:38:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:38:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2099.codfw.wmnet with OS bookworm [00:38:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2099.codfw.wmnet with OS bookworm completed: - elastic2099 (**WARN**)... [00:38:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:38:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2098.codfw.wmnet with OS bookworm [00:38:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/978662 [00:38:48] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/978662 (owner: 10TrainBranchBot) [00:38:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2098.codfw.wmnet with OS bookworm completed: - elastic2098 (**WARN**)... [00:40:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2100.codfw.wmnet with OS bookworm [00:40:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2100.codfw.wmnet with OS bookworm [00:41:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) [00:42:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) [00:44:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2101.codfw.wmnet with OS bookworm [00:44:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2101.codfw.wmnet with OS bookworm [00:47:25] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:21] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2102.codfw.wmnet with OS bookworm [00:49:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2102.codfw.wmnet with OS bookworm [00:53:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2103.codfw.wmnet with OS bookworm [00:54:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2103.codfw.wmnet with OS bookworm [00:57:38] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/978662 (owner: 10TrainBranchBot) [00:59:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2104.codfw.wmnet with OS bookworm [00:59:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2104.codfw.wmnet with OS bookworm [01:02:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2101.codfw.wmnet with reason: host reimage [01:05:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2101.codfw.wmnet with reason: host reimage [01:06:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2100.codfw.wmnet with reason: host reimage [01:07:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2102.codfw.wmnet with reason: host reimage [01:09:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2100.codfw.wmnet with reason: host reimage [01:09:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [01:11:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2103.codfw.wmnet with reason: host reimage [01:14:56] !log removing 120 files for legal compliance [01:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:09] (not a typo, unfortunately) [01:17:05] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:18:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2104.codfw.wmnet with reason: host reimage [01:19:05] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ceph2001-3 to codfw - jhancock@cumin2002" [01:21:06] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:21:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ceph2001-3 to codfw - jhancock@cumin2002" [01:21:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:21:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2104.codfw.wmnet with reason: host reimage [01:21:32] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:22:58] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:24:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ceph2001.mgmt.codfw.wmnet with reboot policy FORCED [01:24:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ceph2002.mgmt.codfw.wmnet with reboot policy FORCED [01:24:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ceph2003.mgmt.codfw.wmnet with reboot policy FORCED [01:28:00] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:29:53] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:29:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2101.codfw.wmnet with OS bookworm [01:29:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:29:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2101.codfw.wmnet with OS bookworm completed: - elastic2101 (**WARN**)... [01:30:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2100.codfw.wmnet with OS bookworm [01:30:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2100.codfw.wmnet with OS bookworm completed: - elastic2100 (**PASS**)... [01:30:18] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:31:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:31:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2102.codfw.wmnet with OS bookworm [01:31:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2102.codfw.wmnet with OS bookworm completed: - elastic2102 (**PASS**)... [01:32:38] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:32:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2105.codfw.wmnet with OS bookworm [01:33:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2105.codfw.wmnet with OS bookworm [01:33:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:34:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2103.codfw.wmnet with OS bookworm [01:34:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2103.codfw.wmnet with OS bookworm completed: - elastic2103 (**PASS**)... [01:36:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ceph2001.mgmt.codfw.wmnet with reboot policy FORCED [01:36:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ceph2003.mgmt.codfw.wmnet with reboot policy FORCED [01:36:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2106.codfw.wmnet with OS bookworm [01:36:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2106.codfw.wmnet with OS bookworm [01:38:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ceph2002.mgmt.codfw.wmnet with reboot policy FORCED [01:38:39] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:39:19] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ceph2003'] [01:39:20] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ceph2001'] [01:39:22] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ceph2002'] [01:39:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ceph2001'] [01:39:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ceph2002'] [01:39:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ceph2003'] [01:39:55] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ceph2001'] [01:40:00] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ceph2002'] [01:40:04] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ceph2003'] [01:40:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ceph2002'] [01:40:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2107.codfw.wmnet with OS bookworm [01:40:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2107.codfw.wmnet with OS bookworm [01:40:30] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ceph2002'] [01:40:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:40:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2104.codfw.wmnet with OS bookworm [01:40:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2104.codfw.wmnet with OS bookworm completed: - elastic2104 (**PASS**)... [01:40:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ceph2002'] [01:43:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2108.codfw.wmnet with OS bookworm [01:43:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2108.codfw.wmnet with OS bookworm [01:46:21] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:49:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2109.codfw.wmnet with OS bookworm [01:49:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2109.codfw.wmnet with OS bookworm [01:50:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [01:51:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2105.codfw.wmnet with reason: host reimage [01:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:51:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) note to self to check the network port on ceph2002 [01:51:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) a:03Jhancock.wm [01:54:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2105.codfw.wmnet with reason: host reimage [01:54:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2106.codfw.wmnet with reason: host reimage [01:56:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ceph2001'] [01:56:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ceph2003'] [01:58:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2106.codfw.wmnet with reason: host reimage [01:58:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2107.codfw.wmnet with reason: host reimage [02:01:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2108.codfw.wmnet with reason: host reimage [02:01:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2107.codfw.wmnet with reason: host reimage [02:04:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2108.codfw.wmnet with reason: host reimage [02:07:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2109.codfw.wmnet with reason: host reimage [02:10:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2109.codfw.wmnet with reason: host reimage [02:11:44] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:16:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:16:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2105.codfw.wmnet with OS bookworm [02:16:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2105.codfw.wmnet with OS bookworm completed: - elastic2105 (**PASS**)... [02:16:51] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:17:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:17:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2106.codfw.wmnet with OS bookworm [02:18:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2106.codfw.wmnet with OS bookworm completed: - elastic2106 (**PASS**)... [02:18:30] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:24:03] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:27:22] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:27:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2108.codfw.wmnet with OS bookworm [02:27:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:27:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2107.codfw.wmnet with OS bookworm [02:27:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2108.codfw.wmnet with OS bookworm completed: - elastic2108 (**WARN**)... [02:27:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2107.codfw.wmnet with OS bookworm completed: - elastic2107 (**PASS**)... [02:28:02] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:31:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:31:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2109.codfw.wmnet with OS bookworm [02:31:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2109.codfw.wmnet with OS bookworm completed: - elastic2109 (**PASS**)... [02:34:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) [02:35:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) 05Open→03Resolved @bking all your's [02:39:02] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:00:14] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:00:39] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [03:09:03] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:05] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:03:14] (03PS3) 10Clare Ming: Add stream config for *webuiactions via Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) [04:05:59] (03CR) 10Clare Ming: "@KimberlySarabia mind taking another look? I combined schemas so there is only one custom schema for both desktop and mobile webuiactions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming) [04:10:04] (03PS4) 10Clare Ming: Add stream config for *webuiactions via Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) [04:19:32] (03PS5) 10Clare Ming: Add stream config for *webuiactions via Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) [04:21:41] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:30:42] I am going to put phabricator in RO for a few seconds to switch its database master [05:31:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149 [05:32:03] T352149: Switchover m3 master db1159 -> db1119 - https://phabricator.wikimedia.org/T352149 [05:32:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149 [05:33:02] (03PS1) 10Marostegui: Revert "mariadb: Promote db1119 to m3 master" [puppet] - 10https://gerrit.wikimedia.org/r/979085 [05:33:42] (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/979085 (owner: 10Marostegui) [05:33:53] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:34:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:34:37] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:35:26] (03CR) 10Marostegui: Revert "mariadb: Promote db1119 to m3 master" [puppet] - 10https://gerrit.wikimedia.org/r/979085 (owner: 10Marostegui) [05:35:29] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote db1119 to m3 master" [puppet] - 10https://gerrit.wikimedia.org/r/979085 (owner: 10Marostegui) [05:37:10] !log Failover m3 from db1119 to db1159 - T352360 [05:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:14] T352360: Switchover m3 master db1119 -> db1159 - https://phabricator.wikimedia.org/T352360 [05:37:40] (03CR) 10Clare Ming: Define the corresponding stream for scroll (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [05:41:39] (03PS1) 10Marostegui: db1119: Future m5 master [puppet] - 10https://gerrit.wikimedia.org/r/979192 (https://phabricator.wikimedia.org/T352361) [05:42:16] (03CR) 10Marostegui: [C: 03+2] db1119: Future m5 master [puppet] - 10https://gerrit.wikimedia.org/r/979192 (https://phabricator.wikimedia.org/T352361) (owner: 10Marostegui) [05:46:21] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:55:22] (03PS1) 10Marostegui: db2135: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979194 (https://phabricator.wikimedia.org/T352361) [05:55:56] (03CR) 10Marostegui: [C: 03+2] db2135: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979194 (https://phabricator.wikimedia.org/T352361) (owner: 10Marostegui) [05:56:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2135.codfw.wmnet with OS bookworm [06:06:57] (03PS1) 10Marostegui: Revert "db2135: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979206 [06:07:03] (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/979206 (owner: 10Marostegui) [06:08:22] 10SRE, 10SRE-Access-Requests: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098 (10odimitrijevic) Approved [06:12:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2135.codfw.wmnet with reason: host reimage [06:15:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2135.codfw.wmnet with reason: host reimage [06:20:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:20:50] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:21:00] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:30:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2135.codfw.wmnet with OS bookworm [06:31:02] (03CR) 10Marostegui: [C: 03+2] Revert "db2135: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979206 (owner: 10Marostegui) [06:33:00] (03PS1) 10Marostegui: db2135: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/979196 (https://phabricator.wikimedia.org/T352361) [06:34:09] (03CR) 10Marostegui: [C: 03+2] db2135: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/979196 (https://phabricator.wikimedia.org/T352361) (owner: 10Marostegui) [06:36:44] (03PS1) 10Marostegui: mariadb: Move db1119 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/979197 (https://phabricator.wikimedia.org/T352361) [06:36:56] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:37:12] ^ expected [06:37:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1119 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/979197 (https://phabricator.wikimedia.org/T352361) (owner: 10Marostegui) [06:37:28] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:42:04] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:42:44] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:55:00] (03PS1) 10Marostegui: dbproxy1021,27: Test db1119 [puppet] - 10https://gerrit.wikimedia.org/r/979198 (https://phabricator.wikimedia.org/T352505) [06:56:11] (03CR) 10Marostegui: [C: 03+2] dbproxy1021,27: Test db1119 [puppet] - 10https://gerrit.wikimedia.org/r/979198 (https://phabricator.wikimedia.org/T352505) (owner: 10Marostegui) [06:58:13] (03PS1) 10Marostegui: Revert "dbproxy1021,27: Test db1119" [puppet] - 10https://gerrit.wikimedia.org/r/979207 [06:58:24] (03CR) 10Marostegui: "Test was good" [puppet] - 10https://gerrit.wikimedia.org/r/979207 (owner: 10Marostegui) [06:58:43] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1021,27: Test db1119" [puppet] - 10https://gerrit.wikimedia.org/r/979207 (owner: 10Marostegui) [07:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231201T0700) [07:00:14] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:00:39] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [07:09:59] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:12:11] (03PS1) 10Marostegui: dbproxy1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979199 (https://phabricator.wikimedia.org/T351864) [07:12:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1026.eqiad.wmnet with OS bookworm [07:12:50] (03CR) 10Marostegui: [C: 03+2] dbproxy1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979199 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [07:26:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1026.eqiad.wmnet with reason: host reimage [07:29:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1026.eqiad.wmnet with reason: host reimage [07:40:26] (03PS1) 10Marostegui: Revert "dbproxy1026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979208 [07:42:30] (03CR) 10Filippo Giunchedi: [C: 03+2] "No need for a window, I'll merge this now" [puppet] - 10https://gerrit.wikimedia.org/r/978679 (https://phabricator.wikimedia.org/T351948) (owner: 10Bartosz Dziewoński) [07:43:34] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979208 (owner: 10Marostegui) [07:46:06] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:47:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1026.eqiad.wmnet with OS bookworm [07:54:06] (03CR) 10Slyngshede: [V: 03+1] P:url_downloader add blackbox exporter. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:54:24] (03CR) 10Slyngshede: [V: 03+1] P:url_downloader add blackbox exporter. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:58:43] (03CR) 10Jelto: [C: 03+2] aptrepo: upgrade gitlab-ce and gitlab-runner package to 16.4 [puppet] - 10https://gerrit.wikimedia.org/r/979162 (https://phabricator.wikimedia.org/T352480) (owner: 10Jelto) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231201T0800) [08:00:47] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:43] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:12:23] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:16:09] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:21:11] (03CR) 10Volans: "question inline, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) (owner: 10Slyngshede) [08:26:11] (03CR) 10JMeybohm: [C: 03+1] wikifunctions: Reduce helm deploy timeout from 600s default to 120s [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873 (owner: 10Jforrester) [08:27:50] (03PS1) 10Muehlenhoff: Remove John's shell access [puppet] - 10https://gerrit.wikimedia.org/r/979201 (https://phabricator.wikimedia.org/T352508) [08:27:55] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:27:57] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:51] (03PS2) 10Filippo Giunchedi: centralserver: reintroduce tls-remedy for centralserver [puppet] - 10https://gerrit.wikimedia.org/r/979108 (https://phabricator.wikimedia.org/T351710) [08:29:53] (03PS1) 10Filippo Giunchedi: pontoon: remove deprecated pontoon-log-01 [puppet] - 10https://gerrit.wikimedia.org/r/979202 [08:31:27] (03CR) 10Muehlenhoff: [C: 03+2] Remove John's shell access [puppet] - 10https://gerrit.wikimedia.org/r/979201 (https://phabricator.wikimedia.org/T352508) (owner: 10Muehlenhoff) [08:31:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:10] (03CR) 10Brouberol: "I wonder if we should generalize the pattern: _any_ change to the hieradata should trigger the profile specs. WDYT Jesse?" [puppet] - 10https://gerrit.wikimedia.org/r/979119 (owner: 10Brouberol) [08:33:34] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: remove deprecated pontoon-log-01 [puppet] - 10https://gerrit.wikimedia.org/r/979202 (owner: 10Filippo Giunchedi) [08:34:27] (03CR) 10Filippo Giunchedi: "Taavi, please see this change re: centralserver_syslog role (is it used?)" [puppet] - 10https://gerrit.wikimedia.org/r/979108 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [08:36:19] (03CR) 10Filippo Giunchedi: [C: 03+1] P:url_downloader add blackbox exporter. [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:39:00] (03CR) 10Filippo Giunchedi: "Have you tested this with queries on thanos.w.o and auto downsampling?" [puppet] - 10https://gerrit.wikimedia.org/r/979163 (owner: 10Herron) [08:39:51] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:42:37] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:48:47] (03CR) 10Majavah: centralserver: reintroduce tls-remedy for centralserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979108 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [08:50:42] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:54:19] (03PS1) 10Muehlenhoff: Remove John from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/979296 (https://phabricator.wikimedia.org/T352508) [08:57:03] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:57:05] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:59:10] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:01:43] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:04:27] (03CR) 10Volans: [C: 03+1] "LGTM, to be followed by the change in the private repo" [puppet] - 10https://gerrit.wikimedia.org/r/979296 (https://phabricator.wikimedia.org/T352508) (owner: 10Muehlenhoff) [09:04:29] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:04:51] (03PS1) 10Vgutierrez: prometheus::sysctl: Support configurable sysctls [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) [09:05:28] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:05:29] (03CR) 10CI reject: [V: 04-1] prometheus::sysctl: Support configurable sysctls [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:05:41] (03CR) 10Muehlenhoff: [C: 03+2] Remove John from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/979296 (https://phabricator.wikimedia.org/T352508) (owner: 10Muehlenhoff) [09:10:01] (03CR) 10Alexandros Kosiaris: "This is technically correct, so consider this a +1, but since is the first exception we got from the default of 600s, I am willing to bet " [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873 (owner: 10Jforrester) [09:12:28] (03CR) 10Southparkfan: centralserver: reintroduce tls-remedy for centralserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979108 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [09:13:49] (03PS3) 10Vgutierrez: prometheus::sysctl: Support configurable sysctls [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) [09:13:51] (03CR) 10Vgutierrez: "output:" [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:20:09] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [09:23:09] (03PS1) 10Muehlenhoff: Remove John from network device access [homer/public] - 10https://gerrit.wikimedia.org/r/979299 (https://phabricator.wikimedia.org/T352508) [09:23:37] (03PS4) 10Vgutierrez: prometheus::sysctl: Support configurable sysctls [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) [09:25:48] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/795/con" [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:29:08] (03CR) 10Ayounsi: [C: 03+1] Remove John from network device access [homer/public] - 10https://gerrit.wikimedia.org/r/979299 (https://phabricator.wikimedia.org/T352508) (owner: 10Muehlenhoff) [09:34:25] (03CR) 10Fabfur: prometheus::sysctl: Support configurable sysctls (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:37:16] (03CR) 10Muehlenhoff: [C: 03+2] Remove John from network device access [homer/public] - 10https://gerrit.wikimedia.org/r/979299 (https://phabricator.wikimedia.org/T352508) (owner: 10Muehlenhoff) [09:37:25] (03CR) 10Volans: "first pass" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [09:37:58] (03PS5) 10Vgutierrez: prometheus::sysctl: Support configurable sysctls [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) [09:38:13] (03CR) 10Vgutierrez: prometheus::sysctl: Support configurable sysctls (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:42:46] (03CR) 10Fabfur: prometheus::sysctl: Support configurable sysctls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:44:28] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jbond out of all services on: 2211 hosts [09:44:58] (03CR) 10Fabfur: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:45:17] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jbond out of all services on: 2211 hosts [09:51:08] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jbond out of all services on: 2 hosts [09:51:12] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jbond out of all services on: 2 hosts [09:51:43] (03CR) 10Volans: "Much better thanks! Few other questions and you should be ready to start writing the tests 😊" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [09:57:50] (03PS1) 10Majavah: Remove John's root key [labs/private] - 10https://gerrit.wikimedia.org/r/979303 (https://phabricator.wikimedia.org/T352508) [09:57:54] (03PS1) 10Majavah: Remove root keys for some former staff [labs/private] - 10https://gerrit.wikimedia.org/r/979304 [09:59:07] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall though" [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:03:42] (03PS1) 10Muehlenhoff: Add cn=project-cloudinfra to list of NDA-sensitive groups [puppet] - 10https://gerrit.wikimedia.org/r/979305 [10:04:09] (03CR) 10Filippo Giunchedi: centralserver: reintroduce tls-remedy for centralserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979108 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [10:05:06] (03CR) 10Filippo Giunchedi: "I'd rather not require every user script that writes .prom files to also declare its files in puppet" [puppet] - 10https://gerrit.wikimedia.org/r/978049 (owner: 10Jbond) [10:07:59] (PuppetFailure) firing: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:08:52] !log add 60GB to prometheus/k8s in codfw [10:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:16] (03PS1) 10Alexandros Kosiaris: Move new ganeti hosts to insetup::infrastructure_foundations [puppet] - 10https://gerrit.wikimedia.org/r/979306 [10:10:41] (03CR) 10Muehlenhoff: [C: 03+1] Move new ganeti hosts to insetup::infrastructure_foundations [puppet] - 10https://gerrit.wikimedia.org/r/979306 (owner: 10Alexandros Kosiaris) [10:10:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] Move new ganeti hosts to insetup::infrastructure_foundations [puppet] - 10https://gerrit.wikimedia.org/r/979306 (owner: 10Alexandros Kosiaris) [10:12:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] mediawiki: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979102 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [10:13:19] (03Merged) 10jenkins-bot: mediawiki: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979102 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [10:14:59] (PuppetFailure) firing: Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:19:59] (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:20:59] (PuppetFailure) firing: Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:21:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] redis_lock: Switch from rdb1009 to rdb1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979101 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [10:22:00] (PuppetFailure) firing: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:22:32] (03Merged) 10jenkins-bot: redis_lock: Switch from rdb1009 to rdb1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979101 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [10:22:59] (PuppetFailure) resolved: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:23:00] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:25:00] (03PS3) 10Ayounsi: Expose Netbox's BGP servers to Homer [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) [10:26:05] (03PS4) 10Ayounsi: Expose Netbox's BGP servers to Homer [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) [10:27:46] (03CR) 10Vgutierrez: prometheus::sysctl: Support configurable sysctls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:27:57] !draining codfw<->ulsfo transport link to reconfigure card 1/1 in cr1-codfw T350159 [10:28:21] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (2) The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [10:29:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [labs/private] - 10https://gerrit.wikimedia.org/r/979303 (https://phabricator.wikimedia.org/T352508) (owner: 10Majavah) [10:29:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [labs/private] - 10https://gerrit.wikimedia.org/r/979304 (owner: 10Majavah) [10:29:58] (03CR) 10Majavah: [V: 03+2 C: 03+2] Remove John's root key [labs/private] - 10https://gerrit.wikimedia.org/r/979303 (https://phabricator.wikimedia.org/T352508) (owner: 10Majavah) [10:30:04] (03CR) 10Majavah: [V: 03+2 C: 03+2] Remove root keys for some former staff [labs/private] - 10https://gerrit.wikimedia.org/r/979304 (owner: 10Majavah) [10:30:58] !log akosiaris@deploy2002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 07m 12s) [10:32:10] !draining codfw<->eqdfw transport link to reconfigure card 1/1 in cr1-codfw T350159 [10:34:30] !log draining codfw<->eqdfw transport link to reconfigure card 1/1 in cr1-codfw T350159 [10:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:32] !log draining codfw<->eqiad transport link to reconfigure card 1/1 in cr1-codfw T350159 [10:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:41] !log Moving VRRP acrtive gateway for codfw row A/B vlans from cr1-codfw to cr2-codfw to reconfigure card 1/1 T350159 [10:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:40:05] (03PS1) 10Muehlenhoff: systemd-logind logout script: Terminate sessions with a vengeance [puppet] - 10https://gerrit.wikimedia.org/r/979308 [10:40:57] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:41:29] (03PS1) 10Muehlenhoff: logoutd: Remove now obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/979309 [10:42:33] (03CR) 10CI reject: [V: 04-1] systemd-logind logout script: Terminate sessions with a vengeance [puppet] - 10https://gerrit.wikimedia.org/r/979308 (owner: 10Muehlenhoff) [10:44:03] (03CR) 10CI reject: [V: 04-1] logoutd: Remove now obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/979309 (owner: 10Muehlenhoff) [10:45:45] (03PS2) 10Muehlenhoff: systemd-logind logout script: Terminate sessions with a vengeance [puppet] - 10https://gerrit.wikimedia.org/r/979308 [10:47:30] (03CR) 10Hnowlan: [C: 03+1] restbase: set production role and add config for restbase2028 [puppet] - 10https://gerrit.wikimedia.org/r/979161 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [10:49:24] !log installing wireshark security updates on bookworm [10:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:12] (03CR) 10Volans: [C: 03+1] "LGTM, log nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/979308 (owner: 10Muehlenhoff) [10:53:43] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/979305 (owner: 10Muehlenhoff) [10:55:01] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [10:55:07] (03CR) 10Volans: logoutd: Remove now obsolete check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979309 (owner: 10Muehlenhoff) [10:57:07] (03PS3) 10Muehlenhoff: systemd-logind logout script: Terminate sessions with a vengeance [puppet] - 10https://gerrit.wikimedia.org/r/979308 [10:57:13] (03CR) 10Muehlenhoff: systemd-logind logout script: Terminate sessions with a vengeance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979308 (owner: 10Muehlenhoff) [10:59:15] (03CR) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [10:59:23] !log Resetting circuit preference for transports landing on card 1/1 cr1-codfw T350159 [10:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:59] (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:00:14] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:00:39] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [11:00:40] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/979308 (owner: 10Muehlenhoff) [11:00:50] !log Draining cr1-codfw transport to cr3-eqsin to reset card 1/0 T350159 [11:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:27] (03CR) 10Muehlenhoff: [C: 03+2] systemd-logind logout script: Terminate sessions with a vengeance [puppet] - 10https://gerrit.wikimedia.org/r/979308 (owner: 10Muehlenhoff) [11:02:10] (03PS2) 10Muehlenhoff: logoutd: Remove now obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/979309 [11:03:07] (03PS3) 10Muehlenhoff: logoutd: Remove now obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/979309 [11:04:26] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jbond out of all services on: 2 hosts [11:04:31] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jbond out of all services on: 2 hosts [11:06:03] (03PS1) 10Majavah: P:prometheus::cloud: get openstack exporter from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/979312 (https://phabricator.wikimedia.org/T350010) [11:06:05] (03PS1) 10Majavah: O:prometheus: provision cloud instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/979313 (https://phabricator.wikimedia.org/T350010) [11:09:30] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jbond out of all services on: 2 hosts [11:09:33] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jbond out of all services on: 2 hosts [11:09:59] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:14] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/796/con" [puppet] - 10https://gerrit.wikimedia.org/r/979313 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah) [11:13:05] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/979309 (owner: 10Muehlenhoff) [11:13:42] (03CR) 10Muehlenhoff: [C: 03+2] logoutd: Remove now obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/979309 (owner: 10Muehlenhoff) [11:17:27] (03PS2) 10Effie Mouzeli: (WIP) mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 [11:19:46] (03PS12) 10Effie Mouzeli: (WIP)mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:20:17] (03CR) 10CI reject: [V: 04-1] (WIP)mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:20:31] (03PS1) 10Muehlenhoff: offboard-user.py: Privileged openstack groups need different handling [puppet] - 10https://gerrit.wikimedia.org/r/979317 [11:22:04] !log Disabling BGP peering to AS1299 prior to reset of card 1/0 in cr1-codfw T350159 [11:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:55] (03PS13) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:23:08] (03PS2) 10Ayounsi: Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) [11:23:19] (03CR) 10Ayounsi: Netbox: add generic function to execute a Netbox script (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [11:23:50] (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:25:18] (03PS14) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:26:07] (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:27:23] (03CR) 10MVernon: [C: 03+1] "LGTM, though the amount of manual work here still makes my lazy self a bit sad..." [puppet] - 10https://gerrit.wikimedia.org/r/979161 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [11:27:37] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Idle - Telia, AS1299/IPv6: Idle - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:29:09] ^^ sry this is me, will re-enable shortly [11:29:23] !log Reset card 1/0 in cr1-codfw T350159 [11:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:47] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:29:59] (PuppetFailure) resolved: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:30:09] PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:30:37] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on cr[1-2]-codfw,cr[1-2]-codfw IPv6 with reason: resetting line card [11:30:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on cr[1-2]-codfw,cr[1-2]-codfw IPv6 with reason: resetting line card [11:32:46] (03PS15) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:35:05] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 193, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:35:27] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 115 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:35:41] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 127, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:35:53] RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:43:20] (03PS1) 10Awight: Remove outdated stretch exclusion for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/979319 [11:45:57] (03CR) 10Majavah: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/979317 (owner: 10Muehlenhoff) [11:46:16] (03CR) 10Majavah: [C: 03+1] "LGTM once Ic69ef5f9be35721e3eda36a032845434d81f9e5d has been merged" [puppet] - 10https://gerrit.wikimedia.org/r/979305 (owner: 10Muehlenhoff) [11:46:19] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 7 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:49:08] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm [11:50:45] (03CR) 10Muehlenhoff: [C: 03+2] offboard-user.py: Privileged openstack groups need different handling [puppet] - 10https://gerrit.wikimedia.org/r/979317 (owner: 10Muehlenhoff) [11:51:43] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 127, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:54:27] (03CR) 10Muehlenhoff: [C: 03+2] Add cn=project-cloudinfra to list of NDA-sensitive groups [puppet] - 10https://gerrit.wikimedia.org/r/979305 (owner: 10Muehlenhoff) [11:59:42] ^^ above router down alert is expected, relates to new 100G port for circuit currently being wired up [11:59:47] I'll ack the alerts for now [12:01:08] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 127, down: 1, dormant: 0, excluded: 0, unused: 0: Cathal Mooney Port et-1/0/2 is down as its been configured for new Arelion 100G but not patched. - The acknowledgement expires at: 2023-12-08 12:00:29. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:02:30] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jbond out of all services on: 2211 hosts [12:03:35] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jbond out of all services on: 2211 hosts [12:03:54] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage [12:07:20] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage [12:10:59] (03CR) 10Mabualruz: "LGTM in the functional side the event are showing for me, still not sure about the schema and the data I guess maybe @Kim can check on thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming) [12:17:57] !log add BGP custom field to Netbox - T306649 [12:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:00] T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 [12:21:42] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts flerovium.eqiad.wmnet [12:24:22] (03PS1) 10Muehlenhoff: firewall: Remove special case handling for flerovium [puppet] - 10https://gerrit.wikimedia.org/r/979333 (https://phabricator.wikimedia.org/T352193) [12:25:14] (03PS1) 10Muehlenhoff: Remove hadoop-hdfs-backup alias [puppet] - 10https://gerrit.wikimedia.org/r/979334 [12:27:56] (03CR) 10CI reject: [V: 04-1] Remove hadoop-hdfs-backup alias [puppet] - 10https://gerrit.wikimedia.org/r/979334 (owner: 10Muehlenhoff) [12:29:38] (03PS2) 10Muehlenhoff: Remove hadoop-hdfs-backup alias [puppet] - 10https://gerrit.wikimedia.org/r/979334 (https://phabricator.wikimedia.org/T352193) [12:31:00] (PuppetFailure) resolved: Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:33:24] 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) a:03BTullis [12:33:40] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:34:04] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1046.eqiad.wmnet with OS bookworm [12:35:02] (03PS1) 10EoghanGaffney: [admin] Add ldap user for sandeeps [puppet] - 10https://gerrit.wikimedia.org/r/979336 (https://phabricator.wikimedia.org/T352334) [12:35:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] changeprop: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979103 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [12:35:53] (03CR) 10CI reject: [V: 04-1] [admin] Add ldap user for sandeeps [puppet] - 10https://gerrit.wikimedia.org/r/979336 (https://phabricator.wikimedia.org/T352334) (owner: 10EoghanGaffney) [12:35:58] (03CR) 10Filippo Giunchedi: "I believe we should be able to get gitpuppet to write .prom files by adding it to prometheus-node-exporter unix group" [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [12:36:06] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flerovium.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:36:22] (03Merged) 10jenkins-bot: changeprop: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979103 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [12:36:24] (03Merged) 10jenkins-bot: api-gateway: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979104 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [12:36:33] (03PS2) 10EoghanGaffney: [admin] Add ldap user for sandeeps [puppet] - 10https://gerrit.wikimedia.org/r/979336 (https://phabricator.wikimedia.org/T352334) [12:36:59] (PuppetFailure) resolved: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:37:05] (03CR) 10Filippo Giunchedi: [C: 03+1] P:prometheus::cloud: get openstack exporter from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/979312 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah) [12:37:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flerovium.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:37:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:37:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flerovium.eqiad.wmnet [12:37:42] 10SRE, 10ops-eqiad, 10Patch-For-Review: decommission flerovium - https://phabricator.wikimedia.org/T352193 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `flerovium.eqiad.wmnet` - flerovium.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Foun... [12:38:42] (03PS1) 10Muehlenhoff: Remove flerovium from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/979337 (https://phabricator.wikimedia.org/T352193) [12:39:20] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [12:39:37] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:39:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, of course remember do run provision-fs.sh before this change (possibly even changing it locally on the prometheus codfw hosts, seems" [puppet] - 10https://gerrit.wikimedia.org/r/979313 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah) [12:41:43] (03PS1) 10Muehlenhoff: Remove analytics_cluster::hadoop::client role [puppet] - 10https://gerrit.wikimedia.org/r/979338 [12:41:45] (03CR) 10Muehlenhoff: [C: 03+2] Remove flerovium from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/979337 (https://phabricator.wikimedia.org/T352193) (owner: 10Muehlenhoff) [12:43:19] 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) This is confirmed. ` Enclosure Device ID: 32 Slot Number: 9 Drive's position: DiskGroup: 12, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 9... [12:46:37] (03PS1) 10Effie Mouzeli: deployment_server: add mcrouter service 1 [puppet] - 10https://gerrit.wikimedia.org/r/979339 (https://phabricator.wikimedia.org/T346690) [12:47:16] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:47:55] (03CR) 10Kamila Součková: [C: 03+1] Deploy kube-state-metrics to the dse-k8s cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978504 (https://phabricator.wikimedia.org/T264625) (owner: 10Btullis) [12:49:41] 10SRE, 10ops-eqiad, 10Patch-For-Review: decommission flerovium - https://phabricator.wikimedia.org/T352193 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:50:41] 10SRE, 10ops-eqiad, 10Patch-For-Review: decommission flerovium - https://phabricator.wikimedia.org/T352193 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Jclark-ctr [12:50:43] (03CR) 10Majavah: [C: 03+2] P:prometheus::cloud: get openstack exporter from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/979312 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah) [12:51:35] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace 4TB SATA disk in an-worker1086 - https://phabricator.wikimedia.org/T352529 (10BTullis) [12:51:48] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace 4TB SATA disk in an-worker1086 - https://phabricator.wikimedia.org/T352529 (10BTullis) [12:51:50] 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) [12:53:03] 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) I've created {T352168} and tagged it with #ops-eqiad so I'll move this ticket to waiting. [12:53:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/979336 (https://phabricator.wikimedia.org/T352334) (owner: 10EoghanGaffney) [12:53:42] (03CR) 10Muehlenhoff: [C: 03+2] Remove hadoop-hdfs-backup alias [puppet] - 10https://gerrit.wikimedia.org/r/979334 (https://phabricator.wikimedia.org/T352193) (owner: 10Muehlenhoff) [12:55:43] (03PS1) 10Effie Mouzeli: Add namespace for mcrouter service [deployment-charts] - 10https://gerrit.wikimedia.org/r/979340 (https://phabricator.wikimedia.org/T346690) [12:56:19] (03PS2) 10Effie Mouzeli: Add namespace for mcrouter service 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979340 (https://phabricator.wikimedia.org/T346690) [12:57:02] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10BTullis) a:03BTullis [12:57:55] (03CR) 10EoghanGaffney: [C: 03+2] [admin] Add ldap user for sandeeps [puppet] - 10https://gerrit.wikimedia.org/r/979336 (https://phabricator.wikimedia.org/T352334) (owner: 10EoghanGaffney) [13:02:22] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Grant Access to wmf, releng, ciadmin for sandeeps - https://phabricator.wikimedia.org/T352334 (10eoghan) 05Open→03Resolved a:03eoghan I've added `sandeeps` to the LDAP groups. Feel free to reopen if there's anythin... [13:02:48] (03PS3) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [13:13:27] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [13:13:40] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [13:16:48] (03PS2) 10Majavah: O:prometheus: provision cloud instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/979313 (https://phabricator.wikimedia.org/T350010) [13:16:50] (03PS1) 10Majavah: prometheus: provision-fs: create cloud fs on codfw [puppet] - 10https://gerrit.wikimedia.org/r/979343 (https://phabricator.wikimedia.org/T350010) [13:19:08] 10SRE, 10SRE-Access-Requests: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10eoghan) a:03odimitrijevic [13:19:29] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10eoghan) a:03odimitrijevic [13:19:46] (03CR) 10Majavah: [C: 03+2] prometheus: provision-fs: create cloud fs on codfw [puppet] - 10https://gerrit.wikimedia.org/r/979343 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah) [13:27:40] !log run prometheus provision-fs on prometheus2* to create file system for cloud instance T350010 [13:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:44] T350010: Evaluate whether to deploy cloud Prometheus instance to codfw - https://phabricator.wikimedia.org/T350010 [13:28:45] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [13:28:58] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [13:30:11] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [13:30:28] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [13:30:41] (03CR) 10Brouberol: [C: 03+1] Deploy kube-state-metrics to the dse-k8s cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978504 (https://phabricator.wikimedia.org/T264625) (owner: 10Btullis) [13:31:42] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:31:43] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [13:32:00] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [13:32:30] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10BTullis) I'm having a look at this now. I believe that it is related to the dumps architecture and specifically with wikid... [13:32:49] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10BTullis) 05Open→03Resolved [13:33:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:35:20] (03CR) 10Majavah: [C: 03+2] O:prometheus: provision cloud instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/979313 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah) [13:40:18] (03PS3) 10Jcrespo: add_recent_uploads: Be more resilient against errors [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979160 [13:46:32] (03PS16) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [13:47:27] (03PS5) 10Brouberol: Explicitly link the apt_repo.yaml hirea file to the modules/profile specs [puppet] - 10https://gerrit.wikimedia.org/r/979119 [13:47:36] (03PS6) 10Brouberol: Explicitly link the apt_repo.yaml hiera file to the modules/profile specs [puppet] - 10https://gerrit.wikimedia.org/r/979119 [13:48:19] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [13:48:25] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [13:50:04] (03CR) 10CI reject: [V: 04-1] Explicitly link the apt_repo.yaml hiera file to the modules/profile specs [puppet] - 10https://gerrit.wikimedia.org/r/979119 (owner: 10Brouberol) [13:50:08] (03PS4) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [13:57:44] 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for pfischer - https://phabricator.wikimedia.org/T352475 (10eoghan) @Gehel Could you please approve this request, as @pfischer's manager? [13:58:23] 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for pfischer - https://phabricator.wikimedia.org/T352475 (10eoghan) a:03Gehel [13:58:42] (03PS2) 10Alexandros Kosiaris: cp-jobqueue: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979105 (https://phabricator.wikimedia.org/T326171) [13:58:44] (03PS2) 10Alexandros Kosiaris: Remove rdb1009 unused references from repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/979106 (https://phabricator.wikimedia.org/T326171) [13:59:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] cp-jobqueue: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979105 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [14:00:48] (03Merged) 10jenkins-bot: cp-jobqueue: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979105 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [14:03:36] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [14:03:53] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [14:05:27] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:05:53] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:07:16] (03PS1) 10Jelto: add wmf-debci image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/979355 (https://phabricator.wikimedia.org/T352003) [14:07:46] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:07:46] PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:07:47] PROBLEM - ensure kvm processes are running on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:08:54] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:09:00] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:09:21] (03PS17) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [14:11:06] (03PS18) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [14:12:12] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:12:12] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:13:40] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove rdb1009 unused references from repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/979106 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [14:15:53] (03Merged) 10jenkins-bot: Remove rdb1009 unused references from repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/979106 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [14:18:17] !log hashar@deploy2002 Started deploy [integration/docroot@1c2de6b]: doc: link to Disovery parent pom [14:18:24] !log hashar@deploy2002 Finished deploy [integration/docroot@1c2de6b]: doc: link to Disovery parent pom (duration: 00m 06s) [14:19:33] PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:06] !log hashar@deploy2002 Started deploy [integration/docroot@88f69cc]: doc: link to the Gearman Java library [14:20:12] !log hashar@deploy2002 Finished deploy [integration/docroot@88f69cc]: doc: link to the Gearman Java library (duration: 00m 05s) [14:21:07] (03PS10) 10Jcrespo: mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) [14:24:37] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [14:25:50] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [14:25:59] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [14:26:00] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [14:26:08] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [14:26:09] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:26:11] !log cleanup rdb1009 from all deployment charts [14:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:18] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:26:29] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [14:26:37] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [14:26:38] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:26:45] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:26:47] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:26:55] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:27:04] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [14:27:09] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [14:27:10] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:27:15] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:27:16] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:27:22] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:28:29] (03PS1) 10JMeybohm: docker_registry: Don't add control planes as authorized nodes [puppet] - 10https://gerrit.wikimedia.org/r/979360 [14:30:31] (03CR) 10MVernon: [C: 03+1] "Looks good to me, thanks, but I've not done anything with this repo before." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/979355 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto) [14:31:45] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/797/con" [puppet] - 10https://gerrit.wikimedia.org/r/979360 (owner: 10JMeybohm) [14:31:57] (03PS1) 10Alexandros Kosiaris: nextbox: Switch from rdb1009 to rdb1013 [puppet] - 10https://gerrit.wikimedia.org/r/979361 (https://phabricator.wikimedia.org/T326171) [14:34:57] (03PS1) 10Jforrester: wikifunctionswiki: Disable thumbnail in Vector search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979362 (https://phabricator.wikimedia.org/T352532) [14:35:44] (03PS1) 10Effie Mouzeli: mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) [14:36:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] nextbox: Switch from rdb1009 to rdb1013 [puppet] - 10https://gerrit.wikimedia.org/r/979361 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [14:36:14] (03CR) 10Jforrester: [C: 03+1] "Thanks! Sorry for the oversight. Will deploy on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail) [14:38:41] (03PS4) 10Jforrester: wikifunctionswiki: Add ability for sysops to manage Functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail) [14:39:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] docker_registry: Don't add control planes as authorized nodes [puppet] - 10https://gerrit.wikimedia.org/r/979360 (owner: 10JMeybohm) [14:39:02] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:46] (03CR) 10Clare Ming: Define the corresponding stream for scroll (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [14:40:14] (03Abandoned) 10Jforrester: Disable DoubleWiki extension everywhere, at least for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902388 (https://phabricator.wikimedia.org/T332850) (owner: 10Jforrester) [14:41:38] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:41:39] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:41:46] (03PS19) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [14:41:58] (03PS3) 10Jforrester: [BETA CLUSTER] testwiki: Disable PageTriage's extended features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) (owner: 10MPGuy2824) [14:42:03] (03PS1) 10Muehlenhoff: vrts: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/979364 [14:42:07] (03PS4) 10Jforrester: [BETA CLUSTER] testwiki: Disable PageTriage's extended features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) (owner: 10MPGuy2824) [14:42:23] (03CR) 10Jforrester: [C: 03+2] "I'll pull this one into prod manually without a deploy, as it's a Beta-only change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) (owner: 10MPGuy2824) [14:42:41] (03CR) 10CI reject: [V: 04-1] mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:43:06] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:43:32] (03Merged) 10jenkins-bot: [BETA CLUSTER] testwiki: Disable PageTriage's extended features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) (owner: 10MPGuy2824) [14:43:49] (03PS1) 10Hnowlan: jobqueue: migrate another medium-weight job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979365 (https://phabricator.wikimedia.org/T349796) [15:19:56] !log moving esams CR interconnect to 4x10G breakout cable T347403 [15:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:28] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission for hosts rdb[1009-1010].eqiad.wmnet [15:22:15] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:22:15] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:28:06] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] redis_lock: Actually switch from rdb1009 to rdb1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979377 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [15:28:39] (03PS7) 10Vgutierrez: prometheus::sysctl: Support configurable sysctls [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) [15:28:46] !log added Kamila to pwstore [15:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:09] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos-query: enable auto-downsampling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979163 (owner: 10Herron) [15:29:12] (03CR) 10Vgutierrez: prometheus::sysctl: Support configurable sysctls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:29:39] (03CR) 10Herron: thanos-query: enable auto-downsampling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979163 (owner: 10Herron) [15:31:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:31:39] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [15:31:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979368 (owner: 10Muehlenhoff) [15:31:43] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [15:33:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] wikifunctions: Reduce helm deploy timeout from 600s default to 120s [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873 (owner: 10Jforrester) [15:36:19] !log akosiaris@deploy2002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 07m 24s) [15:38:20] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:38:46] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:39:03] (03PS2) 10Muehlenhoff: superset: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/979368 [15:42:02] (03PS1) 10Hnowlan: Revert "jobqueue: migrate another medium-weight job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979219 [15:42:17] !log mwmaint2002: mwscript extensions/Flow/maintenance/FlowFixInconsistentBoards.php --wiki=frwiki # T352550 [15:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:21] T352550: Deactivating Flow using the Beta feature makes the talk page inaccessible - Flow\Exception\InvalidDataException - https://phabricator.wikimedia.org/T352550 [15:43:30] (03PS5) 10Ayounsi: Expose Netbox's BGP servers to Homer [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) [15:45:56] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [15:48:31] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1005.eqiad.wmnet with OS bookworm [15:49:02] (03PS1) 10Ayounsi: Get the server's BGP peer info from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649) [15:49:09] (03CR) 10CI reject: [V: 04-1] Get the server's BGP peer info from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [15:50:07] !log akosiaris@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rdb[1009-1010].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - akosiaris@cumin1001" [15:51:41] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rdb[1009-1010].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - akosiaris@cumin1001" [15:51:41] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:51:42] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rdb[1009-1010].eqiad.wmnet [15:51:46] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission rdb1009, rdb1010 - https://phabricator.wikimedia.org/T352547 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `rdb[1009-1010].eqiad.wmnet` - rdb1009.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/... [15:52:11] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [15:54:18] !log akosiaris@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA records to the rest of the 50% of rdb hosts - akosiaris@cumin1001" [15:55:09] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA records to the rest of the 50% of rdb hosts - akosiaris@cumin1001" [15:55:09] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:56:27] (03CR) 10Hnowlan: [C: 03+2] Revert "jobqueue: migrate another medium-weight job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979219 (owner: 10Hnowlan) [15:56:53] !log dancy@deploy2002 Installing scap version "4.65.0" for 570 hosts [15:57:21] (03Merged) 10jenkins-bot: Revert "jobqueue: migrate another medium-weight job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979219 (owner: 10Hnowlan) [15:57:33] !log give AAAA and PTR records to all rdb hosts (only 50% had it previously) [15:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:12] !log give AAAA and PTR records to scandium T271142 [15:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:16] T271142: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 [15:58:32] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [15:58:51] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) [15:59:06] RECOVERY - ensure kvm processes are running on cloudvirt1046 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:59:10] PROBLEM - SSH on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:00:10] (03PS2) 10Ayounsi: Get the server's BGP peer info from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649) [16:00:43] !log akosiaris@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Give AAAA and PTR records to scandium - akosiaris@cumin1001" [16:01:34] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Give AAAA and PTR records to scandium - akosiaris@cumin1001" [16:01:34] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:02:50] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) @Volans, since dumpsdata[1001-1003].eqiad.wmnet and snapshot[1005-1010].eqiad.wmnet are no longe... [16:03:04] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission rdb1009, rdb1010 - https://phabricator.wikimedia.org/T352547 (10akosiaris) a:05akosiaris→03None [16:03:46] (03PS1) 10Majavah: team-wmcs: Add alert when Galera is not applying any writes [alerts] - 10https://gerrit.wikimedia.org/r/979385 (https://phabricator.wikimedia.org/T352552) [16:04:02] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage [16:04:19] PROBLEM - ensure kvm processes are running on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:06:24] (03CR) 10Andrew Bogott: [C: 03+1] "Looks great -- thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/979385 (https://phabricator.wikimedia.org/T352552) (owner: 10Majavah) [16:07:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage [16:08:37] (03CR) 10Majavah: [C: 03+2] team-wmcs: Add alert when Galera is not applying any writes [alerts] - 10https://gerrit.wikimedia.org/r/979385 (https://phabricator.wikimedia.org/T352552) (owner: 10Majavah) [16:08:48] (03CR) 10FNegri: [C: 03+1] "Thanks, LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/979385 (https://phabricator.wikimedia.org/T352552) (owner: 10Majavah) [16:09:31] RECOVERY - ensure kvm processes are running on cloudvirt1046 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:10:00] (03Merged) 10jenkins-bot: team-wmcs: Add alert when Galera is not applying any writes [alerts] - 10https://gerrit.wikimedia.org/r/979385 (https://phabricator.wikimedia.org/T352552) (owner: 10Majavah) [16:10:53] !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1046.eqiad.wmnet [16:11:08] (03PS1) 10Bking: wdqs: Add blackbox check for LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979388 (https://phabricator.wikimedia.org/T347355) [16:11:36] (03CR) 10CI reject: [V: 04-1] wdqs: Add blackbox check for LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979388 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:11:49] PROBLEM - SSH on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:12:15] (03PS2) 10Bking: wdqs: Add blackbox check for LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979388 (https://phabricator.wikimedia.org/T347355) [16:14:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr) ms-be1076. f5. U1. Port1 cableid 230304500018 ms-be1077. f6 U1. Port1 cableid 230304500068 ms-be1078. f7 U1. Port1 cableid 230304500009 ms-be1079. e5 U1. Port1 ca... [16:18:27] (03PS1) 10EoghanGaffney: [admin] Add user account for xiaoxiao to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979389 (https://phabricator.wikimedia.org/T352098) [16:20:01] PROBLEM - ensure kvm processes are running on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:20:44] (03PS1) 10Ladsgroup: [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) [16:21:14] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Patch-For-Review, 10Sustainability (Incident Followup): Automatically compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 (10Ladsgroup) a:03Ladsgroup [16:23:16] (03CR) 10CI reject: [V: 04-1] [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [16:23:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098 (10eoghan) a:03eoghan [16:24:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr) [16:24:11] PROBLEM - Check systemd state on wdqs1013 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:12] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudvirt1046.eqiad.wmnet [16:24:23] !log dancy@deploy2002 Installing scap version "4.65.0" for 569 hosts [16:25:11] !log dancy@deploy2002 install-world aborted: (duration: 00m 50s) [16:25:18] !log dancy@deploy2002 Installing scap version "4.65.0" for 537 hosts [16:26:16] !log dancy@deploy2002 Installation of scap version "4.65.0" completed for 537 hosts [16:26:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:30:16] (03PS7) 10Vgutierrez: lvs::realserver::ipip: Check that TCP MSS clamping is working [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) [16:30:29] (03Abandoned) 10Ladsgroup: Add IntelliJ files to .gitignore [debs/pybal] - 10https://gerrit.wikimedia.org/r/644036 (owner: 10Ladsgroup) [16:31:29] (03Abandoned) 10Ladsgroup: rsyslog: Add mailman3 to list of accepted daemons [puppet] - 10https://gerrit.wikimedia.org/r/681648 (https://phabricator.wikimedia.org/T276697) (owner: 10Ladsgroup) [16:31:56] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) [16:32:04] 10SRE, 10Observability-Logging, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Ladsgroup) 05Open→03Declined We decided not to collect logs in logstash due to sensitive nature of such logs. [16:32:21] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:34:25] (03Abandoned) 10Ladsgroup: mailman3: Increase the log level to WARNING and send them to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682737 (https://phabricator.wikimedia.org/T276697) (owner: 10Ladsgroup) [16:34:35] (03Abandoned) 10Ladsgroup: lists: Send error logs of apache2/exim4 to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682736 (https://phabricator.wikimedia.org/T276697) (owner: 10Ladsgroup) [16:34:59] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) @akosiaris I see that: * `mw[1349-1413]` * `mw[2259-2376]` * `mc[2042-2055]` * `parse[2001-2020]` a... [16:36:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:43] (03Abandoned) 10Ladsgroup: [WIP] mediawiki-cache-warmup: Add support for POST requests [puppet] - 10https://gerrit.wikimedia.org/r/737498 (https://phabricator.wikimedia.org/T290989) (owner: 10Ladsgroup) [16:39:01] (03CR) 10Vgutierrez: [V: 03+1] "@Filippo I got a question for you... this should be used to alert if the observed MSS is higher than the configured one on tcp-mss-clamper" [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:39:47] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1005.eqiad.wmnet with OS bookworm [16:40:31] (03Abandoned) 10Ladsgroup: acme_chief: Migrate cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/691634 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:40:35] (03CR) 10Vgutierrez: [V: 03+1] lvs::realserver::ipip: Check that TCP MSS clamping is working (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:43:27] (03Abandoned) 10Ladsgroup: Add wikimedia.org.tr template pointing out to another NS [dns] - 10https://gerrit.wikimedia.org/r/634925 (https://phabricator.wikimedia.org/T259792) (owner: 10Ladsgroup) [16:44:15] PROBLEM - Check systemd state on wdqs1013 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:24] (03Abandoned) 10Ladsgroup: Move tests to a proper directory structure [debs/pybal] - 10https://gerrit.wikimedia.org/r/644050 (owner: 10Ladsgroup) [16:49:12] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:27] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:50:57] (03PS1) 10Papaul: Add new ceph node to site.pp and apt_repo.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979394 (https://phabricator.wikimedia.org/T349934) [16:51:54] (03CR) 10Papaul: [C: 03+2] Add new ceph node to site.pp and apt_repo.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979394 (https://phabricator.wikimedia.org/T349934) (owner: 10Papaul) [16:55:02] (03PS1) 10Hnowlan: jobqueue: switch a medium weight job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979395 (https://phabricator.wikimedia.org/T349796) [16:58:19] RECOVERY - ensure kvm processes are running on cloudvirt1046 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:58:20] (03PS1) 10Hnowlan: jobqueue: migrate a moderately weighty job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979396 (https://phabricator.wikimedia.org/T349796) [16:59:12] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10mfossati) @jcrespo , would it be possible to use the [internal reverse proxy](https://gitlab.wikimedia.org/repos/research/resea... [16:59:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ceph2001.codfw.wmnet with OS bullseye [16:59:23] (03PS1) 10Hnowlan: jobqueue: migrate a heavyweight job to jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/979397 (https://phabricator.wikimedia.org/T349796) [16:59:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ceph2001.codfw.wmnet with OS bullseye [17:01:04] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/978665 [17:06:21] PROBLEM - Check systemd state on wdqs1013 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:16:39] RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:16:55] (03PS1) 10Volans: reports: network, remove rdb from no IPv6 list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/979399 (https://phabricator.wikimedia.org/T271142) [17:18:03] RECOVERY - Check systemd state on wdqs1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:55] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/802/console" [puppet] - 10https://gerrit.wikimedia.org/r/979388 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:22:25] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/803/console" [puppet] - 10https://gerrit.wikimedia.org/r/979388 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:23:21] (03PS3) 10Bking: wdqs: Add blackbox check for LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979388 (https://phabricator.wikimedia.org/T347355) [17:23:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979388 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:23:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:27:36] (03CR) 10Bking: [C: 03+2] wdqs: Add blackbox check for LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979388 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:30:34] (03CR) 10Gergő Tisza: Revert "Do not try to use Thumbor on beta" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972263 (https://phabricator.wikimedia.org/T344605) (owner: 10Gergő Tisza) [17:45:01] (ProbeDown) firing: (2) Service wdqs1007:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:47:19] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ceph2001.codfw.wmnet with OS bullseye [17:47:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ceph2001.codfw.wmnet with OS bullseye executed with errors: - ceph200... [17:48:35] (03PS1) 10Bking: wdqs: fix blackbox check for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979401 (https://phabricator.wikimedia.org/T347355) [17:49:03] (ProbeDown) firing: (10) Service wdqs1013:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:49:20] (03CR) 10CI reject: [V: 04-1] wdqs: fix blackbox check for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979401 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:50:01] (ProbeDown) firing: (12) Service wdqs1013:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:33] ^^ probedown alerts have been silenced, sorry for the spam [17:51:15] (03PS2) 10Bking: wdqs: fix blackbox check for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979401 (https://phabricator.wikimedia.org/T347355) [17:51:43] (03CR) 10CI reject: [V: 04-1] wdqs: fix blackbox check for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979401 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:53:00] (03PS3) 10Bking: wdqs: fix blackbox check for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979401 (https://phabricator.wikimedia.org/T347355) [18:16:30] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10odimitrijevic) Approved [18:17:04] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10odimitrijevic) a:05odimitrijevic→03eoghan [18:17:28] 10SRE, 10SRE-Access-Requests: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10odimitrijevic) a:05odimitrijevic→03eoghan [18:17:46] 10SRE, 10SRE-Access-Requests: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10odimitrijevic) Approved [18:22:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ceph2001.codfw.wmnet with OS bullseye [18:22:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ceph2001.codfw.wmnet with OS bullseye [18:26:01] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:32:42] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:55:01] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:00:15] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:00:40] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [19:04:20] (03CR) 10Dzahn: [C: 03+1] wdqs: fix blackbox check for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979401 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:11:25] (03CR) 10Bking: [C: 03+2] wdqs: fix blackbox check for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979401 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:13:28] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) >>! In T350020#9375684, @mfossati wrote: > @jcrespo , would it be possible to use the [internal reverse proxy](https:/... [19:17:38] (03CR) 10Clare Ming: [C: 03+1] "lgtm - tested locally in conjunction with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/977783/ \o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [19:18:16] (03CR) 10Clare Ming: [C: 03+1] Define the corresponding stream for scroll (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [19:24:57] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:24:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:25:19] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:29:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:29:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:29:39] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:31:09] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,user@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:47] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:39] !log bking@wdqs1006 rebooting unresponsive host [19:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:47] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:59] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:09] RECOVERY - SSH on wdqs1006 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:39:42] (03CR) 10JHathaway: profile: create in module data for profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [19:40:15] (03CR) 10Jcrespo: [C: 03+2] Migrate TLS configuration to separate file and prepare for puppet call [software/mediabackups] - 10https://gerrit.wikimedia.org/r/978133 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [19:40:22] (03CR) 10Jcrespo: [C: 03+2] Prepare for 0.2.0 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/978643 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [19:40:28] (03CR) 10Jcrespo: [C: 03+2] add_recent_uploads: Be more resilient against errors [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979160 (owner: 10Jcrespo) [19:48:58] (03PS1) 10Bking: wdqs: Monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979408 (https://phabricator.wikimedia.org/T347355) [19:49:35] (03CR) 10CI reject: [V: 04-1] wdqs: Monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979408 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:50:33] (03PS2) 10Bking: wdqs: Monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979408 (https://phabricator.wikimedia.org/T347355) [19:54:23] (03PS3) 10Bking: wdqs: Monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979408 (https://phabricator.wikimedia.org/T347355) [19:54:50] (03CR) 10Dzahn: wdqs: Monitor ldf endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979408 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:58:07] (03CR) 10Bking: wdqs: Monitor ldf endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979408 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [20:22:29] (03CR) 10Andrew Bogott: [C: 03+2] puppetserver: create a necessary parent dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott) [20:25:06] (03CR) 10Clare Ming: Define the corresponding stream for scroll (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [20:56:20] (03PS4) 10Kimberly Sarabia: Define the corresponding stream for scroll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) [21:05:12] (03CR) 10Clare Ming: [C: 03+1] Define the corresponding stream for scroll (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [21:05:44] (03CR) 10Xiaoxiao: [C: 03+1] [admin] Add user account for xiaoxiao to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979389 (https://phabricator.wikimedia.org/T352098) (owner: 10EoghanGaffney) [21:31:56] (03PS4) 10Bking: wdqs: remove ldf endpoint monitoring [puppet] - 10https://gerrit.wikimedia.org/r/979408 (https://phabricator.wikimedia.org/T347355) [21:45:27] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1076.mgmt.eqiad.wmnet with reboot policy FORCED [21:45:28] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1077.mgmt.eqiad.wmnet with reboot policy FORCED [21:45:30] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1078.mgmt.eqiad.wmnet with reboot policy FORCED [21:45:31] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1079.mgmt.eqiad.wmnet with reboot policy FORCED [22:04:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr) [22:04:31] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [22:06:38] (03PS5) 10Ryan Kemper: wdqs: remove ldf endpoint monitoring [puppet] - 10https://gerrit.wikimedia.org/r/979408 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [22:07:30] (03CR) 10Dzahn: [C: 03+1] wdqs: remove ldf endpoint monitoring [puppet] - 10https://gerrit.wikimedia.org/r/979408 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [22:09:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1076.mgmt.eqiad.wmnet with reboot policy FORCED [22:09:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1077.mgmt.eqiad.wmnet with reboot policy FORCED [22:09:38] (03CR) 10Bking: [C: 03+2] wdqs: remove ldf endpoint monitoring [puppet] - 10https://gerrit.wikimedia.org/r/979408 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [22:10:38] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1079.mgmt.eqiad.wmnet with reboot policy FORCED [22:10:41] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1078.mgmt.eqiad.wmnet with reboot policy FORCED [22:11:20] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [22:13:28] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be - jclark@cumin1001" [22:14:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be - jclark@cumin1001" [22:14:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:15:01] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1076.mgmt.eqiad.wmnet with reboot policy FORCED [22:15:03] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1077.mgmt.eqiad.wmnet with reboot policy FORCED [22:15:05] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1078.mgmt.eqiad.wmnet with reboot policy FORCED [22:15:07] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1079.mgmt.eqiad.wmnet with reboot policy FORCED [22:15:26] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1079.mgmt.eqiad.wmnet with reboot policy FORCED [22:15:38] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1076.mgmt.eqiad.wmnet with reboot policy FORCED [22:16:42] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1077.mgmt.eqiad.wmnet with reboot policy FORCED [22:17:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1078.mgmt.eqiad.wmnet with reboot policy FORCED [22:17:24] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1076.mgmt.eqiad.wmnet with reboot policy FORCED [22:17:25] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1077.mgmt.eqiad.wmnet with reboot policy FORCED [22:17:27] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1078.mgmt.eqiad.wmnet with reboot policy FORCED [22:17:29] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1079.mgmt.eqiad.wmnet with reboot policy FORCED [22:29:45] (03PS1) 10Ssingh: wikimedia.org: add 1Password site verification [dns] - 10https://gerrit.wikimedia.org/r/979421 (https://phabricator.wikimedia.org/T352579) [22:36:58] 10SRE-swift-storage, 10Commons, 10UploadWizard: Incomplete files uploaded (10 MB interruption) - https://phabricator.wikimedia.org/T350917 (10RoyZuo) >>! In T350917#9357740, @MatthewVernon wrote: > Looking at recent uploads, there are definitely >10MB files being uploaded: > https://commons.wikimedia.org/wik... [22:52:59] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:55:01] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:00:15] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:00:40] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [23:07:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources