[00:00:44] <wikibugs>	 (03PS1) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763
[00:01:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED
[00:01:53] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED
[00:02:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED
[00:03:33] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T334964 (10phaultfinder)
[00:04:46] <wikibugs>	 (03PS1) 10EoghanGaffney: [gitlab/failover] Add check for DNS records update [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771)
[00:05:17] <wikibugs>	 (03PS1) 10Dzahn: phabricator: add parameter for db_datadir in cloud and use default path [puppet] - 10https://gerrit.wikimedia.org/r/909786
[00:10:23] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED
[00:12:42] <wikibugs>	 (03PS1) 10Dzahn: phorge: add parameter for db_datadir and use default path [puppet] - 10https://gerrit.wikimedia.org/r/909787
[00:13:54] <wikibugs>	 (03PS2) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763
[00:13:56] <wikibugs>	 (03PS5) 10Aaron Schulz: Set "templateOverridesBySection" in an etcd.php loop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834
[00:14:09] <wikibugs>	 (03PS2) 10Aaron Schulz: Use pt-heartbeat for all non-static external clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893835 (https://phabricator.wikimedia.org/T129093)
[00:14:11] <wikibugs>	 (03PS1) 10Dzahn: mariadb::generic_server: change default datadir path [puppet] - 10https://gerrit.wikimedia.org/r/909788
[00:14:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P47155 and previous config saved to /var/cache/conftool/dbconfig/20230419-001423-ladsgroup.json
[00:15:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED
[00:18:18] <wikibugs>	 (03PS1) 10Dzahn: acme_chief: add gerrit1003 to hosts allowed for gerrit certs [puppet] - 10https://gerrit.wikimedia.org/r/909790 (https://phabricator.wikimedia.org/T326368)
[00:19:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED
[00:22:19] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:15] <wikibugs>	 (03PS1) 10Dzahn: replace gerrit1001 with gerrit1003 as ping target for blackbox smoke [puppet] - 10https://gerrit.wikimedia.org/r/909791 (https://phabricator.wikimedia.org/T326368)
[00:24:43] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED
[00:28:48] <wikibugs>	 (03PS1) 10Dzahn: logstash: replace gerrit1001 with gerrit1003 in tests [puppet] - 10https://gerrit.wikimedia.org/r/909792 (https://phabricator.wikimedia.org/T326368)
[00:29:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1072.eqiad.wmnet with OS bullseye
[00:29:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be1072.eqiad.wmnet with OS bullseye
[00:29:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T333332)', diff saved to https://phabricator.wikimedia.org/P47156 and previous config saved to /var/cache/conftool/dbconfig/20230419-002929-ladsgroup.json
[00:29:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[00:29:35] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[00:29:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[00:29:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1210 (T333332)', diff saved to https://phabricator.wikimedia.org/P47157 and previous config saved to /var/cache/conftool/dbconfig/20230419-002952-ladsgroup.json
[00:30:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED
[00:30:29] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:32:01] <wikibugs>	 (03PS1) 10Dzahn: cloudgw: fix IP address for gerrit-replica.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/909794
[00:32:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T333332)', diff saved to https://phabricator.wikimedia.org/P47158 and previous config saved to /var/cache/conftool/dbconfig/20230419-003235-ladsgroup.json
[00:33:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[00:35:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED
[00:36:35] <wikibugs>	 (03PS1) 10Dzahn: cloudgw: allow VMs to speak to new gerrit server (gerrit1003) [puppet] - 10https://gerrit.wikimedia.org/r/909795 (https://phabricator.wikimedia.org/T326368)
[00:37:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1073.eqiad.wmnet with OS bullseye
[00:37:32] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be1073.eqiad.wmnet with OS bullseye
[00:38:45] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add host-based Hiera keys for gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/909796 (https://phabricator.wikimedia.org/T326368)
[00:39:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/909768
[00:39:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/909768 (owner: 10TrainBranchBot)
[00:39:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1074.eqiad.wmnet with OS bullseye
[00:39:45] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be1074.eqiad.wmnet with OS bullseye
[00:44:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1072.eqiad.wmnet with reason: host reimage
[00:47:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P47159 and previous config saved to /var/cache/conftool/dbconfig/20230419-004741-ladsgroup.json
[00:47:50] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1072.eqiad.wmnet with reason: host reimage
[00:50:15] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:52:30] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[00:54:31] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[00:57:31] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/909768 (owner: 10TrainBranchBot)
[01:00:29] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:01:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1073.eqiad.wmnet with reason: host reimage
[01:02:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P47160 and previous config saved to /var/cache/conftool/dbconfig/20230419-010247-ladsgroup.json
[01:04:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1073.eqiad.wmnet with reason: host reimage
[01:05:16] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:10:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1074.eqiad.wmnet with reason: host reimage
[01:12:31] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[01:13:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1074.eqiad.wmnet with reason: host reimage
[01:15:19] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:15:30] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder)
[01:16:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:17:18] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10Jhancock.wm) @Papaul dns2003 already exists in netbox. It's in A2.
[01:17:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T333332)', diff saved to https://phabricator.wikimedia.org/P47161 and previous config saved to /var/cache/conftool/dbconfig/20230419-011754-ladsgroup.json
[01:17:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[01:18:00] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[01:18:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[01:18:39] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10Papaul) @Jhancock.wm go from dns2004 up
[01:18:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:18:43] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1072.eqiad.wmnet with OS bullseye
[01:18:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance
[01:18:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be1072.eqiad.wmnet with OS bullseye completed: - ms-be...
[01:19:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance
[01:20:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance
[01:21:05] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:21:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance
[01:21:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T333332)', diff saved to https://phabricator.wikimedia.org/P47162 and previous config saved to /var/cache/conftool/dbconfig/20230419-012114-ladsgroup.json
[01:23:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1075.eqiad.wmnet with OS bullseye
[01:23:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be1075.eqiad.wmnet with OS bullseye
[01:25:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T333332)', diff saved to https://phabricator.wikimedia.org/P47163 and previous config saved to /var/cache/conftool/dbconfig/20230419-012509-ladsgroup.json
[01:25:15] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[01:30:19] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:32:32] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[01:33:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[01:34:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:36:15] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:36:58] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:37:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1074.eqiad.wmnet with OS bullseye
[01:37:10] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be1074.eqiad.wmnet with OS bullseye completed: - ms-be...
[01:38:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1075.eqiad.wmnet with reason: host reimage
[01:40:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P47164 and previous config saved to /var/cache/conftool/dbconfig/20230419-014016-ladsgroup.json
[01:42:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1075.eqiad.wmnet with reason: host reimage
[01:44:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:45:11] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:46:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[01:46:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:46:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1073.eqiad.wmnet with OS bullseye
[01:46:26] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be1073.eqiad.wmnet with OS bullseye completed: - ms-be...
[01:48:00] <jinxer-wm>	 (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[01:50:41] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:55:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P47165 and previous config saved to /var/cache/conftool/dbconfig/20230419-015522-ladsgroup.json
[01:59:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[02:01:01] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:01:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:03:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:03:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1075.eqiad.wmnet with OS bullseye
[02:03:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be1075.eqiad.wmnet with OS bullseye completed: - ms-be...
[02:04:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Papaul)
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:06:41] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Papaul) 05Open→03Resolved This is complete
[02:06:53] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul)
[02:10:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T333332)', diff saved to https://phabricator.wikimedia.org/P47166 and previous config saved to /var/cache/conftool/dbconfig/20230419-021028-ladsgroup.json
[02:10:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance
[02:10:35] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[02:10:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance
[02:10:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T333332)', diff saved to https://phabricator.wikimedia.org/P47167 and previous config saved to /var/cache/conftool/dbconfig/20230419-021051-ladsgroup.json
[02:16:33] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:16:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T333332)', diff saved to https://phabricator.wikimedia.org/P47168 and previous config saved to /var/cache/conftool/dbconfig/20230419-021646-ladsgroup.json
[02:16:52] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[02:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:30:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[02:31:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P47170 and previous config saved to /var/cache/conftool/dbconfig/20230419-023152-ladsgroup.json
[02:46:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P47171 and previous config saved to /var/cache/conftool/dbconfig/20230419-024658-ladsgroup.json
[02:50:57] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:54:09] <icinga-wm_>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:00:47] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:02:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T333332)', diff saved to https://phabricator.wikimedia.org/P47172 and previous config saved to /var/cache/conftool/dbconfig/20230419-030205-ladsgroup.json
[03:02:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance
[03:02:11] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[03:02:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance
[03:02:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[03:02:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[03:02:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T333332)', diff saved to https://phabricator.wikimedia.org/P47173 and previous config saved to /var/cache/conftool/dbconfig/20230419-030234-ladsgroup.json
[03:05:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T333332)', diff saved to https://phabricator.wikimedia.org/P47174 and previous config saved to /var/cache/conftool/dbconfig/20230419-030530-ladsgroup.json
[03:07:23] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:15:37] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:20:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P47175 and previous config saved to /var/cache/conftool/dbconfig/20230419-032036-ladsgroup.json
[03:32:19] <icinga-wm_>	 PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:35:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P47176 and previous config saved to /var/cache/conftool/dbconfig/20230419-033542-ladsgroup.json
[03:40:01] <icinga-wm_>	 PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdn1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops
[03:47:19] <icinga-wm_>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:50:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T333332)', diff saved to https://phabricator.wikimedia.org/P47177 and previous config saved to /var/cache/conftool/dbconfig/20230419-035048-ladsgroup.json
[03:50:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance
[03:50:55] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[03:51:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance
[03:51:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47178 and previous config saved to /var/cache/conftool/dbconfig/20230419-035112-ladsgroup.json
[03:55:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47180 and previous config saved to /var/cache/conftool/dbconfig/20230419-035507-ladsgroup.json
[04:08:35] <icinga-wm_>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:10:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P47181 and previous config saved to /var/cache/conftool/dbconfig/20230419-041013-ladsgroup.json
[04:25:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P47182 and previous config saved to /var/cache/conftool/dbconfig/20230419-042520-ladsgroup.json
[04:33:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[04:40:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47183 and previous config saved to /var/cache/conftool/dbconfig/20230419-044027-ladsgroup.json
[04:40:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance
[04:40:31] <icinga-wm_>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:40:33] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[04:40:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance
[04:40:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T333332)', diff saved to https://phabricator.wikimedia.org/P47184 and previous config saved to /var/cache/conftool/dbconfig/20230419-044050-ladsgroup.json
[04:44:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T333332)', diff saved to https://phabricator.wikimedia.org/P47185 and previous config saved to /var/cache/conftool/dbconfig/20230419-044445-ladsgroup.json
[04:53:43] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:57:16] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[04:59:17] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[04:59:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P47186 and previous config saved to /var/cache/conftool/dbconfig/20230419-045951-ladsgroup.json
[05:00:13] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:01:33] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:01:41] <icinga-wm_>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:02:07] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:04:39] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.405 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:05:11] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:06:39] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:14:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P47187 and previous config saved to /var/cache/conftool/dbconfig/20230419-051457-ladsgroup.json
[05:16:03] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:16:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10Volans) >>! In T334680#8791310, @Dzahn wrote: > But since the compilers are running in cloud VPS and there it's neither of the...
[05:17:17] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[05:20:16] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder)
[05:30:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T333332)', diff saved to https://phabricator.wikimedia.org/P47188 and previous config saved to /var/cache/conftool/dbconfig/20230419-053003-ladsgroup.json
[05:30:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance
[05:30:10] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[05:30:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance
[05:30:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47189 and previous config saved to /var/cache/conftool/dbconfig/20230419-053027-ladsgroup.json
[05:31:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) >>! In T296832#8791457, @cmooney wrote: > In terms of next steps we obviously need to keep things consistent....
[05:33:23] <icinga-wm_>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:34:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47190 and previous config saved to /var/cache/conftool/dbconfig/20230419-053425-ladsgroup.json
[05:37:18] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[05:38:16] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[05:44:03] <icinga-wm_>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:46:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[05:48:17] <jinxer-wm>	 (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[05:49:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P47191 and previous config saved to /var/cache/conftool/dbconfig/20230419-054931-ladsgroup.json
[05:50:36] <wikibugs>	 (03CR) 10Volans: "Post-merge comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/902502 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking)
[05:51:03] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:59:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T0600)
[06:00:45] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:04:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P47192 and previous config saved to /var/cache/conftool/dbconfig/20230419-060437-ladsgroup.json
[06:07:23] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:07:30] <wikibugs>	 (03PS1) 10Marostegui: db1212: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909851 (https://phabricator.wikimedia.org/T326669)
[06:08:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P47193 and previous config saved to /var/cache/conftool/dbconfig/20230419-060803-root.json
[06:08:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1212: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909851 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[06:08:34] <wikibugs>	 (03CR) 10Volans: "A question and few comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney)
[06:12:26] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1219 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909853 (https://phabricator.wikimedia.org/T326669)
[06:13:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1219 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909853 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[06:14:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1219 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P47194 and previous config saved to /var/cache/conftool/dbconfig/20230419-061414-marostegui.json
[06:14:17] <wikibugs>	 (03CR) 10Volans: [C: 03+1] service: add comment for spicerack field addition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909605 (owner: 10Clément Goubert)
[06:14:20] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[06:15:37] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:15:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Netbox PuppetDB import script deletes cabel labels when interfaces are renamed - https://phabricator.wikimedia.org/T334987 (10ayounsi) Last time I checked it was not possible/recommended to edit a cable, but instead delete/create it. We could also store the cable IDs keyed...
[06:17:39] <wikibugs>	 (03PS1) 10Marostegui: db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909855
[06:18:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909855 (owner: 10Marostegui)
[06:19:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47195 and previous config saved to /var/cache/conftool/dbconfig/20230419-061944-ladsgroup.json
[06:19:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance
[06:19:50] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[06:20:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance
[06:20:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T333332)', diff saved to https://phabricator.wikimedia.org/P47196 and previous config saved to /var/cache/conftool/dbconfig/20230419-062007-ladsgroup.json
[06:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[06:21:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113 (s5,s6)', diff saved to https://phabricator.wikimedia.org/P47197 and previous config saved to /var/cache/conftool/dbconfig/20230419-062123-root.json
[06:21:57] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:22:07] <wikibugs>	 (03PS1) 10Marostegui: db1113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909856 (https://phabricator.wikimedia.org/T326669)
[06:23:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P47200 and previous config saved to /var/cache/conftool/dbconfig/20230419-062307-root.json
[06:23:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909856 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[06:24:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T333332)', diff saved to https://phabricator.wikimedia.org/P47201 and previous config saved to /var/cache/conftool/dbconfig/20230419-062401-ladsgroup.json
[06:29:53] <wikibugs>	 (03PS1) 10Marostegui: db1213: Add it to s5 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/909941 (https://phabricator.wikimedia.org/T326669)
[06:30:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[06:30:30] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:34:54] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:35:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1213: Add it to s5 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/909941 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[06:36:41] <wikibugs>	 (03PS1) 10Marostegui: db1219: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909942 (https://phabricator.wikimedia.org/T326669)
[06:37:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1219: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909942 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[06:37:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P47202 and previous config saved to /var/cache/conftool/dbconfig/20230419-063713-root.json
[06:38:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P47203 and previous config saved to /var/cache/conftool/dbconfig/20230419-063812-root.json
[06:38:19] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[06:38:22] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[06:39:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P47204 and previous config saved to /var/cache/conftool/dbconfig/20230419-063907-ladsgroup.json
[06:39:19] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1223 [puppet] - 10https://gerrit.wikimedia.org/r/909943
[06:39:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1223 [puppet] - 10https://gerrit.wikimedia.org/r/909943 (owner: 10Marostegui)
[06:41:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 T335011', diff saved to https://phabricator.wikimedia.org/P47205 and previous config saved to /var/cache/conftool/dbconfig/20230419-064122-root.json
[06:41:27] <stashbot>	 T335011: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011
[06:42:50] <wikibugs>	 (03PS1) 10Marostegui: db1110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909944 (https://phabricator.wikimedia.org/T326683)
[06:43:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909944 (https://phabricator.wikimedia.org/T326683) (owner: 10Marostegui)
[06:45:50] <wikibugs>	 (03PS10) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182
[06:46:14] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:49:42] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1216 [puppet] - 10https://gerrit.wikimedia.org/r/909945
[06:50:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1216 [puppet] - 10https://gerrit.wikimedia.org/r/909945 (owner: 10Marostegui)
[06:50:32] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:52:04] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for KBach - https://phabricator.wikimedia.org/T334931 (10KBach) Thanks @Clement_Goubert!
[06:52:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P47206 and previous config saved to /var/cache/conftool/dbconfig/20230419-065218-root.json
[06:53:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P47207 and previous config saved to /var/cache/conftool/dbconfig/20230419-065317-root.json
[06:54:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P47208 and previous config saved to /var/cache/conftool/dbconfig/20230419-065413-ladsgroup.json
[06:55:02] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1116 - https://phabricator.wikimedia.org/T334926 (10jcrespo) a:03Jclark-ctr
[06:56:30] <wikibugs>	 (03PS3) 10KartikMistry: Enable Content/Section translation on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909607 (https://phabricator.wikimedia.org/T327102)
[06:59:08] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1102 - https://phabricator.wikimedia.org/T334927 (10jcrespo) a:03Jclark-ctr
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T0700). Please do the needful.
[07:00:04] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:35] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:01:12] <kart_>	 I'm here!
[07:01:37] <kart_>	 I'll go ahead with deployment for my patch.
[07:01:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909607 (https://phabricator.wikimedia.org/T327102) (owner: 10KartikMistry)
[07:02:36] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Content/Section translation on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909607 (https://phabricator.wikimedia.org/T327102) (owner: 10KartikMistry)
[07:03:25] <wikibugs>	 (03CR) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede)
[07:03:33] <logmsgbot>	 !log kartik@deploy2002 Started scap: Backport for [[gerrit:909607|Enable Content/Section translation on 6 Wikipedias (T327102)]]
[07:03:39] <stashbot>	 T327102: Enable Content and Section translation on 6 Wikipedias - https://phabricator.wikimedia.org/T327102
[07:04:53] <wikibugs>	 (03PS9) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519
[07:05:06] <logmsgbot>	 !log kartik@deploy2002 kartik: Backport for [[gerrit:909607|Enable Content/Section translation on 6 Wikipedias (T327102)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[07:05:07] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:07:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P47209 and previous config saved to /var/cache/conftool/dbconfig/20230419-070723-root.json
[07:08:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P47210 and previous config saved to /var/cache/conftool/dbconfig/20230419-070822-root.json
[07:09:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T333332)', diff saved to https://phabricator.wikimedia.org/P47211 and previous config saved to /var/cache/conftool/dbconfig/20230419-070920-ladsgroup.json
[07:09:25] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[07:10:20] <XioNoX>	 !log push pfw policies - T334983
[07:10:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:40] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] "Looks good. Optionally would consider moving this to `/opt/bin` and `/opt/etc`, but we can do that later" [puppet] - 10https://gerrit.wikimedia.org/r/909749 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[07:13:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10ayounsi)
[07:13:07] <logmsgbot>	 !log kartik@deploy2002 Finished scap: Backport for [[gerrit:909607|Enable Content/Section translation on 6 Wikipedias (T327102)]] (duration: 09m 33s)
[07:13:12] <stashbot>	 T327102: Enable Content and Section translation on 6 Wikipedias - https://phabricator.wikimedia.org/T327102
[07:14:32] <kart_>	 I'm done with my config deployment. And, there are no more patches in the backport/config window.
[07:15:21] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:15:39] <XioNoX>	 !log update TLS cert on pfw - T334676
[07:15:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/909749 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[07:18:01] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Add db1213 to s5 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/909947 (https://phabricator.wikimedia.org/T326683)
[07:22:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P47212 and previous config saved to /var/cache/conftool/dbconfig/20230419-072228-root.json
[07:23:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P47213 and previous config saved to /var/cache/conftool/dbconfig/20230419-072326-root.json
[07:25:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) Slightly relevant - https://wikitech.wikimedia.org/wiki/Juniper_TLS_certificate_install
[07:26:55] <wikibugs>	 (03CR) 10Muehlenhoff: SSH Keymanagement, allow user to manage ssh keys. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede)
[07:31:28] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1224 [puppet] - 10https://gerrit.wikimedia.org/r/909949
[07:31:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1224 [puppet] - 10https://gerrit.wikimedia.org/r/909949 (owner: 10Marostegui)
[07:37:24] <icinga-wm_>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:37:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P47214 and previous config saved to /var/cache/conftool/dbconfig/20230419-073732-root.json
[07:38:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P47215 and previous config saved to /var/cache/conftool/dbconfig/20230419-073831-root.json
[07:39:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for ktsouroupidou [puppet] - 10https://gerrit.wikimedia.org/r/909950
[07:41:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for ktsouroupidou [puppet] - 10https://gerrit.wikimedia.org/r/909950 (owner: 10Muehlenhoff)
[07:50:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Add db1213 to s5 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/909947 (https://phabricator.wikimedia.org/T326683) (owner: 10Marostegui)
[07:50:47] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1113: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/909871
[07:51:13] <wikibugs>	 (03PS1) 10Stevemunene: Add Product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909951 (https://phabricator.wikimedia.org/T333000)
[07:51:39] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove insetup from db1213 [puppet] - 10https://gerrit.wikimedia.org/r/909952 (https://phabricator.wikimedia.org/T326669)
[07:52:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P47216 and previous config saved to /var/cache/conftool/dbconfig/20230419-075203-root.json
[07:52:37] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10WMDE-leszek)
[07:52:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 6%: Pooling', diff saved to https://phabricator.wikimedia.org/P47217 and previous config saved to /var/cache/conftool/dbconfig/20230419-075237-root.json
[07:52:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1213 [puppet] - 10https://gerrit.wikimedia.org/r/909952 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[07:53:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P47218 and previous config saved to /var/cache/conftool/dbconfig/20230419-075336-root.json
[07:53:58] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/909953
[07:54:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/909953 (owner: 10Marostegui)
[07:55:52] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10WMDE-leszek) Hello, regarding the wikibase/termbox  service -- we'd be fine with a move to gitlab but have a question for ourselves to find answer fo...
[07:57:26] <wikibugs>	 (03PS1) 10Elukey: services: add kafka-logging200[4,5] IPs to eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/909954 (https://phabricator.wikimedia.org/T334510)
[07:58:10] <wikibugs>	 (03CR) 10Clément Goubert: service: add comment for spicerack field addition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909605 (owner: 10Clément Goubert)
[07:58:26] <icinga-wm_>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:00:06] <jouncebot>	 jnuche and ^demon: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T0800).
[08:00:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1113: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/909871 (owner: 10Marostegui)
[08:00:25] <jnuche>	 good morning, I'll be deploying in 10m
[08:00:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P47219 and previous config saved to /var/cache/conftool/dbconfig/20230419-080030-root.json
[08:03:33] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[08:07:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P47220 and previous config saved to /var/cache/conftool/dbconfig/20230419-080708-root.json
[08:07:26] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:07:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 7%: Pooling', diff saved to https://phabricator.wikimedia.org/P47221 and previous config saved to /var/cache/conftool/dbconfig/20230419-080742-root.json
[08:08:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P47222 and previous config saved to /var/cache/conftool/dbconfig/20230419-080841-root.json
[08:10:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Add second tracking category for Graph (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909700 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE))
[08:10:50] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Marostegui)
[08:11:02] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909958 (https://phabricator.wikimedia.org/T330211)
[08:11:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909958 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot)
[08:12:02] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909958 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot)
[08:15:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P47223 and previous config saved to /var/cache/conftool/dbconfig/20230419-081535-root.json
[08:15:38] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:18:30] <logmsgbot>	 !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.5  refs T330211
[08:18:35] <stashbot>	 T330211: 1.41.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T330211
[08:22:02] <wikibugs>	 (03PS2) 10Elukey: services: modify Kafka logging IPs in eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/909954 (https://phabricator.wikimedia.org/T334510)
[08:22:12] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:22:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P47224 and previous config saved to /var/cache/conftool/dbconfig/20230419-082213-root.json
[08:22:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 8%: Pooling', diff saved to https://phabricator.wikimedia.org/P47225 and previous config saved to /var/cache/conftool/dbconfig/20230419-082247-root.json
[08:23:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[08:23:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[08:23:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P47226 and previous config saved to /var/cache/conftool/dbconfig/20230419-082345-root.json
[08:23:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10ItamarWMDE) Thank you @Clement_Goubert!
[08:24:13] <logmsgbot>	 !log jnuche@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.5  refs T330211 (duration: 05m 43s)
[08:24:18] <stashbot>	 T330211: 1.41.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T330211
[08:27:28] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/909872
[08:27:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P47227 and previous config saved to /var/cache/conftool/dbconfig/20230419-082738-root.json
[08:27:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/909872 (owner: 10Marostegui)
[08:30:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P47228 and previous config saved to /var/cache/conftool/dbconfig/20230419-083040-root.json
[08:30:52] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:33:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:34:14] <wikibugs>	 (03PS1) 10Slyngshede: Enable emailing for signup and password reset [software/bitu] - 10https://gerrit.wikimedia.org/r/909959
[08:34:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:01:00 on db2185.codfw.wmnet,db[1115,1215].eqiad.wmnet with reason: Test
[08:35:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:01:00 on db2185.codfw.wmnet,db[1115,1215].eqiad.wmnet with reason: Test
[08:35:18] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:36:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Netbox PuppetDB import script deletes cabel labels when interfaces are renamed - https://phabricator.wikimedia.org/T334987 (10cmooney) >>! In T334987#8792130, @ayounsi wrote: > We could also store the cable IDs keyed by remote interface ID and re-use that when re-creating t...
[08:37:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P47229 and previous config saved to /var/cache/conftool/dbconfig/20230419-083717-root.json
[08:37:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 9%: Pooling', diff saved to https://phabricator.wikimedia.org/P47230 and previous config saved to /var/cache/conftool/dbconfig/20230419-083753-root.json
[08:39:34] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10kevinbazira)
[08:40:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[08:41:43] <wikibugs>	 (03Abandoned) 10Muehlenhoff: httpd: Let Puppet pick the init provider [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[08:42:33] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "wmnet: Switch deployment CNAMEs to codfw" [dns] - 10https://gerrit.wikimedia.org/r/909873
[08:42:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "wmnet: Switch deployment CNAMEs to codfw" [dns] - 10https://gerrit.wikimedia.org/r/909873 (owner: 10Clément Goubert)
[08:42:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10cmooney)
[08:42:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P47231 and previous config saved to /var/cache/conftool/dbconfig/20230419-084243-root.json
[08:43:02] <wikibugs>	 (03PS1) 10Stevemunene: Configure product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000)
[08:43:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Netbox PuppetDB import script deletes cable labels when interfaces are renamed - https://phabricator.wikimedia.org/T334987 (10cmooney)
[08:43:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[08:45:26] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[08:45:29] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[08:45:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P47232 and previous config saved to /var/cache/conftool/dbconfig/20230419-084545-root.json
[08:46:01] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:46:55] <wikibugs>	 (03PS2) 10Clément Goubert: Revert "wmnet: Switch deployment CNAMEs to codfw" [dns] - 10https://gerrit.wikimedia.org/r/909873
[08:47:01] <wikibugs>	 (03CR) 10Btullis: "Looks good. The PCC failure for the aqs node looks like it's just a missing dummy secret, so it's a +1 from me in principle, as soon as th" [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans)
[08:47:51] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-2] "Holding for switchback" [dns] - 10https://gerrit.wikimedia.org/r/909873 (owner: 10Clément Goubert)
[08:48:41] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "Switch deployment server to deploy2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/909874
[08:49:50] <wikibugs>	 (03PS2) 10Clément Goubert: Revert "Switch deployment server to deploy2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/909874
[08:50:06] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-2] "Hold for switchback" [puppet] - 10https://gerrit.wikimedia.org/r/909874 (owner: 10Clément Goubert)
[08:50:23] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:33] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert)
[08:50:41] <wikibugs>	 (03PS3) 10Clément Goubert: Revert "wmnet: Switch deployment CNAMEs to codfw" [dns] - 10https://gerrit.wikimedia.org/r/909873 (https://phabricator.wikimedia.org/T335015)
[08:50:50] <wikibugs>	 (03PS3) 10Clément Goubert: Revert "Switch deployment server to deploy2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/909874 (https://phabricator.wikimedia.org/T335015)
[08:51:25] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert)
[08:52:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P47233 and previous config saved to /var/cache/conftool/dbconfig/20230419-085222-root.json
[08:52:47] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[08:52:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)   - Do...
[08:52:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P47234 and previous config saved to /var/cache/conftool/dbconfig/20230419-085257-root.json
[08:56:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] services: modify Kafka logging IPs in eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/909954 (https://phabricator.wikimedia.org/T334510) (owner: 10Elukey)
[08:57:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: modify Kafka logging IPs in eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/909954 (https://phabricator.wikimedia.org/T334510) (owner: 10Elukey)
[08:57:31] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[08:57:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P47235 and previous config saved to /var/cache/conftool/dbconfig/20230419-085748-root.json
[08:59:30] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[08:59:35] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[08:59:38] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[08:59:52] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: sync
[08:59:55] <wikibugs>	 (03CR) 10Clément Goubert: P:lists:monitoring: Raise process count for uwsgi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909247 (owner: 10Clément Goubert)
[09:00:11] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: sync
[09:00:32] <wikibugs>	 (03PS11) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182
[09:00:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P47236 and previous config saved to /var/cache/conftool/dbconfig/20230419-090050-root.json
[09:00:56] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: sync
[09:01:09] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:01:13] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: sync
[09:03:46] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[09:03:49] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[09:05:07] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/909673 (https://phabricator.wikimedia.org/T333550) (owner: 10Clément Goubert)
[09:05:10] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[09:05:14] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[09:07:05] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[09:07:09] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[09:07:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P47237 and previous config saved to /var/cache/conftool/dbconfig/20230419-090727-root.json
[09:07:33] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[09:07:36] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[09:08:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P47238 and previous config saved to /var/cache/conftool/dbconfig/20230419-090802-root.json
[09:08:45] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10AndrewTavis_WMDE) I just signed the NDA :)  @Aklapper I'll connect with @karapayneWMDE about changing the templates. Thanks for bringing this to our attention!
[09:12:49] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] admin: Add atieno to to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/909673 (https://phabricator.wikimedia.org/T333550) (owner: 10Clément Goubert)
[09:12:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P47239 and previous config saved to /var/cache/conftool/dbconfig/20230419-091252-root.json
[09:15:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P47240 and previous config saved to /var/cache/conftool/dbconfig/20230419-091554-root.json
[09:17:31] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[09:19:30] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07), 10Patch-For-Review: Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Clement_Goubert) @Atieno Your access request has been merged and should be operational within the next half hour, you have...
[09:19:34] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07), 10Patch-For-Review: Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Clement_Goubert) 05In progress→03Resolved
[09:20:30] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder)
[09:20:36] <wikibugs>	 (03PS9) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774)
[09:22:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P47241 and previous config saved to /var/cache/conftool/dbconfig/20230419-092232-root.json
[09:23:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P47242 and previous config saved to /var/cache/conftool/dbconfig/20230419-092307-root.json
[09:26:52] <wikibugs>	 (03PS1) 10Marostegui: db1117: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909963 (https://phabricator.wikimedia.org/T335017)
[09:27:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1117: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909963 (https://phabricator.wikimedia.org/T335017) (owner: 10Marostegui)
[09:27:38] <wikibugs>	 (03CR) 10Func: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909875 (owner: 10Func)
[09:27:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P47243 and previous config saved to /var/cache/conftool/dbconfig/20230419-092757-root.json
[09:29:18] <wikibugs>	 (03PS1) 10Marostegui: section: Update zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909964 (https://phabricator.wikimedia.org/T334455)
[09:29:37] <wikibugs>	 (03CR) 10Func: cleanup: Remove duplicate permission config of confirmed users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909875 (owner: 10Func)
[09:30:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] section: Update zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909964 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[09:31:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P47244 and previous config saved to /var/cache/conftool/dbconfig/20230419-093059-root.json
[09:31:39] <wikibugs>	 (03PS1) 10Marostegui: host-to-instance: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909965 (https://phabricator.wikimedia.org/T334455)
[09:32:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] host-to-instance: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909965 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[09:32:15] <icinga-wm_>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:32:40] <wikibugs>	 (03Merged) 10jenkins-bot: host-to-instance: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909965 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[09:33:29] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add fat to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/909771 (https://phabricator.wikimedia.org/T335016)
[09:35:30] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] Add fat to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/909771 (https://phabricator.wikimedia.org/T335016) (owner: 10Gerrit maintenance bot)
[09:37:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[09:37:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P47245 and previous config saved to /var/cache/conftool/dbconfig/20230419-093737-root.json
[09:38:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P47246 and previous config saved to /var/cache/conftool/dbconfig/20230419-093812-root.json
[09:38:29] <wikibugs>	 (03PS1) 10Marostegui: check-master-heartbeat.sh: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909966 (https://phabricator.wikimedia.org/T334455)
[09:38:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[09:39:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] check-master-heartbeat.sh: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909966 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[09:39:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] check-master-heartbeat.sh: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909966 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[09:40:27] <wikibugs>	 (03PS10) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774)
[09:41:57] <wikibugs>	 (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/909966 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[09:42:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] check-master-heartbeat.sh: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909966 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[09:43:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P47247 and previous config saved to /var/cache/conftool/dbconfig/20230419-094302-root.json
[09:44:28] <wikibugs>	 (03PS1) 10Marostegui: common.yaml: Add db1215 to mysql clients [puppet] - 10https://gerrit.wikimedia.org/r/909967 (https://phabricator.wikimedia.org/T334455)
[09:46:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P47248 and previous config saved to /var/cache/conftool/dbconfig/20230419-094604-root.json
[09:46:14] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] common.yaml: Add db1215 to mysql clients [puppet] - 10https://gerrit.wikimedia.org/r/909967 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[09:46:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[09:46:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] common.yaml: Add db1215 to mysql clients [puppet] - 10https://gerrit.wikimedia.org/r/909967 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[09:48:15] <jinxer-wm>	 (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[09:48:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[09:48:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[09:48:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P47249 and previous config saved to /var/cache/conftool/dbconfig/20230419-094836-ladsgroup.json
[09:48:42] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[09:50:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P47250 and previous config saved to /var/cache/conftool/dbconfig/20230419-095044-ladsgroup.json
[09:52:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P47252 and previous config saved to /var/cache/conftool/dbconfig/20230419-095241-root.json
[09:53:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P47253 and previous config saved to /var/cache/conftool/dbconfig/20230419-095316-root.json
[09:58:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P47254 and previous config saved to /var/cache/conftool/dbconfig/20230419-095807-root.json
[09:58:35] <wikibugs>	 10ops-codfw: Broken PSU on ganeti2019 - https://phabricator.wikimedia.org/T335026 (10MoritzMuehlenhoff)
[09:59:32] <icinga-wm_>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:59:32] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1000)
[10:00:40] <wikibugs>	 (03PS1) 10Elukey: amd_gpu: add udev rules to bypass the 'render' group [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009)
[10:00:42] <wikibugs>	 (03PS1) 10Elukey: role:dse_k8s::worker: set allow_gpu_broader_access [puppet] - 10https://gerrit.wikimedia.org/r/909969 (https://phabricator.wikimedia.org/T333009)
[10:01:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P47255 and previous config saved to /var/cache/conftool/dbconfig/20230419-100109-root.json
[10:01:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[10:01:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[10:02:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2118.codfw.wmnet with reason: Maintenance
[10:02:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2118.codfw.wmnet with reason: Maintenance
[10:03:21] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40746/console" [puppet] - 10https://gerrit.wikimedia.org/r/909969 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey)
[10:03:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[10:03:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[10:03:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance
[10:04:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance
[10:04:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[10:04:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[10:05:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2113.codfw.wmnet with reason: Maintenance
[10:05:35] <wikibugs>	 (03PS2) 10Elukey: amd_gpu: add udev rules to bypass the 'render' group [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009)
[10:05:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2113.codfw.wmnet with reason: Maintenance
[10:05:37] <wikibugs>	 (03PS2) 10Elukey: role:dse_k8s::worker: set allow_gpu_broader_access [puppet] - 10https://gerrit.wikimedia.org/r/909969 (https://phabricator.wikimedia.org/T333009)
[10:05:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P47256 and previous config saved to /var/cache/conftool/dbconfig/20230419-100550-ladsgroup.json
[10:07:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[10:07:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P47257 and previous config saved to /var/cache/conftool/dbconfig/20230419-100746-root.json
[10:07:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[10:08:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2140.codfw.wmnet with reason: Maintenance
[10:08:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2140.codfw.wmnet with reason: Maintenance
[10:09:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[10:09:15] <wikibugs>	 (03PS1) 10Elukey: amd-gpu-tester: add more ROCm packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909970 (https://phabricator.wikimedia.org/T333009)
[10:09:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[10:10:50] <wikibugs>	 (03PS1) 10Marostegui: prometheus.yaml: Change zarcillo location [puppet] - 10https://gerrit.wikimedia.org/r/909972 (https://phabricator.wikimedia.org/T334455)
[10:11:52] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:13:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:13:50] <icinga-wm_>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:13:58] <icinga-wm_>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:15:12] <icinga-wm_>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[10:15:38] <icinga-wm_>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1741 days) https://wikitech.wikimedia.org/wiki/Logs
[10:16:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P47258 and previous config saved to /var/cache/conftool/dbconfig/20230419-101614-root.json
[10:16:33] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] cassandra: add de-init to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans)
[10:17:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance
[10:17:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10jbond) 05Open→03In progress p:05Triage→03Medium a:03jbond
[10:17:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance
[10:18:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:18:15] <wikibugs>	 (03PS1) 10Jbond: pcc_facts_processor: skip invalid names [puppet] - 10https://gerrit.wikimedia.org/r/909973 (https://phabricator.wikimedia.org/T334680)
[10:20:14] <icinga-wm_>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[10:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[10:20:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P47259 and previous config saved to /var/cache/conftool/dbconfig/20230419-102057-ladsgroup.json
[10:21:46] <icinga-wm_>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1257 days) https://wikitech.wikimedia.org/wiki/Logs
[10:22:24] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:23:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:26:47] <wikibugs>	 (03CR) 10Jbond: "Thanks for this, there are a few places where this pattern has been reinvented.  CR lgtm just a couple of things to check with wmcs" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway)
[10:27:02] <wikibugs>	 (03CR) 10Jbond: "https://puppet-compiler.wmflabs.org/output/909756/40745/pcc-worker1001.puppet-diffs.eqiad1.wikimedia.cloud/change.pcc-worker1001.puppet-di" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway)
[10:28:29] <wikibugs>	 (03PS1) 10Klausman: hiera: Add faux secrets for ores-legacy service on Lift Wing [labs/private] - 10https://gerrit.wikimedia.org/r/909974 (https://phabricator.wikimedia.org/T330414)
[10:28:38] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, a few comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/909959 (owner: 10Slyngshede)
[10:29:08] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/909972 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[10:29:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pcc_facts_processor: skip invalid names [puppet] - 10https://gerrit.wikimedia.org/r/909973 (https://phabricator.wikimedia.org/T334680) (owner: 10Jbond)
[10:29:14] <wikibugs>	 (03PS2) 10Klausman: hiera: Add faux secrets for ores-legacy service on Lift Wing [labs/private] - 10https://gerrit.wikimedia.org/r/909974 (https://phabricator.wikimedia.org/T330414)
[10:29:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] prometheus.yaml: Change zarcillo location [puppet] - 10https://gerrit.wikimedia.org/r/909972 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui)
[10:29:41] <jbond>	 marostegui: happy for me to merge yours?
[10:29:46] <marostegui>	 jbond: go for it!
[10:29:51] <marostegui>	 thanks
[10:29:59] <jbond>	 np, done
[10:30:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[10:31:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] hiera: Add faux secrets for ores-legacy service on Lift Wing [labs/private] - 10https://gerrit.wikimedia.org/r/909974 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[10:32:50] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: Add faux secrets for ores-legacy service on Lift Wing [labs/private] - 10https://gerrit.wikimedia.org/r/909974 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[10:33:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10jbond) Thanks for the debugging, the issues was because the facts where not updating, which happened because there was/is an i...
[10:33:36] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: Add faux secrets for ores-legacy service on Lift Wing [labs/private] - 10https://gerrit.wikimedia.org/r/909974 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[10:34:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10jbond) 05In progress→03Resolved going to tentatively close this but please reopen if you still see the issue
[10:34:39] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.e4 in eqiad
[10:36:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P47260 and previous config saved to /var/cache/conftool/dbconfig/20230419-103603-ladsgroup.json
[10:36:09] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[10:37:16] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.e4 in eqiad
[10:39:58] <wikibugs>	 (03CR) 10Muehlenhoff: "This looks fine per se, but note that setting the raid fact to "perccli" currently also enables RAID checks (see raid::perccli), are those" [puppet] - 10https://gerrit.wikimedia.org/r/909707 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[10:40:01] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.1a in eqiad
[10:41:22] <wikibugs>	 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10Ladsgroup) In this specific case, it wasn't slo...
[10:42:29] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-en-local-public.1a in eqiad
[10:43:35] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.1a in codfw
[10:45:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[10:45:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[10:46:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance
[10:46:03] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-en-local-public.1a in codfw
[10:46:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance
[10:47:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[10:47:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[10:48:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance
[10:48:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance
[10:48:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[10:49:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[10:49:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2165.codfw.wmnet with reason: Maintenance
[10:49:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2165.codfw.wmnet with reason: Maintenance
[11:00:20] <wikibugs>	 (03CR) 10Jbond: core_modules: add core modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908326 (owner: 10Jbond)
[11:00:23] <wikibugs>	 (03PS4) 10Jbond: puppet::agent: rename the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490)
[11:00:25] <wikibugs>	 (03PS11) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991
[11:00:27] <wikibugs>	 (03PS13) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841)
[11:00:29] <wikibugs>	 (03PS10) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (https://phabricator.wikimedia.org/T330490)
[11:00:31] <wikibugs>	 (03PS54) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356
[11:00:33] <wikibugs>	 (03CR) 10Jbond: puppet::agent: rename the enable_puppet7 flag (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[11:00:49] <wikibugs>	 (03CR) 10Jbond: wmflib: updat ipresolv to work with puppet7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond)
[11:02:49] <sukhe>	 jouncebot: nowandnext
[11:02:49] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 57 minute(s)
[11:02:49] <jouncebot>	 In 1 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1300)
[11:03:30] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus1005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:03:47] <sukhe>	 jouncebot: refresh
[11:03:48] <jouncebot>	 I refreshed my knowledge about deployments.
[11:03:52] <sukhe>	 jouncebot: nowandnext
[11:03:52] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 56 minute(s)
[11:03:52] <jouncebot>	 In 1 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1300)
[11:04:08] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:04:24] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:04:28] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus1006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:04:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10ssingh) Thanks to everyone who worked on debugging/resolving this! I will try it again for the reimages in eqiad to see how it...
[11:05:44] <wikibugs>	 (03CR) 10Muehlenhoff: puppet::agent: rename the enable_puppet7 flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[11:05:55] <wikibugs>	 (03PS1) 10Ayounsi: mgmt: allow prometheus [homer/public] - 10https://gerrit.wikimedia.org/r/909980 (https://phabricator.wikimedia.org/T335027)
[11:07:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) {P47077}
[11:08:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi)
[11:09:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi)
[11:09:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi)
[11:09:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi)
[11:18:05] <wikibugs>	 (03CR) 10Muehlenhoff: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond)
[11:18:41] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=7; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[11:19:27] <wikibugs>	 (03PS1) 10Ssingh: depool eqiad (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/909985 (https://phabricator.wikimedia.org/T321309)
[11:34:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, but we need to merge" [puppet] - 10https://gerrit.wikimedia.org/r/909658 (owner: 10Slyngshede)
[11:36:45] <wikibugs>	 (03PS55) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490)
[11:36:57] <wikibugs>	 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10elukey)
[11:37:24] <wikibugs>	 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10elukey)
[11:38:18] <icinga-wm_>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:47:54] <wikibugs>	 (03CR) 10Jbond: puppetserver: add puppetserver module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[11:50:17] <wikibugs>	 (03CR) 10Btullis: amd_gpu: add udev rules to bypass the 'render' group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey)
[11:50:55] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Add Product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909951 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene)
[11:56:32] <wikibugs>	 (03CR) 10Elukey: amd_gpu: add udev rules to bypass the 'render' group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey)
[11:58:56] <wikibugs>	 (03CR) 10Btullis: Configure product analytics airflow instance (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene)
[11:59:28] <icinga-wm_>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:04:31] <wikibugs>	 (03CR) 10Ottomata: "OH!  I had suspected T326419, but ruled it out because at least one live broker was still in the list of bootstrap servers, and the rest o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/909954 (https://phabricator.wikimedia.org/T334510) (owner: 10Elukey)
[12:05:50] <wikibugs>	 (03PS4) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579
[12:10:46] <wikibugs>	 (03PS5) 10Jbond: puppet::agent: rename the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490)
[12:10:48] <wikibugs>	 (03PS12) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991
[12:10:50] <wikibugs>	 (03PS14) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841)
[12:10:52] <wikibugs>	 (03PS11) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (https://phabricator.wikimedia.org/T330490)
[12:10:54] <wikibugs>	 (03PS56) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490)
[12:12:21] <wikibugs>	 (03PS1) 10Zabe: Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909884 (https://phabricator.wikimedia.org/T331921)
[12:12:30] <wikibugs>	 (03PS2) 10Zabe: Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909884 (https://phabricator.wikimedia.org/T331921)
[12:13:33] <wikibugs>	 (03PS4) 10Hokwelum: make dumpsdata1006 the xmlfallback host [puppet] - 10https://gerrit.wikimedia.org/r/908995 (https://phabricator.wikimedia.org/T325232)
[12:13:35] <wikibugs>	 (03PS1) 10Hokwelum: Add orb1.de1.scatter.red to rsync config [puppet] - 10https://gerrit.wikimedia.org/r/909990
[12:19:38] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] build: add helper scripts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro)
[12:20:29] <wikibugs>	 (03Merged) 10jenkins-bot: build: add helper scripts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro)
[12:21:08] <wikibugs>	 (03PS2) 10Hokwelum: Add orb1.de1.scatter.red to rsync config for dumps [puppet] - 10https://gerrit.wikimedia.org/r/909990
[12:23:12] <icinga-wm_>	 PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service,ferm.service,prometheus-nic-firmware-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:24:15] <wikibugs>	 (03PS4) 10Clément Goubert: push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061)
[12:24:32] <wikibugs>	 (03PS1) 10David Caro: build_deb: use wikimedia images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/909991
[12:25:26] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on sretest1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:27:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061) (owner: 10Clément Goubert)
[12:28:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T334964 (10phaultfinder)
[12:30:23] <wikibugs>	 (03PS1) 10Klausman: Lift Wing: Add new namespace for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/909992 (https://phabricator.wikimedia.org/T330414)
[12:31:20] <icinga-wm_>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:31:24] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Add orb1.de1.scatter.red to rsync config for dumps [puppet] - 10https://gerrit.wikimedia.org/r/909990 (owner: 10Hokwelum)
[12:31:55] <wikibugs>	 (03PS3) 10ArielGlenn: Add orb1.de1.scatter.red to rsync config for dumps [puppet] - 10https://gerrit.wikimedia.org/r/909990 (owner: 10Hokwelum)
[12:32:28] <wikibugs>	 (03PS2) 10Slyngshede: Enable emailing for signup and password reset [software/bitu] - 10https://gerrit.wikimedia.org/r/909959
[12:33:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[12:34:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Lift Wing: Add new namespace for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/909992 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[12:36:31] <wikibugs>	 (03PS5) 10Clément Goubert: push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061)
[12:39:46] <wikibugs>	 (03PS1) 10Klausman: Lift Wing: Add new namespace for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414)
[12:40:13] <wikibugs>	 (03Abandoned) 10Klausman: Lift Wing: Add new namespace for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/909992 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[12:41:56] <icinga-wm_>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:44:15] <wikibugs>	 (03CR) 10Slyngshede: "I needed to move a few things around to have the templates be configurable." [software/bitu] - 10https://gerrit.wikimedia.org/r/909959 (owner: 10Slyngshede)
[12:53:46] <wikibugs>	 (03CR) 10Elukey: Lift Wing: Add new namespace for ores-legacy service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[12:54:29] <wikibugs>	 (03CR) 10Elukey: Lift Wing: Add new namespace for ores-legacy service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[12:55:37] <wikibugs>	 (03PS2) 10Klausman: admin_ng: Add new namespace for the ores-legacy service on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414)
[12:55:54] <wikibugs>	 (03CR) 10Klausman: admin_ng: Add new namespace for the ores-legacy service on Lift Wing (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[12:58:06] <wikibugs>	 (03CR) 10Elukey: admin_ng: Add new namespace for the ores-legacy service on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[13:00:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (let's still open the task to figure out from which address to send the mails in production, though?)" [software/bitu] - 10https://gerrit.wikimedia.org/r/909959 (owner: 10Slyngshede)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1300). nyaa~
[13:00:05] <jouncebot>	 Func: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:16] <Func>	 o/
[13:00:18] <wikibugs>	 (03PS3) 10Klausman: admin_ng: Add new namespace for the ores-legacy service on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414)
[13:00:46] <wikibugs>	 (03CR) 10Klausman: admin_ng: Add new namespace for the ores-legacy service on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[13:00:54] <taavi>	 o/ I can deploy
[13:00:56] <urbanecm>	 i can deploy today
[13:01:00] <urbanecm>	 well, taavi was quicker :)
[13:01:01] <Lucas_WMDE>	 I can’t, so go ahead ^^
[13:01:18] <Lucas_WMDE>	 (excellent jouncebot message though)
[13:01:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909875 (owner: 10Func)
[13:01:46] <wikibugs>	 (03CR) 10Ladsgroup: P:lists:monitoring: Raise process count for uwsgi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909247 (owner: 10Clément Goubert)
[13:01:53] <wikibugs>	 (03PS2) 10Ladsgroup: P:lists:monitoring: Raise process count for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/909247 (owner: 10Clément Goubert)
[13:01:55] <Sario>	 Lucas_WMDE: I hang out in this channel primarily for jouncebot's messages
[13:01:58] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] P:lists:monitoring: Raise process count for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/909247 (owner: 10Clément Goubert)
[13:02:17] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[13:02:18] <TheresNoTime>	 ooh we had that one twice in a row I think :3
[13:02:19] <wikibugs>	 (03Merged) 10jenkins-bot: cleanup: Remove duplicate permission config of confirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909875 (owner: 10Func)
[13:02:45] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:909875|cleanup: Remove duplicate permission config of confirmed users]]
[13:03:32] <wikibugs>	 (03PS1) 10JMeybohm: Move kubernetes cluster config to dedicated common file [puppet] - 10https://gerrit.wikimedia.org/r/909994 (https://phabricator.wikimedia.org/T325268)
[13:04:06] <logmsgbot>	 !log taavi@deploy2002 func and taavi: Backport for [[gerrit:909875|cleanup: Remove duplicate permission config of confirmed users]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[13:04:15] <taavi>	 Func: please test
[13:04:16] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[13:04:34] <Func>	 taavi: I don't have sufficient rights to test, but this is just a cleanup. if you think it is worth a test, could you help with that?
[13:04:54] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40747/console" [puppet] - 10https://gerrit.wikimedia.org/r/909994 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[13:05:12] <icinga-wm_>	 RECOVERY - mailman3-web on lists1001 is OK: PROCS OK: 5 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:06:13] <taavi>	 let's see
[13:07:00] <urbanecm>	 should be "view special:usergrouprights" and checking skipcatcha stays where it is
[13:07:20] <wikibugs>	 (03Abandoned) 10JMeybohm: Move kubernetes cluster config to dedicated common file [puppet] - 10https://gerrit.wikimedia.org/r/909994 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[13:07:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] admin_ng: Add new namespace for the ores-legacy service on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[13:07:43] <Func>	 urbanecm: oh yeah my bad
[13:08:09] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] admin_ng: Add new namespace for the ores-legacy service on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[13:08:29] <taavi>	 I manually confirmed https://test.wikipedia.org/wiki/Special:UserRights/Taavi_test_account_20230419_01 and still see skipcaptcha via meta=userinfo, I think we're good, syncing
[13:08:39] <urbanecm>	 at checkuserwiki, skipcaptcha disappears from autoconfirmed, but stays granted to user. the extension doesn't seem to be installed there, so...shouldn't be an issue.
[13:09:22] <moritzm>	 !log installing lldpd security updates
[13:09:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:17] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:909875|cleanup: Remove duplicate permission config of confirmed users]] (duration: 11m 32s)
[13:14:30] <taavi>	 {{done}}, anyone have anything else to deploy?
[13:15:03] <Func>	 taavi: thanks
[13:16:20] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:16:37] <wikibugs>	 (03CR) 10TheDJ: Add separate config for enabling JsonConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909603 (owner: 10Zabe)
[13:16:46] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:20:40] <sukhe>	 jouncebot: next
[13:20:40] <jouncebot>	 In 0 hour(s) and 39 minute(s): LVS reimages in eqiad (no deployments during this time, please) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1400)
[13:21:19] <wikibugs>	 (03CR) 10Elukey: ml-services: deployment of ores-legacy app in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[13:21:24] <wikibugs>	 (03PS5) 10Elukey: ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[13:21:29] <wikibugs>	 (03PS7) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065)
[13:22:17] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[13:25:09] <wikibugs>	 (03CR) 10Stevemunene: Configure product analytics airflow instance (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene)
[13:25:16] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder)
[13:25:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.a8 in codfw
[13:27:57] <wikibugs>	 (03PS57) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490)
[13:28:12] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.remove-ghost-objects (exit_code=99) from container wikipedia-en-local-public.a8 in codfw
[13:28:26] <sukhe>	 taavi: all done for the current deployment window?
[13:28:36] <taavi>	 sukhe: yes!
[13:28:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[13:29:10] <wikibugs>	 (03CR) 10Majavah: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[13:29:25] <sukhe>	 taavi: thanks
[13:30:15] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:07] <wikibugs>	 (03CR) 10Eevans: cassandra: add de-init to systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans)
[13:32:21] <wikibugs>	 (03CR) 10Jbond: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[13:33:33] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:34:51] <icinga-wm_>	 PROBLEM - Check systemd state on cp5022 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:34:55] <icinga-wm_>	 PROBLEM - Check systemd state on cp5021 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:35:17] <icinga-wm_>	 PROBLEM - Check systemd state on cp5020 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:35:29] <sukhe>	 something has to be up with these, looking
[13:35:45] <icinga-wm_>	 PROBLEM - Check systemd state on cp5017 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:35:51] <icinga-wm_>	 PROBLEM - Check systemd state on cp5019 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:35:53] <icinga-wm_>	 PROBLEM - Check systemd state on cp5023 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:27] <icinga-wm_>	 RECOVERY - Check systemd state on cp5021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job varnish-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:40:51] <icinga-wm_>	 RECOVERY - Check systemd state on cp5022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:41:23] <logmsgbot>	 !log sukhe@deploy2002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in eqiad, blocking deploys T321309
[13:41:29] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[13:41:29] <wikibugs>	 (03PS1) 10Eevans: Missing aqs cluster secrets [labs/private] - 10https://gerrit.wikimedia.org/r/909997 (https://phabricator.wikimedia.org/T334754)
[13:41:39] <logmsgbot>	 !log sukhe@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in eqiad, blocking deploys T321309 (duration: 00m 16s)
[13:41:46] <logmsgbot>	 !log sukhe@deploy2002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in eqiad, blocking deploys T321309
[13:42:15] <sukhe>	 BGP alerts in eqiad expected
[13:42:17] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[13:42:57] <wikibugs>	 (03PS2) 10Btullis: Add the perccli utility to the new Ceph servers [puppet] - 10https://gerrit.wikimedia.org/r/909707 (https://phabricator.wikimedia.org/T330151)
[13:43:17] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[13:43:33] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job varnish-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:44:10] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] cassandra: add de-init to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans)
[13:44:35] <icinga-wm_>	 RECOVERY - Check systemd state on cp5017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:02] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: Lift Wing: Add new namespace for ores-legacy service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909992 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman)
[13:45:51] <icinga-wm_>	 PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:03] <icinga-wm_>	 RECOVERY - Check systemd state on cp5019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[13:47:45] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal
[13:48:15] <jinxer-wm>	 (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[13:51:59] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[13:53:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10Jhancock.wm)
[13:56:07] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414)
[13:56:09] <icinga-wm_>	 RECOVERY - Check systemd state on cp5023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:56:39] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[13:57:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10Jhancock.wm)
[14:00:04] <jouncebot>	 Deploy window LVS reimages in eqiad (no deployments during this time, please) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1400)
[14:00:07] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:11] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:02:48] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[14:03:29] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:03:33] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus1006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1019.eqiad.wmnet with OS bullseye
[14:04:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1019.eqiad.wmnet with OS bullseye
[14:09:00] <wikibugs>	 (03CR) 10Jbond: puppet::agent: rename the enable_puppet7 flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:10:40] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Missing aqs cluster secrets [labs/private] - 10https://gerrit.wikimedia.org/r/909997 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans)
[14:11:25] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] Missing aqs cluster secrets [labs/private] - 10https://gerrit.wikimedia.org/r/909997 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans)
[14:12:16] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans)
[14:12:27] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: lvs1019: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/909325 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[14:16:25] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans)
[14:17:44] <wikibugs>	 (03PS1) 10Ssingh: hiera: remove lvs1019's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910004 (https://phabricator.wikimedia.org/T321309)
[14:19:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] puppet::agent: rename the enable_puppet7 flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:19:30] <wikibugs>	 (03CR) 10Jbond: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:19:40] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1019.eqiad.wmnet with reason: host reimage
[14:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[14:20:59] <icinga-wm_>	 RECOVERY - Check systemd state on cp5020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Jhancock.wm)
[14:22:42] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1019.eqiad.wmnet with reason: host reimage
[14:26:23] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: add more ROCm packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909970 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey)
[14:30:13] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[14:34:29] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:34:57] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:01] <wikibugs>	 (03PS1) 10Ssingh: varnish: bump size of varnish shared memory log to 160M [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093)
[14:36:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Jhancock.wm)
[14:37:27] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:38:27] <icinga-wm_>	 RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 78 connections established with conf1007.eqiad.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal
[14:38:54] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1019.eqiad.wmnet with OS bullseye
[14:39:33] <icinga-wm_>	 RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:40:22] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1019.eqiad.wmnet with OS bullseye completed: - lvs1019 (**PASS**)   - Downtimed on Icinga/Aler...
[14:41:46] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Jhancock.wm) according to the change log on dns2003, it was the old authdns. updated the ticket to reflect the new naming
[14:42:23] <icinga-wm_>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:42:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1019
[14:42:53] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs1019
[14:43:39] <wikibugs>	 (03PS2) 10EoghanGaffney: [gitlab/failover] Add check for DNS records update [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771)
[14:45:21] <wikibugs>	 (03CR) 10EoghanGaffney: [gitlab/failover] Add check for DNS records update (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney)
[14:45:23] <wikibugs>	 (03PS2) 10Ssingh: varnish: bump size of varnish shared memory log to 160M [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093)
[14:45:41] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:46:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [gitlab/failover] Add check for DNS records update [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney)
[14:47:36] <wikibugs>	 (03PS3) 10EoghanGaffney: [gitlab/failover] Add check for DNS records update [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771)
[14:47:42] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] varnish: bump size of varnish shared memory log to 160M [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh)
[14:49:48] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Jhancock.wm)
[14:52:59] <icinga-wm_>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:53:20] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: remove lvs1019's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910004 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[14:54:48] <sukhe>	 !log restart pybal on lvs1019 to pick up bpg-med change
[14:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:37] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs1018: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910027 (https://phabricator.wikimedia.org/T321309)
[14:58:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[15:01:24] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[15:02:47] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: replace gerrit1001 with gerrit1003 in tests [puppet] - 10https://gerrit.wikimedia.org/r/909792 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[15:03:33] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T334964 (10phaultfinder)
[15:07:04] <wikibugs>	 (03CR) 10Muehlenhoff: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[15:09:07] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) p:05Triage→03Medium
[15:09:48] <wikibugs>	 (03PS1) 10Klausman: hiera: Add ores-legacy user for k8s/deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/910030
[15:12:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/909707 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[15:13:11] <wikibugs>	 (03CR) 10Elukey: "Looks good, what does pcc say?" [puppet] - 10https://gerrit.wikimedia.org/r/910030 (owner: 10Klausman)
[15:16:26] <wikibugs>	 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10lbowmaker)
[15:20:36] <sukhe>	 !log stop pybal on lvs1018 for reimaging: T321309
[15:20:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:41] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[15:22:58] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40751/console" [puppet] - 10https://gerrit.wikimedia.org/r/910030 (owner: 10Klausman)
[15:24:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40752/console" [puppet] - 10https://gerrit.wikimedia.org/r/910030 (owner: 10Klausman)
[15:24:47] <icinga-wm_>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:25:25] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] hiera: Add ores-legacy user for k8s/deployment_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910030 (owner: 10Klausman)
[15:25:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "approved by langcom - https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Fante" [dns] - 10https://gerrit.wikimedia.org/r/909771 (https://phabricator.wikimedia.org/T335016) (owner: 10Gerrit maintenance bot)
[15:25:32] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: Add ores-legacy user for k8s/deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/910030 (owner: 10Klausman)
[15:25:45] <icinga-wm_>	 PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[15:25:47] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[15:26:26] <herzog>	 Which channel would be best to discuss a possible EventBus issue? Which team handles it?
[15:26:53] <sukhe>	 jouncebot: next
[15:26:53] <jouncebot>	 In 1 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700)
[15:28:33] <mutante>	 herzog: the admin group eventbus-admins is empty but pointed me to T232122  which tells me it's analytics, now "Data Engineering"
[15:28:33] <stashbot>	 T232122: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122
[15:28:37] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal
[15:28:57] <herzog>	 thanks mutante - do they idle in a particular channel?
[15:29:05] <mutante>	 herzog: https://wikitech.wikimedia.org/wiki/Data_Engineering
[15:29:09] <herzog>	 checking
[15:29:39] <mutante>	 herzog: see the "contact us" tab
[15:29:54] <herzog>	 mutante: heh - "in our public IRC channel, . You can use the keyword a-team to ping us, so we notice your question."
[15:30:06] <herzog>	 they forgot to mention the channel though :)
[15:30:09] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:21] <mutante>	 herzog: yea, I thought the same, but it does appear further down
[15:30:26] <herzog>	 I'll see -analytics
[15:30:31] <mutante>	 yea, that one
[15:30:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:30:44] <mutante>	 I would expect it to redirect you if they renamed it
[15:30:54] <joal>	 that's right herzog - you'll find ottomata over there (here as well but better discussed over htere)
[15:32:15] <TheresNoTime>	 Wonderfully https://www.mediawiki.org/wiki/Platform_Engineering_Team/Skill_Matrix also lists EventBus as in their scope :p
[15:33:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for dns200[4-6] - pt1979@cumin2002"
[15:34:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for dns200[4-6] - pt1979@cumin2002"
[15:34:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:34:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:34:55] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus1005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:35:49] <herzog>	 thanks joal - I posted there :)
[15:36:28] <mutante>	 !log DNS - added new project language "fat" (fat.wikipedia.org) - the "Fante" language, a dialect of Akan, spoken by 2.8 million people in Ghana - https://en.wikipedia.org/wiki/Fante_dialect  T335016
[15:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:36] <stashbot>	 T335016: Create Wikipedia Fante - https://phabricator.wikimedia.org/T335016
[15:39:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:40:17] <icinga-wm_>	 PROBLEM - IPMI Sensor Status on ms-be2043 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:42:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:44:40] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Assign proper insetup Puppet roles to machines [puppet] - 10https://gerrit.wikimedia.org/r/906023
[15:44:50] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:45:59] <icinga-wm_>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:47:26] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: thanos-fe: proper insetup Puppet roles to machine [puppet] - 10https://gerrit.wikimedia.org/r/906023
[15:47:37] <wikibugs>	 (03PS1) 10Bking: elasticsearch: handle cloudelastic URLs [cookbooks] - 10https://gerrit.wikimedia.org/r/910037 (https://phabricator.wikimedia.org/T331303)
[15:47:41] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1018
[15:47:49] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs1018
[15:48:08] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1018.eqiad.wmnet with OS bullseye
[15:48:19] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1018.eqiad.wmnet with OS bullseye
[15:49:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:51:10] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: lvs1018: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910027 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[15:53:08] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:53:38] <wikibugs>	 (03PS1) 10Papaul: Add dns200[4-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/910039 (https://phabricator.wikimedia.org/T326688)
[15:57:09] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add dns200[4-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/910039 (https://phabricator.wikimedia.org/T326688) (owner: 10Papaul)
[15:58:08] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:00:08] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:47] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1018.eqiad.wmnet with reason: host reimage
[16:02:51] <sukhe>	 jouncebot: next
[16:02:51] <jouncebot>	 In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700)
[16:03:22] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:04:08] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:04:26] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:04:29] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:04:53] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:05:12] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:05:46] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10KFrancis) Hi all, I am confirming the NDA has been signed.  Please proceed with the access request.  Thanks!
[16:05:50] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:06:02] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:06:10] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1018.eqiad.wmnet with reason: host reimage
[16:06:32] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:09:07] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:09:11] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:09:20] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:14:51] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/910041
[16:16:42] <icinga-wm_>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:17:08] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['backup2010.codfw.wmnet']
[16:17:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/910041 (owner: 10Jbond)
[16:17:16] <wikibugs>	 (03PS4) 10JHathaway: replace puppet::config with concat [puppet] - 10https://gerrit.wikimedia.org/r/909756
[16:19:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] replace puppet::config with concat [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway)
[16:19:38] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40753/console" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway)
[16:21:32] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Clement_Goubert) Thanks @KFrancis @AndrewTavis_WMDE Can you provide me with your wmde email address please?
[16:21:59] <wikibugs>	 (03PS7) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix)
[16:23:04] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1018.eqiad.wmnet with OS bullseye
[16:23:15] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1018.eqiad.wmnet with OS bullseye completed: - lvs1018 (**PASS**)   - Downtimed on Icinga/Aler...
[16:26:07] <wikibugs>	 (03PS5) 10JHathaway: replace puppet::config with concat [puppet] - 10https://gerrit.wikimedia.org/r/909756
[16:26:26] <wikibugs>	 (03CR) 10JHathaway: "pcc output, https://puppet-compiler.wmflabs.org/output/909756/40753/" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway)
[16:27:50] <wikibugs>	 (03CR) 10JHathaway: "Andrew could you take a look at the change to, modules/profile/manifests/wmcs/instance.pp, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway)
[16:30:10] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] core_modules: add core modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[16:30:14] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:58] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond)
[16:31:11] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence, 10Observability-Alerting, 10observability, and 2 others: Database alerting - https://phabricator.wikimedia.org/T172492 (10Ladsgroup) 05Open→03Resolved Database alerting in general needs improvements and we made a lot of progress since this...
[16:31:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation)    Hi with @FNavas-foundation —  Current access —  Superset   - no - "Service access denied due to missing privileges." Turnilo     -  no - "Service...
[16:31:54] <icinga-wm_>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:32:36] <wikibugs>	 (03CR) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert)
[16:32:47] <Amir1>	 jouncebot: nowandnext
[16:32:47] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): LVS reimages in eqiad (no deployments during this time, please) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1400)
[16:32:47] <jouncebot>	 In 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700)
[16:33:08] <Amir1>	 sukhe: ping me once you're done
[16:33:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:33:23] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] puppet::agent: rename the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[16:33:26] <sukhe>	 Amir1: I will be done in 27 minutes for sure
[16:33:27] <sukhe>	 more like five
[16:33:32] <sukhe>	 but, is something upcoming?
[16:33:44] <sukhe>	 asking because I have one left but of course I don't want to take the slot of anyone else
[16:33:50] <sukhe>	 so I can do that last one later
[16:33:55] <Amir1>	 not anything major, I just want to deploy an easy non urgent patch
[16:33:59] <sukhe>	 sure
[16:34:10] <sukhe>	 just finishing this one and then I will let you know when I release the lock
[16:34:18] <Amir1>	 finish your work, this patch has been siting for months
[16:34:26] <Amir1>	 it can wait for a day more if needs to
[16:35:05] <wikibugs>	 (03PS1) 10Ssingh: hiera: remove lvs1018's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910047 (https://phabricator.wikimedia.org/T321309)
[16:35:06] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:35:30] <sukhe>	 er, are you sure? it will take at least an hour to reimage the last one, it's high-traffic1 so draining takes time :)
[16:35:47] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1018
[16:35:56] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs1018
[16:36:26] <sukhe>	 Amir1: I will also take a break so will resume when you are done
[16:36:55] <Amir1>	 it's fine, seriously
[16:36:58] <wikibugs>	 (03CR) 10Dzahn: "oh, already merged:) thanks! I wasn't sure if it matters for the tests when exactly this happens. was just preparing it :)" [puppet] - 10https://gerrit.wikimedia.org/r/909792 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[16:37:03] <Amir1>	 go take a break, do your work
[16:37:12] <sukhe>	 ok, 2kind.gif :)
[16:37:31] <Amir1>	 I've already picked up something else to do, it's not like there is shortage of fires to put
[16:37:38] <sukhe>	 oh yeah...
[16:38:30] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[16:38:32] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: remove lvs1018's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910047 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[16:38:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: add host-based Hiera keys for gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/909796 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[16:39:01] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED
[16:39:15] <sukhe>	 !log restart pybal on lvs1018 to remove bgp-med change: T321309
[16:39:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:20] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[16:41:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED
[16:41:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "This host is up (as in "can be pinged"), doesn't have gerrit prod role yet but it can be expected to be up and trying to get things done t" [puppet] - 10https://gerrit.wikimedia.org/r/909791 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[16:44:36] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[16:46:25] <wikibugs>	 (03CR) 10Dzahn: "I am going" [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[16:50:09] <wikibugs>	 (03CR) 10Dzahn: "I was about to say "I am going ahead with this" :)" [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[16:53:12] <wikibugs>	 (03PS1) 10Dzahn: site: add gerrit prod role to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/910049 (https://phabricator.wikimedia.org/T326368)
[16:53:30] <jinxer-wm>	 (Access port speed <= 100Mbps) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[16:57:46] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: replace gerrit1001 with gerrit1003 in tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909792 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[16:58:09] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs1017: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910050 (https://phabricator.wikimedia.org/T321309)
[16:58:36] <wikibugs>	 (03CR) 10Dzahn: "gotcha!:) thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/909792 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[16:59:04] <icinga-wm_>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700)
[17:00:38] <sukhe>	 anyone here to deploy? please let me know
[17:00:44] <sukhe>	 I haven't started the last LVS reimaging so can pause
[17:00:50] <sukhe>	 seems like Amir.1 was the one but he said it's fine
[17:01:04] <sukhe>	 I will wait for 10 mins to be sure
[17:01:09] <sukhe>	 (scap is locked)
[17:01:26] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED
[17:02:31] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[17:04:31] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[17:05:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dns2005.mgmt.codfw.wmnet with reboot policy FORCED
[17:09:01] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite)
[17:09:39] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite)
[17:12:00] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[17:14:17] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns2005.mgmt.codfw.wmnet with reboot policy FORCED
[17:14:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dns2006.mgmt.codfw.wmnet with reboot policy FORCED
[17:14:56] <sukhe>	 cool, proceeding with the last reimage then
[17:14:59] <sukhe>	 please hold off deploys
[17:15:41] <mutante>	 jouncebot: stall it
[17:15:59] <mutante>	 jouncebot: nowandnext
[17:15:59] <jouncebot>	 For the next 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700)
[17:15:59] <jouncebot>	 In 0 hour(s) and 44 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800)
[17:16:00] <jouncebot>	 In 0 hour(s) and 44 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800)
[17:18:14] <mutante>	 jouncebot: nowandnext
[17:18:14] <jouncebot>	 For the next 0 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700)
[17:18:14] <jouncebot>	 In 0 hour(s) and 41 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800)
[17:18:14] <jouncebot>	 In 0 hour(s) and 41 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800)
[17:19:06] <icinga-wm_>	 PROBLEM - WDQS SPARQL on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[17:19:27] <mutante>	 jouncebot: nowandnext
[17:19:27] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700)
[17:19:27] <jouncebot>	 In 0 hour(s) and 40 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800)
[17:19:27] <jouncebot>	 In 0 hour(s) and 40 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800)
[17:19:38] <mutante>	 sukhe: I tried to fix it but failed
[17:19:41] <sukhe>	 mutante: thanks
[17:19:45] <sukhe>	 I will keep an eye out
[17:19:47] * mutante edited the Deployment calendar page 
[17:19:54] <mutante>	 but the bot doesnt get it yet
[17:20:22] <mutante>	 on wiki it shows only your thing as "happening now" now
[17:20:31] <sukhe>	 jouncebot: refresh
[17:20:33] <jouncebot>	 I refreshed my knowledge about deployments.
[17:20:37] <mutante>	 jouncebot: nowandnext
[17:20:38] <jouncebot>	 For the next 0 hour(s) and 39 minute(s): LVS reimages in eqiad (no deployments during this time, please) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1400)
[17:20:38] <jouncebot>	 In 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1745)
[17:21:04] <sukhe>	 !log stop pybal in lvs1017 for reimaging
[17:21:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:15] <mutante>	 you got 39 minutes, stole from the "MW on Kubernetes"  window . heh
[17:21:19] <sukhe>	 ha
[17:21:56] <icinga-wm_>	 PROBLEM - Query Service HTTP Port on wdqs1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:22:16] <icinga-wm_>	 RECOVERY - WDQS SPARQL on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.217 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[17:22:31] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[17:23:35] <icinga-wm_>	 RECOVERY - Query Service HTTP Port on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:25:32] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder)
[17:27:00] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[17:28:08] <icinga-wm_>	 PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[17:28:18] <sukhe>	 ^ expected
[17:29:20] <wikibugs>	 (03CR) 10Cmelo: [C: 04-1] "Just to avoid it to get merged before we are really ready to deploy this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909401 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo)
[17:30:06] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:31:20] <icinga-wm_>	 RECOVERY - Check systemd state on vrts2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:26] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[17:33:30] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[17:33:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10colewhite) Getting Prometheus to scrape a new metrics endpoint is pretty straightforward.  When the exporter is up and running and firewall r...
[17:35:04] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus1006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns2006.mgmt.codfw.wmnet with reboot policy FORCED
[17:38:40] <wikibugs>	 (03PS1) 10Cmelo: Add new user right campaignevents-organize-events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088)
[17:42:32] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[17:43:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[17:44:03] <wikibugs>	 (03PS1) 10Cmelo: Set multi organizer feature flag to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088)
[17:45:05] <jouncebot>	 Deploy window LVS reimages in eqiad (no deployments during this time, please) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1400)
[17:45:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1745)
[17:46:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[17:46:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye
[17:46:36] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye
[17:46:46] <sukhe>	 jouncebot: next
[17:46:46] <jouncebot>	 In 0 hour(s) and 13 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800)
[17:46:46] <jouncebot>	 In 0 hour(s) and 13 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800)
[17:46:49] <sukhe>	 hmm
[17:46:50] <sukhe>	 fun
[17:46:52] <sukhe>	 let's see
[17:47:17] <sukhe>	 draining took a longer time than expected, but was expected
[17:48:15] <jinxer-wm>	 (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[17:48:50] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: lvs1017: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910050 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[17:49:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:49:45] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) 05Open→03In progress p:05Triage→03High
[17:49:53] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Trizek-WMF)
[17:50:17] <wikibugs>	 (03PS1) 10Ladsgroup: auto_schema: Get rid of concept of skipping replicas [software] - 10https://gerrit.wikimedia.org/r/910057
[17:50:27] <wikibugs>	 (03PS1) 10Ssingh: hiera: remove lvs1017's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910058 (https://phabricator.wikimedia.org/T321309)
[17:50:34] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2004']
[17:51:48] <wikibugs>	 (03PS2) 10Ladsgroup: auto_schema: Get rid of concept of skipping replicas [software] - 10https://gerrit.wikimedia.org/r/910057
[17:54:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:55:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns2004']
[17:55:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2005']
[17:55:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2006']
[17:56:19] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns2005']
[17:56:47] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns2006']
[17:56:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2005']
[17:57:07] <wikibugs>	 (03PS58) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490)
[17:57:09] <wikibugs>	 (03PS1) 10Jbond: git-sync-upstream: add support for g10k and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059
[17:57:10] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns2005']
[17:57:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2005']
[17:57:32] <icinga-wm_>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:57:50] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns2005']
[17:57:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2006']
[17:58:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns2006']
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1745)
[18:00:05] <jouncebot>	 jnuche and ^demon: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800).
[18:00:05] <jouncebot>	 jnuche and ^demon: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800).
[18:00:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dns2004.wikimedia.org with OS bullseye
[18:00:24] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2004.wikimedia.org with OS bullseye
[18:00:31] <mutante>	 win 71
[18:00:40] <icinga-wm_>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:01:01] <mutante>	 is fighting flying ants invading the apartment
[18:01:17] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage
[18:01:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul)
[18:01:24] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF)
[18:01:42] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF)
[18:01:46] <wikibugs>	 (03PS59) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490)
[18:02:34] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T334901 (10Papaul) 05Open→03Resolved a:03Papaul @jcrespo thanks we will ignore this alert then
[18:03:19] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF)
[18:03:30] <jinxer-wm>	 (Access port speed <= 100Mbps) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[18:03:58] <wikibugs>	 (03PS60) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490)
[18:04:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[18:04:18] <wikibugs>	 10SRE, 10ops-codfw: Broken PSU on ganeti2019 - https://phabricator.wikimedia.org/T335026 (10Papaul) p:05Triage→03Medium a:03Jhancock.wm
[18:04:32] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage
[18:04:40] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:07:15] <wikibugs>	 (03PS61) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490)
[18:07:22] <icinga-wm_>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:07:54] <wikibugs>	 (03CR) 10JHathaway: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[18:09:48] <wikibugs>	 (03CR) 10JHathaway: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[18:15:18] <icinga-wm_>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:16:35] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF)
[18:19:14] <icinga-wm_>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[18:21:06] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:21:56] <icinga-wm_>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:22:28] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1017.eqiad.wmnet with OS bullseye
[18:22:47] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye completed: - lvs1017 (**PASS**)   - Downtimed on Icinga/Aler...
[18:23:21] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: remove lvs1017's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910058 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[18:23:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye
[18:23:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye
[18:25:32] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF)
[18:25:40] <sukhe>	 !log restart pybal on lvs1017 to pick up bgp-med change: T321309
[18:25:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:44] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[18:26:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[18:27:58] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF)
[18:28:26] <logmsgbot>	 !log sukhe@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in eqiad, blocking deploys T321309 (duration: 286m 39s)
[18:28:41] <sukhe>	 ^ Traffic LVS work completed in eqiad. thanks to all for your patience
[18:28:54] <wikibugs>	 (03Abandoned) 10Ssingh: depool eqiad (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/909985 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[18:29:37] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) a:05Trizek-WMF→03sgrabarczuk I did whatever I can that doesn't require checking....
[18:30:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[18:31:26] <icinga-wm_>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:31:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2004.wikimedia.org with reason: host reimage
[18:33:00] <icinga-wm_>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:33:01] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF)
[18:35:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2004.wikimedia.org with reason: host reimage
[18:36:00] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2005.wikimedia.org with reason: host reimage
[18:36:33] <icinga-wm_>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:37:35] <icinga-wm_>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:39:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2005.wikimedia.org with reason: host reimage
[18:40:55] <icinga-wm_>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:41:28] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) This is now complete and we have upgraded all 176 Traffic hosts to bullseye. WE would like to thank @MoritzMuehlenhoff for helping with the Pybal backport that made the LVS reimaging...
[18:43:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Aklapper) > Mediawiki - no That's via https://meta.wikimedia.org/wiki/Special:CentralAuth?target=FNavas-WMF instead and unrelated to this task?
[18:44:58] <wikibugs>	 (03CR) 10Jbond: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[18:46:24] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) @jbond for the firmware reimaging cookbook that saved us a lot of time by automating the iDRAC and NIC firmwares and deferring having the defer reboot option.
[18:46:30] <wikibugs>	 (03PS62) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490)
[18:48:51] <wikibugs>	 (03CR) 10JHathaway: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[18:50:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[18:50:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:51:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) I just attempted to build frbast1002 and frpig1002 and neither got a dhcp offer. Could we please verify that all the hosts are in the corre...
[18:52:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:52:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2004.wikimedia.org with OS bullseye
[18:52:13] <icinga-wm_>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:52:14] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2004.wikimedia.org with OS bullseye completed: - dns2004 (**PASS**)   - Removed from Pup...
[18:53:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dns2006.wikimedia.org with OS bullseye
[18:53:24] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2006.wikimedia.org with OS bullseye
[18:54:03] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add gerrit1003 to rsync dest hosts when using prod role [puppet] - 10https://gerrit.wikimedia.org/r/910064 (https://phabricator.wikimedia.org/T326368)
[18:56:15] <icinga-wm_>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:56:41] <wikibugs>	 (03PS6) 10Jbond: puppet::agent: rename the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490)
[18:56:43] <wikibugs>	 (03PS13) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991
[18:56:45] <wikibugs>	 (03PS15) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841)
[18:56:47] <wikibugs>	 (03PS12) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (https://phabricator.wikimedia.org/T330490)
[18:56:49] <wikibugs>	 (03PS2) 10Jbond: git-sync-upstream: add support for g10k and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059
[18:56:51] <wikibugs>	 (03PS63) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490)
[18:58:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) a:05Jgreen→03Cmjohnson
[18:59:09] <icinga-wm_>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:00:09] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:00:21] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/910064/40755/" [puppet] - 10https://gerrit.wikimedia.org/r/910064 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[19:01:15] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul)
[19:01:45] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:02:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) @Jhancock.wm hey if you a chance can you please check network cable on dns2006? link is showing down Thanks ` ge-1/0/8        up    down dns2006
[19:02:16] <wikibugs>	 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10aaron) >>! In T334023#8792693, @Ladsgroup wrote...
[19:02:19] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "LGTM for switchback time" [dns] - 10https://gerrit.wikimedia.org/r/909873 (https://phabricator.wikimedia.org/T335015) (owner: 10Clément Goubert)
[19:02:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[19:02:55] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "LGTM for switchback time" [puppet] - 10https://gerrit.wikimedia.org/r/909874 (https://phabricator.wikimedia.org/T335015) (owner: 10Clément Goubert)
[19:03:17] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:04:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[19:04:21] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host dns2005.wikimedia.org with OS bullseye
[19:04:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye completed: - dns2005 (**FAIL**)   - Removed from Pup...
[19:04:28] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye executed with errors: - dns2005 (**FAIL**)   - Remov...
[19:04:35] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul)
[19:04:43] <icinga-wm_>	 PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:05:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul)
[19:05:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye
[19:05:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye
[19:06:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2005.wikimedia.org with reason: host reimage
[19:09:49] <wikibugs>	 (03PS1) 10Dzahn: add ServiceOps-Collab as contact for gerrit/phab migration roles and peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/910065
[19:09:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2005.wikimedia.org with reason: host reimage
[19:11:08] <wikibugs>	 (03CR) 10Dzahn: "noticed when swiching gerrit1003 from migration role to actual prod role that the role owner changes, so added us for the special Phabrica" [puppet] - 10https://gerrit.wikimedia.org/r/910065 (owner: 10Dzahn)
[19:12:48] <wikibugs>	 (03CR) 10Dzahn: "currently this happens when making a change like https://gerrit.wikimedia.org/r/c/operations/puppet/+/910049" [puppet] - 10https://gerrit.wikimedia.org/r/910065 (owner: 10Dzahn)
[19:14:08] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "much more readable in https://puppet-compiler.wmflabs.org/output/910049/40756/gerrit1003.wikimedia.org/index.html now after https://gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/910049 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[19:18:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2005.wikimedia.org with OS bullseye
[19:18:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye completed: - dns2005 (**PASS**)   - Downtimed on Ici...
[19:20:15] <icinga-wm_>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:21:13] <icinga-wm_>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:21:53] <icinga-wm_>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:25:29] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:28:27] <wikibugs>	 (03PS64) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490)
[19:38:04] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 04-1] Add new user right campaignevents-organize-events (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo)
[19:39:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) Correct! @Aklapper
[19:40:09] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:42:10] <wikibugs>	 (03PS65) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490)
[19:42:19] <icinga-wm_>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:45:09] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:48:29] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:49:36] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns2006.wikimedia.org with OS bullseye
[19:49:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2006.wikimedia.org with OS bullseye executed with errors: - dns2006 (**FAIL**)   - Remov...
[19:50:02] <wikibugs>	 (03CR) 10Daimona Eaytoy: Set multi organizer feature flag to true (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo)
[19:52:19] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 04-1] "Sent the other comments too early... I also wanted to add that this change should be made dependent on I4caf9ab8170a83d8d81922adb10915c6df" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo)
[19:54:24] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add the perccli utility to the new Ceph servers [puppet] - 10https://gerrit.wikimedia.org/r/909707 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[19:57:09] <wikibugs>	 (03CR) 10Jbond: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[20:00:07] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T2000).
[20:00:07] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:02:17] <wikibugs>	 (03CR) 10Jbond: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[20:07:29] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909884 (https://phabricator.wikimedia.org/T331921) (owner: 10Zabe)
[20:08:18] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909884 (https://phabricator.wikimedia.org/T331921) (owner: 10Zabe)
[20:09:30] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:909884|Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" (T331921)]]
[20:09:36] <stashbot>	 T331921: enable de-wp bureaucrats to remove adminflag and to grant importer rights - https://phabricator.wikimedia.org/T331921
[20:10:56] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:909884|Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" (T331921)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[20:13:39] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:14:33] <icinga-wm_>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:14:48] <wikibugs>	 (03PS1) 10Ladsgroup: auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654)
[20:14:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup)
[20:15:30] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup)
[20:15:51] <wikibugs>	 (03PS2) 10Ladsgroup: auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654)
[20:15:56] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic-Icebox, 10TestMe: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10Umherirrender) There was a recent improvement of thumbnails purge for similiar reasons on T331138.  For me the thum...
[20:16:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup)
[20:16:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:16:56] <logmsgbot>	 !log zabe@deploy2002 Finished scap: Backport for [[gerrit:909884|Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" (T331921)]] (duration: 07m 26s)
[20:17:03] <stashbot>	 T331921: enable de-wp bureaucrats to remove adminflag and to grant importer rights - https://phabricator.wikimedia.org/T331921
[20:17:36] <wikibugs>	 (03PS3) 10Ladsgroup: auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654)
[20:17:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup)
[20:17:53] <wikibugs>	 (03PS4) 10Ladsgroup: auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654)
[20:21:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:21:58] <wikibugs>	 10SRE, 10WMF-General-or-Unknown: some file thumbs fail to purge on upload of a new version - https://phabricator.wikimedia.org/T35672 (10Umherirrender) 05Open→03Resolved Please do not reopen very old tasks. Please create new tasks for new issues even there are looking the same (after some years it should b...
[20:33:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[20:35:06] <wikibugs>	 (03PS3) 10Eevans: Do not de-init node prior to restart [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/909403 (https://phabricator.wikimedia.org/T334754)
[20:56:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Ottomata) > Turnilo - no - "Service access denied due to missing privileges.  Turnilo only uses LDAP for authentication (no posix group membership), so this hints that...
[21:07:16] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[21:09:17] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[21:12:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[21:27:16] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[21:30:17] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder)
[21:32:35] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] elasticsearch: handle cloudelastic URLs [cookbooks] - 10https://gerrit.wikimedia.org/r/910037 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking)
[21:32:39] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elasticsearch: handle cloudelastic URLs [cookbooks] - 10https://gerrit.wikimedia.org/r/910037 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking)
[21:33:42] <wikibugs>	 (03PS1) 10Cwhite: prometheus: Change zarcillo location [puppet] - 10https://gerrit.wikimedia.org/r/910076 (https://phabricator.wikimedia.org/T334455)
[21:35:36] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] prometheus: Change zarcillo location [puppet] - 10https://gerrit.wikimedia.org/r/910076 (https://phabricator.wikimedia.org/T334455) (owner: 10Cwhite)
[21:35:43] <wikibugs>	 (03Merged) 10jenkins-bot: elasticsearch: handle cloudelastic URLs [cookbooks] - 10https://gerrit.wikimedia.org/r/910037 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking)
[21:38:17] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2022.codfw.wmnet with OS bullseye
[21:40:03] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:45:48] <wikibugs>	 (03PS2) 10Dzahn: acme_chief/gerrit certs: add gerrit1003 to hosts and gerrit-new to SNI [puppet] - 10https://gerrit.wikimedia.org/r/909790 (https://phabricator.wikimedia.org/T326368)
[21:46:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[21:46:55] <icinga-wm_>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:47:17] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[21:47:39] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:48:15] <jinxer-wm>	 (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[21:48:17] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[21:52:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:57:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:00:34] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:04:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[22:06:41] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic-Icebox, 10TestMe: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10JoKalliauer)
[22:10:35] <tzatziki>	 !log removing 5 files for legal compliance
[22:10:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:12] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic-Icebox, 10TestMe: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10JoKalliauer) @Umherirrender ; You have to compare the PNG not the SVG, because the rendering has several rendering...
[22:14:35] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic-Icebox, 10TestMe: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10JoKalliauer)
[22:14:44] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:20:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[22:20:26] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) Seems like there are not just 2 users, there are actually 3 different users!    [mwmaint1002:~] $  ldapsearch -x uid=fnavas* | grep uidNumber uidNumber: 43544 ui...
[22:24:56] <icinga-wm_>	 RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:27:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) To remove any ambiguity, let's refer to them by uidNumbers. Starting with the oldest:  43544 | uid = fnavas | sn = Francisco Navas | cn = Francisco Navas | mail...
[22:28:54] <icinga-wm_>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:30:15] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[22:34:53] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2022.codfw.wmnet with OS bullseye
[22:35:15] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] gerrit: replace Icinga monitoring with Prometheus, ssh port 29418 [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[22:36:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) Hey @FNavas-foundation can you do these things:  - set an email address for the fnavas-foundation user (login at wikitech and go to preferences, set an address)...
[22:38:22] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] add ServiceOps-Collab as contact for gerrit/phab migration roles and peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/910065 (owner: 10Dzahn)
[22:38:48] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] acme_chief/gerrit certs: add gerrit1003 to hosts and gerrit-new to SNI [puppet] - 10https://gerrit.wikimedia.org/r/909790 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[22:39:06] <wikibugs>	 (03PS2) 10Andrea Denisse: prometheus: Added support for syncing data between instances [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979)
[22:43:36] <wikibugs>	 (03CR) 10Dzahn: prometheus: Added support for syncing data between instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[22:45:39] <wikibugs>	 (03CR) 10Dzahn: prometheus: Added support for syncing data between instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[22:47:06] <wikibugs>	 (03CR) 10Dzahn: prometheus: Added support for syncing data between instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[22:51:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) Thanks @Dzahn   - email added to fnavas-foundation  - lets use 43670 | uid = fnavas-foundation | sn = FNavas-foundation | cn = FNavas-foundation   -...
[22:54:48] <wikibugs>	 (03PS1) 10Cwhite: logstash: webrequest ecs: move backend to label [puppet] - 10https://gerrit.wikimedia.org/r/910077 (https://phabricator.wikimedia.org/T277816)
[22:55:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Dwisehaupt i checked on the switch all the interfaces are configured and up maybe the server were not added to DNS since we do not manage Frac...
[22:55:52] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] cloudgw: allow VMs to speak to new gerrit server (gerrit1003) [puppet] - 10https://gerrit.wikimedia.org/r/909795 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[22:56:01] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] cloudgw: fix IP address for gerrit-replica.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/909794 (owner: 10Dzahn)
[22:56:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: webrequest ecs: move backend to label [puppet] - 10https://gerrit.wikimedia.org/r/910077 (https://phabricator.wikimedia.org/T277816) (owner: 10Cwhite)
[23:02:52] <tzatziki>	 !log removing 3 files for legal compliance
[23:02:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:00] <tzatziki>	 !log removing 1 file for legal compliance
[23:15:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:46] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) @FNavas-foundation Thank you for the prompt reply. I can confirm that all users have an (the same) email address now, cool!.   It's possible that you need both a...
[23:20:40] <wikibugs>	 (03CR) 10Cwhite: "Jenkins says "cp0000.eqiad.wmnet" is a typo but it is intentional." [puppet] - 10https://gerrit.wikimedia.org/r/910077 (https://phabricator.wikimedia.org/T277816) (owner: 10Cwhite)
[23:22:05] <wikibugs>	 (03CR) 10Dzahn: "I was thinking about this earlier when you mentioned the "impossible number", heh. Maybe use 9999 instead? Should be fine to steal that on" [puppet] - 10https://gerrit.wikimedia.org/r/910077 (https://phabricator.wikimedia.org/T277816) (owner: 10Cwhite)
[23:25:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) To quickly answer your last question - Abhas Tripathi has access to those supersets (I know for a fact) and @SDelbecque-WMF (who is the other PM on m...
[23:29:07] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910078
[23:29:09] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910078 (owner: 10Zabe)
[23:29:47] <logmsgbot>	 !log zabe@deploy2002 Started scap: [[gerrit:910078]]
[23:29:55] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910078 (owner: 10Zabe)
[23:32:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) @Papaul Thanks, I have verified they are in DNS.  I think there may be some crossing in cables or vlans. When I try to build a host, I'm se...
[23:35:52] <icinga-wm_>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:35:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) Thanks! This was very valuable information. With that we are able to track it down, luckily.  So when I look at Abhas Tripathi, they have membership in analytics...
[23:36:28] <logmsgbot>	 !log zabe@deploy2002 Finished scap: [[gerrit:910078]] (duration: 06m 40s)
[23:37:00] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) a:05FNavas-foundation→03None
[23:37:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) 05Stalled→03Open
[23:37:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) p:05Medium→03High
[23:37:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) 05Open→03In progress
[23:37:33] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn)
[23:39:23] <wikibugs>	 (03CR) 10Dzahn: "Doesn't seem like this is what was needed. Instead all they needed was "add to wmf LDAP group" and this group isn't even needed. https://p" [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482) (owner: 10Ssingh)
[23:40:09] <wikibugs>	 (03PS1) 10Dzahn: Revert "admin: add fnavas-foundation to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910017
[23:40:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "admin: add fnavas-foundation to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910017 (owner: 10Dzahn)
[23:42:39] <wikibugs>	 (03PS2) 10Dzahn: Revert "admin: add fnavas-foundation to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910017
[23:43:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "admin: add fnavas-foundation to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910017 (owner: 10Dzahn)
[23:44:00] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:44:16] <wikibugs>	 (03CR) 10Dzahn: "ok, so they need to be moved to the ldap_only section, not be completely removed. I will make a new change that converts them" [puppet] - 10https://gerrit.wikimedia.org/r/910017 (owner: 10Dzahn)
[23:53:41] <wikibugs>	 (03PS2) 10Dzahn: add ServiceOps-Collab as contact for gerrit/phab migration roles and peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/910065
[23:53:43] <wikibugs>	 (03PS1) 10Dzahn: admin: move fnavas to ldap_only admins, remove from a-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/910104 (https://phabricator.wikimedia.org/T331482)
[23:54:00] <wikibugs>	 (03PS2) 10Dzahn: admin: move fnavas to ldap_only admins, remove from a-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/910104 (https://phabricator.wikimedia.org/T331482)