[00:00:44] (03PS1) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763 [00:01:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [00:01:53] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [00:02:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [00:03:33] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T334964 (10phaultfinder) [00:04:46] (03PS1) 10EoghanGaffney: [gitlab/failover] Add check for DNS records update [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) [00:05:17] (03PS1) 10Dzahn: phabricator: add parameter for db_datadir in cloud and use default path [puppet] - 10https://gerrit.wikimedia.org/r/909786 [00:10:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [00:12:42] (03PS1) 10Dzahn: phorge: add parameter for db_datadir and use default path [puppet] - 10https://gerrit.wikimedia.org/r/909787 [00:13:54] (03PS2) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763 [00:13:56] (03PS5) 10Aaron Schulz: Set "templateOverridesBySection" in an etcd.php loop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834 [00:14:09] (03PS2) 10Aaron Schulz: Use pt-heartbeat for all non-static external clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893835 (https://phabricator.wikimedia.org/T129093) [00:14:11] (03PS1) 10Dzahn: mariadb::generic_server: change default datadir path [puppet] - 10https://gerrit.wikimedia.org/r/909788 [00:14:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P47155 and previous config saved to /var/cache/conftool/dbconfig/20230419-001423-ladsgroup.json [00:15:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [00:18:18] (03PS1) 10Dzahn: acme_chief: add gerrit1003 to hosts allowed for gerrit certs [puppet] - 10https://gerrit.wikimedia.org/r/909790 (https://phabricator.wikimedia.org/T326368) [00:19:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [00:22:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:15] (03PS1) 10Dzahn: replace gerrit1001 with gerrit1003 as ping target for blackbox smoke [puppet] - 10https://gerrit.wikimedia.org/r/909791 (https://phabricator.wikimedia.org/T326368) [00:24:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [00:28:48] (03PS1) 10Dzahn: logstash: replace gerrit1001 with gerrit1003 in tests [puppet] - 10https://gerrit.wikimedia.org/r/909792 (https://phabricator.wikimedia.org/T326368) [00:29:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1072.eqiad.wmnet with OS bullseye [00:29:27] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be1072.eqiad.wmnet with OS bullseye [00:29:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T333332)', diff saved to https://phabricator.wikimedia.org/P47156 and previous config saved to /var/cache/conftool/dbconfig/20230419-002929-ladsgroup.json [00:29:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance [00:29:35] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [00:29:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance [00:29:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1210 (T333332)', diff saved to https://phabricator.wikimedia.org/P47157 and previous config saved to /var/cache/conftool/dbconfig/20230419-002952-ladsgroup.json [00:30:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [00:30:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:01] (03PS1) 10Dzahn: cloudgw: fix IP address for gerrit-replica.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/909794 [00:32:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T333332)', diff saved to https://phabricator.wikimedia.org/P47158 and previous config saved to /var/cache/conftool/dbconfig/20230419-003235-ladsgroup.json [00:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:35:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [00:36:35] (03PS1) 10Dzahn: cloudgw: allow VMs to speak to new gerrit server (gerrit1003) [puppet] - 10https://gerrit.wikimedia.org/r/909795 (https://phabricator.wikimedia.org/T326368) [00:37:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1073.eqiad.wmnet with OS bullseye [00:37:32] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be1073.eqiad.wmnet with OS bullseye [00:38:45] (03PS1) 10Dzahn: gerrit: add host-based Hiera keys for gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/909796 (https://phabricator.wikimedia.org/T326368) [00:39:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/909768 [00:39:25] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/909768 (owner: 10TrainBranchBot) [00:39:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1074.eqiad.wmnet with OS bullseye [00:39:45] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be1074.eqiad.wmnet with OS bullseye [00:44:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1072.eqiad.wmnet with reason: host reimage [00:47:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P47159 and previous config saved to /var/cache/conftool/dbconfig/20230419-004741-ladsgroup.json [00:47:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1072.eqiad.wmnet with reason: host reimage [00:50:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:30] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [00:54:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [00:57:31] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/909768 (owner: 10TrainBranchBot) [01:00:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1073.eqiad.wmnet with reason: host reimage [01:02:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P47160 and previous config saved to /var/cache/conftool/dbconfig/20230419-010247-ladsgroup.json [01:04:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1073.eqiad.wmnet with reason: host reimage [01:05:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1074.eqiad.wmnet with reason: host reimage [01:12:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [01:13:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1074.eqiad.wmnet with reason: host reimage [01:15:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:30] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [01:16:25] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:17:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10Jhancock.wm) @Papaul dns2003 already exists in netbox. It's in A2. [01:17:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T333332)', diff saved to https://phabricator.wikimedia.org/P47161 and previous config saved to /var/cache/conftool/dbconfig/20230419-011754-ladsgroup.json [01:17:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [01:18:00] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [01:18:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [01:18:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10Papaul) @Jhancock.wm go from dns2004 up [01:18:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:18:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1072.eqiad.wmnet with OS bullseye [01:18:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance [01:18:48] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be1072.eqiad.wmnet with OS bullseye completed: - ms-be... [01:19:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance [01:20:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance [01:21:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance [01:21:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T333332)', diff saved to https://phabricator.wikimedia.org/P47162 and previous config saved to /var/cache/conftool/dbconfig/20230419-012114-ladsgroup.json [01:23:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1075.eqiad.wmnet with OS bullseye [01:23:15] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be1075.eqiad.wmnet with OS bullseye [01:25:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T333332)', diff saved to https://phabricator.wikimedia.org/P47163 and previous config saved to /var/cache/conftool/dbconfig/20230419-012509-ladsgroup.json [01:25:15] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [01:30:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [01:33:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [01:34:04] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:36:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:37:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1074.eqiad.wmnet with OS bullseye [01:37:10] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be1074.eqiad.wmnet with OS bullseye completed: - ms-be... [01:38:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1075.eqiad.wmnet with reason: host reimage [01:40:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P47164 and previous config saved to /var/cache/conftool/dbconfig/20230419-014016-ladsgroup.json [01:42:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1075.eqiad.wmnet with reason: host reimage [01:44:05] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:45:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [01:46:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:46:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1073.eqiad.wmnet with OS bullseye [01:46:26] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be1073.eqiad.wmnet with OS bullseye completed: - ms-be... [01:48:00] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [01:50:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P47165 and previous config saved to /var/cache/conftool/dbconfig/20230419-015522-ladsgroup.json [01:59:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:01:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:51] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:03:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:03:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1075.eqiad.wmnet with OS bullseye [02:03:58] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be1075.eqiad.wmnet with OS bullseye completed: - ms-be... [02:04:15] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Papaul) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:41] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Papaul) 05Open→03Resolved This is complete [02:06:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:59] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul) [02:10:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T333332)', diff saved to https://phabricator.wikimedia.org/P47166 and previous config saved to /var/cache/conftool/dbconfig/20230419-021028-ladsgroup.json [02:10:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [02:10:35] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [02:10:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [02:10:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T333332)', diff saved to https://phabricator.wikimedia.org/P47167 and previous config saved to /var/cache/conftool/dbconfig/20230419-021051-ladsgroup.json [02:16:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T333332)', diff saved to https://phabricator.wikimedia.org/P47168 and previous config saved to /var/cache/conftool/dbconfig/20230419-021646-ladsgroup.json [02:16:52] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [02:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:30:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:31:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P47170 and previous config saved to /var/cache/conftool/dbconfig/20230419-023152-ladsgroup.json [02:46:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P47171 and previous config saved to /var/cache/conftool/dbconfig/20230419-024658-ladsgroup.json [02:50:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:09] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:00:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T333332)', diff saved to https://phabricator.wikimedia.org/P47172 and previous config saved to /var/cache/conftool/dbconfig/20230419-030205-ladsgroup.json [03:02:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [03:02:11] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [03:02:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [03:02:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [03:02:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [03:02:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T333332)', diff saved to https://phabricator.wikimedia.org/P47173 and previous config saved to /var/cache/conftool/dbconfig/20230419-030234-ladsgroup.json [03:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T333332)', diff saved to https://phabricator.wikimedia.org/P47174 and previous config saved to /var/cache/conftool/dbconfig/20230419-030530-ladsgroup.json [03:07:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P47175 and previous config saved to /var/cache/conftool/dbconfig/20230419-032036-ladsgroup.json [03:32:19] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P47176 and previous config saved to /var/cache/conftool/dbconfig/20230419-033542-ladsgroup.json [03:40:01] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdn1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [03:47:19] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:50:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T333332)', diff saved to https://phabricator.wikimedia.org/P47177 and previous config saved to /var/cache/conftool/dbconfig/20230419-035048-ladsgroup.json [03:50:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [03:50:55] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [03:51:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [03:51:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47178 and previous config saved to /var/cache/conftool/dbconfig/20230419-035112-ladsgroup.json [03:55:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47180 and previous config saved to /var/cache/conftool/dbconfig/20230419-035507-ladsgroup.json [04:08:35] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:10:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P47181 and previous config saved to /var/cache/conftool/dbconfig/20230419-041013-ladsgroup.json [04:25:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P47182 and previous config saved to /var/cache/conftool/dbconfig/20230419-042520-ladsgroup.json [04:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:40:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47183 and previous config saved to /var/cache/conftool/dbconfig/20230419-044027-ladsgroup.json [04:40:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [04:40:31] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:40:33] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [04:40:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [04:40:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T333332)', diff saved to https://phabricator.wikimedia.org/P47184 and previous config saved to /var/cache/conftool/dbconfig/20230419-044050-ladsgroup.json [04:44:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T333332)', diff saved to https://phabricator.wikimedia.org/P47185 and previous config saved to /var/cache/conftool/dbconfig/20230419-044445-ladsgroup.json [04:53:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [04:59:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [04:59:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P47186 and previous config saved to /var/cache/conftool/dbconfig/20230419-045951-ladsgroup.json [05:00:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:01:41] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:02:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:04:39] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.405 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:05:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P47187 and previous config saved to /var/cache/conftool/dbconfig/20230419-051457-ladsgroup.json [05:16:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:14] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10Volans) >>! In T334680#8791310, @Dzahn wrote: > But since the compilers are running in cloud VPS and there it's neither of the... [05:17:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [05:20:16] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [05:30:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T333332)', diff saved to https://phabricator.wikimedia.org/P47188 and previous config saved to /var/cache/conftool/dbconfig/20230419-053003-ladsgroup.json [05:30:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [05:30:10] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [05:30:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [05:30:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47189 and previous config saved to /var/cache/conftool/dbconfig/20230419-053027-ladsgroup.json [05:31:16] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) >>! In T296832#8791457, @cmooney wrote: > In terms of next steps we obviously need to keep things consistent.... [05:33:23] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:34:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47190 and previous config saved to /var/cache/conftool/dbconfig/20230419-053425-ladsgroup.json [05:37:18] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [05:38:16] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [05:44:03] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:46:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [05:48:17] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [05:49:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P47191 and previous config saved to /var/cache/conftool/dbconfig/20230419-054931-ladsgroup.json [05:50:36] (03CR) 10Volans: "Post-merge comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/902502 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking) [05:51:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:59:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T0600) [06:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P47192 and previous config saved to /var/cache/conftool/dbconfig/20230419-060437-ladsgroup.json [06:07:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:30] (03PS1) 10Marostegui: db1212: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909851 (https://phabricator.wikimedia.org/T326669) [06:08:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P47193 and previous config saved to /var/cache/conftool/dbconfig/20230419-060803-root.json [06:08:10] (03CR) 10Marostegui: [C: 03+2] db1212: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909851 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:08:34] (03CR) 10Volans: "A question and few comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [06:12:26] (03PS1) 10Marostegui: instances.yaml: Add db1219 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909853 (https://phabricator.wikimedia.org/T326669) [06:13:02] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1219 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909853 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:14:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1219 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P47194 and previous config saved to /var/cache/conftool/dbconfig/20230419-061414-marostegui.json [06:14:17] (03CR) 10Volans: [C: 03+1] service: add comment for spicerack field addition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909605 (owner: 10Clément Goubert) [06:14:20] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:42] 10SRE, 10Infrastructure-Foundations: Netbox PuppetDB import script deletes cabel labels when interfaces are renamed - https://phabricator.wikimedia.org/T334987 (10ayounsi) Last time I checked it was not possible/recommended to edit a cable, but instead delete/create it. We could also store the cable IDs keyed... [06:17:39] (03PS1) 10Marostegui: db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909855 [06:18:05] (03CR) 10Marostegui: [C: 03+2] db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909855 (owner: 10Marostegui) [06:19:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T333332)', diff saved to https://phabricator.wikimedia.org/P47195 and previous config saved to /var/cache/conftool/dbconfig/20230419-061944-ladsgroup.json [06:19:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [06:19:50] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [06:20:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [06:20:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T333332)', diff saved to https://phabricator.wikimedia.org/P47196 and previous config saved to /var/cache/conftool/dbconfig/20230419-062007-ladsgroup.json [06:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:21:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113 (s5,s6)', diff saved to https://phabricator.wikimedia.org/P47197 and previous config saved to /var/cache/conftool/dbconfig/20230419-062123-root.json [06:21:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:07] (03PS1) 10Marostegui: db1113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909856 (https://phabricator.wikimedia.org/T326669) [06:23:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P47200 and previous config saved to /var/cache/conftool/dbconfig/20230419-062307-root.json [06:23:42] (03CR) 10Marostegui: [C: 03+2] db1113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909856 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:24:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T333332)', diff saved to https://phabricator.wikimedia.org/P47201 and previous config saved to /var/cache/conftool/dbconfig/20230419-062401-ladsgroup.json [06:29:53] (03PS1) 10Marostegui: db1213: Add it to s5 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/909941 (https://phabricator.wikimedia.org/T326669) [06:30:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:30:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:23] (03CR) 10Marostegui: [C: 03+2] db1213: Add it to s5 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/909941 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:36:41] (03PS1) 10Marostegui: db1219: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909942 (https://phabricator.wikimedia.org/T326669) [06:37:13] (03CR) 10Marostegui: [C: 03+2] db1219: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909942 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:37:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P47202 and previous config saved to /var/cache/conftool/dbconfig/20230419-063713-root.json [06:38:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P47203 and previous config saved to /var/cache/conftool/dbconfig/20230419-063812-root.json [06:38:19] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [06:38:22] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [06:39:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P47204 and previous config saved to /var/cache/conftool/dbconfig/20230419-063907-ladsgroup.json [06:39:19] (03PS1) 10Marostegui: install_server: Do not reimage db1223 [puppet] - 10https://gerrit.wikimedia.org/r/909943 [06:39:51] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1223 [puppet] - 10https://gerrit.wikimedia.org/r/909943 (owner: 10Marostegui) [06:41:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 T335011', diff saved to https://phabricator.wikimedia.org/P47205 and previous config saved to /var/cache/conftool/dbconfig/20230419-064122-root.json [06:41:27] T335011: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011 [06:42:50] (03PS1) 10Marostegui: db1110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909944 (https://phabricator.wikimedia.org/T326683) [06:43:17] (03CR) 10Marostegui: [C: 03+2] db1110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909944 (https://phabricator.wikimedia.org/T326683) (owner: 10Marostegui) [06:45:50] (03PS10) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [06:46:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:42] (03PS1) 10Marostegui: install_server: Do not reimage db1216 [puppet] - 10https://gerrit.wikimedia.org/r/909945 [06:50:14] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1216 [puppet] - 10https://gerrit.wikimedia.org/r/909945 (owner: 10Marostegui) [06:50:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:04] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for KBach - https://phabricator.wikimedia.org/T334931 (10KBach) Thanks @Clement_Goubert! [06:52:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P47206 and previous config saved to /var/cache/conftool/dbconfig/20230419-065218-root.json [06:53:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P47207 and previous config saved to /var/cache/conftool/dbconfig/20230419-065317-root.json [06:54:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P47208 and previous config saved to /var/cache/conftool/dbconfig/20230419-065413-ladsgroup.json [06:55:02] 10ops-eqiad, 10decommission-hardware: decommission db1116 - https://phabricator.wikimedia.org/T334926 (10jcrespo) a:03Jclark-ctr [06:56:30] (03PS3) 10KartikMistry: Enable Content/Section translation on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909607 (https://phabricator.wikimedia.org/T327102) [06:59:08] 10ops-eqiad, 10decommission-hardware: decommission db1102 - https://phabricator.wikimedia.org/T334927 (10jcrespo) a:03Jclark-ctr [07:00:04] Amir1, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T0700). Please do the needful. [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:12] I'm here! [07:01:37] I'll go ahead with deployment for my patch. [07:01:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909607 (https://phabricator.wikimedia.org/T327102) (owner: 10KartikMistry) [07:02:36] (03Merged) 10jenkins-bot: Enable Content/Section translation on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909607 (https://phabricator.wikimedia.org/T327102) (owner: 10KartikMistry) [07:03:25] (03CR) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede) [07:03:33] !log kartik@deploy2002 Started scap: Backport for [[gerrit:909607|Enable Content/Section translation on 6 Wikipedias (T327102)]] [07:03:39] T327102: Enable Content and Section translation on 6 Wikipedias - https://phabricator.wikimedia.org/T327102 [07:04:53] (03PS9) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 [07:05:06] !log kartik@deploy2002 kartik: Backport for [[gerrit:909607|Enable Content/Section translation on 6 Wikipedias (T327102)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:05:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P47209 and previous config saved to /var/cache/conftool/dbconfig/20230419-070723-root.json [07:08:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P47210 and previous config saved to /var/cache/conftool/dbconfig/20230419-070822-root.json [07:09:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T333332)', diff saved to https://phabricator.wikimedia.org/P47211 and previous config saved to /var/cache/conftool/dbconfig/20230419-070920-ladsgroup.json [07:09:25] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [07:10:20] !log push pfw policies - T334983 [07:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:40] (03CR) 10EoghanGaffney: [C: 03+1] "Looks good. Optionally would consider moving this to `/opt/bin` and `/opt/etc`, but we can do that later" [puppet] - 10https://gerrit.wikimedia.org/r/909749 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [07:13:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10ayounsi) [07:13:07] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:909607|Enable Content/Section translation on 6 Wikipedias (T327102)]] (duration: 09m 33s) [07:13:12] T327102: Enable Content and Section translation on 6 Wikipedias - https://phabricator.wikimedia.org/T327102 [07:14:32] I'm done with my config deployment. And, there are no more patches in the backport/config window. [07:15:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:39] !log update TLS cert on pfw - T334676 [07:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/909749 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [07:18:01] (03PS1) 10Marostegui: site.pp: Add db1213 to s5 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/909947 (https://phabricator.wikimedia.org/T326683) [07:22:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P47212 and previous config saved to /var/cache/conftool/dbconfig/20230419-072228-root.json [07:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P47213 and previous config saved to /var/cache/conftool/dbconfig/20230419-072326-root.json [07:25:53] 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) Slightly relevant - https://wikitech.wikimedia.org/wiki/Juniper_TLS_certificate_install [07:26:55] (03CR) 10Muehlenhoff: SSH Keymanagement, allow user to manage ssh keys. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede) [07:31:28] (03PS1) 10Marostegui: install_server: Do not reimage db1224 [puppet] - 10https://gerrit.wikimedia.org/r/909949 [07:31:59] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1224 [puppet] - 10https://gerrit.wikimedia.org/r/909949 (owner: 10Marostegui) [07:37:24] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:37:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P47214 and previous config saved to /var/cache/conftool/dbconfig/20230419-073732-root.json [07:38:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P47215 and previous config saved to /var/cache/conftool/dbconfig/20230419-073831-root.json [07:39:21] (03PS1) 10Muehlenhoff: Remove access for ktsouroupidou [puppet] - 10https://gerrit.wikimedia.org/r/909950 [07:41:38] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for ktsouroupidou [puppet] - 10https://gerrit.wikimedia.org/r/909950 (owner: 10Muehlenhoff) [07:50:12] (03CR) 10Marostegui: [C: 03+2] site.pp: Add db1213 to s5 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/909947 (https://phabricator.wikimedia.org/T326683) (owner: 10Marostegui) [07:50:47] (03PS1) 10Marostegui: Revert "db1113: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/909871 [07:51:13] (03PS1) 10Stevemunene: Add Product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909951 (https://phabricator.wikimedia.org/T333000) [07:51:39] (03PS1) 10Marostegui: site.pp: Remove insetup from db1213 [puppet] - 10https://gerrit.wikimedia.org/r/909952 (https://phabricator.wikimedia.org/T326669) [07:52:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P47216 and previous config saved to /var/cache/conftool/dbconfig/20230419-075203-root.json [07:52:37] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10WMDE-leszek) [07:52:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 6%: Pooling', diff saved to https://phabricator.wikimedia.org/P47217 and previous config saved to /var/cache/conftool/dbconfig/20230419-075237-root.json [07:52:47] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1213 [puppet] - 10https://gerrit.wikimedia.org/r/909952 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:53:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P47218 and previous config saved to /var/cache/conftool/dbconfig/20230419-075336-root.json [07:53:58] (03PS1) 10Marostegui: site.pp: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/909953 [07:54:31] (03CR) 10Marostegui: [C: 03+2] site.pp: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/909953 (owner: 10Marostegui) [07:55:52] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10WMDE-leszek) Hello, regarding the wikibase/termbox service -- we'd be fine with a move to gitlab but have a question for ourselves to find answer fo... [07:57:26] (03PS1) 10Elukey: services: add kafka-logging200[4,5] IPs to eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/909954 (https://phabricator.wikimedia.org/T334510) [07:58:10] (03CR) 10Clément Goubert: service: add comment for spicerack field addition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909605 (owner: 10Clément Goubert) [07:58:26] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:00:06] jnuche and ^demon: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T0800). [08:00:18] (03CR) 10Marostegui: [C: 03+2] Revert "db1113: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/909871 (owner: 10Marostegui) [08:00:25] good morning, I'll be deploying in 10m [08:00:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P47219 and previous config saved to /var/cache/conftool/dbconfig/20230419-080030-root.json [08:03:33] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [08:07:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P47220 and previous config saved to /var/cache/conftool/dbconfig/20230419-080708-root.json [08:07:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 7%: Pooling', diff saved to https://phabricator.wikimedia.org/P47221 and previous config saved to /var/cache/conftool/dbconfig/20230419-080742-root.json [08:08:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P47222 and previous config saved to /var/cache/conftool/dbconfig/20230419-080841-root.json [08:10:18] (03CR) 10Lucas Werkmeister (WMDE): Add second tracking category for Graph (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909700 (https://phabricator.wikimedia.org/T334895) (owner: 10Lucas Werkmeister (WMDE)) [08:10:50] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Marostegui) [08:11:02] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909958 (https://phabricator.wikimedia.org/T330211) [08:11:04] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909958 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot) [08:12:02] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909958 (https://phabricator.wikimedia.org/T330211) (owner: 10TrainBranchBot) [08:15:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P47223 and previous config saved to /var/cache/conftool/dbconfig/20230419-081535-root.json [08:15:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:30] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.5 refs T330211 [08:18:35] T330211: 1.41.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T330211 [08:22:02] (03PS2) 10Elukey: services: modify Kafka logging IPs in eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/909954 (https://phabricator.wikimedia.org/T334510) [08:22:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P47224 and previous config saved to /var/cache/conftool/dbconfig/20230419-082213-root.json [08:22:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 8%: Pooling', diff saved to https://phabricator.wikimedia.org/P47225 and previous config saved to /var/cache/conftool/dbconfig/20230419-082247-root.json [08:23:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [08:23:13] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [08:23:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P47226 and previous config saved to /var/cache/conftool/dbconfig/20230419-082345-root.json [08:23:54] 10SRE, 10SRE-Access-Requests, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10ItamarWMDE) Thank you @Clement_Goubert! [08:24:13] !log jnuche@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.5 refs T330211 (duration: 05m 43s) [08:24:18] T330211: 1.41.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T330211 [08:27:28] (03PS1) 10Marostegui: Revert "db1110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/909872 [08:27:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P47227 and previous config saved to /var/cache/conftool/dbconfig/20230419-082738-root.json [08:27:56] (03CR) 10Marostegui: [C: 03+2] Revert "db1110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/909872 (owner: 10Marostegui) [08:30:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P47228 and previous config saved to /var/cache/conftool/dbconfig/20230419-083040-root.json [08:30:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:34:14] (03PS1) 10Slyngshede: Enable emailing for signup and password reset [software/bitu] - 10https://gerrit.wikimedia.org/r/909959 [08:34:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:01:00 on db2185.codfw.wmnet,db[1115,1215].eqiad.wmnet with reason: Test [08:35:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:01:00 on db2185.codfw.wmnet,db[1115,1215].eqiad.wmnet with reason: Test [08:35:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:52] 10SRE, 10Infrastructure-Foundations: Netbox PuppetDB import script deletes cabel labels when interfaces are renamed - https://phabricator.wikimedia.org/T334987 (10cmooney) >>! In T334987#8792130, @ayounsi wrote: > We could also store the cable IDs keyed by remote interface ID and re-use that when re-creating t... [08:37:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P47229 and previous config saved to /var/cache/conftool/dbconfig/20230419-083717-root.json [08:37:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 9%: Pooling', diff saved to https://phabricator.wikimedia.org/P47230 and previous config saved to /var/cache/conftool/dbconfig/20230419-083753-root.json [08:39:34] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10kevinbazira) [08:40:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [08:41:43] (03Abandoned) 10Muehlenhoff: httpd: Let Puppet pick the init provider [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [08:42:33] (03PS1) 10Clément Goubert: Revert "wmnet: Switch deployment CNAMEs to codfw" [dns] - 10https://gerrit.wikimedia.org/r/909873 [08:42:40] (03CR) 10CI reject: [V: 04-1] Revert "wmnet: Switch deployment CNAMEs to codfw" [dns] - 10https://gerrit.wikimedia.org/r/909873 (owner: 10Clément Goubert) [08:42:43] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10cmooney) [08:42:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P47231 and previous config saved to /var/cache/conftool/dbconfig/20230419-084243-root.json [08:43:02] (03PS1) 10Stevemunene: Configure product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) [08:43:03] 10SRE, 10Infrastructure-Foundations: Netbox PuppetDB import script deletes cable labels when interfaces are renamed - https://phabricator.wikimedia.org/T334987 (10cmooney) [08:43:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [08:45:26] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [08:45:29] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [08:45:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P47232 and previous config saved to /var/cache/conftool/dbconfig/20230419-084545-root.json [08:46:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:55] (03PS2) 10Clément Goubert: Revert "wmnet: Switch deployment CNAMEs to codfw" [dns] - 10https://gerrit.wikimedia.org/r/909873 [08:47:01] (03CR) 10Btullis: "Looks good. The PCC failure for the aqs node looks like it's just a missing dummy secret, so it's a +1 from me in principle, as soon as th" [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [08:47:51] (03CR) 10Clément Goubert: [C: 04-2] "Holding for switchback" [dns] - 10https://gerrit.wikimedia.org/r/909873 (owner: 10Clément Goubert) [08:48:41] (03PS1) 10Clément Goubert: Revert "Switch deployment server to deploy2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/909874 [08:49:50] (03PS2) 10Clément Goubert: Revert "Switch deployment server to deploy2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/909874 [08:50:06] (03CR) 10Clément Goubert: [C: 04-2] "Hold for switchback" [puppet] - 10https://gerrit.wikimedia.org/r/909874 (owner: 10Clément Goubert) [08:50:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:33] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [08:50:41] (03PS3) 10Clément Goubert: Revert "wmnet: Switch deployment CNAMEs to codfw" [dns] - 10https://gerrit.wikimedia.org/r/909873 (https://phabricator.wikimedia.org/T335015) [08:50:50] (03PS3) 10Clément Goubert: Revert "Switch deployment server to deploy2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/909874 (https://phabricator.wikimedia.org/T335015) [08:51:25] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) [08:52:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P47233 and previous config saved to /var/cache/conftool/dbconfig/20230419-085222-root.json [08:52:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [08:52:53] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Do... [08:52:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P47234 and previous config saved to /var/cache/conftool/dbconfig/20230419-085257-root.json [08:56:17] (03CR) 10JMeybohm: [C: 03+1] services: modify Kafka logging IPs in eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/909954 (https://phabricator.wikimedia.org/T334510) (owner: 10Elukey) [08:57:13] (03CR) 10Elukey: [C: 03+2] services: modify Kafka logging IPs in eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/909954 (https://phabricator.wikimedia.org/T334510) (owner: 10Elukey) [08:57:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [08:57:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P47235 and previous config saved to /var/cache/conftool/dbconfig/20230419-085748-root.json [08:59:30] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [08:59:35] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [08:59:38] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [08:59:52] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: sync [08:59:55] (03CR) 10Clément Goubert: P:lists:monitoring: Raise process count for uwsgi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909247 (owner: 10Clément Goubert) [09:00:11] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: sync [09:00:32] (03PS11) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [09:00:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P47236 and previous config saved to /var/cache/conftool/dbconfig/20230419-090050-root.json [09:00:56] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: sync [09:01:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:13] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: sync [09:03:46] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:03:49] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:05:07] (03CR) 10MVernon: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/909673 (https://phabricator.wikimedia.org/T333550) (owner: 10Clément Goubert) [09:05:10] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:05:14] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:07:05] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:07:09] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:07:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P47237 and previous config saved to /var/cache/conftool/dbconfig/20230419-090727-root.json [09:07:33] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:07:36] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:08:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P47238 and previous config saved to /var/cache/conftool/dbconfig/20230419-090802-root.json [09:08:45] 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10AndrewTavis_WMDE) I just signed the NDA :) @Aklapper I'll connect with @karapayneWMDE about changing the templates. Thanks for bringing this to our attention! [09:12:49] (03CR) 10Clément Goubert: [C: 03+2] admin: Add atieno to to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/909673 (https://phabricator.wikimedia.org/T333550) (owner: 10Clément Goubert) [09:12:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P47239 and previous config saved to /var/cache/conftool/dbconfig/20230419-091252-root.json [09:15:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P47240 and previous config saved to /var/cache/conftool/dbconfig/20230419-091554-root.json [09:17:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [09:19:30] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07), 10Patch-For-Review: Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Clement_Goubert) @Atieno Your access request has been merged and should be operational within the next half hour, you have... [09:19:34] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07), 10Patch-For-Review: Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Clement_Goubert) 05In progress→03Resolved [09:20:30] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [09:20:36] (03PS9) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) [09:22:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P47241 and previous config saved to /var/cache/conftool/dbconfig/20230419-092232-root.json [09:23:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P47242 and previous config saved to /var/cache/conftool/dbconfig/20230419-092307-root.json [09:26:52] (03PS1) 10Marostegui: db1117: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909963 (https://phabricator.wikimedia.org/T335017) [09:27:18] (03CR) 10Marostegui: [C: 03+2] db1117: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909963 (https://phabricator.wikimedia.org/T335017) (owner: 10Marostegui) [09:27:38] (03CR) 10Func: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909875 (owner: 10Func) [09:27:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P47243 and previous config saved to /var/cache/conftool/dbconfig/20230419-092757-root.json [09:29:18] (03PS1) 10Marostegui: section: Update zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909964 (https://phabricator.wikimedia.org/T334455) [09:29:37] (03CR) 10Func: cleanup: Remove duplicate permission config of confirmed users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909875 (owner: 10Func) [09:30:03] (03CR) 10Marostegui: [C: 03+2] section: Update zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909964 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [09:31:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P47244 and previous config saved to /var/cache/conftool/dbconfig/20230419-093059-root.json [09:31:39] (03PS1) 10Marostegui: host-to-instance: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909965 (https://phabricator.wikimedia.org/T334455) [09:32:11] (03CR) 10Marostegui: [C: 03+2] host-to-instance: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909965 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [09:32:15] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:32:40] (03Merged) 10jenkins-bot: host-to-instance: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909965 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [09:33:29] (03PS1) 10Gerrit maintenance bot: Add fat to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/909771 (https://phabricator.wikimedia.org/T335016) [09:35:30] (03CR) 10Zabe: [C: 03+1] Add fat to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/909771 (https://phabricator.wikimedia.org/T335016) (owner: 10Gerrit maintenance bot) [09:37:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [09:37:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P47245 and previous config saved to /var/cache/conftool/dbconfig/20230419-093737-root.json [09:38:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P47246 and previous config saved to /var/cache/conftool/dbconfig/20230419-093812-root.json [09:38:29] (03PS1) 10Marostegui: check-master-heartbeat.sh: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909966 (https://phabricator.wikimedia.org/T334455) [09:38:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [09:39:21] (03CR) 10Marostegui: [C: 03+2] check-master-heartbeat.sh: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909966 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [09:39:30] (03CR) 10CI reject: [V: 04-1] check-master-heartbeat.sh: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909966 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [09:40:27] (03PS10) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) [09:41:57] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/909966 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [09:42:44] (03CR) 10Marostegui: [C: 03+2] check-master-heartbeat.sh: Change zarcillo location [software] - 10https://gerrit.wikimedia.org/r/909966 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [09:43:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P47247 and previous config saved to /var/cache/conftool/dbconfig/20230419-094302-root.json [09:44:28] (03PS1) 10Marostegui: common.yaml: Add db1215 to mysql clients [puppet] - 10https://gerrit.wikimedia.org/r/909967 (https://phabricator.wikimedia.org/T334455) [09:46:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P47248 and previous config saved to /var/cache/conftool/dbconfig/20230419-094604-root.json [09:46:14] (03CR) 10Ladsgroup: [C: 03+1] common.yaml: Add db1215 to mysql clients [puppet] - 10https://gerrit.wikimedia.org/r/909967 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [09:46:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [09:46:43] (03CR) 10Marostegui: [C: 03+2] common.yaml: Add db1215 to mysql clients [puppet] - 10https://gerrit.wikimedia.org/r/909967 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [09:48:15] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [09:48:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:48:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:48:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P47249 and previous config saved to /var/cache/conftool/dbconfig/20230419-094836-ladsgroup.json [09:48:42] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [09:50:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P47250 and previous config saved to /var/cache/conftool/dbconfig/20230419-095044-ladsgroup.json [09:52:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P47252 and previous config saved to /var/cache/conftool/dbconfig/20230419-095241-root.json [09:53:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1219 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P47253 and previous config saved to /var/cache/conftool/dbconfig/20230419-095316-root.json [09:58:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P47254 and previous config saved to /var/cache/conftool/dbconfig/20230419-095807-root.json [09:58:35] 10ops-codfw: Broken PSU on ganeti2019 - https://phabricator.wikimedia.org/T335026 (10MoritzMuehlenhoff) [09:59:32] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:32] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1000) [10:00:40] (03PS1) 10Elukey: amd_gpu: add udev rules to bypass the 'render' group [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009) [10:00:42] (03PS1) 10Elukey: role:dse_k8s::worker: set allow_gpu_broader_access [puppet] - 10https://gerrit.wikimedia.org/r/909969 (https://phabricator.wikimedia.org/T333009) [10:01:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P47255 and previous config saved to /var/cache/conftool/dbconfig/20230419-100109-root.json [10:01:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:01:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:02:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2118.codfw.wmnet with reason: Maintenance [10:02:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2118.codfw.wmnet with reason: Maintenance [10:03:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40746/console" [puppet] - 10https://gerrit.wikimedia.org/r/909969 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:03:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:03:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:03:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:04:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:04:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1130.eqiad.wmnet with reason: Maintenance [10:04:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1130.eqiad.wmnet with reason: Maintenance [10:05:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2113.codfw.wmnet with reason: Maintenance [10:05:35] (03PS2) 10Elukey: amd_gpu: add udev rules to bypass the 'render' group [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009) [10:05:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2113.codfw.wmnet with reason: Maintenance [10:05:37] (03PS2) 10Elukey: role:dse_k8s::worker: set allow_gpu_broader_access [puppet] - 10https://gerrit.wikimedia.org/r/909969 (https://phabricator.wikimedia.org/T333009) [10:05:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P47256 and previous config saved to /var/cache/conftool/dbconfig/20230419-100550-ladsgroup.json [10:07:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [10:07:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P47257 and previous config saved to /var/cache/conftool/dbconfig/20230419-100746-root.json [10:07:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [10:08:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2140.codfw.wmnet with reason: Maintenance [10:08:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2140.codfw.wmnet with reason: Maintenance [10:09:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:09:15] (03PS1) 10Elukey: amd-gpu-tester: add more ROCm packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909970 (https://phabricator.wikimedia.org/T333009) [10:09:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:10:50] (03PS1) 10Marostegui: prometheus.yaml: Change zarcillo location [puppet] - 10https://gerrit.wikimedia.org/r/909972 (https://phabricator.wikimedia.org/T334455) [10:11:52] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:13:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:13:50] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:13:58] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [10:15:38] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1741 days) https://wikitech.wikimedia.org/wiki/Logs [10:16:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P47258 and previous config saved to /var/cache/conftool/dbconfig/20230419-101614-root.json [10:16:33] (03CR) 10Hnowlan: [C: 03+1] cassandra: add de-init to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [10:17:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:17:12] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10jbond) 05Open→03In progress p:05Triage→03Medium a:03jbond [10:17:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:18:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:18:15] (03PS1) 10Jbond: pcc_facts_processor: skip invalid names [puppet] - 10https://gerrit.wikimedia.org/r/909973 (https://phabricator.wikimedia.org/T334680) [10:20:14] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [10:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:20:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P47259 and previous config saved to /var/cache/conftool/dbconfig/20230419-102057-ladsgroup.json [10:21:46] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1257 days) https://wikitech.wikimedia.org/wiki/Logs [10:22:24] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:23:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:26:47] (03CR) 10Jbond: "Thanks for this, there are a few places where this pattern has been reinvented. CR lgtm just a couple of things to check with wmcs" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [10:27:02] (03CR) 10Jbond: "https://puppet-compiler.wmflabs.org/output/909756/40745/pcc-worker1001.puppet-diffs.eqiad1.wikimedia.cloud/change.pcc-worker1001.puppet-di" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [10:28:29] (03PS1) 10Klausman: hiera: Add faux secrets for ores-legacy service on Lift Wing [labs/private] - 10https://gerrit.wikimedia.org/r/909974 (https://phabricator.wikimedia.org/T330414) [10:28:38] (03CR) 10Muehlenhoff: "Looks good, a few comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/909959 (owner: 10Slyngshede) [10:29:08] (03CR) 10Jcrespo: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/909972 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [10:29:10] (03CR) 10Jbond: [C: 03+2] pcc_facts_processor: skip invalid names [puppet] - 10https://gerrit.wikimedia.org/r/909973 (https://phabricator.wikimedia.org/T334680) (owner: 10Jbond) [10:29:14] (03PS2) 10Klausman: hiera: Add faux secrets for ores-legacy service on Lift Wing [labs/private] - 10https://gerrit.wikimedia.org/r/909974 (https://phabricator.wikimedia.org/T330414) [10:29:18] (03CR) 10Marostegui: [C: 03+2] prometheus.yaml: Change zarcillo location [puppet] - 10https://gerrit.wikimedia.org/r/909972 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [10:29:41] marostegui: happy for me to merge yours? [10:29:46] jbond: go for it! [10:29:51] thanks [10:29:59] np, done [10:30:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:31:10] (03CR) 10Elukey: [C: 03+1] hiera: Add faux secrets for ores-legacy service on Lift Wing [labs/private] - 10https://gerrit.wikimedia.org/r/909974 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [10:32:50] (03CR) 10Klausman: [C: 03+2] hiera: Add faux secrets for ores-legacy service on Lift Wing [labs/private] - 10https://gerrit.wikimedia.org/r/909974 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [10:33:10] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10jbond) Thanks for the debugging, the issues was because the facts where not updating, which happened because there was/is an i... [10:33:36] (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: Add faux secrets for ores-legacy service on Lift Wing [labs/private] - 10https://gerrit.wikimedia.org/r/909974 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [10:34:16] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10jbond) 05In progress→03Resolved going to tentatively close this but please reopen if you still see the issue [10:34:39] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.e4 in eqiad [10:36:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P47260 and previous config saved to /var/cache/conftool/dbconfig/20230419-103603-ladsgroup.json [10:36:09] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [10:37:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.e4 in eqiad [10:39:58] (03CR) 10Muehlenhoff: "This looks fine per se, but note that setting the raid fact to "perccli" currently also enables RAID checks (see raid::perccli), are those" [puppet] - 10https://gerrit.wikimedia.org/r/909707 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [10:40:01] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.1a in eqiad [10:41:22] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10Ladsgroup) In this specific case, it wasn't slo... [10:42:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-en-local-public.1a in eqiad [10:43:35] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.1a in codfw [10:45:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:45:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:46:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance [10:46:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-en-local-public.1a in codfw [10:46:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance [10:47:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [10:47:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [10:48:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance [10:48:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance [10:48:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1126.eqiad.wmnet with reason: Maintenance [10:49:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1126.eqiad.wmnet with reason: Maintenance [10:49:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2165.codfw.wmnet with reason: Maintenance [10:49:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2165.codfw.wmnet with reason: Maintenance [11:00:20] (03CR) 10Jbond: core_modules: add core modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908326 (owner: 10Jbond) [11:00:23] (03PS4) 10Jbond: puppet::agent: rename the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) [11:00:25] (03PS11) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [11:00:27] (03PS13) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [11:00:29] (03PS10) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (https://phabricator.wikimedia.org/T330490) [11:00:31] (03PS54) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 [11:00:33] (03CR) 10Jbond: puppet::agent: rename the enable_puppet7 flag (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:00:49] (03CR) 10Jbond: wmflib: updat ipresolv to work with puppet7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [11:02:49] jouncebot: nowandnext [11:02:49] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [11:02:49] In 1 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1300) [11:03:30] PROBLEM - Check systemd state on prometheus1005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:47] jouncebot: refresh [11:03:48] I refreshed my knowledge about deployments. [11:03:52] jouncebot: nowandnext [11:03:52] No deployments scheduled for the next 1 hour(s) and 56 minute(s) [11:03:52] In 1 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1300) [11:04:08] PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:24] PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:28] PROBLEM - Check systemd state on prometheus1006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:42] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10ssingh) Thanks to everyone who worked on debugging/resolving this! I will try it again for the reimages in eqiad to see how it... [11:05:44] (03CR) 10Muehlenhoff: puppet::agent: rename the enable_puppet7 flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:05:55] (03PS1) 10Ayounsi: mgmt: allow prometheus [homer/public] - 10https://gerrit.wikimedia.org/r/909980 (https://phabricator.wikimedia.org/T335027) [11:07:16] 10SRE, 10Infrastructure-Foundations, 10netops: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) {P47077} [11:08:47] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) [11:09:27] 10SRE, 10Infrastructure-Foundations, 10netops: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [11:09:40] 10SRE, 10Infrastructure-Foundations, 10netops: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [11:09:46] 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) [11:18:05] (03CR) 10Muehlenhoff: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [11:18:41] !log hnowlan@puppetmaster1001 conftool action : set/weight=7; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [11:19:27] (03PS1) 10Ssingh: depool eqiad (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/909985 (https://phabricator.wikimedia.org/T321309) [11:34:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, but we need to merge" [puppet] - 10https://gerrit.wikimedia.org/r/909658 (owner: 10Slyngshede) [11:36:45] (03PS55) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [11:36:57] 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10elukey) [11:37:24] 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10elukey) [11:38:18] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:47:54] (03CR) 10Jbond: puppetserver: add puppetserver module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:50:17] (03CR) 10Btullis: amd_gpu: add udev rules to bypass the 'render' group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [11:50:55] (03CR) 10Btullis: [C: 03+1] Add Product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909951 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [11:56:32] (03CR) 10Elukey: amd_gpu: add udev rules to bypass the 'render' group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [11:58:56] (03CR) 10Btullis: Configure product analytics airflow instance (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [11:59:28] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:04:31] (03CR) 10Ottomata: "OH! I had suspected T326419, but ruled it out because at least one live broker was still in the list of bootstrap servers, and the rest o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/909954 (https://phabricator.wikimedia.org/T334510) (owner: 10Elukey) [12:05:50] (03PS4) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 [12:10:46] (03PS5) 10Jbond: puppet::agent: rename the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) [12:10:48] (03PS12) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [12:10:50] (03PS14) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [12:10:52] (03PS11) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (https://phabricator.wikimedia.org/T330490) [12:10:54] (03PS56) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [12:12:21] (03PS1) 10Zabe: Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909884 (https://phabricator.wikimedia.org/T331921) [12:12:30] (03PS2) 10Zabe: Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909884 (https://phabricator.wikimedia.org/T331921) [12:13:33] (03PS4) 10Hokwelum: make dumpsdata1006 the xmlfallback host [puppet] - 10https://gerrit.wikimedia.org/r/908995 (https://phabricator.wikimedia.org/T325232) [12:13:35] (03PS1) 10Hokwelum: Add orb1.de1.scatter.red to rsync config [puppet] - 10https://gerrit.wikimedia.org/r/909990 [12:19:38] (03CR) 10David Caro: [C: 03+2] build: add helper scripts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro) [12:20:29] (03Merged) 10jenkins-bot: build: add helper scripts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro) [12:21:08] (03PS2) 10Hokwelum: Add orb1.de1.scatter.red to rsync config for dumps [puppet] - 10https://gerrit.wikimedia.org/r/909990 [12:23:12] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service,ferm.service,prometheus-nic-firmware-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:15] (03PS4) 10Clément Goubert: push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061) [12:24:32] (03PS1) 10David Caro: build_deb: use wikimedia images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/909991 [12:25:26] PROBLEM - Check whether ferm is active by checking the default input chain on sretest1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:27:40] (03CR) 10CI reject: [V: 04-1] push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061) (owner: 10Clément Goubert) [12:28:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T334964 (10phaultfinder) [12:30:23] (03PS1) 10Klausman: Lift Wing: Add new namespace for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/909992 (https://phabricator.wikimedia.org/T330414) [12:31:20] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:31:24] (03CR) 10ArielGlenn: [C: 03+2] Add orb1.de1.scatter.red to rsync config for dumps [puppet] - 10https://gerrit.wikimedia.org/r/909990 (owner: 10Hokwelum) [12:31:55] (03PS3) 10ArielGlenn: Add orb1.de1.scatter.red to rsync config for dumps [puppet] - 10https://gerrit.wikimedia.org/r/909990 (owner: 10Hokwelum) [12:32:28] (03PS2) 10Slyngshede: Enable emailing for signup and password reset [software/bitu] - 10https://gerrit.wikimedia.org/r/909959 [12:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:34:06] (03CR) 10CI reject: [V: 04-1] Lift Wing: Add new namespace for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/909992 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [12:36:31] (03PS5) 10Clément Goubert: push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061) [12:39:46] (03PS1) 10Klausman: Lift Wing: Add new namespace for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) [12:40:13] (03Abandoned) 10Klausman: Lift Wing: Add new namespace for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/909992 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [12:41:56] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:44:15] (03CR) 10Slyngshede: "I needed to move a few things around to have the templates be configurable." [software/bitu] - 10https://gerrit.wikimedia.org/r/909959 (owner: 10Slyngshede) [12:53:46] (03CR) 10Elukey: Lift Wing: Add new namespace for ores-legacy service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [12:54:29] (03CR) 10Elukey: Lift Wing: Add new namespace for ores-legacy service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [12:55:37] (03PS2) 10Klausman: admin_ng: Add new namespace for the ores-legacy service on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) [12:55:54] (03CR) 10Klausman: admin_ng: Add new namespace for the ores-legacy service on Lift Wing (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [12:58:06] (03CR) 10Elukey: admin_ng: Add new namespace for the ores-legacy service on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [13:00:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (let's still open the task to figure out from which address to send the mails in production, though?)" [software/bitu] - 10https://gerrit.wikimedia.org/r/909959 (owner: 10Slyngshede) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1300). nyaa~ [13:00:05] Func: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] o/ [13:00:18] (03PS3) 10Klausman: admin_ng: Add new namespace for the ores-legacy service on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) [13:00:46] (03CR) 10Klausman: admin_ng: Add new namespace for the ores-legacy service on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [13:00:54] o/ I can deploy [13:00:56] i can deploy today [13:01:00] well, taavi was quicker :) [13:01:01] I can’t, so go ahead ^^ [13:01:18] (excellent jouncebot message though) [13:01:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909875 (owner: 10Func) [13:01:46] (03CR) 10Ladsgroup: P:lists:monitoring: Raise process count for uwsgi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909247 (owner: 10Clément Goubert) [13:01:53] (03PS2) 10Ladsgroup: P:lists:monitoring: Raise process count for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/909247 (owner: 10Clément Goubert) [13:01:55] Lucas_WMDE: I hang out in this channel primarily for jouncebot's messages [13:01:58] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] P:lists:monitoring: Raise process count for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/909247 (owner: 10Clément Goubert) [13:02:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [13:02:18] ooh we had that one twice in a row I think :3 [13:02:19] (03Merged) 10jenkins-bot: cleanup: Remove duplicate permission config of confirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909875 (owner: 10Func) [13:02:45] !log taavi@deploy2002 Started scap: Backport for [[gerrit:909875|cleanup: Remove duplicate permission config of confirmed users]] [13:03:32] (03PS1) 10JMeybohm: Move kubernetes cluster config to dedicated common file [puppet] - 10https://gerrit.wikimedia.org/r/909994 (https://phabricator.wikimedia.org/T325268) [13:04:06] !log taavi@deploy2002 func and taavi: Backport for [[gerrit:909875|cleanup: Remove duplicate permission config of confirmed users]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:04:15] Func: please test [13:04:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [13:04:34] taavi: I don't have sufficient rights to test, but this is just a cleanup. if you think it is worth a test, could you help with that? [13:04:54] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40747/console" [puppet] - 10https://gerrit.wikimedia.org/r/909994 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:05:12] RECOVERY - mailman3-web on lists1001 is OK: PROCS OK: 5 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:06:13] let's see [13:07:00] should be "view special:usergrouprights" and checking skipcatcha stays where it is [13:07:20] (03Abandoned) 10JMeybohm: Move kubernetes cluster config to dedicated common file [puppet] - 10https://gerrit.wikimedia.org/r/909994 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:07:43] (03CR) 10Elukey: [C: 03+1] admin_ng: Add new namespace for the ores-legacy service on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [13:07:43] urbanecm: oh yeah my bad [13:08:09] (03CR) 10Klausman: [C: 03+2] admin_ng: Add new namespace for the ores-legacy service on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/909993 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [13:08:29] I manually confirmed https://test.wikipedia.org/wiki/Special:UserRights/Taavi_test_account_20230419_01 and still see skipcaptcha via meta=userinfo, I think we're good, syncing [13:08:39] at checkuserwiki, skipcaptcha disappears from autoconfirmed, but stays granted to user. the extension doesn't seem to be installed there, so...shouldn't be an issue. [13:09:22] !log installing lldpd security updates [13:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:17] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:909875|cleanup: Remove duplicate permission config of confirmed users]] (duration: 11m 32s) [13:14:30] {{done}}, anyone have anything else to deploy? [13:15:03] taavi: thanks [13:16:20] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:16:37] (03CR) 10TheDJ: Add separate config for enabling JsonConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909603 (owner: 10Zabe) [13:16:46] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:20:40] jouncebot: next [13:20:40] In 0 hour(s) and 39 minute(s): LVS reimages in eqiad (no deployments during this time, please) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1400) [13:21:19] (03CR) 10Elukey: ml-services: deployment of ores-legacy app in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [13:21:24] (03PS5) 10Elukey: ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [13:21:29] (03PS7) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [13:22:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [13:25:09] (03CR) 10Stevemunene: Configure product analytics airflow instance (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [13:25:16] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [13:25:42] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.a8 in codfw [13:27:57] (03PS57) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [13:28:12] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.remove-ghost-objects (exit_code=99) from container wikipedia-en-local-public.a8 in codfw [13:28:26] taavi: all done for the current deployment window? [13:28:36] sukhe: yes! [13:28:47] (03CR) 10CI reject: [V: 04-1] puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:29:10] (03CR) 10Majavah: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:29:25] taavi: thanks [13:30:15] RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:07] (03CR) 10Eevans: cassandra: add de-init to systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [13:32:21] (03CR) 10Jbond: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:33:33] PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:51] PROBLEM - Check systemd state on cp5022 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:55] PROBLEM - Check systemd state on cp5021 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:17] PROBLEM - Check systemd state on cp5020 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:29] something has to be up with these, looking [13:35:45] PROBLEM - Check systemd state on cp5017 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:51] PROBLEM - Check systemd state on cp5019 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:53] PROBLEM - Check systemd state on cp5023 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:27] RECOVERY - Check systemd state on cp5021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:33] (JobUnavailable) firing: Reduced availability for job varnish-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:40:51] RECOVERY - Check systemd state on cp5022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:23] !log sukhe@deploy2002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in eqiad, blocking deploys T321309 [13:41:29] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [13:41:29] (03PS1) 10Eevans: Missing aqs cluster secrets [labs/private] - 10https://gerrit.wikimedia.org/r/909997 (https://phabricator.wikimedia.org/T334754) [13:41:39] !log sukhe@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in eqiad, blocking deploys T321309 (duration: 00m 16s) [13:41:46] !log sukhe@deploy2002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in eqiad, blocking deploys T321309 [13:42:15] BGP alerts in eqiad expected [13:42:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [13:42:57] (03PS2) 10Btullis: Add the perccli utility to the new Ceph servers [puppet] - 10https://gerrit.wikimedia.org/r/909707 (https://phabricator.wikimedia.org/T330151) [13:43:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [13:43:33] (JobUnavailable) resolved: Reduced availability for job varnish-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:44:10] (03CR) 10Btullis: [C: 03+1] cassandra: add de-init to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [13:44:35] RECOVERY - Check systemd state on cp5017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:02] (03CR) 10Ilias Sarantopoulos: Lift Wing: Add new namespace for ores-legacy service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/909992 (https://phabricator.wikimedia.org/T330414) (owner: 10Klausman) [13:45:51] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:03] RECOVERY - Check systemd state on cp5019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:47:45] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [13:48:15] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:51:59] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [13:53:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10Jhancock.wm) [13:56:07] (03PS6) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) [13:56:09] RECOVERY - Check systemd state on cp5023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:39] (03CR) 10Ilias Sarantopoulos: ml-services: deployment of ores-legacy app in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [13:57:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10Jhancock.wm) [14:00:04] Deploy window LVS reimages in eqiad (no deployments during this time, please) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1400) [14:00:07] RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:11] RECOVERY - Check systemd state on prometheus1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:48] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:03:29] PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:33] PROBLEM - Check systemd state on prometheus1006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1019.eqiad.wmnet with OS bullseye [14:04:59] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1019.eqiad.wmnet with OS bullseye [14:09:00] (03CR) 10Jbond: puppet::agent: rename the enable_puppet7 flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:10:40] (03CR) 10Eevans: [C: 03+2] Missing aqs cluster secrets [labs/private] - 10https://gerrit.wikimedia.org/r/909997 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [14:11:25] (03CR) 10Eevans: [V: 03+2 C: 03+2] Missing aqs cluster secrets [labs/private] - 10https://gerrit.wikimedia.org/r/909997 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [14:12:16] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [14:12:27] (03CR) 10Ssingh: [C: 03+2] hiera: lvs1019: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/909325 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:16:25] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/909737 (https://phabricator.wikimedia.org/T334754) (owner: 10Eevans) [14:17:44] (03PS1) 10Ssingh: hiera: remove lvs1019's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910004 (https://phabricator.wikimedia.org/T321309) [14:19:16] (03CR) 10Muehlenhoff: [C: 03+1] puppet::agent: rename the enable_puppet7 flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:19:30] (03CR) 10Jbond: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:19:40] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1019.eqiad.wmnet with reason: host reimage [14:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:20:59] RECOVERY - Check systemd state on cp5020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Jhancock.wm) [14:22:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1019.eqiad.wmnet with reason: host reimage [14:26:23] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: add more ROCm packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909970 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [14:30:13] RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:34:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:57] PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:01] (03PS1) 10Ssingh: varnish: bump size of varnish shared memory log to 160M [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093) [14:36:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Jhancock.wm) [14:37:27] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:38:27] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 78 connections established with conf1007.eqiad.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [14:38:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1019.eqiad.wmnet with OS bullseye [14:39:33] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1019.eqiad.wmnet with OS bullseye completed: - lvs1019 (**PASS**) - Downtimed on Icinga/Aler... [14:41:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Jhancock.wm) according to the change log on dns2003, it was the old authdns. updated the ticket to reflect the new naming [14:42:23] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:42:44] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1019 [14:42:53] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs1019 [14:43:39] (03PS2) 10EoghanGaffney: [gitlab/failover] Add check for DNS records update [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) [14:45:21] (03CR) 10EoghanGaffney: [gitlab/failover] Add check for DNS records update (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [14:45:23] (03PS2) 10Ssingh: varnish: bump size of varnish shared memory log to 160M [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093) [14:45:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:18] (03CR) 10CI reject: [V: 04-1] [gitlab/failover] Add check for DNS records update [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [14:47:36] (03PS3) 10EoghanGaffney: [gitlab/failover] Add check for DNS records update [cookbooks] - 10https://gerrit.wikimedia.org/r/909765 (https://phabricator.wikimedia.org/T330771) [14:47:42] (03CR) 10BBlack: [C: 03+1] varnish: bump size of varnish shared memory log to 160M [puppet] - 10https://gerrit.wikimedia.org/r/910005 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:49:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Jhancock.wm) [14:52:59] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:53:20] (03CR) 10Ssingh: [C: 03+2] hiera: remove lvs1019's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910004 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:54:48] !log restart pybal on lvs1019 to pick up bpg-med change [14:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:37] (03PS1) 10Ssingh: hiera: lvs1018: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910027 (https://phabricator.wikimedia.org/T321309) [14:58:14] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:01:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:02:47] (03CR) 10Cwhite: [C: 03+2] logstash: replace gerrit1001 with gerrit1003 in tests [puppet] - 10https://gerrit.wikimedia.org/r/909792 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [15:03:33] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T334964 (10phaultfinder) [15:07:04] (03CR) 10Muehlenhoff: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:09:07] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) p:05Triage→03Medium [15:09:48] (03PS1) 10Klausman: hiera: Add ores-legacy user for k8s/deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/910030 [15:12:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/909707 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [15:13:11] (03CR) 10Elukey: "Looks good, what does pcc say?" [puppet] - 10https://gerrit.wikimedia.org/r/910030 (owner: 10Klausman) [15:16:26] 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10lbowmaker) [15:20:36] !log stop pybal on lvs1018 for reimaging: T321309 [15:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:41] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [15:22:58] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40751/console" [puppet] - 10https://gerrit.wikimedia.org/r/910030 (owner: 10Klausman) [15:24:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40752/console" [puppet] - 10https://gerrit.wikimedia.org/r/910030 (owner: 10Klausman) [15:24:47] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:25:25] (03CR) 10Klausman: [V: 03+1] hiera: Add ores-legacy user for k8s/deployment_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910030 (owner: 10Klausman) [15:25:28] (03CR) 10Dzahn: [C: 03+2] "approved by langcom - https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Fante" [dns] - 10https://gerrit.wikimedia.org/r/909771 (https://phabricator.wikimedia.org/T335016) (owner: 10Gerrit maintenance bot) [15:25:32] (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: Add ores-legacy user for k8s/deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/910030 (owner: 10Klausman) [15:25:45] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:25:47] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:26:26] Which channel would be best to discuss a possible EventBus issue? Which team handles it? [15:26:53] jouncebot: next [15:26:53] In 1 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700) [15:28:33] herzog: the admin group eventbus-admins is empty but pointed me to T232122 which tells me it's analytics, now "Data Engineering" [15:28:33] T232122: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122 [15:28:37] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [15:28:57] thanks mutante - do they idle in a particular channel? [15:29:05] herzog: https://wikitech.wikimedia.org/wiki/Data_Engineering [15:29:09] checking [15:29:39] herzog: see the "contact us" tab [15:29:54] mutante: heh - "in our public IRC channel, . You can use the keyword a-team to ping us, so we notice your question." [15:30:06] they forgot to mention the channel though :) [15:30:09] RECOVERY - Check systemd state on prometheus1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:21] herzog: yea, I thought the same, but it does appear further down [15:30:26] I'll see -analytics [15:30:31] yea, that one [15:30:39] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:30:44] I would expect it to redirect you if they renamed it [15:30:54] that's right herzog - you'll find ottomata over there (here as well but better discussed over htere) [15:32:15] Wonderfully https://www.mediawiki.org/wiki/Platform_Engineering_Team/Skill_Matrix also lists EventBus as in their scope :p [15:33:18] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for dns200[4-6] - pt1979@cumin2002" [15:34:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for dns200[4-6] - pt1979@cumin2002" [15:34:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:34:38] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:34:55] PROBLEM - Check systemd state on prometheus1005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:49] thanks joal - I posted there :) [15:36:28] !log DNS - added new project language "fat" (fat.wikipedia.org) - the "Fante" language, a dialect of Akan, spoken by 2.8 million people in Ghana - https://en.wikipedia.org/wiki/Fante_dialect T335016 [15:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:36] T335016: Create Wikipedia Fante - https://phabricator.wikimedia.org/T335016 [15:39:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:40:17] PROBLEM - IPMI Sensor Status on ms-be2043 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:42:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED [15:44:40] (03PS2) 10Alexandros Kosiaris: Assign proper insetup Puppet roles to machines [puppet] - 10https://gerrit.wikimedia.org/r/906023 [15:44:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED [15:45:59] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:47:26] (03PS3) 10Alexandros Kosiaris: thanos-fe: proper insetup Puppet roles to machine [puppet] - 10https://gerrit.wikimedia.org/r/906023 [15:47:37] (03PS1) 10Bking: elasticsearch: handle cloudelastic URLs [cookbooks] - 10https://gerrit.wikimedia.org/r/910037 (https://phabricator.wikimedia.org/T331303) [15:47:41] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1018 [15:47:49] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs1018 [15:48:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1018.eqiad.wmnet with OS bullseye [15:48:19] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1018.eqiad.wmnet with OS bullseye [15:49:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED [15:51:10] (03CR) 10Ssingh: [C: 03+2] hiera: lvs1018: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910027 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:53:08] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:38] (03PS1) 10Papaul: Add dns200[4-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/910039 (https://phabricator.wikimedia.org/T326688) [15:57:09] (03CR) 10Papaul: [C: 03+2] Add dns200[4-6] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/910039 (https://phabricator.wikimedia.org/T326688) (owner: 10Papaul) [15:58:08] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:08] RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:47] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1018.eqiad.wmnet with reason: host reimage [16:02:51] jouncebot: next [16:02:51] In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700) [16:03:22] PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:08] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:04:26] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:04:29] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:04:53] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:05:12] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:05:46] 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10KFrancis) Hi all, I am confirming the NDA has been signed. Please proceed with the access request. Thanks! [16:05:50] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:06:02] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:06:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1018.eqiad.wmnet with reason: host reimage [16:06:32] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:09:07] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:09:11] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:09:20] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:14:51] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/910041 [16:16:42] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:17:08] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['backup2010.codfw.wmnet'] [16:17:09] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/910041 (owner: 10Jbond) [16:17:16] (03PS4) 10JHathaway: replace puppet::config with concat [puppet] - 10https://gerrit.wikimedia.org/r/909756 [16:19:27] (03CR) 10CI reject: [V: 04-1] replace puppet::config with concat [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [16:19:38] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40753/console" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [16:21:32] 10SRE, 10LDAP-Access-Requests: Request to add Andrew McAllister to ldap/wmde group - https://phabricator.wikimedia.org/T334960 (10Clement_Goubert) Thanks @KFrancis @AndrewTavis_WMDE Can you provide me with your wmde email address please? [16:21:59] (03PS7) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [16:23:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1018.eqiad.wmnet with OS bullseye [16:23:15] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1018.eqiad.wmnet with OS bullseye completed: - lvs1018 (**PASS**) - Downtimed on Icinga/Aler... [16:26:07] (03PS5) 10JHathaway: replace puppet::config with concat [puppet] - 10https://gerrit.wikimedia.org/r/909756 [16:26:26] (03CR) 10JHathaway: "pcc output, https://puppet-compiler.wmflabs.org/output/909756/40753/" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [16:27:50] (03CR) 10JHathaway: "Andrew could you take a look at the change to, modules/profile/manifests/wmcs/instance.pp, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/909756 (owner: 10JHathaway) [16:30:10] (03CR) 10JHathaway: [C: 03+1] core_modules: add core modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:30:14] RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:58] (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [16:31:11] 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence, 10Observability-Alerting, 10observability, and 2 others: Database alerting - https://phabricator.wikimedia.org/T172492 (10Ladsgroup) 05Open→03Resolved Database alerting in general needs improvements and we made a lot of progress since this... [16:31:13] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) Hi with @FNavas-foundation — Current access — Superset - no - "Service access denied due to missing privileges." Turnilo - no - "Service... [16:31:54] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:36] (03CR) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [16:32:47] jouncebot: nowandnext [16:32:47] For the next 0 hour(s) and 27 minute(s): LVS reimages in eqiad (no deployments during this time, please) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1400) [16:32:47] In 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700) [16:33:08] sukhe: ping me once you're done [16:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:33:23] (03CR) 10JHathaway: [C: 03+1] puppet::agent: rename the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:33:26] Amir1: I will be done in 27 minutes for sure [16:33:27] more like five [16:33:32] but, is something upcoming? [16:33:44] asking because I have one left but of course I don't want to take the slot of anyone else [16:33:50] so I can do that last one later [16:33:55] not anything major, I just want to deploy an easy non urgent patch [16:33:59] sure [16:34:10] just finishing this one and then I will let you know when I release the lock [16:34:18] finish your work, this patch has been siting for months [16:34:26] it can wait for a day more if needs to [16:35:05] (03PS1) 10Ssingh: hiera: remove lvs1018's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910047 (https://phabricator.wikimedia.org/T321309) [16:35:06] PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:30] er, are you sure? it will take at least an hour to reimage the last one, it's high-traffic1 so draining takes time :) [16:35:47] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1018 [16:35:56] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs1018 [16:36:26] Amir1: I will also take a break so will resume when you are done [16:36:55] it's fine, seriously [16:36:58] (03CR) 10Dzahn: "oh, already merged:) thanks! I wasn't sure if it matters for the tests when exactly this happens. was just preparing it :)" [puppet] - 10https://gerrit.wikimedia.org/r/909792 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:37:03] go take a break, do your work [16:37:12] ok, 2kind.gif :) [16:37:31] I've already picked up something else to do, it's not like there is shortage of fires to put [16:37:38] oh yeah... [16:38:30] (Access port speed <= 100Mbps) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [16:38:32] (03CR) 10Ssingh: [C: 03+2] hiera: remove lvs1018's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910047 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:38:55] (03CR) 10Dzahn: [C: 03+2] gerrit: add host-based Hiera keys for gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/909796 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:39:01] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED [16:39:15] !log restart pybal on lvs1018 to remove bgp-med change: T321309 [16:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:20] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [16:41:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED [16:41:20] (03CR) 10Dzahn: [C: 03+1] "This host is up (as in "can be pinged"), doesn't have gerrit prod role yet but it can be expected to be up and trying to get things done t" [puppet] - 10https://gerrit.wikimedia.org/r/909791 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:44:36] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [16:46:25] (03CR) 10Dzahn: "I am going" [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [16:50:09] (03CR) 10Dzahn: "I was about to say "I am going ahead with this" :)" [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [16:53:12] (03PS1) 10Dzahn: site: add gerrit prod role to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/910049 (https://phabricator.wikimedia.org/T326368) [16:53:30] (Access port speed <= 100Mbps) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [16:57:46] (03CR) 10Cwhite: [C: 03+2] logstash: replace gerrit1001 with gerrit1003 in tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909792 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:58:09] (03PS1) 10Ssingh: hiera: lvs1017: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910050 (https://phabricator.wikimedia.org/T321309) [16:58:36] (03CR) 10Dzahn: "gotcha!:) thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/909792 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:59:04] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700) [17:00:38] anyone here to deploy? please let me know [17:00:44] I haven't started the last LVS reimaging so can pause [17:00:50] seems like Amir.1 was the one but he said it's fine [17:01:04] I will wait for 10 mins to be sure [17:01:09] (scap is locked) [17:01:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns2004.mgmt.codfw.wmnet with reboot policy FORCED [17:02:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [17:04:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [17:05:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dns2005.mgmt.codfw.wmnet with reboot policy FORCED [17:09:01] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [17:09:39] 10SRE, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [17:12:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [17:14:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns2005.mgmt.codfw.wmnet with reboot policy FORCED [17:14:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dns2006.mgmt.codfw.wmnet with reboot policy FORCED [17:14:56] cool, proceeding with the last reimage then [17:14:59] please hold off deploys [17:15:41] jouncebot: stall it [17:15:59] jouncebot: nowandnext [17:15:59] For the next 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700) [17:15:59] In 0 hour(s) and 44 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800) [17:16:00] In 0 hour(s) and 44 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800) [17:18:14] jouncebot: nowandnext [17:18:14] For the next 0 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700) [17:18:14] In 0 hour(s) and 41 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800) [17:18:14] In 0 hour(s) and 41 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800) [17:19:06] PROBLEM - WDQS SPARQL on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:19:27] jouncebot: nowandnext [17:19:27] For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1700) [17:19:27] In 0 hour(s) and 40 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800) [17:19:27] In 0 hour(s) and 40 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800) [17:19:38] sukhe: I tried to fix it but failed [17:19:41] mutante: thanks [17:19:45] I will keep an eye out [17:19:47] * mutante edited the Deployment calendar page [17:19:54] but the bot doesnt get it yet [17:20:22] on wiki it shows only your thing as "happening now" now [17:20:31] jouncebot: refresh [17:20:33] I refreshed my knowledge about deployments. [17:20:37] jouncebot: nowandnext [17:20:38] For the next 0 hour(s) and 39 minute(s): LVS reimages in eqiad (no deployments during this time, please) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1400) [17:20:38] In 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1745) [17:21:04] !log stop pybal in lvs1017 for reimaging [17:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:15] you got 39 minutes, stole from the "MW on Kubernetes" window . heh [17:21:19] ha [17:21:56] PROBLEM - Query Service HTTP Port on wdqs1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:22:16] RECOVERY - WDQS SPARQL on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.217 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:22:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [17:23:35] RECOVERY - Query Service HTTP Port on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:25:32] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [17:27:00] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:28:08] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:28:18] ^ expected [17:29:20] (03CR) 10Cmelo: [C: 04-1] "Just to avoid it to get merged before we are really ready to deploy this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909401 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [17:30:06] RECOVERY - Check systemd state on prometheus1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:20] RECOVERY - Check systemd state on vrts2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:26] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:33:30] (Access port speed <= 100Mbps) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [17:33:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10colewhite) Getting Prometheus to scrape a new metrics endpoint is pretty straightforward. When the exporter is up and running and firewall r... [17:35:04] PROBLEM - Check systemd state on prometheus1006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns2006.mgmt.codfw.wmnet with reboot policy FORCED [17:38:40] (03PS1) 10Cmelo: Add new user right campaignevents-organize-events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) [17:42:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [17:43:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [17:44:03] (03PS1) 10Cmelo: Set multi organizer feature flag to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) [17:45:05] Deploy window LVS reimages in eqiad (no deployments during this time, please) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1400) [17:45:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1745) [17:46:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [17:46:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [17:46:36] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye [17:46:46] jouncebot: next [17:46:46] In 0 hour(s) and 13 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800) [17:46:46] In 0 hour(s) and 13 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800) [17:46:49] hmm [17:46:50] fun [17:46:52] let's see [17:47:17] draining took a longer time than expected, but was expected [17:48:15] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [17:48:50] (03CR) 10Ssingh: [C: 03+2] hiera: lvs1017: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/910050 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:49:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:49:45] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) 05Open→03In progress p:05Triage→03High [17:49:53] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Trizek-WMF) [17:50:17] (03PS1) 10Ladsgroup: auto_schema: Get rid of concept of skipping replicas [software] - 10https://gerrit.wikimedia.org/r/910057 [17:50:27] (03PS1) 10Ssingh: hiera: remove lvs1017's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910058 (https://phabricator.wikimedia.org/T321309) [17:50:34] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2004'] [17:51:48] (03PS2) 10Ladsgroup: auto_schema: Get rid of concept of skipping replicas [software] - 10https://gerrit.wikimedia.org/r/910057 [17:54:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:55:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns2004'] [17:55:54] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2005'] [17:55:58] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2006'] [17:56:19] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns2005'] [17:56:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns2006'] [17:56:54] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2005'] [17:57:07] (03PS58) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [17:57:09] (03PS1) 10Jbond: git-sync-upstream: add support for g10k and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059 [17:57:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns2005'] [17:57:28] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2005'] [17:57:32] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:57:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns2005'] [17:57:56] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns2006'] [17:58:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns2006'] [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1745) [18:00:05] jnuche and ^demon: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800). [18:00:05] jnuche and ^demon: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T1800). [18:00:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dns2004.wikimedia.org with OS bullseye [18:00:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2004.wikimedia.org with OS bullseye [18:00:31] win 71 [18:00:40] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:01:01] is fighting flying ants invading the apartment [18:01:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage [18:01:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) [18:01:24] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:01:42] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:01:46] (03PS59) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [18:02:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T334901 (10Papaul) 05Open→03Resolved a:03Papaul @jcrespo thanks we will ignore this alert then [18:03:19] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:03:30] (Access port speed <= 100Mbps) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [18:03:58] (03PS60) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [18:04:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:04:18] 10SRE, 10ops-codfw: Broken PSU on ganeti2019 - https://phabricator.wikimedia.org/T335026 (10Papaul) p:05Triage→03Medium a:03Jhancock.wm [18:04:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage [18:04:40] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:07:15] (03PS61) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [18:07:22] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:07:54] (03CR) 10JHathaway: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:09:48] (03CR) 10JHathaway: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:15:18] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:16:35] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:19:14] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:21:06] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:21:56] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:22:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1017.eqiad.wmnet with OS bullseye [18:22:47] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye completed: - lvs1017 (**PASS**) - Downtimed on Icinga/Aler... [18:23:21] (03CR) 10Ssingh: [C: 03+2] hiera: remove lvs1017's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/910058 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:23:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye [18:23:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye [18:25:32] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:25:40] !log restart pybal on lvs1017 to pick up bgp-med change: T321309 [18:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:44] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [18:26:30] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:27:58] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:28:26] !log sukhe@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in eqiad, blocking deploys T321309 (duration: 286m 39s) [18:28:41] ^ Traffic LVS work completed in eqiad. thanks to all for your patience [18:28:54] (03Abandoned) 10Ssingh: depool eqiad (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/909985 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:29:37] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) a:05Trizek-WMF→03sgrabarczuk I did whatever I can that doesn't require checking.... [18:30:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:31:26] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:31:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2004.wikimedia.org with reason: host reimage [18:33:00] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:33:01] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:35:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2004.wikimedia.org with reason: host reimage [18:36:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2005.wikimedia.org with reason: host reimage [18:36:33] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:37:35] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:39:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2005.wikimedia.org with reason: host reimage [18:40:55] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:41:28] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) This is now complete and we have upgraded all 176 Traffic hosts to bullseye. WE would like to thank @MoritzMuehlenhoff for helping with the Pybal backport that made the LVS reimaging... [18:43:04] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Aklapper) > Mediawiki - no That's via https://meta.wikimedia.org/wiki/Special:CentralAuth?target=FNavas-WMF instead and unrelated to this task? [18:44:58] (03CR) 10Jbond: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:46:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) @jbond for the firmware reimaging cookbook that saved us a lot of time by automating the iDRAC and NIC firmwares and deferring having the defer reboot option. [18:46:30] (03PS62) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [18:48:51] (03CR) 10JHathaway: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:50:08] (03CR) 10CI reject: [V: 04-1] puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:50:14] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:51:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) I just attempted to build frbast1002 and frpig1002 and neither got a dhcp offer. Could we please verify that all the hosts are in the corre... [18:52:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:52:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2004.wikimedia.org with OS bullseye [18:52:13] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:52:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2004.wikimedia.org with OS bullseye completed: - dns2004 (**PASS**) - Removed from Pup... [18:53:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dns2006.wikimedia.org with OS bullseye [18:53:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2006.wikimedia.org with OS bullseye [18:54:03] (03PS1) 10Dzahn: gerrit: add gerrit1003 to rsync dest hosts when using prod role [puppet] - 10https://gerrit.wikimedia.org/r/910064 (https://phabricator.wikimedia.org/T326368) [18:56:15] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:56:41] (03PS6) 10Jbond: puppet::agent: rename the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 (https://phabricator.wikimedia.org/T330490) [18:56:43] (03PS13) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [18:56:45] (03PS15) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [18:56:47] (03PS12) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (https://phabricator.wikimedia.org/T330490) [18:56:49] (03PS2) 10Jbond: git-sync-upstream: add support for g10k and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059 [18:56:51] (03PS63) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [18:58:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) a:05Jgreen→03Cmjohnson [18:59:09] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:00:09] RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:21] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/910064/40755/" [puppet] - 10https://gerrit.wikimedia.org/r/910064 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [19:01:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) [19:01:45] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:02:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) @Jhancock.wm hey if you a chance can you please check network cable on dns2006? link is showing down Thanks ` ge-1/0/8 up down dns2006 [19:02:16] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10aaron) >>! In T334023#8792693, @Ladsgroup wrote... [19:02:19] (03CR) 10RLazarus: [C: 03+1] "LGTM for switchback time" [dns] - 10https://gerrit.wikimedia.org/r/909873 (https://phabricator.wikimedia.org/T335015) (owner: 10Clément Goubert) [19:02:22] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:02:55] (03CR) 10RLazarus: [C: 03+1] "LGTM for switchback time" [puppet] - 10https://gerrit.wikimedia.org/r/909874 (https://phabricator.wikimedia.org/T335015) (owner: 10Clément Goubert) [19:03:17] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:04:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:04:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host dns2005.wikimedia.org with OS bullseye [19:04:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye completed: - dns2005 (**FAIL**) - Removed from Pup... [19:04:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye executed with errors: - dns2005 (**FAIL**) - Remov... [19:04:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) [19:04:43] PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:05:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) [19:05:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye [19:05:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye [19:06:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2005.wikimedia.org with reason: host reimage [19:09:49] (03PS1) 10Dzahn: add ServiceOps-Collab as contact for gerrit/phab migration roles and peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/910065 [19:09:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2005.wikimedia.org with reason: host reimage [19:11:08] (03CR) 10Dzahn: "noticed when swiching gerrit1003 from migration role to actual prod role that the role owner changes, so added us for the special Phabrica" [puppet] - 10https://gerrit.wikimedia.org/r/910065 (owner: 10Dzahn) [19:12:48] (03CR) 10Dzahn: "currently this happens when making a change like https://gerrit.wikimedia.org/r/c/operations/puppet/+/910049" [puppet] - 10https://gerrit.wikimedia.org/r/910065 (owner: 10Dzahn) [19:14:08] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "much more readable in https://puppet-compiler.wmflabs.org/output/910049/40756/gerrit1003.wikimedia.org/index.html now after https://gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/910049 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [19:18:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2005.wikimedia.org with OS bullseye [19:18:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye completed: - dns2005 (**PASS**) - Downtimed on Ici... [19:20:15] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:21:13] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:21:53] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:25:29] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:28:27] (03PS64) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [19:38:04] (03CR) 10Daimona Eaytoy: [C: 04-1] Add new user right campaignevents-organize-events (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [19:39:17] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) Correct! @Aklapper [19:40:09] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:42:10] (03PS65) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [19:42:19] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:45:09] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:48:29] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:49:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns2006.wikimedia.org with OS bullseye [19:49:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2006.wikimedia.org with OS bullseye executed with errors: - dns2006 (**FAIL**) - Remov... [19:50:02] (03CR) 10Daimona Eaytoy: Set multi organizer feature flag to true (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [19:52:19] (03CR) 10Daimona Eaytoy: [C: 04-1] "Sent the other comments too early... I also wanted to add that this change should be made dependent on I4caf9ab8170a83d8d81922adb10915c6df" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [19:54:24] (03CR) 10Btullis: [C: 03+2] Add the perccli utility to the new Ceph servers [puppet] - 10https://gerrit.wikimedia.org/r/909707 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [19:57:09] (03CR) 10Jbond: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [20:00:07] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230419T2000). [20:00:07] No Gerrit patches in the queue for this window AFAICS. [20:02:17] (03CR) 10Jbond: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [20:07:29] (03CR) 10Zabe: [C: 03+2] Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909884 (https://phabricator.wikimedia.org/T331921) (owner: 10Zabe) [20:08:18] (03Merged) 10jenkins-bot: Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909884 (https://phabricator.wikimedia.org/T331921) (owner: 10Zabe) [20:09:30] !log zabe@deploy2002 Started scap: Backport for [[gerrit:909884|Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" (T331921)]] [20:09:36] T331921: enable de-wp bureaucrats to remove adminflag and to grant importer rights - https://phabricator.wikimedia.org/T331921 [20:10:56] !log zabe@deploy2002 zabe: Backport for [[gerrit:909884|Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" (T331921)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:13:39] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:14:33] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:14:48] (03PS1) 10Ladsgroup: auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) [20:14:54] (03CR) 10CI reject: [V: 04-1] auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup) [20:15:30] (03CR) 10Ladsgroup: "recheck" [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup) [20:15:51] (03PS2) 10Ladsgroup: auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) [20:15:56] 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic-Icebox, 10TestMe: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10Umherirrender) There was a recent improvement of thumbnails purge for similiar reasons on T331138. For me the thum... [20:16:20] (03CR) 10CI reject: [V: 04-1] auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup) [20:16:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:16:56] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:909884|Revert "Revert "dewiki: Allow 'crats to remove sysopship and manage importers"" (T331921)]] (duration: 07m 26s) [20:17:03] T331921: enable de-wp bureaucrats to remove adminflag and to grant importer rights - https://phabricator.wikimedia.org/T331921 [20:17:36] (03PS3) 10Ladsgroup: auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) [20:17:42] (03CR) 10CI reject: [V: 04-1] auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) (owner: 10Ladsgroup) [20:17:53] (03PS4) 10Ladsgroup: auto_schema: Add support for more straightforward check functions [software] - 10https://gerrit.wikimedia.org/r/910089 (https://phabricator.wikimedia.org/T304654) [20:21:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:21:58] 10SRE, 10WMF-General-or-Unknown: some file thumbs fail to purge on upload of a new version - https://phabricator.wikimedia.org/T35672 (10Umherirrender) 05Open→03Resolved Please do not reopen very old tasks. Please create new tasks for new issues even there are looking the same (after some years it should b... [20:33:15] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:35:06] (03PS3) 10Eevans: Do not de-init node prior to restart [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/909403 (https://phabricator.wikimedia.org/T334754) [20:56:14] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Ottomata) > Turnilo - no - "Service access denied due to missing privileges. Turnilo only uses LDAP for authentication (no posix group membership), so this hints that... [21:07:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [21:09:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [21:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [21:27:16] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [21:30:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [21:32:35] (03CR) 10Ryan Kemper: [C: 03+1] elasticsearch: handle cloudelastic URLs [cookbooks] - 10https://gerrit.wikimedia.org/r/910037 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking) [21:32:39] (03CR) 10Bking: [C: 03+2] elasticsearch: handle cloudelastic URLs [cookbooks] - 10https://gerrit.wikimedia.org/r/910037 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking) [21:33:42] (03PS1) 10Cwhite: prometheus: Change zarcillo location [puppet] - 10https://gerrit.wikimedia.org/r/910076 (https://phabricator.wikimedia.org/T334455) [21:35:36] (03CR) 10Cwhite: [C: 03+2] prometheus: Change zarcillo location [puppet] - 10https://gerrit.wikimedia.org/r/910076 (https://phabricator.wikimedia.org/T334455) (owner: 10Cwhite) [21:35:43] (03Merged) 10jenkins-bot: elasticsearch: handle cloudelastic URLs [cookbooks] - 10https://gerrit.wikimedia.org/r/910037 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking) [21:38:17] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2022.codfw.wmnet with OS bullseye [21:40:03] RECOVERY - Check systemd state on prometheus1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:48] (03PS2) 10Dzahn: acme_chief/gerrit certs: add gerrit1003 to hosts and gerrit-new to SNI [puppet] - 10https://gerrit.wikimedia.org/r/909790 (https://phabricator.wikimedia.org/T326368) [21:46:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ganeti2019:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ganeti2019 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [21:46:55] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:47:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [21:47:39] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:48:15] (PowerSupply) firing: Power Supply - PS Redundancy - issue on parse2010:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=parse2010 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [21:48:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [21:52:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:57:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:00:34] RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:06:41] 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic-Icebox, 10TestMe: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10JoKalliauer) [22:10:35] !log removing 5 files for legal compliance [22:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:12] 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic-Icebox, 10TestMe: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10JoKalliauer) @Umherirrender ; You have to compare the PNG not the SVG, because the rendering has several rendering... [22:14:35] 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic-Icebox, 10TestMe: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10JoKalliauer) [22:14:44] RECOVERY - Check systemd state on prometheus1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:20:26] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) Seems like there are not just 2 users, there are actually 3 different users! [mwmaint1002:~] $ ldapsearch -x uid=fnavas* | grep uidNumber uidNumber: 43544 ui... [22:24:56] RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:29] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) To remove any ambiguity, let's refer to them by uidNumbers. Starting with the oldest: 43544 | uid = fnavas | sn = Francisco Navas | cn = Francisco Navas | mail... [22:28:54] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on cp2031:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:34:53] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2022.codfw.wmnet with OS bullseye [22:35:15] (03CR) 10EoghanGaffney: [C: 03+1] gerrit: replace Icinga monitoring with Prometheus, ssh port 29418 [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [22:36:11] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) Hey @FNavas-foundation can you do these things: - set an email address for the fnavas-foundation user (login at wikitech and go to preferences, set an address)... [22:38:22] (03CR) 10EoghanGaffney: [C: 03+1] add ServiceOps-Collab as contact for gerrit/phab migration roles and peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/910065 (owner: 10Dzahn) [22:38:48] (03CR) 10EoghanGaffney: [C: 03+1] acme_chief/gerrit certs: add gerrit1003 to hosts and gerrit-new to SNI [puppet] - 10https://gerrit.wikimedia.org/r/909790 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [22:39:06] (03PS2) 10Andrea Denisse: prometheus: Added support for syncing data between instances [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) [22:43:36] (03CR) 10Dzahn: prometheus: Added support for syncing data between instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [22:45:39] (03CR) 10Dzahn: prometheus: Added support for syncing data between instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [22:47:06] (03CR) 10Dzahn: prometheus: Added support for syncing data between instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [22:51:08] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) Thanks @Dzahn - email added to fnavas-foundation - lets use 43670 | uid = fnavas-foundation | sn = FNavas-foundation | cn = FNavas-foundation -... [22:54:48] (03PS1) 10Cwhite: logstash: webrequest ecs: move backend to label [puppet] - 10https://gerrit.wikimedia.org/r/910077 (https://phabricator.wikimedia.org/T277816) [22:55:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Dwisehaupt i checked on the switch all the interfaces are configured and up maybe the server were not added to DNS since we do not manage Frac... [22:55:52] (03CR) 10EoghanGaffney: [C: 03+1] cloudgw: allow VMs to speak to new gerrit server (gerrit1003) [puppet] - 10https://gerrit.wikimedia.org/r/909795 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [22:56:01] (03CR) 10EoghanGaffney: [C: 03+1] cloudgw: fix IP address for gerrit-replica.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/909794 (owner: 10Dzahn) [22:56:51] (03CR) 10CI reject: [V: 04-1] logstash: webrequest ecs: move backend to label [puppet] - 10https://gerrit.wikimedia.org/r/910077 (https://phabricator.wikimedia.org/T277816) (owner: 10Cwhite) [23:02:52] !log removing 3 files for legal compliance [23:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:00] !log removing 1 file for legal compliance [23:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:46] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) @FNavas-foundation Thank you for the prompt reply. I can confirm that all users have an (the same) email address now, cool!. It's possible that you need both a... [23:20:40] (03CR) 10Cwhite: "Jenkins says "cp0000.eqiad.wmnet" is a typo but it is intentional." [puppet] - 10https://gerrit.wikimedia.org/r/910077 (https://phabricator.wikimedia.org/T277816) (owner: 10Cwhite) [23:22:05] (03CR) 10Dzahn: "I was thinking about this earlier when you mentioned the "impossible number", heh. Maybe use 9999 instead? Should be fine to steal that on" [puppet] - 10https://gerrit.wikimedia.org/r/910077 (https://phabricator.wikimedia.org/T277816) (owner: 10Cwhite) [23:25:05] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) To quickly answer your last question - Abhas Tripathi has access to those supersets (I know for a fact) and @SDelbecque-WMF (who is the other PM on m... [23:29:07] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910078 [23:29:09] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910078 (owner: 10Zabe) [23:29:47] !log zabe@deploy2002 Started scap: [[gerrit:910078]] [23:29:55] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910078 (owner: 10Zabe) [23:32:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) @Papaul Thanks, I have verified they are in DNS. I think there may be some crossing in cables or vlans. When I try to build a host, I'm se... [23:35:52] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:54] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) Thanks! This was very valuable information. With that we are able to track it down, luckily. So when I look at Abhas Tripathi, they have membership in analytics... [23:36:28] !log zabe@deploy2002 Finished scap: [[gerrit:910078]] (duration: 06m 40s) [23:37:00] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) a:05FNavas-foundation→03None [23:37:06] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) 05Stalled→03Open [23:37:09] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) p:05Medium→03High [23:37:19] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) 05Open→03In progress [23:37:33] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) [23:39:23] (03CR) 10Dzahn: "Doesn't seem like this is what was needed. Instead all they needed was "add to wmf LDAP group" and this group isn't even needed. https://p" [puppet] - 10https://gerrit.wikimedia.org/r/905767 (https://phabricator.wikimedia.org/T331482) (owner: 10Ssingh) [23:40:09] (03PS1) 10Dzahn: Revert "admin: add fnavas-foundation to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910017 [23:40:21] (03CR) 10CI reject: [V: 04-1] Revert "admin: add fnavas-foundation to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910017 (owner: 10Dzahn) [23:42:39] (03PS2) 10Dzahn: Revert "admin: add fnavas-foundation to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910017 [23:43:09] (03CR) 10CI reject: [V: 04-1] Revert "admin: add fnavas-foundation to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/910017 (owner: 10Dzahn) [23:44:00] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:44:16] (03CR) 10Dzahn: "ok, so they need to be moved to the ldap_only section, not be completely removed. I will make a new change that converts them" [puppet] - 10https://gerrit.wikimedia.org/r/910017 (owner: 10Dzahn) [23:53:41] (03PS2) 10Dzahn: add ServiceOps-Collab as contact for gerrit/phab migration roles and peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/910065 [23:53:43] (03PS1) 10Dzahn: admin: move fnavas to ldap_only admins, remove from a-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/910104 (https://phabricator.wikimedia.org/T331482) [23:54:00] (03PS2) 10Dzahn: admin: move fnavas to ldap_only admins, remove from a-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/910104 (https://phabricator.wikimedia.org/T331482)