[00:01:18] (03PS1) 10JHathaway: puppet-merge: don't symlink environments [puppet] - 10https://gerrit.wikimedia.org/r/976878 (https://phabricator.wikimedia.org/T350809) [00:02:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED [00:03:08] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976878 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [00:08:30] RECOVERY - Check systemd state on logstash2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:02] (03CR) 10JHathaway: [C: 03+2] puppet-merge: don't symlink environments [puppet] - 10https://gerrit.wikimedia.org/r/976878 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [00:10:06] (03CR) 10Dzahn: "I don't know much about this but I am giving moral support to attempt and fix the current puppet-merge issue with this." [puppet] - 10https://gerrit.wikimedia.org/r/976878 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [00:12:38] RECOVERY - Check systemd state on ms-be2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:18] RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:25] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for SToyofuku-WMF - https://phabricator.wikimedia.org/T351857 (10SToyofuku-WMF) [00:17:10] RECOVERY - Check systemd state on bast2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:07] jhathaway: looks promising :) [00:18:22] great [00:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:18:34] jhathaway: I ran the command that failed on all of them.. and it works again [00:18:37] on a random host [00:18:47] it was sudo /usr/local/sbin/smart-data-dump --syslog --outfile /var/lib/prometheus/node.d/device_smart.prom [00:18:54] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:59] and it was: /var/lib/puppet/lib/facter/raid.rb' (No such file or directory) [00:19:11] but now that file is back! [00:19:25] you fixed it :) [00:19:26] RECOVERY - Check systemd state on krb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:48] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host planet1003.eqiad.wmnet with OS bookworm [00:19:53] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm executed with errors: - planet1003 (**FAIL**)... [00:20:05] trying another reimage of my VM now [00:20:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host planet1003.eqiad.wmnet with OS bookworm [00:20:19] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm [00:20:56] RECOVERY - Check systemd state on ms-be2062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:04] RECOVERY - Check systemd state on kubernetes2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:56] RECOVERY - Check systemd state on backup1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:08] RECOVERY - Check systemd state on ganeti-test1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:28] RECOVERY - Check systemd state on an-presto1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:28] RECOVERY - Check systemd state on kubernetes2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:14] RECOVERY - Check systemd state on kubernetes1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:20] RECOVERY - Check systemd state on dbprov2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:26] RECOVERY - Check systemd state on kubernetes1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:38] RECOVERY - Check systemd state on ganeti1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:44] RECOVERY - Check systemd state on kafka-main2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:35] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host planet2003.codfw.wmnet with OS bookworm [00:28:35] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host planet2003.codfw.wmnet [00:28:40] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet2003.codfw.wmnet with OS bookworm executed with errors: - planet2003 (**FAIL**)... [00:29:16] RECOVERY - Check systemd state on kubernetes2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host planet2003.codfw.wmnet with OS bookworm [00:29:32] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet2003.codfw.wmnet with OS bookworm [00:30:04] RECOVERY - Check systemd state on kubestage1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:28] RECOVERY - Check systemd state on an-worker1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:30] RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:36] RECOVERY - Check systemd state on graphite2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:22] RECOVERY - Check systemd state on kafka-main2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:32] RECOVERY - Check systemd state on ganeti4005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage [00:31:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 7h 10m 44s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [00:32:06] RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:12] RECOVERY - Check systemd state on clouddb1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:16] RECOVERY - Check systemd state on ganeti5005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:42] RECOVERY - Check systemd state on ml-serve1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:48] RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:52] RECOVERY - Check systemd state on sessionstore2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:59] (PuppetFailure) firing: (2) Puppet has failed on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:34:00] RECOVERY - Check systemd state on backup2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage [00:34:44] RECOVERY - Check systemd state on ml-cache1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:45] (03CR) 10JHathaway: [C: 03+2] dev env: PS1 function for to show the puppet env [puppet] - 10https://gerrit.wikimedia.org/r/965536 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [00:34:59] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:35:02] RECOVERY - Check systemd state on ml-serve2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:18] RECOVERY - Check systemd state on ganeti-test2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:10] RECOVERY - Check systemd state on kafka-jumbo1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:44] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:36:52] RECOVERY - Check systemd state on kubernetes1037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:04] RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:02] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:30] RECOVERY - Check systemd state on ganeti5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:32] RECOVERY - Check systemd state on puppetdb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:42] (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:44] RECOVERY - Check systemd state on an-worker1109 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:02] RECOVERY - Check systemd state on an-worker1134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/976710 [00:39:05] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/976710 (owner: 10TrainBranchBot) [00:39:06] RECOVERY - Check systemd state on dumpsdata1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:14] RECOVERY - Check systemd state on an-presto1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:14] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:08] RECOVERY - Check systemd state on ganeti1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:20] RECOVERY - Check systemd state on ms-be2066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:24] RECOVERY - Check systemd state on kafka-jumbo1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:28] RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 6m 34s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [00:41:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on planet2003.codfw.wmnet with reason: host reimage [00:41:56] RECOVERY - Check systemd state on an-presto1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:02] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:10] RECOVERY - Check systemd state on ganeti2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:34] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:59] (PuppetZeroResources) firing: Puppet has failed generate resources on puppetserver1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:43:28] RECOVERY - Check systemd state on kubernetes2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:42] (SystemdUnitFailed) resolved: (2) export_smart_data_dump.service Failed on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:52] RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:59] (PuppetFailure) firing: (2) Puppet has failed on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:44:38] RECOVERY - Check systemd state on ganeti1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on planet2003.codfw.wmnet with reason: host reimage [00:44:44] RECOVERY - Check systemd state on kubernetes1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:04] RECOVERY - Check systemd state on ms-backup2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:42] RECOVERY - Check systemd state on ml-serve1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:20] RECOVERY - Check systemd state on kubernetes2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:58] RECOVERY - Check systemd state on kubernetes1045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:42] RECOVERY - Check systemd state on an-presto1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:24] RECOVERY - Check systemd state on backup2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED [00:53:02] RECOVERY - Check systemd state on ganeti-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:26] RECOVERY - Check systemd state on kubernetes2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Jhancock.wm) [00:56:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/976710 (owner: 10TrainBranchBot) [00:57:34] RECOVERY - Check systemd state on cp4037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on puppetserver1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:59:42] RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED [01:02:45] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2033'] [01:03:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2033'] [01:08:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 7h 14m 45s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [01:09:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2028.codfw.wmnet with OS bullseye [01:09:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host restbase2028.codfw.wmnet with OS bullseye [01:13:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 35m 38s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [01:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:04] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 28m 16s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [01:25:04] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 25m 11s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [01:26:02] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host planet1003.eqiad.wmnet with OS bookworm [01:26:06] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host planet2003.codfw.wmnet with OS bookworm [01:26:07] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm executed with errors: - planet1003 (**FAIL**)... [01:26:11] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet2003.codfw.wmnet with OS bookworm executed with errors: - planet2003 (**FAIL**)... [01:32:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:16] RECOVERY - Check systemd state on an-conf1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2028.codfw.wmnet with OS bullseye [02:26:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host restbase2028.codfw.wmnet with OS bullseye executed with errors: - restbase... [02:38:23] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:23] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:28:23] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:34:59] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:36:45] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:44:14] (PuppetFailure) firing: Puppet has failed on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:30:36] 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, 10media-backups: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bawolff) As a note My expectation is that the primary usage for a higher limit would be: * Long (or HD) videos w... [06:19:08] PROBLEM - Check systemd state on puppetserver1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:11] PROBLEM - Check systemd state on kubernetes2007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:18] 10SRE-swift-storage, 10Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802 (10Bawolff) > add: TIFFs and PDFs (and sometimes even PNGs) are bigger than 4GiB too. It should be noted, that many (not all) of the tiff cases are due to using no lossless... [06:19:43] 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, 10media-backups: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10AlexisJazz) >>! In T191804#9354216, @Bawolff wrote: > As a note > > My expectation is that the primary usage for... [06:23:57] !log Restarting CI Jenkins for plugin update # T282893 [06:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:02] T282893: Various CI jobs failing after "mkdir: cannot create directory ‘log’: Permission denied" - https://phabricator.wikimedia.org/T282893 [06:25:03] yes, I wrote a patch for a java code base [06:32:49] (03PS1) 10Marostegui: pc2014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/976883 (https://phabricator.wikimedia.org/T351786) [06:34:09] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:36:00] (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/976883 (https://phabricator.wikimedia.org/T351786) (owner: 10Marostegui) [06:37:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Switch [06:38:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Switch [06:41:17] (03PS1) 10Marostegui: mariadb: Promote db1119 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/976884 (https://phabricator.wikimedia.org/T351638) [06:42:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1119 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/976884 (https://phabricator.wikimedia.org/T351638) (owner: 10Marostegui) [06:44:23] !log Failover m2 from db1195 to db1119 - T351638 [06:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:28] T351638: Switchover m2 master db1195 -> db1119 - https://phabricator.wikimedia.org/T351638 [06:44:31] 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, 10media-backups: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bawolff) > There's one more use I think: the 4K transcode of https://commons.wikimedia.org/wiki/File:Politpa... [06:48:48] (03CR) 10Muehlenhoff: "Looks good, two nits inline. One thing that will need to be added in a followup when you add a role is to include the firewall base profil" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [06:50:14] !log Restarting CI Jenkins for plugins removals [06:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:54] (03PS1) 10Marostegui: db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/976885 (https://phabricator.wikimedia.org/T351386) [06:51:32] (03CR) 10Marostegui: [C: 03+2] db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/976885 (https://phabricator.wikimedia.org/T351386) (owner: 10Marostegui) [06:52:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1195.eqiad.wmnet with OS bookworm [06:52:32] (03CR) 10Andrea Denisse: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [06:52:51] marostegui: I am restarting Gerrit :) [06:53:08] !log Restarting Gerrit [06:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:19] hashar: go for it! [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231123T0700) [07:00:04] kormat, marostegui, and Amir1: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231123T0700) [07:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:04:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage [07:05:45] (03PS1) 10Marostegui: mariadb: Do not reimage db1241 [puppet] - 10https://gerrit.wikimedia.org/r/976926 [07:06:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Do not reimage db1241 [puppet] - 10https://gerrit.wikimedia.org/r/976926 (owner: 10Marostegui) [07:08:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage [07:08:58] !log Restarted CI Jenkins to upgrade Rebuilder plugin [07:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:11:04] (03PS1) 10Marostegui: db1119: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/976927 [07:12:07] PROBLEM - jenkins_service_running on contint2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [07:13:07] RECOVERY - jenkins_service_running on contint2002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [07:13:41] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/976927 (owner: 10Marostegui) [07:13:51] (03CR) 10Marostegui: [C: 03+2] db1119: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/976927 (owner: 10Marostegui) [07:19:57] <_joe_> !log restarted sirenbot [07:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1195.eqiad.wmnet with OS bookworm [07:22:21] the Jenkins issue on contint2002 was me restarting the service and systemd not actually restarting it. Filed as https://phabricator.wikimedia.org/T351865 [07:28:23] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:29:49] (03PS1) 10Arnaudb: mariadb: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/976946 (https://phabricator.wikimedia.org/T343674) [07:31:20] (03CR) 10Marostegui: mariadb: enable notifications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [07:36:58] (03CR) 10Arnaudb: [C: 03+2] mariadb: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/976946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [07:37:37] (03PS2) 10Arnaudb: mariadb: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/976946 (https://phabricator.wikimedia.org/T343674) [07:37:58] (03CR) 10Muehlenhoff: [C: 03+2] mediawiki::packages: Drop python-pil [puppet] - 10https://gerrit.wikimedia.org/r/976202 (https://phabricator.wikimedia.org/T268468) (owner: 10Muehlenhoff) [07:41:46] (03CR) 10Marostegui: [C: 03+1] mariadb: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/976946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [07:42:10] (03CR) 10Arnaudb: [C: 03+2] mariadb: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/976946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [07:43:26] (03PS1) 10Arnaudb: mariadb: hieradata ordering [puppet] - 10https://gerrit.wikimedia.org/r/976947 (https://phabricator.wikimedia.org/T343674) [07:43:40] (03CR) 10Marostegui: [C: 03+1] mariadb: hieradata ordering [puppet] - 10https://gerrit.wikimedia.org/r/976947 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [07:44:07] (03CR) 10Arnaudb: [C: 03+2] mariadb: hieradata ordering [puppet] - 10https://gerrit.wikimedia.org/r/976947 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:00:05] Amir1, apergos, and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231123T0800). [08:01:40] No patch owners are signed up to deploy today, and no one is signed up to learn how to deploy, so have a nice long weekend for those of you who have the holiday, and for everyone else, have a quiet few days and we'll see you next time! [08:08:42] apergos: thank you for the check! :) [08:09:30] sure thing! [08:13:10] (03CR) 10Ryan Kemper: [C: 03+1] sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [08:15:18] (03CR) 10Stevemunene: [C: 03+2] set druid hosts to use the reuse partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/976385 (https://phabricator.wikimedia.org/T332589) (owner: 10Stevemunene) [08:17:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 36.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:22:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 43.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:33:00] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:34:50] (ProbeDown) firing: (6) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:35:15] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:35:47] (03PS2) 10Majavah: rsync: do not included config for absented modules [puppet] - 10https://gerrit.wikimedia.org/r/976835 [08:36:42] (03CR) 10Kosta Harlan: [C: 03+1] ores extension: set default value of OresLiftWingAddHostHeader to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976161 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [08:36:45] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:37:07] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:37:35] !log Restarting CI Jenkins for plugins removals [08:37:38] PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:54] * Emperor here [08:38:00] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:38:23] (ProbeDown) firing: (9) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:38:24] (JobUnavailable) firing: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:40:33] titan1002 is hosed, I can login via the serial console, but the root login over the serial console fails, probably under heavy load [08:41:00] that's one of the new dedicated-thanos-frontends [08:41:03] (03CR) 10Majavah: [C: 03+2] rsync: do not included config for absented modules [puppet] - 10https://gerrit.wikimedia.org/r/976835 (owner: 10Majavah) [08:41:14] I acked via phone [08:41:30] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:41:48] I've depooled titan1002 from thanos [08:42:33] ah snap folks I was trying some queries on the thanos UI, it is probably me [08:42:34] titan1001 seems to be struggling too (I'm still waiting to see if I can get in over ssh) [08:42:47] I'm here too, likely safe to reboot titan1002 if it is hosed [08:42:49] we should have some cgroups settings to avoid this though [08:43:12] I checked SEL, there's at least no SEL-logged signs of hardware trouble [08:43:39] yeah probably heavy queries, curious the memory limits didn't kick in tho [08:43:56] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 261 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:44:05] godog: I was trying the rules in https://gerrit.wikimedia.org/r/c/operations/puppet/+/975846/3/modules/profile/files/thanos/recording_rules.yaml via thanos ui, I guess that they were too heavy, will refrain from merging :( [08:44:14] (PuppetFailure) firing: Puppet has failed on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:44:15] should we wait for the expensive query to time out? otherwise I can also powercycle to unbreak it [08:44:22] nono please go ahead [08:44:36] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:44:39] elukey: seems plausible [08:44:47] moritzm: +1 on my end to just reboot [08:44:54] still trying to get into titan1001 [08:44:59] ok, doing that now [08:45:26] !log powercycling titan1002 [08:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:44] (03PS3) 10Stevemunene: C:bigtop::hadoop switch to new topology script. [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [08:45:52] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 215 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:45:57] apergos: I was going to ask to sync a config patch, but I see on https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar that there should be no deploys today or tomorrow [08:46:09] (waiting for a password prompt on the serial console, ssh still not getting in) [08:46:18] (03PS1) 10Phuedx: Remove mediawiki.web_ui.interactions event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976930 (https://phabricator.wikimedia.org/T351195) [08:47:12] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:47:25] kostajh: we got an email recently with the schedule, about this [08:47:56] but the bot pings people regardless ;-) [08:48:08] trying to log in on the serial console gets me "Login timed out after 60 seconds" [08:48:13] (03CR) 10CI reject: [V: 04-1] C:bigtop::hadoop switch to new topology script. [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [08:48:24] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:48:55] titan1002 is back now [08:48:58] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 69.44 ms [08:49:04] RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:49:31] should I powercycle titan1001? [08:49:36] thank you moritzm [08:49:39] Emperor: +1 [08:49:50] (ProbeDown) firing: (9) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:53] !log powercycle titan1001 [08:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:21] the thanos-query is back up, I'll repool titan1002 now [08:51:52] ack [08:52:07] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:52:55] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [08:53:26] (ProbeDown) firing: (9) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:53:44] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:53:48] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:53:51] (03PS1) 10Arnaudb: mariadb: add a new host to s3 [puppet] - 10https://gerrit.wikimedia.org/r/976948 (https://phabricator.wikimedia.org/T343674) [08:53:56] titan1001 back [08:54:46] thanks! [08:54:50] (JobUnavailable) firing: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:54:58] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:55:07] thank you folks, appreciate it, I'll take a closer look at what when sideways [08:55:10] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:55:12] went even [08:56:06] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:56:30] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 59 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:56:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:56:40] (03CR) 10Marostegui: mariadb: add a new host to s3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/976948 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:57:25] (03CR) 10Elukey: [C: 03+1] ml-services: update docker images to latest versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/976748 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [08:58:23] (ProbeDown) firing: (9) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:58:24] (JobUnavailable) resolved: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:59:05] (03CR) 10Volans: "post-merge comment" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [09:01:09] godog: I had a quick look at titan1002 and it seems the kernel deadlocked under memory pressure, apparently while trying to oom-kill thanos [09:01:43] sigh [09:01:51] thank you moritzm that makes sense to me [09:04:20] (03CR) 10Filippo Giunchedi: [C: 03+2] team-o11y: alert on Prometheus storing a few days of data [alerts] - 10https://gerrit.wikimedia.org/r/975832 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [09:05:23] (03PS1) 10Marostegui: report_users: Add cloudlb hosts [software] - 10https://gerrit.wikimedia.org/r/976932 [09:05:35] (03CR) 10CI reject: [V: 04-1] report_users: Add cloudlb hosts [software] - 10https://gerrit.wikimedia.org/r/976932 (owner: 10Marostegui) [09:05:57] (03CR) 10Marostegui: [C: 03+1] mariadb: add a new host to s3 [puppet] - 10https://gerrit.wikimedia.org/r/976948 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [09:07:20] (03CR) 10Arnaudb: mariadb: add a new host to s3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976948 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [09:07:22] (03CR) 10Volans: "I've left some new comments, most of my previous questions on the old file are still unanswered." [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [09:07:24] (03CR) 10Arnaudb: [C: 03+2] mariadb: add a new host to s3 [puppet] - 10https://gerrit.wikimedia.org/r/976948 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [09:07:33] (03CR) 10Filippo Giunchedi: [C: 03+2] varnishkafka: move to rsyslog::conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976740 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [09:09:10] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update docker images to latest versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/976748 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [09:09:13] (03PS1) 10Marostegui: report_users: Add cloudlb hosts [software] - 10https://gerrit.wikimedia.org/r/976933 [09:10:43] !log add 80G to prometheus/k8s in eqiad [09:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:46] (03CR) 10Marostegui: [C: 03+2] report_users: Add cloudlb hosts [software] - 10https://gerrit.wikimedia.org/r/976933 (owner: 10Marostegui) [09:10:47] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1008.eqiad.wmnet with OS bullseye [09:11:05] (03Abandoned) 10Marostegui: report_users: Add cloudlb hosts [software] - 10https://gerrit.wikimedia.org/r/976932 (owner: 10Marostegui) [09:11:18] (03Merged) 10jenkins-bot: report_users: Add cloudlb hosts [software] - 10https://gerrit.wikimedia.org/r/976933 (owner: 10Marostegui) [09:12:22] !log add 50G to prometheus/services in codfw [09:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:02] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: provisionning db2190.codfw.wmnet - T343674 [09:13:07] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [09:13:17] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: provisionning db2190.codfw.wmnet - T343674 [09:13:20] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: provisionning db2190.codfw.wmnet - T343674 [09:13:32] (03PS1) 10Muehlenhoff: profile::mediawiki::php: Remove support for PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/976934 [09:13:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: provisionning db2190.codfw.wmnet - T343674 [09:15:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db2149 in db2190 for T343674', diff saved to https://phabricator.wikimedia.org/P53736 and previous config saved to /var/cache/conftool/dbconfig/20231123-091514-arnaudb.json [09:16:14] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 30 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:16:38] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: support alternative base in ::conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976741 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [09:17:24] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db2149.codfw.wmnet onto db2190.codfw.wmnet [09:18:05] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2149.codfw.wmnet onto db2190.codfw.wmnet [09:18:49] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db2149.codfw.wmnet onto db2190.codfw.wmnet [09:19:10] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2149.codfw.wmnet onto db2190.codfw.wmnet [09:19:49] (03CR) 10JMeybohm: [C: 03+1] Define the spark-history/spark-history-test k8s namespaces (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) (owner: 10Brouberol) [09:20:38] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db2149.codfw.wmnet onto db2190.codfw.wmnet [09:22:47] (03PS2) 10Ayounsi: Don't alert for v6 AAAA for logstash and kafka-logging [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/976110 [09:22:52] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog_exporter: move to a define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976742 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [09:23:03] (03PS3) 10Filippo Giunchedi: rsyslog_exporter: move to a define [puppet] - 10https://gerrit.wikimedia.org/r/976742 (https://phabricator.wikimedia.org/T351799) [09:24:02] (03CR) 10Ayounsi: [C: 03+2] Don't alert for v6 AAAA for logstash and kafka-logging [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/976110 (owner: 10Ayounsi) [09:24:34] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 47 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:24:42] (03Merged) 10jenkins-bot: Don't alert for v6 AAAA for logstash and kafka-logging [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/976110 (owner: 10Ayounsi) [09:26:54] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [09:27:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [09:27:17] (03PS1) 10Brouberol: Monitor kafka topic with replication factor == 1 [alerts] - 10https://gerrit.wikimedia.org/r/976935 (https://phabricator.wikimedia.org/T346887) [09:29:02] FYI there is a rolling-restart of rsyslog across the fleet as a side effect of https://gerrit.wikimedia.org/r/c/operations/puppet/+/976742 [09:32:37] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:25] 10Puppet, 10SRE, 10Infrastructure-Foundations: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10Vgutierrez) 05Resolved→03Open It looks like we are having some issues with the raid fact: ` Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Command '['/usr/bin/timeout', '120', '/us... [09:38:59] (PuppetFailure) resolved: Puppet has failed on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:43:40] (03CR) 10Ayounsi: [C: 03+1] Switch netboxdb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969331 (owner: 10Muehlenhoff) [09:43:53] (03PS1) 10Arnaudb: mariadb: add a new host on s2 [puppet] - 10https://gerrit.wikimedia.org/r/976949 (https://phabricator.wikimedia.org/T343674) [09:44:03] (03PS2) 10Jbond: Gemfile: update to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976304 [09:44:07] (03CR) 10Ayounsi: [C: 03+1] Move mw appservers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/975225 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [09:44:35] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 5 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:45:42] (03CR) 10CI reject: [V: 04-1] Gemfile: update to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976304 (owner: 10Jbond) [09:46:32] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Gehel) 05Declined→03Open Re-opening after discussion with @brouberol, having better auto discovery is still interesting. [09:46:38] (03CR) 10Ayounsi: [C: 03+1] cr-cloud: Move allow-public below deny-to-private-subnets [homer/public] - 10https://gerrit.wikimedia.org/r/970275 (owner: 10Majavah) [09:47:08] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Gehel) a:05Ottomata→03brouberol [09:47:11] (03CR) 10Marostegui: [C: 03+1] mariadb: add a new host on s2 [puppet] - 10https://gerrit.wikimedia.org/r/976949 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [09:50:35] (03CR) 10Arnaudb: [C: 03+2] mariadb: add a new host on s2 [puppet] - 10https://gerrit.wikimedia.org/r/976949 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [09:51:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:54:38] (03CR) 10Ayounsi: [C: 03+1] "LGTM but I'd rather have someone else from I/F to review it before merging." [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [09:55:02] (03PS1) 10Jbond: puppetserver: Manage the whole environments dir not just production [puppet] - 10https://gerrit.wikimedia.org/r/976938 [09:55:56] (03CR) 10Btullis: [C: 03+1] Define the spark-history/spark-history-test k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) (owner: 10Brouberol) [09:56:07] (03CR) 10Brouberol: "Btullis: do you know whether we have isio mesh enabled in dse?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) (owner: 10Brouberol) [09:57:54] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: ship a separate 'receiver' instance [puppet] - 10https://gerrit.wikimedia.org/r/976743 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [09:58:33] (03CR) 10Btullis: [C: 03+2] Fix Matomo TagManager functionality [puppet] - 10https://gerrit.wikimedia.org/r/976750 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [09:59:53] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host druid1008.eqiad.wmnet with OS bullseye [10:00:32] (03Abandoned) 10Ayounsi: Depool esams [dns] - 10https://gerrit.wikimedia.org/r/974538 (owner: 10Ayounsi) [10:04:27] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: provisionning db2189.codfw.wmnet - T343674 [10:04:32] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [10:04:40] (03PS2) 10Jbond: puppetserver: Manage the whole environments dir not just production [puppet] - 10https://gerrit.wikimedia.org/r/976938 [10:04:53] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: provisionning db2189.codfw.wmnet - T343674 [10:04:56] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: provisionning db2189.codfw.wmnet - T343674 [10:05:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: provisionning db2189.codfw.wmnet - T343674 [10:06:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db2175 in db2189 for T343674', diff saved to https://phabricator.wikimedia.org/P53737 and previous config saved to /var/cache/conftool/dbconfig/20231123-100638-arnaudb.json [10:07:03] (03CR) 10Jbond: [C: 03+1] rsync: ensure daemon is started after config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [10:07:05] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fetch from rsyslog-receiver exporter [puppet] - 10https://gerrit.wikimedia.org/r/976744 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [10:09:53] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db2175.codfw.wmnet onto db2189.codfw.wmnet [10:13:04] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [10:16:31] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: swift::proxy [10:16:44] (03PS8) 10JMeybohm: Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) [10:16:46] (03PS8) 10JMeybohm: api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) [10:17:34] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update docker images to latest versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/976748 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [10:19:05] (03Merged) 10jenkins-bot: ml-services: update docker images to latest versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/976748 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [10:20:04] (03PS1) 10Muehlenhoff: Switch swift::proxy to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976941 (https://phabricator.wikimedia.org/T349619) [10:20:45] (03PS1) 10Arnaudb: mariadb: add a new host on s1 [puppet] - 10https://gerrit.wikimedia.org/r/976951 (https://phabricator.wikimedia.org/T343674) [10:21:42] (03CR) 10Marostegui: [C: 03+1] mariadb: add a new host on s1 [puppet] - 10https://gerrit.wikimedia.org/r/976951 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:22:16] (03CR) 10Muehlenhoff: [C: 03+2] Switch swift::proxy to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976941 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:22:17] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1008.eqiad.wmnet with OS bullseye [10:23:02] (03CR) 10Arnaudb: [C: 03+2] mariadb: add a new host on s1 [puppet] - 10https://gerrit.wikimedia.org/r/976951 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:24:35] (03CR) 10Clément Goubert: [C: 03+1] profile::mediawiki::php: Remove support for PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/976934 (owner: 10Muehlenhoff) [10:25:00] (03CR) 10Jbond: "lgtm minr nit and just need confirmation on a comment" [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [10:26:49] RECOVERY - Check systemd state on kubernetes2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:15] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: provisionning db2188.codfw.wmnet - T343674 [10:27:20] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [10:27:30] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: provisionning db2188.codfw.wmnet - T343674 [10:27:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: provisionning db2188.codfw.wmnet - T343674 [10:27:35] RECOVERY - Check systemd state on kubernetes1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:49] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: provisionning db2188.codfw.wmnet - T343674 [10:28:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db2146 in db2188 for T343674', diff saved to https://phabricator.wikimedia.org/P53738 and previous config saved to /var/cache/conftool/dbconfig/20231123-102840-arnaudb.json [10:30:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: swift::proxy [10:31:33] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db2146.codfw.wmnet onto db2188.codfw.wmnet [10:34:21] !log stevemunene@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host druid1008.eqiad.wmnet with OS bullseye [10:37:46] (03PS1) 10Stevemunene: update druid100[7-8] reuse partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/976943 (https://phabricator.wikimedia.org/T332589) [10:38:10] (03PS1) 10Arnaudb: mariadb: new host on S8 [puppet] - 10https://gerrit.wikimedia.org/r/976953 (https://phabricator.wikimedia.org/T343674) [10:38:25] (03CR) 10Ayounsi: [C: 03+1] Generate subnet DHCP configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [10:39:02] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:39:06] (03CR) 10Marostegui: [C: 03+1] mariadb: new host on S8 [puppet] - 10https://gerrit.wikimedia.org/r/976953 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:39:23] (03CR) 10Arnaudb: [C: 03+2] mariadb: new host on S8 [puppet] - 10https://gerrit.wikimedia.org/r/976953 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:39:43] (03CR) 10Hnowlan: [C: 03+1] api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:40:47] (03CR) 10Hnowlan: [C: 03+1] Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:41:35] (03CR) 10Brouberol: [C: 03+1] update druid100[7-8] reuse partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/976943 (https://phabricator.wikimedia.org/T332589) (owner: 10Stevemunene) [10:44:24] (03PS12) 10Brouberol: Generate subnet DHCP configuration [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) [10:45:12] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: provisionning db2195.codfw.wmnet - T343674 [10:45:17] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [10:45:25] (03CR) 10Brouberol: Generate subnet DHCP configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [10:45:38] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: provisionning db2195.codfw.wmnet - T343674 [10:45:40] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: provisionning db2195.codfw.wmnet - T343674 [10:45:49] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/661/con" [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [10:45:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: provisionning db2195.codfw.wmnet - T343674 [10:45:59] (03PS4) 10Majavah: interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734 [10:46:01] (03PS3) 10Majavah: interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800 [10:46:03] (03PS1) 10Majavah: interface: new define for managing routing rules [puppet] - 10https://gerrit.wikimedia.org/r/976944 [10:46:12] (03CR) 10Stevemunene: [C: 03+2] update druid100[7-8] reuse partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/976943 (https://phabricator.wikimedia.org/T332589) (owner: 10Stevemunene) [10:47:24] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/662/con" [puppet] - 10https://gerrit.wikimedia.org/r/976944 (owner: 10Majavah) [10:47:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db2181 in db2195 for T343674', diff saved to https://phabricator.wikimedia.org/P53739 and previous config saved to /var/cache/conftool/dbconfig/20231123-104724-arnaudb.json [10:49:38] (03CR) 10CI reject: [V: 04-1] interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800 (owner: 10Majavah) [10:50:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [10:50:15] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db2181.codfw.wmnet onto db2195.codfw.wmnet [10:50:32] (03CR) 10CI reject: [V: 04-1] interface: new define for managing routing rules [puppet] - 10https://gerrit.wikimedia.org/r/976944 (owner: 10Majavah) [10:52:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: orchestrator [10:53:06] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:53:51] (03PS1) 10Muehlenhoff: Switch orchestrator to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976966 (https://phabricator.wikimedia.org/T349619) [10:54:28] (03PS5) 10Majavah: interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734 [10:54:30] (03PS4) 10Majavah: interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800 [10:54:32] (03PS2) 10Majavah: interface: new define for managing routing rules [puppet] - 10https://gerrit.wikimedia.org/r/976944 [10:54:34] (03PS1) 10Majavah: interface: fix absenting of post_up_command [puppet] - 10https://gerrit.wikimedia.org/r/976967 [10:57:52] (03CR) 10Muehlenhoff: [C: 03+2] Switch orchestrator to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976966 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:59:29] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2149.codfw.wmnet onto db2190.codfw.wmnet [11:00:05] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231123T1100). [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231123T1100) [11:03:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: orchestrator [11:04:40] (03CR) 10Vgutierrez: [C: 03+1] "thanks for catching that!" [puppet] - 10https://gerrit.wikimedia.org/r/976967 (owner: 10Majavah) [11:05:37] (03CR) 10Majavah: [C: 03+2] interface: fix absenting of post_up_command [puppet] - 10https://gerrit.wikimedia.org/r/976967 (owner: 10Majavah) [11:06:29] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:06:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53740 and previous config saved to /var/cache/conftool/dbconfig/20231123-110630-arnaudb.json [11:09:20] (03PS6) 10Filippo Giunchedi: ircecho: Migrate the ircecho script from Python 2 to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [11:09:22] (03PS1) 10Filippo Giunchedi: rsyslog: add syslogidentifier for rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/977043 (https://phabricator.wikimedia.org/T351799) [11:10:02] (03PS2) 10Filippo Giunchedi: rsyslog: add syslogidentifier for rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/977043 (https://phabricator.wikimedia.org/T351799) [11:10:08] (03CR) 10Filippo Giunchedi: "Please ignore the last PS by me, mistake" [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [11:10:55] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:11:04] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [11:11:12] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [11:11:18] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [11:11:24] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [11:11:32] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [11:11:39] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [11:11:44] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:11:52] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [11:12:00] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [11:12:05] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [11:12:13] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [11:12:22] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [11:12:28] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [11:13:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dborch1001.wikimedia.org [11:15:39] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: add syslogidentifier for rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/977043 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [11:16:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dborch1001.wikimedia.org [11:17:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:20:00] (03PS1) 10Majavah: P:acme_chief: cloud: require package for config file [puppet] - 10https://gerrit.wikimedia.org/r/977044 [11:21:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 20%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53741 and previous config saved to /var/cache/conftool/dbconfig/20231123-112135-arnaudb.json [11:22:02] (03PS9) 10JMeybohm: Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) [11:22:04] (03PS9) 10JMeybohm: api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) [11:22:57] (03PS2) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:23:22] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host lists1004.eqiad.wmnet [11:23:36] (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:24:30] (03PS1) 10Muehlenhoff: Switch lists1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/977045 (https://phabricator.wikimedia.org/T349619) [11:24:58] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2175.codfw.wmnet onto db2189.codfw.wmnet [11:25:45] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1008.eqiad.wmnet with OS bullseye [11:26:28] (03CR) 10Muehlenhoff: P:acme_chief: cloud: require package for config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977044 (owner: 10Majavah) [11:26:59] (03CR) 10Muehlenhoff: [C: 03+2] Switch lists1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/977045 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:30:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host lists1004.eqiad.wmnet [11:32:15] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976804 (https://phabricator.wikimedia.org/T308142) (owner: 10Sergio Gimeno) [11:33:28] (03PS2) 10Majavah: secret: dkim: move wmcs dkim keys to correct location [labs/private] - 10https://gerrit.wikimedia.org/r/969690 [11:33:32] (03PS2) 10Majavah: hieradata: fix cloudinfra webproxy password location [labs/private] - 10https://gerrit.wikimedia.org/r/969689 [11:33:38] (03PS2) 10Majavah: hieradata: add fake metricsinfra grafana password [labs/private] - 10https://gerrit.wikimedia.org/r/969691 [11:33:44] (03PS1) 10Majavah: secret: add the project-proxy acme-chief account [labs/private] - 10https://gerrit.wikimedia.org/r/977047 [11:33:52] (03PS1) 10Muehlenhoff: Correct insetup role for lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/977048 [11:35:11] 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, 10media-backups: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10AlexisJazz) >>! In T191804#9354264, @Bawolff wrote: >> There's one more use I think: the 4K transcode of htt... [11:35:35] (03CR) 10Muehlenhoff: [C: 03+2] Correct insetup role for lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/977048 (owner: 10Muehlenhoff) [11:36:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 30%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53742 and previous config saved to /var/cache/conftool/dbconfig/20231123-113640-arnaudb.json [11:38:46] (03PS1) 10Majavah: P:ssh::client: allow using extra_ssh_keys in cloud [puppet] - 10https://gerrit.wikimedia.org/r/977049 [11:40:01] (03CR) 10Majavah: [V: 03+2 C: 03+2] secret: add the project-proxy acme-chief account [labs/private] - 10https://gerrit.wikimedia.org/r/977047 (owner: 10Majavah) [11:43:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [11:45:33] godog: ^ that anycast alert is centrallog1002 [11:46:11] hah interesting, thank you taavi I'll take a look [11:46:38] the anycast healthchecker check is failing [11:47:22] 10SRE, 10Wikimedia-Mailing-lists: Revert accidental unsubscribe of all members of wikino-admin-l - https://phabricator.wikimedia.org/T351881 (10jhsoby) 05Open→03Resolved a:03jhsoby Solved by manually resubscribing everyone based off the unsubscription notifications I received as mailing list owner. All g... [11:47:36] taavi: were you able to find logs from anycast-healthchecker ? [11:47:51] godog: sudo tail -f /var/log/anycast-healthchecker/anycast-healthchecker.log [11:48:16] thank you [11:51:17] I think I got it, sending a patch [11:51:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 40%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53743 and previous config saved to /var/cache/conftool/dbconfig/20231123-115145-arnaudb.json [11:53:05] (03PS1) 10Filippo Giunchedi: rsyslog: deploy netdev_kafka_relay to rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/977050 (https://phabricator.wikimedia.org/T351799) [11:53:43] taavi: ^ [11:53:58] looking [11:54:37] (03CR) 10Majavah: [C: 03+1] rsyslog: deploy netdev_kafka_relay to rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/977050 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [11:54:40] +1 [11:54:57] cheers [11:56:00] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: deploy netdev_kafka_relay to rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/977050 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [11:59:37] ok nevermind I'm going to revert that, too hasty [11:59:54] (03PS1) 10Filippo Giunchedi: Revert "rsyslog: deploy netdev_kafka_relay to rsyslog-receiver" [puppet] - 10https://gerrit.wikimedia.org/r/976912 [12:00:02] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "rsyslog: deploy netdev_kafka_relay to rsyslog-receiver" [puppet] - 10https://gerrit.wikimedia.org/r/976912 (owner: 10Filippo Giunchedi) [12:01:40] (03CR) 10Hnowlan: [C: 03+1] Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:04:08] taavi: ok now I got it :) patch incoming [12:06:02] (03PS1) 10Filippo Giunchedi: rsyslog: add missing modload to netdev_kafka_relay [puppet] - 10https://gerrit.wikimedia.org/r/977051 (https://phabricator.wikimedia.org/T351799) [12:06:08] that's it ^ [12:06:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53744 and previous config saved to /var/cache/conftool/dbconfig/20231123-120650-arnaudb.json [12:07:06] (03CR) 10Majavah: [C: 03+1] rsyslog: add missing modload to netdev_kafka_relay [puppet] - 10https://gerrit.wikimedia.org/r/977051 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [12:07:28] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] rsyslog: add missing modload to netdev_kafka_relay [puppet] - 10https://gerrit.wikimedia.org/r/977051 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [12:08:58] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2146.codfw.wmnet onto db2188.codfw.wmnet [12:10:25] ok we're back [12:10:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53745 and previous config saved to /var/cache/conftool/dbconfig/20231123-121054-arnaudb.json [12:11:08] (03CR) 10Vgutierrez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/977046 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [12:11:09] anycast-healthchecker is serious about logging, ngl [12:11:22] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host druid1008.eqiad.wmnet with OS bullseye [12:11:26] (03CR) 10Elukey: [C: 04-1] "I tried to test the rules via Thanos UI and we got into memory pressure issues, stalling these changes for the moment." [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [12:11:44] i mean at least it's better than "withdrawing a prefix, go figure it out by yourself lol" [12:12:10] hahah indeed [12:13:46] (Not accepting/receiving prefixes from anycast BGP peer) resolved: (2) Device cr1-eqiad.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [12:14:02] (03CR) 10Majavah: [C: 03+2] P:toolforge::mailrelay: fix root@ on toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/971890 (owner: 10Majavah) [12:16:17] going to lunch [12:17:35] (03PS1) 10Vgutierrez: interface::ipip: Fix absenting [puppet] - 10https://gerrit.wikimedia.org/r/977056 (https://phabricator.wikimedia.org/T351069) [12:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:19:48] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1008.eqiad.wmnet with OS bullseye [12:21:00] (03PS1) 10Btullis: Matomo: permit public retrieval of specific CSS and JS files [puppet] - 10https://gerrit.wikimedia.org/r/977057 (https://phabricator.wikimedia.org/T349910) [12:21:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 60%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53746 and previous config saved to /var/cache/conftool/dbconfig/20231123-122155-arnaudb.json [12:22:23] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/666/con" [puppet] - 10https://gerrit.wikimedia.org/r/977057 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [12:24:21] (03CR) 10Btullis: Monitor kafka topic with replication factor == 1 (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/976935 (https://phabricator.wikimedia.org/T346887) (owner: 10Brouberol) [12:24:59] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:26:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 20%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53747 and previous config saved to /var/cache/conftool/dbconfig/20231123-122559-arnaudb.json [12:26:47] (03CR) 10Btullis: [C: 03+1] Define the spark-history/spark-history-test k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) (owner: 10Brouberol) [12:30:16] !log depooling ncredir4001 till puppet is fixed [12:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:57] (03PS1) 10Majavah: P:puppetserver: enable CA monitoring [puppet] - 10https://gerrit.wikimedia.org/r/977066 [12:32:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/667/con" [puppet] - 10https://gerrit.wikimedia.org/r/977066 (owner: 10Majavah) [12:37:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 70%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53748 and previous config saved to /var/cache/conftool/dbconfig/20231123-123700-arnaudb.json [12:40:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2004.codfw.wmnet with OS bookworm [12:41:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 30%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53749 and previous config saved to /var/cache/conftool/dbconfig/20231123-124104-arnaudb.json [12:43:56] (03PS1) 10Phuedx: Remove eventlogging_FeaturePolicyViolation and _SpecialMuteSubmit event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977075 [12:45:37] (03CR) 10Btullis: [C: 03+2] Remove oozie configuration from core hadoop configuration files [puppet] - 10https://gerrit.wikimedia.org/r/974647 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [12:45:50] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2181.codfw.wmnet onto db2195.codfw.wmnet [12:48:43] RECOVERY - HTTPS non-canonical-redirect-1 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 213076 seconds left:Certificate wikipedia.com valid until 2024-02-05 03:06:03 +0000 (expires in 73 days) https://wikitech.wikimedia.org/wiki/Ncredir [12:49:15] RECOVERY - HTTPS non-canonical-redirect-5 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikimedia.is has 342644 seconds left:Certificate wikimedia.is valid until 2024-02-11 10:59:48 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/Ncredir [12:49:25] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 324634 seconds left:Certificate *.wikipedia.bg valid until 2024-02-13 06:30:10 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/Ncredir [12:49:55] (03CR) 10Fabfur: [C: 03+1] interface::ipip: Fix absenting [puppet] - 10https://gerrit.wikimedia.org/r/977056 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [12:49:57] (03PS1) 10Majavah: team-sre: add alert for unsigned puppet certificates [alerts] - 10https://gerrit.wikimedia.org/r/977076 [12:50:05] (03CR) 10Fabfur: [C: 03+1] interface::manual: Fix absenting [puppet] - 10https://gerrit.wikimedia.org/r/977046 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [12:52:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 80%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53750 and previous config saved to /var/cache/conftool/dbconfig/20231123-125205-arnaudb.json [12:52:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53751 and previous config saved to /var/cache/conftool/dbconfig/20231123-125240-arnaudb.json [12:53:23] (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:50] (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53752 and previous config saved to /var/cache/conftool/dbconfig/20231123-125537-arnaudb.json [12:56:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 40%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53753 and previous config saved to /var/cache/conftool/dbconfig/20231123-125609-arnaudb.json [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231123T1300) [13:01:28] (03PS1) 10MVernon: hiera: use envoy by default in ms clusters (nfc) [puppet] - 10https://gerrit.wikimedia.org/r/977077 (https://phabricator.wikimedia.org/T317616) [13:04:12] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977077 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [13:06:36] (03PS1) 10Marostegui: apt_repo.yaml: Allow dbproxy2* reimage [puppet] - 10https://gerrit.wikimedia.org/r/977078 (https://phabricator.wikimedia.org/T351864) [13:07:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 90%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53754 and previous config saved to /var/cache/conftool/dbconfig/20231123-130710-arnaudb.json [13:07:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 20%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53755 and previous config saved to /var/cache/conftool/dbconfig/20231123-130745-arnaudb.json [13:10:36] (03CR) 10Jcrespo: [C: 03+1] "Probably a lefover" [puppet] - 10https://gerrit.wikimedia.org/r/977078 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [13:10:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 20%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53756 and previous config saved to /var/cache/conftool/dbconfig/20231123-131042-arnaudb.json [13:11:12] (03CR) 10Marostegui: [C: 03+2] apt_repo.yaml: Allow dbproxy2* reimage [puppet] - 10https://gerrit.wikimedia.org/r/977078 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [13:11:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53757 and previous config saved to /var/cache/conftool/dbconfig/20231123-131114-arnaudb.json [13:11:29] (03CR) 10Marostegui: [C: 03+1] hiera: use envoy by default in ms clusters (nfc) [puppet] - 10https://gerrit.wikimedia.org/r/977077 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [13:11:55] (03CR) 10MVernon: [C: 03+2] hiera: use envoy by default in ms clusters (nfc) [puppet] - 10https://gerrit.wikimedia.org/r/977077 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [13:16:48] (03PS1) 10Arnaudb: mariadb: notification toggle on dbhosts [puppet] - 10https://gerrit.wikimedia.org/r/976954 (https://phabricator.wikimedia.org/T343674) [13:18:22] (03PS1) 10Muehlenhoff: Show the list of hosts which will be affected by a role conversion [cookbooks] - 10https://gerrit.wikimedia.org/r/977080 [13:21:12] (03CR) 10Brouberol: [C: 03+2] Define the spark-history/spark-history-test k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) (owner: 10Brouberol) [13:21:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/977080 (owner: 10Muehlenhoff) [13:21:55] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dbproxy2004.codfw.wmnet with OS bookworm [13:22:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53758 and previous config saved to /var/cache/conftool/dbconfig/20231123-132215-arnaudb.json [13:22:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 30%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53759 and previous config saved to /var/cache/conftool/dbconfig/20231123-132250-arnaudb.json [13:22:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2004.codfw.wmnet with OS bookworm [13:23:38] (03PS1) 10Majavah: team-wmcs: Adapt cloudlb alerts for wiki replicas [alerts] - 10https://gerrit.wikimedia.org/r/977081 (https://phabricator.wikimedia.org/T346947) [13:24:04] (03CR) 10Brouberol: [V: 03+1 C: 03+2] Generate subnet DHCP configuration [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:24:55] (03CR) 10Muehlenhoff: [C: 03+2] Show the list of hosts which will be affected by a role conversion [cookbooks] - 10https://gerrit.wikimedia.org/r/977080 (owner: 10Muehlenhoff) [13:25:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53760 and previous config saved to /var/cache/conftool/dbconfig/20231123-132547-arnaudb.json [13:26:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 60%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53761 and previous config saved to /var/cache/conftool/dbconfig/20231123-132619-arnaudb.json [13:28:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1243 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53762 and previous config saved to /var/cache/conftool/dbconfig/20231123-132840-arnaudb.json [13:29:17] (03PS2) 10Arnaudb: mariadb: notification toggle on dbhosts [puppet] - 10https://gerrit.wikimedia.org/r/976954 (https://phabricator.wikimedia.org/T343674) [13:30:13] (03PS3) 10Arnaudb: mariadb: notification toggle on dbhosts [puppet] - 10https://gerrit.wikimedia.org/r/976954 (https://phabricator.wikimedia.org/T343674) [13:30:15] (03CR) 10Marostegui: [C: 03+1] mariadb: notification toggle on dbhosts [puppet] - 10https://gerrit.wikimedia.org/r/976954 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:30:39] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host irc2002.wikimedia.org [13:31:22] (03CR) 10Arnaudb: [C: 03+2] mariadb: notification toggle on dbhosts [puppet] - 10https://gerrit.wikimedia.org/r/976954 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:32:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:32:35] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:33:19] (03PS1) 10Muehlenhoff: Switch irc2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/977082 (https://phabricator.wikimedia.org/T349619) [13:34:01] (03PS1) 10Majavah: network: update WMCS network data [puppet] - 10https://gerrit.wikimedia.org/r/977083 [13:35:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:35:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.414 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:35:40] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/668/con" [puppet] - 10https://gerrit.wikimedia.org/r/977083 (owner: 10Majavah) [13:36:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch irc2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/977082 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:37:13] (03CR) 10Jbond: "i thikn this could make puppet-merge a little slower (O)N where n= number of branches" [puppet] - 10https://gerrit.wikimedia.org/r/976938 (owner: 10Jbond) [13:37:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 40%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53763 and previous config saved to /var/cache/conftool/dbconfig/20231123-133755-arnaudb.json [13:39:07] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host druid1008.eqiad.wmnet with OS bullseye [13:39:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/977066 (owner: 10Majavah) [13:40:30] (03PS2) 10Brouberol: Monitor kafka topic with replication factor == 1 [alerts] - 10https://gerrit.wikimedia.org/r/976935 (https://phabricator.wikimedia.org/T346887) [13:40:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 40%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53764 and previous config saved to /var/cache/conftool/dbconfig/20231123-134052-arnaudb.json [13:40:57] (03CR) 10Brouberol: Monitor kafka topic with replication factor == 1 (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/976935 (https://phabricator.wikimedia.org/T346887) (owner: 10Brouberol) [13:41:09] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: provisionning db2191.codfw.wmnet - T343674 [13:41:14] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [13:41:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: provisionning db2191.codfw.wmnet - T343674 [13:41:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 70%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53765 and previous config saved to /var/cache/conftool/dbconfig/20231123-134124-arnaudb.json [13:41:28] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2191.codfw.wmnet with reason: provisionning db2191.codfw.wmnet - T343674 [13:41:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2191.codfw.wmnet with reason: provisionning db2191.codfw.wmnet - T343674 [13:43:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db2131 in db2191 for T343674', diff saved to https://phabricator.wikimedia.org/P53766 and previous config saved to /var/cache/conftool/dbconfig/20231123-134316-arnaudb.json [13:43:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1243 (re)pooling @ 20%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53767 and previous config saved to /var/cache/conftool/dbconfig/20231123-134345-arnaudb.json [13:45:16] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db2131.codfw.wmnet onto db2191.codfw.wmnet [13:49:36] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host irc2002.wikimedia.org [13:53:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53768 and previous config saved to /var/cache/conftool/dbconfig/20231123-135300-arnaudb.json [13:53:13] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:puppetserver: enable CA monitoring [puppet] - 10https://gerrit.wikimedia.org/r/977066 (owner: 10Majavah) [13:53:54] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dbproxy2004.codfw.wmnet with OS bookworm [13:54:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2004.codfw.wmnet with OS bookworm [13:55:45] (03PS1) 10Arnaudb: mariadb: replace db1147 by db1247 on s4 [puppet] - 10https://gerrit.wikimedia.org/r/976956 (https://phabricator.wikimedia.org/T344036) [13:55:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53769 and previous config saved to /var/cache/conftool/dbconfig/20231123-135557-arnaudb.json [13:56:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 80%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53770 and previous config saved to /var/cache/conftool/dbconfig/20231123-135629-arnaudb.json [13:58:20] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) 05In progress→03Resolved a:03MatthewVernon I think this is now done - ms clusters default to using envoy (I've not done anything to beta, but it should carry on using... [13:58:35] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host irc1002.wikimedia.org [13:58:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1243 (re)pooling @ 30%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53771 and previous config saved to /var/cache/conftool/dbconfig/20231123-135850-arnaudb.json [13:59:49] (03PS1) 10Muehlenhoff: Switch irc1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/977086 (https://phabricator.wikimedia.org/T349619) [14:00:40] (03CR) 10Muehlenhoff: [C: 03+2] Switch irc1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/977086 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:08:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 60%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53772 and previous config saved to /var/cache/conftool/dbconfig/20231123-140805-arnaudb.json [14:09:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host irc1002.wikimedia.org [14:11:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53773 and previous config saved to /var/cache/conftool/dbconfig/20231123-141102-arnaudb.json [14:11:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 90%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53774 and previous config saved to /var/cache/conftool/dbconfig/20231123-141134-arnaudb.json [14:11:48] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [14:12:19] jouncebot: help [14:12:19] **** JounceBot Help **** [14:12:19] JounceBot is a deployment helper bot for the Wikimedia movement. [14:12:20] Source at: https://gerrit.wikimedia.org/g/wikimedia/bots/jouncebot [14:12:20] Available commands: [14:12:20] HELP Print all commands known to the server. [14:12:20] NEXT Get the next deployment event(s if they happen at the same time). [14:12:20] NOW Get the current deployment event(s) or the time until the next. [14:12:20] NOWANDNEXT Get the current and next deployment event(s). [14:12:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2004.codfw.wmnet with reason: host reimage [14:12:21] REFRESH Refresh my knowledge about deployments. [14:12:27] jouncebot: nowandnext [14:12:28] No deployments scheduled for the next 2 hour(s) and 47 minute(s) [14:12:28] In 2 hour(s) and 47 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231123T1700) [14:13:18] (03CR) 10Btullis: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/976935 (https://phabricator.wikimedia.org/T346887) (owner: 10Brouberol) [14:13:19] (it’s also thanksgiving, so it’s supposed to be a no-deploy day AFAIK) [14:13:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1243 (re)pooling @ 40%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53775 and previous config saved to /var/cache/conftool/dbconfig/20231123-141355-arnaudb.json [14:14:54] I'm was planning to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/976804. It is a feature enabling for a recommendation suggester we announce to enable yesterday but we were waiting for the recommendation model to produce results. If that's ok for any SREs around. [14:15:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2004.codfw.wmnet with reason: host reimage [14:16:34] (03CR) 10Ssingh: [C: 03+1] interface::manual: Fix absenting [puppet] - 10https://gerrit.wikimedia.org/r/977046 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:16:43] (03CR) 10JMeybohm: [C: 03+2] Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:16:50] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1114.eqiad.wmnet [14:16:51] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1114.eqiad.wmnet [14:17:38] (03Merged) 10jenkins-bot: Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:17:41] !log swap cp1114 <-> cp1089 (T349244) [14:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:46] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [14:20:58] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:21:15] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:21:20] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:21:39] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:21:56] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1115.eqiad.wmnet [14:21:57] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1115.eqiad.wmnet [14:22:42] !log swap cp1115 <-> cp1090 (T349244) [14:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:47] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [14:23:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 70%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53776 and previous config saved to /var/cache/conftool/dbconfig/20231123-142310-arnaudb.json [14:23:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [14:23:24] (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:23:37] Lucas_WMDE: you are right, I missed that yesterday. I was hoping for some non-US SRE to be around and give a green light. [14:24:00] (03PS1) 10Cathal Mooney: Remove some networks from hiera no longer in use, rename others [puppet] - 10https://gerrit.wikimedia.org/r/977087 (https://phabricator.wikimedia.org/T351059) [14:24:50] (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:09] sergi0: ok, good luck :) (I can’t help with that) [14:25:49] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:26:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 70%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53777 and previous config saved to /var/cache/conftool/dbconfig/20231123-142607-arnaudb.json [14:26:16] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:26:23] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:26:39] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:26:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53778 and previous config saved to /var/cache/conftool/dbconfig/20231123-142639-arnaudb.json [14:26:57] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1008.eqiad.wmnet with OS bullseye [14:27:12] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host druid1008.eqiad.wmnet with OS bullseye [14:28:58] Any SREs around have objections with the deploy of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/976804? [14:29:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1243 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53779 and previous config saved to /var/cache/conftool/dbconfig/20231123-142900-arnaudb.json [14:29:41] sergi0: why is that urgent enough to be deployed on a no-deploy day? [14:29:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db2175 repooling api', diff saved to https://phabricator.wikimedia.org/P53780 and previous config saved to /var/cache/conftool/dbconfig/20231123-142950-arnaudb.json [14:30:32] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:30:52] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:30:57] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:31:07] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:31:40] sergi0: for reference: https://wikitech.wikimedia.org/wiki/Deployments/Emergencies [14:31:59] (03PS10) 10JMeybohm: api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) [14:32:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'temporary depool of db1242 to fix API', diff saved to https://phabricator.wikimedia.org/P53781 and previous config saved to /var/cache/conftool/dbconfig/20231123-143238-arnaudb.json [14:32:48] taavi: the urgency is because of its release date was announced in Tech/News. But it's not strictly falling into those categories. That's why I'm asking explicitly :) [14:34:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 5%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53782 and previous config saved to /var/cache/conftool/dbconfig/20231123-143427-arnaudb.json [14:35:07] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate IP gateway for private1-b-codfw to spine switches - https://phabricator.wikimedia.org/T351534 (10cmooney) [14:36:16] (03CR) 10Majavah: [C: 03+1] "great minds think alike, etc, etc, https://gerrit.wikimedia.org/r/c/operations/puppet/+/977083" [puppet] - 10https://gerrit.wikimedia.org/r/977087 (https://phabricator.wikimedia.org/T351059) (owner: 10Cathal Mooney) [14:36:35]  [14:36:40] wrong paste [14:37:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2004.codfw.wmnet with OS bookworm [14:38:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 80%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53783 and previous config saved to /var/cache/conftool/dbconfig/20231123-143815-arnaudb.json [14:38:24] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:40] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1008.eqiad.wmnet with OS bullseye [14:40:11] (03CR) 10Btullis: [V: 03+1 C: 03+2] Matomo: permit public retrieval of specific CSS and JS files [puppet] - 10https://gerrit.wikimedia.org/r/977057 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [14:40:55] sergi0: i know it's a shiny new feature, but to me it seems like the exact opposite of what the procedure says is ok (prioritize availability over new features), and it's starting to near the end of the EU work day [14:41:02] (03Abandoned) 10Cathal Mooney: Remove some networks from hiera no longer in use, rename others [puppet] - 10https://gerrit.wikimedia.org/r/977087 (https://phabricator.wikimedia.org/T351059) (owner: 10Cathal Mooney) [14:41:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 80%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53784 and previous config saved to /var/cache/conftool/dbconfig/20231123-144112-arnaudb.json [14:41:27] (03CR) 10Ssingh: [C: 03+1] interface::ipip: Fix absenting [puppet] - 10https://gerrit.wikimedia.org/r/977056 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:42:01] (03CR) 10Cathal Mooney: [C: 03+1] "Nice! I forgot about constants.pp good stuff :)" [puppet] - 10https://gerrit.wikimedia.org/r/977083 (owner: 10Majavah) [14:42:03] (03CR) 10Brouberol: [C: 03+2] Monitor kafka topic with replication factor == 1 [alerts] - 10https://gerrit.wikimedia.org/r/976935 (https://phabricator.wikimedia.org/T346887) (owner: 10Brouberol) [14:42:17] (03CR) 10Majavah: [V: 03+1 C: 03+2] network: update WMCS network data [puppet] - 10https://gerrit.wikimedia.org/r/977083 (owner: 10Majavah) [14:43:09] (03PS3) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [14:43:13] (03Merged) 10jenkins-bot: Monitor kafka topic with replication factor == 1 [alerts] - 10https://gerrit.wikimedia.org/r/976935 (https://phabricator.wikimedia.org/T346887) (owner: 10Brouberol) [14:43:46] (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:44:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1243 (re)pooling @ 60%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53785 and previous config saved to /var/cache/conftool/dbconfig/20231123-144405-arnaudb.json [14:44:29] (03PS2) 10Ssingh: P:dns::auth::update: add support for authdns-update hosts via confd [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) [14:44:58] (03CR) 10Ssingh: "[revised]: added a script that checks for minimum pooled threshold." [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:45:27] (03PS1) 10Muehlenhoff: statistics::web: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/977088 [14:46:43] taavi: alright, I understand. I will schedule it early next week. Thanks for answering :) [14:47:16] RECOVERY - Host lsw1-e6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 7.45 ms [14:47:22] RECOVERY - Host lsw1-e6-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [14:49:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53786 and previous config saved to /var/cache/conftool/dbconfig/20231123-144932-arnaudb.json [14:50:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [14:51:43] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1008.eqiad.wmnet with reason: host reimage [14:52:49] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10ayounsi) [14:53:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 90%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53787 and previous config saved to /var/cache/conftool/dbconfig/20231123-145320-arnaudb.json [14:53:24] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:00] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1008.eqiad.wmnet with reason: host reimage [14:56:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53788 and previous config saved to /var/cache/conftool/dbconfig/20231123-145617-arnaudb.json [14:58:42] (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove all remaining references to oozie and clean up [puppet] - 10https://gerrit.wikimedia.org/r/974649 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [14:59:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1243 (re)pooling @ 70%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53789 and previous config saved to /var/cache/conftool/dbconfig/20231123-145910-arnaudb.json [14:59:27] PROBLEM - Check systemd state on kubernetes2036 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:46] Does anyone know where/if the the user counts for https://www.mediawiki.org/wiki/Extension:BetaFeatures are visible? I am curious about the job for updating them [15:01:12] hnowlan: those are shown on the preferences section for each feature, see for example https://www.mediawiki.org/wiki/Special:Preferences#mw-prefsection-betafeatures [15:01:27] PROBLEM - Check systemd state on ms-be1048 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:19] taavi: excellent, thank you! [15:02:44] (03CR) 10Filippo Giunchedi: [C: 03+1] team-wmcs: Adapt cloudlb alerts for wiki replicas (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/977081 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [15:04:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 15%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53790 and previous config saved to /var/cache/conftool/dbconfig/20231123-150437-arnaudb.json [15:05:55] (03PS2) 10Majavah: team-wmcs: Adapt cloudlb alerts for wiki replicas [alerts] - 10https://gerrit.wikimedia.org/r/977081 (https://phabricator.wikimedia.org/T346947) [15:06:23] (03CR) 10Majavah: [C: 03+2] team-wmcs: Adapt cloudlb alerts for wiki replicas (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/977081 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [15:06:29] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:08:24] (03Merged) 10jenkins-bot: team-wmcs: Adapt cloudlb alerts for wiki replicas [alerts] - 10https://gerrit.wikimedia.org/r/977081 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [15:08:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53792 and previous config saved to /var/cache/conftool/dbconfig/20231123-150825-arnaudb.json [15:11:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53793 and previous config saved to /var/cache/conftool/dbconfig/20231123-151122-arnaudb.json [15:11:58] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1008.eqiad.wmnet with OS bullseye [15:13:35] (03PS1) 10Filippo Giunchedi: rsyslog: move centrallog to ossl [puppet] - 10https://gerrit.wikimedia.org/r/977090 (https://phabricator.wikimedia.org/T351710) [15:14:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1243 (re)pooling @ 80%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53794 and previous config saved to /var/cache/conftool/dbconfig/20231123-151415-arnaudb.json [15:16:56] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10Volans) When we introduced the sre.hosts.provision cookbook we envision Piling many changes together simplifies the user interaction but leaves a lot of open questions... [15:18:05] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: move centrallog to ossl [puppet] - 10https://gerrit.wikimedia.org/r/977090 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [15:19:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53795 and previous config saved to /var/cache/conftool/dbconfig/20231123-151942-arnaudb.json [15:25:39] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10cmooney) Great work! I had some thoughts on this, more around the latter pieces than the workflow itself. In terms of the proposed cookbook, do you envision it running... [15:25:45] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1048 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:27:37] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2036 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:29:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1243 (re)pooling @ 90%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53796 and previous config saved to /var/cache/conftool/dbconfig/20231123-152920-arnaudb.json [15:29:20] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10cmooney) >>! In T351891#9355490, @Volans wrote: > In addition I think that we need to solve first another problem, that is a pre-requisite for this and other similar req... [15:30:57] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10Volans) >>! In T351891#9355523, @cmooney wrote: >>>! In T351891#9355490, @Volans wrote: >> In addition I think that we need to solve first another problem, that is a pre... [15:34:19] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) [15:34:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 45%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53797 and previous config saved to /var/cache/conftool/dbconfig/20231123-153447-arnaudb.json [15:35:39] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10cmooney) >>! In T351891#9355525, @Volans wrote: > How does the cookbook know which spec table to use for a given host? User-input? Then we're back to square one. As Arz... [15:44:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1243 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53798 and previous config saved to /var/cache/conftool/dbconfig/20231123-154425-arnaudb.json [15:49:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53799 and previous config saved to /var/cache/conftool/dbconfig/20231123-154952-arnaudb.json [15:53:10] (03CR) 10Klausman: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/977092 (owner: 10Klausman) [15:54:58] (03PS1) 10Volans: sre.hosts.reimage: improve puppet 5to7 migrtion [cookbooks] - 10https://gerrit.wikimedia.org/r/977094 [15:59:19] (03CR) 10Elukey: [C: 03+1] ml-services: remove experimental article-descriptions service [deployment-charts] - 10https://gerrit.wikimedia.org/r/977092 (owner: 10Klausman) [15:59:53] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/977094 (owner: 10Volans) [16:00:12] (03CR) 10Elukey: sre.hosts.reimage: improve puppet 5to7 migrtion (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/977094 (owner: 10Volans) [16:00:32] (03CR) 10JMeybohm: [C: 04-1] sre.hosts.reimage: improve puppet 5to7 migrtion (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/977094 (owner: 10Volans) [16:04:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 75%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53801 and previous config saved to /var/cache/conftool/dbconfig/20231123-160457-arnaudb.json [16:07:27] (03PS2) 10Volans: sre.hosts.reimage: improve puppet 5to7 migration [cookbooks] - 10https://gerrit.wikimedia.org/r/977094 [16:07:40] (03CR) 10Klausman: [C: 03+2] ml-services: remove experimental article-descriptions service [deployment-charts] - 10https://gerrit.wikimedia.org/r/977092 (owner: 10Klausman) [16:08:37] (03Merged) 10jenkins-bot: ml-services: remove experimental article-descriptions service [deployment-charts] - 10https://gerrit.wikimedia.org/r/977092 (owner: 10Klausman) [16:08:43] (03PS1) 10Filippo Giunchedi: rsyslog: get ::conf to notify the correct instance [puppet] - 10https://gerrit.wikimedia.org/r/977095 (https://phabricator.wikimedia.org/T351799) [16:08:45] (03PS1) 10Filippo Giunchedi: rsyslog: move netdev_kafka_relay to rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/977096 (https://phabricator.wikimedia.org/T351799) [16:13:20] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:13:46] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1160.eqiad.wmnet with OS bullseye [16:13:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye [16:18:24] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:20:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53802 and previous config saved to /var/cache/conftool/dbconfig/20231123-162002-arnaudb.json [16:21:43] (03CR) 10JMeybohm: [C: 03+1] sre.hosts.reimage: improve puppet 5to7 migration (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/977094 (owner: 10Volans) [16:22:42] (03CR) 10Elukey: "Left some comments to understand the code in a better way, hope to not have asked silly questions :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [16:24:20] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10Volans) >>! In T351891#9355538, @cmooney wrote: > As Arzhel defined it there would be one table, and the host the script (be that existing Netbox ProvisionServerNetwork... [16:24:49] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: improve puppet 5to7 migration (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/977094 (owner: 10Volans) [16:25:15] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:29:19] (03Merged) 10jenkins-bot: sre.hosts.reimage: improve puppet 5to7 migration [cookbooks] - 10https://gerrit.wikimedia.org/r/977094 (owner: 10Volans) [16:30:40] (03CR) 10Giuseppe Lavagetto: "Overall LGTM; I would suggest splitting off activating one job to a second patch for ease of revert, but up to you how to proceed. See a c" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:32:36] (03PS7) 10Hnowlan: changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) [16:33:56] (03CR) 10Hnowlan: changeprop: add config support for migration to k8s jobrunners (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:34:04] (03PS1) 10Hnowlan: jobqueue: migrate first job to Kubernetes jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/977099 (https://phabricator.wikimedia.org/T349796) [16:34:40] (03CR) 10Hnowlan: [C: 03+2] changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:35:07] (03PS1) 10Volans: sre.hosts.reimage: fqdn required [cookbooks] - 10https://gerrit.wikimedia.org/r/977100 [16:35:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53804 and previous config saved to /var/cache/conftool/dbconfig/20231123-163507-arnaudb.json [16:35:21] (03CR) 10JMeybohm: [C: 03+1] sre.hosts.reimage: fqdn required [cookbooks] - 10https://gerrit.wikimedia.org/r/977100 (owner: 10Volans) [16:35:30] (03Merged) 10jenkins-bot: changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:39:00] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fqdn required [cookbooks] - 10https://gerrit.wikimedia.org/r/977100 (owner: 10Volans) [16:40:51] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [16:41:08] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [16:41:43] (03PS1) 10Ssingh: P:dns::auth::update: add support for generating .ssh/config via confd [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) [16:41:59] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:42:27] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:43:32] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host mw2420.codfw.wmnet with OS bullseye [16:43:57] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:44:32] 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10M2k_dewiki) 05Resolved→03Open Hello, https://templatetransclusioncheck.toolforge.org/ https://templatetransclusioncheck.toolforge.org/?lang=de&name=Vorlage:... [16:44:43] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:47:09] (03PS2) 10Hnowlan: jobqueue: migrate first job to Kubernetes jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/977099 (https://phabricator.wikimedia.org/T349796) [16:49:05] (03PS9) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [16:49:07] (03PS6) 10Pppery: Merge in changes to qqq.json rather than overwriting them [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363) [16:49:09] (03PS5) 10Pppery: Undo qqq.json overwrites [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) [16:52:05] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10cmooney) >>! In T351891#9355575, @Volans wrote: >>>! In T351891#9355538, @cmooney wrote: >> As Arzhel defined it there would be one table, and the host the script (be th... [17:00:05] jbond and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231123T1700). nyaa~ [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:45] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2420.codfw.wmnet with reason: host reimage [17:03:00] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host an-worker1160.eqiad.wmnet with OS bullseye [17:04:54] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2420.codfw.wmnet with reason: host reimage [17:06:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10cmooney) @robh @Jclark-ctr I kicked off the reimage of an-worker1160 again. I think the problem here wasn't actually an error on the DHCP config, but a problem we have... [17:11:00] RECOVERY - Host lsw1-e5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [17:11:00] RECOVERY - Host lsw1-e5-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 22.94 ms [17:11:00] RECOVERY - Host lsw1-e7-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 24.01 ms [17:11:06] RECOVERY - Host lsw1-e7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 8.84 ms [17:13:45] (Device rebooted) firing: Alert for device ps1-a6-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:18:18] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2131.codfw.wmnet onto db2191.codfw.wmnet [17:18:45] (Device rebooted) resolved: Device ps1-a6-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:19:39] (03PS10) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [17:19:41] (03PS7) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [17:25:42] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2420.codfw.wmnet with OS bullseye [17:30:51] 10SRE-tools, 10Infrastructure-Foundations: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10Volans) Some random additions: * I would probably add a grep for the IP on at least `/etc` on the host too to check if it's hardcoded somewhere else in addition to `/etc/nework/interf... [17:34:57] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/976958 [17:44:28] !log repool ncredir4001 [17:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:42] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 332.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:00:07] bd808: May I have your attention please! Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231123T1800) [18:00:07] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231123T1800) [18:24:50] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:46:59] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/977095 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [18:48:07] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/977096 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [18:58:52] PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 15871MiB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [19:06:45] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:19:20] RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [20:08:50] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 11.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:18:24] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:25:15] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:25:19] (03PS2) 10Majavah: Add virtual domain mapping for OATHAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966598 (https://phabricator.wikimedia.org/T348484) [20:31:42] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:56:26] 10SRE, 10Infrastructure-Foundations, 10netbox: Error creating device in netbox - https://phabricator.wikimedia.org/T336547 (10Volans) 05Open→03Resolved p:05Triage→03High Thanks for reporting this. The issue was caused by a bug in one of the new custom validators that was hit only during the creation... [21:00:07] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10Volans) Trying to run the import puppetdb script on `cloudgw1002 ` is now a noop, but for `cloudgw2002-dev` fails with this exception: `lang=python... [21:01:25] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.decorators.retry: dynamic_params_callbacks=(set_tries,) dosn't seem to work as expected - https://phabricator.wikimedia.org/T346134 (10Volans) 05Open→03Resolved The change has been merged and released with Spicerack v7.3.0 on Oct. 4th. Res... [21:03:57] 10SRE-tools, 10Infrastructure-Foundations: wmflib: improve interactive.ask_input to support free-form responses - https://phabricator.wikimedia.org/T327408 (10Volans) 05Open→03Resolved This was fixed in wmflib v1.2.1 released on Feb. 2nd. [21:28:34] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:26] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:53:22] PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 28032MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [22:13:50] RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [22:25:44] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 331.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:28:24] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:06:46] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:30:56] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state