[00:01:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:04:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [00:07:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P67649 and previous config saved to /var/cache/conftool/dbconfig/20240823-000738-ladsgroup.json [00:09:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [00:12:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T371742)', diff saved to https://phabricator.wikimedia.org/P67650 and previous config saved to /var/cache/conftool/dbconfig/20240823-001219-ladsgroup.json [00:12:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [00:12:29] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:12:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [00:14:51] (03CR) 10Andrew Bogott: [C:03+2] pdns.conf.erb: secondary=yes [puppet] - 10https://gerrit.wikimedia.org/r/1065037 (owner: 10Andrew Bogott) [00:20:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1065038 (owner: 10TrainBranchBot) [00:22:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P67651 and previous config saved to /var/cache/conftool/dbconfig/20240823-002245-ladsgroup.json [00:28:17] !log rebooting puppetserver1003.eqiad.wmnet from mgmt console; It's unresponsive and causing puppet errors on clients. [00:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:04] jhathaway: fyi ^^ [00:34:27] FIRING: [2x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:10] (03CR) 10Andrew Bogott: [C:03+2] cinderutils: add --allow-unattended-format when preparing volumes [puppet] - 10https://gerrit.wikimedia.org/r/1056606 (https://phabricator.wikimedia.org/T371573) (owner: 10Dzahn) [00:37:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T370903)', diff saved to https://phabricator.wikimedia.org/P67652 and previous config saved to /var/cache/conftool/dbconfig/20240823-003753-ladsgroup.json [00:37:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1249.eqiad.wmnet with reason: Maintenance [00:37:57] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:38:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1249.eqiad.wmnet with reason: Maintenance [00:38:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T370903)', diff saved to https://phabricator.wikimedia.org/P67653 and previous config saved to /var/cache/conftool/dbconfig/20240823-003815-ladsgroup.json [00:44:27] RESOLVED: [2x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:44:27] FIRING: [6x] ProbeDown: Service puppetmaster2001:8140 has failed probes (http_puppetmaster2001_codfw_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:49:51] !log krinkle@deploy1003 Started deploy [integration/docroot@da4dac4]: (no justification provided) [00:49:57] !log krinkle@deploy1003 Finished deploy [integration/docroot@da4dac4]: (no justification provided) (duration: 00m 06s) [01:01:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T370903)', diff saved to https://phabricator.wikimedia.org/P67655 and previous config saved to /var/cache/conftool/dbconfig/20240823-010144-ladsgroup.json [01:01:54] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [01:16:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P67656 and previous config saved to /var/cache/conftool/dbconfig/20240823-011651-ladsgroup.json [01:31:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P67657 and previous config saved to /var/cache/conftool/dbconfig/20240823-013158-ladsgroup.json [01:42:07] 10SRE-swift-storage, 10MW-on-K8s, 06serviceops, 10Shellbox: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10086764 (10tstarling) >>! In T292322#9614342, @Joe wrote: > @tstarling I think we determined that the expensive part of handling large files in shellbox was mostly the do... [01:47:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T370903)', diff saved to https://phabricator.wikimedia.org/P67658 and previous config saved to /var/cache/conftool/dbconfig/20240823-014706-ladsgroup.json [01:47:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:47:10] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [01:47:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:52:56] (03CR) 10Dzahn: "thanks!:)" [puppet] - 10https://gerrit.wikimedia.org/r/1056606 (https://phabricator.wikimedia.org/T371573) (owner: 10Dzahn) [01:53:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [01:54:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [01:54:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T371742)', diff saved to https://phabricator.wikimedia.org/P67659 and previous config saved to /var/cache/conftool/dbconfig/20240823-015417-ladsgroup.json [01:54:21] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [01:54:36] (03PS3) 10Srishakatux: Add site entry for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) [01:55:29] (03CR) 10Srishakatux: "@amir.aharoni@mail.huji.ac.il There is a patch for namespace fix here: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1060895" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [02:12:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2136.codfw.wmnet with reason: Maintenance [02:12:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2136.codfw.wmnet with reason: Maintenance [02:12:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T370903)', diff saved to https://phabricator.wikimedia.org/P67660 and previous config saved to /var/cache/conftool/dbconfig/20240823-021231-ladsgroup.json [02:12:36] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [02:40:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T370903)', diff saved to https://phabricator.wikimedia.org/P67661 and previous config saved to /var/cache/conftool/dbconfig/20240823-024058-ladsgroup.json [02:41:03] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [02:56:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P67662 and previous config saved to /var/cache/conftool/dbconfig/20240823-025605-ladsgroup.json [03:02:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:05] FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:11:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P67663 and previous config saved to /var/cache/conftool/dbconfig/20240823-031113-ladsgroup.json [03:26:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T370903)', diff saved to https://phabricator.wikimedia.org/P67664 and previous config saved to /var/cache/conftool/dbconfig/20240823-032620-ladsgroup.json [03:26:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [03:26:24] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [03:26:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [03:26:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T370903)', diff saved to https://phabricator.wikimedia.org/P67665 and previous config saved to /var/cache/conftool/dbconfig/20240823-032642-ladsgroup.json [03:39:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T371742)', diff saved to https://phabricator.wikimedia.org/P67666 and previous config saved to /var/cache/conftool/dbconfig/20240823-033932-ladsgroup.json [03:39:36] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:54:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P67667 and previous config saved to /var/cache/conftool/dbconfig/20240823-035439-ladsgroup.json [03:56:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T370903)', diff saved to https://phabricator.wikimedia.org/P67668 and previous config saved to /var/cache/conftool/dbconfig/20240823-035611-ladsgroup.json [03:56:15] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [04:01:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:09:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P67669 and previous config saved to /var/cache/conftool/dbconfig/20240823-040947-ladsgroup.json [04:11:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P67670 and previous config saved to /var/cache/conftool/dbconfig/20240823-041118-ladsgroup.json [04:22:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:24:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T371742)', diff saved to https://phabricator.wikimedia.org/P67671 and previous config saved to /var/cache/conftool/dbconfig/20240823-042454-ladsgroup.json [04:24:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [04:24:58] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [04:25:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [04:25:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:25:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:25:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T371742)', diff saved to https://phabricator.wikimedia.org/P67672 and previous config saved to /var/cache/conftool/dbconfig/20240823-042531-ladsgroup.json [04:26:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P67673 and previous config saved to /var/cache/conftool/dbconfig/20240823-042625-ladsgroup.json [04:37:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:41:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T370903)', diff saved to https://phabricator.wikimedia.org/P67674 and previous config saved to /var/cache/conftool/dbconfig/20240823-044132-ladsgroup.json [04:41:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [04:41:37] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [04:41:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [04:44:27] FIRING: [6x] ProbeDown: Service puppetmaster2001:8140 has failed probes (http_puppetmaster2001_codfw_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:16:58] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance [05:17:11] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance [05:17:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T370903)', diff saved to https://phabricator.wikimedia.org/P67675 and previous config saved to /var/cache/conftool/dbconfig/20240823-051718-ladsgroup.json [05:17:22] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [05:49:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T370903)', diff saved to https://phabricator.wikimedia.org/P67676 and previous config saved to /var/cache/conftool/dbconfig/20240823-054940-ladsgroup.json [05:49:50] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [05:50:22] 10SRE-swift-storage, 10MW-on-K8s, 06serviceops, 10Shellbox: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10086879 (10tstarling) 05Open→03In progress a:03tstarling [05:52:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240823T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P67677 and previous config saved to /var/cache/conftool/dbconfig/20240823-060447-ladsgroup.json [06:07:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr3-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:12:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T371742)', diff saved to https://phabricator.wikimedia.org/P67678 and previous config saved to /var/cache/conftool/dbconfig/20240823-061235-ladsgroup.json [06:12:39] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [06:19:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P67679 and previous config saved to /var/cache/conftool/dbconfig/20240823-061954-ladsgroup.json [06:19:57] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "set cephosd1005to failed in Netbox - ayounsi@cumin1002" [06:20:25] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "set cephosd1005to failed in Netbox - ayounsi@cumin1002" [06:23:02] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Juniper alarms (instance cr1-eqiad) - https://phabricator.wikimedia.org/T373166 (10LSobanski) 03NEW [06:27:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P67680 and previous config saved to /var/cache/conftool/dbconfig/20240823-062742-ladsgroup.json [06:35:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T370903)', diff saved to https://phabricator.wikimedia.org/P67681 and previous config saved to /var/cache/conftool/dbconfig/20240823-063502-ladsgroup.json [06:35:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2155.codfw.wmnet with reason: Maintenance [06:35:06] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [06:35:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2155.codfw.wmnet with reason: Maintenance [06:35:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [06:35:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [06:35:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T370903)', diff saved to https://phabricator.wikimedia.org/P67682 and previous config saved to /var/cache/conftool/dbconfig/20240823-063539-ladsgroup.json [06:42:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P67683 and previous config saved to /var/cache/conftool/dbconfig/20240823-064249-ladsgroup.json [06:52:10] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [06:57:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T371742)', diff saved to https://phabricator.wikimedia.org/P67684 and previous config saved to /var/cache/conftool/dbconfig/20240823-065756-ladsgroup.json [06:57:58] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [06:58:00] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [06:58:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [06:58:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T371742)', diff saved to https://phabricator.wikimedia.org/P67685 and previous config saved to /var/cache/conftool/dbconfig/20240823-065819-ladsgroup.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240823T0700) [07:05:05] FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:08:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T370903)', diff saved to https://phabricator.wikimedia.org/P67686 and previous config saved to /var/cache/conftool/dbconfig/20240823-070832-ladsgroup.json [07:08:36] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [07:20:50] (03CR) 10Filippo Giunchedi: [C:03+1] curator: free up space to safely restart daemons [puppet] - 10https://gerrit.wikimedia.org/r/1064781 (https://phabricator.wikimedia.org/T371961) (owner: 10Tiziano Fogli) [07:22:43] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, can be merged at any time" [puppet] - 10https://gerrit.wikimedia.org/r/1064821 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [07:22:45] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr3-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [07:23:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P67687 and previous config saved to /var/cache/conftool/dbconfig/20240823-072339-ladsgroup.json [07:25:33] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, should be merged ahead of failovers" [puppet] - 10https://gerrit.wikimedia.org/r/1064818 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [07:25:49] FIRING: [2x] PuppetFailure: Puppet has failed on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:27:16] !log start prometheus1006 bookworm upgrade - T326657 [07:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:20] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [07:35:48] FIRING: [2x] PuppetFailure: Puppet has failed on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:38:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P67688 and previous config saved to /var/cache/conftool/dbconfig/20240823-073846-ladsgroup.json [07:39:27] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:53:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T370903)', diff saved to https://phabricator.wikimedia.org/P67689 and previous config saved to /var/cache/conftool/dbconfig/20240823-075353-ladsgroup.json [07:53:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance [07:53:57] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [07:54:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance [07:54:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T370903)', diff saved to https://phabricator.wikimedia.org/P67690 and previous config saved to /var/cache/conftool/dbconfig/20240823-075415-ladsgroup.json [07:58:19] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [08:02:25] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:27] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:04:52] (03PS1) 10Jgiannelos: restbase: Update mobileapps service hostname on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1065123 (https://phabricator.wikimedia.org/T370460) [08:07:25] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:06] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [08:10:44] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:15:33] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1065126 (https://phabricator.wikimedia.org/T373168) [08:17:11] (03PS1) 10Filippo Giunchedi: jaeger: enable tags-as-fields for query and collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065127 (https://phabricator.wikimedia.org/T372411) [08:17:18] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [08:19:25] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:25] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T370903)', diff saved to https://phabricator.wikimedia.org/P67691 and previous config saved to /var/cache/conftool/dbconfig/20240823-082707-ladsgroup.json [08:27:12] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [08:27:28] (03CR) 10Jelto: [C:03+1] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1064755 (https://phabricator.wikimedia.org/T371222) (owner: 10EoghanGaffney) [08:27:43] (03CR) 10EoghanGaffney: [C:03+2] gitlab: Allow backup script metrics call to fail [puppet] - 10https://gerrit.wikimedia.org/r/1064755 (https://phabricator.wikimedia.org/T371222) (owner: 10EoghanGaffney) [08:34:25] RESOLVED: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:40:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:42:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P67692 and previous config saved to /var/cache/conftool/dbconfig/20240823-084214-ladsgroup.json [08:45:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T371742)', diff saved to https://phabricator.wikimedia.org/P67693 and previous config saved to /var/cache/conftool/dbconfig/20240823-084506-ladsgroup.json [08:45:11] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [08:48:49] (03CR) 10Btullis: [C:03+2] cephosd: Assemble the MD RAID arrays, so that they can be removed [puppet] - 10https://gerrit.wikimedia.org/r/1064807 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [08:49:38] (03PS2) 10Btullis: cephosd: Assemble the MD RAID arrays, so that they can be removed [puppet] - 10https://gerrit.wikimedia.org/r/1064807 (https://phabricator.wikimedia.org/T372783) [08:49:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:59] (03CR) 10Btullis: [C:03+2] cephosd: Assemble the MD RAID arrays, so that they can be removed [puppet] - 10https://gerrit.wikimedia.org/r/1064807 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [08:54:15] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [08:54:40] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P67694 and previous config saved to /var/cache/conftool/dbconfig/20240823-085722-ladsgroup.json [08:59:31] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cephosd1005.eqiad.wmnet with OS bookworm [08:59:49] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [09:00:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P67695 and previous config saved to /var/cache/conftool/dbconfig/20240823-090014-ladsgroup.json [09:04:40] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:40] RESOLVED: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T370903)', diff saved to https://phabricator.wikimedia.org/P67696 and previous config saved to /var/cache/conftool/dbconfig/20240823-091229-ladsgroup.json [09:12:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance [09:12:33] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [09:12:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance [09:12:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T370903)', diff saved to https://phabricator.wikimedia.org/P67697 and previous config saved to /var/cache/conftool/dbconfig/20240823-091251-ladsgroup.json [09:14:24] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1065132 (https://phabricator.wikimedia.org/T373173) [09:15:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P67698 and previous config saved to /var/cache/conftool/dbconfig/20240823-091521-ladsgroup.json [09:16:42] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1065133 (https://phabricator.wikimedia.org/T373174) [09:18:07] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1065134 (https://phabricator.wikimedia.org/T373175) [09:24:40] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T371742)', diff saved to https://phabricator.wikimedia.org/P67699 and previous config saved to /var/cache/conftool/dbconfig/20240823-093028-ladsgroup.json [09:30:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance [09:30:32] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [09:30:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance [09:30:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T371742)', diff saved to https://phabricator.wikimedia.org/P67700 and previous config saved to /var/cache/conftool/dbconfig/20240823-093050-ladsgroup.json [09:35:51] (03CR) 10Btullis: [C:03+1] deployment_server: change the PG image tag to timestamp-sha@checksum [puppet] - 10https://gerrit.wikimedia.org/r/1064779 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [09:36:08] (03CR) 10Brouberol: [C:03+2] deployment_server: change the PG image tag to timestamp-sha@checksum [puppet] - 10https://gerrit.wikimedia.org/r/1064779 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [09:38:10] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2025.codfw.wmnet [09:38:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2025.codfw.wmnet [09:39:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2025.codfw.wmnet with OS bullseye [09:39:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [09:39:40] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:41] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [09:39:47] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [09:41:13] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065140 (https://phabricator.wikimedia.org/T372869) [09:41:53] (03PS1) 10Btullis: cephosd: Fix the grep for finding MD array members [puppet] - 10https://gerrit.wikimedia.org/r/1065143 (https://phabricator.wikimedia.org/T372783) [09:42:35] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cephosd1005.eqiad.wmnet with OS bookworm [09:44:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T370903)', diff saved to https://phabricator.wikimedia.org/P67701 and previous config saved to /var/cache/conftool/dbconfig/20240823-094445-ladsgroup.json [09:44:49] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [09:45:33] (03CR) 10Btullis: [C:03+2] cephosd: Fix the grep for finding MD array members [puppet] - 10https://gerrit.wikimedia.org/r/1065143 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [09:47:53] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T373133#10087200 (10BTullis) a:05BTullis→03None [09:48:13] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T373177 (10BTullis) 03NEW [09:49:12] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [09:49:50] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2025 - cgoubert@cumin1002" [09:49:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2025 - cgoubert@cumin1002" [09:49:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:49:55] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2025.codfw.wmnet 168.0.192.10.in-addr.arpa 8.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:49:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2025.codfw.wmnet 168.0.192.10.in-addr.arpa 8.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:49:59] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2025 [09:50:08] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission dbproxy101[8-9].eqiad.wmnet - https://phabricator.wikimedia.org/T373178 (10BTullis) 03NEW [09:50:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2025 [09:50:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [09:51:32] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-coord100[1-2] - https://phabricator.wikimedia.org/T373179 (10BTullis) 03NEW [09:51:54] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-coord100[1-2] - https://phabricator.wikimedia.org/T373179#10087254 (10BTullis) [09:52:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:32] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2012.codfw.wmnet [09:59:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P67702 and previous config saved to /var/cache/conftool/dbconfig/20240823-095952-ladsgroup.json [10:00:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2012.codfw.wmnet [10:00:20] (03CR) 10Kosta Harlan: [C:03+1] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065140 (https://phabricator.wikimedia.org/T372869) (owner: 10STran) [10:00:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2012.codfw.wmnet with OS bullseye [10:01:00] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087262 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [10:01:02] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [10:01:28] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:04:04] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cephosd1005.eqiad.wmnet with OS bookworm [10:04:25] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [10:06:10] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2012 - cgoubert@cumin1002" [10:06:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2012 - cgoubert@cumin1002" [10:06:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:06:15] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2012.codfw.wmnet 67.0.192.10.in-addr.arpa 7.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:06:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2012.codfw.wmnet 67.0.192.10.in-addr.arpa 7.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:06:18] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2012 [10:06:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2012 [10:06:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [10:07:48] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2025.codfw.wmnet with reason: host reimage [10:09:53] !log btullis@cumin1002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch [10:11:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2025.codfw.wmnet with reason: host reimage [10:12:32] (03PS1) 10Btullis: cephosd: Do not fail if no MD RAID arrays are dicovered [puppet] - 10https://gerrit.wikimedia.org/r/1065146 (https://phabricator.wikimedia.org/T372783) [10:13:34] (03CR) 10Btullis: [C:03+2] cephosd: Do not fail if no MD RAID arrays are dicovered [puppet] - 10https://gerrit.wikimedia.org/r/1065146 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [10:14:27] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cephosd1005.eqiad.wmnet with OS bookworm [10:14:35] 06SRE, 06Infrastructure-Foundations, 10netbox, 06Traffic-Icebox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997#10087280 (10ayounsi) 05Stalled→03Declined An active/active Netbox is not really doable for now. For both Redis and Postgres the extra cross-DC latency makes it prac... [10:15:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P67704 and previous config saved to /var/cache/conftool/dbconfig/20240823-101459-ladsgroup.json [10:17:22] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [10:19:14] !log btullis@cumin1002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch [10:22:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2012.codfw.wmnet with reason: host reimage [10:26:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2012.codfw.wmnet with reason: host reimage [10:30:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T370903)', diff saved to https://phabricator.wikimedia.org/P67705 and previous config saved to /var/cache/conftool/dbconfig/20240823-103006-ladsgroup.json [10:30:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2199.codfw.wmnet with reason: Maintenance [10:30:13] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [10:30:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2199.codfw.wmnet with reason: Maintenance [10:30:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2025.codfw.wmnet with OS bullseye [10:31:08] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [10:34:16] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2025.codfw.wmnet [10:34:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2025.codfw.wmnet [10:35:02] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2026.codfw.wmnet [10:35:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2026.codfw.wmnet [10:35:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:36:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bullseye [10:36:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [10:36:37] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [10:36:44] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:37:22] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage [10:39:48] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2003.codfw.wmnet [10:40:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage [10:40:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2003.codfw.wmnet [10:40:48] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2026 - cgoubert@cumin1002" [10:40:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2026 - cgoubert@cumin1002" [10:40:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:40:53] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2026.codfw.wmnet 170.0.192.10.in-addr.arpa 0.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:40:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2026.codfw.wmnet 170.0.192.10.in-addr.arpa 0.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:40:56] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2026 [10:41:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2026 [10:41:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [10:41:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2003.codfw.wmnet with OS bullseye [10:42:03] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [10:42:18] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [10:42:27] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:44:15] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:45:02] (03PS1) 10Slyngshede: 2FA: Use username as foreign key to security token table. [software/bitu] - 10https://gerrit.wikimedia.org/r/1065166 [10:45:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2012.codfw.wmnet with OS bullseye [10:45:51] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087380 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [10:46:49] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2003 - cgoubert@cumin1002" [10:46:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2003 - cgoubert@cumin1002" [10:46:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:46:54] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2003.codfw.wmnet 177.16.192.10.in-addr.arpa 7.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:46:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2003.codfw.wmnet 177.16.192.10.in-addr.arpa 7.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:46:57] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2003 [10:48:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2003 [10:48:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [10:48:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:51:30] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10087394 (10Tgr) Related: {T198755} >>!... [10:53:16] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:53:33] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:53:56] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2012.codfw.wmnet [10:53:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2012.codfw.wmnet [10:54:21] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [10:54:43] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [10:55:17] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:55:27] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [10:55:45] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [10:56:37] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [10:58:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage [10:58:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:59:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2206.codfw.wmnet with reason: Maintenance [10:59:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2206.codfw.wmnet with reason: Maintenance [10:59:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T370903)', diff saved to https://phabricator.wikimedia.org/P67706 and previous config saved to /var/cache/conftool/dbconfig/20240823-105938-ladsgroup.json [10:59:42] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240823T0700) [11:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240823T1100). [11:00:48] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10087406 (10Ladsgroup) >>! In T372943#10... [11:01:08] !log running homer 'cr*codfw*' commit T372878 [11:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:11] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [11:02:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage [11:03:39] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [11:05:05] FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:05:44] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [11:05:45] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1005.eqiad.wmnet with OS bookworm [11:05:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2003.codfw.wmnet with reason: host reimage [11:05:58] 06SRE, 10SRE-swift-storage: Cephadm doesn't find the correct image to run a shell - https://phabricator.wikimedia.org/T373185 (10MatthewVernon) 03NEW [11:07:14] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1004.eqiad.wmnet with OS bookworm [11:07:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [11:08:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T371742)', diff saved to https://phabricator.wikimedia.org/P67707 and previous config saved to /var/cache/conftool/dbconfig/20240823-110813-ladsgroup.json [11:08:17] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:08:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2003.codfw.wmnet with reason: host reimage [11:09:14] 06SRE, 10SRE-swift-storage: Cephadm doesn't find the correct image to run a shell - https://phabricator.wikimedia.org/T373185#10087438 (10MatthewVernon) The relevant code is [[ https://github.com/ceph/ceph/blob/1606e7f6687026ef9e2196416ed4d243749d4303/src/cephadm/cephadm.py#L558 | infer_local_ceph_image in ce... [11:16:10] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch from test-s1 to test-s1 [11:16:15] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch from test-s1 to test-s1 [11:16:32] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cephosd1004.eqiad.wmnet with OS bookworm [11:16:57] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1004.eqiad.wmnet with OS bookworm [11:21:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2026.codfw.wmnet with OS bullseye [11:21:59] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [11:23:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P67708 and previous config saved to /var/cache/conftool/dbconfig/20240823-112320-ladsgroup.json [11:23:45] !log Running homer 'lsw1-a3-codfw*' commit 'T372878' [11:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:51] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [11:27:05] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2026.codfw.wmnet [11:27:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2026.codfw.wmnet [11:27:17] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch from test-s1 to test-s1 [11:27:21] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch from test-s1 to test-s1 [11:28:06] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2033.codfw.wmnet [11:28:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2033.codfw.wmnet [11:28:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2003.codfw.wmnet with OS bullseye [11:28:48] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cephosd1004.eqiad.wmnet with OS bookworm [11:28:56] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [11:29:18] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087484 (10Clement_Goubert) [11:30:24] !log Running homer 'lsw1-b3-codfw*' commit 'T372878' [11:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:28] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [11:31:00] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2033.codfw.wmnet with OS bullseye [11:31:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T370903)', diff saved to https://phabricator.wikimedia.org/P67709 and previous config saved to /var/cache/conftool/dbconfig/20240823-113109-ladsgroup.json [11:31:20] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087486 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [11:31:23] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:31:25] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [11:32:07] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2003.codfw.wmnet [11:32:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2003.codfw.wmnet [11:32:20] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:35:27] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2033 - cgoubert@cumin1002" [11:35:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2033 - cgoubert@cumin1002" [11:35:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:35:32] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2033.codfw.wmnet 55.0.192.10.in-addr.arpa 5.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:35:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2033.codfw.wmnet 55.0.192.10.in-addr.arpa 5.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:35:36] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2033 [11:35:46] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2033 [11:35:46] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [11:38:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P67710 and previous config saved to /var/cache/conftool/dbconfig/20240823-113829-ladsgroup.json [11:39:24] (03PS1) 10Btullis: cephosd: Remove LVM signatures in addition to MD RAID metadata [puppet] - 10https://gerrit.wikimedia.org/r/1065180 (https://phabricator.wikimedia.org/T372783) [11:40:17] (03CR) 10Btullis: [C:03+2] cephosd: Remove LVM signatures in addition to MD RAID metadata [puppet] - 10https://gerrit.wikimedia.org/r/1065180 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [11:44:51] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1004.eqiad.wmnet with OS bookworm [11:46:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P67711 and previous config saved to /var/cache/conftool/dbconfig/20240823-114616-ladsgroup.json [11:48:13] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch from test-s1 to test-s1 [11:48:16] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch from test-s1 to test-s1 [11:49:34] (03PS1) 10MVernon: ceph: add the LABEL ceph=True to the image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1065187 (https://phabricator.wikimedia.org/T279621) [11:50:34] (03CR) 10MVernon: "I did a test build locally, and now:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1065187 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [11:52:06] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2033.codfw.wmnet with reason: host reimage [11:52:17] (03PS1) 10Dbrant: Turn account vanishing contact form into a redirect. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065189 (https://phabricator.wikimedia.org/T372828) [11:53:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T371742)', diff saved to https://phabricator.wikimedia.org/P67712 and previous config saved to /var/cache/conftool/dbconfig/20240823-115336-ladsgroup.json [11:53:38] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance [11:53:40] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:53:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance [11:53:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T371742)', diff saved to https://phabricator.wikimedia.org/P67713 and previous config saved to /var/cache/conftool/dbconfig/20240823-115358-ladsgroup.json [11:54:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2033.codfw.wmnet with reason: host reimage [12:01:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P67714 and previous config saved to /var/cache/conftool/dbconfig/20240823-120124-ladsgroup.json [12:04:28] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage [12:08:13] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage [12:08:25] (03PS1) 10Dbrant: Turn account vanishing contact form into a redirect. (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065201 (https://phabricator.wikimedia.org/T372828) [12:10:26] (03PS2) 10Dbrant: Turn account vanishing contact form into a redirect. (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065201 (https://phabricator.wikimedia.org/T372828) [12:12:11] (03PS1) 10Brouberol: airflow-test-k8s: Move pooler.imageTag to a value file common to all PG clusters in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065203 (https://phabricator.wikimedia.org/T372286) [12:13:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2033.codfw.wmnet with OS bullseye [12:13:46] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087551 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [12:13:50] (03PS1) 10Brouberol: superset-next: use immutable image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065206 (https://phabricator.wikimedia.org/T373000) [12:14:24] (03PS4) 10Brouberol: cloudnative-pg-operator: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064373 (https://phabricator.wikimedia.org/T373000) [12:14:27] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:16:20] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch [12:16:23] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch [12:16:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T370903)', diff saved to https://phabricator.wikimedia.org/P67715 and previous config saved to /var/cache/conftool/dbconfig/20240823-121631-ladsgroup.json [12:16:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2210.codfw.wmnet with reason: Maintenance [12:16:35] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:16:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2210.codfw.wmnet with reason: Maintenance [12:16:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T370903)', diff saved to https://phabricator.wikimedia.org/P67716 and previous config saved to /var/cache/conftool/dbconfig/20240823-121653-ladsgroup.json [12:17:44] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch [12:17:48] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch [12:20:20] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch [12:20:32] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the (test) switch [12:30:16] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-operator: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064373 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [12:31:12] (03CR) 10Brouberol: [C:03+2] superset-next: use immutable image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065206 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [12:31:17] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1004.eqiad.wmnet with OS bookworm [12:31:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [12:32:50] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [12:34:33] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1003.eqiad.wmnet with OS bookworm [12:39:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:39:53] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:42:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T370903)', diff saved to https://phabricator.wikimedia.org/P67717 and previous config saved to /var/cache/conftool/dbconfig/20240823-124243-ladsgroup.json [12:42:51] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:54:32] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage [12:57:37] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage [12:57:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P67718 and previous config saved to /var/cache/conftool/dbconfig/20240823-125750-ladsgroup.json [13:03:24] (03CR) 10Ssingh: [C:03+1] wdqs: -main and -scholarly are different services [puppet] - 10https://gerrit.wikimedia.org/r/1064840 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [13:03:47] (03PS1) 10Nik Gkountas: admin: add new ssh key for ngkountas [puppet] - 10https://gerrit.wikimedia.org/r/1065216 (https://phabricator.wikimedia.org/T371372) [13:05:13] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: ngkountas user has same SSH key for cloud/prod - https://phabricator.wikimedia.org/T371372#10087670 (10ngkountas) @Fabfur @jhathaway I have created a new key and created a new Gerrit patch for it here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/106... [13:07:50] !log milimetric@deploy1003 Started deploy [analytics/refinery@e5d0d48]: Special deploy to make sure sqoop logic matches schema change [13:08:59] !log Running homer 'lsw1-a3-codfw*' commit 'T372878' [13:08:59] (03CR) 10Ssingh: [C:03+1] wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) (owner: 10Ryan Kemper) [13:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:03] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [13:09:27] (03CR) 10Ssingh: [C:03+1] wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [13:09:47] !log milimetric@deploy1003 Finished deploy [analytics/refinery@e5d0d48]: Special deploy to make sure sqoop logic matches schema change (duration: 01m 57s) [13:10:21] !log Running homer 'cr*codfw*' commit 'T372878' [13:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P67719 and previous config saved to /var/cache/conftool/dbconfig/20240823-131257-ladsgroup.json [13:15:21] (03CR) 10Ssingh: [C:03+1] wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [13:16:30] (03CR) 10CDanis: [C:03+1] jaeger: enable tags-as-fields for query and collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065127 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi) [13:17:39] (03CR) 10Filippo Giunchedi: "'alertmanagers' variable needs to be changed too, to instruct prometheus to send alerts to the new host too" [puppet] - 10https://gerrit.wikimedia.org/r/1064826 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [13:17:45] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker2033.codfw.wmnet [13:17:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker2033.codfw.wmnet [13:18:02] (03CR) 10Filippo Giunchedi: "'alertmanagers' variable needs to be changed too, to instruct prometheus to send alerts to the new host too" [puppet] - 10https://gerrit.wikimedia.org/r/1064828 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [13:18:02] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2033.codfw.wmnet [13:18:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2033.codfw.wmnet [13:18:49] (03CR) 10MVernon: "Hi," [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1065187 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:19:07] (03CR) 10Clément Goubert: [C:03+1] ceph: add the LABEL ceph=True to the image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1065187 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:19:24] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194 (10KCVelaga_WMF) 03NEW [13:19:45] (03CR) 10MVernon: [V:03+2 C:03+2] ceph: add the LABEL ceph=True to the image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1065187 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:21:17] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1003.eqiad.wmnet with OS bookworm [13:25:35] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2007.codfw.wmnet [13:27:05] (03PS1) 10MVernon: php7.4 - fix formatting error in changelog entries [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1065218 [13:27:57] (03CR) 10MVernon: "Hi," [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1065218 (owner: 10MVernon) [13:28:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T370903)', diff saved to https://phabricator.wikimedia.org/P67720 and previous config saved to /var/cache/conftool/dbconfig/20240823-132804-ladsgroup.json [13:28:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2219.codfw.wmnet with reason: Maintenance [13:28:09] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:28:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2219.codfw.wmnet with reason: Maintenance [13:28:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T370903)', diff saved to https://phabricator.wikimedia.org/P67721 and previous config saved to /var/cache/conftool/dbconfig/20240823-132838-ladsgroup.json [13:28:47] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi) [13:28:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2007.codfw.wmnet [13:29:57] (03CR) 10Bking: [C:03+1] airflow-test-k8s: Move pooler.imageTag to a value file common to all PG clusters in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065203 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [13:30:09] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: Move pooler.imageTag to a value file common to all PG clusters in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065203 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [13:30:16] (03CR) 10Cathal Mooney: [C:03+1] IP validator: don't allow empty dns on active mgmt interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064775 (https://phabricator.wikimedia.org/T339121) (owner: 10Ayounsi) [13:30:17] !log milimetric@deploy1003 Started deploy [analytics/refinery@e5d0d48]: Special deploy to make sure sqoop logic matches schema change [13:30:20] (03CR) 10Clément Goubert: [C:03+1] php7.4 - fix formatting error in changelog entries [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1065218 (owner: 10MVernon) [13:30:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2007.codfw.wmnet with OS bullseye [13:30:51] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087782 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [13:31:15] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [13:31:27] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [13:31:32] (03CR) 10Hnowlan: [C:03+1] php7.4 - fix formatting error in changelog entries [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1065218 (owner: 10MVernon) [13:31:34] (03CR) 10MVernon: [V:03+2 C:03+2] php7.4 - fix formatting error in changelog entries [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1065218 (owner: 10MVernon) [13:32:54] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on kafka-main2001.codfw.wmnet with reason: Decom next week [13:32:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on kafka-main2001.codfw.wmnet with reason: Decom next week [13:32:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T371742)', diff saved to https://phabricator.wikimedia.org/P67722 and previous config saved to /var/cache/conftool/dbconfig/20240823-133258-ladsgroup.json [13:33:03] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:34:51] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2007 - cgoubert@cumin1002" [13:34:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2007 - cgoubert@cumin1002" [13:34:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:34:55] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2007.codfw.wmnet 195.16.192.10.in-addr.arpa 5.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:34:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2007.codfw.wmnet 195.16.192.10.in-addr.arpa 5.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:34:59] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2007 [13:36:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2007 [13:36:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [13:37:29] (03PS1) 10AikoChou: ml-services: add new revertrisk isvcs for pre-save context [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065221 (https://phabricator.wikimedia.org/T356102) [13:37:40] !log milimetric@deploy1003 Finished deploy [analytics/refinery@e5d0d48]: Special deploy to make sure sqoop logic matches schema change (duration: 07m 22s) [13:40:34] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: decommission of codfw frack servers - frdb2001 frqueue2001 payments2003 - https://phabricator.wikimedia.org/T373149#10087813 (10Papaul) @Clement_Goubert thank you [13:41:22] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: decommission of codfw frack servers - frdb2001 frqueue2001 payments2003 - https://phabricator.wikimedia.org/T373149#10087815 (10Papaul) [13:42:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:48:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P67723 and previous config saved to /var/cache/conftool/dbconfig/20240823-134805-ladsgroup.json [13:49:31] !log milimetric@deploy1003 Started deploy [analytics/refinery@e5d0d48] (thin): Special deploy to make sure sqoop logic matches schema change [13:49:41] (03CR) 10STran: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065140 (https://phabricator.wikimedia.org/T372869) (owner: 10STran) [13:49:43] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2027.codfw.wmnet [13:50:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2027.codfw.wmnet [13:50:37] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065140 (https://phabricator.wikimedia.org/T372869) (owner: 10STran) [13:50:49] (03PS2) 10AikoChou: ml-services: add new revertrisk isvcs for pre-save context [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065221 (https://phabricator.wikimedia.org/T356102) [13:51:15] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2027.codfw.wmnet with OS bullseye [13:51:21] (03PS1) 10Brouberol: cloudnative-pg-cluster: stop printing variable in resource manifest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065227 (https://phabricator.wikimedia.org/T372286) [13:51:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087860 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [13:51:29] !log stran@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply [13:51:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [13:52:11] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [13:52:13] !log stran@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply [13:52:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:58] !log stran@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [13:53:22] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2007.codfw.wmnet with reason: host reimage [13:53:38] !log stran@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [13:54:08] !log stran@deploy1003 helmfile [codfw] START helmfile.d/services/ipoid: apply [13:54:09] (03CR) 10Stevemunene: [C:03+1] cloudnative-pg-cluster: stop printing variable in resource manifest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065227 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [13:54:19] !log milimetric@deploy1003 Finished deploy [analytics/refinery@e5d0d48] (thin): Special deploy to make sure sqoop logic matches schema change (duration: 04m 48s) [13:54:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T370903)', diff saved to https://phabricator.wikimedia.org/P67724 and previous config saved to /var/cache/conftool/dbconfig/20240823-135431-ladsgroup.json [13:54:36] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:54:36] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: stop printing variable in resource manifest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065227 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [13:54:45] !log stran@deploy1003 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [13:55:00] (03CR) 10AikoChou: ml-services: add new revertrisk isvcs for pre-save context (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065221 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [13:55:33] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2027 - cgoubert@cumin1002" [13:55:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2027 - cgoubert@cumin1002" [13:55:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:55:38] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2027.codfw.wmnet 176.0.192.10.in-addr.arpa 6.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:55:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2027.codfw.wmnet 176.0.192.10.in-addr.arpa 6.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:55:42] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2027 [13:56:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2007.codfw.wmnet with reason: host reimage [13:57:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:58:26] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2027 [13:58:26] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [13:58:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:01:56] !log Running homer 'cr*codfw*' commit 'T372878' [14:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:00] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [14:03:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P67725 and previous config saved to /var/cache/conftool/dbconfig/20240823-140312-ladsgroup.json [14:09:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P67726 and previous config saved to /var/cache/conftool/dbconfig/20240823-140938-ladsgroup.json [14:10:13] (03CR) 10Klausman: ml-services: add new revertrisk isvcs for pre-save context (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065221 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [14:15:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2027.codfw.wmnet with reason: host reimage [14:16:26] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2007.codfw.wmnet with OS bullseye [14:16:40] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087989 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [14:17:19] !log Running homer 'lsw1-b6-codfw*' commit 'T372878' [14:17:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2027.codfw.wmnet with reason: host reimage [14:18:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T371742)', diff saved to https://phabricator.wikimedia.org/P67727 and previous config saved to /var/cache/conftool/dbconfig/20240823-141819-ladsgroup.json [14:18:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance [14:18:23] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:18:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance [14:18:36] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2007.codfw.wmnet [14:18:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2007.codfw.wmnet [14:18:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2205 (T371742)', diff saved to https://phabricator.wikimedia.org/P67728 and previous config saved to /var/cache/conftool/dbconfig/20240823-141841-ladsgroup.json [14:20:15] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10087998 (10mpopov) Approved [14:22:10] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1002.eqiad.wmnet with OS bookworm [14:23:11] (03PS4) 10Brouberol: ceph-csi-rbd: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064761 (https://phabricator.wikimedia.org/T373000) [14:24:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P67729 and previous config saved to /var/cache/conftool/dbconfig/20240823-142445-ladsgroup.json [14:30:53] (03CR) 10JHathaway: "Overall looks good, a couple of suggested fixes and a few questions." [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [14:31:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T367856)', diff saved to https://phabricator.wikimedia.org/P67730 and previous config saved to /var/cache/conftool/dbconfig/20240823-143140-marostegui.json [14:31:49] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:34:37] (03PS2) 10Hnowlan: use shellbox-video globally (adding group2, including commons) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) [14:36:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:37:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2027.codfw.wmnet with OS bullseye [14:37:34] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10088060 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [14:39:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T370903)', diff saved to https://phabricator.wikimedia.org/P67731 and previous config saved to /var/cache/conftool/dbconfig/20240823-143952-ladsgroup.json [14:41:22] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:42:17] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage [14:45:50] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage [14:46:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P67732 and previous config saved to /var/cache/conftool/dbconfig/20240823-144649-marostegui.json [14:52:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:56:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#10088114 (10klausman) The problem with the alert is that it's ina a very spammy channel, and this particular alert happened about 1h after I had left for the day. I guess... [14:58:52] (03PS1) 10Andrew Bogott: Remove obsolete files for openstack v. antelope [puppet] - 10https://gerrit.wikimedia.org/r/1065235 [15:00:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#10088119 (10Jhancock.wm) I can get the offending DIMM card replaced today. Just need a little bit. @klausman [15:01:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P67733 and previous config saved to /var/cache/conftool/dbconfig/20240823-150156-marostegui.json [15:03:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#10088121 (10Dzahn) fwiw, I keep thinking that if the alert would simply be an email to the right list then it would be much more effective, not require realtime monitorin... [15:03:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#10088125 (10klausman) >>! In T365291#10088119, @Jhancock.wm wrote: > I can get the offending DIMM card replaced today. Just need a little bit. @klausman Thank you! The... [15:05:05] FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:07:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:09:39] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1002.eqiad.wmnet with OS bookworm [15:11:23] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bookworm [15:17:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T367856)', diff saved to https://phabricator.wikimedia.org/P67734 and previous config saved to /var/cache/conftool/dbconfig/20240823-151704-marostegui.json [15:17:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 7:00:00 on db2164.codfw.wmnet with reason: Maintenance [15:17:08] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [15:17:18] 14SRE-Sprint-Week-Sustainability-March2023, 06Traffic, 10Sustainability (Incident Followup): Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799#10088179 (10CDanis) @Jelto and @fgiunchedi handled a repeat Jio set-top box hotlink issue between 0600-0730 UTC today. The new requestct... [15:17:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 7:00:00 on db2164.codfw.wmnet with reason: Maintenance [15:17:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 14:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:17:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 14:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:17:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T367856)', diff saved to https://phabricator.wikimedia.org/P67735 and previous config saved to /var/cache/conftool/dbconfig/20240823-151730-marostegui.json [15:19:00] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: decommission of codfw frack servers - frdb2001 frqueue2001 payments2003 - https://phabricator.wikimedia.org/T373149#10088195 (10Jhancock.wm) [15:19:35] (03PS16) 10Btullis: Add a matomo_plugins component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [15:21:09] (03CR) 10CI reject: [V:04-1] Add a matomo_plugins component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [15:24:27] (03PS17) 10Btullis: Add a matomo_plugins component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [15:26:15] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3734/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [15:28:26] (03CR) 10Btullis: [V:03+1] Add a matomo_plugins component to the apt private repo (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [15:28:55] (03PS1) 10CDanis: haproxy limit-by-path: reduce bwlim [puppet] - 10https://gerrit.wikimedia.org/r/1065240 (https://phabricator.wikimedia.org/T317799) [15:32:29] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1001.eqiad.wmnet with OS bookworm [15:33:10] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bookworm [15:33:58] (03CR) 10CDanis: [C:03+2] jaeger: enable tags-as-fields for query and collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065127 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi) [15:34:51] (03Merged) 10jenkins-bot: jaeger: enable tags-as-fields for query and collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065127 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi) [15:35:25] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [15:35:35] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [15:38:00] !log Running homer 'lsw1-a6-codfw*' commit T372878 [15:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:03] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [15:38:19] (03CR) 10JHathaway: "just noticed this typo" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [15:39:27] (03PS18) 10Btullis: Add a matomo_plugins component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [15:40:53] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [15:41:01] (03CR) 10Btullis: [V:03+1] Add a matomo_plugins component to the apt private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [15:45:15] (03CR) 10JHathaway: [C:03+1] "looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1065166 (owner: 10Slyngshede) [15:46:53] (03PS2) 10CDanis: haproxy limit-by-path: reduce bwlim [puppet] - 10https://gerrit.wikimedia.org/r/1065240 (https://phabricator.wikimedia.org/T317799) [15:47:00] (03CR) 10JHathaway: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [15:48:37] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: decommission of codfw frack servers - frdb2001 frqueue2001 payments2003 - https://phabricator.wikimedia.org/T373149#10088261 (10Dwisehaupt) @Clement_Goubert Oh. Thanks for that. I must have forgot it last night. S... [15:52:05] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [15:52:57] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 17:00:00 on wdqs[1023-1024].eqiad.wmnet with reason: noisy alerts related to graph split T337013 [15:53:00] T337013: [Epic] Splitting the graph in WDQS - https://phabricator.wikimedia.org/T337013 [15:53:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 17:00:00 on wdqs[1023-1024].eqiad.wmnet with reason: noisy alerts related to graph split T337013 [15:53:41] 14SRE-Sprint-Week-Sustainability-March2023, 06Traffic, 13Patch-For-Review, 10Sustainability (Incident Followup): Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799#10088277 (10CDanis) `lang=irc 11:47:53 hmmm if those are megabytes my intention was to set it to 300 /... [15:54:53] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [15:54:59] (03PS1) 10Dwisehaupt: icinga: remove frdb2001 frqueue2001 payments2003 [puppet] - 10https://gerrit.wikimedia.org/r/1064942 (https://phabricator.wikimedia.org/T373149) [15:59:43] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2027.codfw.wmnet [15:59:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2027.codfw.wmnet [15:59:57] !log Running homer 'cr*codfw*' commit T372878 [16:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:00] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:00:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T371742)', diff saved to https://phabricator.wikimedia.org/P67736 and previous config saved to /var/cache/conftool/dbconfig/20240823-160033-ladsgroup.json [16:00:52] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:04:16] (03PS3) 10CDanis: haproxy limit-by-path: reduce bwlim [puppet] - 10https://gerrit.wikimedia.org/r/1065240 (https://phabricator.wikimedia.org/T317799) [16:07:25] (03PS3) 10Scott French: php8.1-cli: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064814 (https://phabricator.wikimedia.org/T372602) [16:07:25] (03PS3) 10Scott French: php8.1-fpm: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064815 (https://phabricator.wikimedia.org/T372602) [16:07:25] (03PS3) 10Scott French: php8.1-fpm-multiversion-base: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064816 (https://phabricator.wikimedia.org/T372602) [16:10:04] (03CR) 10Vgutierrez: [C:03+1] haproxy limit-by-path: reduce bwlim [puppet] - 10https://gerrit.wikimedia.org/r/1065240 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [16:14:27] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:15:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P67737 and previous config saved to /var/cache/conftool/dbconfig/20240823-161540-ladsgroup.json [16:16:25] !log bearloga@deploy1003 Started deploy [airflow-dags/wmde@c55c7de]: (no justification provided) [16:16:32] !log bearloga@deploy1003 Finished deploy [airflow-dags/wmde@c55c7de]: (no justification provided) (duration: 00m 06s) [16:18:01] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: decommission of codfw frack servers - frdb2001 frqueue2001 payments2003 - https://phabricator.wikimedia.org/T373149#10088365 (10Clement_Goubert) >>! In T373149#10088261, @Dwisehaupt wrote: > @Clement_Goubert Oh. T... [16:19:01] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1001.eqiad.wmnet with OS bookworm [16:21:31] (03PS1) 10Andrew Bogott: Add openstack magnum and heat logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1065247 (https://phabricator.wikimedia.org/T268175) [16:22:44] (03PS2) 10Andrew Bogott: Add openstack magnum and heat logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1065247 (https://phabricator.wikimedia.org/T268175) [16:23:13] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1065247 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [16:25:27] (03CR) 10Vgutierrez: prometheus: add script to check TCP MSS clamping value (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [16:26:06] (03PS1) 10Bking: Data-platform: change severity of stat host high load alerts [alerts] - 10https://gerrit.wikimedia.org/r/1065248 (https://phabricator.wikimedia.org/T373046) [16:26:22] (03CR) 10Andrew Bogott: [C:03+2] Add openstack magnum and heat logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1065247 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [16:27:31] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10088394 (10Jhancock.wm) [16:28:01] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10088395 (10Jhancock.wm) [16:28:16] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10088396 (10Jhancock.wm) [16:30:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P67738 and previous config saved to /var/cache/conftool/dbconfig/20240823-163047-ladsgroup.json [16:32:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [16:33:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [16:34:04] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:37:26] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating mgmt for frack servers in codfw - jhancock@cumin2002" [16:37:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating mgmt for frack servers in codfw - jhancock@cumin2002" [16:37:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:40:34] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560#10088411 (10VRiley-WMF) Was able to find the drive. We can replace at anytime @MatthewVernon [16:42:05] (03CR) 10Ssingh: [C:03+1] Data-platform: change severity of stat host high load alerts [alerts] - 10https://gerrit.wikimedia.org/r/1065248 (https://phabricator.wikimedia.org/T373046) (owner: 10Bking) [16:45:26] !log nettrom@deploy1003 Started deploy [airflow-dags/analytics_product@c55c7de]: (no justification provided) [16:45:44] !log nettrom@deploy1003 Finished deploy [airflow-dags/analytics_product@c55c7de]: (no justification provided) (duration: 00m 17s) [16:45:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T371742)', diff saved to https://phabricator.wikimedia.org/P67740 and previous config saved to /var/cache/conftool/dbconfig/20240823-164554-ladsgroup.json [16:45:59] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:50:10] !log conniecc1@deploy1003 Started deploy [airflow-dags/analytics_product@c55c7de]: (no justification provided) [16:50:14] !log conniecc1@deploy1003 Finished deploy [airflow-dags/analytics_product@c55c7de]: (no justification provided) (duration: 00m 03s) [17:07:22] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10088494 (10Jhancock.wm) a:05Jhancock.wm→03Papaul This one is ready for ya. ports are on 41 of both frack switches. [17:08:01] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10088497 (10Jhancock.wm) frdb2005 is racked and ready for @Papaul. ports are on port 40 of the FR switches. [17:08:38] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10088498 (10Jhancock.wm) @Papaul payments2006 is racked and ready. on ports 39 of the FR switches. [17:18:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#10088511 (10Jhancock.wm) @klausman it's been replaced and booted up. looks like the alert has cleared. lmk if you need any further assistance! [17:20:09] (03CR) 10Brouberol: [C:03+1] Data-platform: change severity of stat host high load alerts [alerts] - 10https://gerrit.wikimedia.org/r/1065248 (https://phabricator.wikimedia.org/T373046) (owner: 10Bking) [17:21:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#10088520 (10klausman) >>! In T365291#10088511, @Jhancock.wm wrote: > @klausman it's been replaced and booted up. looks like the alert has cleared. lmk if you need any fur... [17:23:17] (03PS2) 10Dwisehaupt: frack: decommission a few codfw hosts [dns] - 10https://gerrit.wikimedia.org/r/1064941 (https://phabricator.wikimedia.org/T373149) [17:29:50] RESOLVED: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:52:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:53:05] (03PS2) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert2002 [dns] - 10https://gerrit.wikimedia.org/r/1065258 (https://phabricator.wikimedia.org/T372418) [17:57:22] (03CR) 10Dwisehaupt: "These hosts are all off. Ready to clean up the DNS for them." [dns] - 10https://gerrit.wikimedia.org/r/1064941 (https://phabricator.wikimedia.org/T373149) (owner: 10Dwisehaupt) [17:57:52] (03CR) 10Dwisehaupt: "These hosts are all off. Just cleaning them out of icinga." [puppet] - 10https://gerrit.wikimedia.org/r/1064942 (https://phabricator.wikimedia.org/T373149) (owner: 10Dwisehaupt) [17:59:23] (03CR) 10Andrea Denisse: "Hi Filippo, thanks for taking a look. This patch is part of the Stage 3: "Make alert2002 the active alertmanager host"." [puppet] - 10https://gerrit.wikimedia.org/r/1064826 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [17:59:44] (03CR) 10Eevans: [C:03+2] restbase: Update mobileapps service hostname on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1065123 (https://phabricator.wikimedia.org/T370460) (owner: 10Jgiannelos) [18:00:10] (03CR) 10Andrea Denisse: [C:03+2] alert: Allow Apache2 connections for the alert[12]002 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1064821 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [18:01:13] (03CR) 10Andrea Denisse: [C:03+2] alert: Allow connections from the alert[12]002 addresses [puppet] - 10https://gerrit.wikimedia.org/r/1064818 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [18:02:55] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1065248 (https://phabricator.wikimedia.org/T373046) (owner: 10Bking) [18:09:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [18:14:05] (03CR) 10Ssingh: [C:03+1] frack: decommission a few codfw hosts [dns] - 10https://gerrit.wikimedia.org/r/1064941 (https://phabricator.wikimedia.org/T373149) (owner: 10Dwisehaupt) [18:14:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [18:19:45] (03CR) 10Dwisehaupt: [C:03+2] "Thanks @ssingh@wikimedia.org I can finish this out and deploy." [dns] - 10https://gerrit.wikimedia.org/r/1064941 (https://phabricator.wikimedia.org/T373149) (owner: 10Dwisehaupt) [18:31:43] (03CR) 10Bartosz Dziewoński: "It seems that there was an attempt to fix this in another commit somewhere. However, the changes don't seem to have been deployed to https" [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński) [18:32:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [18:37:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:07:57] (03PS1) 10Kgraessle: Enable AutoModerator on id.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065265 (https://phabricator.wikimedia.org/T365792) [19:14:28] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10088802 (10CDanis) >>! In T372943#10083... [19:15:32] (03PS1) 10Eevans: Replace deployment-restbase04 w/ deployment-restbase05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460) [19:16:10] (03CR) 10CI reject: [V:04-1] Replace deployment-restbase04 w/ deployment-restbase05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460) (owner: 10Eevans) [19:17:26] (03CR) 10Bking: [C:03+2] Data-platform: change severity of stat host high load alerts [alerts] - 10https://gerrit.wikimedia.org/r/1065248 (https://phabricator.wikimedia.org/T373046) (owner: 10Bking) [19:18:37] (03Merged) 10jenkins-bot: Data-platform: change severity of stat host high load alerts [alerts] - 10https://gerrit.wikimedia.org/r/1065248 (https://phabricator.wikimedia.org/T373046) (owner: 10Bking) [19:51:25] (03PS4) 10CDanis: haproxy limit-by-path: reduce bwlim [puppet] - 10https://gerrit.wikimedia.org/r/1065240 (https://phabricator.wikimedia.org/T317799) [19:56:20] (03PS20) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [19:56:50] (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [19:59:07] (03PS21) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [19:59:36] (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [20:00:21] (03PS22) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [20:00:52] (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [20:04:44] (03PS23) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [20:14:27] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:30:07] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Enable require_tty_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1065271 (https://phabricator.wikimedia.org/T361724) [20:54:17] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10089050 (10Tgr) >>! In T372943#10087406... [21:01:29] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667#10089060 (10jhathaway) [21:02:15] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667#10089061 (10jhathaway) [21:06:52] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667#10089064 (10jhathaway) T236481 has some related background work that @jbond performed to prepare us for the pson deprecation. Also, in order to compile catalog's on the command line, yo... [21:07:55] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667#10089066 (10jhathaway) [21:09:25] (03CR) 10JHathaway: [C:03+1] icinga: remove frdb2001 frqueue2001 payments2003 [puppet] - 10https://gerrit.wikimedia.org/r/1064942 (https://phabricator.wikimedia.org/T373149) (owner: 10Dwisehaupt) [21:11:37] (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [21:50:11] (03PS1) 10CDanis: add historical API [software/klaxon] - 10https://gerrit.wikimedia.org/r/1065279 [21:50:11] (03PS1) 10CDanis: WIP: export annotations [software/klaxon] - 10https://gerrit.wikimedia.org/r/1065280 [21:51:13] (03PS2) 10CDanis: WIP: export annotations [software/klaxon] - 10https://gerrit.wikimedia.org/r/1065280 (https://phabricator.wikimedia.org/T373230) [21:51:18] (03CR) 10CI reject: [V:04-1] add historical API [software/klaxon] - 10https://gerrit.wikimedia.org/r/1065279 (owner: 10CDanis) [21:52:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:57] (03PS1) 10JHathaway: puppet8: remove ssl_keystore_location, always set ssl_key_password [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) [21:55:16] (03PS1) 10JHathaway: puppet8: ensure type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1065284 (https://phabricator.wikimedia.org/T372667) [21:55:44] (03PS1) 10JHathaway: puppet8: drop explicity metaparams [puppet] - 10https://gerrit.wikimedia.org/r/1065286 (https://phabricator.wikimedia.org/T366900) [22:15:27] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1065286 (https://phabricator.wikimedia.org/T366900) (owner: 10JHathaway) [22:15:28] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [22:15:30] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1065284 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [23:29:31] (03PS2) 10Catrope: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) [23:29:51] (03CR) 10Catrope: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) (owner: 10Catrope) [23:30:13] (03CR) 10CI reject: [V:04-1] Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) (owner: 10Catrope) [23:30:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) (owner: 10Catrope) [23:38:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1065298 [23:38:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1065298 (owner: 10TrainBranchBot) [23:41:54] (03PS3) 10Catrope: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) [23:42:35] (03CR) 10CI reject: [V:04-1] Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) (owner: 10Catrope) [23:45:43] (03PS4) 10Catrope: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) [23:50:49] (03PS5) 10Catrope: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) [23:53:13] (03PS6) 10Catrope: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945)