[00:16:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti-test2004.codfw.wmnet with OS bullseye [00:17:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [00:18:20] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [00:19:30] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [00:19:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [00:24:04] (03PS1) 10Andrew Bogott: glance: update init script from upstream package [puppet] - 10https://gerrit.wikimedia.org/r/963849 [00:24:39] (03CR) 10Andrew Bogott: [C: 03+2] glance: update init script from upstream package [puppet] - 10https://gerrit.wikimedia.org/r/963849 (owner: 10Andrew Bogott) [00:26:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [00:30:14] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [00:31:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:35:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [00:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962245 [00:38:51] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962245 (owner: 10TrainBranchBot) [00:39:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye [00:39:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [00:46:01] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962245 (owner: 10TrainBranchBot) [01:00:03] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:02] (03PS1) 10Andrew Bogott: Glance: listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/963850 [01:02:55] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:34] (03PS2) 10Andrew Bogott: Glance: listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/963850 [01:09:36] (03CR) 10Andrew Bogott: [C: 03+2] Glance: listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/963850 (owner: 10Andrew Bogott) [01:16:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T343198)', diff saved to https://phabricator.wikimedia.org/P52839 and previous config saved to /var/cache/conftool/dbconfig/20231006-011928-arnaudb.json [01:19:35] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [01:20:53] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 41.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:25:29] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 44.44% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:27:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:30:03] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:24] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:34:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P52840 and previous config saved to /var/cache/conftool/dbconfig/20231006-013434-arnaudb.json [01:46:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P52841 and previous config saved to /var/cache/conftool/dbconfig/20231006-014941-arnaudb.json [02:04:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T343198)', diff saved to https://phabricator.wikimedia.org/P52842 and previous config saved to /var/cache/conftool/dbconfig/20231006-020447-arnaudb.json [02:04:50] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [02:04:52] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [02:05:03] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [02:05:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:05:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T343198)', diff saved to https://phabricator.wikimedia.org/P52843 and previous config saved to /var/cache/conftool/dbconfig/20231006-020509-arnaudb.json [02:10:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:38:32] (JobUnavailable) firing: (3) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:34] (JobUnavailable) firing: (3) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:09:47] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:10:13] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:23:33] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:23:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:39:09] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gitlab2002), Fresh: 135 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:55:27] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:47] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:39] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:44:15] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:45:21] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:46:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:51:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:56:03] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231006T0600) [06:00:37] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:30] (03CR) 10Tim Starling: [C: 03+1] "MediaWiki has SvgHandler::SVG_DEFAULT_RENDER_LANG = 'en'. In SvgHandler::makeParamString():" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/962563 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan) [06:25:57] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:33] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:18] (03CR) 10Muehlenhoff: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [06:52:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2023.codfw.wmnet with OS bullseye [06:53:01] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2023.codfw.wmnet with OS bullseye [06:53:37] !log installing bind9 security updates (client side libs/tools only) [06:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:21] (03CR) 10Stevemunene: airflow-wmde: configure wmde airflow instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231006T0700) [07:04:21] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:05:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [07:09:30] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2023.codfw.wmnet with reason: host reimage [07:12:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:12:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2023.codfw.wmnet with reason: host reimage [07:17:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:25:53] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:41] (03PS1) 10Majavah: replica_cnf_api: Don't try to access variable before assignment [puppet] - 10https://gerrit.wikimedia.org/r/963939 [07:27:43] (03PS1) 10Majavah: maintain_dbusers: Do not try to get file path on deletion [puppet] - 10https://gerrit.wikimedia.org/r/963940 [07:28:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2023.codfw.wmnet with OS bullseye [07:28:43] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2023.codfw.wmnet with OS bullseye completed: - ganeti2023 (**PASS**) - Downtimed on... [07:29:57] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:02] (03CR) 10CI reject: [V: 04-1] maintain_dbusers: Do not try to get file path on deletion [puppet] - 10https://gerrit.wikimedia.org/r/963940 (owner: 10Majavah) [07:32:21] (03PS2) 10Majavah: maintain_dbusers: Do not try to get file path on deletion [puppet] - 10https://gerrit.wikimedia.org/r/963940 [07:34:55] (03CR) 10CI reject: [V: 04-1] maintain_dbusers: Do not try to get file path on deletion [puppet] - 10https://gerrit.wikimedia.org/r/963940 (owner: 10Majavah) [07:36:09] (03PS3) 10Majavah: maintain_dbusers: Do not try to get file path on deletion [puppet] - 10https://gerrit.wikimedia.org/r/963940 [07:36:11] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:36:32] (03PS1) 10Elukey: modules: duplicate app:generic:1.0.0 to 1.0.1 to ease reviews [deployment-charts] - 10https://gerrit.wikimedia.org/r/963943 [07:36:34] (03PS1) 10Elukey: modules: add quotes to args rendered by app:generic [deployment-charts] - 10https://gerrit.wikimedia.org/r/963944 [07:36:45] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:18] (03PS2) 10Elukey: modules: add quotes to args rendered by app:generic [deployment-charts] - 10https://gerrit.wikimedia.org/r/963944 [07:37:55] (03PS3) 10Elukey: modules: add quotes to args rendered by app:generic [deployment-charts] - 10https://gerrit.wikimedia.org/r/963944 [07:38:45] (03CR) 10CI reject: [V: 04-1] maintain_dbusers: Do not try to get file path on deletion [puppet] - 10https://gerrit.wikimedia.org/r/963940 (owner: 10Majavah) [07:39:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:40:21] (03PS1) 10Elukey: charts: update app.generic module for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/963945 [07:41:13] (03PS4) 10Elukey: modules: add quotes to args rendered by app:generic [deployment-charts] - 10https://gerrit.wikimedia.org/r/963944 [07:41:39] (03PS2) 10Elukey: charts: update app.generic module for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/963945 [07:42:01] (03PS3) 10Elukey: charts: update app.generic module for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/963945 [07:44:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:45:02] (03PS4) 10Majavah: maintain_dbusers: Do not try to get file path on deletion [puppet] - 10https://gerrit.wikimedia.org/r/963940 [07:45:19] (03CR) 10Kevin Bazira: [C: 03+1] charts: update app.generic module for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/963945 (owner: 10Elukey) [07:47:45] (03CR) 10Elukey: "Post merge review :)" [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [07:56:03] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:19] (03CR) 10Majavah: [C: 03+2] hieradata: acme_chief: update openstack cert config [puppet] - 10https://gerrit.wikimedia.org/r/963752 (owner: 10Majavah) [08:00:33] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262) (owner: 10Cwhite) [08:03:45] (03CR) 10Gehel: "Minor comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [08:04:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet [08:12:01] (03PS1) 10Gehel: to illustrate comment on parent CR [puppet] - 10https://gerrit.wikimedia.org/r/963949 [08:12:41] (03CR) 10CI reject: [V: 04-1] to illustrate comment on parent CR [puppet] - 10https://gerrit.wikimedia.org/r/963949 (owner: 10Gehel) [08:12:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet [08:13:48] (03PS2) 10Gehel: to illustrate comment on parent CR [puppet] - 10https://gerrit.wikimedia.org/r/963949 [08:14:24] (03Abandoned) 10Arturo Borrero Gonzalez: [DON'T MERGE UNLESS IN EMERGENCY] cloudgw: revert recent changes [puppet] - 10https://gerrit.wikimedia.org/r/963742 (owner: 10Arturo Borrero Gonzalez) [08:14:27] (03CR) 10CI reject: [V: 04-1] to illustrate comment on parent CR [puppet] - 10https://gerrit.wikimedia.org/r/963949 (owner: 10Gehel) [08:15:08] (03CR) 10David Caro: [C: 03+1] replica_cnf_api: Don't try to access variable before assignment [puppet] - 10https://gerrit.wikimedia.org/r/963939 (owner: 10Majavah) [08:15:44] (03PS3) 10Gehel: to illustrate comment on parent CR [puppet] - 10https://gerrit.wikimedia.org/r/963949 [08:16:46] (03CR) 10Majavah: [C: 03+2] replica_cnf_api: Don't try to access variable before assignment [puppet] - 10https://gerrit.wikimedia.org/r/963939 (owner: 10Majavah) [08:16:52] (03CR) 10David Caro: [C: 03+1] maintain_dbusers: Do not try to get file path on deletion [puppet] - 10https://gerrit.wikimedia.org/r/963940 (owner: 10Majavah) [08:17:00] (03CR) 10Majavah: [C: 03+2] maintain_dbusers: Do not try to get file path on deletion [puppet] - 10https://gerrit.wikimedia.org/r/963940 (owner: 10Majavah) [08:18:20] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [08:18:25] (03CR) 10CI reject: [V: 04-1] to illustrate comment on parent CR [puppet] - 10https://gerrit.wikimedia.org/r/963949 (owner: 10Gehel) [08:21:28] (03CR) 10Clément Goubert: [C: 03+1] modules: duplicate app:generic:1.0.0 to 1.0.1 to ease reviews [deployment-charts] - 10https://gerrit.wikimedia.org/r/963943 (owner: 10Elukey) [08:22:07] (03CR) 10Clément Goubert: [C: 03+1] modules: add quotes to args rendered by app:generic [deployment-charts] - 10https://gerrit.wikimedia.org/r/963944 (owner: 10Elukey) [08:22:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2023.codfw.wmnet to cluster codfw and group A [08:22:16] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2023.codfw.wmnet to cluster codfw and group A [08:24:06] !log elukey@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [08:26:23] !log elukey@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [08:34:35] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Switch firewall check use an IP address [cookbooks] - 10https://gerrit.wikimedia.org/r/963951 [08:34:50] (03PS1) 10Ayounsi: Add Auto-Submitted: auto-generated to Icinga emails [puppet] - 10https://gerrit.wikimedia.org/r/963952 (https://phabricator.wikimedia.org/T347835) [08:34:52] (03PS1) 10Ayounsi: Add Auto-Submitted: auto-generated to check_private_data_report [puppet] - 10https://gerrit.wikimedia.org/r/963953 (https://phabricator.wikimedia.org/T347835) [08:34:54] (03PS1) 10Ayounsi: Add Auto-Submitted: auto-generated to Phabricator reports [puppet] - 10https://gerrit.wikimedia.org/r/963954 (https://phabricator.wikimedia.org/T347835) [08:43:29] !log installing vim security updates [08:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:06] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab/failover] Increase alert downtime duration [cookbooks] - 10https://gerrit.wikimedia.org/r/962636 (owner: 10EoghanGaffney) [08:48:28] (03PS1) 10Elukey: profile::statistics::explorer: ensure /srv/published/wmf-ml-models [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) [08:50:23] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:50:31] (03Merged) 10jenkins-bot: [gitlab/failover] Increase alert downtime duration [cookbooks] - 10https://gerrit.wikimedia.org/r/962636 (owner: 10EoghanGaffney) [08:50:57] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:51:42] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10ayounsi) [08:52:17] 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [08:52:21] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10ayounsi) [08:53:44] (03PS2) 10Elukey: profile::statistics::explorer: ensure /srv/published/wmf-ml-models [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) [08:54:24] (03CR) 10Filippo Giunchedi: [C: 03+1] Add Auto-Submitted: auto-generated to Icinga emails [puppet] - 10https://gerrit.wikimedia.org/r/963952 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [08:55:01] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:55:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43930/console" [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [08:55:27] (03CR) 10Elukey: profile::statistics::explorer: ensure /srv/published/wmf-ml-models [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [08:55:29] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:54] (03CR) 10Jelto: [C: 03+2] admin: change email of bawolff [puppet] - 10https://gerrit.wikimedia.org/r/963741 (https://phabricator.wikimedia.org/T348216) (owner: 10Jelto) [08:59:51] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10Jelto) [08:59:51] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:21] (03PS1) 10Muehlenhoff: Add new apt servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/963958 [09:03:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [09:03:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [09:03:42] (03CR) 10DCausse: [C: 03+1] flink-app chart: Add zookeeper to egress_enabled fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/963130 (owner: 10Ebernhardson) [09:03:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [09:03:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [09:05:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host apt2002.wikimedia.org [09:05:30] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:06:14] (03PS2) 10Slyngshede: Style ssh key management using Codex. [software/bitu] - 10https://gerrit.wikimedia.org/r/963779 [09:10:11] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM apt2002.wikimedia.org - jmm@cumin2002" [09:11:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM apt2002.wikimedia.org - jmm@cumin2002" [09:11:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:11:02] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache apt2002.wikimedia.org on all recursors [09:11:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) apt2002.wikimedia.org on all recursors [09:11:17] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:15:58] (03CR) 10DCausse: "is this something we still need?" [puppet] - 10https://gerrit.wikimedia.org/r/961182 (owner: 10Ebernhardson) [09:18:43] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM apt2002.wikimedia.org - jmm@cumin2002" [09:19:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM apt2002.wikimedia.org - jmm@cumin2002" [09:19:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:19:33] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache apt2002.wikimedia.org on all recursors [09:19:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) apt2002.wikimedia.org on all recursors [09:19:43] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host apt2002.wikimedia.org [09:22:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host apt1002.wikimedia.org [09:22:47] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:23:24] (03CR) 10Muehlenhoff: [C: 03+2] Add new apt servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/963958 (owner: 10Muehlenhoff) [09:25:18] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM apt1002.wikimedia.org - jmm@cumin2002" [09:26:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM apt1002.wikimedia.org - jmm@cumin2002" [09:26:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:26:09] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache apt1002.wikimedia.org on all recursors [09:26:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) apt1002.wikimedia.org on all recursors [09:26:23] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:24] (03CR) 10Aklapper: [C: 03+1] Add Auto-Submitted: auto-generated to Phabricator reports [puppet] - 10https://gerrit.wikimedia.org/r/963954 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [09:26:38] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM apt1002.wikimedia.org - jmm@cumin2002" [09:27:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM apt1002.wikimedia.org - jmm@cumin2002" [09:27:46] (03CR) 10Jbond: mariadb: update the ssl-ca value used by mariadb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [09:30:51] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:06] (03CR) 10Ayounsi: [C: 03+2] Add Auto-Submitted: auto-generated to Icinga emails [puppet] - 10https://gerrit.wikimedia.org/r/963952 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [09:31:15] (03CR) 10Klausman: profile::statistics::explorer: ensure /srv/published/wmf-ml-models (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:31:53] (03CR) 10Jbond: [C: 03+1] "LGTM sorry for the noise" [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [09:32:33] (03CR) 10Elukey: profile::statistics::explorer: ensure /srv/published/wmf-ml-models (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:33:25] (03PS3) 10Elukey: profile::statistics::explorer: ensure /srv/published/wmf-ml-models [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) [09:37:32] (03PS3) 10Jbond: puppet agent: protect against a missing client bucket path [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [09:37:52] (03CR) 10Btullis: profile::statistics::explorer: ensure /srv/published/wmf-ml-models (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:38:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [09:39:44] 10SRE, 10SRE-Access-Requests: bawolff is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T348216 (10Jelto) 05Open→03Resolved Email address for bewolff was updated, error message is gone when using `cross-validate-accounts`. I'll close this task. Thanks again for the qui... [09:41:13] (03PS2) 10Ayounsi: Add Auto-Submitted: auto-generated to Phabricator reports [puppet] - 10https://gerrit.wikimedia.org/r/963954 (https://phabricator.wikimedia.org/T347835) [09:42:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host apt1002.wikimedia.org with OS bookworm [09:43:22] (03PS4) 10Elukey: profile::statistics::explorer: ensure /srv/published/wmf-ml-models [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) [09:43:26] (03PS1) 10Elukey: role::statistics::explorer: add ml-team-admins to stat100x nodes [puppet] - 10https://gerrit.wikimedia.org/r/963962 (https://phabricator.wikimedia.org/T347838) [09:43:39] (03CR) 10Elukey: profile::statistics::explorer: ensure /srv/published/wmf-ml-models (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:43:51] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10jbond) 05Open→03Resolved @ATsay-WMF please create a new ticket using the [[ https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ | sre access request template ]] [09:44:57] (03CR) 10Ayounsi: [C: 03+2] Add Auto-Submitted: auto-generated to Phabricator reports [puppet] - 10https://gerrit.wikimedia.org/r/963954 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [09:45:09] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/963962 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:45:23] (03PS5) 10Elukey: profile::statistics::explorer: ensure /srv/published/wmf-ml-models [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) [09:45:52] (03CR) 10Elukey: profile::statistics::explorer: ensure /srv/published/wmf-ml-models (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:46:00] (03CR) 10Klausman: [C: 03+1] "Modulo Ben's question re: group below, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:46:32] (03CR) 10Klausman: [C: 03+1] role::statistics::explorer: add ml-team-admins to stat100x nodes [puppet] - 10https://gerrit.wikimedia.org/r/963962 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:46:34] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:47:23] (03PS1) 10Brouberol: Install kafka-kit-prometheus-metricsfetcher on kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/963964 (https://phabricator.wikimedia.org/T348315) [09:48:48] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10jbond) >>! In T344199#9229747, @ATsay-WMF wrote: > Hello, I'd like to request access to analytics-privatedata-users as well. Thanks! Please disregard my last comment however... [09:49:08] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10jbond) 05Resolved→03Open [09:49:12] (03PS1) 10Majavah: P:toolforge: remove obsolete config files [puppet] - 10https://gerrit.wikimedia.org/r/963965 [09:51:02] (03CR) 10Elukey: [C: 03+2] role::statistics::explorer: add ml-team-admins to stat100x nodes [puppet] - 10https://gerrit.wikimedia.org/r/963962 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:51:09] (03CR) 10Elukey: [C: 03+2] profile::statistics::explorer: ensure /srv/published/wmf-ml-models [puppet] - 10https://gerrit.wikimedia.org/r/963956 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:51:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on apt1002.wikimedia.org with reason: host reimage [09:51:43] (03CR) 10CI reject: [V: 04-1] P:toolforge: remove obsolete config files [puppet] - 10https://gerrit.wikimedia.org/r/963965 (owner: 10Majavah) [09:53:33] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [software/bitu] - 10https://gerrit.wikimedia.org/r/963779 (owner: 10Slyngshede) [09:54:10] (03PS2) 10Majavah: P:toolforge: remove obsolete config files [puppet] - 10https://gerrit.wikimedia.org/r/963965 [09:54:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apt1002.wikimedia.org with reason: host reimage [09:55:31] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:35] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:00:22] (03CR) 10Jbond: [C: 03+1] "LGTM, fyi wmflib has a dns module that wraps dnspython e.g." [cookbooks] - 10https://gerrit.wikimedia.org/r/963951 (owner: 10Muehlenhoff) [10:01:21] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Implement Codex design for properties page. [software/bitu] - 10https://gerrit.wikimedia.org/r/963681 (owner: 10Slyngshede) [10:01:43] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Style ssh key management using Codex. [software/bitu] - 10https://gerrit.wikimedia.org/r/963779 (owner: 10Slyngshede) [10:04:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:05:02] (03PS2) 10Slyngshede: Add URI validator [software/bitu] - 10https://gerrit.wikimedia.org/r/961738 [10:05:13] (03CR) 10Slyngshede: Add URI validator (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/961738 (owner: 10Slyngshede) [10:05:18] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add URI validator [software/bitu] - 10https://gerrit.wikimedia.org/r/961738 (owner: 10Slyngshede) [10:07:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apt1002.wikimedia.org with OS bookworm [10:07:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host apt1002.wikimedia.org [10:10:24] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Switch firewall check use an IP address (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/963951 (owner: 10Muehlenhoff) [10:13:12] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [10:13:22] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [10:14:14] (03PS1) 10Filippo Giunchedi: prometheus: add 'cloud' instance [puppet] - 10https://gerrit.wikimedia.org/r/963987 (https://phabricator.wikimedia.org/T336854) [10:15:37] (03PS1) 10Muehlenhoff: Remove ORES Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/963988 [10:16:21] (03CR) 10Elukey: [C: 03+1] Remove ORES Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/963988 (owner: 10Muehlenhoff) [10:16:38] (03CR) 10CI reject: [V: 04-1] prometheus: add 'cloud' instance [puppet] - 10https://gerrit.wikimedia.org/r/963987 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [10:19:41] (03CR) 10Muehlenhoff: [C: 03+2] Remove ORES Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/963988 (owner: 10Muehlenhoff) [10:20:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2023.codfw.wmnet to cluster codfw and group A [10:20:43] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Trizek-WMF) [10:21:01] (03PS1) 10Btullis: Configure the spark3 defaults with the default yarn shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [10:21:04] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) 05Open→03Resolved Notes for CRS: * Update the template we have at Office wiki, as T345265 was i... [10:21:28] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2023.codfw.wmnet to cluster codfw and group A [10:21:40] (03CR) 10CI reject: [V: 04-1] Configure the spark3 defaults with the default yarn shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [10:23:06] (03PS2) 10Btullis: Configure the spark3 defaults with the default yarn shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [10:25:51] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:47] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43933/console" [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [10:29:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:30:11] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:56] (03PS2) 10Filippo Giunchedi: prometheus: add 'cloud' instance [puppet] - 10https://gerrit.wikimedia.org/r/963987 (https://phabricator.wikimedia.org/T336854) [10:32:31] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) I'll take care of the CRS part, and follow up with SRE for the GitLab switch. [10:32:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [10:33:44] (03CR) 10Elukey: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/963964 (https://phabricator.wikimedia.org/T348315) (owner: 10Brouberol) [10:34:14] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43934/console" [puppet] - 10https://gerrit.wikimedia.org/r/963987 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [10:34:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:38:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [10:39:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [10:41:24] (03CR) 10Brouberol: [C: 03+2] Install kafka-kit-prometheus-metricsfetcher on kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/963964 (https://phabricator.wikimedia.org/T348315) (owner: 10Brouberol) [10:43:42] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: Fix duplicated nginx entry [puppet] - 10https://gerrit.wikimedia.org/r/963755 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [10:44:53] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:45:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [10:46:36] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 (10MoritzMuehlenhoff) [10:49:27] (03Abandoned) 10Muehlenhoff: puppetdb: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [10:49:53] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:52:19] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [10:52:47] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete [10:53:19] (03PS1) 10Slyngshede: Float footer to the window buttom. [software/bitu] - 10https://gerrit.wikimedia.org/r/963991 [10:54:23] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Float footer to the window buttom. [software/bitu] - 10https://gerrit.wikimedia.org/r/963991 (owner: 10Slyngshede) [11:04:21] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:06] (03PS5) 10Jbond: wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T314776) [11:10:29] (03CR) 10CI reject: [V: 04-1] wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [11:13:30] (03PS6) 10Jbond: wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T314776) [11:15:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43936/console" [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [11:20:37] (03PS7) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) [11:22:11] (03PS28) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [11:22:13] (03PS3) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [11:25:31] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:29:21] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:29:49] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:02] (03PS1) 10Muehlenhoff: Remove profile::ganeti::ganeti3 option [puppet] - 10https://gerrit.wikimedia.org/r/963995 [11:34:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T343198)', diff saved to https://phabricator.wikimedia.org/P52847 and previous config saved to /var/cache/conftool/dbconfig/20231006-113441-arnaudb.json [11:34:46] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:39:56] (03CR) 10Jbond: [V: 03+1] "updated to add the original task (T314776) where we discovered this issue" [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [11:40:13] (03PS1) 10Muehlenhoff: Automatically restart parsoid-rt if it crashes [puppet] - 10https://gerrit.wikimedia.org/r/963996 (https://phabricator.wikimedia.org/T345220) [11:42:53] (03PS1) 10Ayounsi: ganeti-test2004: add to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/963998 (https://phabricator.wikimedia.org/T345602) [11:43:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963995 (owner: 10Muehlenhoff) [11:43:47] (03PS2) 10Ayounsi: ganeti-test2004: add to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/963998 (https://phabricator.wikimedia.org/T345602) [11:46:44] (03CR) 10Muehlenhoff: ganeti-test2004: add to Puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/963998 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [11:47:38] (03PS3) 10Ayounsi: ganeti-test2004: add to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/963998 (https://phabricator.wikimedia.org/T345602) [11:48:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove profile::ganeti::ganeti3 option [puppet] - 10https://gerrit.wikimedia.org/r/963995 (owner: 10Muehlenhoff) [11:48:27] (03PS4) 10Ayounsi: ganeti-test2004: add to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/963998 (https://phabricator.wikimedia.org/T345602) [11:48:44] (03CR) 10Ayounsi: "thx" [puppet] - 10https://gerrit.wikimedia.org/r/963998 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [11:49:21] (03PS1) 10Jelto: admin: add amyt to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/964000 (https://phabricator.wikimedia.org/T344199) [11:49:22] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Jelto) p:05Triage→03Medium We have approval from manager and group owner already for `analytics-privatedata-users`. So we can proceed with adding `amyt` to `analytics-priv... [11:49:35] (03CR) 10Ayounsi: [C: 03+2] ganeti-test2004: add to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/963998 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [11:49:39] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Jelto) [11:49:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P52848 and previous config saved to /var/cache/conftool/dbconfig/20231006-114947-arnaudb.json [11:55:15] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti-test2004.codfw.wmnet with OS bullseye [11:55:35] (03Abandoned) 10Muehlenhoff: Remove profile::ganeti::ganeti3 setting [puppet] - 10https://gerrit.wikimedia.org/r/838172 (owner: 10Muehlenhoff) [11:55:45] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:19] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:06] !log rebalancing ganeti row D/eqiad [12:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:13] RECOVERY - Ganeti memory on ganeti1019 is OK: OK Memory 88% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [12:04:29] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P52850 and previous config saved to /var/cache/conftool/dbconfig/20231006-120454-arnaudb.json [12:09:58] (03CR) 10Jelto: [C: 03+2] gitlab/failover: remove deploy-page at the end of cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/963739 (https://phabricator.wikimedia.org/T345531) (owner: 10Jelto) [12:10:36] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:11:20] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:11:25] (03CR) 10Jbond: "i personally think (but perhaps im being to cautious) we should get another approval from the analytics group. The original request was f" [puppet] - 10https://gerrit.wikimedia.org/r/964000 (https://phabricator.wikimedia.org/T344199) (owner: 10Jelto) [12:12:31] (03Merged) 10jenkins-bot: gitlab/failover: remove deploy-page at the end of cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/963739 (https://phabricator.wikimedia.org/T345531) (owner: 10Jelto) [12:13:13] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti-test2004.codfw.wmnet with OS bullseye [12:13:59] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti-test2004.codfw.wmnet with OS bullseye [12:14:52] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:15:12] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:15:45] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:15:52] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:16:57] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:17:11] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:18:09] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:18:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10jbond) [12:18:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10jbond) 05Open→03In progress p:05Triage→03Medium [12:18:39] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [12:20:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T343198)', diff saved to https://phabricator.wikimedia.org/P52851 and previous config saved to /var/cache/conftool/dbconfig/20231006-122000-arnaudb.json [12:20:03] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:20:06] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [12:20:16] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:20:20] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Jelto) Per discussion in https://gerrit.wikimedia.org/r/c/operations/puppet/+/964000/1#message-e8640992dffe84a96eb65107b8dcfbff5... [12:20:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T343198)', diff saved to https://phabricator.wikimedia.org/P52852 and previous config saved to /var/cache/conftool/dbconfig/20231006-122022-arnaudb.json [12:20:31] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Jelto) [12:22:28] (03PS4) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [12:23:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:25:20] (03PS5) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [12:25:57] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:16] (03PS1) 10Jbond: puppetdb-microservice: add puppetversion to listof allowed facts [puppet] - 10https://gerrit.wikimedia.org/r/964001 (https://phabricator.wikimedia.org/T348319) [12:27:48] (03CR) 10Jbond: [C: 03+2] puppetdb-microservice: add puppetversion to listof allowed facts [puppet] - 10https://gerrit.wikimedia.org/r/964001 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:29:44] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:29:52] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:30:31] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:17] (03Abandoned) 10Slyngshede: Replace the look with Wikimedia UI [software/bitu] - 10https://gerrit.wikimedia.org/r/941535 (owner: 10Ladsgroup) [12:33:45] (03PS1) 10Jelto: gitlab: install warning banner only on replicas when doing a restore [puppet] - 10https://gerrit.wikimedia.org/r/964003 (https://phabricator.wikimedia.org/T345531) [12:54:01] (03PS6) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [12:54:09] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:54:21] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:54:34] (03CR) 10CI reject: [V: 04-1] Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [12:55:12] (03PS1) 10Slyngshede: C:IDM Enable email update in test [puppet] - 10https://gerrit.wikimedia.org/r/964004 [12:55:39] (03PS7) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [12:55:57] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43944/console" [puppet] - 10https://gerrit.wikimedia.org/r/964004 (owner: 10Slyngshede) [12:58:49] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [12:59:37] (03PS2) 10Slyngshede: C:IDM Enable email update in test [puppet] - 10https://gerrit.wikimedia.org/r/964004 [13:00:15] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43946/console" [puppet] - 10https://gerrit.wikimedia.org/r/964004 (owner: 10Slyngshede) [13:00:41] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:45] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [13:01:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [13:02:46] (03CR) 10Muehlenhoff: [C: 03+1] C:IDM Enable email update in test [puppet] - 10https://gerrit.wikimedia.org/r/964004 (owner: 10Slyngshede) [13:03:13] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti-test2004.codfw.wmnet with OS bullseye [13:03:59] (03PS3) 10Slyngshede: C:IDM Enable email update in test [puppet] - 10https://gerrit.wikimedia.org/r/964004 [13:04:01] (03PS8) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [13:05:03] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43947/console" [puppet] - 10https://gerrit.wikimedia.org/r/964004 (owner: 10Slyngshede) [13:06:12] (03PS4) 10Slyngshede: C:IDM Enable email update in test [puppet] - 10https://gerrit.wikimedia.org/r/964004 [13:07:12] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43949/console" [puppet] - 10https://gerrit.wikimedia.org/r/964004 (owner: 10Slyngshede) [13:10:18] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43950/console" [puppet] - 10https://gerrit.wikimedia.org/r/964004 (owner: 10Slyngshede) [13:12:27] (03PS5) 10Slyngshede: C:IDM Enable email update in test [puppet] - 10https://gerrit.wikimedia.org/r/964004 [13:13:16] (03PS1) 10Slyngshede: Avoid displaying empty LDAP values as an array [] [software/bitu] - 10https://gerrit.wikimedia.org/r/964006 [13:13:33] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43951/console" [puppet] - 10https://gerrit.wikimedia.org/r/964004 (owner: 10Slyngshede) [13:13:43] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Avoid displaying empty LDAP values as an array [] [software/bitu] - 10https://gerrit.wikimedia.org/r/964006 (owner: 10Slyngshede) [13:15:01] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43952/console" [puppet] - 10https://gerrit.wikimedia.org/r/964004 (owner: 10Slyngshede) [13:15:11] (03PS1) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [13:15:39] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:IDM Enable email update in test [puppet] - 10https://gerrit.wikimedia.org/r/964004 (owner: 10Slyngshede) [13:17:42] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [13:17:43] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:17:49] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:18:04] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-master1004.eqiad.wmnet with reason: host reimage [13:18:49] (03PS29) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [13:18:51] (03PS9) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [13:18:53] (03PS1) 10Btullis: [WIP] Deploy multiple spark shufflers for yarn to production [puppet] - 10https://gerrit.wikimedia.org/r/964008 (https://phabricator.wikimedia.org/T344910) [13:21:16] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-master1004.eqiad.wmnet with reason: host reimage [13:21:52] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [13:21:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [13:23:38] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:24:36] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 28): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43953/console" [puppet] - 10https://gerrit.wikimedia.org/r/964008 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:26:02] (03CR) 10Btullis: Support multiple spark yarn shufflers in parallel (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:26:15] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "ganeti-test2004 - ayounsi@cumin1001" [13:26:28] (03PS17) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:27:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "ganeti-test2004 - ayounsi@cumin1001" [13:28:16] (03CR) 10CI reject: [V: 04-1] team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:28:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:31:51] (03PS2) 10Btullis: [WIP] Deploy multiple spark shufflers for yarn to production [puppet] - 10https://gerrit.wikimedia.org/r/964008 (https://phabricator.wikimedia.org/T344910) [13:34:34] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [13:35:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:38:31] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [13:41:24] (03PS30) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [13:41:26] (03PS10) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [13:49:19] (03CR) 10Elukey: [C: 03+2] modules: duplicate app:generic:1.0.0 to 1.0.1 to ease reviews [deployment-charts] - 10https://gerrit.wikimedia.org/r/963943 (owner: 10Elukey) [13:49:24] (03CR) 10Elukey: [C: 03+2] modules: add quotes to args rendered by app:generic [deployment-charts] - 10https://gerrit.wikimedia.org/r/963944 (owner: 10Elukey) [13:49:30] (03CR) 10Elukey: [C: 03+2] charts: update app.generic module for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/963945 (owner: 10Elukey) [13:52:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2054.codfw.wmnet with OS bullseye [13:52:10] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2054.codfw.wmnet with OS bullseye [13:53:03] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:54:41] (03PS18) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:54:50] (03CR) 10JHathaway: [C: 03+2] prometheus-postgres-exporter: install configs before service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [13:55:09] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:55:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [13:55:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-master1004.eqiad.wmnet with OS bullseye [13:55:16] (03CR) 10JHathaway: [C: 03+2] puppet agent: protect against a missing client bucket path [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [13:55:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye completed:... [13:55:27] (03CR) 10JHathaway: [C: 03+2] "thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [13:55:47] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:49] (03PS3) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) [13:55:56] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 28): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43955/console" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:55:59] (03CR) 10CI reject: [V: 04-1] team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:56:42] (03PS2) 10Muehlenhoff: testreduce: Automatically restart parsoid-rt server/client and mariadb on failures [puppet] - 10https://gerrit.wikimedia.org/r/963996 (https://phabricator.wikimedia.org/T345220) [13:59:11] (03CR) 10CI reject: [V: 04-1] testreduce: Automatically restart parsoid-rt server/client and mariadb on failures [puppet] - 10https://gerrit.wikimedia.org/r/963996 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [14:00:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:11] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:33] (03CR) 10Subramanya Sastry: [C: 03+1] testreduce: Automatically restart parsoid-rt server/client and mariadb on failures [puppet] - 10https://gerrit.wikimedia.org/r/963996 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [14:00:50] (03PS2) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [14:01:39] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:34] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:02:42] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:03:06] (03PS1) 10Jbond: config_master: add an installer directory [puppet] - 10https://gerrit.wikimedia.org/r/964011 (https://phabricator.wikimedia.org/T348319) [14:03:08] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [14:10:15] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Jhancock.wm) @Papaul the image took on this one, but I am not progressing past this point. `Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagi... [14:11:54] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH nodes) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:12:57] (03PS1) 10Urbanecm: Growth: Enable Welcome survey user research for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964013 (https://phabricator.wikimedia.org/T342353) [14:13:39] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH nodes) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:17:51] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [14:19:13] (03CR) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [14:19:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [14:20:25] (03PS1) 10Jbond: late_command.sh: Add logic to rerad puppet version from config-master [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) [14:20:59] (03CR) 10CI reject: [V: 04-1] late_command.sh: Add logic to rerad puppet version from config-master [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [14:21:10] (03PS3) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [14:21:41] 10SRE, 10Traffic, 10Patch-For-Review: Rename ACAST_PS_ADVERTISE in bird and anycast-healthchecker to BIRD_IP_ADVERTISE - https://phabricator.wikimedia.org/T348174 (10ssingh) After some consideration about what to rename as part of this effort, I think I am now in favour of not renaming anything for the simpl... [14:22:33] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-master1003.eqiad.wmnet with reason: host reimage [14:22:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [14:23:07] (03PS2) 10Jbond: late_command.sh: Add logic to rerad puppet version from config-master [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) [14:23:40] (03CR) 10jenkins-bot: late_command.sh: Add logic to rerad puppet version from config-master [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [14:23:45] (03PS4) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [14:24:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [14:25:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-master1003.eqiad.wmnet with reason: host reimage [14:26:03] (03PS3) 10Muehlenhoff: testreduce: Auto-restart parsoid-rt server/client and mariadb on failures [puppet] - 10https://gerrit.wikimedia.org/r/963996 (https://phabricator.wikimedia.org/T345220) [14:26:17] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [14:26:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963996 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [14:27:39] (03CR) 10David Caro: [C: 03+1] P:toolforge: remove obsolete config files [puppet] - 10https://gerrit.wikimedia.org/r/963965 (owner: 10Majavah) [14:28:30] (03CR) 10Majavah: [C: 03+2] P:toolforge: remove obsolete config files [puppet] - 10https://gerrit.wikimedia.org/r/963965 (owner: 10Majavah) [14:28:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [14:29:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [14:29:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye [14:29:56] (03CR) 10Ssingh: [C: 03+1] "Thanks for the patch and sorry about the delay. Let's merge this next week!" [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [14:34:26] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [14:35:08] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:36:38] (03CR) 10Ilias Sarantopoulos: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [14:36:41] (03PS5) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [14:38:32] (JobUnavailable) firing: (3) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:32] (03PS3) 10Jbond: late_command.sh: Add logic to rerad puppet version from config-master [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) [14:38:44] (03PS6) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [14:39:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr) [14:40:48] (03CR) 10Jbond: [C: 03+2] config_master: add an installer directory [puppet] - 10https://gerrit.wikimedia.org/r/964011 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [14:41:56] (03CR) 10Jbond: "Ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [14:42:01] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [14:42:20] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [14:42:29] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:44:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [14:44:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-master1003.eqiad.wmnet with OS bullseye [14:44:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye completed:... [14:48:09] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:49:21] (JobUnavailable) firing: (3) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:54:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2054.codfw.wmnet with OS bullseye [14:54:07] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2054.codfw.wmnet with OS bullseye executed with errors: - kubernetes2054 (**FAIL**) - Downtimed on... [14:54:58] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Jhancock.wm) [49/50, retrying in 147.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title kubernetes2054 not found... [14:55:39] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1063.mgmt.eqiad.wmnet with reboot policy FORCED [14:55:41] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1064.mgmt.eqiad.wmnet with reboot policy FORCED [14:55:43] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1067.mgmt.eqiad.wmnet with reboot policy FORCED [14:58:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1063.mgmt.eqiad.wmnet with reboot policy FORCED [14:58:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1064.mgmt.eqiad.wmnet with reboot policy FORCED [14:58:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1067.mgmt.eqiad.wmnet with reboot policy FORCED [14:59:30] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye [14:59:32] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [14:59:33] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye [14:59:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye [14:59:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [14:59:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye [15:01:04] (03Abandoned) 10Ebernhardson: k8s config: Include the cluster name in the exported configuration [puppet] - 10https://gerrit.wikimedia.org/r/961182 (owner: 10Ebernhardson) [15:06:25] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43958/console" [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [15:07:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:09:29] (03PS1) 10Elukey: team-sre: improve k8s high api latency monitor [alerts] - 10https://gerrit.wikimedia.org/r/964025 [15:10:42] (03CR) 10CI reject: [V: 04-1] team-sre: improve k8s high api latency monitor [alerts] - 10https://gerrit.wikimedia.org/r/964025 (owner: 10Elukey) [15:12:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:12:50] (03PS2) 10Elukey: team-sre: improve k8s high api latency monitor [alerts] - 10https://gerrit.wikimedia.org/r/964025 [15:14:02] (03CR) 10CI reject: [V: 04-1] team-sre: improve k8s high api latency monitor [alerts] - 10https://gerrit.wikimedia.org/r/964025 (owner: 10Elukey) [15:14:32] (03PS2) 10Hnowlan: thumbor: add imagemagick policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) [15:15:39] (03PS3) 10Elukey: team-sre: improve k8s high api latency monitor [alerts] - 10https://gerrit.wikimedia.org/r/964025 [15:18:31] (03CR) 10Klausman: [C: 03+1] team-sre: improve k8s high api latency monitor [alerts] - 10https://gerrit.wikimedia.org/r/964025 (owner: 10Elukey) [15:19:43] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43959/console" [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [15:20:37] (03CR) 10Hnowlan: thumbor: add imagemagick policy file (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [15:21:26] (03PS2) 10Hnowlan: rest-gateway: only pass requests for knowledge-gap on wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/960575 (https://phabricator.wikimedia.org/T342213) [15:22:36] (03CR) 10JHathaway: late_command.sh: Add logic to rerad puppet version from config-master (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [15:23:10] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: only pass requests for knowledge-gap on wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/960575 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [15:23:59] (03Merged) 10jenkins-bot: rest-gateway: only pass requests for knowledge-gap on wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/960575 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [15:24:15] (03PS31) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [15:24:17] (03PS11) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [15:24:19] (03PS41) 10Btullis: Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) [15:26:56] (03CR) 10Hnowlan: [C: 03+2] svg: default to "en" when a language is not specified [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/962563 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan) [15:31:26] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye [15:31:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**)... [15:32:18] (03CR) 10Filippo Giunchedi: [C: 03+1] team-sre: improve k8s high api latency monitor [alerts] - 10https://gerrit.wikimedia.org/r/964025 (owner: 10Elukey) [15:34:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jhancock.wm) db1229 has an image but is not passing part of the reimage process. here is the latest error. provision script was run again before this run and bios was checked.... [15:35:33] (03Merged) 10jenkins-bot: svg: default to "en" when a language is not specified [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/962563 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan) [15:41:03] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:46:03] (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:51:04] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:53:29] (03CR) 10Btullis: [C: 03+1] "Adding ottomata as a reviewer, since gmodena is still out for a little while." [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [15:54:45] (03CR) 10Btullis: [C: 03+1] airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [15:55:15] (03CR) 10Btullis: [C: 03+1] druid: Bring druid1011.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962249 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [16:01:02] 10SRE, 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Papaul) @Eevans hello when do you think it will be the best day for us to coordinate with you on relocating this node so that we are not block by it during the codfw switch migration from VC to VXLAN/EVPN? Th... [16:01:03] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:02:02] (03CR) 10Clément Goubert: [C: 03+1] team-sre: improve k8s high api latency monitor [alerts] - 10https://gerrit.wikimedia.org/r/964025 (owner: 10Elukey) [16:06:03] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:08:09] (03PS1) 10JHathaway: dev env: move hiera configs into roles [puppet] - 10https://gerrit.wikimedia.org/r/964034 (https://phabricator.wikimedia.org/T337970) [16:09:18] (03CR) 10Herron: "LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/963987 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [16:10:09] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964034 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [16:10:43] (03CR) 10CI reject: [V: 04-1] dev env: move hiera configs into roles [puppet] - 10https://gerrit.wikimedia.org/r/964034 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [16:13:37] (03PS2) 10JHathaway: dev env: move hiera configs into roles [puppet] - 10https://gerrit.wikimedia.org/r/964034 (https://phabricator.wikimedia.org/T337970) [16:13:37] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1100.mgmt.eqiad.wmnet with reboot policy FORCED [16:19:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bullseye [16:19:08] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1067.eqiad.wmnet with OS bullseye [16:19:11] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1063.eqiad.wmnet with OS bullseye [16:22:15] (03PS1) 10Ayounsi: Change ganeti-test2004's role to ganeti_test [puppet] - 10https://gerrit.wikimedia.org/r/964036 (https://phabricator.wikimedia.org/T345602) [16:26:36] (03CR) 10Peter Fischer: "LGTM, thank you! Double-checked but as David said, kafka-test might not be the right choice unless we want manually produce events." [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [16:27:45] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1100.mgmt.eqiad.wmnet with reboot policy FORCED [16:28:29] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1100.mgmt.eqiad.wmnet with reboot policy FORCED [16:28:54] (03CR) 10JHathaway: [C: 03+2] dev env: move hiera configs into roles [puppet] - 10https://gerrit.wikimedia.org/r/964034 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [16:34:44] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:37:22] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1100.mgmt.eqiad.wmnet with reboot policy FORCED [16:40:31] (03PS18) 10Ebernhardson: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T326328) [16:40:33] (03CR) 10Ebernhardson: cirrus streaming updater service (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T326328) (owner: 10Ebernhardson) [16:41:28] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1100.mgmt.eqiad.wmnet with reboot policy FORCED [16:46:18] (03PS1) 10Andrew Bogott: Remove file for nova-placement service [puppet] - 10https://gerrit.wikimedia.org/r/964040 [16:46:20] (03PS1) 10Andrew Bogott: Update nova init scripts from Antelope packages [puppet] - 10https://gerrit.wikimedia.org/r/964041 [16:49:49] (03CR) 10Andrew Bogott: [C: 03+2] Remove file for nova-placement service [puppet] - 10https://gerrit.wikimedia.org/r/964040 (owner: 10Andrew Bogott) [16:50:04] (03CR) 10Andrew Bogott: [C: 03+2] Update nova init scripts from Antelope packages [puppet] - 10https://gerrit.wikimedia.org/r/964041 (owner: 10Andrew Bogott) [16:54:53] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1100.mgmt.eqiad.wmnet with reboot policy FORCED [16:58:21] (03PS1) 10Hnowlan: rest-gateway: route edit-,editor- and page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/964044 (https://phabricator.wikimedia.org/T336391) [16:58:29] (03PS1) 10Andrew Bogott: placement-api: update init script for Antelope [puppet] - 10https://gerrit.wikimedia.org/r/964045 [16:59:04] (03CR) 10Andrew Bogott: [C: 03+2] placement-api: update init script for Antelope [puppet] - 10https://gerrit.wikimedia.org/r/964045 (owner: 10Andrew Bogott) [16:59:06] (03CR) 10CI reject: [V: 04-1] rest-gateway: route edit-,editor- and page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/964044 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [17:01:46] (03PS2) 10Hnowlan: rest-gateway: route edit-,editor- and page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/964044 (https://phabricator.wikimedia.org/T336391) [17:02:03] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1100'] [17:02:17] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1100'] [17:03:50] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1100'] [17:03:58] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1100'] [17:05:13] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1100'] [17:05:19] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1100'] [17:05:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) [17:08:04] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1100.eqiad.wmnet'] [17:08:11] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1100.eqiad.wmnet'] [17:08:17] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:10:39] !log pt1979@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1100'] [17:10:55] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1100'] [17:10:58] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 Gb - https://phabricator.wikimedia.org/T191804 (10Bawolff) @MatthewVernon I think you're the right person to ask. With work being done to make MediaWiki no longer be limited to 4GB files, I was wondering what SR... [17:13:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:14:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/964036 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [17:39:39] (03CR) 10Peter Fischer: [C: 03+1] cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T326328) (owner: 10Ebernhardson) [17:49:18] 10SRE, 10MW-on-K8s, 10MediaWiki-Platform-Team, 10MediaWiki-extensions-CentralAuth, and 4 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10Tgr) [18:30:35] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1101 [18:30:38] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1101 [18:31:38] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [18:31:42] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [18:32:42] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [18:42:36] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [18:42:56] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@3b7df78]: Update rdf-spark-tools to 0.3.135 to fix query mapping job failure [18:43:26] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@3b7df78]: Update rdf-spark-tools to 0.3.135 to fix query mapping job failure (duration: 00m 29s) [18:53:32] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:00:11] (03PS2) 10Ryan Kemper: wdqs: bring graph split hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/963777 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [19:00:19] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963777 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [19:17:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 44.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:22:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:22:45] (03PS1) 10Bking: wikikube: prepare new cirrus-streaming-updater service [puppet] - 10https://gerrit.wikimedia.org/r/964069 (https://phabricator.wikimedia.org/T347075) [19:29:10] (03PS1) 10Bking: cirrus: create new namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/964071 (https://phabricator.wikimedia.org/T347075) [19:30:03] (03PS2) 10Bking: cirrus: create new namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/964071 (https://phabricator.wikimedia.org/T347075) [19:31:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:20] (03CR) 10Ryan Kemper: [C: 03+1] cirrus: create new namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/964071 (https://phabricator.wikimedia.org/T347075) (owner: 10Bking) [19:33:25] (03CR) 10Bking: [C: 03+2] wikikube: prepare new cirrus-streaming-updater service [puppet] - 10https://gerrit.wikimedia.org/r/964069 (https://phabricator.wikimedia.org/T347075) (owner: 10Bking) [19:33:27] (03CR) 10Ryan Kemper: [C: 03+1] wikikube: prepare new cirrus-streaming-updater service [puppet] - 10https://gerrit.wikimedia.org/r/964069 (https://phabricator.wikimedia.org/T347075) (owner: 10Bking) [19:36:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:37:06] (03CR) 10Bking: [C: 03+2] cirrus: create new namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/964071 (https://phabricator.wikimedia.org/T347075) (owner: 10Bking) [19:39:49] !log bking@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [19:40:57] !log bking@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [19:41:20] !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [19:42:30] (03PS1) 10Eevans: cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) [19:42:57] (03CR) 10CI reject: [V: 04-1] cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [19:43:05] !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [19:43:18] (03PS2) 10Eevans: cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) [19:43:44] (03CR) 10CI reject: [V: 04-1] cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [19:43:57] !log bking@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:44:39] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:45:00] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [19:46:29] 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10Peachey88) [19:46:33] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [19:53:07] (03PS3) 10Eevans: cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) [19:53:34] (03CR) 10CI reject: [V: 04-1] cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [19:56:46] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [19:59:32] (03PS4) 10Eevans: cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) [19:59:58] (03CR) 10CI reject: [V: 04-1] cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [20:01:39] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [20:05:59] (03CR) 10Bking: [C: 03+2] cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T326328) (owner: 10Ebernhardson) [20:10:45] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:11:24] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:14:54] (03PS5) 10Eevans: cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) [20:15:59] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [20:21:52] (03PS1) 10Ebernhardson: cirrus-streaming-updater: Update swift account to match prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/964081 [20:26:09] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.22% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:28:19] (03CR) 10Bking: [C: 03+2] cirrus-streaming-updater: Update swift account to match prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/964081 (owner: 10Ebernhardson) [20:29:18] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:29:35] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:31:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.22% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:32:17] (03PS1) 10Andrew Bogott: Horizon: new version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/964083 [20:33:03] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: new version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/964083 (owner: 10Andrew Bogott) [20:34:56] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:35:19] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:42:01] (03PS1) 10Andrew Bogott: codfw1dev Horizon: update docker version again [puppet] - 10https://gerrit.wikimedia.org/r/964087 [20:42:52] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev Horizon: update docker version again [puppet] - 10https://gerrit.wikimedia.org/r/964087 (owner: 10Andrew Bogott) [20:42:56] (03PS1) 10Ebernhardson: cirrus-streaming-updater: Set job manager replicas to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/964088 [20:43:22] (03CR) 10Bking: [C: 03+2] cirrus-streaming-updater: Set job manager replicas to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/964088 (owner: 10Ebernhardson) [20:44:04] (03Merged) 10jenkins-bot: cirrus-streaming-updater: Set job manager replicas to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/964088 (owner: 10Ebernhardson) [20:45:10] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10KStoller-WMF) p:05Triage→03High [20:45:14] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:45:23] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:48:50] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [20:50:22] (03PS1) 10Andrew Bogott: Horizon: new version for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/964090 (https://phabricator.wikimedia.org/T347927) [20:51:15] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: new version for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/964090 (https://phabricator.wikimedia.org/T347927) (owner: 10Andrew Bogott) [21:32:41] 10SRE, 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/4 Draft: Implement Gitlab CI and Blubber config [21:39:38] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:39:56] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:57:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T343198)', diff saved to https://phabricator.wikimedia.org/P52855 and previous config saved to /var/cache/conftool/dbconfig/20231006-215725-arnaudb.json [21:57:30] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:12:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P52856 and previous config saved to /var/cache/conftool/dbconfig/20231006-221232-arnaudb.json [22:12:54] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Papaul) @Jhancock.wm when i was setting the other kubernetes node i had the line below for all the nodes, bit for some reason that line was replace with ` 1603 node /^kubernetes20(0[5-9]|[1-4][0-9]|5[01235... [22:19:39] (03PS1) 10Papaul: Add kubernetes2054 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/964094 (https://phabricator.wikimedia.org/T345650) [22:21:04] (03CR) 10Papaul: [C: 03+2] Add kubernetes2054 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/964094 (https://phabricator.wikimedia.org/T345650) (owner: 10Papaul) [22:26:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2054.codfw.wmnet with OS bullseye [22:26:47] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2054.codfw.wmnet with OS bullseye [22:27:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P52857 and previous config saved to /var/cache/conftool/dbconfig/20231006-222738-arnaudb.json [22:28:50] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [22:42:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T343198)', diff saved to https://phabricator.wikimedia.org/P52858 and previous config saved to /var/cache/conftool/dbconfig/20231006-224245-arnaudb.json [22:42:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [22:42:49] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:43:00] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [22:43:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T343198)', diff saved to https://phabricator.wikimedia.org/P52859 and previous config saved to /var/cache/conftool/dbconfig/20231006-224306-arnaudb.json [22:47:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2054.codfw.wmnet with reason: host reimage [22:50:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2054.codfw.wmnet with reason: host reimage [22:53:32] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:03:32] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:04:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:04:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2054.codfw.wmnet with OS bullseye [23:04:39] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2054.codfw.wmnet with OS bullseye completed: - kubernetes2054 (**WARN*... [23:05:06] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Papaul) [23:06:50] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Papaul) 05Open→03Resolved a:03Papaul @akosiaris this is ready for service. Please note this is the first time we are putting a Supermicro server in production any feedback will be great . Thanks. [23:40:56] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:53:08] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1