[00:18:24] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:25:15] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:27:46] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:39:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/976959 [00:39:06] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/976959 (owner: 10TrainBranchBot) [00:45:38] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/976959 (owner: 10TrainBranchBot) [01:14:22] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:15:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 48.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:35:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 48.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:28:24] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:38:24] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:46] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:08:24] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:27:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:32:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:05:12] (LVSHighRX) firing: Excessive RX traffic on lvs2013:9100 (eno12399np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [04:10:12] (LVSHighRX) resolved: Excessive RX traffic on lvs2013:9100 (eno12399np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [04:18:24] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:25:15] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:59:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:04:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:16:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 42.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:21:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 41.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:27:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 46.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:32:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 48.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:35:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 43.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:40:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 43.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:08:06] (03PS1) 10Marostegui: apt_repo.yaml: Do not reimage db1242 [puppet] - 10https://gerrit.wikimedia.org/r/977119 [06:12:40] (03CR) 10Marostegui: [C: 03+2] apt_repo.yaml: Do not reimage db1242 [puppet] - 10https://gerrit.wikimedia.org/r/977119 (owner: 10Marostegui) [06:14:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2003.codfw.wmnet with OS bookworm [06:15:28] (03CR) 10Marostegui: [C: 03+1] mariadb: replace db1147 by db1247 on s4 [puppet] - 10https://gerrit.wikimedia.org/r/976956 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [06:24:24] (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/977120 (https://phabricator.wikimedia.org/T351283) [06:25:45] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/977120 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui) [06:26:18] (03Merged) 10jenkins-bot: control-mariadb-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/977120 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui) [06:28:24] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:30:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2003.codfw.wmnet with reason: host reimage [06:31:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2122', diff saved to https://phabricator.wikimedia.org/P53805 and previous config saved to /var/cache/conftool/dbconfig/20231124-063152-root.json [06:33:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2003.codfw.wmnet with reason: host reimage [06:34:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P53806 and previous config saved to /var/cache/conftool/dbconfig/20231124-063424-root.json [06:34:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 10%: Upgrade to 10.6.16', diff saved to https://phabricator.wikimedia.org/P53807 and previous config saved to /var/cache/conftool/dbconfig/20231124-063450-root.json [06:37:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Upgrade to 10.6.16', diff saved to https://phabricator.wikimedia.org/P53808 and previous config saved to /var/cache/conftool/dbconfig/20231124-063723-root.json [06:43:15] (03CR) 10Vgutierrez: [C: 03+2] interface::manual: Fix absenting [puppet] - 10https://gerrit.wikimedia.org/r/977046 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [06:43:26] (03CR) 10Vgutierrez: [C: 03+2] interface::ipip: Fix absenting [puppet] - 10https://gerrit.wikimedia.org/r/977056 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [06:49:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 25%: Upgrade to 10.6.16', diff saved to https://phabricator.wikimedia.org/P53809 and previous config saved to /var/cache/conftool/dbconfig/20231124-064955-root.json [06:52:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Upgrade to 10.6.16', diff saved to https://phabricator.wikimedia.org/P53810 and previous config saved to /var/cache/conftool/dbconfig/20231124-065228-root.json [06:52:51] (03PS1) 10Kevin Bazira: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/976960 (https://phabricator.wikimedia.org/T343123) [06:53:12] (03PS1) 10Physikerwelt: Enable native MathML rendering on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977121 (https://phabricator.wikimedia.org/T350787) [06:55:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2003.codfw.wmnet with OS bookworm [06:55:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2002.codfw.wmnet with OS bookworm [06:56:14] (03PS2) 10Kevin Bazira: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/976960 (https://phabricator.wikimedia.org/T343123) [06:59:40] (03PS1) 10Marostegui: oathauth_users: Prepare for removal [puppet] - 10https://gerrit.wikimedia.org/r/977123 (https://phabricator.wikimedia.org/T348693) [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231124T0700) [07:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:05:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 50%: Upgrade to 10.6.16', diff saved to https://phabricator.wikimedia.org/P53811 and previous config saved to /var/cache/conftool/dbconfig/20231124-070500-root.json [07:06:46] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:07:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: Upgrade to 10.6.16', diff saved to https://phabricator.wikimedia.org/P53812 and previous config saved to /var/cache/conftool/dbconfig/20231124-070733-root.json [07:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:11:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2002.codfw.wmnet with reason: host reimage [07:15:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2002.codfw.wmnet with reason: host reimage [07:20:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 75%: Upgrade to 10.6.16', diff saved to https://phabricator.wikimedia.org/P53813 and previous config saved to /var/cache/conftool/dbconfig/20231124-072005-root.json [07:22:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Upgrade to 10.6.16', diff saved to https://phabricator.wikimedia.org/P53814 and previous config saved to /var/cache/conftool/dbconfig/20231124-072238-root.json [07:32:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2001.codfw.wmnet with OS bookworm [07:32:59] PROBLEM - Host es2032 #page is DOWN: PING CRITICAL - Packet loss = 100% [07:33:06] arnaudb: ^ [07:33:12] uh :) [07:33:14] it is me [07:33:18] downtiming, sorry [07:34:19] RECOVERY - Host es2032 #page is UP: PING OK - Packet loss = 0%, RTA = 53.23 ms [07:34:20] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: reboot [07:34:21] PROBLEM - mysqld processes #page on es2032 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:34:33] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: reboot [07:34:44] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: reboot [07:34:46] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: reboot [07:35:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 100%: Upgrade to 10.6.16', diff saved to https://phabricator.wikimedia.org/P53815 and previous config saved to /var/cache/conftool/dbconfig/20231124-073510-root.json [07:35:12] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: reboot [07:35:15] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: reboot [07:35:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2002.codfw.wmnet with OS bookworm [07:35:32] weird, it returns the task id is not valid oO [07:35:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [alerts] - 10https://gerrit.wikimedia.org/r/977076 (owner: 10Majavah) [07:37:05] arnaudb: cause it is a private task [07:37:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Upgrade to 10.6.16', diff saved to https://phabricator.wikimedia.org/P53816 and previous config saved to /var/cache/conftool/dbconfig/20231124-073743-root.json [07:38:21] RECOVERY - mysqld processes #page on es2032 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:38:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 5%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53817 and previous config saved to /var/cache/conftool/dbconfig/20231124-073838-arnaudb.json [07:47:36] (03PS1) 10Muehlenhoff: pki: Add new intermediate for ganeti-rapi [puppet] - 10https://gerrit.wikimedia.org/r/977129 (https://phabricator.wikimedia.org/T350686) [07:48:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2001.codfw.wmnet with reason: host reimage [07:50:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2001.codfw.wmnet with reason: host reimage [07:51:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'repool API on db2181', diff saved to https://phabricator.wikimedia.org/P53818 and previous config saved to /var/cache/conftool/dbconfig/20231124-075137-arnaudb.json [07:53:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53819 and previous config saved to /var/cache/conftool/dbconfig/20231124-075343-arnaudb.json [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231124T0800) [08:02:46] (03PS1) 10Muehlenhoff: Add new Hiera option to be used to selectively test defs_from_etcd on nftables [puppet] - 10https://gerrit.wikimedia.org/r/977132 (https://phabricator.wikimedia.org/T348734) [08:08:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977132 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [08:08:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 15%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53820 and previous config saved to /var/cache/conftool/dbconfig/20231124-080848-arnaudb.json [08:11:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2001.codfw.wmnet with OS bookworm [08:11:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:13:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2181 and db2193 to fix their config, will repool them asap', diff saved to https://phabricator.wikimedia.org/P53821 and previous config saved to /var/cache/conftool/dbconfig/20231124-081304-arnaudb.json [08:14:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2180 and db2193 to fix their config, will repool them asap', diff saved to https://phabricator.wikimedia.org/P53822 and previous config saved to /var/cache/conftool/dbconfig/20231124-081422-arnaudb.json [08:15:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 20%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53823 and previous config saved to /var/cache/conftool/dbconfig/20231124-081541-arnaudb.json [08:16:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53824 and previous config saved to /var/cache/conftool/dbconfig/20231124-081601-arnaudb.json [08:16:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53825 and previous config saved to /var/cache/conftool/dbconfig/20231124-081619-arnaudb.json [08:18:24] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:21:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 44.23% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:21:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2188 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53826 and previous config saved to /var/cache/conftool/dbconfig/20231124-082115-arnaudb.json [08:22:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2189 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53827 and previous config saved to /var/cache/conftool/dbconfig/20231124-082208-arnaudb.json [08:23:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 20%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53828 and previous config saved to /var/cache/conftool/dbconfig/20231124-082353-arnaudb.json [08:24:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2190 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53829 and previous config saved to /var/cache/conftool/dbconfig/20231124-082436-arnaudb.json [08:25:15] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:25:45] (03CR) 10Majavah: [C: 03+2] team-sre: add alert for unsigned puppet certificates [alerts] - 10https://gerrit.wikimedia.org/r/977076 (owner: 10Majavah) [08:27:33] (03Merged) 10jenkins-bot: team-sre: add alert for unsigned puppet certificates [alerts] - 10https://gerrit.wikimedia.org/r/977076 (owner: 10Majavah) [08:30:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 40%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53830 and previous config saved to /var/cache/conftool/dbconfig/20231124-083046-arnaudb.json [08:30:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53831 and previous config saved to /var/cache/conftool/dbconfig/20231124-083049-arnaudb.json [08:31:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53832 and previous config saved to /var/cache/conftool/dbconfig/20231124-083106-arnaudb.json [08:31:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53833 and previous config saved to /var/cache/conftool/dbconfig/20231124-083124-arnaudb.json [08:31:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2195 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53834 and previous config saved to /var/cache/conftool/dbconfig/20231124-083147-arnaudb.json [08:36:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 46.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:36:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2188 (re)pooling @ 20%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53835 and previous config saved to /var/cache/conftool/dbconfig/20231124-083620-arnaudb.json [08:37:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2189 (re)pooling @ 20%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53836 and previous config saved to /var/cache/conftool/dbconfig/20231124-083713-arnaudb.json [08:38:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 25%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53837 and previous config saved to /var/cache/conftool/dbconfig/20231124-083858-arnaudb.json [08:39:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2190 (re)pooling @ 20%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53838 and previous config saved to /var/cache/conftool/dbconfig/20231124-083941-arnaudb.json [08:41:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 46.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:43:43] (03CR) 10Arnaudb: [C: 03+2] mariadb: replace db1147 by db1247 on s4 [puppet] - 10https://gerrit.wikimedia.org/r/976956 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [08:45:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 60%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53839 and previous config saved to /var/cache/conftool/dbconfig/20231124-084551-arnaudb.json [08:45:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 20%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53840 and previous config saved to /var/cache/conftool/dbconfig/20231124-084554-arnaudb.json [08:46:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53841 and previous config saved to /var/cache/conftool/dbconfig/20231124-084611-arnaudb.json [08:46:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53842 and previous config saved to /var/cache/conftool/dbconfig/20231124-084629-arnaudb.json [08:46:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2195 (re)pooling @ 20%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53843 and previous config saved to /var/cache/conftool/dbconfig/20231124-084652-arnaudb.json [08:47:10] (03PS1) 10Arnaudb: mariadb: toggle notifications on cloned hosts [puppet] - 10https://gerrit.wikimedia.org/r/976962 (https://phabricator.wikimedia.org/T343674) [08:51:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2188 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53844 and previous config saved to /var/cache/conftool/dbconfig/20231124-085125-arnaudb.json [08:52:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2189 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53845 and previous config saved to /var/cache/conftool/dbconfig/20231124-085218-arnaudb.json [08:54:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53846 and previous config saved to /var/cache/conftool/dbconfig/20231124-085403-arnaudb.json [08:54:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2190 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53847 and previous config saved to /var/cache/conftool/dbconfig/20231124-085446-arnaudb.json [09:00:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 80%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53848 and previous config saved to /var/cache/conftool/dbconfig/20231124-090056-arnaudb.json [09:01:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53849 and previous config saved to /var/cache/conftool/dbconfig/20231124-090059-arnaudb.json [09:01:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 80%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53850 and previous config saved to /var/cache/conftool/dbconfig/20231124-090116-arnaudb.json [09:01:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 80%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53851 and previous config saved to /var/cache/conftool/dbconfig/20231124-090134-arnaudb.json [09:01:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2195 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53852 and previous config saved to /var/cache/conftool/dbconfig/20231124-090157-arnaudb.json [09:06:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2188 (re)pooling @ 40%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53853 and previous config saved to /var/cache/conftool/dbconfig/20231124-090630-arnaudb.json [09:07:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2189 (re)pooling @ 40%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53854 and previous config saved to /var/cache/conftool/dbconfig/20231124-090723-arnaudb.json [09:09:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 35%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53855 and previous config saved to /var/cache/conftool/dbconfig/20231124-090908-arnaudb.json [09:09:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2190 (re)pooling @ 40%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53856 and previous config saved to /var/cache/conftool/dbconfig/20231124-090951-arnaudb.json [09:13:12] (LVSHighRX) firing: Excessive RX traffic on lvs2013:9100 (eno12399np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [09:16:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 100%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53857 and previous config saved to /var/cache/conftool/dbconfig/20231124-091601-arnaudb.json [09:16:02] here, looking [09:16:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 40%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53858 and previous config saved to /var/cache/conftool/dbconfig/20231124-091604-arnaudb.json [09:16:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53859 and previous config saved to /var/cache/conftool/dbconfig/20231124-091621-arnaudb.json [09:16:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Post reconfig repooling', diff saved to https://phabricator.wikimedia.org/P53860 and previous config saved to /var/cache/conftool/dbconfig/20231124-091639-arnaudb.json [09:16:54] (03PS1) 10Majavah: puppetserver: use deploy script on initial setup [puppet] - 10https://gerrit.wikimedia.org/r/977140 [09:17:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2195 (re)pooling @ 40%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53861 and previous config saved to /var/cache/conftool/dbconfig/20231124-091702-arnaudb.json [09:18:12] (LVSHighRX) resolved: Excessive RX traffic on lvs2013:9100 (eno12399np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [09:21:14] (03PS2) 10Vgutierrez: lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) [09:21:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2188 (re)pooling @ 50%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53862 and previous config saved to /var/cache/conftool/dbconfig/20231124-092135-arnaudb.json [09:22:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2189 (re)pooling @ 50%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53864 and previous config saved to /var/cache/conftool/dbconfig/20231124-092228-arnaudb.json [09:24:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 40%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53865 and previous config saved to /var/cache/conftool/dbconfig/20231124-092413-arnaudb.json [09:24:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2190 (re)pooling @ 50%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53866 and previous config saved to /var/cache/conftool/dbconfig/20231124-092456-arnaudb.json [09:27:00] (03CR) 10Elukey: [C: 03+1] "We can test, I think that the memory value will need some adjusting, but we can do some tests manually after the deploy to staging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/976960 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [09:27:29] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications on cloned hosts [puppet] - 10https://gerrit.wikimedia.org/r/976962 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [09:30:41] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications on cloned hosts [puppet] - 10https://gerrit.wikimedia.org/r/976962 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [09:31:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 50%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53867 and previous config saved to /var/cache/conftool/dbconfig/20231124-093109-arnaudb.json [09:32:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2195 (re)pooling @ 50%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53868 and previous config saved to /var/cache/conftool/dbconfig/20231124-093207-arnaudb.json [09:36:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2188 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53869 and previous config saved to /var/cache/conftool/dbconfig/20231124-093640-arnaudb.json [09:36:46] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host mw2421.codfw.wmnet with OS bullseye [09:37:25] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host mw2425.codfw.wmnet with OS bullseye [09:37:29] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host mw2431.codfw.wmnet with OS bullseye [09:37:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2189 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53870 and previous config saved to /var/cache/conftool/dbconfig/20231124-093733-arnaudb.json [09:37:56] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host mw1472.eqiad.wmnet with OS bullseye [09:38:14] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host mw1473.eqiad.wmnet with OS bullseye [09:39:12] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host mw1474.eqiad.wmnet with OS bullseye [09:39:18] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host mw1475.eqiad.wmnet with OS bullseye [09:39:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 45%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53871 and previous config saved to /var/cache/conftool/dbconfig/20231124-093918-arnaudb.json [09:40:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2190 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53872 and previous config saved to /var/cache/conftool/dbconfig/20231124-094001-arnaudb.json [09:42:05] (03PS1) 10Muehlenhoff: Add support to write out blocked networks from requestctl [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) [09:43:14] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976960 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [09:43:39] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10ayounsi) Wow, nice to see all the comments ! First this proposal is not about replacing the "ProvisionServerNetwork" Netbox script, so it makes sens to me to keep impro... [09:44:21] (03Merged) 10jenkins-bot: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/976960 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [09:46:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53873 and previous config saved to /var/cache/conftool/dbconfig/20231124-094614-arnaudb.json [09:47:11] (03PS3) 10Vgutierrez: lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) [09:47:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2195 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53874 and previous config saved to /var/cache/conftool/dbconfig/20231124-094713-arnaudb.json [09:47:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [09:49:29] (03PS1) 10Muehlenhoff: Enable requestctl-driven network blocks for sretest [puppet] - 10https://gerrit.wikimedia.org/r/977166 (https://phabricator.wikimedia.org/T348734) [09:50:14] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1472.eqiad.wmnet with reason: host reimage [09:50:38] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:50:49] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1473.eqiad.wmnet with reason: host reimage [09:51:12] (03PS2) 10Muehlenhoff: Add new Hiera option to be used to selectively test defs_from_etcd on nftables [puppet] - 10https://gerrit.wikimedia.org/r/977132 (https://phabricator.wikimedia.org/T348734) [09:51:33] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1474.eqiad.wmnet with reason: host reimage [09:51:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2188 (re)pooling @ 70%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53875 and previous config saved to /var/cache/conftool/dbconfig/20231124-095145-arnaudb.json [09:51:49] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1475.eqiad.wmnet with reason: host reimage [09:52:08] (03CR) 10Jbond: "AFAIK we just need tls termination not mTLS. As such i think we ca just us the discovery CI?" [puppet] - 10https://gerrit.wikimedia.org/r/977129 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [09:52:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2189 (re)pooling @ 70%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53876 and previous config saved to /var/cache/conftool/dbconfig/20231124-095238-arnaudb.json [09:53:02] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1473.eqiad.wmnet with reason: host reimage [09:53:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1472.eqiad.wmnet with reason: host reimage [09:53:29] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:54:09] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2421.codfw.wmnet with reason: host reimage [09:54:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 50%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53877 and previous config saved to /var/cache/conftool/dbconfig/20231124-095423-arnaudb.json [09:54:32] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2431.codfw.wmnet with reason: host reimage [09:54:34] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2425.codfw.wmnet with reason: host reimage [09:55:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2190 (re)pooling @ 70%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53878 and previous config saved to /var/cache/conftool/dbconfig/20231124-095508-arnaudb.json [09:55:44] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:55:46] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2425.codfw.wmnet with reason: host reimage [09:56:10] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1475.eqiad.wmnet with reason: host reimage [09:56:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977132 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [09:58:47] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1474.eqiad.wmnet with reason: host reimage [10:00:51] (03CR) 10Jbond: [C: 04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/977140 (owner: 10Majavah) [10:01:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 70%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53879 and previous config saved to /var/cache/conftool/dbconfig/20231124-100120-arnaudb.json [10:01:45] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2431.codfw.wmnet with reason: host reimage [10:02:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2195 (re)pooling @ 70%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53880 and previous config saved to /var/cache/conftool/dbconfig/20231124-100218-arnaudb.json [10:04:12] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2421.codfw.wmnet with reason: host reimage [10:06:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2188 (re)pooling @ 80%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53881 and previous config saved to /var/cache/conftool/dbconfig/20231124-100650-arnaudb.json [10:07:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2189 (re)pooling @ 80%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53882 and previous config saved to /var/cache/conftool/dbconfig/20231124-100743-arnaudb.json [10:09:25] PROBLEM - Host mw2425 is DOWN: PING CRITICAL - Packet loss = 100% [10:09:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53883 and previous config saved to /var/cache/conftool/dbconfig/20231124-100928-arnaudb.json [10:10:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2190 (re)pooling @ 80%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53884 and previous config saved to /var/cache/conftool/dbconfig/20231124-101013-arnaudb.json [10:10:33] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/977049 (owner: 10Majavah) [10:10:48] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1473.eqiad.wmnet with OS bullseye [10:11:10] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1472.eqiad.wmnet with OS bullseye [10:11:34] (KubernetesCalicoDown) firing: mw2425.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2425.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:12:14] (03PS4) 10Vgutierrez: lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) [10:12:39] RECOVERY - Host mw2425 is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [10:15:32] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2425.codfw.wmnet with OS bullseye [10:16:06] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1475.eqiad.wmnet with OS bullseye [10:16:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 80%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53885 and previous config saved to /var/cache/conftool/dbconfig/20231124-101625-arnaudb.json [10:16:34] (KubernetesCalicoDown) resolved: mw2425.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2425.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:16:38] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1474.eqiad.wmnet with OS bullseye [10:17:18] (03PS5) 10Vgutierrez: lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) [10:17:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2195 (re)pooling @ 80%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53886 and previous config saved to /var/cache/conftool/dbconfig/20231124-101722-arnaudb.json [10:20:08] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2431.codfw.wmnet with OS bullseye [10:20:38] (03PS6) 10Vgutierrez: lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) [10:21:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2188 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53887 and previous config saved to /var/cache/conftool/dbconfig/20231124-102155-arnaudb.json [10:22:29] (03CR) 10JMeybohm: [C: 03+2] Move mw appservers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/975225 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [10:22:42] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2421.codfw.wmnet with OS bullseye [10:22:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2189 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53888 and previous config saved to /var/cache/conftool/dbconfig/20231124-102248-arnaudb.json [10:23:03] (03PS7) 10Vgutierrez: lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) [10:23:08] (03Merged) 10jenkins-bot: Move mw appservers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/975225 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [10:24:17] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/673/con" [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:24:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 70%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53889 and previous config saved to /var/cache/conftool/dbconfig/20231124-102433-arnaudb.json [10:25:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2190 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53890 and previous config saved to /var/cache/conftool/dbconfig/20231124-102518-arnaudb.json [10:28:24] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:31:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53891 and previous config saved to /var/cache/conftool/dbconfig/20231124-103130-arnaudb.json [10:32:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2195 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53892 and previous config saved to /var/cache/conftool/dbconfig/20231124-103228-arnaudb.json [10:37:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2188 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53893 and previous config saved to /var/cache/conftool/dbconfig/20231124-103700-arnaudb.json [10:37:51] (03CR) 10Jbond: [C: 03+1] "LGTM minor suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah) [10:37:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2189 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53894 and previous config saved to /var/cache/conftool/dbconfig/20231124-103753-arnaudb.json [10:39:25] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:39:28] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:39:37] (03CR) 10JMeybohm: [C: 03+2] api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:39:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 80%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53895 and previous config saved to /var/cache/conftool/dbconfig/20231124-103938-arnaudb.json [10:40:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2190 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53896 and previous config saved to /var/cache/conftool/dbconfig/20231124-104023-arnaudb.json [10:40:44] (03Merged) 10jenkins-bot: api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:42:52] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [10:43:08] (03CR) 10Jbond: [C: 03+1] "lgtm minor suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/976800 (owner: 10Majavah) [10:45:47] (03CR) 10JMeybohm: [C: 03+1] "Does not look ridiculous ;-)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977099 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:46:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53897 and previous config saved to /var/cache/conftool/dbconfig/20231124-104635-arnaudb.json [10:46:59] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:47:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2195 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53898 and previous config saved to /var/cache/conftool/dbconfig/20231124-104733-arnaudb.json [10:48:55] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/674/con" [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [10:49:07] (03CR) 10Jbond: [C: 04-1] "could looks good to me but the type checking seems a bit off" [puppet] - 10https://gerrit.wikimedia.org/r/976944 (owner: 10Majavah) [10:49:59] (03CR) 10Hnowlan: [C: 03+2] jobqueue: migrate first job to Kubernetes jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/977099 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:50:58] (03Merged) 10jenkins-bot: jobqueue: migrate first job to Kubernetes jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/977099 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:51:48] (03CR) 10Btullis: "It's not quite right yet, because the pcc run shows that the filename of the wrapper script isn't added to core-site.xml:" [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [10:53:04] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [10:53:21] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [10:54:37] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [10:54:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53899 and previous config saved to /var/cache/conftool/dbconfig/20231124-105443-arnaudb.json [10:55:18] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [10:58:30] (03PS1) 10Muehlenhoff: openstack::base::wikitech::web: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/977170 [11:00:59] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:02:18] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:02:30] (03PS1) 10JMeybohm: Revert "api-gateway,rest-gateway: Switch to cert-manager certificates" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977163 [11:02:53] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:03:52] (03CR) 10JMeybohm: [C: 03+2] Revert "api-gateway,rest-gateway: Switch to cert-manager certificates" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977163 (owner: 10JMeybohm) [11:04:41] (03Merged) 10jenkins-bot: Revert "api-gateway,rest-gateway: Switch to cert-manager certificates" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977163 (owner: 10JMeybohm) [11:05:47] (03CR) 10Jbond: [C: 04-1] "-1: perhaps i misunderstood but it seems like the not all failures are captured." [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [11:08:27] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [11:09:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53900 and previous config saved to /var/cache/conftool/dbconfig/20231124-110948-arnaudb.json [11:11:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'set es2032 back as es1 master for T344589', diff saved to https://phabricator.wikimedia.org/P53901 and previous config saved to /var/cache/conftool/dbconfig/20231124-111109-arnaudb.json [11:17:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2191 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53902 and previous config saved to /var/cache/conftool/dbconfig/20231124-111715-arnaudb.json [11:17:47] (03CR) 10Majavah: "I'm fairly sure we can just drop this term entirely, traffic from the caches to wikitech is now encrypted and the Envoy profile manages th" [puppet] - 10https://gerrit.wikimedia.org/r/977170 (owner: 10Muehlenhoff) [11:32:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2191 (re)pooling @ 20%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53903 and previous config saved to /var/cache/conftool/dbconfig/20231124-113220-arnaudb.json [11:36:48] 10SRE-swift-storage, 10Commons, 10Traffic: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) p:05Triage→03High [11:47:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2191 (re)pooling @ 30%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53904 and previous config saved to /var/cache/conftool/dbconfig/20231124-114725-arnaudb.json [11:48:30] (03PS1) 10Muehlenhoff: ganeti: Add option to enable PKI-based RAPI cert [puppet] - 10https://gerrit.wikimedia.org/r/977175 (https://phabricator.wikimedia.org/T350686) [11:51:14] (03PS1) 10Slyngshede: Test commit [puppet] - 10https://gerrit.wikimedia.org/r/977176 [11:53:21] (03PS2) 10Muehlenhoff: ganeti: Add option to enable PKI-based RAPI cert [puppet] - 10https://gerrit.wikimedia.org/r/977175 (https://phabricator.wikimedia.org/T350686) [11:54:53] 10SRE-swift-storage, 10Commons, 10Traffic: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) ...but profile::tlsproxy::envoy doesn't have that configuation available as far as I can see... [11:58:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977175 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [12:02:11] 10SRE-swift-storage, 10Commons, 10Traffic: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) Let's add it as an optional parameter, and try and pass it through. [12:02:17] (03PS2) 10Muehlenhoff: Add support to write out blocked networks from requestctl [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) [12:02:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2191 (re)pooling @ 40%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53905 and previous config saved to /var/cache/conftool/dbconfig/20231124-120230-arnaudb.json [12:05:06] (03PS1) 10Majavah: base: puppet_env_ps1: only try to use color when possible [puppet] - 10https://gerrit.wikimedia.org/r/977177 [12:08:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [12:11:54] (03PS1) 10MVernon: tlsproxy::envoy allow setting stream_idle_timeout, 180s for swift [puppet] - 10https://gerrit.wikimedia.org/r/977178 (https://phabricator.wikimedia.org/T351876) [12:12:49] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977178 (https://phabricator.wikimedia.org/T351876) (owner: 10MVernon) [12:15:09] (03PS1) 10Muehlenhoff: analytics::postgresql: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/977181 [12:17:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2191 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53906 and previous config saved to /var/cache/conftool/dbconfig/20231124-121735-arnaudb.json [12:18:24] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:21:16] (03PS2) 10KartikMistry: Update cxserver to 2023-11-20-052250-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/976369 (https://phabricator.wikimedia.org/T341458) [12:21:18] (03PS1) 10KartikMistry: Update Apertium to 2023-11-23-055425-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977183 (https://phabricator.wikimedia.org/T346997) [12:30:30] (03PS1) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/977184 [12:32:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2191 (re)pooling @ 60%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53907 and previous config saved to /var/cache/conftool/dbconfig/20231124-123240-arnaudb.json [12:36:02] (03PS2) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/977184 [12:36:04] (03PS1) 10Jbond: puppet-merge: Fix up help message [puppet] - 10https://gerrit.wikimedia.org/r/977185 (https://phabricator.wikimedia.org/T350809) [12:41:09] !bash Lucas_WMDE> find someone who loves you like sammy loves the word transient https://bash.toolforge.org/search?p=0&q=transient [12:41:09] Amir1: Stored quip at https://bash.toolforge.org/quip/PcBZAYwBhuQtenzvJ-CA [12:42:08] very important operations business being conducted here [12:43:55] lol [12:44:16] ^ courtesy ping TheresNoTime I suppose :P [12:45:40] (03PS3) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/977184 [12:45:42] (03PS1) 10Jbond: README: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/977206 [12:45:58] (03PS2) 10Jbond: README: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/977206 [12:46:04] (03CR) 10Jbond: [V: 03+2 C: 03+2] README: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/977206 (owner: 10Jbond) [12:47:19] (03CR) 10JMeybohm: [C: 03+1] "This sounds reasonable to me and PCC looks good too. No-ops everywhere else so I don't feel like this is very invasive." [puppet] - 10https://gerrit.wikimedia.org/r/977178 (https://phabricator.wikimedia.org/T351876) (owner: 10MVernon) [12:47:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2191 (re)pooling @ 70%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53908 and previous config saved to /var/cache/conftool/dbconfig/20231124-124745-arnaudb.json [12:50:38] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10CodeReviewBot) jynus opened https://gitlab.wikimedia.org/repos/sre/wmfbackups/-/merge_requests/5... [12:52:21] (03PS1) 10Jbond: Revert "README: test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/977194 [12:52:36] (03CR) 10Hnowlan: [C: 03+1] tlsproxy::envoy allow setting stream_idle_timeout, 180s for swift [puppet] - 10https://gerrit.wikimedia.org/r/977178 (https://phabricator.wikimedia.org/T351876) (owner: 10MVernon) [12:52:57] (03CR) 10Jbond: [C: 03+2] Revert "README: test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/977194 (owner: 10Jbond) [12:53:10] (03PS4) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/977184 [12:54:45] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Drop config which is the same as the default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974695 (owner: 10Awight) [12:55:11] (03PS5) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) [12:58:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/977132 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [12:59:21] (03PS1) 10JMeybohm: Revert "Revert "api-gateway,rest-gateway: Switch to cert-manager certificates"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977195 [12:59:53] (03PS2) 10JMeybohm: Revert "Revert "api-gateway,rest-gateway: Switch to cert-manager certificates"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977195 (https://phabricator.wikimedia.org/T300033) [12:59:54] (03PS1) 10JMeybohm: api-gateway: Add env variables required for envoy SDS [deployment-charts] - 10https://gerrit.wikimedia.org/r/977207 (https://phabricator.wikimedia.org/T300033) [13:00:47] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/977185 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:01:00] (03CR) 10Jbond: "I put a -1 in the comments as there is an error. however as this is not used and it is made up of a go template which is hard to test unl" [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [13:01:20] (03CR) 10Jbond: [C: 03+1] Enable requestctl-driven network blocks for sretest [puppet] - 10https://gerrit.wikimedia.org/r/977166 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [13:01:22] (03CR) 10Hnowlan: [C: 03+1] api-gateway: Add env variables required for envoy SDS [deployment-charts] - 10https://gerrit.wikimedia.org/r/977207 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:02:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2191 (re)pooling @ 80%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53909 and previous config saved to /var/cache/conftool/dbconfig/20231124-130250-arnaudb.json [13:04:02] (03CR) 10JMeybohm: [C: 03+2] api-gateway: Add env variables required for envoy SDS [deployment-charts] - 10https://gerrit.wikimedia.org/r/977207 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:04:19] (03CR) 10MVernon: [C: 03+2] tlsproxy::envoy allow setting stream_idle_timeout, 180s for swift [puppet] - 10https://gerrit.wikimedia.org/r/977178 (https://phabricator.wikimedia.org/T351876) (owner: 10MVernon) [13:04:54] (03Merged) 10jenkins-bot: api-gateway: Add env variables required for envoy SDS [deployment-charts] - 10https://gerrit.wikimedia.org/r/977207 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:06:18] (03CR) 10Jbond: [C: 03+1] "lgtm but see question inline" [puppet] - 10https://gerrit.wikimedia.org/r/977175 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:08:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/977177 (owner: 10Majavah) [13:10:11] (03CR) 10Hnowlan: [C: 03+1] Revert "Revert "api-gateway,rest-gateway: Switch to cert-manager certificates"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977195 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:10:47] (03CR) 10Majavah: [C: 03+2] base: puppet_env_ps1: only try to use color when possible [puppet] - 10https://gerrit.wikimedia.org/r/977177 (owner: 10Majavah) [13:11:49] 10SRE-swift-storage, 10Commons, 10Traffic: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10AlexisJazz) >>! In T351876#9356445, @MatthewVernon wrote: > I think, per [[https://github.com/wikimedia/operations-puppet/b... [13:12:37] RECOVERY - Check systemd state on kubernetes2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:35] (03CR) 10Muehlenhoff: Add support to write out blocked networks from requestctl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [13:13:43] (03PS3) 10Muehlenhoff: Add support to write out blocked networks from requestctl [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) [13:14:37] 10SRE-swift-storage, 10Commons, 10Traffic: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) @AlexisJazz it's a time for how long the connect has no data going over it "Each time an encode/decode event... [13:15:19] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2036 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:17:39] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:17:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2191 (re)pooling @ 90%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53910 and previous config saved to /var/cache/conftool/dbconfig/20231124-131755-arnaudb.json [13:17:57] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:18:10] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [13:18:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:19:22] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [13:19:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:23:59] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [13:24:18] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [13:24:32] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [13:24:51] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [13:27:38] 10SRE-swift-storage, 10Commons, 10Traffic: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I can confirm that I can now download this file, even though it ta... [13:30:40] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711) (owner: 10Brouberol) [13:32:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [13:33:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2191 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P53911 and previous config saved to /var/cache/conftool/dbconfig/20231124-133300-arnaudb.json [13:35:55] (03CR) 10Muehlenhoff: ganeti: Add option to enable PKI-based RAPI cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977175 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:35:57] (03PS3) 10Muehlenhoff: ganeti: Add option to enable PKI-based RAPI cert [puppet] - 10https://gerrit.wikimedia.org/r/977175 (https://phabricator.wikimedia.org/T350686) [13:40:57] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:41:18] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [13:41:18] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:41:42] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [13:42:06] (03CR) 10Btullis: [C: 03+1] "Looks good to me. I'm not sure why we have a puppet 7 pcc failure for deploy1002m but we can probabl;y ignore it. I would deploy the chang" [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711) (owner: 10Brouberol) [13:43:41] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/977175 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:44:47] (03CR) 10Vgutierrez: [V: 03+1] "Adding Filippo for the prometheus::ops side of things, thanks in advance!" [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [13:45:11] (03CR) 10Filippo Giunchedi: "I like idea and LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:48:51] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:49:01] (03PS6) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) [13:49:57] (03PS1) 10Jcrespo: dbbackups: Unify configuration of checks and prepare for 0.8.4 [puppet] - 10https://gerrit.wikimedia.org/r/977208 (https://phabricator.wikimedia.org/T340741) [13:55:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [13:58:13] (03PS2) 10Jcrespo: dbbackups: Unify configuration of checks and prepare for 0.8.4 [puppet] - 10https://gerrit.wikimedia.org/r/977208 (https://phabricator.wikimedia.org/T340741) [14:01:47] (03CR) 10Jcrespo: "This looks as intended: https://puppet-compiler.wmflabs.org/output/977208/677/backupmon1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/977208 (https://phabricator.wikimedia.org/T340741) (owner: 10Jcrespo) [14:02:14] (03CR) 10Jcrespo: [C: 04-2] "DO NOT MERGE UNTIL PACKAGE UPDATE" [puppet] - 10https://gerrit.wikimedia.org/r/977208 (https://phabricator.wikimedia.org/T340741) (owner: 10Jcrespo) [14:03:24] (03CR) 10Jcrespo: [C: 04-2] "Context: I am integrating the check away from puppet and into the wmfbackups-check debian package." [puppet] - 10https://gerrit.wikimedia.org/r/977208 (https://phabricator.wikimedia.org/T340741) (owner: 10Jcrespo) [14:05:09] (03PS1) 10Hnowlan: mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/977209 (https://phabricator.wikimedia.org/T349796) [14:06:07] (03CR) 10JMeybohm: [C: 03+2] Revert "Revert "api-gateway,rest-gateway: Switch to cert-manager certificates"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977195 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:07:00] (03Merged) 10jenkins-bot: Revert "Revert "api-gateway,rest-gateway: Switch to cert-manager certificates"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977195 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:07:33] (03PS1) 10Hnowlan: jobqueue: migrate thumbnailrender to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/977210 (https://phabricator.wikimedia.org/T349796) [14:16:39] (03PS7) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) [14:16:40] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [14:19:40] (03CR) 10Filippo Giunchedi: [C: 03+1] puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [14:25:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:27:00] (03CR) 10Vgutierrez: [V: 03+1] "thanks for the review, comment answered inline" [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:28:18] (03CR) 10Filippo Giunchedi: [C: 03+1] lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:28:24] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:46] (03CR) 10Ssingh: P:dns::auth::update: add support for authdns-update hosts via confd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:31:34] (03PS8) 10Vgutierrez: lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) [14:31:36] (03CR) 10Ssingh: P:dns::auth::update: add support for generating .ssh/config via confd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:38:24] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:26] (03PS2) 10Majavah: puppetserver: use deploy script on initial setup [puppet] - 10https://gerrit.wikimedia.org/r/977140 [14:41:29] (03PS9) 10Vgutierrez: lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) [14:41:59] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/680/con" [puppet] - 10https://gerrit.wikimedia.org/r/977140 (owner: 10Majavah) [14:42:38] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/681/con" [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:43:55] (03CR) 10Majavah: [V: 03+1] puppetserver: use deploy script on initial setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977140 (owner: 10Majavah) [14:47:42] (03CR) 10Jbond: [C: 04-1] "thanks for the info" [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:49:00] (03PS1) 10Majavah: P:puppetserver::git: add wrapper for using git as gitpuppet [puppet] - 10https://gerrit.wikimedia.org/r/977212 [14:49:24] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/977140 (owner: 10Majavah) [14:49:37] (03CR) 10Majavah: [V: 03+1 C: 03+2] puppetserver: use deploy script on initial setup [puppet] - 10https://gerrit.wikimedia.org/r/977140 (owner: 10Majavah) [14:50:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/682/con" [puppet] - 10https://gerrit.wikimedia.org/r/977212 (owner: 10Majavah) [14:52:11] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/977212 (owner: 10Majavah) [14:52:37] (03PS2) 10Majavah: P:puppetserver::git: add wrapper for using git as gitpuppet [puppet] - 10https://gerrit.wikimedia.org/r/977212 [14:53:24] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:42] (03CR) 10Majavah: [C: 03+2] P:puppetserver::git: add wrapper for using git as gitpuppet [puppet] - 10https://gerrit.wikimedia.org/r/977212 (owner: 10Majavah) [14:58:23] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:58:53] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:00:14] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:01:00] (03PS1) 10Elukey: istio: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) [15:01:01] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:02:11] (03PS2) 10Elukey: istio: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) [15:16:15] (03PS1) 10Jbond: Do not Merge: Testing CI [puppet] - 10https://gerrit.wikimedia.org/r/977216 (https://phabricator.wikimedia.org/T265633) [15:16:42] (03CR) 10CI reject: [V: 04-1] Do not Merge: Testing CI [puppet] - 10https://gerrit.wikimedia.org/r/977216 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [15:16:57] (03Abandoned) 10Jbond: puppet: try to deal with existing puppet runs [software/spicerack] - 10https://gerrit.wikimedia.org/r/971133 (owner: 10Jbond) [15:18:57] (03Abandoned) 10Muehlenhoff: pki: Add new intermediate for ganeti-rapi [puppet] - 10https://gerrit.wikimedia.org/r/977129 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [15:19:11] (KubernetesAPINotScrapable) firing: k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:21:06] (03PS3) 10Jbond: interface: attempt to resolve ordering issues with tagged interfaces [puppet] - 10https://gerrit.wikimedia.org/r/971406 (owner: 10Majavah) [15:21:20] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977227 [15:22:22] (03CR) 10Jbond: [C: 03+1] "nice lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971406 (owner: 10Majavah) [15:22:49] (03CR) 10Ssingh: P:dns::auth::update: add support for authdns-update hosts via confd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:24:11] (KubernetesAPINotScrapable) firing: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:25:27] (03PS1) 10Ladsgroup: beta: Stop writing to the old columns of pagelinks in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977218 (https://phabricator.wikimedia.org/T299947) [15:26:43] (03CR) 10Ladsgroup: [C: 03+2] beta: Stop writing to the old columns of pagelinks in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977218 (https://phabricator.wikimedia.org/T299947) (owner: 10Ladsgroup) [15:27:28] (03Merged) 10jenkins-bot: beta: Stop writing to the old columns of pagelinks in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977218 (https://phabricator.wikimedia.org/T299947) (owner: 10Ladsgroup) [15:33:22] (03CR) 10Ssingh: [C: 03+1] lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:34:08] (03PS3) 10Elukey: istio: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) [15:34:10] (03PS1) 10Elukey: cert-manager: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977220 (https://phabricator.wikimedia.org/T351933) [15:35:19] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/istio ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [15:36:01] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/cert-manager/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977220 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [15:41:46] (03CR) 10Jbond: [C: 03+1] "See inline for response and suggestion" [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:46:40] (03PS10) 10Vgutierrez: lvs,pybal: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) [15:47:29] (03CR) 10Ssingh: P:dns::auth::update: add support for generating .ssh/config via confd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:47:53] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/684/con" [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:51:01] (03CR) 10Vgutierrez: [V: 03+1] lvs,pybal: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:56:36] (03PS1) 10Jbond: tox.ini: drop limit on commit-message-validator [puppet] - 10https://gerrit.wikimedia.org/r/977221 (https://phabricator.wikimedia.org/T265633) [15:56:48] (03PS11) 10Vgutierrez: lvs,pybal: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) [15:57:35] (03CR) 10Vgutierrez: "PS11 just drops the hieradata changes required to get a PCC output with both ensure => absent and ensure => present" [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:58:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 48.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:59:26] (03PS2) 10Jbond: Do not Merge: Testing CI [puppet] - 10https://gerrit.wikimedia.org/r/977216 (https://phabricator.wikimedia.org/T265633) [16:03:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 48.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:09:48] (03CR) 10Ssingh: P:dns::auth::update: add support for generating .ssh/config via confd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:10:03] (03PS1) 10Hashar: rake: add support for tox v4 [puppet] - 10https://gerrit.wikimedia.org/r/977223 (https://phabricator.wikimedia.org/T345152) [16:10:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 48.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:10:28] (03Abandoned) 10Hashar: taskgen: update for tox 4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/954297 (https://phabricator.wikimedia.org/T345152) (owner: 10Majavah) [16:15:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 48.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:17:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 41.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:18:23] (03CR) 10Ssingh: [C: 03+1] lvs,pybal: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:18:24] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:21:59] (03CR) 10Ssingh: [C: 03+2] "This came up in Ifcf00272c69fd9e8b03345eff1c74d7599b2385a and we forgot to merge (that's on me!)." [puppet] - 10https://gerrit.wikimedia.org/r/926509 (owner: 10Muehlenhoff) [16:22:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 49.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:23:41] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) @ayounsi what would be the required TCP MSS clamping values? per https://phabricator.wikimedia.org/T348837#9256494 It seems that around ~1400 bytes for both IPv4/IPv6 should... [16:33:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 49.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:38:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 49.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:38:55] (03CR) 10Ssingh: [C: 03+2] gdnsd: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926509 (owner: 10Muehlenhoff) [16:40:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 41.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:45:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 41.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:03:05] (03PS1) 10Ladsgroup: Disable VipsScaler in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977250 (https://phabricator.wikimedia.org/T290759) [17:16:10] (03PS1) 10Ssingh: admin: reserve uid/gid for authdns user [puppet] - 10https://gerrit.wikimedia.org/r/977252 (https://phabricator.wikimedia.org/T347054) [17:18:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 49.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:21:50] (03PS2) 10Ssingh: P:dns::auth::update: add support for generating .ssh/config via confd [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) [17:23:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 49.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:32:29] (03PS1) 10Ssingh: P:dns::auth::update::account: switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/977259 (https://phabricator.wikimedia.org/T347054) [17:35:29] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/685/con" [puppet] - 10https://gerrit.wikimedia.org/r/977259 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:40:05] (ProbeDown) firing: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:40:17] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [17:40:49] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:57] Looking ^ [17:44:43] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [17:45:05] (ProbeDown) resolved: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:45:13] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:48:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:05:06] (03CR) 10Muehlenhoff: admin: reserve uid/gid for authdns user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977252 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:08:03] (03CR) 10Ssingh: admin: reserve uid/gid for authdns user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977252 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:28:24] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:24:11] (KubernetesAPINotScrapable) firing: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [20:18:24] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:28:24] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:00:15] RECOVERY - Check systemd state on ms-be1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:41] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1048 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:24:11] (KubernetesAPINotScrapable) firing: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable