[00:15:10] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10287153 (10Platonides) The bug for multiple mailing lists was fixed several years ago: https://gitlab.com/mailman/mailman/-/issues/955 (so, hopefully, the fix is included...
[00:38:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1086574
[00:38:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1086574 (owner: 10TrainBranchBot)
[01:07:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[01:08:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1086576
[01:08:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1086576 (owner: 10TrainBranchBot)
[01:11:22] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1086574 (owner: 10TrainBranchBot)
[01:43:20] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1086576 (owner: 10TrainBranchBot)
[01:51:49] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[02:02:05] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/49bd37b8f4f3d94e484accd8635c9153243ed147994c71222a2ed5739293bf63/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:16:26] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10287188 (10AlphaLemur) I have seen this message on two different lists:  * Wikimedia-AU-Members, October 7, 2024, 07:53 UTC. - I can confirm this was not a cross-posted me...
[02:22:05] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:37:37] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:37:37] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:54:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:01:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:02:37] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:31:19] <wikibugs>	 06SRE, 10Huggle, 06Infrastructure-Foundations: IRC recent changes provider fails in Huggle after recent irc.wikimedia.org upgrade - https://phabricator.wikimedia.org/T378667#10287194 (10Crazycomputers) I tracked down the issue on the Huggle side. The library Huggle uses for IRC (libirc) expects the MYINFO co...
[05:32:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:37:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:39:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1164 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:46:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:49:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1164 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:51:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:51:49] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[05:55:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:57:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1172 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:00:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1172 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:13:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:34:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1164 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:37:37] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:49:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1164 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:29:33] <wikibugs>	 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916 (10phaultfinder) 03NEW
[07:31:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1085366 (https://phabricator.wikimedia.org/T378693) (owner: 10Slyngshede)
[07:39:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10287339 (10MoritzMuehlenhoff) I had a look at the IPMI logs and there are still two more of these errors logged after you reseated the memory on Friday, so it seems this was...
[07:57:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[07:57:28] <wikibugs>	 (03CR) 10Arnaudb: "no massive gain to get from this, mostly quality of life improvements that are not crucial!" [puppet] - 10https://gerrit.wikimedia.org/r/1084730 (owner: 10Arnaudb)
[07:57:49] <wikibugs>	 (03Abandoned) 10Arnaudb: mariadb: add mycli [puppet] - 10https://gerrit.wikimedia.org/r/1084730 (owner: 10Arnaudb)
[07:59:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2013.codfw.wmnet
[07:59:31] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: fix mycnf [puppet] - 10https://gerrit.wikimedia.org/r/1087120
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:02:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[08:03:20] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: fix mycnf [puppet] - 10https://gerrit.wikimedia.org/r/1087120 (owner: 10Arnaudb)
[08:03:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:05:37] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[08:06:15] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[08:09:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2013.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:11:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2013.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:11:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:11:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2013.codfw.wmnet
[08:11:28] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10287430 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti2013.codfw.wmnet` - ganeti2013.codfw.wmnet (*...
[08:12:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2014.codfw.wmnet
[08:15:32] <XioNoX>	 !log push Drop labtestwikitech return traffic term to eqiad routers - CR1083589
[08:15:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:41] <XioNoX>	 /cc taavi ^
[08:16:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:21:24] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2239.codfw.wmnet with reason: waiting for productionnization T373579
[08:21:26] <stashbot>	 T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579
[08:21:37] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2239.codfw.wmnet with reason: waiting for productionnization T373579
[08:22:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2014.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:23:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2014.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:23:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:23:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2014.codfw.wmnet
[08:23:16] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10287437 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti2014.codfw.wmnet` - ganeti2014.codfw.wmnet (*...
[08:24:27] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10287438 (10MoritzMuehlenhoff)
[08:24:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: decommission ganeti2013/ganeti2014 - https://phabricator.wikimedia.org/T378596#10287452 (10MoritzMuehlenhoff)
[08:26:42] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol)
[08:38:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti1039 to ganeti1052 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1087123
[08:50:56] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:51:07] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:53:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1039 to ganeti1052 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1087123 (owner: 10Muehlenhoff)
[08:57:15] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:57:36] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:59:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet
[09:00:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10287518 (10elukey) Fixed 1044. For some reason IPv6 support was disabled, so our settings like `IPv6AutoConfigEnabled: False` led to a HTTP 400. I connecte...
[09:04:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[09:04:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet
[09:06:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti1045.eqiad.wmnet with reason: reboots for nftables
[09:06:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti1045.eqiad.wmnet with reason: reboots for nftables
[09:06:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 13 hosts with reason: reboots for nftables
[09:07:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: reboots for nftables
[09:09:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[09:10:17] <wikibugs>	 (03CR) 10David Caro: [C:03+2] P:toolforge::proxy: use svc.toolforge.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1080056 (owner: 10Majavah)
[09:20:11] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] service::catalog: mark apus service as paging [puppet] - 10https://gerrit.wikimedia.org/r/1085617 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[09:21:05] <wikibugs>	 (03CR) 10MVernon: [C:03+2] service::catalog: mark apus service as paging [puppet] - 10https://gerrit.wikimedia.org/r/1085617 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[09:21:46] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "Good idea!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1085590 (https://phabricator.wikimedia.org/T378751) (owner: 10Cathal Mooney)
[09:21:56] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 (10MoritzMuehlenhoff) 03NEW
[09:21:58] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up new S3-level replicated storage cluster "apus" - https://phabricator.wikimedia.org/T279621#10287549 (10MatthewVernon)
[09:22:07] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10287561 (10MoritzMuehlenhoff) p:05Triage→03Medium
[09:23:50] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up new S3-level replicated storage cluster "apus" - https://phabricator.wikimedia.org/T279621#10287562 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Aiming to migrate first production user this quarter.
[09:25:02] <wikibugs>	 (03CR) 10Elukey: "Hello! I don't particularly love the -bookworm suffix in the dir name, in other places we have a specific /bookworm/etc.. structure, but t" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol)
[09:25:17] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: disk (sde) failed on thanos-be2003 - https://phabricator.wikimedia.org/T378800#10287573 (10MatthewVernon) It's now behaving itself properly.
[09:28:32] <wikibugs>	 (03CR) 10Brouberol: "I haven't tried to build these. Is there a process I could follow to run a build before we merge?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol)
[09:29:18] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update revscoring to kserve 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087130 (https://phabricator.wikimedia.org/T367048)
[09:29:37] <wikibugs>	 (03PS2) 10Brouberol: Publish JDK8 images based on Debian Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928)
[09:30:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:31:16] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph: Migrate gitlab storage to apus - https://phabricator.wikimedia.org/T378922 (10MatthewVernon) 03NEW
[09:32:43] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph: Migrate gitlab storage to apus - https://phabricator.wikimedia.org/T378922#10287610 (10MatthewVernon) p:05Triage→03Medium
[09:33:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:33:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Send check-cumin-aliases output only to SRE IF alias [puppet] - 10https://gerrit.wikimedia.org/r/1087133
[09:34:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:35:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:36:52] <wikibugs>	 (03CR) 10Muehlenhoff: Publish JDK8 images based on Debian Bookworm (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol)
[09:37:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:37:17] <wikibugs>	 (03CR) 10Volans: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1087133 (owner: 10Muehlenhoff)
[09:37:39] <wikibugs>	 (03CR) 10Brouberol: Publish JDK8 images based on Debian Bookworm (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol)
[09:37:45] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:40:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[09:40:57] <wikibugs>	 (03PS7) 10Elukey: sre.hosts.provision: initial UEFI support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi)
[09:41:05] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provision: initial UEFI support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi)
[09:41:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol)
[09:42:59] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:45:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[09:47:21] <wikibugs>	 (03PS1) 10Brouberol: global_config: expose additional ports on hadoop masters/workers [puppet] - 10https://gerrit.wikimedia.org/r/1087135 (https://phabricator.wikimedia.org/T377928)
[09:47:40] <wikibugs>	 (03PS2) 10Brouberol: global_config: expose additional ports on hadoop masters/workers [puppet] - 10https://gerrit.wikimedia.org/r/1087135 (https://phabricator.wikimedia.org/T377928)
[09:48:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Send check-cumin-aliases output only to SRE IF alias [puppet] - 10https://gerrit.wikimedia.org/r/1087133 (owner: 10Muehlenhoff)
[09:50:02] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Schedule daily runs of WikimediaEvents UpdatePeriodicMetrics.php [puppet] - 10https://gerrit.wikimedia.org/r/1085620 (https://phabricator.wikimedia.org/T375508) (owner: 10Dreamy Jazz)
[09:50:37] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:51:35] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.provision: initial UEFI support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi)
[09:51:49] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[09:53:37] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: update revscoring to kserve 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087130 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos)
[09:54:23] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4443/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087135 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol)
[09:54:38] <wikibugs>	 (03PS1) 10Brouberol: global_config: define external services entries for the hive metastore servers [puppet] - 10https://gerrit.wikimedia.org/r/1087136 (https://phabricator.wikimedia.org/T377928)
[09:54:48] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1168 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:55:51] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:56:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[09:56:41] <wikibugs>	 (03CR) 10Volans: [C:03+2] Use new remote_hosts getter in RemoteHostsAdapter [cookbooks] - 10https://gerrit.wikimedia.org/r/1084734 (owner: 10Volans)
[09:56:59] <volans>	 !log deploying spicerack v8.15.2 to cumin[12]002
[09:57:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:15] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:57:36] <wikibugs>	 (03PS2) 10Brouberol: global_config: define external services entries for the hive metastore servers [puppet] - 10https://gerrit.wikimedia.org/r/1087136 (https://phabricator.wikimedia.org/T377928)
[09:59:23] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update revscoring to kserve 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087130 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos)
[09:59:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Swift roles [puppet] - 10https://gerrit.wikimedia.org/r/1083158 (owner: 10Muehlenhoff)
[09:59:50] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4444/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087136 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol)
[10:00:48] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:01:28] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update revscoring to kserve 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087130 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos)
[10:01:35] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[10:01:48] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[10:02:22] <wikibugs>	 (03Merged) 10jenkins-bot: Use new remote_hosts getter in RemoteHostsAdapter [cookbooks] - 10https://gerrit.wikimedia.org/r/1084734 (owner: 10Volans)
[10:02:30] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:06:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[10:06:48] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:07:52] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance
[10:08:06] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance
[10:08:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T376905)', diff saved to https://phabricator.wikimedia.org/P70850 and previous config saved to /var/cache/conftool/dbconfig/20241104-100813-ladsgroup.json
[10:08:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[10:14:29] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Sounds like a good idea +1" [homer/public] - 10https://gerrit.wikimedia.org/r/1084143 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi)
[10:15:33] <wikibugs>	 06SRE, 10Charts, 06Infrastructure-Foundations, 07Service-deployment-requests: New Service Request: chart-renderer - https://phabricator.wikimedia.org/T376939#10287851 (10MatthewVernon)
[10:15:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T376905)', diff saved to https://phabricator.wikimedia.org/P70851 and previous config saved to /var/cache/conftool/dbconfig/20241104-101552-ladsgroup.json
[10:16:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07Epic, 07Kubernetes: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams - https://phabricator.wikimedia.org/T378742#10287863 (10MatthewVernon)
[10:17:03] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:17:18] <wikibugs>	 (03PS1) 10Ladsgroup: dbtools: Drop depool and repool bashes [software] - 10https://gerrit.wikimedia.org/r/1087138 (https://phabricator.wikimedia.org/T377738)
[10:18:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[10:18:48] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:20:52] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1043.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:21:07] <wikibugs>	 (03PS1) 10Slyngshede: Provide the option to run an embedded Redis server. [software/bitu] - 10https://gerrit.wikimedia.org/r/1087140
[10:21:16] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Fix unblock bug. [software/bitu] - 10https://gerrit.wikimedia.org/r/1085366 (https://phabricator.wikimedia.org/T378693) (owner: 10Slyngshede)
[10:22:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10287892 (10MoritzMuehlenhoff)
[10:23:13] <wikibugs>	 (03Merged) 10jenkins-bot: Fix unblock bug. [software/bitu] - 10https://gerrit.wikimedia.org/r/1085366 (https://phabricator.wikimedia.org/T378693) (owner: 10Slyngshede)
[10:26:52] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Prefer Lumen to reach ATT [homer/public] - 10https://gerrit.wikimedia.org/r/1084143 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi)
[10:27:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[10:27:39] <wikibugs>	 06SRE, 07SRE-Unowned, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10287901 (10MatthewVernon) @jijiki can you expand on what you mean, please? This task is currently too broad...
[10:27:51] <wikibugs>	 (03Merged) 10jenkins-bot: Prefer Lumen to reach ATT [homer/public] - 10https://gerrit.wikimedia.org/r/1084143 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi)
[10:29:47] <wikibugs>	 (03CR) 10Arnaudb: "I was unaware of `depool-and-wait`! Otherwise, LGTM" [software] - 10https://gerrit.wikimedia.org/r/1087138 (https://phabricator.wikimedia.org/T377738) (owner: 10Ladsgroup)
[10:30:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P70852 and previous config saved to /var/cache/conftool/dbconfig/20241104-103059-ladsgroup.json
[10:31:07] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1043.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:31:12] <moritzm>	 !log installing libseccomp updates from Bookworm point release
[10:31:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:32] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:35:50] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10287958 (10jcrespo) Was db2190 taken care, data-wise/repooled? Not super worried or super-urgent, but to track it somewhere and making sure it doesn't fall into the cracks of depooled host...
[10:37:01] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10287959 (10Ladsgroup) Yes :)
[10:38:57] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:39:08] <wikibugs>	 (03CR) 10Brouberol: "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol)
[10:39:12] <wikibugs>	 (03PS3) 10Brouberol: Publish JDK8 images based on Debian Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928)
[10:39:29] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Publish JDK8 images based on Debian Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol)
[10:39:33] <wikibugs>	 (03CR) 10Brouberol: [V:03+2 C:03+2] Publish JDK8 images based on Debian Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol)
[10:40:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10287970 (10MoritzMuehlenhoff)
[10:41:22] <moritzm>	 !log installing libtool updates from Bookworm point release
[10:41:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:46] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:42:20] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:43:31] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1084129 (owner: 10Ayounsi)
[10:46:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P70853 and previous config saved to /var/cache/conftool/dbconfig/20241104-104606-ladsgroup.json
[10:47:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[10:48:38] <XioNoX>	 !log eqiad: Prefer Lumen to reach ATT - T377844
[10:48:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:18] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!  We could make the 'multihop' an optional attribute for the YAML dict but for these few I think it's fine in the Jinja." [homer/public] - 10https://gerrit.wikimedia.org/r/1084125 (owner: 10Ayounsi)
[10:50:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Add temporary LVS community for liberica test [homer/public] - 10https://gerrit.wikimedia.org/r/1084760 (https://phabricator.wikimedia.org/T378453) (owner: 10Ayounsi)
[10:50:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10287999 (10MoritzMuehlenhoff)
[10:52:34] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:54:47] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T1100)
[11:01:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T376905)', diff saved to https://phabricator.wikimedia.org/P70854 and previous config saved to /var/cache/conftool/dbconfig/20241104-110113-ladsgroup.json
[11:01:21] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2227.codfw.wmnet with reason: Maintenance
[11:01:34] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2227.codfw.wmnet with reason: Maintenance
[11:01:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1080069 (owner: 10EoghanGaffney)
[11:01:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2227 (T376905)', diff saved to https://phabricator.wikimedia.org/P70855 and previous config saved to /var/cache/conftool/dbconfig/20241104-110141-ladsgroup.json
[11:05:01] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:06:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Assign ganeti role to ganeti1039/ganeti1040 [puppet] - 10https://gerrit.wikimedia.org/r/1087153 (https://phabricator.wikimedia.org/T378921)
[11:08:01] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087153 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff)
[11:09:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T376905)', diff saved to https://phabricator.wikimedia.org/P70856 and previous config saved to /var/cache/conftool/dbconfig/20241104-110953-ladsgroup.json
[11:11:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: ganeti1025 VMs unresponsive Nov 1 2024 - https://phabricator.wikimedia.org/T378809#10288046 (10MoritzMuehlenhoff) >>! In T378809#10284244, @cmooney wrote: >>>! In T378809#10284231, @CDanis wrote: >> I'm pretty confident this is the same as T348730, and I thi...
[11:12:35] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:13:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Assign ganeti role to ganeti1039/ganeti1040 [puppet] - 10https://gerrit.wikimedia.org/r/1087153 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff)
[11:14:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker2068 - https://phabricator.wikimedia.org/T378255#10288061 (10Clement_Goubert) 05Open→03In progress a:03Clement_Goubert
[11:17:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[11:19:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916#10288086 (10phaultfinder)
[11:22:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[11:22:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:22:49] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:24:51] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] trafficserver: Lua script for routing 8.1-enrolled traffic [puppet] - 10https://gerrit.wikimedia.org/r/1072821 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French)
[11:25:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P70857 and previous config saved to /var/cache/conftool/dbconfig/20241104-112501-ladsgroup.json
[11:27:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:29:04] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update articlequality staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087157
[11:33:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker2068 - https://phabricator.wikimedia.org/T378255#10288128 (10Clement_Goubert) Partition table copied to the new disk and added it to the software raid. Rebuild in progress. ` cgoubert@wikikube-worker2068:~$ cat /proc/mdstat  Person...
[11:34:05] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:37:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:38:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey)
[11:40:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P70858 and previous config saved to /var/cache/conftool/dbconfig/20241104-114008-ladsgroup.json
[11:42:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:44:20] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:45:51] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:54:14] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2190: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/1087160
[11:55:10] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2190: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/1087160 (owner: 10Marostegui)
[11:55:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T376905)', diff saved to https://phabricator.wikimedia.org/P70859 and previous config saved to /var/cache/conftool/dbconfig/20241104-115514-ladsgroup.json
[11:56:06] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:58:13] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:58:55] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10288156 (10Marostegui) Notifications were disabled, I have enabled them as the host is serving queries.
[12:01:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet
[12:08:28] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[12:08:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet
[12:10:10] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[12:11:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1039.eqiad.wmnet to cluster eqiad and group B
[12:11:43] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1039.eqiad.wmnet to cluster eqiad and group B
[12:16:28] <wikibugs>	 (03PS1) 10Slyngshede: P:idp enable Redis TGT backend [puppet] - 10https://gerrit.wikimedia.org/r/1087167 (https://phabricator.wikimedia.org/T377728)
[12:19:32] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[12:19:58] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[12:20:22] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[12:22:12] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[12:22:40] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[12:24:51] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[12:26:08] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: update articlequality staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087157 (owner: 10Ilias Sarantopoulos)
[12:32:41] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update articlequality staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087157 (owner: 10Ilias Sarantopoulos)
[12:33:49] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update articlequality staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087157 (owner: 10Ilias Sarantopoulos)
[12:34:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10288300 (10MoritzMuehlenhoff)
[12:34:33] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[12:35:06] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[12:37:32] <wikibugs>	 (03CR) 10Ladsgroup: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1087160 (owner: 10Marostegui)
[12:44:54] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[12:45:08] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[12:45:10] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[12:45:26] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[12:45:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T376905)', diff saved to https://phabricator.wikimedia.org/P70860 and previous config saved to /var/cache/conftool/dbconfig/20241104-124533-ladsgroup.json
[12:49:30] <XioNoX>	 !log deploy "Add temporary LVS community for liberica test" - T378453
[12:49:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:43] <stashbot>	 T378453: Testing liberica with ncredir@eqiad - https://phabricator.wikimedia.org/T378453
[12:55:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T376905)', diff saved to https://phabricator.wikimedia.org/P70861 and previous config saved to /var/cache/conftool/dbconfig/20241104-125459-ladsgroup.json
[13:06:46] <Dreamy_Jazz>	 !log Started MediaModeration scan on all wikis other than s4 (commonswiki + testcommonswiki) - https://wikitech.wikimedia.org/wiki/MediaModeration
[13:06:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P70862 and previous config saved to /var/cache/conftool/dbconfig/20241104-131006-ladsgroup.json
[13:11:25] <Dreamy_Jazz>	 !log Started slow MediaModeration scan for commonswiki to be scanning as close to upload as possible - https://wikitech.wikimedia.org/wiki/MediaModeration
[13:11:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1039.eqiad.wmnet to cluster eqiad and group B
[13:25:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P70864 and previous config saved to /var/cache/conftool/dbconfig/20241104-132513-ladsgroup.json
[13:25:56] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1039.eqiad.wmnet to cluster eqiad and group B
[13:31:22] <wikibugs>	 (03CR) 10Clément Goubert: "Indeed it does, thanks for that!" [puppet] - 10https://gerrit.wikimedia.org/r/1083146 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[13:36:49] <jinxer-wm>	 RESOLVED: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[13:38:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1172 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:39:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: productionize db2235 [puppet] - 10https://gerrit.wikimedia.org/r/1087179 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[13:39:31] <wikibugs>	 (03CR) 10Marostegui: "thanks I commented on the new one!" [puppet] - 10https://gerrit.wikimedia.org/r/1084128 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[13:40:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T376905)', diff saved to https://phabricator.wikimedia.org/P70865 and previous config saved to /var/cache/conftool/dbconfig/20241104-134021-ladsgroup.json
[13:40:26] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[13:40:39] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[13:45:45] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[13:45:59] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[13:46:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T376905)', diff saved to https://phabricator.wikimedia.org/P70866 and previous config saved to /var/cache/conftool/dbconfig/20241104-134605-ladsgroup.json
[13:49:09] <wikibugs>	 (03PS1) 10Klausman: admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380)
[13:49:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:49:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[13:50:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Schema change T367856
[13:50:37] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Schema change T367856
[13:50:37] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[13:51:04] <marostegui>	 !log Start schema change on redacteddb1001:s8 T367856 (this will make replication in s8 lag for around 2-3 days)
[13:51:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T376905)', diff saved to https://phabricator.wikimedia.org/P70867 and previous config saved to /var/cache/conftool/dbconfig/20241104-135516-ladsgroup.json
[13:56:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1170 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:56:52] <wikibugs>	 (03PS1) 10Muehlenhoff: spark: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1087186
[13:56:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove spark2 profile [puppet] - 10https://gerrit.wikimedia.org/r/1087187
[13:59:42] <HouseOfM>	 o/
[13:59:50] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1172 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T1400).
[14:00:05] <jouncebot>	 HouseOfM: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:29] <Lucas_WMDE>	 o/
[14:01:09] <Lucas_WMDE>	 I can deploy!
[14:01:32] <HouseOfM>	 Thanks :)
[14:01:43] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087186 (owner: 10Muehlenhoff)
[14:06:17] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10288826 (10Lucas_Werkmeister_WMDE) IMHO it’s a bit of an awkward time to add someone to `restricted`, given the status of T378429, but sure ^^ let’s see how far that gets us.
[14:06:23] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10288830 (10Marostegui) >>! In T378143#10266787, @ABran-WMF wrote: > I've tried to reproduce what's been done in T355269 which is quite close to what we...
[14:07:04] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Exclude affiliates from P&E dashboard integration for CampaignEvents Extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey)
[14:07:33] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10288831 (10ABran-WMF) basically a validation of the picked up positions, I stuck to the existing topology as there was a 1:1 match between hosts and ea...
[14:08:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:09:00] <wikibugs>	 (03CR) 10Mhorsey: Exclude affiliates from P&E dashboard integration for CampaignEvents Extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey)
[14:09:34] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087187 (owner: 10Muehlenhoff)
[14:10:19] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] spark: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1087186 (owner: 10Muehlenhoff)
[14:10:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P70868 and previous config saved to /var/cache/conftool/dbconfig/20241104-141023-ladsgroup.json
[14:10:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey)
[14:11:20] <wikibugs>	 (03Merged) 10jenkins-bot: Exclude affiliates from P&E dashboard integration for CampaignEvents Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey)
[14:11:39] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Release new mesh.service module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084030 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol)
[14:11:51] <Lucas_WMDE>	 scap is fetching lots of submodules
[14:11:55] <Lucas_WMDE>	 first deployment of the week, maybe ^^
[14:12:19] <HouseOfM>	 Oh good, lol
[14:12:31] <wikibugs>	 (03PS11) 10Vgutierrez: profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127)
[14:12:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1084765|Exclude affiliates from P&E dashboard integration for CampaignEvents Extension (T377252)]]
[14:12:35] <stashbot>	 T377252: Disable Program and Events Dashboard integration for unsupported wikis - https://phabricator.wikimedia.org/T377252
[14:12:44] <wikibugs>	 (03Abandoned) 10Volans: sre.switchdc.databses: test on test-s4 section [cookbooks] - 10https://gerrit.wikimedia.org/r/1073762 (owner: 10Volans)
[14:12:49] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10288851 (10MatthewVernon) I think, given @jhathaway's [[ https://phabricator.wikimedia.org/T378584#10284180 | update ]] on T378584 we should try booting...
[14:13:06] <wikibugs>	 (03CR) 10CDanis: [C:03+1] mesh.service: introduce a way to further specify the service label selectors (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol)
[14:13:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1170 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:13:53] <wikibugs>	 (03PS12) 10Vgutierrez: profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127)
[14:13:59] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10288874 (10MatthewVernon) @jhathaway great, thanks. With the new thanos backends hopefully arriving this week (which are also...
[14:17:27] <wikibugs>	 (03PS8) 10Vgutierrez: role,site: Provide a liberica role and use it on lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/1083778 (https://phabricator.wikimedia.org/T377127)
[14:18:32] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mesh.service: introduce a way to further specify the service label selectors (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol)
[14:18:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Release new mesh.service module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084030 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol)
[14:18:41] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: add the 'component: webserver' to the tls Service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084032 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol)
[14:19:48] <wikibugs>	 (03Merged) 10jenkins-bot: Release new mesh.service module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084030 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol)
[14:19:52] <wikibugs>	 (03Merged) 10jenkins-bot: mesh.service: introduce a way to further specify the service label selectors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol)
[14:21:15] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083778 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez)
[14:22:45] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10288906 (10Marostegui) So, we have 6 rows available, so let's place one per row.  For A3, there's already an external store host there, so if there's a...
[14:22:50] <wikibugs>	 (03PS14) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881)
[14:23:13] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 mhorsey, lucaswerkmeister-wmde: Backport for [[gerrit:1084765|Exclude affiliates from P&E dashboard integration for CampaignEvents Extension (T377252)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:23:20] <stashbot>	 T377252: Disable Program and Events Dashboard integration for unsupported wikis - https://phabricator.wikimedia.org/T377252
[14:23:25] <Lucas_WMDE>	 HouseOfM: please test :)
[14:23:43] <HouseOfM>	 Will do :)
[14:24:32] <moritzm>	 !log uploaded php7.4 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2+icu67u3 to component/icu67 (backports of latest security fixes to our PHP 7.4 build)
[14:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:45] <Lucas_WMDE>	 marostegui: question about https://phabricator.wikimedia.org/T367856#10288593 – does this affect the public replicas (quarry etc.)? or is this replication lag in a different kind of database?
[14:25:01] <marostegui>	 Lucas_WMDE: it will yes
[14:25:07] <Lucas_WMDE>	 ok
[14:25:22] <marostegui>	 Lucas_WMDE: They'll get around 2 days of lag for s8, but not yet
[14:25:22] <Lucas_WMDE>	 we’ll probably put a brief mention of it in the wikidata weekly summary (linking to that phab task) if that’ sokay
[14:25:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P70869 and previous config saved to /var/cache/conftool/dbconfig/20241104-142530-ladsgroup.json
[14:25:47] <marostegui>	 Lucas_WMDE: yeah, i can tell you when that will happen if you like
[14:25:53] <marostegui>	 (unlikely this week)
[14:25:57] <Lucas_WMDE>	 ah ok
[14:26:01] <Lucas_WMDE>	 sure
[14:26:07] <Lucas_WMDE>	 then I’ll take it out of the summary for this week again :)
[14:26:35] <marostegui>	 Lucas_WMDE: We are going to alter each wikireplica so they will be depooled, but at some point we will alter their master and then their master-master so there will be two periods of 2days of lag, but that will take a few days to happen.
[14:27:09] <Lucas_WMDE>	 ok
[14:27:46] <Lucas_WMDE>	 it’s not super important that we announce it, I think, but I saw it fly past and thought “I remember that causing some confusion before” and figured we might as well include it in the weekly summary
[14:27:53] <Lucas_WMDE>	 but also, we shouldn’t do that too early ^^
[14:27:55] <Lucas_WMDE>	 so I’m glad I asked :)
[14:27:58] <HouseOfM>	 Lucas_WMDE: all good
[14:28:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 mhorsey, lucaswerkmeister-wmde: Continuing with sync
[14:28:32] <Lucas_WMDE>	 if it’s convenient, you can ping me (or e.g. Lydia) for inclusion in the next weekly summary
[14:28:38] <Lucas_WMDE>	 if not, we’ll survive too :)
[14:28:49] <marostegui>	 Lucas_WMDE: yeah, I think it is a very good idea to include it there :)
[14:29:00] <marostegui>	 So thanks for that! I will keep you posted
[14:29:31] <Lucas_WMDE>	 okay, thanks!
[14:30:04] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: toggle notifications for db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1087194 (https://phabricator.wikimedia.org/T373579)
[14:30:05] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: toggle notifications for db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1087194 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[14:32:05] <wikibugs>	 (03PS14) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519)
[14:35:19] <wikibugs>	 (03PS2) 10Klausman: admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380)
[14:36:05] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Overall it looks good to me, and even if it contains hacks they are self-contained for this use case. I am inclined to proceed, we have so" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway)
[14:36:12] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084765|Exclude affiliates from P&E dashboard integration for CampaignEvents Extension (T377252)]] (duration: 23m 39s)
[14:36:15] <stashbot>	 T377252: Disable Program and Events Dashboard integration for unsupported wikis - https://phabricator.wikimedia.org/T377252
[14:36:39] <wikibugs>	 (03PS3) 10Klausman: admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380)
[14:37:37] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:58] <wikibugs>	 (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4446/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[14:38:36] <Lucas_WMDE>	 looks like that’s everything for now
[14:38:44] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:58] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add RIPE RIS sessions to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/1084125 (owner: 10Ayounsi)
[14:39:03] <HouseOfM>	 Thanks Lucas_WMDE
[14:39:31] <wikibugs>	 (03Merged) 10jenkins-bot: Add RIPE RIS sessions to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/1084125 (owner: 10Ayounsi)
[14:40:24] <Lucas_WMDE>	 np :)
[14:40:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T376905)', diff saved to https://phabricator.wikimedia.org/P70870 and previous config saved to /var/cache/conftool/dbconfig/20241104-144037-ladsgroup.json
[14:40:42] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[14:40:56] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[14:41:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T376905)', diff saved to https://phabricator.wikimedia.org/P70871 and previous config saved to /var/cache/conftool/dbconfig/20241104-144101-ladsgroup.json
[14:42:37] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:46:02] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10289015 (10elukey) We do have support for UEFI in the provision cookbook and in reimage (after https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/10...
[14:50:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T376905)', diff saved to https://phabricator.wikimedia.org/P70872 and previous config saved to /var/cache/conftool/dbconfig/20241104-145027-ladsgroup.json
[14:59:57] <wikibugs>	 (03PS1) 10Tchanders: temp accounts: Enable temp account creation on second-round pilots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087195 (https://phabricator.wikimedia.org/T378336)
[15:02:28] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] temp accounts: Enable temp account creation on second-round pilots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087195 (https://phabricator.wikimedia.org/T378336) (owner: 10Tchanders)
[15:02:37] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:05:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P70873 and previous config saved to /var/cache/conftool/dbconfig/20241104-150534-ladsgroup.json
[15:07:50] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10289123 (10phaultfinder)
[15:10:47] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good, comparing PS 9 to current 12!" [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez)
[15:20:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P70874 and previous config saved to /var/cache/conftool/dbconfig/20241104-152041-ladsgroup.json
[15:20:46] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez)
[15:23:43] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10289174 (10MatthewVernon) I think from that the two big issues are the partman cookbooks (which we'd obviously need the one we're using for these nodes t...
[15:25:02] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE: Request Kerberos identity for jsn.sherman - https://phabricator.wikimedia.org/T378786#10289193 (10Ottomata)
[15:25:13] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] role,site: Provide a liberica role and use it on lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/1083778 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez)
[15:25:35] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: admin/data.yaml: Add researchers to users of ml-lab100x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[15:25:48] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10289196 (10LSobanski)
[15:29:01] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] geo-maps: switch CN to to eqsin (from ulsfo) [dns] - 10https://gerrit.wikimedia.org/r/1085456 (https://phabricator.wikimedia.org/T378744) (owner: 10Ssingh)
[15:29:14] <sukhe>	 !log running authdns-update to move CN traffic to eqsin from ulsfo: T378744
[15:29:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:16] <stashbot>	 T378744: GeoDNS: consider sending CN to eqsin - https://phabricator.wikimedia.org/T378744
[15:31:24] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Good eyes!" [puppet] - 10https://gerrit.wikimedia.org/r/1087187 (owner: 10Muehlenhoff)
[15:32:32] <wikibugs>	 (03CR) 10Klausman: [V:03+1] admin/data.yaml: Add researchers to users of ml-lab100x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[15:34:27] <icinga-wm>	 RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[15:34:41] <wikibugs>	 (03PS2) 10Majavah: Drop support for s11 MariaDB section [puppet] - 10https://gerrit.wikimedia.org/r/1083586 (https://phabricator.wikimedia.org/T378260)
[15:35:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T376905)', diff saved to https://phabricator.wikimedia.org/P70876 and previous config saved to /var/cache/conftool/dbconfig/20241104-153548-ladsgroup.json
[15:35:53] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[15:35:53] <vgutierrez>	 !log upload liberica 0.1 to apt.wm.o (bookworm) - T377127
[15:35:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:02] <stashbot>	 T377127: liberica puppetization - https://phabricator.wikimedia.org/T377127
[15:36:07] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[15:36:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T376905)', diff saved to https://phabricator.wikimedia.org/P70877 and previous config saved to /var/cache/conftool/dbconfig/20241104-153613-ladsgroup.json
[15:37:10] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Drop support for s11 MariaDB section [puppet] - 10https://gerrit.wikimedia.org/r/1083586 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah)
[15:40:05] <wikibugs>	 06SRE, 10vm-requests, 07Kubernetes: codfw: (2x) aux-k8s-ctrl nodes - https://phabricator.wikimedia.org/T378986 (10herron) 03NEW
[15:40:08] <wikibugs>	 (03PS1) 10Marostegui: mariadb: prod dbproxy200[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) (owner: 10Arnaudb)
[15:40:08] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Looks good, you may have a race condition if you productionize db2235 before you merge this as db2235 will replace db2135" [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) (owner: 10Arnaudb)
[15:40:09] <wikibugs>	 06SRE, 10vm-requests, 07Kubernetes: codfw: (4x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378987 (10herron) 03NEW
[15:40:13] <wikibugs>	 06SRE, 10vm-requests, 07Kubernetes: codfw: (3x) aux-k8s-etcd nodes - https://phabricator.wikimedia.org/T378988 (10herron) 03NEW
[15:40:13] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm
[15:40:17] <wikibugs>	 06SRE, 10vm-requests, 07Kubernetes: eqiad: (2x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378989 (10herron) 03NEW
[15:44:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07Epic, 07Kubernetes: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams - https://phabricator.wikimedia.org/T378742#10289310 (10herron)
[15:44:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] shellbox: add migration release (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082572 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French)
[15:45:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T376905)', diff saved to https://phabricator.wikimedia.org/P70878 and previous config saved to /var/cache/conftool/dbconfig/20241104-154543-ladsgroup.json
[15:46:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:46:32] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[15:46:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: decommission ganeti2013/ganeti2014 - https://phabricator.wikimedia.org/T378596#10289345 (10Jhancock.wm) 05Open→03Resolved
[15:51:32] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[15:52:00] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage
[15:52:49] <icinga-wm>	 PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100%
[15:54:45] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage
[15:57:46] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2235 [puppet] - 10https://gerrit.wikimedia.org/r/1087179 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[15:58:50] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[16:00:33] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db[2135,2235].codfw.wmnet with reason: cloning db2135@db2235
[16:00:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07Epic, 07Kubernetes: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams - https://phabricator.wikimedia.org/T378742#10289440 (10CDanis) p:05Triage→03Medium
[16:00:47] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db[2135,2235].codfw.wmnet with reason: cloning db2135@db2235
[16:00:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P70879 and previous config saved to /var/cache/conftool/dbconfig/20241104-160050-ladsgroup.json
[16:00:52] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10289443 (10elukey) p:05Triage→03Medium
[16:01:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:01:04] <wikibugs>	 (03CR) 10Marostegui: sre.mysql.upgrade: add depool/pool logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[16:01:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10289446 (10Jhancock.wm) removed CPU 2. gonna let it run for a little and see if it generates errors. then we'll at least know which one is the problem
[16:02:18] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2135.codfw.wmnet onto db2235.codfw.wmnet
[16:03:28] <icinga-wm>	 PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:04:51] <wikibugs>	 (03CR) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[16:05:12] <icinga-wm>	 PROBLEM - MariaDB Replica IO: m5 on db2160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2135.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2135.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:05:24] <arnaudb>	 its me, I failed to downtime a node
[16:05:25] <arnaudb>	 fixing
[16:05:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[16:05:38] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[16:05:49] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db2160.codfw.wmnet with reason: cloning db2135@db2235
[16:06:02] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2160.codfw.wmnet with reason: cloning db2135@db2235
[16:06:04] <arnaudb>	 sorry for the noise!
[16:06:37] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[16:07:16] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[16:08:13] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10289477 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm
[16:08:33] <wikibugs>	 (03PS15) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881)
[16:08:59] <wikibugs>	 (03CR) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[16:10:12] <wikibugs>	 (03PS3) 10Brouberol: airflow: add the 'component: webserver' to the tls Service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084032 (https://phabricator.wikimedia.org/T378377)
[16:10:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10289485 (10bking) a:03VRiley-WMF
[16:11:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10289487 (10bking)
[16:11:31] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: add the 'component: webserver' to the tls Service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084032 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol)
[16:12:11] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085633 (https://phabricator.wikimedia.org/T376948) (owner: 10Aude)
[16:12:24] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2135.codfw.wmnet onto db2235.codfw.wmnet
[16:13:06] <icinga-wm>	 RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms
[16:13:12] <icinga-wm>	 RECOVERY - MariaDB Replica IO: m5 on db2160 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:14:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[16:14:32] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[16:14:46] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[16:15:03] <wikibugs>	 (03PS1) 10Slyngshede: P:idp add default Redis database for cloud. [puppet] - 10https://gerrit.wikimedia.org/r/1087203
[16:15:06] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[16:15:27] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10289524 (10LSobanski) a:03eoghan
[16:15:29] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10289525 (10MoritzMuehlenhoff)
[16:15:51] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: productionize db2236 [puppet] - 10https://gerrit.wikimedia.org/r/1087202 (https://phabricator.wikimedia.org/T373579)
[16:15:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P70880 and previous config saved to /var/cache/conftool/dbconfig/20241104-161557-ladsgroup.json
[16:16:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1087203 (owner: 10Slyngshede)
[16:16:54] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idp add default Redis database for cloud. [puppet] - 10https://gerrit.wikimedia.org/r/1087203 (owner: 10Slyngshede)
[16:21:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[16:23:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[16:23:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[16:25:06] <wikibugs>	 (03PS1) 10Elukey: profile::docker::report: use the internal registry endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1087205 (https://phabricator.wikimedia.org/T378618)
[16:25:07] <wikibugs>	 (03PS1) 10Elukey: docker_registry_ha: reduce from 300 to 180 the nginx timeout [puppet] - 10https://gerrit.wikimedia.org/r/1087206 (https://phabricator.wikimedia.org/T378618)
[16:27:27] <wikibugs>	 (03PS16) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881)
[16:27:32] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4447/console" [puppet] - 10https://gerrit.wikimedia.org/r/1087205 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey)
[16:28:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[16:28:55] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4449/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087205 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey)
[16:30:05] <jouncebot>	 jan_drewniak: That opportune time for a Wikimedia Portals Update deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T1630).
[16:31:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T376905)', diff saved to https://phabricator.wikimedia.org/P70881 and previous config saved to /var/cache/conftool/dbconfig/20241104-163104-ladsgroup.json
[16:31:09] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[16:31:23] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[16:31:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T376905)', diff saved to https://phabricator.wikimedia.org/P70882 and previous config saved to /var/cache/conftool/dbconfig/20241104-163129-ladsgroup.json
[16:34:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[16:37:46] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] profile::docker::report: use the internal registry endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1087205 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey)
[16:37:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm
[16:38:04] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10289661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm complet...
[16:38:17] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] docker_registry_ha: reduce from 300 to 180 the nginx timeout [puppet] - 10https://gerrit.wikimedia.org/r/1087206 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey)
[16:40:09] <wikibugs>	 (03PS1) 10DLynch: Set Flow to read-only on remaining phase 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087207 (https://phabricator.wikimedia.org/T377990)
[16:40:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T376905)', diff saved to https://phabricator.wikimedia.org/P70883 and previous config saved to /var/cache/conftool/dbconfig/20241104-164051-ladsgroup.json
[16:42:17] <wikibugs>	 (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085598 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French)
[16:42:20] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox: add optional .spec.strategy override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085598 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French)
[16:44:02] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: add optional .spec.strategy override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085598 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French)
[16:50:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker12[35-42] - https://phabricator.wikimedia.org/T377021#10289692 (10VRiley-WMF) @Clement_Goubert It seems that there are servers already named wikikube-worker1240, wikikube-worker1241, and wikikube-worker12...
[16:53:16] <wikibugs>	 06SRE, 07SRE-Unowned, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10289712 (10bd808) I can't speak for Effie, but my imagined reimplementation of wikitech-static would be to produce an OCI container image daily containing Wikitech's content along with a Medi...
[16:53:49] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable puppet7 for liberica role [puppet] - 10https://gerrit.wikimedia.org/r/1087210
[16:55:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087210 (owner: 10Vgutierrez)
[16:55:37] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087210 (owner: 10Vgutierrez)
[16:55:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P70885 and previous config saved to /var/cache/conftool/dbconfig/20241104-165558-ladsgroup.json
[16:56:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10289749 (10MoritzMuehlenhoff)
[16:56:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker12[35-42] - https://phabricator.wikimedia.org/T377021#10289750 (10Clement_Goubert) Yeah sorry about that, I got confused with the host renaming we've been doing. Thanks for catching it. I'll amend the tas...
[16:59:44] <logmsgbot>	 !log vgutierrez@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1013.eqiad.wmnet with OS bookworm
[17:00:33] <wikibugs>	 (03PS15) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519)
[17:01:24] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable puppet7 for liberica role [puppet] - 10https://gerrit.wikimedia.org/r/1087210
[17:02:10] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Enable puppet7 for liberica role [puppet] - 10https://gerrit.wikimedia.org/r/1087210 (owner: 10Vgutierrez)
[17:02:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker12[43-58] - https://phabricator.wikimedia.org/T378185#10289766 (10Clement_Goubert) @VRiley-WMF I've messed up the hostnames on that task as well as {T377021}, I'm so sorry. I'll sort all of this out first thing tomorrow.
[17:02:15] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable puppet7 for liberica role [puppet] - 10https://gerrit.wikimedia.org/r/1087210 (owner: 10Vgutierrez)
[17:03:57] <wikibugs>	 06SRE, 10vm-requests, 07Kubernetes: eqiad: (2x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378989#10289779 (10CDanis) LGTM! please use groups C and D if possible, that would give full diversity across ganeti groups
[17:04:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:06:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:07:16] <icinga-wm>	 PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100%
[17:07:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[17:07:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10289800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm
[17:11:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P70886 and previous config saved to /var/cache/conftool/dbconfig/20241104-171105-ladsgroup.json
[17:12:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker12[35-42] - https://phabricator.wikimedia.org/T377021#10289838 (10Clement_Goubert)
[17:13:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10289850 (10Clement_Goubert)
[17:13:44] <icinga-wm>	 RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms
[17:16:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10289885 (10Clement_Goubert)
[17:16:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10289882 (10Clement_Goubert)
[17:18:23] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes: fix hostnames for eqiad refresh and expansion [puppet] - 10https://gerrit.wikimedia.org/r/1087216 (https://phabricator.wikimedia.org/T376185)
[17:20:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[17:23:31] <wikibugs>	 (03PS1) 10Vgutierrez: liberica: Set gobgpd router-id [puppet] - 10https://gerrit.wikimedia.org/r/1087219
[17:23:56] <icinga-wm>	 RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 1 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[17:23:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[17:24:13] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087219 (owner: 10Vgutierrez)
[17:25:56] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] liberica: Set gobgpd router-id [puppet] - 10https://gerrit.wikimedia.org/r/1087219 (owner: 10Vgutierrez)
[17:26:06] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] liberica: Set gobgpd router-id [puppet] - 10https://gerrit.wikimedia.org/r/1087219 (owner: 10Vgutierrez)
[17:26:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T376905)', diff saved to https://phabricator.wikimedia.org/P70887 and previous config saved to /var/cache/conftool/dbconfig/20241104-172612-ladsgroup.json
[17:26:17] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[17:26:31] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[17:26:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T376905)', diff saved to https://phabricator.wikimedia.org/P70888 and previous config saved to /var/cache/conftool/dbconfig/20241104-172638-ladsgroup.json
[17:26:39] <wikibugs>	 (03CR) 10Calbon: [V:03+1] admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[17:35:37] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.puppet.migrate-host for host lvs1013.eqiad.wmnet
[17:35:52] <logmsgbot>	 !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host lvs1013.eqiad.wmnet
[17:36:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T376905)', diff saved to https://phabricator.wikimedia.org/P70889 and previous config saved to /var/cache/conftool/dbconfig/20241104-173604-ladsgroup.json
[17:37:16] <wikibugs>	 (03CR) 10Klausman: [V:03+2] admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[17:37:23] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[17:37:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm
[17:37:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10290044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm completed: - srete...
[17:38:33] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] dbtools: Drop depool and repool bashes [software] - 10https://gerrit.wikimedia.org/r/1087138 (https://phabricator.wikimedia.org/T377738) (owner: 10Ladsgroup)
[17:38:42] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] temp accounts: Enable temp account creation on second-round pilots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087195 (https://phabricator.wikimedia.org/T378336) (owner: 10Tchanders)
[17:39:53] <wikibugs>	 (03Merged) 10jenkins-bot: dbtools: Drop depool and repool bashes [software] - 10https://gerrit.wikimedia.org/r/1087138 (https://phabricator.wikimedia.org/T377738) (owner: 10Ladsgroup)
[17:43:16] <vgutierrez>	 !log upload liberica 0.2 to apt.wm.o (bookworm) - T377127
[17:43:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:19] <stashbot>	 T377127: liberica puppetization - https://phabricator.wikimedia.org/T377127
[17:43:54] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm
[17:47:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mesh.service: introduce a way to further specify the service label selectors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol)
[17:49:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] ats: Route rest_v1/page/(html|title) to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1080232 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris)
[17:51:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P70890 and previous config saved to /var/cache/conftool/dbconfig/20241104-175111-ladsgroup.json
[17:56:07] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T1800)
[18:00:05] <jouncebot>	 ryankemper: Time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T1800).
[18:01:55] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage
[18:01:59] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087222 (https://phabricator.wikimedia.org/T374683)
[18:02:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Revert "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087222 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris)
[18:06:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P70891 and previous config saved to /var/cache/conftool/dbconfig/20241104-180618-ladsgroup.json
[18:07:15] <wikibugs>	 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10290152 (10Krd) Just happend again:  Request from <redacted>.43.46 via cp3066 cp3066, Varnish XID 230643229 Upstream caches: cp3066 int Error: 429, at Mon, 04 Nov 2024 18:05:37 GMT
[18:12:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087207 (https://phabricator.wikimedia.org/T377990) (owner: 10DLynch)
[18:13:01] <wikibugs>	 (03PS1) 10Vgutierrez: liberica: gobgpd router-id value needs to be quoted [puppet] - 10https://gerrit.wikimedia.org/r/1087224
[18:13:28] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087224 (owner: 10Vgutierrez)
[18:15:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] liberica: gobgpd router-id value needs to be quoted [puppet] - 10https://gerrit.wikimedia.org/r/1087224 (owner: 10Vgutierrez)
[18:21:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T376905)', diff saved to https://phabricator.wikimedia.org/P70892 and previous config saved to /var/cache/conftool/dbconfig/20241104-182125-ladsgroup.json
[18:21:31] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[18:21:34] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[18:21:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T376905)', diff saved to https://phabricator.wikimedia.org/P70893 and previous config saved to /var/cache/conftool/dbconfig/20241104-182140-ladsgroup.json
[18:25:34] <logmsgbot>	 !log vgutierrez@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1013.eqiad.wmnet with OS bookworm
[18:29:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T376905)', diff saved to https://phabricator.wikimedia.org/P70894 and previous config saved to /var/cache/conftool/dbconfig/20241104-182933-ladsgroup.json
[18:31:50] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1083913 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall)
[18:35:18] <wikibugs>	 (03CR) 10BCornwall: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur)
[18:41:19] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on lvs2013.codfw.wmnet with reason: vgutierrez
[18:41:32] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs2013.codfw.wmnet with reason: vgutierrez
[18:41:48] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs2013.codfw.wmnet
[18:41:49] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2013.codfw.wmnet
[18:42:37] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:44:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P70895 and previous config saved to /var/cache/conftool/dbconfig/20241104-184440-ladsgroup.json
[18:45:58] <wikibugs>	 (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082572 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French)
[18:46:10] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox: add migration release (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082572 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French)
[18:47:12] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: known issues with liberica-hcforwarder and ipip-multiqueue-optimizer
[18:47:13] <logmsgbot>	 !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: known issues with liberica-hcforwarder and ipip-multiqueue-optimizer
[18:47:17] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: add migration release (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082572 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French)
[18:52:43] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1085515 (https://phabricator.wikimedia.org/T374827) (owner: 10Slyngshede)
[18:53:50] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply
[18:54:02] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683)
[18:54:21] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683)
[18:54:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[18:54:32] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[18:55:27] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[18:55:28] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[18:56:12] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[18:56:13] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[18:56:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris)
[18:56:44] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[18:56:45] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[18:57:27] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[18:57:28] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[18:58:06] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[18:59:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P70896 and previous config saved to /var/cache/conftool/dbconfig/20241104-185947-ladsgroup.json
[18:59:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[19:00:47] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683)
[19:02:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris)
[19:03:57] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[19:04:12] <wikibugs>	 (03PS1) 10Urbanecm: Message: Downgrade exception on bool/null param to warning [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087231 (https://phabricator.wikimedia.org/T378876)
[19:04:25] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[19:04:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[19:07:30] <urbanecm>	 jouncebot: nowandnext
[19:07:30] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 52 minute(s)
[19:07:30] <jouncebot>	 In 1 hour(s) and 52 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T2100)
[19:09:32] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[19:09:56] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[19:14:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T376905)', diff saved to https://phabricator.wikimedia.org/P70897 and previous config saved to /var/cache/conftool/dbconfig/20241104-191454-ladsgroup.json
[19:14:59] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[19:15:12] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[19:15:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T376905)', diff saved to https://phabricator.wikimedia.org/P70898 and previous config saved to /var/cache/conftool/dbconfig/20241104-191519-ladsgroup.json
[19:17:16] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Message: Downgrade exception on bool/null param to warning [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087231 (https://phabricator.wikimedia.org/T378876) (owner: 10Urbanecm)
[19:17:52] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[19:18:47] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[19:18:58] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[19:19:54] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[19:20:05] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[19:21:03] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[19:21:14] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[19:22:00] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[19:22:11] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[19:23:07] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[19:23:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T376905)', diff saved to https://phabricator.wikimedia.org/P70899 and previous config saved to /var/cache/conftool/dbconfig/20241104-192319-ladsgroup.json
[19:38:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P70900 and previous config saved to /var/cache/conftool/dbconfig/20241104-193826-ladsgroup.json
[19:50:20] <wikibugs>	 (03Merged) 10jenkins-bot: Message: Downgrade exception on bool/null param to warning [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087231 (https://phabricator.wikimedia.org/T378876) (owner: 10Urbanecm)
[19:51:34] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1087231|Message: Downgrade exception on bool/null param to warning (T378876)]]
[19:51:38] <stashbot>	 T378876: InvalidArgumentException: Scalar parameter must be a string, number, Stringable, or MessageSpecifier (October 2024) - https://phabricator.wikimedia.org/T378876
[19:53:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P70901 and previous config saved to /var/cache/conftool/dbconfig/20241104-195333-ladsgroup.json
[19:54:55] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1087231|Message: Downgrade exception on bool/null param to warning (T378876)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[19:55:57] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[20:00:46] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087231|Message: Downgrade exception on bool/null param to warning (T378876)]] (duration: 09m 12s)
[20:01:10] <stashbot>	 T378876: InvalidArgumentException: Scalar parameter must be a string, number, Stringable, or MessageSpecifier (October 2024) - https://phabricator.wikimedia.org/T378876
[20:07:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to snapshot* with group snapshot-admins for ebernhardson - https://phabricator.wikimedia.org/T379025 (10Gehel) 03NEW
[20:08:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T376905)', diff saved to https://phabricator.wikimedia.org/P70902 and previous config saved to /var/cache/conftool/dbconfig/20241104-200840-ladsgroup.json
[20:08:45] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[20:08:59] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[20:09:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T376905)', diff saved to https://phabricator.wikimedia.org/P70903 and previous config saved to /var/cache/conftool/dbconfig/20241104-200905-ladsgroup.json
[20:09:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to snapshot* with group snapshot-admins for ebernhardson - https://phabricator.wikimedia.org/T379025#10290622 (10Gehel) I approve this request for @EBernhardson
[20:17:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T376905)', diff saved to https://phabricator.wikimedia.org/P70904 and previous config saved to /var/cache/conftool/dbconfig/20241104-201703-ladsgroup.json
[20:17:26] <wikibugs>	 (03PS1) 10Eevans: aqs1013: decommission [puppet] - 10https://gerrit.wikimedia.org/r/1087238 (https://phabricator.wikimedia.org/T379026)
[20:19:24] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply
[20:20:05] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs1013: decommission [puppet] - 10https://gerrit.wikimedia.org/r/1087238 (https://phabricator.wikimedia.org/T379026) (owner: 10Eevans)
[20:20:17] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[20:20:28] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[20:21:13] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[20:21:24] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[20:21:56] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts aqs1013.eqiad.wmnet
[20:22:01] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[20:22:12] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[20:22:49] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[20:23:01] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[20:23:49] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[20:26:29] <wikibugs>	 (03PS1) 10Gehel: admin: add ebernhardson as a member of the snapshot-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1087239 (https://phabricator.wikimedia.org/T379025)
[20:26:33] <swfrench-wmf>	 !log zero-replica "migration" releases created for all shellbox instances - T375243
[20:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:41] <stashbot>	 T375243: Turn up PHP 8.1 Shellbox deployments - https://phabricator.wikimedia.org/T375243
[20:27:50] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.dns.netbox
[20:30:44] <wikibugs>	 (03PS1) 10Eevans: aqs1013 replaced by aqs1022 (hardware refresh) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087240 (https://phabricator.wikimedia.org/T379026)
[20:32:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P70905 and previous config saved to /var/cache/conftool/dbconfig/20241104-203210-ladsgroup.json
[20:32:37] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002"
[20:34:11] <wikibugs>	 (03CR) 10Volans: [C:03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087239 (https://phabricator.wikimedia.org/T379025) (owner: 10Gehel)
[20:34:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[20:35:04] <wikibugs>	 (03CR) 10Eevans: "On a related note, we should probably do something about: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/re" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087240 (https://phabricator.wikimedia.org/T379026) (owner: 10Eevans)
[20:35:36] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002"
[20:35:36] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:35:37] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1013.eqiad.wmnet
[20:38:08] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission aqs1013 - https://phabricator.wikimedia.org/T379026#10290723 (10Eevans) a:05Eevans→03None
[20:39:16] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#10290732 (10Eevans) 05Open→03Resolved We can close this now; aqs1013 is no more (T379026) 🪦
[20:42:15] <wikibugs>	 06SRE, 06Data-Persistence, 06serviceops: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18 - https://phabricator.wikimedia.org/T373490#10290736 (10Eevans) 05Open→03Resolved a:03Eevans aqs1013 has been decommissioned (T379026), and aqs1014 fixed; Closing
[20:42:45] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10290759 (10Eevans) 05Open→03Resolved
[20:47:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P70906 and previous config saved to /var/cache/conftool/dbconfig/20241104-204717-ladsgroup.json
[20:47:58] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1*: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002
[20:48:57] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:52:37] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:53:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10290804 (10Jhancock.wm)
[20:57:45] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: Corto: Access model (MVP only) - https://phabricator.wikimedia.org/T375451#10290817 (10Eevans) 05Open→03Resolved a:03Eevans
[20:57:46] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: Corto: Access model (MVP only) - https://phabricator.wikimedia.org/T375451#10290814 (10Eevans)
[20:59:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T2100).
[21:00:05] <jouncebot>	 kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:12] <Kemayo>	 o/
[21:00:35] <wikibugs>	 (03CR) 10Máté Szabó: "This is now unblocked." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó)
[21:00:39] <wikibugs>	 (03PS2) 10Máté Szabó: Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086)
[21:02:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T376905)', diff saved to https://phabricator.wikimedia.org/P70907 and previous config saved to /var/cache/conftool/dbconfig/20241104-210224-ladsgroup.json
[21:02:30] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[21:02:43] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[21:02:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kubestage2003 to codfw - jhancock@cumin2002"
[21:02:50] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kubestage2003 to codfw - jhancock@cumin2002"
[21:02:50] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:03:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubestage2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:03:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubestage2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:05:49] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1*: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002
[21:07:40] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance
[21:07:54] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance
[21:08:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T376905)', diff saved to https://phabricator.wikimedia.org/P70908 and previous config saved to /var/cache/conftool/dbconfig/20241104-210800-ladsgroup.json
[21:10:00] <Kemayo>	 Anyone around for deployments?
[21:12:18] <RhinosF1>	 I will try annoy a few
[21:14:06] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2*: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002
[21:14:40] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubestage2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:14:54] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubestage2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:15:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T376905)', diff saved to https://phabricator.wikimedia.org/P70909 and previous config saved to /var/cache/conftool/dbconfig/20241104-211505-ladsgroup.json
[21:15:08] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubestage2003']
[21:15:12] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubestage2004']
[21:15:36] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubestage2003']
[21:15:38] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubestage2004']
[21:17:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubestage2003.codfw.wmnet with OS bookworm
[21:17:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubestage2004.codfw.wmnet with OS bookworm
[21:17:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10290906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubestage2003.codfw.wmnet with OS bookworm
[21:17:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10290907 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubestage2004.codfw.wmnet with OS bookworm
[21:25:19] <Kemayo>	  RoanKattouw / Urbanecm / cjming / TheresNoTime / kindrobot / mutante / denisse: anyone able to run the deployment window?
[21:26:16] <TheresNoTime>	 I can't this evening, sorry! Hope you get a response soon!
[21:28:45] <tgr|away>	 I can run it
[21:28:51] <Kemayo>	 tgr|away: Thanks!
[21:30:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P70910 and previous config saved to /var/cache/conftool/dbconfig/20241104-213012-ladsgroup.json
[21:31:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087207 (https://phabricator.wikimedia.org/T377990) (owner: 10DLynch)
[21:31:59] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2*: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002
[21:32:06] <wikibugs>	 (03Merged) 10jenkins-bot: Set Flow to read-only on remaining phase 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087207 (https://phabricator.wikimedia.org/T377990) (owner: 10DLynch)
[21:32:25] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1087207|Set Flow to read-only on remaining phase 0 wikis (T377990)]]
[21:32:28] <stashbot>	 T377990: [Config] Set Flow to "read-only" at all Phase 0 wikis - https://phabricator.wikimedia.org/T377990
[21:34:59] <logmsgbot>	 !log tgr@deploy2002 tgr, kemayo: Backport for [[gerrit:1087207|Set Flow to read-only on remaining phase 0 wikis (T377990)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:35:33] <tgr|away>	 Kemayo: do you want to check or is it OK to continue?
[21:35:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2004.codfw.wmnet with reason: host reimage
[21:35:50] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2003.codfw.wmnet with reason: host reimage
[21:35:51] <Kemayo>	 tgr|away: I did a quick check and it looks good.
[21:35:58] <tgr|away>	 thx
[21:36:01] <logmsgbot>	 !log tgr@deploy2002 tgr, kemayo: Continuing with sync
[21:38:46] <Kemayo>	 tgr|away: Thanks for jumping in for it!
[21:38:56] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2004.codfw.wmnet with reason: host reimage
[21:41:06] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087207|Set Flow to read-only on remaining phase 0 wikis (T377990)]] (duration: 08m 40s)
[21:41:11] <stashbot>	 T377990: [Config] Set Flow to "read-only" at all Phase 0 wikis - https://phabricator.wikimedia.org/T377990
[21:41:45] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2003.codfw.wmnet with reason: host reimage
[21:41:55] <tgr|away>	 !log UTC late deploys done
[21:41:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P70911 and previous config saved to /var/cache/conftool/dbconfig/20241104-214519-ladsgroup.json
[21:57:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:58:22] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:58:23] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2004.codfw.wmnet with OS bookworm
[21:58:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10290996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubestage2004.codfw.wmnet with OS bookworm completed: - kubestage2004 (...
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T2200).
[22:00:24] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[22:00:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T376905)', diff saved to https://phabricator.wikimedia.org/P70912 and previous config saved to /var/cache/conftool/dbconfig/20241104-220026-ladsgroup.json
[22:00:30] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[22:00:31] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2003.codfw.wmnet with OS bookworm
[22:00:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10290998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubestage2003.codfw.wmnet with OS bookworm completed: - kubestage2003 (...
[22:04:13] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10291008 (10Jhancock.wm)
[22:06:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10291009 (10Jhancock.wm) 05Open→03Resolved @Clement_Goubert this pair is ready
[22:12:49] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[22:15:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291022 (10Jhancock.wm)
[22:16:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding mc-gp2004 to codfw - jhancock@cumin2002"
[22:16:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding mc-gp2004 to codfw - jhancock@cumin2002"
[22:16:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:17:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-gp2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:17:44] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-gp2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:17:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-gp2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:25:24] <wikibugs>	 (03CR) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway)
[22:27:05] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway)
[22:29:01] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-gp2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:29:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-gp2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:30:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-gp2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:32:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp2004']
[22:32:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp2005']
[22:33:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp2006']
[22:33:31] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-gp2004']
[22:33:33] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-gp2005']
[22:33:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-gp2006']
[22:35:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-gp2004.codfw.wmnet with OS bookworm
[22:35:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-gp2005.codfw.wmnet with OS bookworm
[22:35:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-gp2006.codfw.wmnet with OS bookworm
[22:35:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-gp2004.codfw.wmnet with OS bookworm
[22:35:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291073 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-gp2005.codfw.wmnet with OS bookworm
[22:35:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-gp2006.codfw.wmnet with OS bookworm
[22:50:02] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Looks good! Confirmed that these hosts are already in the network policy, and indeed that makes sense if they've already joined the cluste" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087240 (https://phabricator.wikimedia.org/T379026) (owner: 10Eevans)
[22:53:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp2004.codfw.wmnet with reason: host reimage
[22:53:24] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp2005.codfw.wmnet with reason: host reimage
[22:56:13] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp2004.codfw.wmnet with reason: host reimage
[22:59:20] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp2005.codfw.wmnet with reason: host reimage
[23:07:02] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, claime! Looks good - two optional comments." [puppet] - 10https://gerrit.wikimedia.org/r/1083146 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[23:15:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[23:17:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[23:17:18] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp2004.codfw.wmnet with OS bookworm
[23:17:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291133 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-gp2004.codfw.wmnet with OS bookworm completed: - mc-gp2004 (**PASS**)...
[23:18:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[23:18:11] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[23:18:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp2005.codfw.wmnet with OS bookworm
[23:18:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-gp2005.codfw.wmnet with OS bookworm completed: - mc-gp2005 (**WARN**)...
[23:21:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291139 (10Jhancock.wm)
[23:56:16] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc-gp2006.codfw.wmnet with OS bookworm
[23:56:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291268 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-gp2006.codfw.wmnet with OS bookworm executed with errors: - mc-gp2006 (*...
[23:56:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc-gp2006
[23:56:44] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-gp2006