[00:15:10] 06SRE, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10287153 (10Platonides) The bug for multiple mailing lists was fixed several years ago: https://gitlab.com/mailman/mailman/-/issues/955 (so, hopefully, the fix is included... [00:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1086574 [00:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1086574 (owner: 10TrainBranchBot) [01:07:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [01:08:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1086576 [01:08:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1086576 (owner: 10TrainBranchBot) [01:11:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1086574 (owner: 10TrainBranchBot) [01:43:20] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1086576 (owner: 10TrainBranchBot) [01:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [02:02:05] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/49bd37b8f4f3d94e484accd8635c9153243ed147994c71222a2ed5739293bf63/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:16:26] 06SRE, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10287188 (10AlphaLemur) I have seen this message on two different lists: * Wikimedia-AU-Members, October 7, 2024, 07:53 UTC. - I can confirm this was not a cross-posted me... [02:22:05] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:37:37] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:37] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:11] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:01:11] RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:02:37] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:31:19] 06SRE, 10Huggle, 06Infrastructure-Foundations: IRC recent changes provider fails in Huggle after recent irc.wikimedia.org upgrade - https://phabricator.wikimedia.org/T378667#10287194 (10Crazycomputers) I tracked down the issue on the Huggle side. The library Huggle uses for IRC (libirc) expects the MYINFO co... [05:32:45] PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:37:45] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:39:27] PROBLEM - Hadoop NodeManager on an-worker1164 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:46:45] PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:49:27] RECOVERY - Hadoop NodeManager on an-worker1164 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:51:45] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [05:55:45] RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:57:45] PROBLEM - Hadoop NodeManager on an-worker1172 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:00:45] RECOVERY - Hadoop NodeManager on an-worker1172 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:13:45] RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:34:27] PROBLEM - Hadoop NodeManager on an-worker1164 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:37:37] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:49:27] RECOVERY - Hadoop NodeManager on an-worker1164 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:29:33] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916 (10phaultfinder) 03NEW [07:31:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1085366 (https://phabricator.wikimedia.org/T378693) (owner: 10Slyngshede) [07:39:09] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10287339 (10MoritzMuehlenhoff) I had a look at the IPMI logs and there are still two more of these errors logged after you reseated the memory on Friday, so it seems this was... [07:57:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [07:57:28] (03CR) 10Arnaudb: "no massive gain to get from this, mostly quality of life improvements that are not crucial!" [puppet] - 10https://gerrit.wikimedia.org/r/1084730 (owner: 10Arnaudb) [07:57:49] (03Abandoned) 10Arnaudb: mariadb: add mycli [puppet] - 10https://gerrit.wikimedia.org/r/1084730 (owner: 10Arnaudb) [07:59:11] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2013.codfw.wmnet [07:59:31] (03PS1) 10Arnaudb: mariadb: fix mycnf [puppet] - 10https://gerrit.wikimedia.org/r/1087120 [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [08:03:20] (03CR) 10Arnaudb: [C:03+2] mariadb: fix mycnf [puppet] - 10https://gerrit.wikimedia.org/r/1087120 (owner: 10Arnaudb) [08:03:52] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:05:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:06:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:09:57] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2013.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:11:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2013.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:11:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:11:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2013.codfw.wmnet [08:11:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10287430 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti2013.codfw.wmnet` - ganeti2013.codfw.wmnet (*... [08:12:09] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2014.codfw.wmnet [08:15:32] !log push Drop labtestwikitech return traffic term to eqiad routers - CR1083589 [08:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:41] /cc taavi ^ [08:16:51] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:21:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2239.codfw.wmnet with reason: waiting for productionnization T373579 [08:21:26] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [08:21:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2239.codfw.wmnet with reason: waiting for productionnization T373579 [08:22:29] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2014.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:23:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2014.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:23:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:23:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2014.codfw.wmnet [08:23:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10287437 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti2014.codfw.wmnet` - ganeti2014.codfw.wmnet (*... [08:24:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10287438 (10MoritzMuehlenhoff) [08:24:46] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: decommission ganeti2013/ganeti2014 - https://phabricator.wikimedia.org/T378596#10287452 (10MoritzMuehlenhoff) [08:26:42] (03CR) 10Brouberol: [C:03+2] airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [08:38:40] (03PS1) 10Muehlenhoff: Switch ganeti1039 to ganeti1052 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1087123 [08:50:56] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:51:07] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:53:00] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1039 to ganeti1052 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1087123 (owner: 10Muehlenhoff) [08:57:15] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:57:36] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:59:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet [09:00:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10287518 (10elukey) Fixed 1044. For some reason IPv6 support was disabled, so our settings like `IPv6AutoConfigEnabled: False` led to a HTTP 400. I connecte... [09:04:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [09:04:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet [09:06:18] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti1045.eqiad.wmnet with reason: reboots for nftables [09:06:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti1045.eqiad.wmnet with reason: reboots for nftables [09:06:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 13 hosts with reason: reboots for nftables [09:07:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: reboots for nftables [09:09:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [09:10:17] (03CR) 10David Caro: [C:03+2] P:toolforge::proxy: use svc.toolforge.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1080056 (owner: 10Majavah) [09:20:11] (03CR) 10Arnaudb: [C:03+1] service::catalog: mark apus service as paging [puppet] - 10https://gerrit.wikimedia.org/r/1085617 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:21:05] (03CR) 10MVernon: [C:03+2] service::catalog: mark apus service as paging [puppet] - 10https://gerrit.wikimedia.org/r/1085617 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:21:46] (03CR) 10Ayounsi: [C:03+1] "Good idea!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1085590 (https://phabricator.wikimedia.org/T378751) (owner: 10Cathal Mooney) [09:21:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 (10MoritzMuehlenhoff) 03NEW [09:21:58] 10SRE-swift-storage, 13Patch-For-Review: Set up new S3-level replicated storage cluster "apus" - https://phabricator.wikimedia.org/T279621#10287549 (10MatthewVernon) [09:22:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10287561 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:23:50] 10SRE-swift-storage, 13Patch-For-Review: Set up new S3-level replicated storage cluster "apus" - https://phabricator.wikimedia.org/T279621#10287562 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Aiming to migrate first production user this quarter. [09:25:02] (03CR) 10Elukey: "Hello! I don't particularly love the -bookworm suffix in the dir name, in other places we have a specific /bookworm/etc.. structure, but t" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [09:25:17] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: disk (sde) failed on thanos-be2003 - https://phabricator.wikimedia.org/T378800#10287573 (10MatthewVernon) It's now behaving itself properly. [09:28:32] (03CR) 10Brouberol: "I haven't tried to build these. Is there a process I could follow to run a build before we merge?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [09:29:18] (03PS1) 10Ilias Sarantopoulos: ml-services: update revscoring to kserve 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087130 (https://phabricator.wikimedia.org/T367048) [09:29:37] (03PS2) 10Brouberol: Publish JDK8 images based on Debian Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) [09:30:12] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:31:16] 06SRE, 10SRE-swift-storage, 10Ceph: Migrate gitlab storage to apus - https://phabricator.wikimedia.org/T378922 (10MatthewVernon) 03NEW [09:32:43] 06SRE, 10SRE-swift-storage, 10Ceph: Migrate gitlab storage to apus - https://phabricator.wikimedia.org/T378922#10287610 (10MatthewVernon) p:05Triage→03Medium [09:33:12] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:33:48] (03PS1) 10Muehlenhoff: Send check-cumin-aliases output only to SRE IF alias [puppet] - 10https://gerrit.wikimedia.org/r/1087133 [09:34:12] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:35:46] PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:36:52] (03CR) 10Muehlenhoff: Publish JDK8 images based on Debian Bookworm (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [09:37:12] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:37:17] (03CR) 10Volans: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1087133 (owner: 10Muehlenhoff) [09:37:39] (03CR) 10Brouberol: Publish JDK8 images based on Debian Bookworm (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [09:37:45] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:40:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [09:40:57] (03PS7) 10Elukey: sre.hosts.provision: initial UEFI support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [09:41:05] (03CR) 10Elukey: sre.hosts.provision: initial UEFI support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [09:41:52] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [09:42:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:45:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [09:47:21] (03PS1) 10Brouberol: global_config: expose additional ports on hadoop masters/workers [puppet] - 10https://gerrit.wikimedia.org/r/1087135 (https://phabricator.wikimedia.org/T377928) [09:47:40] (03PS2) 10Brouberol: global_config: expose additional ports on hadoop masters/workers [puppet] - 10https://gerrit.wikimedia.org/r/1087135 (https://phabricator.wikimedia.org/T377928) [09:48:42] (03CR) 10Muehlenhoff: [C:03+2] Send check-cumin-aliases output only to SRE IF alias [puppet] - 10https://gerrit.wikimedia.org/r/1087133 (owner: 10Muehlenhoff) [09:50:02] (03CR) 10Kosta Harlan: [C:03+1] Schedule daily runs of WikimediaEvents UpdatePeriodicMetrics.php [puppet] - 10https://gerrit.wikimedia.org/r/1085620 (https://phabricator.wikimedia.org/T375508) (owner: 10Dreamy Jazz) [09:50:37] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:51:35] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: initial UEFI support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [09:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [09:53:37] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update revscoring to kserve 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087130 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [09:54:23] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4443/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087135 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [09:54:38] (03PS1) 10Brouberol: global_config: define external services entries for the hive metastore servers [puppet] - 10https://gerrit.wikimedia.org/r/1087136 (https://phabricator.wikimedia.org/T377928) [09:54:48] RECOVERY - Hadoop NodeManager on an-worker1168 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:55:51] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:56:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [09:56:41] (03CR) 10Volans: [C:03+2] Use new remote_hosts getter in RemoteHostsAdapter [cookbooks] - 10https://gerrit.wikimedia.org/r/1084734 (owner: 10Volans) [09:56:59] !log deploying spicerack v8.15.2 to cumin[12]002 [09:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:15] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:57:36] (03PS2) 10Brouberol: global_config: define external services entries for the hive metastore servers [puppet] - 10https://gerrit.wikimedia.org/r/1087136 (https://phabricator.wikimedia.org/T377928) [09:59:23] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update revscoring to kserve 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087130 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [09:59:29] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Swift roles [puppet] - 10https://gerrit.wikimedia.org/r/1083158 (owner: 10Muehlenhoff) [09:59:50] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4444/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087136 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [10:00:48] PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:01:28] (03Merged) 10jenkins-bot: ml-services: update revscoring to kserve 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087130 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [10:01:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:01:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:02:22] (03Merged) 10jenkins-bot: Use new remote_hosts getter in RemoteHostsAdapter [cookbooks] - 10https://gerrit.wikimedia.org/r/1084734 (owner: 10Volans) [10:02:30] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:06:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [10:06:48] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:07:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [10:08:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [10:08:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T376905)', diff saved to https://phabricator.wikimedia.org/P70850 and previous config saved to /var/cache/conftool/dbconfig/20241104-100813-ladsgroup.json [10:08:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [10:14:29] (03CR) 10Cathal Mooney: [C:03+1] "Sounds like a good idea +1" [homer/public] - 10https://gerrit.wikimedia.org/r/1084143 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi) [10:15:33] 06SRE, 10Charts, 06Infrastructure-Foundations, 07Service-deployment-requests: New Service Request: chart-renderer - https://phabricator.wikimedia.org/T376939#10287851 (10MatthewVernon) [10:15:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T376905)', diff saved to https://phabricator.wikimedia.org/P70851 and previous config saved to /var/cache/conftool/dbconfig/20241104-101552-ladsgroup.json [10:16:33] 06SRE, 06Infrastructure-Foundations, 07Epic, 07Kubernetes: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams - https://phabricator.wikimedia.org/T378742#10287863 (10MatthewVernon) [10:17:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:17:18] (03PS1) 10Ladsgroup: dbtools: Drop depool and repool bashes [software] - 10https://gerrit.wikimedia.org/r/1087138 (https://phabricator.wikimedia.org/T377738) [10:18:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [10:18:48] RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:20:52] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1043.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:21:07] (03PS1) 10Slyngshede: Provide the option to run an embedded Redis server. [software/bitu] - 10https://gerrit.wikimedia.org/r/1087140 [10:21:16] (03CR) 10Slyngshede: [C:03+2] Fix unblock bug. [software/bitu] - 10https://gerrit.wikimedia.org/r/1085366 (https://phabricator.wikimedia.org/T378693) (owner: 10Slyngshede) [10:22:07] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10287892 (10MoritzMuehlenhoff) [10:23:13] (03Merged) 10jenkins-bot: Fix unblock bug. [software/bitu] - 10https://gerrit.wikimedia.org/r/1085366 (https://phabricator.wikimedia.org/T378693) (owner: 10Slyngshede) [10:26:52] (03CR) 10Ayounsi: [C:03+2] Prefer Lumen to reach ATT [homer/public] - 10https://gerrit.wikimedia.org/r/1084143 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi) [10:27:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [10:27:39] 06SRE, 07SRE-Unowned, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10287901 (10MatthewVernon) @jijiki can you expand on what you mean, please? This task is currently too broad... [10:27:51] (03Merged) 10jenkins-bot: Prefer Lumen to reach ATT [homer/public] - 10https://gerrit.wikimedia.org/r/1084143 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi) [10:29:47] (03CR) 10Arnaudb: "I was unaware of `depool-and-wait`! Otherwise, LGTM" [software] - 10https://gerrit.wikimedia.org/r/1087138 (https://phabricator.wikimedia.org/T377738) (owner: 10Ladsgroup) [10:30:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P70852 and previous config saved to /var/cache/conftool/dbconfig/20241104-103059-ladsgroup.json [10:31:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1043.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:31:12] !log installing libseccomp updates from Bookworm point release [10:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:32] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:35:50] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10287958 (10jcrespo) Was db2190 taken care, data-wise/repooled? Not super worried or super-urgent, but to track it somewhere and making sure it doesn't fall into the cracks of depooled host... [10:37:01] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10287959 (10Ladsgroup) Yes :) [10:38:57] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:39:08] (03CR) 10Brouberol: "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [10:39:12] (03PS3) 10Brouberol: Publish JDK8 images based on Debian Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) [10:39:29] (03CR) 10Brouberol: [C:03+2] Publish JDK8 images based on Debian Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [10:39:33] (03CR) 10Brouberol: [V:03+2 C:03+2] Publish JDK8 images based on Debian Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [10:40:41] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10287970 (10MoritzMuehlenhoff) [10:41:22] !log installing libtool updates from Bookworm point release [10:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:46] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:42:20] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:43:31] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1084129 (owner: 10Ayounsi) [10:46:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P70853 and previous config saved to /var/cache/conftool/dbconfig/20241104-104606-ladsgroup.json [10:47:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [10:48:38] !log eqiad: Prefer Lumen to reach ATT - T377844 [10:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:18] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! We could make the 'multihop' an optional attribute for the YAML dict but for these few I think it's fine in the Jinja." [homer/public] - 10https://gerrit.wikimedia.org/r/1084125 (owner: 10Ayounsi) [10:50:20] (03CR) 10Cathal Mooney: [C:03+1] Add temporary LVS community for liberica test [homer/public] - 10https://gerrit.wikimedia.org/r/1084760 (https://phabricator.wikimedia.org/T378453) (owner: 10Ayounsi) [10:50:28] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10287999 (10MoritzMuehlenhoff) [10:52:34] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:54:47] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T1100) [11:01:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T376905)', diff saved to https://phabricator.wikimedia.org/P70854 and previous config saved to /var/cache/conftool/dbconfig/20241104-110113-ladsgroup.json [11:01:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2227.codfw.wmnet with reason: Maintenance [11:01:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2227.codfw.wmnet with reason: Maintenance [11:01:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1080069 (owner: 10EoghanGaffney) [11:01:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2227 (T376905)', diff saved to https://phabricator.wikimedia.org/P70855 and previous config saved to /var/cache/conftool/dbconfig/20241104-110141-ladsgroup.json [11:05:01] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:06:28] (03PS1) 10Muehlenhoff: Assign ganeti role to ganeti1039/ganeti1040 [puppet] - 10https://gerrit.wikimedia.org/r/1087153 (https://phabricator.wikimedia.org/T378921) [11:08:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087153 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [11:09:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T376905)', diff saved to https://phabricator.wikimedia.org/P70856 and previous config saved to /var/cache/conftool/dbconfig/20241104-110953-ladsgroup.json [11:11:00] 06SRE, 06Infrastructure-Foundations, 10netops: ganeti1025 VMs unresponsive Nov 1 2024 - https://phabricator.wikimedia.org/T378809#10288046 (10MoritzMuehlenhoff) >>! In T378809#10284244, @cmooney wrote: >>>! In T378809#10284231, @CDanis wrote: >> I'm pretty confident this is the same as T348730, and I thi... [11:12:35] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:13:19] (03CR) 10Muehlenhoff: [C:03+2] Assign ganeti role to ganeti1039/ganeti1040 [puppet] - 10https://gerrit.wikimedia.org/r/1087153 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [11:14:35] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker2068 - https://phabricator.wikimedia.org/T378255#10288061 (10Clement_Goubert) 05Open→03In progress a:03Clement_Goubert [11:17:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [11:19:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916#10288086 (10phaultfinder) [11:22:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [11:22:29] FIRING: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:22:49] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:24:51] (03CR) 10Vgutierrez: [C:03+1] trafficserver: Lua script for routing 8.1-enrolled traffic [puppet] - 10https://gerrit.wikimedia.org/r/1072821 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [11:25:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P70857 and previous config saved to /var/cache/conftool/dbconfig/20241104-112501-ladsgroup.json [11:27:25] FIRING: [2x] SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:04] (03PS1) 10Ilias Sarantopoulos: ml-services: update articlequality staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087157 [11:33:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker2068 - https://phabricator.wikimedia.org/T378255#10288128 (10Clement_Goubert) Partition table copied to the new disk and added it to the software raid. Rebuild in progress. ` cgoubert@wikikube-worker2068:~$ cat /proc/mdstat Person... [11:34:05] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:37:25] FIRING: [2x] SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey) [11:40:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P70858 and previous config saved to /var/cache/conftool/dbconfig/20241104-114008-ladsgroup.json [11:42:25] RESOLVED: [2x] SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:44:20] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:45:51] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:54:14] (03PS1) 10Marostegui: Revert "db2190: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/1087160 [11:55:10] (03CR) 10Marostegui: [C:03+2] Revert "db2190: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/1087160 (owner: 10Marostegui) [11:55:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T376905)', diff saved to https://phabricator.wikimedia.org/P70859 and previous config saved to /var/cache/conftool/dbconfig/20241104-115514-ladsgroup.json [11:56:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:58:13] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:58:55] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10288156 (10Marostegui) Notifications were disabled, I have enabled them as the host is serving queries. [12:01:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet [12:08:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:08:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet [12:10:10] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:11:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1039.eqiad.wmnet to cluster eqiad and group B [12:11:43] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1039.eqiad.wmnet to cluster eqiad and group B [12:16:28] (03PS1) 10Slyngshede: P:idp enable Redis TGT backend [puppet] - 10https://gerrit.wikimedia.org/r/1087167 (https://phabricator.wikimedia.org/T377728) [12:19:32] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:19:58] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:20:22] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:22:12] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:22:40] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:24:51] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:26:08] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update articlequality staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087157 (owner: 10Ilias Sarantopoulos) [12:32:41] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update articlequality staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087157 (owner: 10Ilias Sarantopoulos) [12:33:49] (03Merged) 10jenkins-bot: ml-services: update articlequality staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087157 (owner: 10Ilias Sarantopoulos) [12:34:10] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10288300 (10MoritzMuehlenhoff) [12:34:33] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:35:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:37:32] (03CR) 10Ladsgroup: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1087160 (owner: 10Marostegui) [12:44:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [12:45:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [12:45:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:45:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:45:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T376905)', diff saved to https://phabricator.wikimedia.org/P70860 and previous config saved to /var/cache/conftool/dbconfig/20241104-124533-ladsgroup.json [12:49:30] !log deploy "Add temporary LVS community for liberica test" - T378453 [12:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:43] T378453: Testing liberica with ncredir@eqiad - https://phabricator.wikimedia.org/T378453 [12:55:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T376905)', diff saved to https://phabricator.wikimedia.org/P70861 and previous config saved to /var/cache/conftool/dbconfig/20241104-125459-ladsgroup.json [13:06:46] !log Started MediaModeration scan on all wikis other than s4 (commonswiki + testcommonswiki) - https://wikitech.wikimedia.org/wiki/MediaModeration [13:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P70862 and previous config saved to /var/cache/conftool/dbconfig/20241104-131006-ladsgroup.json [13:11:25] !log Started slow MediaModeration scan for commonswiki to be scanning as close to upload as possible - https://wikitech.wikimedia.org/wiki/MediaModeration [13:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1039.eqiad.wmnet to cluster eqiad and group B [13:25:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P70864 and previous config saved to /var/cache/conftool/dbconfig/20241104-132513-ladsgroup.json [13:25:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1039.eqiad.wmnet to cluster eqiad and group B [13:31:22] (03CR) 10Clément Goubert: "Indeed it does, thanks for that!" [puppet] - 10https://gerrit.wikimedia.org/r/1083146 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:36:49] RESOLVED: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:38:49] PROBLEM - Hadoop NodeManager on an-worker1172 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:39:18] (03CR) 10Marostegui: [C:03+1] mariadb: productionize db2235 [puppet] - 10https://gerrit.wikimedia.org/r/1087179 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [13:39:31] (03CR) 10Marostegui: "thanks I commented on the new one!" [puppet] - 10https://gerrit.wikimedia.org/r/1084128 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [13:40:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T376905)', diff saved to https://phabricator.wikimedia.org/P70865 and previous config saved to /var/cache/conftool/dbconfig/20241104-134021-ladsgroup.json [13:40:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [13:40:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [13:45:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [13:45:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [13:46:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T376905)', diff saved to https://phabricator.wikimedia.org/P70866 and previous config saved to /var/cache/conftool/dbconfig/20241104-134605-ladsgroup.json [13:49:09] (03PS1) 10Klausman: admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) [13:49:21] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:49:36] (03CR) 10CI reject: [V:04-1] admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [13:50:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Schema change T367856 [13:50:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Schema change T367856 [13:50:37] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:51:04] !log Start schema change on redacteddb1001:s8 T367856 (this will make replication in s8 lag for around 2-3 days) [13:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T376905)', diff saved to https://phabricator.wikimedia.org/P70867 and previous config saved to /var/cache/conftool/dbconfig/20241104-135516-ladsgroup.json [13:56:45] PROBLEM - Hadoop NodeManager on an-worker1170 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:56:52] (03PS1) 10Muehlenhoff: spark: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1087186 [13:56:52] (03PS1) 10Muehlenhoff: Remove spark2 profile [puppet] - 10https://gerrit.wikimedia.org/r/1087187 [13:59:42] o/ [13:59:50] RECOVERY - Hadoop NodeManager on an-worker1172 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T1400). [14:00:05] HouseOfM: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:29] o/ [14:01:09] I can deploy! [14:01:32] Thanks :) [14:01:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087186 (owner: 10Muehlenhoff) [14:06:17] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10288826 (10Lucas_Werkmeister_WMDE) IMHO it’s a bit of an awkward time to add someone to `restricted`, given the status of T378429, but sure ^^ let’s see how far that gets us. [14:06:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10288830 (10Marostegui) >>! In T378143#10266787, @ABran-WMF wrote: > I've tried to reproduce what's been done in T355269 which is quite close to what we... [14:07:04] (03CR) 10Lucas Werkmeister (WMDE): Exclude affiliates from P&E dashboard integration for CampaignEvents Extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey) [14:07:33] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10288831 (10ABran-WMF) basically a validation of the picked up positions, I stuck to the existing topology as there was a 1:1 match between hosts and ea... [14:08:22] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:09:00] (03CR) 10Mhorsey: Exclude affiliates from P&E dashboard integration for CampaignEvents Extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey) [14:09:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087187 (owner: 10Muehlenhoff) [14:10:19] (03CR) 10Brouberol: [C:03+1] spark: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1087186 (owner: 10Muehlenhoff) [14:10:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P70868 and previous config saved to /var/cache/conftool/dbconfig/20241104-141023-ladsgroup.json [14:10:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey) [14:11:20] (03Merged) 10jenkins-bot: Exclude affiliates from P&E dashboard integration for CampaignEvents Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey) [14:11:39] (03CR) 10CDanis: [C:03+1] Release new mesh.service module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084030 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [14:11:51] scap is fetching lots of submodules [14:11:55] first deployment of the week, maybe ^^ [14:12:19] Oh good, lol [14:12:31] (03PS11) 10Vgutierrez: profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) [14:12:33] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1084765|Exclude affiliates from P&E dashboard integration for CampaignEvents Extension (T377252)]] [14:12:35] T377252: Disable Program and Events Dashboard integration for unsupported wikis - https://phabricator.wikimedia.org/T377252 [14:12:44] (03Abandoned) 10Volans: sre.switchdc.databses: test on test-s4 section [cookbooks] - 10https://gerrit.wikimedia.org/r/1073762 (owner: 10Volans) [14:12:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10288851 (10MatthewVernon) I think, given @jhathaway's [[ https://phabricator.wikimedia.org/T378584#10284180 | update ]] on T378584 we should try booting... [14:13:06] (03CR) 10CDanis: [C:03+1] mesh.service: introduce a way to further specify the service label selectors (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [14:13:46] RECOVERY - Hadoop NodeManager on an-worker1170 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:13:53] (03PS12) 10Vgutierrez: profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) [14:13:59] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10288874 (10MatthewVernon) @jhathaway great, thanks. With the new thanos backends hopefully arriving this week (which are also... [14:17:27] (03PS8) 10Vgutierrez: role,site: Provide a liberica role and use it on lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/1083778 (https://phabricator.wikimedia.org/T377127) [14:18:32] (03CR) 10Brouberol: [C:03+2] mesh.service: introduce a way to further specify the service label selectors (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [14:18:38] (03CR) 10Brouberol: [C:03+2] Release new mesh.service module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084030 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [14:18:41] (03CR) 10Brouberol: [C:03+2] airflow: add the 'component: webserver' to the tls Service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084032 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [14:19:48] (03Merged) 10jenkins-bot: Release new mesh.service module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084030 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [14:19:52] (03Merged) 10jenkins-bot: mesh.service: introduce a way to further specify the service label selectors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [14:21:15] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083778 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [14:22:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10288906 (10Marostegui) So, we have 6 rows available, so let's place one per row. For A3, there's already an external store host there, so if there's a... [14:22:50] (03PS14) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) [14:23:13] !log lucaswerkmeister-wmde@deploy2002 mhorsey, lucaswerkmeister-wmde: Backport for [[gerrit:1084765|Exclude affiliates from P&E dashboard integration for CampaignEvents Extension (T377252)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:23:20] T377252: Disable Program and Events Dashboard integration for unsupported wikis - https://phabricator.wikimedia.org/T377252 [14:23:25] HouseOfM: please test :) [14:23:43] Will do :) [14:24:32] !log uploaded php7.4 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2+icu67u3 to component/icu67 (backports of latest security fixes to our PHP 7.4 build) [14:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:45] marostegui: question about https://phabricator.wikimedia.org/T367856#10288593 – does this affect the public replicas (quarry etc.)? or is this replication lag in a different kind of database? [14:25:01] Lucas_WMDE: it will yes [14:25:07] ok [14:25:22] Lucas_WMDE: They'll get around 2 days of lag for s8, but not yet [14:25:22] we’ll probably put a brief mention of it in the wikidata weekly summary (linking to that phab task) if that’ sokay [14:25:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P70869 and previous config saved to /var/cache/conftool/dbconfig/20241104-142530-ladsgroup.json [14:25:47] Lucas_WMDE: yeah, i can tell you when that will happen if you like [14:25:53] (unlikely this week) [14:25:57] ah ok [14:26:01] sure [14:26:07] then I’ll take it out of the summary for this week again :) [14:26:35] Lucas_WMDE: We are going to alter each wikireplica so they will be depooled, but at some point we will alter their master and then their master-master so there will be two periods of 2days of lag, but that will take a few days to happen. [14:27:09] ok [14:27:46] it’s not super important that we announce it, I think, but I saw it fly past and thought “I remember that causing some confusion before” and figured we might as well include it in the weekly summary [14:27:53] but also, we shouldn’t do that too early ^^ [14:27:55] so I’m glad I asked :) [14:27:58] Lucas_WMDE: all good [14:28:15] !log lucaswerkmeister-wmde@deploy2002 mhorsey, lucaswerkmeister-wmde: Continuing with sync [14:28:32] if it’s convenient, you can ping me (or e.g. Lydia) for inclusion in the next weekly summary [14:28:38] if not, we’ll survive too :) [14:28:49] Lucas_WMDE: yeah, I think it is a very good idea to include it there :) [14:29:00] So thanks for that! I will keep you posted [14:29:31] okay, thanks! [14:30:04] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1087194 (https://phabricator.wikimedia.org/T373579) [14:30:05] (03CR) 10Arnaudb: [C:03+2] mariadb: toggle notifications for db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1087194 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [14:32:05] (03PS14) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) [14:35:19] (03PS2) 10Klausman: admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) [14:36:05] (03CR) 10Elukey: [C:03+1] "Overall it looks good to me, and even if it contains hacks they are self-contained for this use case. I am inclined to proceed, we have so" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [14:36:12] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084765|Exclude affiliates from P&E dashboard integration for CampaignEvents Extension (T377252)]] (duration: 23m 39s) [14:36:15] T377252: Disable Program and Events Dashboard integration for unsupported wikis - https://phabricator.wikimedia.org/T377252 [14:36:39] (03PS3) 10Klausman: admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) [14:37:37] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:58] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4446/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [14:38:36] looks like that’s everything for now [14:38:44] !log UTC afternoon backport+config window done [14:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:58] (03CR) 10Ayounsi: [C:03+2] Add RIPE RIS sessions to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/1084125 (owner: 10Ayounsi) [14:39:03] Thanks Lucas_WMDE [14:39:31] (03Merged) 10jenkins-bot: Add RIPE RIS sessions to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/1084125 (owner: 10Ayounsi) [14:40:24] np :) [14:40:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T376905)', diff saved to https://phabricator.wikimedia.org/P70870 and previous config saved to /var/cache/conftool/dbconfig/20241104-144037-ladsgroup.json [14:40:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [14:40:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [14:41:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T376905)', diff saved to https://phabricator.wikimedia.org/P70871 and previous config saved to /var/cache/conftool/dbconfig/20241104-144101-ladsgroup.json [14:42:37] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:02] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10289015 (10elukey) We do have support for UEFI in the provision cookbook and in reimage (after https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/10... [14:50:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T376905)', diff saved to https://phabricator.wikimedia.org/P70872 and previous config saved to /var/cache/conftool/dbconfig/20241104-145027-ladsgroup.json [14:59:57] (03PS1) 10Tchanders: temp accounts: Enable temp account creation on second-round pilots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087195 (https://phabricator.wikimedia.org/T378336) [15:02:28] (03CR) 10Dreamy Jazz: [C:03+1] temp accounts: Enable temp account creation on second-round pilots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087195 (https://phabricator.wikimedia.org/T378336) (owner: 10Tchanders) [15:02:37] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P70873 and previous config saved to /var/cache/conftool/dbconfig/20241104-150534-ladsgroup.json [15:07:50] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10289123 (10phaultfinder) [15:10:47] (03CR) 10Ssingh: [C:03+1] "Looks good, comparing PS 9 to current 12!" [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [15:20:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P70874 and previous config saved to /var/cache/conftool/dbconfig/20241104-152041-ladsgroup.json [15:20:46] (03CR) 10Vgutierrez: [C:03+2] profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [15:23:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10289174 (10MatthewVernon) I think from that the two big issues are the partman cookbooks (which we'd obviously need the one we're using for these nodes t... [15:25:02] 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE: Request Kerberos identity for jsn.sherman - https://phabricator.wikimedia.org/T378786#10289193 (10Ottomata) [15:25:13] (03CR) 10Vgutierrez: [C:03+2] role,site: Provide a liberica role and use it on lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/1083778 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [15:25:35] (03CR) 10Ilias Sarantopoulos: admin/data.yaml: Add researchers to users of ml-lab100x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [15:25:48] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10289196 (10LSobanski) [15:29:01] (03CR) 10Ssingh: [C:03+2] geo-maps: switch CN to to eqsin (from ulsfo) [dns] - 10https://gerrit.wikimedia.org/r/1085456 (https://phabricator.wikimedia.org/T378744) (owner: 10Ssingh) [15:29:14] !log running authdns-update to move CN traffic to eqsin from ulsfo: T378744 [15:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:16] T378744: GeoDNS: consider sending CN to eqsin - https://phabricator.wikimedia.org/T378744 [15:31:24] (03CR) 10Brouberol: [C:03+1] "Good eyes!" [puppet] - 10https://gerrit.wikimedia.org/r/1087187 (owner: 10Muehlenhoff) [15:32:32] (03CR) 10Klausman: [V:03+1] admin/data.yaml: Add researchers to users of ml-lab100x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [15:34:27] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [15:34:41] (03PS2) 10Majavah: Drop support for s11 MariaDB section [puppet] - 10https://gerrit.wikimedia.org/r/1083586 (https://phabricator.wikimedia.org/T378260) [15:35:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T376905)', diff saved to https://phabricator.wikimedia.org/P70876 and previous config saved to /var/cache/conftool/dbconfig/20241104-153548-ladsgroup.json [15:35:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [15:35:53] !log upload liberica 0.1 to apt.wm.o (bookworm) - T377127 [15:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:02] T377127: liberica puppetization - https://phabricator.wikimedia.org/T377127 [15:36:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [15:36:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T376905)', diff saved to https://phabricator.wikimedia.org/P70877 and previous config saved to /var/cache/conftool/dbconfig/20241104-153613-ladsgroup.json [15:37:10] (03CR) 10Majavah: [C:03+2] Drop support for s11 MariaDB section [puppet] - 10https://gerrit.wikimedia.org/r/1083586 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [15:40:05] 06SRE, 10vm-requests, 07Kubernetes: codfw: (2x) aux-k8s-ctrl nodes - https://phabricator.wikimedia.org/T378986 (10herron) 03NEW [15:40:08] (03PS1) 10Marostegui: mariadb: prod dbproxy200[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) (owner: 10Arnaudb) [15:40:08] (03CR) 10Marostegui: [C:03+1] "Looks good, you may have a race condition if you productionize db2235 before you merge this as db2235 will replace db2135" [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) (owner: 10Arnaudb) [15:40:09] 06SRE, 10vm-requests, 07Kubernetes: codfw: (4x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378987 (10herron) 03NEW [15:40:13] 06SRE, 10vm-requests, 07Kubernetes: codfw: (3x) aux-k8s-etcd nodes - https://phabricator.wikimedia.org/T378988 (10herron) 03NEW [15:40:13] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm [15:40:17] 06SRE, 10vm-requests, 07Kubernetes: eqiad: (2x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378989 (10herron) 03NEW [15:44:11] 06SRE, 06Infrastructure-Foundations, 07Epic, 07Kubernetes: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams - https://phabricator.wikimedia.org/T378742#10289310 (10herron) [15:44:34] (03CR) 10Hnowlan: [C:03+1] shellbox: add migration release (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082572 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [15:45:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T376905)', diff saved to https://phabricator.wikimedia.org/P70878 and previous config saved to /var/cache/conftool/dbconfig/20241104-154543-ladsgroup.json [15:46:30] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:46:32] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [15:46:43] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: decommission ganeti2013/ganeti2014 - https://phabricator.wikimedia.org/T378596#10289345 (10Jhancock.wm) 05Open→03Resolved [15:51:32] (03CR) 10Ilias Sarantopoulos: [C:03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [15:52:00] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [15:52:49] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [15:54:45] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [15:57:46] (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2235 [puppet] - 10https://gerrit.wikimedia.org/r/1087179 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [15:58:50] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:00:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db[2135,2235].codfw.wmnet with reason: cloning db2135@db2235 [16:00:44] 06SRE, 06Infrastructure-Foundations, 07Epic, 07Kubernetes: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams - https://phabricator.wikimedia.org/T378742#10289440 (10CDanis) p:05Triage→03Medium [16:00:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db[2135,2235].codfw.wmnet with reason: cloning db2135@db2235 [16:00:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P70879 and previous config saved to /var/cache/conftool/dbconfig/20241104-160050-ladsgroup.json [16:00:52] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10289443 (10elukey) p:05Triage→03Medium [16:01:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:04] (03CR) 10Marostegui: sre.mysql.upgrade: add depool/pool logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [16:01:10] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10289446 (10Jhancock.wm) removed CPU 2. gonna let it run for a little and see if it generates errors. then we'll at least know which one is the problem [16:02:18] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2135.codfw.wmnet onto db2235.codfw.wmnet [16:03:28] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:51] (03CR) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [16:05:12] PROBLEM - MariaDB Replica IO: m5 on db2160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2135.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2135.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:05:24] its me, I failed to downtime a node [16:05:25] fixing [16:05:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:05:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:05:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db2160.codfw.wmnet with reason: cloning db2135@db2235 [16:06:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2160.codfw.wmnet with reason: cloning db2135@db2235 [16:06:04] sorry for the noise! [16:06:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:07:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:08:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10289477 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm [16:08:33] (03PS15) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) [16:08:59] (03CR) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [16:10:12] (03PS3) 10Brouberol: airflow: add the 'component: webserver' to the tls Service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084032 (https://phabricator.wikimedia.org/T378377) [16:10:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10289485 (10bking) a:03VRiley-WMF [16:11:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10289487 (10bking) [16:11:31] (03CR) 10Brouberol: [C:03+2] airflow: add the 'component: webserver' to the tls Service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084032 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [16:12:11] (03CR) 10CDanis: [C:03+1] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085633 (https://phabricator.wikimedia.org/T376948) (owner: 10Aude) [16:12:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2135.codfw.wmnet onto db2235.codfw.wmnet [16:13:06] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [16:13:12] RECOVERY - MariaDB Replica IO: m5 on db2160 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:10] (03CR) 10CI reject: [V:04-1] sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [16:14:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:14:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:15:03] (03PS1) 10Slyngshede: P:idp add default Redis database for cloud. [puppet] - 10https://gerrit.wikimedia.org/r/1087203 [16:15:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:15:27] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10289524 (10LSobanski) a:03eoghan [16:15:29] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10289525 (10MoritzMuehlenhoff) [16:15:51] (03PS2) 10Arnaudb: mariadb: productionize db2236 [puppet] - 10https://gerrit.wikimedia.org/r/1087202 (https://phabricator.wikimedia.org/T373579) [16:15:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P70880 and previous config saved to /var/cache/conftool/dbconfig/20241104-161557-ladsgroup.json [16:16:15] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1087203 (owner: 10Slyngshede) [16:16:54] (03CR) 10Slyngshede: [C:03+2] P:idp add default Redis database for cloud. [puppet] - 10https://gerrit.wikimedia.org/r/1087203 (owner: 10Slyngshede) [16:21:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [16:23:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [16:23:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [16:25:06] (03PS1) 10Elukey: profile::docker::report: use the internal registry endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1087205 (https://phabricator.wikimedia.org/T378618) [16:25:07] (03PS1) 10Elukey: docker_registry_ha: reduce from 300 to 180 the nginx timeout [puppet] - 10https://gerrit.wikimedia.org/r/1087206 (https://phabricator.wikimedia.org/T378618) [16:27:27] (03PS16) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) [16:27:32] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4447/console" [puppet] - 10https://gerrit.wikimedia.org/r/1087205 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [16:28:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [16:28:55] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4449/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087205 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [16:30:05] jan_drewniak: That opportune time for a Wikimedia Portals Update deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T1630). [16:31:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T376905)', diff saved to https://phabricator.wikimedia.org/P70881 and previous config saved to /var/cache/conftool/dbconfig/20241104-163104-ladsgroup.json [16:31:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [16:31:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [16:31:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T376905)', diff saved to https://phabricator.wikimedia.org/P70882 and previous config saved to /var/cache/conftool/dbconfig/20241104-163129-ladsgroup.json [16:34:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [16:37:46] (03CR) 10Clément Goubert: [C:03+1] profile::docker::report: use the internal registry endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1087205 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [16:37:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [16:38:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10289661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm complet... [16:38:17] (03CR) 10Clément Goubert: [C:03+1] docker_registry_ha: reduce from 300 to 180 the nginx timeout [puppet] - 10https://gerrit.wikimedia.org/r/1087206 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [16:40:09] (03PS1) 10DLynch: Set Flow to read-only on remaining phase 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087207 (https://phabricator.wikimedia.org/T377990) [16:40:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T376905)', diff saved to https://phabricator.wikimedia.org/P70883 and previous config saved to /var/cache/conftool/dbconfig/20241104-164051-ladsgroup.json [16:42:17] (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085598 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [16:42:20] (03CR) 10Scott French: [C:03+2] shellbox: add optional .spec.strategy override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085598 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [16:44:02] (03Merged) 10jenkins-bot: shellbox: add optional .spec.strategy override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085598 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [16:50:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker12[35-42] - https://phabricator.wikimedia.org/T377021#10289692 (10VRiley-WMF) @Clement_Goubert It seems that there are servers already named wikikube-worker1240, wikikube-worker1241, and wikikube-worker12... [16:53:16] 06SRE, 07SRE-Unowned, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10289712 (10bd808) I can't speak for Effie, but my imagined reimplementation of wikitech-static would be to produce an OCI container image daily containing Wikitech's content along with a Medi... [16:53:49] (03PS1) 10Vgutierrez: hiera: Enable puppet7 for liberica role [puppet] - 10https://gerrit.wikimedia.org/r/1087210 [16:55:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087210 (owner: 10Vgutierrez) [16:55:37] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087210 (owner: 10Vgutierrez) [16:55:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P70885 and previous config saved to /var/cache/conftool/dbconfig/20241104-165558-ladsgroup.json [16:56:08] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10289749 (10MoritzMuehlenhoff) [16:56:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker12[35-42] - https://phabricator.wikimedia.org/T377021#10289750 (10Clement_Goubert) Yeah sorry about that, I got confused with the host renaming we've been doing. Thanks for catching it. I'll amend the tas... [16:59:44] !log vgutierrez@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1013.eqiad.wmnet with OS bookworm [17:00:33] (03PS15) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) [17:01:24] (03PS2) 10Vgutierrez: hiera: Enable puppet7 for liberica role [puppet] - 10https://gerrit.wikimedia.org/r/1087210 [17:02:10] (03CR) 10Ssingh: [C:03+1] hiera: Enable puppet7 for liberica role [puppet] - 10https://gerrit.wikimedia.org/r/1087210 (owner: 10Vgutierrez) [17:02:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker12[43-58] - https://phabricator.wikimedia.org/T378185#10289766 (10Clement_Goubert) @VRiley-WMF I've messed up the hostnames on that task as well as {T377021}, I'm so sorry. I'll sort all of this out first thing tomorrow. [17:02:15] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable puppet7 for liberica role [puppet] - 10https://gerrit.wikimedia.org/r/1087210 (owner: 10Vgutierrez) [17:03:57] 06SRE, 10vm-requests, 07Kubernetes: eqiad: (2x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378989#10289779 (10CDanis) LGTM! please use groups C and D if possible, that would give full diversity across ganeti groups [17:04:22] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:06:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:07:16] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:07:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [17:07:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10289800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm [17:11:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P70886 and previous config saved to /var/cache/conftool/dbconfig/20241104-171105-ladsgroup.json [17:12:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker12[35-42] - https://phabricator.wikimedia.org/T377021#10289838 (10Clement_Goubert) [17:13:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10289850 (10Clement_Goubert) [17:13:44] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [17:16:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10289885 (10Clement_Goubert) [17:16:55] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10289882 (10Clement_Goubert) [17:18:23] (03PS1) 10Clément Goubert: kubernetes: fix hostnames for eqiad refresh and expansion [puppet] - 10https://gerrit.wikimedia.org/r/1087216 (https://phabricator.wikimedia.org/T376185) [17:20:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [17:23:31] (03PS1) 10Vgutierrez: liberica: Set gobgpd router-id [puppet] - 10https://gerrit.wikimedia.org/r/1087219 [17:23:56] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 1 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [17:23:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [17:24:13] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087219 (owner: 10Vgutierrez) [17:25:56] (03CR) 10Ssingh: [C:03+1] liberica: Set gobgpd router-id [puppet] - 10https://gerrit.wikimedia.org/r/1087219 (owner: 10Vgutierrez) [17:26:06] (03CR) 10Vgutierrez: [C:03+2] liberica: Set gobgpd router-id [puppet] - 10https://gerrit.wikimedia.org/r/1087219 (owner: 10Vgutierrez) [17:26:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T376905)', diff saved to https://phabricator.wikimedia.org/P70887 and previous config saved to /var/cache/conftool/dbconfig/20241104-172612-ladsgroup.json [17:26:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [17:26:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [17:26:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T376905)', diff saved to https://phabricator.wikimedia.org/P70888 and previous config saved to /var/cache/conftool/dbconfig/20241104-172638-ladsgroup.json [17:26:39] (03CR) 10Calbon: [V:03+1] admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [17:35:37] !log vgutierrez@cumin1002 START - Cookbook sre.puppet.migrate-host for host lvs1013.eqiad.wmnet [17:35:52] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host lvs1013.eqiad.wmnet [17:36:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T376905)', diff saved to https://phabricator.wikimedia.org/P70889 and previous config saved to /var/cache/conftool/dbconfig/20241104-173604-ladsgroup.json [17:37:16] (03CR) 10Klausman: [V:03+2] admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [17:37:23] (03CR) 10Klausman: [V:03+2 C:03+2] admin/data.yaml: Add researchers to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1087182 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [17:37:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [17:37:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10290044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm completed: - srete... [17:38:33] (03CR) 10Ladsgroup: [C:03+2] dbtools: Drop depool and repool bashes [software] - 10https://gerrit.wikimedia.org/r/1087138 (https://phabricator.wikimedia.org/T377738) (owner: 10Ladsgroup) [17:38:42] (03CR) 10Kosta Harlan: [C:03+1] temp accounts: Enable temp account creation on second-round pilots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087195 (https://phabricator.wikimedia.org/T378336) (owner: 10Tchanders) [17:39:53] (03Merged) 10jenkins-bot: dbtools: Drop depool and repool bashes [software] - 10https://gerrit.wikimedia.org/r/1087138 (https://phabricator.wikimedia.org/T377738) (owner: 10Ladsgroup) [17:43:16] !log upload liberica 0.2 to apt.wm.o (bookworm) - T377127 [17:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:19] T377127: liberica puppetization - https://phabricator.wikimedia.org/T377127 [17:43:54] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm [17:47:27] (03CR) 10Alexandros Kosiaris: [C:03+1] mesh.service: introduce a way to further specify the service label selectors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [17:49:56] (03CR) 10Alexandros Kosiaris: [C:03+2] ats: Route rest_v1/page/(html|title) to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1080232 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [17:51:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P70890 and previous config saved to /var/cache/conftool/dbconfig/20241104-175111-ladsgroup.json [17:56:07] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T1800) [18:00:05] ryankemper: Time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T1800). [18:01:55] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [18:01:59] (03PS1) 10Alexandros Kosiaris: Revert "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087222 (https://phabricator.wikimedia.org/T374683) [18:02:43] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Revert "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087222 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [18:06:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P70891 and previous config saved to /var/cache/conftool/dbconfig/20241104-180618-ladsgroup.json [18:07:15] 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10290152 (10Krd) Just happend again: Request from .43.46 via cp3066 cp3066, Varnish XID 230643229 Upstream caches: cp3066 int Error: 429, at Mon, 04 Nov 2024 18:05:37 GMT [18:12:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087207 (https://phabricator.wikimedia.org/T377990) (owner: 10DLynch) [18:13:01] (03PS1) 10Vgutierrez: liberica: gobgpd router-id value needs to be quoted [puppet] - 10https://gerrit.wikimedia.org/r/1087224 [18:13:28] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087224 (owner: 10Vgutierrez) [18:15:25] (03CR) 10Vgutierrez: [C:03+2] liberica: gobgpd router-id value needs to be quoted [puppet] - 10https://gerrit.wikimedia.org/r/1087224 (owner: 10Vgutierrez) [18:21:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T376905)', diff saved to https://phabricator.wikimedia.org/P70892 and previous config saved to /var/cache/conftool/dbconfig/20241104-182125-ladsgroup.json [18:21:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [18:21:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [18:21:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T376905)', diff saved to https://phabricator.wikimedia.org/P70893 and previous config saved to /var/cache/conftool/dbconfig/20241104-182140-ladsgroup.json [18:25:34] !log vgutierrez@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1013.eqiad.wmnet with OS bookworm [18:29:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T376905)', diff saved to https://phabricator.wikimedia.org/P70894 and previous config saved to /var/cache/conftool/dbconfig/20241104-182933-ladsgroup.json [18:31:50] (03CR) 10Ssingh: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1083913 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [18:35:18] (03CR) 10BCornwall: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [18:41:19] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on lvs2013.codfw.wmnet with reason: vgutierrez [18:41:32] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs2013.codfw.wmnet with reason: vgutierrez [18:41:48] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs2013.codfw.wmnet [18:41:49] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2013.codfw.wmnet [18:42:37] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:44:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P70895 and previous config saved to /var/cache/conftool/dbconfig/20241104-184440-ladsgroup.json [18:45:58] (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082572 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [18:46:10] (03CR) 10Scott French: [C:03+2] shellbox: add migration release (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082572 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [18:47:12] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: known issues with liberica-hcforwarder and ipip-multiqueue-optimizer [18:47:13] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: known issues with liberica-hcforwarder and ipip-multiqueue-optimizer [18:47:17] (03Merged) 10jenkins-bot: shellbox: add migration release (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082572 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [18:52:43] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1085515 (https://phabricator.wikimedia.org/T374827) (owner: 10Slyngshede) [18:53:50] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [18:54:02] (03PS1) 10Alexandros Kosiaris: Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683) [18:54:21] (03PS2) 10Alexandros Kosiaris: Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683) [18:54:30] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [18:54:32] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [18:55:27] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [18:55:28] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [18:56:12] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [18:56:13] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:56:23] (03CR) 10CI reject: [V:04-1] Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [18:56:44] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:56:45] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [18:57:27] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [18:57:28] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [18:58:06] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [18:59:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P70896 and previous config saved to /var/cache/conftool/dbconfig/20241104-185947-ladsgroup.json [18:59:55] FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:00:47] (03PS3) 10Alexandros Kosiaris: Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683) [19:02:48] (03CR) 10CI reject: [V:04-1] Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [19:03:57] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [19:04:12] (03PS1) 10Urbanecm: Message: Downgrade exception on bool/null param to warning [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087231 (https://phabricator.wikimedia.org/T378876) [19:04:25] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [19:04:55] RESOLVED: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:07:30] jouncebot: nowandnext [19:07:30] No deployments scheduled for the next 1 hour(s) and 52 minute(s) [19:07:30] In 1 hour(s) and 52 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T2100) [19:09:32] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [19:09:56] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [19:14:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T376905)', diff saved to https://phabricator.wikimedia.org/P70897 and previous config saved to /var/cache/conftool/dbconfig/20241104-191454-ladsgroup.json [19:14:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [19:15:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [19:15:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T376905)', diff saved to https://phabricator.wikimedia.org/P70898 and previous config saved to /var/cache/conftool/dbconfig/20241104-191519-ladsgroup.json [19:17:16] (03CR) 10Urbanecm: [C:03+2] Message: Downgrade exception on bool/null param to warning [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087231 (https://phabricator.wikimedia.org/T378876) (owner: 10Urbanecm) [19:17:52] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [19:18:47] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [19:18:58] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [19:19:54] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [19:20:05] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [19:21:03] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [19:21:14] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [19:22:00] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [19:22:11] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [19:23:07] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [19:23:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T376905)', diff saved to https://phabricator.wikimedia.org/P70899 and previous config saved to /var/cache/conftool/dbconfig/20241104-192319-ladsgroup.json [19:38:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P70900 and previous config saved to /var/cache/conftool/dbconfig/20241104-193826-ladsgroup.json [19:50:20] (03Merged) 10jenkins-bot: Message: Downgrade exception on bool/null param to warning [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087231 (https://phabricator.wikimedia.org/T378876) (owner: 10Urbanecm) [19:51:34] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1087231|Message: Downgrade exception on bool/null param to warning (T378876)]] [19:51:38] T378876: InvalidArgumentException: Scalar parameter must be a string, number, Stringable, or MessageSpecifier (October 2024) - https://phabricator.wikimedia.org/T378876 [19:53:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P70901 and previous config saved to /var/cache/conftool/dbconfig/20241104-195333-ladsgroup.json [19:54:55] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1087231|Message: Downgrade exception on bool/null param to warning (T378876)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:55:57] !log urbanecm@deploy2002 urbanecm: Continuing with sync [20:00:46] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087231|Message: Downgrade exception on bool/null param to warning (T378876)]] (duration: 09m 12s) [20:01:10] T378876: InvalidArgumentException: Scalar parameter must be a string, number, Stringable, or MessageSpecifier (October 2024) - https://phabricator.wikimedia.org/T378876 [20:07:27] 06SRE, 10SRE-Access-Requests: Requesting access to snapshot* with group snapshot-admins for ebernhardson - https://phabricator.wikimedia.org/T379025 (10Gehel) 03NEW [20:08:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T376905)', diff saved to https://phabricator.wikimedia.org/P70902 and previous config saved to /var/cache/conftool/dbconfig/20241104-200840-ladsgroup.json [20:08:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [20:08:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [20:09:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T376905)', diff saved to https://phabricator.wikimedia.org/P70903 and previous config saved to /var/cache/conftool/dbconfig/20241104-200905-ladsgroup.json [20:09:21] 06SRE, 10SRE-Access-Requests: Requesting access to snapshot* with group snapshot-admins for ebernhardson - https://phabricator.wikimedia.org/T379025#10290622 (10Gehel) I approve this request for @EBernhardson [20:17:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T376905)', diff saved to https://phabricator.wikimedia.org/P70904 and previous config saved to /var/cache/conftool/dbconfig/20241104-201703-ladsgroup.json [20:17:26] (03PS1) 10Eevans: aqs1013: decommission [puppet] - 10https://gerrit.wikimedia.org/r/1087238 (https://phabricator.wikimedia.org/T379026) [20:19:24] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [20:20:05] (03CR) 10Eevans: [C:03+2] aqs1013: decommission [puppet] - 10https://gerrit.wikimedia.org/r/1087238 (https://phabricator.wikimedia.org/T379026) (owner: 10Eevans) [20:20:17] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [20:20:28] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [20:21:13] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [20:21:24] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [20:21:56] !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts aqs1013.eqiad.wmnet [20:22:01] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [20:22:12] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [20:22:49] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [20:23:01] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [20:23:49] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [20:26:29] (03PS1) 10Gehel: admin: add ebernhardson as a member of the snapshot-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1087239 (https://phabricator.wikimedia.org/T379025) [20:26:33] !log zero-replica "migration" releases created for all shellbox instances - T375243 [20:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:41] T375243: Turn up PHP 8.1 Shellbox deployments - https://phabricator.wikimedia.org/T375243 [20:27:50] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [20:30:44] (03PS1) 10Eevans: aqs1013 replaced by aqs1022 (hardware refresh) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087240 (https://phabricator.wikimedia.org/T379026) [20:32:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P70905 and previous config saved to /var/cache/conftool/dbconfig/20241104-203210-ladsgroup.json [20:32:37] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [20:34:11] (03CR) 10Volans: [C:03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087239 (https://phabricator.wikimedia.org/T379025) (owner: 10Gehel) [20:34:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:35:04] (03CR) 10Eevans: "On a related note, we should probably do something about: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/re" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087240 (https://phabricator.wikimedia.org/T379026) (owner: 10Eevans) [20:35:36] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [20:35:36] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:35:37] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1013.eqiad.wmnet [20:38:08] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission aqs1013 - https://phabricator.wikimedia.org/T379026#10290723 (10Eevans) a:05Eevans→03None [20:39:16] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#10290732 (10Eevans) 05Open→03Resolved We can close this now; aqs1013 is no more (T379026) 🪦 [20:42:15] 06SRE, 06Data-Persistence, 06serviceops: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18 - https://phabricator.wikimedia.org/T373490#10290736 (10Eevans) 05Open→03Resolved a:03Eevans aqs1013 has been decommissioned (T379026), and aqs1014 fixed; Closing [20:42:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10290759 (10Eevans) 05Open→03Resolved [20:47:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P70906 and previous config saved to /var/cache/conftool/dbconfig/20241104-204717-ladsgroup.json [20:47:58] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1*: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [20:48:57] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:52:37] RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:53:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10290804 (10Jhancock.wm) [20:57:45] 06SRE-OnFire, 10Incident Tooling: Corto: Access model (MVP only) - https://phabricator.wikimedia.org/T375451#10290817 (10Eevans) 05Open→03Resolved a:03Eevans [20:57:46] 06SRE-OnFire, 10Incident Tooling: Corto: Access model (MVP only) - https://phabricator.wikimedia.org/T375451#10290814 (10Eevans) [20:59:22] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T2100). [21:00:05] kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] o/ [21:00:35] (03CR) 10Máté Szabó: "This is now unblocked." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [21:00:39] (03PS2) 10Máté Szabó: Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) [21:02:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T376905)', diff saved to https://phabricator.wikimedia.org/P70907 and previous config saved to /var/cache/conftool/dbconfig/20241104-210224-ladsgroup.json [21:02:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [21:02:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [21:02:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kubestage2003 to codfw - jhancock@cumin2002" [21:02:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding kubestage2003 to codfw - jhancock@cumin2002" [21:02:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:03:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubestage2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:03:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubestage2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:05:49] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1*: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [21:07:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [21:07:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [21:08:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T376905)', diff saved to https://phabricator.wikimedia.org/P70908 and previous config saved to /var/cache/conftool/dbconfig/20241104-210800-ladsgroup.json [21:10:00] Anyone around for deployments? [21:12:18] I will try annoy a few [21:14:06] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2*: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [21:14:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubestage2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:14:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubestage2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:15:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T376905)', diff saved to https://phabricator.wikimedia.org/P70909 and previous config saved to /var/cache/conftool/dbconfig/20241104-211505-ladsgroup.json [21:15:08] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubestage2003'] [21:15:12] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubestage2004'] [21:15:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubestage2003'] [21:15:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubestage2004'] [21:17:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubestage2003.codfw.wmnet with OS bookworm [21:17:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubestage2004.codfw.wmnet with OS bookworm [21:17:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10290906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubestage2003.codfw.wmnet with OS bookworm [21:17:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10290907 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubestage2004.codfw.wmnet with OS bookworm [21:25:19] RoanKattouw / Urbanecm / cjming / TheresNoTime / kindrobot / mutante / denisse: anyone able to run the deployment window? [21:26:16] I can't this evening, sorry! Hope you get a response soon! [21:28:45] I can run it [21:28:51] tgr|away: Thanks! [21:30:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P70910 and previous config saved to /var/cache/conftool/dbconfig/20241104-213012-ladsgroup.json [21:31:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087207 (https://phabricator.wikimedia.org/T377990) (owner: 10DLynch) [21:31:59] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2*: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [21:32:06] (03Merged) 10jenkins-bot: Set Flow to read-only on remaining phase 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087207 (https://phabricator.wikimedia.org/T377990) (owner: 10DLynch) [21:32:25] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1087207|Set Flow to read-only on remaining phase 0 wikis (T377990)]] [21:32:28] T377990: [Config] Set Flow to "read-only" at all Phase 0 wikis - https://phabricator.wikimedia.org/T377990 [21:34:59] !log tgr@deploy2002 tgr, kemayo: Backport for [[gerrit:1087207|Set Flow to read-only on remaining phase 0 wikis (T377990)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:35:33] Kemayo: do you want to check or is it OK to continue? [21:35:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2004.codfw.wmnet with reason: host reimage [21:35:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2003.codfw.wmnet with reason: host reimage [21:35:51] tgr|away: I did a quick check and it looks good. [21:35:58] thx [21:36:01] !log tgr@deploy2002 tgr, kemayo: Continuing with sync [21:38:46] tgr|away: Thanks for jumping in for it! [21:38:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2004.codfw.wmnet with reason: host reimage [21:41:06] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087207|Set Flow to read-only on remaining phase 0 wikis (T377990)]] (duration: 08m 40s) [21:41:11] T377990: [Config] Set Flow to "read-only" at all Phase 0 wikis - https://phabricator.wikimedia.org/T377990 [21:41:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2003.codfw.wmnet with reason: host reimage [21:41:55] !log UTC late deploys done [21:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P70911 and previous config saved to /var/cache/conftool/dbconfig/20241104-214519-ladsgroup.json [21:57:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:58:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:58:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2004.codfw.wmnet with OS bookworm [21:58:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10290996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubestage2004.codfw.wmnet with OS bookworm completed: - kubestage2004 (... [22:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241104T2200). [22:00:24] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:00:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T376905)', diff saved to https://phabricator.wikimedia.org/P70912 and previous config saved to /var/cache/conftool/dbconfig/20241104-220026-ladsgroup.json [22:00:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:00:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2003.codfw.wmnet with OS bookworm [22:00:37] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10290998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubestage2003.codfw.wmnet with OS bookworm completed: - kubestage2003 (... [22:04:13] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10291008 (10Jhancock.wm) [22:06:01] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10291009 (10Jhancock.wm) 05Open→03Resolved @Clement_Goubert this pair is ready [22:12:49] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:15:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291022 (10Jhancock.wm) [22:16:09] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding mc-gp2004 to codfw - jhancock@cumin2002" [22:16:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding mc-gp2004 to codfw - jhancock@cumin2002" [22:16:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:17:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-gp2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:17:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-gp2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:17:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-gp2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:25:24] (03CR) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [22:27:05] (03CR) 10JHathaway: [C:03+2] sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [22:29:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-gp2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:29:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-gp2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:30:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-gp2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:32:58] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp2004'] [22:32:59] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp2005'] [22:33:01] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp2006'] [22:33:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-gp2004'] [22:33:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-gp2005'] [22:33:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-gp2006'] [22:35:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-gp2004.codfw.wmnet with OS bookworm [22:35:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-gp2005.codfw.wmnet with OS bookworm [22:35:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-gp2006.codfw.wmnet with OS bookworm [22:35:48] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-gp2004.codfw.wmnet with OS bookworm [22:35:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291073 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-gp2005.codfw.wmnet with OS bookworm [22:35:53] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-gp2006.codfw.wmnet with OS bookworm [22:50:02] (03CR) 10Scott French: [C:03+1] "Looks good! Confirmed that these hosts are already in the network policy, and indeed that makes sense if they've already joined the cluste" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087240 (https://phabricator.wikimedia.org/T379026) (owner: 10Eevans) [22:53:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp2004.codfw.wmnet with reason: host reimage [22:53:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp2005.codfw.wmnet with reason: host reimage [22:56:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp2004.codfw.wmnet with reason: host reimage [22:59:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp2005.codfw.wmnet with reason: host reimage [23:07:02] (03CR) 10Scott French: [C:03+1] "Thanks, claime! Looks good - two optional comments." [puppet] - 10https://gerrit.wikimedia.org/r/1083146 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [23:15:13] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:17:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:17:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp2004.codfw.wmnet with OS bookworm [23:17:26] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291133 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-gp2004.codfw.wmnet with OS bookworm completed: - mc-gp2004 (**PASS**)... [23:18:06] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:18:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:18:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp2005.codfw.wmnet with OS bookworm [23:18:20] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-gp2005.codfw.wmnet with OS bookworm completed: - mc-gp2005 (**WARN**)... [23:21:06] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291139 (10Jhancock.wm) [23:56:16] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc-gp2006.codfw.wmnet with OS bookworm [23:56:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291268 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-gp2006.codfw.wmnet with OS bookworm executed with errors: - mc-gp2006 (*... [23:56:42] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc-gp2006 [23:56:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-gp2006