[00:00:16] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:00:23] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:08:08] <wikibugs>	 (03PS1) 10Krinkle: Revert not-deployed "Profiler: Silence RedisException" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008474
[00:08:13] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Revert not-deployed "Profiler: Silence RedisException" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008474 (owner: 10Krinkle)
[00:08:55] <wikibugs>	 (03Merged) 10jenkins-bot: Revert not-deployed "Profiler: Silence RedisException" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008474 (owner: 10Krinkle)
[00:12:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2036.codfw.wmnet with reason: host reimage
[00:13:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T357189)', diff saved to https://phabricator.wikimedia.org/P58416 and previous config saved to /var/cache/conftool/dbconfig/20240305-001345-arnaudb.json
[00:13:48] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance
[00:13:50] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[00:14:02] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance
[00:14:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T357189)', diff saved to https://phabricator.wikimedia.org/P58417 and previous config saved to /var/cache/conftool/dbconfig/20240305-001408-arnaudb.json
[00:14:47] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:14:54] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:15:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2036.codfw.wmnet with reason: host reimage
[00:17:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2039.codfw.wmnet with reason: host reimage
[00:17:54] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2038.codfw.wmnet with reason: host reimage
[00:18:03] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2037.codfw.wmnet with reason: host reimage
[00:18:05] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:18:11] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:19:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T357189)', diff saved to https://phabricator.wikimedia.org/P58418 and previous config saved to /var/cache/conftool/dbconfig/20240305-001918-arnaudb.json
[00:19:23] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[00:20:38] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: host reimage
[00:21:44] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:21:51] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:22:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2038.codfw.wmnet with reason: host reimage
[00:25:08] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2037.codfw.wmnet with reason: host reimage
[00:29:36] <wikibugs>	 (03PS1) 10Dzahn: ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237)
[00:29:57] <wikibugs>	 (03PS2) 10Dzahn: ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237)
[00:30:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2040.codfw.wmnet with reason: host reimage
[00:30:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:31:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn)
[00:33:07] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2040.codfw.wmnet with reason: host reimage
[00:34:03] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:34:06] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2036.codfw.wmnet with OS bookworm
[00:34:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P58419 and previous config saved to /var/cache/conftool/dbconfig/20240305-003425-arnaudb.json
[00:34:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:38:23] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:38:28] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:38:37] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008081
[00:38:40] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2039.codfw.wmnet with OS bookworm
[00:38:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008081 (owner: 10TrainBranchBot)
[00:40:56] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:40:58] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2038.codfw.wmnet with OS bookworm
[00:41:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[00:42:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:42:37] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2037.codfw.wmnet with OS bookworm
[00:43:07] <wikibugs>	 (03PS3) 10Dzahn: ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237)
[00:44:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn)
[00:46:12] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:48:03] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:48:39] <wikibugs>	 (03PS4) 10Dzahn: ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237)
[00:49:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P58420 and previous config saved to /var/cache/conftool/dbconfig/20240305-004931-arnaudb.json
[00:50:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn)
[00:52:12] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:55:49] <mutante>	 !log contint1003 -rebooting
[00:55:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:03:20] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008081 (owner: 10TrainBranchBot)
[01:04:02] <wikibugs>	 (03PS5) 10Dzahn: ci_test: switch firewall::provider to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237)
[01:04:34] <wikibugs>	 (03PS6) 10Dzahn: ci_test: switch firewall::provider to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237)
[01:04:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T357189)', diff saved to https://phabricator.wikimedia.org/P58421 and previous config saved to /var/cache/conftool/dbconfig/20240305-010438-arnaudb.json
[01:04:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance
[01:04:42] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[01:04:54] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance
[01:05:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T357189)', diff saved to https://phabricator.wikimedia.org/P58422 and previous config saved to /var/cache/conftool/dbconfig/20240305-010459-arnaudb.json
[01:10:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T357189)', diff saved to https://phabricator.wikimedia.org/P58423 and previous config saved to /var/cache/conftool/dbconfig/20240305-011008-arnaudb.json
[01:10:12] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[01:10:14] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:10:25] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2035.codfw.wmnet with OS bookworm
[01:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[01:12:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ci_test: switch firewall::provider to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn)
[01:13:34] <wikibugs>	 (03PS2) 10Jdlrobson: Stop sharing vector and vector-2022 scripts on wikis where no users are impacted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007992 (https://phabricator.wikimedia.org/T331679)
[01:17:48] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:21:29] <wikibugs>	 (03PS1) 10Dzahn: ci_test: include profile::firewall in test role [puppet] - 10https://gerrit.wikimedia.org/r/1008579
[01:21:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ci_test: include profile::firewall in test role [puppet] - 10https://gerrit.wikimedia.org/r/1008579 (owner: 10Dzahn)
[01:22:45] <wikibugs>	 (03PS2) 10Dzahn: ci_test: include profile::firewall in test role [puppet] - 10https://gerrit.wikimedia.org/r/1008579
[01:23:40] <wikibugs>	 (03PS3) 10Dzahn: ci_test: include profile::firewall in test role [puppet] - 10https://gerrit.wikimedia.org/r/1008579 (https://phabricator.wikimedia.org/T358237)
[01:24:26] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:25:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P58424 and previous config saved to /var/cache/conftool/dbconfig/20240305-012514-arnaudb.json
[01:26:29] <wikibugs>	 (03PS4) 10Dzahn: ci_test: include profile::firewall in test role [puppet] - 10https://gerrit.wikimedia.org/r/1008579 (https://phabricator.wikimedia.org/T358237)
[01:27:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "wrong provider name -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008576" [puppet] - 10https://gerrit.wikimedia.org/r/1008545 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn)
[01:27:38] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/1008579/1582/contint1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1008579 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn)
[01:31:20] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:31:52] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:32:02] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "@jnuche deploy2002 can now ssh to contint1003. You can try to scap zuul again." [puppet] - 10https://gerrit.wikimedia.org/r/1008579 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn)
[01:40:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P58425 and previous config saved to /var/cache/conftool/dbconfig/20240305-014020-arnaudb.json
[01:48:30] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:55:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T357189)', diff saved to https://phabricator.wikimedia.org/P58426 and previous config saved to /var/cache/conftool/dbconfig/20240305-015527-arnaudb.json
[01:55:29] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance
[01:55:31] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[01:55:45] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance
[01:55:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T357189)', diff saved to https://phabricator.wikimedia.org/P58427 and previous config saved to /var/cache/conftool/dbconfig/20240305-015550-arnaudb.json
[02:00:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T357189)', diff saved to https://phabricator.wikimedia.org/P58428 and previous config saved to /var/cache/conftool/dbconfig/20240305-020049-arnaudb.json
[02:00:54] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[02:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:15:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P58429 and previous config saved to /var/cache/conftool/dbconfig/20240305-021556-arnaudb.json
[02:31:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P58430 and previous config saved to /var/cache/conftool/dbconfig/20240305-023102-arnaudb.json
[02:34:03] <icinga-wm_>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 72027072 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:35:05] <icinga-wm_>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 127248 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:38:04] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:46:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T357189)', diff saved to https://phabricator.wikimedia.org/P58431 and previous config saved to /var/cache/conftool/dbconfig/20240305-024608-arnaudb.json
[02:46:12] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance
[02:46:14] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[02:46:36] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance
[02:46:38] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[02:46:52] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[02:46:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T357189)', diff saved to https://phabricator.wikimedia.org/P58432 and previous config saved to /var/cache/conftool/dbconfig/20240305-024657-arnaudb.json
[02:52:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T357189)', diff saved to https://phabricator.wikimedia.org/P58433 and previous config saved to /var/cache/conftool/dbconfig/20240305-025212-arnaudb.json
[02:52:16] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[02:58:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:59:25] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0300)
[03:01:33] <wikibugs>	 (03PS1) 10RLazarus: k8s: Add getter for the Batch API [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130)
[03:02:07] <wikibugs>	 (03PS1) 10RLazarus: sre.switchdc.mediawiki: Stop maintenance scripts on Kubernetes [cookbooks] - 10https://gerrit.wikimedia.org/r/1008583 (https://phabricator.wikimedia.org/T359130)
[03:07:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P58434 and previous config saved to /var/cache/conftool/dbconfig/20240305-030719-arnaudb.json
[03:07:34] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.21 [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008082 (https://phabricator.wikimedia.org/T354439)
[03:07:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.21 [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008082 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot)
[03:21:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:22:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P58435 and previous config saved to /var/cache/conftool/dbconfig/20240305-032225-arnaudb.json
[03:28:11] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.21 [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008082 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot)
[03:37:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T357189)', diff saved to https://phabricator.wikimedia.org/P58436 and previous config saved to /var/cache/conftool/dbconfig/20240305-033732-arnaudb.json
[03:37:35] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance
[03:37:37] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[03:37:49] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance
[03:37:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T357189)', diff saved to https://phabricator.wikimedia.org/P58437 and previous config saved to /var/cache/conftool/dbconfig/20240305-033755-arnaudb.json
[03:42:45] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:42:52] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:46:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T357189)', diff saved to https://phabricator.wikimedia.org/P58438 and previous config saved to /var/cache/conftool/dbconfig/20240305-034614-arnaudb.json
[03:46:18] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[04:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0400)
[04:01:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P58439 and previous config saved to /var/cache/conftool/dbconfig/20240305-040120-arnaudb.json
[04:06:36] <wikibugs>	 (03PS1) 10Jdlrobson: Partial Revert "Set background/color to inherit for common templates" [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008476 (https://phabricator.wikimedia.org/T358164)
[04:16:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P58440 and previous config saved to /var/cache/conftool/dbconfig/20240305-041626-arnaudb.json
[04:31:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T357189)', diff saved to https://phabricator.wikimedia.org/P58441 and previous config saved to /var/cache/conftool/dbconfig/20240305-043133-arnaudb.json
[04:31:35] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance
[04:31:38] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[04:31:50] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance
[04:31:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T357189)', diff saved to https://phabricator.wikimedia.org/P58442 and previous config saved to /var/cache/conftool/dbconfig/20240305-043155-arnaudb.json
[04:33:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[04:33:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[04:37:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T357189)', diff saved to https://phabricator.wikimedia.org/P58443 and previous config saved to /var/cache/conftool/dbconfig/20240305-043718-arnaudb.json
[04:37:22] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[04:46:43] <icinga-wm_>	 PROBLEM - statsv process on webperf2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv
[04:47:43] <icinga-wm_>	 RECOVERY - statsv process on webperf2003 is OK: PROCS OK: 1 process with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv
[04:47:52] * kart_ deploying cxserver..
[04:48:05] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2024-03-04-113412-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008420 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry)
[04:49:11] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2024-03-04-113412-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008420 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry)
[04:51:59] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[04:52:23] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[04:52:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P58444 and previous config saved to /var/cache/conftool/dbconfig/20240305-045225-arnaudb.json
[05:01:11] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:01:43] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:02:47] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:03:23] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:07:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P58445 and previous config saved to /var/cache/conftool/dbconfig/20240305-050731-arnaudb.json
[05:15:46] <kart_>	 !log Updated cxserver to 2024-03-04-113412-production (T350773)
[05:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:15:50] <stashbot>	 T350773: Remove preq and use node fetch - https://phabricator.wikimedia.org/T350773
[05:17:05] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:17:11] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:17:43] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:22:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T357189)', diff saved to https://phabricator.wikimedia.org/P58446 and previous config saved to /var/cache/conftool/dbconfig/20240305-052237-arnaudb.json
[05:22:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance
[05:22:42] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[05:22:53] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance
[05:23:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T357189)', diff saved to https://phabricator.wikimedia.org/P58447 and previous config saved to /var/cache/conftool/dbconfig/20240305-052259-arnaudb.json
[05:24:26] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:27:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T357189)', diff saved to https://phabricator.wikimedia.org/P58448 and previous config saved to /var/cache/conftool/dbconfig/20240305-052741-arnaudb.json
[05:27:46] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[05:35:57] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:36:17] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:36:21] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:42:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P58449 and previous config saved to /var/cache/conftool/dbconfig/20240305-054247-arnaudb.json
[05:45:39] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:45:46] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:48:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:51:18] <wikibugs>	 (03PS1) 10Tim Starling: SwiftTooManyMediaUploads: use subtraction instead of increase() [alerts] - 10https://gerrit.wikimedia.org/r/1008590
[05:52:50] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:52:57] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:57:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P58450 and previous config saved to /var/cache/conftool/dbconfig/20240305-055754-arnaudb.json
[06:04:51] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:04:58] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:13:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T357189)', diff saved to https://phabricator.wikimedia.org/P58451 and previous config saved to /var/cache/conftool/dbconfig/20240305-061300-arnaudb.json
[06:13:14] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[06:17:49] <icinga-wm_>	 PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 1412 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[06:18:04] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:19:46] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:19:53] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:51:03] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage es1040 [puppet] - 10https://gerrit.wikimedia.org/r/1008741
[06:55:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage es1040 [puppet] - 10https://gerrit.wikimedia.org/r/1008741 (owner: 10Marostegui)
[06:57:10] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[06:57:16] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:59:26] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, Amir1, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0700).
[07:04:21] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:09:21] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:12:55] <icinga-wm_>	 RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[07:15:03] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1024.eqiad.wmnet with OS bullseye
[07:17:24] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:17:31] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:27:36] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1024.eqiad.wmnet with reason: host reimage
[07:27:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[07:31:40] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1024.eqiad.wmnet with reason: host reimage
[07:32:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[07:32:37] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:32:44] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:33:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet
[07:33:52] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2016.codfw.wmnet with OS bullseye
[07:36:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet
[07:48:13] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:48:20] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:49:37] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1024.eqiad.wmnet with OS bullseye
[07:49:48] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2016.codfw.wmnet with reason: host reimage
[07:52:13] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2016.codfw.wmnet with reason: host reimage
[07:52:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::misc::multiinstance
[07:54:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mariadb::misc::multiinstance to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1008806 (https://phabricator.wikimedia.org/T349619)
[07:54:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Switch mariadb::misc::multiinstance to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1008806 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[07:57:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus)
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mariadb::misc::multiinstance to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1008806 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:02:23] <wikibugs>	 (03CR) 10Volans: [C: 03+1] k8s: Add getter for the Batch API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus)
[08:09:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::misc::multiinstance
[08:12:33] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2016.codfw.wmnet with OS bullseye
[08:14:35] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2018.codfw.wmnet with OS bullseye
[08:30:08] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[08:30:22] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[08:30:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T357189)', diff saved to https://phabricator.wikimedia.org/P58452 and previous config saved to /var/cache/conftool/dbconfig/20240305-083028-arnaudb.json
[08:30:29] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2018.codfw.wmnet with reason: host reimage
[08:30:32] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[08:33:15] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2018.codfw.wmnet with reason: host reimage
[08:35:16] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2017.codfw.wmnet with OS bullseye
[08:36:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T357189)', diff saved to https://phabricator.wikimedia.org/P58453 and previous config saved to /var/cache/conftool/dbconfig/20240305-083621-arnaudb.json
[08:36:26] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[08:47:25] <godog>	 !log add new disk to titan2001 /srv - T359068
[08:47:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:28] <stashbot>	 T359068: Not enough space on titan hosts for thanos-compact - https://phabricator.wikimedia.org/T359068
[08:51:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P58454 and previous config saved to /var/cache/conftool/dbconfig/20240305-085128-arnaudb.json
[08:51:30] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2017.codfw.wmnet with reason: host reimage
[08:52:01] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2018.codfw.wmnet with OS bullseye
[08:54:00] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2017.codfw.wmnet with reason: host reimage
[08:56:07] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2019.codfw.wmnet with OS bullseye
[09:00:04] <jouncebot>	 jnuche and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0900).
[09:00:29] <jnuche>	 morning, train and backports are currently blocked by T359114
[09:00:30] <stashbot>	 T359114: Slow and failed deployments - https://phabricator.wikimedia.org/T359114
[09:06:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P58455 and previous config saved to /var/cache/conftool/dbconfig/20240305-090634-arnaudb.json
[09:08:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2117.codfw.wmnet
[09:11:59] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2019.codfw.wmnet with reason: host reimage
[09:12:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2117 T359141', diff saved to https://phabricator.wikimedia.org/P58456 and previous config saved to /var/cache/conftool/dbconfig/20240305-091244-marostegui.json
[09:12:49] <stashbot>	 T359141: decommission db2117.codfw.wmnet - https://phabricator.wikimedia.org/T359141
[09:12:57] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2017.codfw.wmnet with OS bullseye
[09:13:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[09:14:46] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2019.codfw.wmnet with reason: host reimage
[09:15:18] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2117.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[09:16:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2117.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[09:16:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:16:20] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2117.codfw.wmnet
[09:18:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] puppetboard: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1008513 (owner: 10Muehlenhoff)
[09:21:02] <wikibugs>	 (03PS1) 10Slyngshede: P:openldap::management Unbreak cross validation script. [puppet] - 10https://gerrit.wikimedia.org/r/1008809 (https://phabricator.wikimedia.org/T359142)
[09:21:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T357189)', diff saved to https://phabricator.wikimedia.org/P58457 and previous config saved to /var/cache/conftool/dbconfig/20240305-092140-arnaudb.json
[09:21:43] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance
[09:21:45] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[09:21:56] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance
[09:21:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1008809 (https://phabricator.wikimedia.org/T359142) (owner: 10Slyngshede)
[09:22:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58458 and previous config saved to /var/cache/conftool/dbconfig/20240305-092202-arnaudb.json
[09:22:06] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:openldap::management Unbreak cross validation script. [puppet] - 10https://gerrit.wikimedia.org/r/1008809 (https://phabricator.wikimedia.org/T359142) (owner: 10Slyngshede)
[09:23:18] <wikibugs>	 (03CR) 10Muehlenhoff: "Hmmh, good point. There's no good reason for conntrack to be absented along with iptables if the firewall provider doesn't use "ferm". Thi" [puppet] - 10https://gerrit.wikimedia.org/r/1008506 (owner: 10Majavah)
[09:23:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove cloud_private_v4_set from cloudgw nftables definition [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney)
[09:23:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Remove cloud_private_v4_set from cloudgw nftables definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney)
[09:23:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Install conntrack via profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1008814
[09:24:11] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2117 [puppet] - 10https://gerrit.wikimedia.org/r/1008815 (https://phabricator.wikimedia.org/T359141)
[09:24:21] <wikibugs>	 (03CR) 10Muehlenhoff: "Alternative patch proposal at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008814" [puppet] - 10https://gerrit.wikimedia.org/r/1008506 (owner: 10Majavah)
[09:24:27] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:24:29] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] mariadb: Decommission db2117 [puppet] - 10https://gerrit.wikimedia.org/r/1008815 (https://phabricator.wikimedia.org/T359141) (owner: 10Marostegui)
[09:24:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2117 [puppet] - 10https://gerrit.wikimedia.org/r/1008815 (https://phabricator.wikimedia.org/T359141) (owner: 10Marostegui)
[09:24:45] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[09:25:01] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1008814 (owner: 10Muehlenhoff)
[09:25:09] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1008814 (owner: 10Muehlenhoff)
[09:25:25] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+1] "LGTM, I don't see any dependencies on conntract that would cause issues on hosts without a firewall atm." [puppet] - 10https://gerrit.wikimedia.org/r/1008814 (owner: 10Muehlenhoff)
[09:26:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Install conntrack via profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1008814 (owner: 10Muehlenhoff)
[09:26:53] <wikibugs>	 (03Abandoned) 10Majavah: conntrackd: fix CLI installation [puppet] - 10https://gerrit.wikimedia.org/r/1008506 (owner: 10Majavah)
[09:27:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58459 and previous config saved to /var/cache/conftool/dbconfig/20240305-092721-arnaudb.json
[09:27:26] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[09:27:48] <wikibugs>	 (03CR) 10Volans: "Did a first pass on the code only, once we finalize the code I'll pass to the tests" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi)
[09:28:04] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job thanos-rule in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:32:52] <wikibugs>	 (03PS1) 10Ayounsi: sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818
[09:33:30] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2019.codfw.wmnet with OS bullseye
[09:33:38] <wikibugs>	 06SRE, 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2117.codfw.wmnet - https://phabricator.wikimedia.org/T359141#9599660 (10MoritzMuehlenhoff) >>! In T359141#9599610, @Marostegui wrote: > @Volans @MoritzMuehlenhoff is anything else required in this situation?  I think that's fine...
[09:33:44] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9599659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2019.codfw.wmnet with OS bullseye comp...
[09:34:31] <wikibugs>	 06SRE, 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2117.codfw.wmnet - https://phabricator.wikimedia.org/T359141#9599661 (10Marostegui) Thanks! @Jhancock.wm see above, you can proceed whenever you want.
[09:38:04] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job thanos-rule in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:41:43] <wikibugs>	 (03PS2) 10Ayounsi: sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818
[09:42:03] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2020.codfw.wmnet with OS bullseye
[09:42:21] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9599689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2020.codfw.wmnet with OS bullseye
[09:42:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P58460 and previous config saved to /var/cache/conftool/dbconfig/20240305-094228-arnaudb.json
[09:43:32] <akosiaris>	 jnuche: I 'll need another 30 minutes or so and I 'll throw some 200 CPUs at the 2 wikikube clusters unblocking the train
[09:44:12] <jnuche>	 akosiaris: sounds good, thank you so much
[09:52:56] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:toolforge: image_builder: refresh for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006516 (https://phabricator.wikimedia.org/T358483) (owner: 10Majavah)
[09:53:02] <wikibugs>	 (03PS3) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498
[09:54:27] <wikibugs>	 (03PS4) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498
[09:56:42] <wikibugs>	 (03PS5) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498
[09:57:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P58461 and previous config saved to /var/cache/conftool/dbconfig/20240305-095734-arnaudb.json
[09:58:04] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2020.codfw.wmnet with reason: host reimage
[10:02:21] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2020.codfw.wmnet with reason: host reimage
[10:04:53] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [zuul/deploy@bb76c45]: test deployment for new host
[10:04:54] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [zuul/deploy@bb76c45]: test deployment for new host (duration: 00m 01s)
[10:06:53] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [zuul/deploy@bb76c45]: test deployment for new host
[10:07:17] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [zuul/deploy@bb76c45]: test deployment for new host (duration: 00m 24s)
[10:08:11] <moritzm>	 !og installing glib2.0 security updates
[10:11:18] <akosiaris>	 !log homer commit T358752
[10:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:21] <stashbot>	 T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752
[10:12:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58462 and previous config saved to /var/cache/conftool/dbconfig/20240305-101241-arnaudb.json
[10:12:45] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[10:16:40] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:17:28] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:17:37] <wikibugs>	 (03PS1) 10Jaime Nuche: ci_test: do not remove python 2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008823 (https://phabricator.wikimedia.org/T358237)
[10:17:40] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 342, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:18:31] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:18:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] k8s: Add getter for the Batch API [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus)
[10:21:00] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2020.codfw.wmnet with OS bullseye
[10:21:16] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9599881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2020.codfw.wmnet with OS bullseye comp...
[10:21:47] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: add db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008083 (https://phabricator.wikimedia.org/T355422)
[10:21:54] <akosiaris>	 !log uncordon parse20{16..20}.codfw.wmnet T358752
[10:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:57] <stashbot>	 T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752
[10:22:55] <akosiaris>	 !log uncordon  parse10{20..24}.eqiad.wmnet  parse10{10..12}.eqiad.wmnet T358752
[10:22:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:06] <akosiaris>	 jnuche: I think you are clear.
[10:23:40] <jnuche>	 akosiaris: thanks again, I'll deploy in the next few minutes
[10:24:03] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] mariadb: add db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008083 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[10:24:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1008823 (https://phabricator.wikimedia.org/T358237) (owner: 10Jaime Nuche)
[10:25:04] <moritzm>	 jnuche: I can merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008823 if that unblocks you?
[10:25:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P58463 and previous config saved to /var/cache/conftool/dbconfig/20240305-102516-root.json
[10:25:32] <wikibugs>	 (03PS4) 10Ayounsi: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614
[10:25:42] <jnuche>	 moritzm: definitely, thank you!
[10:25:56] <wikibugs>	 (03CR) 10Ayounsi: "Thanks, reply inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi)
[10:27:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ci_test: do not remove python 2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008823 (https://phabricator.wikimedia.org/T358237) (owner: 10Jaime Nuche)
[10:28:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Remember to add it to zarcillo database" [puppet] - 10https://gerrit.wikimedia.org/r/1008083 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[10:29:43] <jnuche>	 akosiaris, moritzm: since you are around, can either of you kill process 3272 on deploy2002? I don't have permissions and that process is holding a scap lock at the moment
[10:34:03] <claime>	 jnuche: doing
[10:34:35] <claime>	 jnuche: done
[10:34:46] <jnuche>	 claime: thx 👍
[10:36:16] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008826 (https://phabricator.wikimedia.org/T354439)
[10:36:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008826 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot)
[10:37:06] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008826 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot)
[10:39:38] <jnuche>	 mmmh, deploy failed, it seems I still need to run the presync first
[10:40:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P58464 and previous config saved to /var/cache/conftool/dbconfig/20240305-104021-root.json
[10:41:10] <moritzm>	 jnuche: merged the patch and forced a puppet run on contint1003
[10:41:21] <jnuche>	 danke
[10:50:21] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/weight=10; selector: service=kubesvc,name=parse2.*
[10:50:32] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=kubesvc,name=parse2.*
[10:50:45] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/weight=10; selector: service=kubesvc,name=parse1.*
[10:51:02] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=kubesvc,name=parse1.*
[10:53:55] <logmsgbot>	 !log jnuche@deploy2002 Pruned MediaWiki: 1.42.0-wmf.18 (duration: 03m 25s)
[10:55:08] <Amir1>	 jouncebot: nowandnext
[10:55:08] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0900)
[10:55:08] <jouncebot>	 In 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1100)
[10:55:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P58465 and previous config saved to /var/cache/conftool/dbconfig/20240305-105526-root.json
[10:56:25] <logmsgbot>	 !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.21  refs T354439
[10:56:29] <stashbot>	 T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439
[10:58:58] <jnuche>	 the train deploy is going to overlap with the MW infrastructure window starting in 2 minutes. apologies if that causes any disruption
[10:59:26] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:59:27] <jnuche>	 I'm currently running the presync, once that's done I can hold the actual deploy to group0 if necessary
[10:59:31] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance
[10:59:45] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance
[10:59:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T343718)', diff saved to https://phabricator.wikimedia.org/P58466 and previous config saved to /var/cache/conftool/dbconfig/20240305-105950-ladsgroup.json
[10:59:55] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1100)
[11:10:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P58467 and previous config saved to /var/cache/conftool/dbconfig/20240305-111031-root.json
[11:13:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9599986 (10ayounsi) I'd recommend to start by turning up a small country/region on that continent (Uruguay/Paraguay for example), ideall...
[11:13:55] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mw-parsoid: Bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008828 (https://phabricator.wikimedia.org/T357392)
[11:15:24] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] mw-parsoid: Bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008828 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[11:15:40] <wikibugs>	 (03PS1) 10Jaime Nuche: Revert "group0 wikis to 1.42.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008829
[11:15:48] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] Revert "group0 wikis to 1.42.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008829 (owner: 10Jaime Nuche)
[11:15:56] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mw-parsoid: Bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008828 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[11:16:09] <kamila_>	 morning wikibugs :D
[11:16:20] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.42.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008829 (owner: 10Jaime Nuche)
[11:16:52] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: add db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008083 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[11:17:28] <wikibugs>	 (03PS1) 10Urbanecm: wikimaniawiki: Update logos to 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008830 (https://phabricator.wikimedia.org/T358379)
[11:19:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008828 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[11:19:41] <wikibugs>	 (03Merged) 10jenkins-bot: mw-parsoid: Bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008828 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[11:20:41] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008831 (https://phabricator.wikimedia.org/T354439)
[11:20:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008831 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot)
[11:20:57] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] APIGW: Add configuration to expose LW isvc article-descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007318 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman)
[11:21:05] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008831 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot)
[11:21:13] <wikibugs>	 (03PS11) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[11:22:01] <wikibugs>	 (03PS12) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[11:22:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[11:22:53] <wikibugs>	 06SRE, 10MW-on-K8s, 06Release-Engineering-Team, 06Traffic, 06serviceops: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508#9600053 (10Clement_Goubert)
[11:23:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[11:24:19] <wikibugs>	 (03PS11) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418
[11:25:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'm sure I lack context though it seems the kafka PKI defaults to 1y expiration and we'll reduce it here to 1mo ?" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron)
[11:30:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T343718)', diff saved to https://phabricator.wikimedia.org/P58469 and previous config saved to /var/cache/conftool/dbconfig/20240305-113027-ladsgroup.json
[11:30:34] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[11:30:40] <logmsgbot>	 !log jnuche@deploy2002 sync-world aborted: testwikis wikis to 1.42.0-wmf.21  refs T354439 (duration: 34m 15s)
[11:30:49] <stashbot>	 T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439
[11:32:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: webperf: move statsv metrics to prometheus 'ext' only [puppet] - 10https://gerrit.wikimedia.org/r/1008833 (https://phabricator.wikimedia.org/T359153)
[11:32:26] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[11:32:39] <wikibugs>	 (03PS13) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[11:32:44] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[11:33:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[11:33:50] <wikibugs>	 06SRE, 10Prod-Kubernetes, 06serviceops: Kubernetes apiserver probe failures on restart - https://phabricator.wikimedia.org/T358936#9600154 (10JMeybohm) To clarify why this happened/happens: kubemaster2001 refreshed the certs used by the apiserver in one puppet run at ~00:51:  ` Mar  1 00:51:28 Exec[renew cer...
[11:34:06] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] APIGW: Add configuration to expose LW isvc article-descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007318 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman)
[11:34:14] <wikibugs>	 (03Merged) 10jenkins-bot: APIGW: Add configuration to expose LW isvc article-descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007318 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman)
[11:34:52] <wikibugs>	 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9600169 (10jcrespo) Replacing the cable can be done any time between 6:00 and 23:55 UTC. Let me know if it will be for a period of extended time so I can downtime it.  If it needs hard down let me know in advance so I can...
[11:36:41] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9600195 (10cmooney) >>! In T358658#9598742, @odimitrijevic wrote: > Yes, approved  Thanks Olja.  Just to update I've been working with KC on this and we...
[11:36:53] <wikibugs>	 (03PS14) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[11:37:01] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[11:38:18] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[11:38:39] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[11:39:23] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2024-03-05-082211-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008836 (https://phabricator.wikimedia.org/T353136)
[11:42:29] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[11:42:36] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[11:42:40] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[11:42:56] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[11:45:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P58471 and previous config saved to /var/cache/conftool/dbconfig/20240305-114533-ladsgroup.json
[11:46:02] <wikibugs>	 (03PS2) 10Klausman: APIGW: double up host header elemement for art-desc on LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008840 (https://phabricator.wikimedia.org/T358654)
[11:46:12] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: eqiad: Move all but 1 parsoid node to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008841 (https://phabricator.wikimedia.org/T358752)
[11:47:26] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] APIGW: double up host header elemement for art-desc on LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008840 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman)
[11:51:03] <wikibugs>	 06SRE, 06Machine-Learning-Team, 13Patch-For-Review: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516#9600294 (10klausman) 05Open→03Resolved
[11:52:18] <wikibugs>	 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9600321 (10Clement_Goubert) >>! In T358117#9598846, @dancy wrote: > @Clement_Goubert We have some questions: > 1) Does `mwdebug.discovery.wmnet` resolve to a random...
[11:52:42] <claime>	 jnuche: It's timing out again?
[11:52:47] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] APIGW: double up host header elemement for art-desc on LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008840 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman)
[11:52:52] <jnuche>	 akosiaris, claime: I ran into another timeout will deploying to mw-on-k8s, testservers now: https://phabricator.wikimedia.org/T359155
[11:52:57] <jnuche>	 yep
[11:53:12] <jnuche>	 s/will/while
[11:53:19] <claime>	 jnuche: all right I'll revert a patch quickly, see if it improves things
[11:53:27] <jnuche>	 thx
[11:53:37] <wikibugs>	 (03Merged) 10jenkins-bot: APIGW: double up host header elemement for art-desc on LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008840 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman)
[11:54:08] <logmsgbot>	 !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[11:54:48] <logmsgbot>	 !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[11:56:34] <wikibugs>	 (03PS1) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154)
[11:57:27] <wikibugs>	 (03PS2) 10EoghanGaffney: [vrts] Remove ticket-test.wm.o and vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/1008445 (https://phabricator.wikimedia.org/T359041)
[11:57:42] <logmsgbot>	 !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[11:58:15] <logmsgbot>	 !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[12:00:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P58472 and previous config saved to /var/cache/conftool/dbconfig/20240305-120040-ladsgroup.json
[12:01:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[12:02:02] <wikibugs>	 (03PS15) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[12:02:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[12:05:22] <wikibugs>	 (03PS1) 10Clément Goubert: mw-api-int: Scale down to 206 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008843
[12:07:04] <claime>	 jnuche: I'll try a scap no-build k8s only deployment because we're not finding a root cause
[12:07:06] <logmsgbot>	 !log cgoubert@deploy2002 Started scap: (no justification provided)
[12:07:18] <jnuche>	 ack
[12:08:07] <claime>	 jnuche: what's the image version that was supposed to be deployed by your earlier deployment
[12:08:25] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] [vrts] Remove ticket-test.wm.o and vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/1008445 (https://phabricator.wikimedia.org/T359041) (owner: 10EoghanGaffney)
[12:08:30] <wikibugs>	 (03PS16) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[12:08:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[12:09:35] <jnuche>	 claime: judging by https://phabricator.wikimedia.org/P58470 then I think `docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-03-05-110738-webserver`:
[12:09:40] <jnuche>	 https://www.irccloud.com/pastebin/E0acriJs/
[12:09:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[12:11:05] <jnuche>	 or `docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2024-03-05-105734-publish`
[12:11:11] <wikibugs>	 (03PS1) 10Krinkle: Profiler: Silence "RedisException: Connection timed out" (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008752 (https://phabricator.wikimedia.org/T348756)
[12:12:23] <wikibugs>	 (03PS17) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[12:13:14] <claime>	 ty
[12:14:21] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "There are some issues in its current format, see details inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[12:14:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[12:15:04] <claime>	 jnuche: So what is happening right now is that it didn't redeploy anything on mw-debug and mw-mis
[12:15:07] <claime>	 misc*
[12:15:17] <claime>	 they're still on 2024-02-29-215143
[12:15:35] <claime>	 But it is deploying 2024-03-05-110738 to all the other deployments
[12:15:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T343718)', diff saved to https://phabricator.wikimedia.org/P58473 and previous config saved to /var/cache/conftool/dbconfig/20240305-121546-ladsgroup.json
[12:16:04] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[12:16:27] <_joe_>	 claime: same problem for the actual mediawiki images
[12:16:46] <_joe_>	 I'm looking at /etc/helmfile-defaults/mediawiki/release/mw-debug-pinkunicorn.yaml
[12:16:49] <claime>	 /etc/helmfile-defaults/mediawiki/release/mw-debug-pinkunicorn.yaml and /etc/helmfile-defaults/mediawiki/release/mw-api-int-canary.yaml have different versions
[12:16:51] <claime>	 exactly
[12:17:15] <jnuche>	 claime: that's odd, I canceled before it could get past mw-debug and misc
[12:17:31] <_joe_>	 jnuche: the problem is scap
[12:17:35] <logmsgbot>	 !log cgoubert@deploy2002 scap failed: KeyError 'canaries' (duration: 10m 29s)
[12:17:40] <claime>	 aaaaan it failed
[12:17:51] <_joe_>	 scap seems not to be updating releases with debug: true
[12:18:00] <_joe_>	 since the 29th of february
[12:18:13] <_joe_>	 I'd go look at the code released around that date
[12:18:39] <jnuche>	 ah, maybe it's the rollback? scap did perform the rollback for debug
[12:18:49] <jnuche>	 ok, gonna look into late scap changes
[12:18:51] <_joe_>	 it's possible
[12:19:00] <_joe_>	 let me look at the git history
[12:19:18] <wikibugs>	 (03CR) 10JMeybohm: mw-mcrouter: update namespace resource limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli)
[12:19:36] <claime>	 What is *not* a scap bug is the fact we can't deploy canary releases
[12:19:40] <_joe_>	 jnuche: yes, you're right
[12:19:42] <claime>	 because the helmfile times out
[12:20:11] <_joe_>	 jnuche: but why rollback to a version that is so old
[12:20:29] <_joe_>	 ah because it was the previous functioning one
[12:20:44] <_joe_>	 claime: and why is helmfile timing out?
[12:20:51] <claime>	 that's what we're trying to find out
[12:21:02] <claime>	 I'm in videochat with akosiaris rn, we're looking
[12:21:06] <akosiaris>	 kubernetes events are empty of anything useful btw
[12:21:21] <_joe_>	 sigh
[12:21:32] <_joe_>	 and this has been happening since yesterday?
[12:22:28] <jnuche>	 yesterday it got past the testservers AFAIK, the deployment timed out for parsoid
[12:22:49] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[12:23:59] <claime>	 mw-api-ext.codfw.canary-f6c699fb7-hhhll    7/9     CrashLoopBackOff   2 (29s ago)     49s  
[12:24:15] <_joe_>	 ok that doesn't look good
[12:24:57] <akosiaris>	 [05-Mar-2024 12:24:27] ERROR: [/etc/php/7.4/fpm/php-fpm.conf:15] Array are not allowed in the global section
[12:24:57] <akosiaris>	 [05-Mar-2024 12:24:27] ERROR: failed to load configuration file '/etc/php/7.4/fpm/php-fpm.conf'
[12:24:57] <akosiaris>	 [05-Mar-2024 12:24:27] ERROR: FPM initialization failed
[12:25:03] <akosiaris>	 found it in the logs of the application
[12:25:12] <_joe_>	 ok, what changed there?
[12:25:27] <akosiaris>	 effie: ^
[12:25:35] <akosiaris>	 any chance this has something to do with mcrouteR?
[12:26:30] <effie>	 yes it does 
[12:26:39] <_joe_>	 env['MCROUTER_SERVER'] = ${MW__MCROUTER_SERVER}
[12:26:41] <_joe_>	 yep
[12:26:44] <effie>	 but this change in the image was merged days ago 
[12:26:57] <_joe_>	 effie: but we only use a new image when there is a release
[12:27:01] <claime>	 ^
[12:27:02] <_joe_>	 when did you make your change?
[12:27:07] <effie>	 last week 
[12:27:14] <_joe_>	 yeah, checks out
[12:27:22] <_joe_>	 let's revert that quickly
[12:27:24] <effie>	 ok let me revert this
[12:27:36] <_joe_>	 effie: bump the image version in the changelog
[12:27:50] <_joe_>	 and claime, we'll need to rebuild from scratch the mediawiki image
[12:28:20] <wikibugs>	 (03PS1) 10Effie Mouzeli: Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753
[12:28:22] <_joe_>	 (in scap, I mean)
[12:28:23] <claime>	 _joe_: that should be done by scap once we've update the php-fpm image
[12:28:37] <_joe_>	 it auto-detects? uhmmm
[12:28:44] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753 (owner: 10Effie Mouzeli)
[12:28:45] <_joe_>	 anyways, you'll see pretty quickly
[12:29:11] <_joe_>	 effie: you also need to bump the changelog
[12:29:16] <jnuche>	 _joe_, claime: a dull rebuild can be forced with ` -Dfull_image_build:True `
[12:29:19] <effie>	 _joe_: I was going to 
[12:29:23] <claime>	 jnuche: thanks
[12:29:24] <jnuche>	 s/dull/full/
[12:29:26] <_joe_>	 I gotta go lunch
[12:32:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[12:34:55] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Scale down to 206 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008843 (owner: 10Clément Goubert)
[12:35:27] <wikibugs>	 (03PS2) 10Effie Mouzeli: Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753
[12:35:49] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-int: Scale down to 206 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008843 (owner: 10Clément Goubert)
[12:35:54] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753 (owner: 10Effie Mouzeli)
[12:35:59] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.hosts.decommission for hosts vrts1002.eqiad.wmnet
[12:36:05] <claime>	 effie: want me to do the image rebuild etc.?
[12:36:08] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753 (owner: 10Effie Mouzeli)
[12:37:23] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753 (owner: 10Effie Mouzeli)
[12:39:58] <wikibugs>	 (03PS1) 10Jaime Nuche: ci_test.pp: add missing Python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008849 (https://phabricator.wikimedia.org/T358237)
[12:40:03] <wikibugs>	 (03PS1) 10Jaime Nuche: ci_test.pp: remove explicit installation of Python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008850 (https://phabricator.wikimedia.org/T358237)
[12:40:20] <effie>	 claime: via scap you mean ?
[12:40:25] <claime>	 yeah
[12:40:35] <claime>	 once you're done with build-production-images
[12:41:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ci_test.pp: add missing Python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008849 (https://phabricator.wikimedia.org/T358237) (owner: 10Jaime Nuche)
[12:41:29] <_joe_>	 claime: I can run it just for that image 
[12:41:41] <_joe_>	 if there's dangling images that fail to build
[12:42:06] <_joe_>	 is anyone running it?
[12:42:14] <wikibugs>	 (03CR) 10Jgiannelos: mobileapps: Switchover outgoing parsoid traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007317 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos)
[12:42:41] <effie>	 I am on the build host 
[12:42:54] <effie>	 claime:  I will do it no problem 
[12:42:58] <claime>	 ack
[12:43:37] <_joe_>	 effie: then let me paste you the command to just rebuild that image
[12:44:39] <effie>	 _joe_:  I already run build-production-images
[12:45:32] <_joe_>	 ah ok
[12:46:13] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.dns.netbox
[12:46:43] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Switchover outgoing parsoid traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007317 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos)
[12:47:34] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Switchover outgoing parsoid traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007317 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos)
[12:48:03] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[12:48:06] <jnuche>	 moritzm: the previous patch for ci_test wasn't enough, the packages still need to be installed on the host. Could you take a look at these two followups?:
[12:48:06] <jnuche>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008849
[12:48:06] <jnuche>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008850
[12:48:16] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[12:48:30] <jnuche>	 (if you have the time)
[12:50:12] <claime>	 nemo-yiannis: can you wait a bit before actually deploying that change?
[12:50:29] <nemo-yiannis>	 ok
[12:50:35] <claime>	 we'd like to put the mw-on-k8s deployments back into a stable, all at the same version state before
[12:51:11] <claime>	 I've also scaled back a bit from the 240 replicas, so I'd like to make sure I'm around to ramp up if needed, and right now I can't do that because our images are borked
[12:51:34] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: vrts1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1002"
[12:51:53] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9600557 (10dr0ptp4kt) @VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish with a hostname of `wdqs1025.eqiad.wmnet`?...
[12:51:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1186', diff saved to https://phabricator.wikimedia.org/P58474 and previous config saved to /var/cache/conftool/dbconfig/20240305-125152-root.json
[12:52:03] <nemo-yiannis>	 claime: is there a ticket to track when this work is going to be complete so I deploy after?
[12:52:13] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Optimize revision table T354015
[12:52:22] <stashbot>	 T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015
[12:52:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Optimize revision table T354015
[12:52:51] <claime>	 nemo-yiannis: https://phabricator.wikimedia.org/T359155#9600551
[12:52:52] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: vrts1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1002"
[12:52:52] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:52:53] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts vrts1002.eqiad.wmnet
[12:54:16] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[12:54:30] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[12:54:35] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[12:54:49] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[12:56:03] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008853
[12:56:59] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[12:57:08] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[12:59:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Fixes for rebuild of php-fpm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008854
[13:00:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008853 (owner: 10Marostegui)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1300)
[13:01:18] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9600588 (10Jelto) >>! In T358658#9596119, @KCVelaga_WMF wrote: > @MoritzMuehlenhoff When I change my email to wikimedia.org for the developer account, I...
[13:03:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Fixes for rebuild of php-fpm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008854 (owner: 10Giuseppe Lavagetto)
[13:04:26] <jinxer-wm>	 (SystemdUnitFailed) resolved: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:11:08] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[13:11:24] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[13:12:43] <logmsgbot>	 !log jiji@deploy2002 Started scap: (no justification provided)
[13:17:29] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2151.codfw.wmnet with reason: Silence for cloning
[13:17:43] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2151.codfw.wmnet with reason: Silence for cloning
[13:17:47] <wikibugs>	 (03PS1) 10Cory Massaro: Allow custom type converters in the orchestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008859
[13:18:17] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: provisionning db2217.codfw.wmnet - T355422
[13:18:21] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: provisionning db2217.codfw.wmnet - T355422
[13:18:21] <stashbot>	 T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422
[13:18:24] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: provisionning db2217.codfw.wmnet - T355422
[13:18:38] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: provisionning db2217.codfw.wmnet - T355422
[13:19:04] <wikibugs>	 (03PS2) 10Cory Massaro: Allow custom type converters in the orchestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008859 (https://phabricator.wikimedia.org/T359098)
[13:21:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2151 in db2217 for T355422', diff saved to https://phabricator.wikimedia.org/P58475 and previous config saved to /var/cache/conftool/dbconfig/20240305-132106-arnaudb.json
[13:21:18] <wikibugs>	 (03Abandoned) 10Jaime Nuche: ci_test.pp: remove explicit installation of Python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008850 (https://phabricator.wikimedia.org/T358237) (owner: 10Jaime Nuche)
[13:21:48] <wikibugs>	 (03Abandoned) 10Jaime Nuche: ci_test.pp: add missing Python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008849 (https://phabricator.wikimedia.org/T358237) (owner: 10Jaime Nuche)
[13:24:27] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:24:54] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2151.codfw.wmnet onto db2217.codfw.wmnet
[13:28:49] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [zuul/deploy@bb76c45]: test deployment for new host
[13:28:54] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [zuul/deploy@bb76c45]: test deployment for new host (duration: 00m 04s)
[13:28:58] <effie>	 jnuche: I am rebuilding still 
[13:29:02] <effie>	 I will let you know when it is done 
[13:29:19] <jnuche>	 effie: that wasn't a train deployment
[13:29:21] <effie>	 nemo-yiannis ^ same 
[13:29:25] <effie>	 jnuche: I know :)
[13:29:38] <jnuche>	 ah, silly coincidence, sry :)
[13:29:52] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9600741 (10cmooney) Taavi advised on IRC about the gerrit issue:  > gerrit enforces that user emails are unique. they need to update the email on the ol...
[13:31:21] <wikibugs>	 06SRE, 10Data Pipelines, 06Data-Engineering, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#9600748 (10dr0ptp4kt) Originally, the thought was to be able to simply count relative volume of these types of inbound taps/clicks. Although we want fidelit...
[13:33:08] <jynus>	 !log running refreshImageMetadata.php on commons for Алфавітно-предметний_покажчик_за_1938_рік_до_Збірника_постанов_і_розпоряджень_Уряду_Української_Радянської_Соціалістичної_Республіки.pdf
[13:33:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:03] <Bsadowski1>	 "Error: 502, Server Hangup at 2024-03-05 13:34:41 GMT"
[13:35:05] <Bsadowski1>	 :(
[13:35:31] <claime>	 Bsadowski1: context?
[13:35:31] <logmsgbot>	 !log jiji@deploy2002 Finished scap: (no justification provided) (duration: 22m 47s)
[13:35:32] <jynus>	 Bsadowski1: what url?
[13:35:45] <Bsadowski1>	 It was a checkuser request
[13:35:50] <Bsadowski1>	 https://login.wikimedia.org/wiki/Special:CheckUser
[13:35:57] <Bsadowski1>	 (steward action)
[13:36:15] <claime>	 Can't check that, no perm :/
[13:36:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9600760 (10KCVelaga_WMF) Thanks @Jelto! GitLab works. I mistakenly assumed that updating the email at idm.wikimedia.org will get reflected across the bo...
[13:36:45] <Bsadowski1>	 Okay I retried the action and it seemed to work.
[13:36:53] <Bsadowski1>	 Weird.
[13:37:01] <jynus>	 Actually it has an explanation
[13:37:32] <Bsadowski1>	 Well... there are a ton of results for the range I checked..
[13:37:41] <jynus>	 seldom used functions are not sometimes well optimized, so the db needs to heat to succeed
[13:37:49] <jynus>	 yes, that would explain it
[13:38:09] <jynus>	 but on a second run it is possible that the data is in memory, succeeding
[13:38:12] <Bsadowski1>	 Maybe Dreamy_Jazz could help with CheckUser things
[13:38:16] <Bsadowski1>	 :D
[13:38:20] <Bsadowski1>	 hehe :)
[13:38:29] <claime>	 Databases are cold-blooded animals
[13:38:40] <claime>	 They need some warmth to function properly x)
[13:38:52] <Bsadowski1>	 I believe there are projects or tasks to make checkuser more... reliable?
[13:39:28] <jynus>	 it shouldn't be like this, but things that run often are noticed more often that funtions that are only used occasionally, independently of the importance
[13:40:00] <jynus>	 yes, also I belive it is not a core feature, so it may not have as much support as other stuff
[13:40:41] <jynus>	 and with core I mean the tecnical meaning (it is an extension) not its importance
[13:40:44] <Bsadowski1>	 Ah
[13:40:53] <Bsadowski1>	 yep yep :)
[13:41:45] <jynus>	 my suggestion would be- if it is a fast query (e.g. < 1 minute) try a couple of times, if it fails consistently (and there is no ongoing outage), file a task
[13:44:58] <wikibugs>	 (03CR) 10Jforrester: "I was just dropping the flag: https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/merge_requests/141" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008859 (https://phabricator.wikimedia.org/T359098) (owner: 10Cory Massaro)
[13:45:44] <effie>	 jnuche: nemo-yiannis you are free to do whatever you wanted to do 
[13:46:04] <jnuche>	 effie: thx!
[13:46:07] <effie>	 sorry for the trouble I caused 
[13:46:36] <jnuche>	 claime: should I try to go ahead with the train or there's something else you wanted to check/do first? 
[13:46:43] <claime>	 jnuche: nope, good on my end
[13:46:46] <jnuche>	 effie: no worries :)
[13:46:54] <claime>	 nemo-yiannis: please wait for the train, and then you're good to go
[13:46:58] <MatmaRex>	 there's supposed to be a backport window in 15 minutes
[13:47:03] <claime>	 augh.
[13:47:56] <claime>	 PCS and backports should not conflict too much with the changes I made to maxSurge etc.
[13:47:56] <jnuche>	 yeah, there's a patch there, unfortunately first we need to get the train stuff out of the way
[13:48:02] <claime>	 But train needs to happen before backport
[13:48:16] <claime>	 (it should be all right)
[13:49:01] <jnuche>	 ok, doing the deed
[13:49:13] <logmsgbot>	 !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.21  refs T354439
[13:49:17] <stashbot>	 T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439
[13:51:40] <wikibugs>	 (03CR) 10Effie Mouzeli: "done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli)
[13:52:23] <wikibugs>	 (03PS6) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498
[13:58:25] <wikibugs>	 (03PS1) 10Majavah: aptrepo: Drop apt.kubernetes.io updates [puppet] - 10https://gerrit.wikimedia.org/r/1008864 (https://phabricator.wikimedia.org/T359169)
[13:59:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1400).
[14:00:05] <jouncebot>	 dbrant and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:09] <Lucas_WMDE>	 o/
[14:00:18] <Lucas_WMDE>	 waiting for jnuche to finish first, I assume
[14:00:22] <jnuche>	 👋 we ran into multiple issues with the train today and we are still running it, backports cannot happen at the moment, I'm sorry about that
[14:00:32] <Lucas_WMDE>	 ack
[14:00:35] <wikibugs>	 (03Abandoned) 10Cory Massaro: Allow custom type converters in the orchestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008859 (https://phabricator.wikimedia.org/T359098) (owner: 10Cory Massaro)
[14:00:40] <wikibugs>	 (03CR) 10Cory Massaro: "Oh, nice. That's definitely a better solution! I'll close this one." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008859 (https://phabricator.wikimedia.org/T359098) (owner: 10Cory Massaro)
[14:00:51] <Lucas_WMDE>	 do you think we’ll be able to do backports later in the window or will there not be enough time?
[14:01:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1008864 (https://phabricator.wikimedia.org/T359169) (owner: 10Majavah)
[14:01:09] <jnuche>	 there's also a good chance the train is gonna eat up the entire hour and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1005161 will have to be rescheduled
[14:01:37] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] aptrepo: Drop apt.kubernetes.io updates [puppet] - 10https://gerrit.wikimedia.org/r/1008864 (https://phabricator.wikimedia.org/T359169) (owner: 10Majavah)
[14:01:43] <Lucas_WMDE>	 ok
[14:02:18] <claime>	 I'm ok with the backports happening after the window if need be
[14:02:28] <wikibugs>	 (03PS18) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[14:02:35] <claime>	 I'll have to juggle a bit with the network migration happening at 1600UTC
[14:02:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[14:02:44] <claime>	 busy busy day
[14:03:13] <jnuche>	 indeed
[14:03:16] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management, 10media-backups: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996#9601034 (10jcrespo)
[14:04:14] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[14:04:21] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:04:54] <wikibugs>	 (03PS1) 10Majavah: hieradata: update striker to 2024-02-28-214103-production [puppet] - 10https://gerrit.wikimedia.org/r/1008865 (https://phabricator.wikimedia.org/T358615)
[14:05:27] <wikibugs>	 (03PS2) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154)
[14:05:48] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management, 10media-backups: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996#9601050 (10jcrespo)
[14:06:36] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: update striker to 2024-02-28-214103-production [puppet] - 10https://gerrit.wikimedia.org/r/1008865 (https://phabricator.wikimedia.org/T358615) (owner: 10Majavah)
[14:06:41] <wikibugs>	 (03CR) 10Elukey: "IIUC the renew_seconds parameter should force puppet to renew the cert earlier, and the idea is to allow more time for an admin to perform" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron)
[14:06:50] <nemo-yiannis>	 claime: ok
[14:07:18] <jnuche>	 now we got past the testservers :)
[14:07:27] <jnuche>	 https://www.irccloud.com/pastebin/ph4Rkix3/
[14:08:05] <jnuche>	 claime, akosiaris, effie, _joe_: thank you all for your help
[14:08:10] <effie>	 cheers 
[14:08:11] <claime>	 \o/
[14:08:13] <akosiaris>	 \o/
[14:09:34] <wikibugs>	 (03PS4) 10Eevans: restbase: provision restbase1038-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005594 (https://phabricator.wikimedia.org/T354560)
[14:09:39] <wikibugs>	 (03PS4) 10Eevans: restbase: provision restbase1039-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005595 (https://phabricator.wikimedia.org/T354560)
[14:09:44] <wikibugs>	 (03PS4) 10Eevans: restbase: provision restbase1040-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005596 (https://phabricator.wikimedia.org/T354560)
[14:09:49] <wikibugs>	 (03PS4) 10Eevans: restbase: provision restbase1041-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005597 (https://phabricator.wikimedia.org/T354560)
[14:09:56] <wikibugs>	 (03PS4) 10Eevans: restbase: provision restbase1042-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005598 (https://phabricator.wikimedia.org/T354560)
[14:10:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[14:11:34] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1038-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005594 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans)
[14:12:33] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Fine-tune deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008867
[14:13:17] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2151.codfw.wmnet onto db2217.codfw.wmnet
[14:14:33] <wikibugs>	 (03CR) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[14:14:40] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse2017 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:15:20] <wikibugs>	 (03PS1) 10Jelto: aptrepo: Update GitLab 3F01618A51312F3F gpg key [puppet] - 10https://gerrit.wikimedia.org/r/1008868
[14:15:48] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Switch restbase102[2345], restbase202[4567] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006895 (https://phabricator.wikimedia.org/T357392)
[14:15:53] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392)
[14:15:58] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Switch the remaining parsoid hosts to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006897 (https://phabricator.wikimedia.org/T357392)
[14:16:04] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: restbase: Switch the default to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006898 (https://phabricator.wikimedia.org/T357392)
[14:16:12] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T357392)
[14:16:20] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T357392)
[14:16:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch restbase102[2345], restbase202[4567] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006895 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[14:16:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Switch restbase102[2345], restbase202[4567] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006895 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[14:16:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 5%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58476 and previous config saved to /var/cache/conftool/dbconfig/20240305-141649-arnaudb.json
[14:17:22] <wikibugs>	 (03PS2) 10Jelto: aptrepo: Update GitLab 3F01618A51312F3F gpg key [puppet] - 10https://gerrit.wikimedia.org/r/1008868
[14:17:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[14:18:31] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:19:25] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: toggle notifications for db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008084 (https://phabricator.wikimedia.org/T355422)
[14:21:26] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:22:46] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] eqiad: Move all but 1 parsoid node to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008841 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris)
[14:24:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:24:52] <claime>	 ^it's lying it's fine
[14:26:18] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[14:26:25] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:27:53] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] [vrts] Remove ticket-test.wm.o and vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1008447 (https://phabricator.wikimedia.org/T359041) (owner: 10EoghanGaffney)
[14:28:23] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[14:28:29] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:29:17] <wikibugs>	 (03PS1) 10EoghanGaffney: Revert "[vrts] Remove ticket-test.wm.o and vrts1002" [puppet] - 10https://gerrit.wikimedia.org/r/1008756
[14:29:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:29:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1008868 (owner: 10Jelto)
[14:29:51] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[14:30:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mediawiki: Fine-tune deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008867 (owner: 10Clément Goubert)
[14:31:03] <claime>	 jnuche: almost there x0
[14:31:21] <logmsgbot>	 !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.21  refs T354439 (duration: 42m 08s)
[14:31:22] <jnuche>	 yep yep yep
[14:31:26] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:31:30] <stashbot>	 T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439
[14:31:31] <jnuche>	 train presync done, rolling forward to group0 in a sec
[14:31:47] <jnuche>	 (should be relatively fast)
[14:31:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58477 and previous config saved to /var/cache/conftool/dbconfig/20240305-143154-arnaudb.json
[14:31:59] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[14:32:23] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[14:32:36] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Revert "[vrts] Remove ticket-test.wm.o and vrts1002" [puppet] - 10https://gerrit.wikimedia.org/r/1008756 (owner: 10EoghanGaffney)
[14:32:51] <claime>	 jnuche: I took the liberty to attach to your screen, I've never watched a train rollout, hope you don't mind
[14:33:09] <jnuche>	 no problemo :)
[14:33:24] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for wqds1025 - cmooney@cumin1002"
[14:33:48] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008871 (https://phabricator.wikimedia.org/T354439)
[14:33:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008871 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot)
[14:33:59] <jnuche>	 here we go
[14:34:15] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for wqds1025 - cmooney@cumin1002"
[14:34:16] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:34:25] <wikibugs>	 (03PS1) 10EoghanGaffney: [vrts] Remove vrts1002 reverences [puppet] - 10https://gerrit.wikimedia.org/r/1008872
[14:34:28] <claime>	 choo choo
[14:34:31] <wikibugs>	 (03PS3) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154)
[14:34:44] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[14:34:47] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008871 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot)
[14:34:50] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:35:10] <logmsgbot>	 !log fabfur@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on dns2004.wikimedia.org with reason: T355873
[14:35:15] <stashbot>	 T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873
[14:35:25] <logmsgbot>	 !log fabfur@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dns2004.wikimedia.org with reason: T355873
[14:35:27] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] aptrepo: Update GitLab 3F01618A51312F3F gpg key [puppet] - 10https://gerrit.wikimedia.org/r/1008868 (owner: 10Jelto)
[14:35:28] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[14:36:07] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:36:30] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:36:50] <icinga-wm_>	 PROBLEM - cassandra-a CQL 10.64.16.28:9042 on restbase1038 is CRITICAL: connect to address 10.64.16.28 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:37:12] <fabfur>	 !log depooling dns2004 for T355873
[14:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:29] <claime>	 jnuche: I should have merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1008867 before you rolled forward, it would have made the deployment faster 
[14:37:41] <claime>	 I'll merge it right quick afterwards
[14:37:47] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for wqds1025 - cmooney@cumin1002"
[14:37:48] <logmsgbot>	 !log fabfur@cumin2002 conftool action : set/pooled=no; selector: name=dns2004.wikimedia.org
[14:37:59] <jnuche>	 ack
[14:38:04] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:07] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9601203 (10akosiaris)
[14:38:26] <claime>	 Right now at every deployment we exceed our capacity by around 800CPUs because of maxSurge/maxUnavailable settings, which means more wait for containers to be ready, etc.
[14:38:36] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for wqds1025 - cmooney@cumin1002"
[14:38:36] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:39:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[14:40:22] <akosiaris>	 !log remove all but 1 host from parsoid@eqiad
[14:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:29] <claime>	 hmm https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&viewPanel=23
[14:40:32] <akosiaris>	 !log remove all but 1 host from parsoid@eqiad T358752
[14:40:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:40] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:40:43] <stashbot>	 T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752
[14:40:56] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9601214 (10elukey) a:05klausman→03None
[14:40:56] <akosiaris>	 claime: hmmm
[14:41:09] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9601211 (10akosiaris) We at [~50% mw-parsoid](https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetr...
[14:41:37] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9601234 (10elukey) Removed Tobias as assignee so the new node can be initialized.
[14:42:00] <claime>	 akosiaris: bump in captcha displayed at the same time
[14:42:45] <fabfur>	 topranks: dns2004 is depooled and downtimed ready for T355873
[14:42:46] <stashbot>	 T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873
[14:42:50] <icinga-wm_>	 PROBLEM - cassandra-b CQL 10.64.16.32:9042 on restbase1038 is CRITICAL: connect to address 10.64.16.32 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:43:02] <topranks>	 fabfur: super thanks!
[14:43:27] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9601236 (10cmooney) >>! In T358727#9600557, @dr0ptp4kt wrote: > @VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish...
[14:44:08] <wikibugs>	 (03CR) 10Volans: "Nice! One typo and a small formatting issue, looks sane otherwise to me, but I'll leave to ServiceOps to review the helmfile command." [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[14:44:12] <claime>	 urandom: something going on with this restbase node? ^^
[14:44:25] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-02-26-150614 to 2024-03-05-140533 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008874 (https://phabricator.wikimedia.org/T296937)
[14:44:40] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse2017 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:44:51] <logmsgbot>	 !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.21  refs T354439
[14:44:55] <stashbot>	 T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439
[14:44:58] <jnuche>	 group0 completed, give me a min to check a couple things
[14:45:25] <akosiaris>	 claime: cassandra appears to be running 
[14:45:52] <icinga-wm_>	 PROBLEM - cassandra-b SSL 10.64.16.32:7000 on restbase1038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:46:27] <akosiaris>	 Condition check resulted in distributed storage system for structured data being skipped ?
[14:46:54] <wikibugs>	 (03PS19) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[14:46:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58480 and previous config saved to /var/cache/conftool/dbconfig/20240305-144658-arnaudb.json
[14:47:03] <jnuche>	 claime: all done, you can go ahead with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1008867  if you want
[14:47:03] <akosiaris>	 4 log entries for cassandra-b and -c since 14:16 today
[14:47:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[14:47:10] <claime>	 jnuche: awesome thanks
[14:47:29] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fine-tune deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008867 (owner: 10Clément Goubert)
[14:48:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008084 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[14:48:50] <icinga-wm_>	 PROBLEM - cassandra-c CQL 10.64.16.35:9042 on restbase1038 is CRITICAL: connect to address 10.64.16.35 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:48:53] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Fine-tune deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008867 (owner: 10Clément Goubert)
[14:49:19] <logmsgbot>	 !log cgoubert@deploy2002 Started scap: (no justification provided)
[14:49:42] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap: (no justification provided) (duration: 00m 23s)
[14:51:15] <logmsgbot>	 !log cgoubert@deploy2002 Started scap: (no justification provided)
[14:51:50] <icinga-wm_>	 PROBLEM - cassandra-c SSL 10.64.16.35:7000 on restbase1038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:52:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli)
[14:52:09] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1038.eqiad.wmnet with reason: Bootstrapping — T354560
[14:52:12] <stashbot>	 T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560
[14:52:23] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1038.eqiad.wmnet with reason: Bootstrapping — T354560
[14:53:42] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9601326 (10MoritzMuehlenhoff) @KCVelaga_WMF Can you try logging into https:/idm.wikimedia.org with your old account? Under "e-mail" you can click "Updat...
[14:54:10] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [zuul/deploy@cadc625]: test deployment for new host
[14:55:54] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2260 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:56:34] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap: (no justification provided) (duration: 05m 18s)
[14:57:01] <claime>	 ok we're good
[14:57:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] eqiad: Move all but 1 parsoid node to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008841 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris)
[14:57:12] <claime>	 dbrant, MatmaRex, Lucas_WMDE, you can proceed with backports
[14:57:15] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[14:57:21] <wikibugs>	 (03PS4) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154)
[14:57:22] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:57:24] <claime>	 sorry it took so long, we can overflow the window
[14:58:04] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:58:10] <MatmaRex>	 thanks, i'm around if anyone can deploy
[14:58:18] <dbrant>	 same
[14:59:08] <wikibugs>	 (03CR) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[14:59:25] <Lucas_WMDE>	 o/
[14:59:28] <Lucas_WMDE>	 jouncebot: nowandnext
[14:59:28] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1400)
[14:59:28] <jouncebot>	 In 1 hour(s) and 0 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1600)
[14:59:34] <Lucas_WMDE>	 we still have a free hour
[14:59:42] <Lucas_WMDE>	 so I guess we’ll just do the deployments now then
[14:59:48] <Lucas_WMDE>	 just a sec, need to finish a comment on phab first
[15:00:04] <_joe_>	 Lucas_WMDE: please hold a sec
[15:00:15] <Lucas_WMDE>	 ok
[15:02:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 25%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58481 and previous config saved to /var/cache/conftool/dbconfig/20240305-150203-arnaudb.json
[15:02:09] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: add all missing hosts from T355422 [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422)
[15:02:23] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM cookbook/python wise :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[15:02:42] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host mw1357.eqiad.wmnet with OS bullseye
[15:02:56] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601359 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host mw1357.eqiad.wmnet with OS bullseye
[15:03:13] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host mw1356.eqiad.wmnet with OS bullseye
[15:03:27] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host mw1356.eqiad.wmnet with OS bullseye
[15:06:31] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1003.eqiad.wmnet with OS bullseye
[15:06:45] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601375 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1003.eqiad.wmnet with OS bullseye
[15:07:02] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1004.eqiad.wmnet with OS bullseye
[15:07:18] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1004.eqiad.wmnet with OS bullseye
[15:08:52] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:08:58] <denisse>	 !log disable meta-monitoring for alert1001 - T333615
[15:08:59] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:09:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:11] <stashbot>	 T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615
[15:09:58] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: add all missing hosts from T355422 [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422)
[15:10:03] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] alert: Failover Icinga and Alertmanager to alert2001 [puppet] - 10https://gerrit.wikimedia.org/r/1003513 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse)
[15:10:21] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] icinga: Set log group to 'nagios' to resolve permission conflicts [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) (owner: 10Andrea Denisse)
[15:10:25] <_joe_>	 jouncebot: now
[15:10:26] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 49 minute(s)
[15:10:40] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:10:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[15:11:14] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:11:18] <wikibugs>	 (03PS3) 10Arnaudb: mariadb: add all missing hosts from T355422 [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422)
[15:11:20] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:12:42] <wikibugs>	 (03CR) 10Arnaudb: mariadb: add all missing hosts from T355422 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[15:14:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:16:47] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: add all missing hosts from T355422 [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[15:18:55] <Lucas_WMDE>	 dbrant, MatmaRex: just FYI, the deployment won’t happen now after all, sorry for the troubles
[15:19:17] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: add all missing hosts from T355422 [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[15:19:30] <dbrant>	 no worries, will move to the next window
[15:19:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) ferm.service on kubernetes1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:21:33] <wikibugs>	 (03CR) 10Marostegui: "All these hosts will start showing up on icinga when puppet starts running - they won't page as they correctly have notifications disabled" [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[15:24:37] <wikibugs>	 (03PS2) 10Andrew Bogott: role::puppetserver::cloud_vps_project: remove firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1008554 (https://phabricator.wikimedia.org/T351450)
[15:24:45] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455)
[15:25:00] <jinxer-wm>	 (ProbeDown) firing: (4) Service wdqs1011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:25:05] <wikibugs>	 (03PS3) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455)
[15:25:13] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::puppetserver::wmcs: parametrize a few hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1008879
[15:25:27] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:25:35] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:25:46] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] alert: Resolve alerts DNS queries to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/1003516 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse)
[15:27:28] <wikibugs>	 (03PS2) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/1003516 (https://phabricator.wikimedia.org/T333615)
[15:28:08] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] alert: Resolve alerts DNS queries to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/1003516 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse)
[15:29:12] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:29:13] <wikibugs>	 (03PS1) 10Majavah: hieradata: fix alert2001 IP address [puppet] - 10https://gerrit.wikimedia.org/r/1008880
[15:29:20] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:30:11] <wikibugs>	 (03PS20) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[15:30:27] <wikibugs>	 (03Abandoned) 10Majavah: hieradata: fix alert2001 IP address [puppet] - 10https://gerrit.wikimedia.org/r/1008880 (owner: 10Majavah)
[15:30:33] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1008881 (https://phabricator.wikimedia.org/T355422)
[15:32:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[15:32:46] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] instances.yaml: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1008881 (https://phabricator.wikimedia.org/T355422) (owner: 10Marostegui)
[15:32:51] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] instances.yaml: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1008881 (https://phabricator.wikimedia.org/T355422) (owner: 10Marostegui)
[15:34:45] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wikimedia.org: failover icinga to alert2001 too [dns] - 10https://gerrit.wikimedia.org/r/1008882 (https://phabricator.wikimedia.org/T333615)
[15:35:12] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1008882 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi)
[15:35:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] wikimedia.org: failover icinga to alert2001 too [dns] - 10https://gerrit.wikimedia.org/r/1008882 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi)
[15:37:02] <wikibugs>	 (03CR) 10Bking: "We're still getting alert spam, so I'm going to merge this. Happy to follow up on suggestions in a future patch." [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking)
[15:37:12] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking)
[15:37:31] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:37:37] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:38:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[15:38:57] <wikibugs>	 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873#9601519 (10Joe)
[15:39:23] <_joe_>	 !log draining kubernetes2035 T355873
[15:39:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:27] <stashbot>	 T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873
[15:39:43] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601524 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host mw1357.eqiad.wmnet with OS bullseye complet...
[15:40:14] <wikibugs>	 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9601526 (10Jhancock.wm) @jcrespo I replaced that cable. It was quick enough it didn't even notice. I remember we tried this in the past and it didn't work. But I have a brand new cable, so maybe that will be the difference.
[15:40:22] <wikibugs>	 06SRE, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9601527 (10andrea.denisse)
[15:41:15] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1356.eqiad.wmnet with OS bullseye
[15:41:29] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601529 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host mw1356.eqiad.wmnet with OS bullseye complet...
[15:43:13] <_joe_>	 !log draining kubernetes2054 T355873
[15:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:19] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[15:43:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2040.codfw.wmnet with OS bookworm
[15:43:29] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9601541 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2040.codfw.wmnet with OS bookworm completed: - es2040 (**WARN**)   -...
[15:43:32] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 8 hosts with reason: Silence for maintenance T355873
[15:43:40] <Jeff_Green>	 We're seeing a flood of nagios/icinga "passive check is awol" alerts from alert1002, has nsca or icinga fallen over?
[15:43:45] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:43:45] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1003.eqiad.wmnet with OS bullseye
[15:43:45] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 8 hosts with reason: Silence for maintenance T355873
[15:43:51] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1008872 (owner: 10EoghanGaffney)
[15:44:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T355873 - depooling db2148 db2163 db2185 db2164 db2189 es2025 es2029 es2030', diff saved to https://phabricator.wikimedia.org/P58489 and previous config saved to /var/cache/conftool/dbconfig/20240305-154400-arnaudb.json
[15:44:12] <Jeff_Green>	 Also, I think the icinga alerts are flapping between warning and recovery.
[15:44:17] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601543 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1003.eqiad.wmnet with OS bullseye comp...
[15:44:25] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:45:03] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:45:05] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1004.eqiad.wmnet with OS bullseye
[15:45:10] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9601554 (10Jhancock.wm)
[15:45:24] <godog>	 Jeff_Green: we did an alert host failover, likely that
[15:45:32] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:45:47] <Jeff_Green>	 godog: oh, huh, I wonder if we're able to report to the new host properly
[15:46:14] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:46:21] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:46:25] <godog>	 Jeff_Green: could be, one sec
[15:46:34] <denisse>	 Hi Jeff_Green, can you share where are those alerts going?
[15:46:47] <denisse>	 I'd like to see them to understand the problem further.
[15:46:47] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1004.eqiad.wmnet with OS bullseye comp...
[15:46:57] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:47:05] <Jeff_Green>	 denisse: do you mean the email alerts, or where our hosts post the nsca reports?
[15:47:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2035.mgmt.codfw.wmnet with reboot policy FORCED
[15:47:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58490 and previous config saved to /var/cache/conftool/dbconfig/20240305-154718-arnaudb.json
[15:47:27] <Jeff_Green>	 the email alerts are going to fr-tech-ops@wikimedia.org
[15:47:53] <denisse>	 Thank you Jeff_Green, taking a look.
[15:48:03] <wikibugs>	 (03PS21) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[15:48:03] <godog>	 !log bounce ircecho on alert2001
[15:48:04] <Jeff_Green>	 denisse: great
[15:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:30] <_joe_>	 !log draining mw2434 T355873
[15:48:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:33] <stashbot>	 T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873
[15:48:44] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 23:00:00 on db2096.codfw.wmnet with reason: Silence for cloning
[15:48:59] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 23:00:00 on db2096.codfw.wmnet with reason: Silence for cloning
[15:49:16] <Jeff_Green>	 fwiw we have two hosts configured for nsca reporting: 208.80.154.88 and 208.80.153.84
[15:49:17] <godog>	 ok ircecho should be back in some fashion
[15:49:27] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 23:00:00 on db2196.codfw.wmnet with reason: Silence for cloning
[15:49:42] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 23:00:00 on db2196.codfw.wmnet with reason: Silence for cloning
[15:49:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) ferm.service on kubernetes1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:50:43] <_joe_>	 !log draining mw2435 T355873
[15:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:51] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9601629 (10VRiley-WMF) Thank you @cmooney ! I have also relabeled this unit to match the name. Closing this ticket as per our discussion s...
[15:51:52] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2035.mgmt.codfw.wmnet with reboot policy FORCED
[15:52:35] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 20 hosts with reason: Silence for cloning
[15:52:35] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9601633 (10VRiley-WMF) 05Open→03Resolved
[15:52:43] <wikibugs>	 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9601636 (10jcrespo) I didn't ask for a cable change, and so far I haven't observed any problem with the host, TBH, it was @ayounsi who requested it, but I wonder if the metrics are too sensitive- we do the backup as fast a...
[15:52:52] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.16.32:9042 on restbase1038 is CRITICAL: connect to address 10.64.16.32 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:53:06] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 20 hosts with reason: Silence for cloning
[15:53:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db[2219-2220].codfw.wmnet with reason: Silence for cloning
[15:53:42] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db[2219-2220].codfw.wmnet with reason: Silence for cloning
[15:53:45] <denisse>	 icinga-wm: <3
[15:54:18] <wikibugs>	 (03PS22) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[15:54:21] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on cr[1-2]-codfw,lsw1-b8-codfw.mgmt asw-b-codfw with reason: prepping for server uplink migration codfw rack b8
[15:54:22] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on cr[1-2]-codfw,lsw1-b8-codfw.mgmt asw-b-codfw with reason: prepping for server uplink migration codfw rack b8
[15:54:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2035.codfw.wmnet with OS bookworm
[15:54:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[15:54:38] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9601658 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2035.codfw.wmnet with OS bookworm
[15:54:45] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b8-codfw.mgmt with reason: prepping for server uplink migration codfw rack b8
[15:54:48] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:54:54] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b8-codfw.mgmt with reason: prepping for server uplink migration codfw rack b8
[15:54:54] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:55:06] <wikibugs>	 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873#9601663 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=19e5ce18-f2ba-4d9e-a80a-2c957c2eecad) set by cmoon...
[15:55:21] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1002.eqiad.wmnet with OS bullseye
[15:55:37] <godog>	 !log bounce ircecho on alert2001 one last time
[15:55:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:40] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 25 hosts with reason: Migrating servers in codfw rack B8 to lsw1-b8-codfw
[15:55:53] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.16.32:7000 on restbase1038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:56:00] <_joe_>	 !log depooled parse2008-10 T355873
[15:56:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:04] <stashbot>	 T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873
[15:56:06] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1005.eqiad.wmnet with OS bullseye
[15:56:19] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 25 hosts with reason: Migrating servers in codfw rack B8 to lsw1-b8-codfw
[15:56:38] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1006.eqiad.wmnet with OS bullseye
[15:57:04] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1002.eqiad.wmnet with OS bullseye
[15:57:08] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1007.eqiad.wmnet with OS bullseye
[15:58:09] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1008.eqiad.wmnet with OS bullseye
[15:58:16] <wikibugs>	 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873#9601682 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f241631d-4830-4ac7-b5c1-29790ccbb916) set by cmoon...
[15:58:28] <_joe_>	 !log depooled mw2434-5, T355873
[15:58:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:40] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1005.eqiad.wmnet with OS bullseye
[15:58:55] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:59:01] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:59:32] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1006.eqiad.wmnet with OS bullseye
[16:00:05] <jouncebot>	 eoghan, jelto, and arnoldokoth: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1600).
[16:00:24] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1007.eqiad.wmnet with OS bullseye
[16:00:27] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:00:27] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:00:27] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:00:56] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1009.eqiad.wmnet with OS bullseye
[16:01:17] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601700 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1008.eqiad.wmnet with OS bullseye
[16:01:21] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:02:05] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:02:12] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:03:50] <wikibugs>	 (03PS1) 10Andrew Bogott: eqiad1::pdns: remove check_dns alerts [puppet] - 10https://gerrit.wikimedia.org/r/1008887
[16:04:27] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:04:30] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: removes backup sources from instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1008906 (https://phabricator.wikimedia.org/T355422)
[16:04:52] <topranks>	 !log commencing migration of servers in codfw rack b8 to lsw1-b8-codfw T355873
[16:04:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:56] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1009.eqiad.wmnet with OS bullseye
[16:05:09] <stashbot>	 T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873
[16:05:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mariadb: removes backup sources from instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1008906 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[16:05:57] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] eqiad1::pdns: remove check_dns alerts [puppet] - 10https://gerrit.wikimedia.org/r/1008887 (owner: 10Andrew Bogott)
[16:06:17] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:06:19] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51595 bytes in 0.823 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:06:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.870 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:06:21] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: removes backup sources from instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1008906 (https://phabricator.wikimedia.org/T355422)
[16:06:26] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:06:32] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:07:05] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] [vrts] Remove vrts1002 reverences [puppet] - 10https://gerrit.wikimedia.org/r/1008872 (owner: 10EoghanGaffney)
[16:07:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: removes backup sources from instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1008906 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[16:07:41] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: removes backup sources from instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1008906 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[16:08:09] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage
[16:08:49] <icinga-wm>	 PROBLEM - Host kubestagetcd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:49] <icinga-wm>	 PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:53] <icinga-wm>	 PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:55] <icinga-wm>	 PROBLEM - Host ml-staging-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:57] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1005.eqiad.wmnet with reason: host reimage
[16:09:31] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:09:38] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:10:01] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1007.eqiad.wmnet with reason: host reimage
[16:10:38] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage
[16:11:14] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1008.eqiad.wmnet with reason: host reimage
[16:11:40] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:11:47] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:12:19] <icinga-wm>	 RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms
[16:12:19] <icinga-wm>	 RECOVERY - Host kubestagetcd2001 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms
[16:12:53] <icinga-wm>	 RECOVERY - Host ml-staging-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms
[16:13:21] <icinga-wm>	 RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms
[16:13:25] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1007.eqiad.wmnet with reason: host reimage
[16:13:38] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1009.eqiad.wmnet with reason: host reimage
[16:13:43] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: provisionning db2203.codfw.wmnet - T355422
[16:13:58] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: provisionning db2203.codfw.wmnet - T355422
[16:14:02] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: provisionning db2203.codfw.wmnet - T355422
[16:14:05] <stashbot>	 T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422
[16:14:06] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: provisionning db2203.codfw.wmnet - T355422
[16:15:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2103 in db2203 for T355422', diff saved to https://phabricator.wikimedia.org/P58492 and previous config saved to /var/cache/conftool/dbconfig/20240305-161517-arnaudb.json
[16:15:21] <wikibugs>	 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873#9601818 (10cmooney) All links moved without problem, servers back online and responding to ping now.
[16:15:31] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1008.eqiad.wmnet with reason: host reimage
[16:15:41] <claime>	 !log Repooling mw2433.codfw.wmnet mw2432.codfw.wmnet parse2008.codfw.wmnet parse2009.codfw.wmnet parse2010.codfw.wmnet
[16:15:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:55] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2004.wikimedia.org
[16:16:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] eqiad1::pdns: remove check_dns alerts [puppet] - 10https://gerrit.wikimedia.org/r/1008887 (owner: 10Andrew Bogott)
[16:16:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2035.codfw.wmnet with reason: host reimage
[16:16:18] <jnuche>	 jouncebot: nowandnext
[16:16:18] <jouncebot>	 For the next 0 hour(s) and 43 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1600)
[16:16:19] <jouncebot>	 In 0 hour(s) and 43 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1700)
[16:16:35] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2103.codfw.wmnet onto db2203.codfw.wmnet
[16:16:42] <wikibugs>	 (03PS1) 10Majavah: hieradata: update test VM without floating IP [puppet] - 10https://gerrit.wikimedia.org/r/1008892
[16:16:42] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for dns2004.wikimedia.org
[16:16:42] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2004.wikimedia.org
[16:17:17] <claime>	 !log uncordon kubernetes2035.codfw.wmnet kubernetes2034.codfw.wmnet mw2434.codfw.wmnet mw2435.codfw.wmnet
[16:17:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:03] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1009.eqiad.wmnet with reason: host reimage
[16:19:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58493 and previous config saved to /var/cache/conftool/dbconfig/20240305-161921-arnaudb.json
[16:19:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58494 and previous config saved to /var/cache/conftool/dbconfig/20240305-161932-arnaudb.json
[16:19:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58495 and previous config saved to /var/cache/conftool/dbconfig/20240305-161955-arnaudb.json
[16:20:00] <jinxer-wm>	 (ProbeDown) resolved: (4) Service wdqs1011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:20:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58496 and previous config saved to /var/cache/conftool/dbconfig/20240305-162011-arnaudb.json
[16:20:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58497 and previous config saved to /var/cache/conftool/dbconfig/20240305-162025-arnaudb.json
[16:20:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58498 and previous config saved to /var/cache/conftool/dbconfig/20240305-162043-arnaudb.json
[16:20:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58499 and previous config saved to /var/cache/conftool/dbconfig/20240305-162056-arnaudb.json
[16:22:29] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1005.eqiad.wmnet with reason: host reimage
[16:23:10] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: provisionning db2204.codfw.wmnet - T355422
[16:23:14] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: provisionning db2204.codfw.wmnet - T355422
[16:23:17] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: provisionning db2204.codfw.wmnet - T355422
[16:23:21] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: provisionning db2204.codfw.wmnet - T355422
[16:23:22] <stashbot>	 T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422
[16:24:04] <jynus>	 !log patching oldimage table for commons T359176
[16:24:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:18] <stashbot>	 T359176: Long-titled archived files can get its path metadata truncated due to not having enough storage space, leading to orphan, non accesible files (was: Two files on commons have invalid UTF-8 characters in path metadata) - https://phabricator.wikimedia.org/T359176
[16:24:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Point apt discovery records to apt1002/apt2002 (new bookworm hosts) [puppet] - 10https://gerrit.wikimedia.org/r/1008893 (https://phabricator.wikimedia.org/T331613)
[16:24:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2104 in db2204 for T355422~', diff saved to https://phabricator.wikimedia.org/P58500 and previous config saved to /var/cache/conftool/dbconfig/20240305-162442-arnaudb.json
[16:25:13] <wikibugs>	 (03CR) 10Herron: "Yes this is my understanding as well, essentially two settings:" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron)
[16:25:45] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2104.codfw.wmnet onto db2204.codfw.wmnet
[16:27:16] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2035.codfw.wmnet with reason: host reimage
[16:28:41] <brennen>	 jouncebot nowandnext
[16:28:41] <jouncebot>	 For the next 0 hour(s) and 31 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1600)
[16:28:41] <jouncebot>	 In 0 hour(s) and 31 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1700)
[16:28:55] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1002.eqiad.wmnet with OS bullseye
[16:29:19] <brennen>	 mutante: just fyi going to do a backport of https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1008476
[16:29:45] <brennen>	 ^ cc: Jdlrobson 
[16:29:47] <mutante>	 brennen: alright, thanks
[16:30:08] <mutante>	 the window is empty
[16:31:09] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602016 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1002.eqiad.wmnet with OS bullseye comp...
[16:31:11] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1007.eqiad.wmnet with OS bullseye
[16:31:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008476 (https://phabricator.wikimedia.org/T358164) (owner: 10Jdlrobson)
[16:32:51] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1007.eqiad.wmnet with OS bullseye comp...
[16:33:41] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1008.eqiad.wmnet with OS bullseye
[16:34:24] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1008.eqiad.wmnet with OS bullseye comp...
[16:34:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58501 and previous config saved to /var/cache/conftool/dbconfig/20240305-163426-arnaudb.json
[16:34:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58502 and previous config saved to /var/cache/conftool/dbconfig/20240305-163437-arnaudb.json
[16:35:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58503 and previous config saved to /var/cache/conftool/dbconfig/20240305-163501-arnaudb.json
[16:35:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58504 and previous config saved to /var/cache/conftool/dbconfig/20240305-163516-arnaudb.json
[16:35:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58505 and previous config saved to /var/cache/conftool/dbconfig/20240305-163530-arnaudb.json
[16:35:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58506 and previous config saved to /var/cache/conftool/dbconfig/20240305-163548-arnaudb.json
[16:36:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58507 and previous config saved to /var/cache/conftool/dbconfig/20240305-163601-arnaudb.json
[16:36:03] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Check backfill status prior to reindexing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008895
[16:36:06] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1009.eqiad.wmnet with OS bullseye
[16:36:20] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1009.eqiad.wmnet with OS bullseye comp...
[16:38:35] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] role::puppetserver::cloud_vps_project: remove firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1008554 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott)
[16:38:54] <logmsgbot>	 !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1006.eqiad.wmnet with OS bullseye
[16:39:07] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602077 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1006.eqiad.wmnet with OS bullseye exec...
[16:39:21] <denisse>	 !log enabling meta-monitoring for the alert* hosts - T333615
[16:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:25] <stashbot>	 T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615
[16:39:31] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1006.eqiad.wmnet with OS bullseye
[16:39:49] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1006.eqiad.wmnet with OS bullseye
[16:40:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix airflow firewall setting for an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1008896
[16:40:54] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix airflow firewall setting for an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1008896
[16:41:04] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1005.eqiad.wmnet with OS bullseye
[16:41:19] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1005.eqiad.wmnet with OS bullseye comp...
[16:42:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:42:16] <wikibugs>	 (03PS1) 10Daniel Kinzler: Rest: allow Handlers to disable body parsing. [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008758 (https://phabricator.wikimedia.org/T357025)
[16:42:45] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Fix airflow firewall setting for an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1008896 (owner: 10Muehlenhoff)
[16:43:27] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1010.eqiad.wmnet with OS bullseye
[16:43:42] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602107 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1010.eqiad.wmnet with OS bullseye
[16:43:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix airflow firewall setting for an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1008896 (owner: 10Muehlenhoff)
[16:44:08] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1011.eqiad.wmnet with OS bullseye
[16:44:24] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602112 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1011.eqiad.wmnet with OS bullseye
[16:44:49] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1012.eqiad.wmnet with OS bullseye
[16:45:05] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602116 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1012.eqiad.wmnet with OS bullseye
[16:45:28] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1013.eqiad.wmnet with OS bullseye
[16:45:43] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602117 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1013.eqiad.wmnet with OS bullseye
[16:46:05] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1014.eqiad.wmnet with OS bullseye
[16:46:20] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye
[16:47:20] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:47:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2035.codfw.wmnet with OS bookworm
[16:47:30] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9602134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2035.codfw.wmnet with OS bookworm completed: - es2035 (**PASS**)   -...
[16:48:41] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9602143 (10Jhancock.wm) @Marostegui this is completed
[16:49:02] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9602140 (10Jhancock.wm) 05Open→03Resolved
[16:49:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58508 and previous config saved to /var/cache/conftool/dbconfig/20240305-164931-arnaudb.json
[16:49:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58509 and previous config saved to /var/cache/conftool/dbconfig/20240305-164942-arnaudb.json
[16:49:58] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008900 (https://phabricator.wikimedia.org/T357907)
[16:50:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58510 and previous config saved to /var/cache/conftool/dbconfig/20240305-165006-arnaudb.json
[16:50:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58511 and previous config saved to /var/cache/conftool/dbconfig/20240305-165022-arnaudb.json
[16:50:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58512 and previous config saved to /var/cache/conftool/dbconfig/20240305-165035-arnaudb.json
[16:50:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58513 and previous config saved to /var/cache/conftool/dbconfig/20240305-165053-arnaudb.json
[16:51:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58514 and previous config saved to /var/cache/conftool/dbconfig/20240305-165106-arnaudb.json
[16:51:55] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1006.eqiad.wmnet with reason: host reimage
[16:51:56] <wikibugs>	 (03Merged) 10jenkins-bot: Partial Revert "Set background/color to inherit for common templates" [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008476 (https://phabricator.wikimedia.org/T358164) (owner: 10Jdlrobson)
[16:52:44] <logmsgbot>	 !log brennen@deploy2002 Started scap: Backport for [[gerrit:1008476|Partial Revert "Set background/color to inherit for common templates" (T358164)]]
[16:52:49] <stashbot>	 T358164: Set color/background to inherit or #333 on common templates/use of HTML4 bgcolor - https://phabricator.wikimedia.org/T358164
[16:53:34] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:53:35] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9602175 (10Marostegui) Thank you so much!
[16:53:41] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:53:57] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[16:54:04] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert "alert: Resolve alerts DNS queries to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/1008759
[16:54:19] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert "wikimedia.org: failover icinga to alert2001 too" [dns] - 10https://gerrit.wikimedia.org/r/1008760
[16:54:32] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert "alert: Failover Icinga and Alertmanager to alert2001" [puppet] - 10https://gerrit.wikimedia.org/r/1008761
[16:55:00] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1006.eqiad.wmnet with reason: host reimage
[16:55:12] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9602183 (10Marostegui)
[16:56:25] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1010.eqiad.wmnet with reason: host reimage
[16:56:51] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1011.eqiad.wmnet with reason: host reimage
[16:56:53] <brennen>	 akosiaris: i am getting some errors for parse* hosts from scap here; guessing this is an indicator i shouldn't be deploying at present?
[16:57:29] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[16:57:37] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1012.eqiad.wmnet with reason: host reimage
[16:57:41] <claime>	 brennen: which nodes?
[16:58:12] <brennen>	 parse1010, 1013, 1011, 1014, 1012
[16:58:29] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1013.eqiad.wmnet with reason: host reimage
[16:58:29] <logmsgbot>	 !log brennen@deploy2002 jdlrobson and brennen: Backport for [[gerrit:1008476|Partial Revert "Set background/color to inherit for common templates" (T358164)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:58:31] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for bdgreenlee - https://phabricator.wikimedia.org/T359123#9602232 (10bdgreenlee) I'm told I'll need `analytics-privatedata-users` too. Can I tack that onto this ticket, or should I file a new one?
[16:58:34] <stashbot>	 T358164: Set color/background to inherit or #333 on common templates/use of HTML4 bgcolor - https://phabricator.wikimedia.org/T358164
[16:58:45] <claime>	 brennen: gimme a second, i'll fix it
[16:59:11] <brennen>	 claime: thanks, holding remainder of sync until i hear back.  Jdlrobson, if there's testing to do i think you can go ahead and do it now.
[16:59:14] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1010.eqiad.wmnet with reason: host reimage
[16:59:16] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: cluster=parsoid
[16:59:56] <brennen>	 claime: for what it's worth, it seems like maybe a few things pooled that shouldn't have been?  errors were changed keys and a couple of timeouts.
[17:00:04] <jouncebot>	 jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:38] <claime>	 brennen: yeah basically puppet didn't run on deploy host between a.kosiaris removing nodes from prod for reimage and you running your deployment
[17:00:40] <brennen>	 jhathaway, rzl: apologies for stepping on your window, in the midst of a backport for a could-be train blocker.
[17:00:46] <rzl>	 (nothing to do in the puppet window, feel free to-- haha
[17:00:51] <rzl>	 you're good, it's all yours :)
[17:00:53] <brennen>	 rzl: right on. :)
[17:01:01] <claime>	 I'm running it now
[17:01:08] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for bdgreenlee - https://phabricator.wikimedia.org/T359123#9602244 (10Dzahn) If you don't mind please file a new one since that's a different tag/board/process.
[17:01:09] <brennen>	 claime: cool, thx.
[17:01:36] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1012.eqiad.wmnet with reason: host reimage
[17:01:59] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=mw243(2|3).*
[17:02:36] <wikibugs>	 (03PS10) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507)
[17:02:41] <wikibugs>	 (03PS2) 10Hnowlan: mobileapps: add cassandra config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/993154 (https://phabricator.wikimedia.org/T350507)
[17:02:57] <wikibugs>	 (03CR) 10Hnowlan: mobileapps: add cassandra config in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/993154 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[17:03:09] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for bdgreenlee - https://phabricator.wikimedia.org/T359123#9602269 (10odimitrijevic) Approved
[17:03:27] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197 (10bdgreenlee)
[17:04:01] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1011.eqiad.wmnet with reason: host reimage
[17:04:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58517 and previous config saved to /var/cache/conftool/dbconfig/20240305-170437-arnaudb.json
[17:04:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58518 and previous config saved to /var/cache/conftool/dbconfig/20240305-170448-arnaudb.json
[17:05:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58519 and previous config saved to /var/cache/conftool/dbconfig/20240305-170511-arnaudb.json
[17:05:19] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan
[17:05:21] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197#9602283 (10odimitrijevic) Approved
[17:05:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58520 and previous config saved to /var/cache/conftool/dbconfig/20240305-170527-arnaudb.json
[17:05:31] <claime>	 brennen: should be good now, those parse nodes are not in dsh anymore
[17:05:32] <logmsgbot>	 !log cmooney@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on lvs2012.codfw.wmnet with reason: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan
[17:05:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58521 and previous config saved to /var/cache/conftool/dbconfig/20240305-170540-arnaudb.json
[17:05:47] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2011.codfw.wmnet with reason: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan
[17:05:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58522 and previous config saved to /var/cache/conftool/dbconfig/20240305-170558-arnaudb.json
[17:06:03] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan
[17:06:11] <claime>	 brennen: they're all being reimaged as k8s nodes, so sync errors to them are not a problem
[17:06:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58523 and previous config saved to /var/cache/conftool/dbconfig/20240305-170611-arnaudb.json
[17:06:13] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602285 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6010131f-b756-49c6-8082-62badba41...
[17:06:16] <brennen>	 claime: thanks, going ahead since this is a revert.
[17:06:23] <logmsgbot>	 !log brennen@deploy2002 jdlrobson and brennen: Continuing with sync
[17:06:30] <claime>	 it's basically a race condition
[17:06:40] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1013.eqiad.wmnet with reason: host reimage
[17:07:32] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr[1-2]-codfw,lsw1-a2-codfw.mgmt with reason: moving lvs2011 which will disrupt bgp
[17:07:39] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr[1-2]-codfw,lsw1-a2-codfw.mgmt with reason: moving lvs2011 which will disrupt bgp
[17:08:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602297 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c0fe6035-a553-49f8-8b94-3d7840e51...
[17:09:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198 (10cmooney) p:05Triage→03Medium
[17:10:15] <topranks>	 !log disabling pybal on lvs2011 (traffic will move to lvs2014) in advance of reimage T352920
[17:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:30] <stashbot>	 T352920: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920
[17:11:38] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2096.codfw.wmnet onto db2196.codfw.wmnet
[17:11:39] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1006.eqiad.wmnet with OS bullseye
[17:11:52] <icinga-wm>	 RECOVERY - MariaDB Replica IO: x1 #page on db2096 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:11:52] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602348 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1006.eqiad.wmnet with OS bullseye comp...
[17:12:09] <claime>	 brennen: going all right?
[17:12:29] <brennen>	 claime: yep, all smooth so far.
[17:12:33] <claime>	 fantastic
[17:16:27] <logmsgbot>	 !log brennen@deploy2002 Finished scap: Backport for [[gerrit:1008476|Partial Revert "Set background/color to inherit for common templates" (T358164)]] (duration: 23m 42s)
[17:16:33] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "This is just a question to Antoine: "Are you going to need a copy of /var/lib/zuul prod data on the test host to test zuul?"" [puppet] - 10https://gerrit.wikimedia.org/r/1007433 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn)
[17:16:37] <stashbot>	 T358164: Set color/background to inherit or #333 on common templates/use of HTML4 bgcolor - https://phabricator.wikimedia.org/T358164
[17:17:04] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1010.eqiad.wmnet with OS bullseye
[17:17:20] <brennen>	 Jdlrobson: should be good to go.
[17:19:20] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1012.eqiad.wmnet with OS bullseye
[17:19:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602418 (10andrea.denisse) a:03andrea.denisse
[17:21:35] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1011.eqiad.wmnet with OS bullseye
[17:21:45] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1010.eqiad.wmnet with OS bullseye comp...
[17:22:50] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: x1 #page on db2096 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:23:21] <claime>	 Didn't that already recover like 10 minutes ago?
[17:23:33] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1012.eqiad.wmnet with OS bullseye comp...
[17:23:58] <sukhe>	 Slave_IO_Running / Slave_SQL_Running but otherwise same host yep
[17:24:33] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602463 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1011.eqiad.wmnet with OS bullseye comp...
[17:24:44] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1013.eqiad.wmnet with OS bullseye
[17:25:34] <mutante>	 denisse: I think I can help with that icinga issue and the bfd check
[17:25:52] <mutante>	 there is this package on alert hosts:  snmp-mibs-downloader
[17:26:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602476 (10dcaro) Affecting also the cloudswitches  {F42399814}
[17:26:30] <mutante>	 I think we have to use that to download the missing MIB file  and what looks for this is the "Snimpy" "load" https://snimpy.readthedocs.io/en/latest/usage.html
[17:26:52] <mutante>	 if that would download the file at the top of https://www.circitor.fr/Mibs/Html/B/BFD-STD-MIB.php
[17:26:55] <mutante>	 then the check should work again
[17:27:20] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] k8s: Add getter for the Batch API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus)
[17:27:27] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:27:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602493 (10dcaro) It's gone now :)
[17:28:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58524 and previous config saved to /var/cache/conftool/dbconfig/20240305-172834-arnaudb.json
[17:29:07] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[17:29:14] <wikibugs>	 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9602507 (10andrea.denisse)
[17:29:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602506 (10andrea.denisse)
[17:29:43] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008900 (https://phabricator.wikimedia.org/T357907) (owner: 10Hnowlan)
[17:29:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602503 (10fgiunchedi) I've bandaided the issue on alert2001, we'll need a more proper fix:  ` # download-mibs # cd /var/lib/snmp && ln -s ../mibs `
[17:30:44] <claime>	 sukhe: Ah thanks I missed that diff.
[17:30:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602522 (10Dzahn) There is this package on the alert hosts:   ` ii  snmp-mibs-downloader                 1.2                             all          install and manage Management Information B...
[17:31:03] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for lvs2011 - cmooney@cumin1002"
[17:31:08] <denisse>	 mutante: Thanks for your comments, we were indeed missing those files.
[17:31:11] <wikibugs>	 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9602535 (10fgiunchedi) Something else that didn't work well: the current version of `ircecho` doesn't seem to attempt reopening the files it is supposed to look for in `/var/log/icinga`. I...
[17:31:23] <denisse>	 I'll send a patch to automate it. :)
[17:31:48] <mutante>	 denisse: :))
[17:31:56] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for lvs2011 - cmooney@cumin1002"
[17:31:56] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:32:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:32:24] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008900 (https://phabricator.wikimedia.org/T357907) (owner: 10Hnowlan)
[17:32:27] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:32:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602540 (10cmooney) >>! In T359198#9602522, @Dzahn wrote: > I guess the snmp-mibs-downloader just has to be automated to download stuff?  Yeah on it's own that package installs but doesn't do a...
[17:33:21] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "Yep, we should definitely stop doing that." [puppet] - 10https://gerrit.wikimedia.org/r/1008833 (https://phabricator.wikimedia.org/T359153) (owner: 10Filippo Giunchedi)
[17:33:31] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008900 (https://phabricator.wikimedia.org/T357907) (owner: 10Hnowlan)
[17:34:24] <wikibugs>	 (03Merged) 10jenkins-bot: k8s: Add getter for the Batch API [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus)
[17:35:11] <wikibugs>	 (03Abandoned) 10Daniel Kinzler: Rest: allow Handlers to disable body parsing. [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008758 (https://phabricator.wikimedia.org/T357025) (owner: 10Daniel Kinzler)
[17:39:53] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[17:40:06] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[17:40:12] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:40:20] <inflatador>	 !log bking@prometheus1006 reload prometheus service as part of troubleshooting T358029
[17:40:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:24] <stashbot>	 T358029: Migrate selected Search Platform alerts from icinga search-platform team to prometheus data-platform team - https://phabricator.wikimedia.org/T358029
[17:40:25] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache lvs2011.codfw.wmnet on all recursors
[17:40:29] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) lvs2011.codfw.wmnet on all recursors
[17:40:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/1007703 (https://phabricator.wikimedia.org/T352920) (owner: 10Cathal Mooney)
[17:41:23] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[17:41:38] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[17:42:45] <wikibugs>	 (03PS2) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636)
[17:43:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58526 and previous config saved to /var/cache/conftool/dbconfig/20240305-174339-arnaudb.json
[17:44:08] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[17:44:33] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1015.eqiad.wmnet with OS bullseye
[17:44:34] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[17:44:46] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1015.eqiad.wmnet with OS bullseye
[17:44:56] <wikibugs>	 (03PS3) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636)
[17:45:04] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1016.eqiad.wmnet with OS bullseye
[17:45:19] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602583 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1016.eqiad.wmnet with OS bullseye
[17:46:11] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1017.eqiad.wmnet with OS bullseye
[17:46:38] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602586 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1017.eqiad.wmnet with OS bullseye
[17:46:42] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye
[17:46:43] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1018.eqiad.wmnet with OS bullseye
[17:47:22] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1019.eqiad.wmnet with OS bullseye
[17:47:27] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:48:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host...
[17:48:17] <wikibugs>	 (03CR) 10Scott French: "Apologies in advance for the long commit message - wanted to make sure the tradeoffs w.r.t. replication index key are explicit. Happy to r" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French)
[17:48:25] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1018.eqiad.wmnet with OS bullseye
[17:49:02] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602595 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1019.eqiad.wmnet with OS bullseye
[17:49:48] <wikibugs>	 (03PS1) 10Btullis: Restrict the set of URLS serviced by Archiva [puppet] - 10https://gerrit.wikimedia.org/r/1008926 (https://phabricator.wikimedia.org/T359031)
[17:53:02] <wikibugs>	 (03CR) 10Btullis: "Currently testing this manually on archiva1002.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1008926 (https://phabricator.wikimedia.org/T359031) (owner: 10Btullis)
[17:53:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602613 (10Dzahn) Looks like it's:  `man 1 download-mibs` `download-mibs --help`   and the config is at `/etc/snmp-mibs-downloader/snmp-mibs-downloader.conf` which has some kind of "AUTOLOAD" c...
[17:55:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602620 (10cmooney) >>! In T359198#9602613, @Dzahn wrote: > Looks like it's: >  > `man 1 download-mibs` > `download-mibs --help` >  >  and the config is at `/etc/snmp-mibs-downloader/snmp-mibs-...
[17:57:19] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1015.eqiad.wmnet with reason: host reimage
[17:58:01] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1016.eqiad.wmnet with reason: host reimage
[17:58:32] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1017.eqiad.wmnet with reason: host reimage
[17:58:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58527 and previous config saved to /var/cache/conftool/dbconfig/20240305-175844-arnaudb.json
[17:59:59] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1018.eqiad.wmnet with reason: host reimage
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1800)
[18:00:09] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1015.eqiad.wmnet with reason: host reimage
[18:00:22] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1019.eqiad.wmnet with reason: host reimage
[18:00:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:02:21] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1016.eqiad.wmnet with reason: host reimage
[18:04:43] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1019.eqiad.wmnet with reason: host reimage
[18:06:28] <logmsgbot>	 !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1014.eqiad.wmnet with OS bullseye
[18:06:59] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1018.eqiad.wmnet with reason: host reimage
[18:07:06] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602700 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye exec...
[18:09:52] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1017.eqiad.wmnet with reason: host reimage
[18:10:53] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage
[18:11:43] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[18:11:49] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:12:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:13:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58528 and previous config saved to /var/cache/conftool/dbconfig/20240305-181349-arnaudb.json
[18:13:52] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage
[18:15:31] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:17:59] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1015.eqiad.wmnet with OS bullseye
[18:18:13] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1015.eqiad.wmnet with OS bullseye comp...
[18:19:54] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1016.eqiad.wmnet with OS bullseye
[18:19:55] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:20:10] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9602719 (10bking)
[18:20:16] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602717 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1016.eqiad.wmnet with OS bullseye comp...
[18:22:21] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1019.eqiad.wmnet with OS bullseye
[18:22:27] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:22:35] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602723 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1019.eqiad.wmnet with OS bullseye comp...
[18:24:51] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1018.eqiad.wmnet with OS bullseye
[18:25:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[18:25:07] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602725 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1018.eqiad.wmnet with OS bullseye comp...
[18:26:30] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:27:34] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1017.eqiad.wmnet with OS bullseye
[18:27:49] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602731 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1017.eqiad.wmnet with OS bullseye comp...
[18:28:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1025
[18:28:04] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1025
[18:28:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[18:30:47] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2011.codfw.wmnet with OS bullseye
[18:30:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602746 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host lvs...
[18:31:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602756 (10cmooney) Reimage looks good, BGP up and lvs2011 handling traffic again: ` cmooney@cumin1002:~$ sud...
[18:37:12] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[18:37:18] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:37:42] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2103.codfw.wmnet onto db2203.codfw.wmnet
[18:37:57] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for FebinBellamy - https://phabricator.wikimedia.org/T359208 (10FBellamy-WMF)
[18:40:38] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[18:40:44] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:46:11] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[18:46:17] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:47:02] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2104.codfw.wmnet onto db2204.codfw.wmnet
[18:54:19] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[18:54:26] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:56:19] <Daimona>	 !log T357007 Running mwscript CampaignEvents:GenerateInvitationList --wiki=metawiki --listfile=/home/daimona/list.txt
[18:56:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:23] <stashbot>	 T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007
[18:57:31] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.16.28:9042 on restbase1038 is OK: TCP OK - 0.039 second response time on 10.64.16.28 port 9042 https://phabricator.wikimedia.org/T93886
[18:59:33] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.16.32:7000 on restbase1038 is OK: SSL OK - Certificate restbase1038-b valid until 2026-02-20 21:34:07 +0000 (expires in 717 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:00:05] <jouncebot>	 jnuche and dduvall: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1900).
[19:03:47] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:03:54] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:06:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869#9602910 (10cmooney)
[19:06:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602909 (10cmooney) 05Open→03Resolved
[19:16:27] <wikibugs>	 06SRE, 10netops, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602958 (10andrea.denisse)
[19:17:54] <wikibugs>	 06SRE, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Icinga Log Permission Conflict with Puppet Configuration - https://phabricator.wikimedia.org/T358539#9602963 (10andrea.denisse) 05Open→03Resolved
[19:17:57] <wikibugs>	 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9602964 (10andrea.denisse)
[19:17:57] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:18:04] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:29:27] <houseblaster>	 Do we know what happened yesterday with the late UTC backport window / if today's backport window is good to go? Sorry if there is a better place to ask…
[19:30:20] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:31:12] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:31:18] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:41:48] <wikibugs>	 06SRE, 10ops-eqiad, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9603028 (10Andrew) Notes from today's (unproductive) meeting:  We met with several Dell reps including an engineer n...
[19:44:07] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye
[19:46:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1025']
[19:46:37] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['wdqs1025']
[19:47:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[19:48:10] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:50:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:50:40] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:53:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:57:29] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: flink-zk reboots
[19:57:35] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on 6 hosts with reason: flink-zk reboots
[19:58:00] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: flink-zk reboots T356239
[19:58:06] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: flink-zk reboots T356239
[20:00:41] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:00:47] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:04:56] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[20:05:36] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[20:06:01] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[20:07:09] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[20:07:11] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[20:08:15] <wikibugs>	 06SRE: Improve automation for the vendor maintenance calendar - https://phabricator.wikimedia.org/T357630#9603076 (10andrea.denisse) Thanks for your comments @ayounsi and @cmooney.  While Janitor looks promising, I believe {icon globe} [[ https://developers.google.com/apps-script | Google Apps Script ]]. would b...
[20:08:31] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[20:10:55] <wikibugs>	 (03PS2) 10Scott French: Add support for an optional ignored-keys pattern [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1008944
[20:11:43] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] partman: configure wdqs1025 partioning [puppet] - 10https://gerrit.wikimedia.org/r/1008943 (https://phabricator.wikimedia.org/T359062) (owner: 10Bking)
[20:14:56] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[20:19:56] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[20:35:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] puppetserver: add ssldir_on_srv param for cloud-vps [puppet] - 10https://gerrit.wikimedia.org/r/1008940 (https://phabricator.wikimedia.org/T276327) (owner: 10Andrew Bogott)
[20:40:43] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] eventstreams: change default num_workers to 0. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008446 (https://phabricator.wikimedia.org/T359051) (owner: 10Gmodena)
[20:43:12] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9603154 (10Volans) 05Resolved→03Open a:03Volans Re-opening as AAAA records were erroneously added to the hosts (AAAA records:**N**). I'll remove them programmatically.
[20:46:07] <brett>	 !log Start rolling out updated fifo-log-demux and configuration to A:cp and A:ncredir - T355905
[20:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:12] <stashbot>	 T355905: Restarting fifo-log-demux should not restart nginx - https://phabricator.wikimedia.org/T355905
[20:46:27] <brett>	 !log Disable puppet on A:cp and A:ncredir - T355905
[20:46:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:35] <wikibugs>	 (03CR) 10Bking: [C: 03+2] partman: configure wdqs1025 partioning [puppet] - 10https://gerrit.wikimedia.org/r/1008943 (https://phabricator.wikimedia.org/T359062) (owner: 10Bking)
[20:50:05] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[20:50:05] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.dns.netbox
[20:52:13] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Deleted AAAA records from new DBs - volans@cumin1002"
[20:52:34] <brett>	 !log upload fifo-log-demux 0.6.5 to bookworm-wikimedia
[20:52:35] <wikibugs>	 (03PS6) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455)
[20:52:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:40] <wikibugs>	 (03PS6) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455)
[20:52:48] <wikibugs>	 (03PS1) 10Andrew Bogott: puppetserver::wmcs: pull enc_path out of the enc class [puppet] - 10https://gerrit.wikimedia.org/r/1008951
[20:53:03] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Deleted AAAA records from new DBs - volans@cumin1002"
[20:53:03] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:53:54] <wikibugs>	 (03PS2) 10Andrew Bogott: puppetserver::wmcs: pull enc_path out of the enc class [puppet] - 10https://gerrit.wikimedia.org/r/1008951
[20:53:56] <wikibugs>	 (03PS7) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455)
[20:54:00] <wikibugs>	 (03PS7) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455)
[20:54:07] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache es2035.codfw.wmnet es2036.codfw.wmnet es2037.codfw.wmnet es2038.codfw.wmnet es2039.codfw.wmnet es2040.codfw.wmnet on all recursors
[20:54:10] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) es2035.codfw.wmnet es2036.codfw.wmnet es2037.codfw.wmnet es2038.codfw.wmnet es2039.codfw.wmnet es2040.codfw.wmnet on all recursors
[20:58:33] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9603243 (10Volans) 05Open→03Resolved Got the list of affected hosts with `nodeset -S '","' -e "es20[35-40]"` on a cumin host, then I run the following code on Netbox: `lang=...
[20:58:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] puppetserver::wmcs: pull enc_path out of the enc class [puppet] - 10https://gerrit.wikimedia.org/r/1008951 (owner: 10Andrew Bogott)
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T2100)
[21:00:04] <jouncebot>	 houseblaster, dbrant, MatmaRex, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:15] <MatmaRex>	 hi
[21:00:41] <dbrant>	 o/
[21:00:51] <houseblaster>	 hi!
[21:01:45] <wikibugs>	 (03PS1) 10Jdlrobson: Set background/color to inherit for common templates in dark mode [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008765 (https://phabricator.wikimedia.org/T358164)
[21:01:45] <urbanecm>	 i can deploy
[21:01:49] <urbanecm>	 good evening everyone!
[21:01:57] <Jdlrobson>	 o/
[21:02:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] HandleSectionLinks: Fix handling headings with raw `>` in attributes [core] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1008472 (https://phabricator.wikimedia.org/T358810) (owner: 10Bartosz Dziewoński)
[21:02:27] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye
[21:02:29] <urbanecm>	 Jdlrobson: i see you uploaded a backport – do you want to do that in this window?
[21:02:47] <Jdlrobson>	 urbanecm: yep just aded to calendar along with the config change
[21:03:01] <urbanecm>	 houseblaster: i see your patch is already merged (and supposedly deployed). is there anything else to do?
[21:03:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[21:03:21] <wikibugs>	 (03PS3) 10Urbanecm: Move account vanishing contact form to Meta wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005161 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant)
[21:03:28] <urbanecm>	 dbrant: going with your patch
[21:03:30] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Move account vanishing contact form to Meta wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005161 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant)
[21:04:16] <urbanecm>	 Jdlrobson: ah, thanks for the info. i didn't reload the calendar apparently. just double-checking, on the calendar you say wmf.20, but the patch is for wmf.21. can you confirm which version you want to backport to?
[21:04:20] <wikibugs>	 (03Merged) 10jenkins-bot: Move account vanishing contact form to Meta wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005161 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant)
[21:04:39] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-logging2001 is CRITICAL: SSL CRITICAL - Certificate kafka-logging2001.codfw.wmnet valid until 2024-03-12 21:04:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[21:04:46] <Jdlrobson>	 Sorry 1.42.0-wmf.21
[21:04:59] <houseblaster>	 Huh. Yesterday it was scheduled to be deployed, but was told it failed. Let me try testing it without debug enabled
[21:05:01] <Jdlrobson>	 (corrected)
[21:05:21] <wikibugs>	 (03PS3) 10Urbanecm: Stop sharing vector and vector-2022 scripts on wikis where no users are impacted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007992 (https://phabricator.wikimedia.org/T331679) (owner: 10Jdlrobson)
[21:05:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Set background/color to inherit for common templates in dark mode [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008765 (https://phabricator.wikimedia.org/T358164) (owner: 10Jdlrobson)
[21:05:42] <urbanecm>	 Jdlrobson: no worries, just wanted to confirm because i hit the button :)
[21:05:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Stop sharing vector and vector-2022 scripts on wikis where no users are impacted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007992 (https://phabricator.wikimedia.org/T331679) (owner: 10Jdlrobson)
[21:05:57] <Jdlrobson>	 🫡
[21:06:04] <houseblaster>	 Working. Nothing further to do, and sorry for the confusion! :)
[21:06:19] <urbanecm>	 houseblaster: no worries. thanks for confirming!
[21:06:35] <wikibugs>	 (03Merged) 10jenkins-bot: Stop sharing vector and vector-2022 scripts on wikis where no users are impacted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007992 (https://phabricator.wikimedia.org/T331679) (owner: 10Jdlrobson)
[21:07:20] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1005161|Move account vanishing contact form to Meta wiki. (T343536)]], [[gerrit:1007992|Stop sharing vector and vector-2022 scripts on wikis where no users are impacted (T331679)]]
[21:07:27] <stashbot>	 T343536: [M] Create v1 of Special:Contact page for account vanish requests - https://phabricator.wikimedia.org/T343536
[21:07:28] <stashbot>	 T331679: Disable sharing of site/user scripts between Vector and Vector 2022 skins - https://phabricator.wikimedia.org/T331679
[21:10:29] <logmsgbot>	 !log urbanecm@deploy2002 jdlrobson and urbanecm and dbrant: Backport for [[gerrit:1005161|Move account vanishing contact form to Meta wiki. (T343536)]], [[gerrit:1007992|Stop sharing vector and vector-2022 scripts on wikis where no users are impacted (T331679)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:10:41] <urbanecm>	 dbrant: Jdlrobson: can you test yours at mwdebug, please?
[21:10:45] <Jdlrobson>	 urbanecm: on it
[21:11:08] <dbrant>	 urbanecm: mine looks good!
[21:11:34] <urbanecm>	 ty!
[21:11:45] <JJMC89>	 that was tagged to thw wrong task btw
[21:12:11] <Jdlrobson>	 urbanecm: LGTM please sync
[21:12:16] <urbanecm>	 ty
[21:12:17] <JJMC89>	 oh, nvm - itw as two unrelated
[21:12:17] <urbanecm>	 proceeding
[21:12:20] <logmsgbot>	 !log urbanecm@deploy2002 jdlrobson and urbanecm and dbrant: Continuing with sync
[21:17:15] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye
[21:17:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1025']
[21:20:01] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir5001.eqsin.wmnet
[21:20:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2107']
[21:20:52] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2107']
[21:21:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Set background/color to inherit for common templates in dark mode [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008765 (https://phabricator.wikimedia.org/T358164) (owner: 10Jdlrobson)
[21:22:05] <urbanecm>	 wonderful
[21:22:07] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1005161|Move account vanishing contact form to Meta wiki. (T343536)]], [[gerrit:1007992|Stop sharing vector and vector-2022 scripts on wikis where no users are impacted (T331679)]] (duration: 14m 46s)
[21:22:11] <wikibugs>	 (03Merged) 10jenkins-bot: HandleSectionLinks: Fix handling headings with raw `>` in attributes [core] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1008472 (https://phabricator.wikimedia.org/T358810) (owner: 10Bartosz Dziewoński)
[21:22:12] <stashbot>	 T343536: [M] Create v1 of Special:Contact page for account vanish requests - https://phabricator.wikimedia.org/T343536
[21:22:12] <stashbot>	 T331679: Disable sharing of site/user scripts between Vector and Vector 2022 skins - https://phabricator.wikimedia.org/T331679
[21:22:24] <Jdlrobson>	 CI issue seems unrelated urbanecm 
[21:22:27] <urbanecm>	 22:02:49 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/mediawiki/extensions/InputBox
[21:22:29] <urbanecm>	 yeah, appears so
[21:22:33] <urbanecm>	 let's see what gate-and-submit will do
[21:25:32] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir5001.eqsin.wmnet
[21:25:42] <wikibugs>	 (03Merged) 10jenkins-bot: Set background/color to inherit for common templates in dark mode [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008765 (https://phabricator.wikimedia.org/T358164) (owner: 10Jdlrobson)
[21:26:31] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1008765|Set background/color to inherit for common templates in dark mode (T358164)]], [[gerrit:1008472|HandleSectionLinks: Fix handling headings with raw `>` in attributes (T358810)]]
[21:26:37] <stashbot>	 T358164: Set color/background to inherit or #333 on common templates/use of HTML4 bgcolor - https://phabricator.wikimedia.org/T358164
[21:26:37] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir4001.ulsfo.wmnet
[21:26:37] <stashbot>	 T358810: Having <> in headings leads to errors - https://phabricator.wikimedia.org/T358810
[21:27:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1025']
[21:28:01] <logmsgbot>	 !log urbanecm@deploy2002 matmarex and jdlrobson and urbanecm: Backport for [[gerrit:1008765|Set background/color to inherit for common templates in dark mode (T358164)]], [[gerrit:1008472|HandleSectionLinks: Fix handling headings with raw `>` in attributes (T358810)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:28:31] <urbanecm>	 Jdlrobson: MatmaRex: can you test at mwdebug, please?
[21:29:09] <MatmaRex>	 looking
[21:30:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[21:30:28] <Jdlrobson>	 urbanecm: yep looking
[21:30:55] <MatmaRex>	 my change looks good
[21:31:30] <urbanecm>	 thanks for confirming MatmaRex 
[21:32:33] <Jdlrobson>	 urbanecm: LGTM please sync
[21:32:38] <urbanecm>	 ty, proceeding
[21:32:40] <logmsgbot>	 !log urbanecm@deploy2002 matmarex and jdlrobson and urbanecm: Continuing with sync
[21:38:07] <wikibugs>	 (03PS5) 10Ahmon Dancy: scap.cfg.erb: Settestservers_check_cmd_* in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117)
[21:38:21] <wikibugs>	 (03PS6) 10Ahmon Dancy: scap.cfg.erb: Set testservers_check_cmd_* in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117)
[21:41:11] <wikibugs>	 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9603410 (10CodeReviewBot) dancy opened https://gitlab.wikimedia.org/repos/releng/train-dev/-/merge_requests/47  Dockerfile.deploy: Add httpbb
[21:41:38] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1014.eqiad.wmnet with OS bullseye
[21:41:53] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9603411 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye
[21:42:06] <wikibugs>	 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9603412 (10CodeReviewBot) dancy merged https://gitlab.wikimedia.org/repos/releng/train-dev/-/merge_requests/47  Dockerfile.deploy: Add httpbb
[21:42:21] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1008765|Set background/color to inherit for common templates in dark mode (T358164)]], [[gerrit:1008472|HandleSectionLinks: Fix handling headings with raw `>` in attributes (T358810)]] (duration: 15m 50s)
[21:42:26] <stashbot>	 T358164: Set color/background to inherit or #333 on common templates/use of HTML4 bgcolor - https://phabricator.wikimedia.org/T358164
[21:42:26] <stashbot>	 T358810: Having <> in headings leads to errors - https://phabricator.wikimedia.org/T358810
[21:42:28] <urbanecm>	 and deployed
[21:42:31] <urbanecm>	 anything else?
[21:45:47] <MatmaRex>	 thanks urbanecm!
[21:47:05] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye
[21:47:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1025']
[21:48:10] <urbanecm>	 any time
[21:49:17] <brett>	 !log Remove fifo-log-demux from bookworm-wikimedia (dist version needs revision)
[21:49:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:39] <wikibugs>	 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9603420 (10CodeReviewBot) dancy opened https://gitlab.wikimedia.org/repos/releng/train-dev/-/merge_requests/48  exp/files/php/scap.cfg: Set testservers_check_cmd_*...
[21:51:23] <Jdlrobson>	 thanks urbanecm for your help today!
[22:03:10] <brett>	 !log upload fifo-log-demux 0.6.5+deb12u1 to bookworm-wikimedia
[22:03:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:42] <brett>	 !log upload fifo-log-demux 0.6.5+deb11u1 to bullseye-wikimedia
[22:07:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1025']
[22:18:49] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5025.eqsin.wmnet
[22:19:55] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:21:43] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir4037.ulsfo.wmnet
[22:22:41] <volans>	 q
[22:25:12] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:26:11] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir4037.ulsfo.wmnet
[22:27:56] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[22:32:48] <jinxer-wm>	 (PuppetDisabled) firing: (2) Puppet disabled on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-test&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[22:33:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[22:34:53] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[22:35:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 41.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:37:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 60 days, 0:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: T337013
[22:37:13] <stashbot>	 T337013: [Epic] Splitting the graph in WDQS - https://phabricator.wikimedia.org/T337013
[22:37:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 60 days, 0:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: T337013
[22:37:26] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:37:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:01:56] <logmsgbot>	 !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1014.eqiad.wmnet with OS bullseye
[23:02:09] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9603617 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye exec...
[23:03:25] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.16.32:9042 on restbase1038 is OK: TCP OK - 0.030 second response time on 10.64.16.32 port 9042 https://phabricator.wikimedia.org/T93886
[23:08:16] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:08:22] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:22:46] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:22:53] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:26:21] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:26:27] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:27:31] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.16.35:7000 on restbase1038 is OK: SSL OK - Certificate restbase1038-c valid until 2026-02-20 21:34:09 +0000 (expires in 716 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[23:30:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:34:36] <wikibugs>	 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873#9603722 (10cmooney) 05Open→03Resolved a:03cmooney
[23:34:41] <wikibugs>	 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9603724 (10cmooney)
[23:35:25] <wikibugs>	 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9603728 (10cmooney)
[23:35:29] <wikibugs>	 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9603729 (10cmooney)
[23:35:48] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:35:49] <wikibugs>	 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9603725 (10cmooney) 05Open→03Resolved a:03cmooney Closing task.  Big thanks to all the SRE teams for the help and co-operation getting this o...
[23:35:54] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:42:19] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:42:25] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:45:13] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:45:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 46.14% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[23:45:19] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:46:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 39.71% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[23:48:10] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:48:46] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:48:53] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:50:59] <wikibugs>	 (03PS1) 10RLazarus: deployment_server: Typo fix in mwscript_k8s.py [puppet] - 10https://gerrit.wikimedia.org/r/1008975 (https://phabricator.wikimedia.org/T341553)
[23:53:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:53:37] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1025.eqiad.wmnet with OS bullseye
[23:54:58] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] deployment_server: Typo fix in mwscript_k8s.py [puppet] - 10https://gerrit.wikimedia.org/r/1008975 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)