[00:00:16] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:00:23] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:08:08] (03PS1) 10Krinkle: Revert not-deployed "Profiler: Silence RedisException" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008474 [00:08:13] (03CR) 10Krinkle: [C: 03+2] Revert not-deployed "Profiler: Silence RedisException" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008474 (owner: 10Krinkle) [00:08:55] (03Merged) 10jenkins-bot: Revert not-deployed "Profiler: Silence RedisException" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008474 (owner: 10Krinkle) [00:12:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2036.codfw.wmnet with reason: host reimage [00:13:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T357189)', diff saved to https://phabricator.wikimedia.org/P58416 and previous config saved to /var/cache/conftool/dbconfig/20240305-001345-arnaudb.json [00:13:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [00:13:50] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [00:14:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [00:14:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T357189)', diff saved to https://phabricator.wikimedia.org/P58417 and previous config saved to /var/cache/conftool/dbconfig/20240305-001408-arnaudb.json [00:14:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:14:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:15:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2036.codfw.wmnet with reason: host reimage [00:17:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2039.codfw.wmnet with reason: host reimage [00:17:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2038.codfw.wmnet with reason: host reimage [00:18:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2037.codfw.wmnet with reason: host reimage [00:18:05] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:18:11] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:19:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T357189)', diff saved to https://phabricator.wikimedia.org/P58418 and previous config saved to /var/cache/conftool/dbconfig/20240305-001918-arnaudb.json [00:19:23] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [00:20:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: host reimage [00:21:44] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:21:51] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:22:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2038.codfw.wmnet with reason: host reimage [00:25:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2037.codfw.wmnet with reason: host reimage [00:29:36] (03PS1) 10Dzahn: ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) [00:29:57] (03PS2) 10Dzahn: ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) [00:30:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2040.codfw.wmnet with reason: host reimage [00:30:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:31:15] (03CR) 10CI reject: [V: 04-1] ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [00:33:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2040.codfw.wmnet with reason: host reimage [00:34:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:34:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2036.codfw.wmnet with OS bookworm [00:34:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P58419 and previous config saved to /var/cache/conftool/dbconfig/20240305-003425-arnaudb.json [00:34:29] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:38:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:38:28] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008081 [00:38:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2039.codfw.wmnet with OS bookworm [00:38:42] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008081 (owner: 10TrainBranchBot) [00:40:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:40:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2038.codfw.wmnet with OS bookworm [00:41:00] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:42:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:42:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2037.codfw.wmnet with OS bookworm [00:43:07] (03PS3) 10Dzahn: ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) [00:44:28] (03CR) 10CI reject: [V: 04-1] ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [00:46:12] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:48:03] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:48:39] (03PS4) 10Dzahn: ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) [00:49:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P58420 and previous config saved to /var/cache/conftool/dbconfig/20240305-004931-arnaudb.json [00:50:01] (03CR) 10CI reject: [V: 04-1] ci_test: include scap::ferm class directly [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [00:52:12] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:55:49] !log contint1003 -rebooting [00:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008081 (owner: 10TrainBranchBot) [01:04:02] (03PS5) 10Dzahn: ci_test: switch firewall::provider to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) [01:04:34] (03PS6) 10Dzahn: ci_test: switch firewall::provider to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) [01:04:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T357189)', diff saved to https://phabricator.wikimedia.org/P58421 and previous config saved to /var/cache/conftool/dbconfig/20240305-010438-arnaudb.json [01:04:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [01:04:42] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [01:04:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [01:05:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T357189)', diff saved to https://phabricator.wikimedia.org/P58422 and previous config saved to /var/cache/conftool/dbconfig/20240305-010459-arnaudb.json [01:10:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T357189)', diff saved to https://phabricator.wikimedia.org/P58423 and previous config saved to /var/cache/conftool/dbconfig/20240305-011008-arnaudb.json [01:10:12] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [01:10:14] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:10:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2035.codfw.wmnet with OS bookworm [01:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:12:49] (03CR) 10Dzahn: [C: 03+2] ci_test: switch firewall::provider to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1008576 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [01:13:34] (03PS2) 10Jdlrobson: Stop sharing vector and vector-2022 scripts on wikis where no users are impacted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007992 (https://phabricator.wikimedia.org/T331679) [01:17:48] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:21:29] (03PS1) 10Dzahn: ci_test: include profile::firewall in test role [puppet] - 10https://gerrit.wikimedia.org/r/1008579 [01:21:57] (03CR) 10CI reject: [V: 04-1] ci_test: include profile::firewall in test role [puppet] - 10https://gerrit.wikimedia.org/r/1008579 (owner: 10Dzahn) [01:22:45] (03PS2) 10Dzahn: ci_test: include profile::firewall in test role [puppet] - 10https://gerrit.wikimedia.org/r/1008579 [01:23:40] (03PS3) 10Dzahn: ci_test: include profile::firewall in test role [puppet] - 10https://gerrit.wikimedia.org/r/1008579 (https://phabricator.wikimedia.org/T358237) [01:24:26] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:25:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P58424 and previous config saved to /var/cache/conftool/dbconfig/20240305-012514-arnaudb.json [01:26:29] (03PS4) 10Dzahn: ci_test: include profile::firewall in test role [puppet] - 10https://gerrit.wikimedia.org/r/1008579 (https://phabricator.wikimedia.org/T358237) [01:27:22] (03CR) 10Dzahn: [C: 03+2] "wrong provider name -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008576" [puppet] - 10https://gerrit.wikimedia.org/r/1008545 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [01:27:38] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/1008579/1582/contint1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1008579 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [01:31:20] RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:31:52] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:32:02] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "@jnuche deploy2002 can now ssh to contint1003. You can try to scap zuul again." [puppet] - 10https://gerrit.wikimedia.org/r/1008579 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [01:40:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P58425 and previous config saved to /var/cache/conftool/dbconfig/20240305-014020-arnaudb.json [01:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:55:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T357189)', diff saved to https://phabricator.wikimedia.org/P58426 and previous config saved to /var/cache/conftool/dbconfig/20240305-015527-arnaudb.json [01:55:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [01:55:31] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [01:55:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [01:55:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T357189)', diff saved to https://phabricator.wikimedia.org/P58427 and previous config saved to /var/cache/conftool/dbconfig/20240305-015550-arnaudb.json [02:00:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T357189)', diff saved to https://phabricator.wikimedia.org/P58428 and previous config saved to /var/cache/conftool/dbconfig/20240305-020049-arnaudb.json [02:00:54] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [02:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:15:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P58429 and previous config saved to /var/cache/conftool/dbconfig/20240305-021556-arnaudb.json [02:31:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P58430 and previous config saved to /var/cache/conftool/dbconfig/20240305-023102-arnaudb.json [02:34:03] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 72027072 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:35:05] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 127248 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:38:04] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T357189)', diff saved to https://phabricator.wikimedia.org/P58431 and previous config saved to /var/cache/conftool/dbconfig/20240305-024608-arnaudb.json [02:46:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [02:46:14] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [02:46:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [02:46:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [02:46:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [02:46:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T357189)', diff saved to https://phabricator.wikimedia.org/P58432 and previous config saved to /var/cache/conftool/dbconfig/20240305-024657-arnaudb.json [02:52:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T357189)', diff saved to https://phabricator.wikimedia.org/P58433 and previous config saved to /var/cache/conftool/dbconfig/20240305-025212-arnaudb.json [02:52:16] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [02:58:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0300) [03:01:33] (03PS1) 10RLazarus: k8s: Add getter for the Batch API [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130) [03:02:07] (03PS1) 10RLazarus: sre.switchdc.mediawiki: Stop maintenance scripts on Kubernetes [cookbooks] - 10https://gerrit.wikimedia.org/r/1008583 (https://phabricator.wikimedia.org/T359130) [03:07:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P58434 and previous config saved to /var/cache/conftool/dbconfig/20240305-030719-arnaudb.json [03:07:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.21 [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008082 (https://phabricator.wikimedia.org/T354439) [03:07:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.21 [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008082 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [03:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:22:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P58435 and previous config saved to /var/cache/conftool/dbconfig/20240305-032225-arnaudb.json [03:28:11] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.21 [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008082 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [03:37:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T357189)', diff saved to https://phabricator.wikimedia.org/P58436 and previous config saved to /var/cache/conftool/dbconfig/20240305-033732-arnaudb.json [03:37:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [03:37:37] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [03:37:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [03:37:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T357189)', diff saved to https://phabricator.wikimedia.org/P58437 and previous config saved to /var/cache/conftool/dbconfig/20240305-033755-arnaudb.json [03:42:45] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:42:52] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:46:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T357189)', diff saved to https://phabricator.wikimedia.org/P58438 and previous config saved to /var/cache/conftool/dbconfig/20240305-034614-arnaudb.json [03:46:18] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0400) [04:01:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P58439 and previous config saved to /var/cache/conftool/dbconfig/20240305-040120-arnaudb.json [04:06:36] (03PS1) 10Jdlrobson: Partial Revert "Set background/color to inherit for common templates" [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008476 (https://phabricator.wikimedia.org/T358164) [04:16:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P58440 and previous config saved to /var/cache/conftool/dbconfig/20240305-041626-arnaudb.json [04:31:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T357189)', diff saved to https://phabricator.wikimedia.org/P58441 and previous config saved to /var/cache/conftool/dbconfig/20240305-043133-arnaudb.json [04:31:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [04:31:38] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [04:31:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [04:31:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T357189)', diff saved to https://phabricator.wikimedia.org/P58442 and previous config saved to /var/cache/conftool/dbconfig/20240305-043155-arnaudb.json [04:33:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [04:33:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [04:37:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T357189)', diff saved to https://phabricator.wikimedia.org/P58443 and previous config saved to /var/cache/conftool/dbconfig/20240305-043718-arnaudb.json [04:37:22] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [04:46:43] PROBLEM - statsv process on webperf2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [04:47:43] RECOVERY - statsv process on webperf2003 is OK: PROCS OK: 1 process with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [04:47:52] * kart_ deploying cxserver.. [04:48:05] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2024-03-04-113412-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008420 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry) [04:49:11] (03Merged) 10jenkins-bot: Update cxserver to 2024-03-04-113412-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008420 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry) [04:51:59] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [04:52:23] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [04:52:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P58444 and previous config saved to /var/cache/conftool/dbconfig/20240305-045225-arnaudb.json [05:01:11] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:01:43] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:02:47] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:03:23] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:07:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P58445 and previous config saved to /var/cache/conftool/dbconfig/20240305-050731-arnaudb.json [05:15:46] !log Updated cxserver to 2024-03-04-113412-production (T350773) [05:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:50] T350773: Remove preq and use node fetch - https://phabricator.wikimedia.org/T350773 [05:17:05] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:11] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:43] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:22:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T357189)', diff saved to https://phabricator.wikimedia.org/P58446 and previous config saved to /var/cache/conftool/dbconfig/20240305-052237-arnaudb.json [05:22:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance [05:22:42] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [05:22:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance [05:23:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T357189)', diff saved to https://phabricator.wikimedia.org/P58447 and previous config saved to /var/cache/conftool/dbconfig/20240305-052259-arnaudb.json [05:24:26] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:27:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T357189)', diff saved to https://phabricator.wikimedia.org/P58448 and previous config saved to /var/cache/conftool/dbconfig/20240305-052741-arnaudb.json [05:27:46] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [05:35:57] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:36:17] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:36:21] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:42:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P58449 and previous config saved to /var/cache/conftool/dbconfig/20240305-054247-arnaudb.json [05:45:39] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:45:46] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:48:31] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:51:18] (03PS1) 10Tim Starling: SwiftTooManyMediaUploads: use subtraction instead of increase() [alerts] - 10https://gerrit.wikimedia.org/r/1008590 [05:52:50] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:52:57] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:57:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P58450 and previous config saved to /var/cache/conftool/dbconfig/20240305-055754-arnaudb.json [06:04:51] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:04:58] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:13:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T357189)', diff saved to https://phabricator.wikimedia.org/P58451 and previous config saved to /var/cache/conftool/dbconfig/20240305-061300-arnaudb.json [06:13:14] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [06:17:49] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 1412 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [06:18:04] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:46] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:19:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:51:03] (03PS1) 10Marostegui: installserver: Do not reimage es1040 [puppet] - 10https://gerrit.wikimedia.org/r/1008741 [06:55:40] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage es1040 [puppet] - 10https://gerrit.wikimedia.org/r/1008741 (owner: 10Marostegui) [06:57:10] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:57:16] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:59:26] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0700) [07:00:05] kormat, marostegui, Amir1, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0700). [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:12:55] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [07:15:03] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1024.eqiad.wmnet with OS bullseye [07:17:24] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:17:31] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:27:36] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1024.eqiad.wmnet with reason: host reimage [07:27:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [07:31:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1024.eqiad.wmnet with reason: host reimage [07:32:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [07:32:37] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:32:44] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:33:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet [07:33:52] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2016.codfw.wmnet with OS bullseye [07:36:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet [07:48:13] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:48:20] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:49:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1024.eqiad.wmnet with OS bullseye [07:49:48] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2016.codfw.wmnet with reason: host reimage [07:52:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2016.codfw.wmnet with reason: host reimage [07:52:13] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::misc::multiinstance [07:54:34] (03PS1) 10Muehlenhoff: Switch mariadb::misc::multiinstance to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1008806 (https://phabricator.wikimedia.org/T349619) [07:54:49] (03CR) 10Marostegui: [C: 03+1] Switch mariadb::misc::multiinstance to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1008806 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:57:53] (03CR) 10Volans: [C: 03+1] "LGTM." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [08:00:05] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:38] (03CR) 10Muehlenhoff: [C: 03+2] Switch mariadb::misc::multiinstance to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1008806 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:02:23] (03CR) 10Volans: [C: 03+1] k8s: Add getter for the Batch API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [08:09:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::misc::multiinstance [08:12:33] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2016.codfw.wmnet with OS bullseye [08:14:35] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2018.codfw.wmnet with OS bullseye [08:30:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance [08:30:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance [08:30:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T357189)', diff saved to https://phabricator.wikimedia.org/P58452 and previous config saved to /var/cache/conftool/dbconfig/20240305-083028-arnaudb.json [08:30:29] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2018.codfw.wmnet with reason: host reimage [08:30:32] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [08:33:15] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2018.codfw.wmnet with reason: host reimage [08:35:16] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2017.codfw.wmnet with OS bullseye [08:36:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T357189)', diff saved to https://phabricator.wikimedia.org/P58453 and previous config saved to /var/cache/conftool/dbconfig/20240305-083621-arnaudb.json [08:36:26] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [08:47:25] !log add new disk to titan2001 /srv - T359068 [08:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:28] T359068: Not enough space on titan hosts for thanos-compact - https://phabricator.wikimedia.org/T359068 [08:51:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P58454 and previous config saved to /var/cache/conftool/dbconfig/20240305-085128-arnaudb.json [08:51:30] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2017.codfw.wmnet with reason: host reimage [08:52:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2018.codfw.wmnet with OS bullseye [08:54:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2017.codfw.wmnet with reason: host reimage [08:56:07] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2019.codfw.wmnet with OS bullseye [09:00:04] jnuche and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0900). [09:00:29] morning, train and backports are currently blocked by T359114 [09:00:30] T359114: Slow and failed deployments - https://phabricator.wikimedia.org/T359114 [09:06:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P58455 and previous config saved to /var/cache/conftool/dbconfig/20240305-090634-arnaudb.json [09:08:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2117.codfw.wmnet [09:11:59] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2019.codfw.wmnet with reason: host reimage [09:12:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2117 T359141', diff saved to https://phabricator.wikimedia.org/P58456 and previous config saved to /var/cache/conftool/dbconfig/20240305-091244-marostegui.json [09:12:49] T359141: decommission db2117.codfw.wmnet - https://phabricator.wikimedia.org/T359141 [09:12:57] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2017.codfw.wmnet with OS bullseye [09:13:24] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [09:14:46] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2019.codfw.wmnet with reason: host reimage [09:15:18] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2117.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:16:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2117.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:16:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:16:20] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2117.codfw.wmnet [09:18:04] (03CR) 10Muehlenhoff: [C: 03+2] puppetboard: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1008513 (owner: 10Muehlenhoff) [09:21:02] (03PS1) 10Slyngshede: P:openldap::management Unbreak cross validation script. [puppet] - 10https://gerrit.wikimedia.org/r/1008809 (https://phabricator.wikimedia.org/T359142) [09:21:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T357189)', diff saved to https://phabricator.wikimedia.org/P58457 and previous config saved to /var/cache/conftool/dbconfig/20240305-092140-arnaudb.json [09:21:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [09:21:45] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:21:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [09:21:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1008809 (https://phabricator.wikimedia.org/T359142) (owner: 10Slyngshede) [09:22:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58458 and previous config saved to /var/cache/conftool/dbconfig/20240305-092202-arnaudb.json [09:22:06] (03CR) 10Slyngshede: [C: 03+2] P:openldap::management Unbreak cross validation script. [puppet] - 10https://gerrit.wikimedia.org/r/1008809 (https://phabricator.wikimedia.org/T359142) (owner: 10Slyngshede) [09:23:18] (03CR) 10Muehlenhoff: "Hmmh, good point. There's no good reason for conntrack to be absented along with iptables if the firewall provider doesn't use "ferm". Thi" [puppet] - 10https://gerrit.wikimedia.org/r/1008506 (owner: 10Majavah) [09:23:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove cloud_private_v4_set from cloudgw nftables definition [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney) [09:23:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Remove cloud_private_v4_set from cloudgw nftables definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney) [09:23:51] (03PS1) 10Muehlenhoff: Install conntrack via profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1008814 [09:24:11] (03PS1) 10Marostegui: mariadb: Decommission db2117 [puppet] - 10https://gerrit.wikimedia.org/r/1008815 (https://phabricator.wikimedia.org/T359141) [09:24:21] (03CR) 10Muehlenhoff: "Alternative patch proposal at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008814" [puppet] - 10https://gerrit.wikimedia.org/r/1008506 (owner: 10Majavah) [09:24:27] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:29] (03CR) 10Arnaudb: [C: 03+1] mariadb: Decommission db2117 [puppet] - 10https://gerrit.wikimedia.org/r/1008815 (https://phabricator.wikimedia.org/T359141) (owner: 10Marostegui) [09:24:37] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2117 [puppet] - 10https://gerrit.wikimedia.org/r/1008815 (https://phabricator.wikimedia.org/T359141) (owner: 10Marostegui) [09:24:45] (03PS10) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [09:25:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1008814 (owner: 10Muehlenhoff) [09:25:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1008814 (owner: 10Muehlenhoff) [09:25:25] (03CR) 10Majavah: [V: 03+1 C: 03+1] "LGTM, I don't see any dependencies on conntract that would cause issues on hosts without a firewall atm." [puppet] - 10https://gerrit.wikimedia.org/r/1008814 (owner: 10Muehlenhoff) [09:26:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Install conntrack via profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1008814 (owner: 10Muehlenhoff) [09:26:53] (03Abandoned) 10Majavah: conntrackd: fix CLI installation [puppet] - 10https://gerrit.wikimedia.org/r/1008506 (owner: 10Majavah) [09:27:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58459 and previous config saved to /var/cache/conftool/dbconfig/20240305-092721-arnaudb.json [09:27:26] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:27:48] (03CR) 10Volans: "Did a first pass on the code only, once we finalize the code I'll pass to the tests" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [09:28:04] (JobUnavailable) firing: (2) Reduced availability for job thanos-rule in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:32:52] (03PS1) 10Ayounsi: sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 [09:33:30] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2019.codfw.wmnet with OS bullseye [09:33:38] 06SRE, 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2117.codfw.wmnet - https://phabricator.wikimedia.org/T359141#9599660 (10MoritzMuehlenhoff) >>! In T359141#9599610, @Marostegui wrote: > @Volans @MoritzMuehlenhoff is anything else required in this situation? I think that's fine... [09:33:44] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9599659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2019.codfw.wmnet with OS bullseye comp... [09:34:31] 06SRE, 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2117.codfw.wmnet - https://phabricator.wikimedia.org/T359141#9599661 (10Marostegui) Thanks! @Jhancock.wm see above, you can proceed whenever you want. [09:38:04] (JobUnavailable) resolved: (2) Reduced availability for job thanos-rule in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:41:43] (03PS2) 10Ayounsi: sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 [09:42:03] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2020.codfw.wmnet with OS bullseye [09:42:21] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9599689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2020.codfw.wmnet with OS bullseye [09:42:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P58460 and previous config saved to /var/cache/conftool/dbconfig/20240305-094228-arnaudb.json [09:43:32] jnuche: I 'll need another 30 minutes or so and I 'll throw some 200 CPUs at the 2 wikikube clusters unblocking the train [09:44:12] akosiaris: sounds good, thank you so much [09:52:56] (03CR) 10Majavah: [C: 03+2] P:toolforge: image_builder: refresh for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006516 (https://phabricator.wikimedia.org/T358483) (owner: 10Majavah) [09:53:02] (03PS3) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 [09:54:27] (03PS4) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 [09:56:42] (03PS5) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 [09:57:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P58461 and previous config saved to /var/cache/conftool/dbconfig/20240305-095734-arnaudb.json [09:58:04] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2020.codfw.wmnet with reason: host reimage [10:02:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2020.codfw.wmnet with reason: host reimage [10:04:53] !log jnuche@deploy2002 Started deploy [zuul/deploy@bb76c45]: test deployment for new host [10:04:54] !log jnuche@deploy2002 Finished deploy [zuul/deploy@bb76c45]: test deployment for new host (duration: 00m 01s) [10:06:53] !log jnuche@deploy2002 Started deploy [zuul/deploy@bb76c45]: test deployment for new host [10:07:17] !log jnuche@deploy2002 Finished deploy [zuul/deploy@bb76c45]: test deployment for new host (duration: 00m 24s) [10:08:11] !og installing glib2.0 security updates [10:11:18] !log homer commit T358752 [10:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:21] T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752 [10:12:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58462 and previous config saved to /var/cache/conftool/dbconfig/20240305-101241-arnaudb.json [10:12:45] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:16:40] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:17:28] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:17:37] (03PS1) 10Jaime Nuche: ci_test: do not remove python 2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008823 (https://phabricator.wikimedia.org/T358237) [10:17:40] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 342, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:18:31] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:32] (03CR) 10JMeybohm: [C: 03+1] k8s: Add getter for the Batch API [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [10:21:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2020.codfw.wmnet with OS bullseye [10:21:16] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9599881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2020.codfw.wmnet with OS bullseye comp... [10:21:47] (03PS1) 10Arnaudb: mariadb: add db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008083 (https://phabricator.wikimedia.org/T355422) [10:21:54] !log uncordon parse20{16..20}.codfw.wmnet T358752 [10:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:57] T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752 [10:22:55] !log uncordon parse10{20..24}.eqiad.wmnet parse10{10..12}.eqiad.wmnet T358752 [10:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:06] jnuche: I think you are clear. [10:23:40] akosiaris: thanks again, I'll deploy in the next few minutes [10:24:03] (03CR) 10Ladsgroup: [C: 03+1] mariadb: add db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008083 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [10:24:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1008823 (https://phabricator.wikimedia.org/T358237) (owner: 10Jaime Nuche) [10:25:04] jnuche: I can merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008823 if that unblocks you? [10:25:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P58463 and previous config saved to /var/cache/conftool/dbconfig/20240305-102516-root.json [10:25:32] (03PS4) 10Ayounsi: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 [10:25:42] moritzm: definitely, thank you! [10:25:56] (03CR) 10Ayounsi: "Thanks, reply inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [10:27:03] (03CR) 10Muehlenhoff: [C: 03+2] ci_test: do not remove python 2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008823 (https://phabricator.wikimedia.org/T358237) (owner: 10Jaime Nuche) [10:28:35] (03CR) 10Marostegui: [C: 03+1] "Remember to add it to zarcillo database" [puppet] - 10https://gerrit.wikimedia.org/r/1008083 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [10:29:43] akosiaris, moritzm: since you are around, can either of you kill process 3272 on deploy2002? I don't have permissions and that process is holding a scap lock at the moment [10:34:03] jnuche: doing [10:34:35] jnuche: done [10:34:46] claime: thx 👍 [10:36:16] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008826 (https://phabricator.wikimedia.org/T354439) [10:36:21] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008826 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [10:37:06] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008826 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [10:39:38] mmmh, deploy failed, it seems I still need to run the presync first [10:40:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P58464 and previous config saved to /var/cache/conftool/dbconfig/20240305-104021-root.json [10:41:10] jnuche: merged the patch and forced a puppet run on contint1003 [10:41:21] danke [10:50:21] !log akosiaris@cumin1002 conftool action : set/weight=10; selector: service=kubesvc,name=parse2.* [10:50:32] !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=kubesvc,name=parse2.* [10:50:45] !log akosiaris@cumin1002 conftool action : set/weight=10; selector: service=kubesvc,name=parse1.* [10:51:02] !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=kubesvc,name=parse1.* [10:53:55] !log jnuche@deploy2002 Pruned MediaWiki: 1.42.0-wmf.18 (duration: 03m 25s) [10:55:08] jouncebot: nowandnext [10:55:08] For the next 0 hour(s) and 4 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T0900) [10:55:08] In 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1100) [10:55:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P58465 and previous config saved to /var/cache/conftool/dbconfig/20240305-105526-root.json [10:56:25] !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.21 refs T354439 [10:56:29] T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439 [10:58:58] the train deploy is going to overlap with the MW infrastructure window starting in 2 minutes. apologies if that causes any disruption [10:59:26] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:27] I'm currently running the presync, once that's done I can hold the actual deploy to group0 if necessary [10:59:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance [10:59:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance [10:59:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T343718)', diff saved to https://phabricator.wikimedia.org/P58466 and previous config saved to /var/cache/conftool/dbconfig/20240305-105950-ladsgroup.json [10:59:55] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1100) [11:10:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P58467 and previous config saved to /var/cache/conftool/dbconfig/20240305-111031-root.json [11:13:47] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9599986 (10ayounsi) I'd recommend to start by turning up a small country/region on that continent (Uruguay/Paraguay for example), ideall... [11:13:55] (03PS1) 10Alexandros Kosiaris: mw-parsoid: Bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008828 (https://phabricator.wikimedia.org/T357392) [11:15:24] (03CR) 10Kamila Součková: [C: 03+1] mw-parsoid: Bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008828 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:15:40] (03PS1) 10Jaime Nuche: Revert "group0 wikis to 1.42.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008829 [11:15:48] (03CR) 10Jaime Nuche: [C: 03+2] Revert "group0 wikis to 1.42.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008829 (owner: 10Jaime Nuche) [11:15:56] (03CR) 10Clément Goubert: [C: 03+1] mw-parsoid: Bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008828 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:16:09] morning wikibugs :D [11:16:20] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.42.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008829 (owner: 10Jaime Nuche) [11:16:52] (03CR) 10Arnaudb: [C: 03+2] mariadb: add db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008083 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [11:17:28] (03PS1) 10Urbanecm: wikimaniawiki: Update logos to 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008830 (https://phabricator.wikimedia.org/T358379) [11:19:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008828 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:19:41] (03Merged) 10jenkins-bot: mw-parsoid: Bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008828 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:20:41] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008831 (https://phabricator.wikimedia.org/T354439) [11:20:49] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008831 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [11:20:57] (03CR) 10Hnowlan: [C: 03+1] APIGW: Add configuration to expose LW isvc article-descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007318 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman) [11:21:05] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008831 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [11:21:13] (03PS11) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [11:22:01] (03PS12) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [11:22:41] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [11:22:53] 06SRE, 10MW-on-K8s, 06Release-Engineering-Team, 06Traffic, 06serviceops: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508#9600053 (10Clement_Goubert) [11:23:29] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [11:24:19] (03PS11) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 [11:25:36] (03CR) 10Filippo Giunchedi: "I'm sure I lack context though it seems the kafka PKI defaults to 1y expiration and we'll reduce it here to 1mo ?" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron) [11:30:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T343718)', diff saved to https://phabricator.wikimedia.org/P58469 and previous config saved to /var/cache/conftool/dbconfig/20240305-113027-ladsgroup.json [11:30:34] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:30:40] !log jnuche@deploy2002 sync-world aborted: testwikis wikis to 1.42.0-wmf.21 refs T354439 (duration: 34m 15s) [11:30:49] T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439 [11:32:15] (03PS1) 10Filippo Giunchedi: webperf: move statsv metrics to prometheus 'ext' only [puppet] - 10https://gerrit.wikimedia.org/r/1008833 (https://phabricator.wikimedia.org/T359153) [11:32:26] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:32:39] (03PS13) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [11:32:44] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:33:36] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [11:33:50] 06SRE, 10Prod-Kubernetes, 06serviceops: Kubernetes apiserver probe failures on restart - https://phabricator.wikimedia.org/T358936#9600154 (10JMeybohm) To clarify why this happened/happens: kubemaster2001 refreshed the certs used by the apiserver in one puppet run at ~00:51: ` Mar 1 00:51:28 Exec[renew cer... [11:34:06] (03CR) 10Klausman: [C: 03+2] APIGW: Add configuration to expose LW isvc article-descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007318 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman) [11:34:14] (03Merged) 10jenkins-bot: APIGW: Add configuration to expose LW isvc article-descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007318 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman) [11:34:52] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9600169 (10jcrespo) Replacing the cable can be done any time between 6:00 and 23:55 UTC. Let me know if it will be for a period of extended time so I can downtime it. If it needs hard down let me know in advance so I can... [11:36:41] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9600195 (10cmooney) >>! In T358658#9598742, @odimitrijevic wrote: > Yes, approved Thanks Olja. Just to update I've been working with KC on this and we... [11:36:53] (03PS14) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [11:37:01] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [11:38:18] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:38:39] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:39:23] (03PS1) 10KartikMistry: Update cxserver to 2024-03-05-082211-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008836 (https://phabricator.wikimedia.org/T353136) [11:42:29] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:42:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:42:40] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:42:56] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:45:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P58471 and previous config saved to /var/cache/conftool/dbconfig/20240305-114533-ladsgroup.json [11:46:02] (03PS2) 10Klausman: APIGW: double up host header elemement for art-desc on LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008840 (https://phabricator.wikimedia.org/T358654) [11:46:12] (03PS1) 10Alexandros Kosiaris: eqiad: Move all but 1 parsoid node to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008841 (https://phabricator.wikimedia.org/T358752) [11:47:26] (03CR) 10Hnowlan: [C: 03+1] APIGW: double up host header elemement for art-desc on LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008840 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman) [11:51:03] 06SRE, 06Machine-Learning-Team, 13Patch-For-Review: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516#9600294 (10klausman) 05Open→03Resolved [11:52:18] 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9600321 (10Clement_Goubert) >>! In T358117#9598846, @dancy wrote: > @Clement_Goubert We have some questions: > 1) Does `mwdebug.discovery.wmnet` resolve to a random... [11:52:42] jnuche: It's timing out again? [11:52:47] (03CR) 10Klausman: [C: 03+2] APIGW: double up host header elemement for art-desc on LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008840 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman) [11:52:52] akosiaris, claime: I ran into another timeout will deploying to mw-on-k8s, testservers now: https://phabricator.wikimedia.org/T359155 [11:52:57] yep [11:53:12] s/will/while [11:53:19] jnuche: all right I'll revert a patch quickly, see if it improves things [11:53:27] thx [11:53:37] (03Merged) 10jenkins-bot: APIGW: double up host header elemement for art-desc on LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008840 (https://phabricator.wikimedia.org/T358654) (owner: 10Klausman) [11:54:08] !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [11:54:48] !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [11:56:34] (03PS1) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) [11:57:27] (03PS2) 10EoghanGaffney: [vrts] Remove ticket-test.wm.o and vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/1008445 (https://phabricator.wikimedia.org/T359041) [11:57:42] !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [11:58:15] !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [12:00:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P58472 and previous config saved to /var/cache/conftool/dbconfig/20240305-120040-ladsgroup.json [12:01:20] (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [12:02:02] (03PS15) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [12:02:25] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [12:05:22] (03PS1) 10Clément Goubert: mw-api-int: Scale down to 206 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008843 [12:07:04] jnuche: I'll try a scap no-build k8s only deployment because we're not finding a root cause [12:07:06] !log cgoubert@deploy2002 Started scap: (no justification provided) [12:07:18] ack [12:08:07] jnuche: what's the image version that was supposed to be deployed by your earlier deployment [12:08:25] (03CR) 10EoghanGaffney: [C: 03+2] [vrts] Remove ticket-test.wm.o and vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/1008445 (https://phabricator.wikimedia.org/T359041) (owner: 10EoghanGaffney) [12:08:30] (03PS16) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [12:08:43] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [12:09:35] claime: judging by https://phabricator.wikimedia.org/P58470 then I think `docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-03-05-110738-webserver`: [12:09:40] https://www.irccloud.com/pastebin/E0acriJs/ [12:09:45] (03CR) 10CI reject: [V: 04-1] cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [12:11:05] or `docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2024-03-05-105734-publish` [12:11:11] (03PS1) 10Krinkle: Profiler: Silence "RedisException: Connection timed out" (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008752 (https://phabricator.wikimedia.org/T348756) [12:12:23] (03PS17) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [12:13:14] ty [12:14:21] (03CR) 10Volans: [C: 04-1] "There are some issues in its current format, see details inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [12:14:26] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [12:15:04] jnuche: So what is happening right now is that it didn't redeploy anything on mw-debug and mw-mis [12:15:07] misc* [12:15:17] they're still on 2024-02-29-215143 [12:15:35] But it is deploying 2024-03-05-110738 to all the other deployments [12:15:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T343718)', diff saved to https://phabricator.wikimedia.org/P58473 and previous config saved to /var/cache/conftool/dbconfig/20240305-121546-ladsgroup.json [12:16:04] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:16:27] <_joe_> claime: same problem for the actual mediawiki images [12:16:46] <_joe_> I'm looking at /etc/helmfile-defaults/mediawiki/release/mw-debug-pinkunicorn.yaml [12:16:49] /etc/helmfile-defaults/mediawiki/release/mw-debug-pinkunicorn.yaml and /etc/helmfile-defaults/mediawiki/release/mw-api-int-canary.yaml have different versions [12:16:51] exactly [12:17:15] claime: that's odd, I canceled before it could get past mw-debug and misc [12:17:31] <_joe_> jnuche: the problem is scap [12:17:35] !log cgoubert@deploy2002 scap failed: KeyError 'canaries' (duration: 10m 29s) [12:17:40] aaaaan it failed [12:17:51] <_joe_> scap seems not to be updating releases with debug: true [12:18:00] <_joe_> since the 29th of february [12:18:13] <_joe_> I'd go look at the code released around that date [12:18:39] ah, maybe it's the rollback? scap did perform the rollback for debug [12:18:49] ok, gonna look into late scap changes [12:18:51] <_joe_> it's possible [12:19:00] <_joe_> let me look at the git history [12:19:18] (03CR) 10JMeybohm: mw-mcrouter: update namespace resource limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli) [12:19:36] What is *not* a scap bug is the fact we can't deploy canary releases [12:19:40] <_joe_> jnuche: yes, you're right [12:19:42] because the helmfile times out [12:20:11] <_joe_> jnuche: but why rollback to a version that is so old [12:20:29] <_joe_> ah because it was the previous functioning one [12:20:44] <_joe_> claime: and why is helmfile timing out? [12:20:51] that's what we're trying to find out [12:21:02] I'm in videochat with akosiaris rn, we're looking [12:21:06] kubernetes events are empty of anything useful btw [12:21:21] <_joe_> sigh [12:21:32] <_joe_> and this has been happening since yesterday? [12:22:28] yesterday it got past the testservers AFAIK, the deployment timed out for parsoid [12:22:49] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [12:23:59] mw-api-ext.codfw.canary-f6c699fb7-hhhll 7/9 CrashLoopBackOff 2 (29s ago) 49s [12:24:15] <_joe_> ok that doesn't look good [12:24:57] [05-Mar-2024 12:24:27] ERROR: [/etc/php/7.4/fpm/php-fpm.conf:15] Array are not allowed in the global section [12:24:57] [05-Mar-2024 12:24:27] ERROR: failed to load configuration file '/etc/php/7.4/fpm/php-fpm.conf' [12:24:57] [05-Mar-2024 12:24:27] ERROR: FPM initialization failed [12:25:03] found it in the logs of the application [12:25:12] <_joe_> ok, what changed there? [12:25:27] effie: ^ [12:25:35] any chance this has something to do with mcrouteR? [12:26:30] yes it does [12:26:39] <_joe_> env['MCROUTER_SERVER'] = ${MW__MCROUTER_SERVER} [12:26:41] <_joe_> yep [12:26:44] but this change in the image was merged days ago [12:26:57] <_joe_> effie: but we only use a new image when there is a release [12:27:01] ^ [12:27:02] <_joe_> when did you make your change? [12:27:07] last week [12:27:14] <_joe_> yeah, checks out [12:27:22] <_joe_> let's revert that quickly [12:27:24] ok let me revert this [12:27:36] <_joe_> effie: bump the image version in the changelog [12:27:50] <_joe_> and claime, we'll need to rebuild from scratch the mediawiki image [12:28:20] (03PS1) 10Effie Mouzeli: Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753 [12:28:22] <_joe_> (in scap, I mean) [12:28:23] _joe_: that should be done by scap once we've update the php-fpm image [12:28:37] <_joe_> it auto-detects? uhmmm [12:28:44] (03CR) 10Clément Goubert: [C: 03+1] Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753 (owner: 10Effie Mouzeli) [12:28:45] <_joe_> anyways, you'll see pretty quickly [12:29:11] <_joe_> effie: you also need to bump the changelog [12:29:16] _joe_, claime: a dull rebuild can be forced with ` -Dfull_image_build:True ` [12:29:19] _joe_: I was going to [12:29:23] jnuche: thanks [12:29:24] s/dull/full/ [12:29:26] <_joe_> I gotta go lunch [12:32:57] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [12:34:55] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Scale down to 206 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008843 (owner: 10Clément Goubert) [12:35:27] (03PS2) 10Effie Mouzeli: Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753 [12:35:49] (03Merged) 10jenkins-bot: mw-api-int: Scale down to 206 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008843 (owner: 10Clément Goubert) [12:35:54] (03CR) 10Clément Goubert: [C: 03+1] Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753 (owner: 10Effie Mouzeli) [12:35:59] !log eoghan@cumin1002 START - Cookbook sre.hosts.decommission for hosts vrts1002.eqiad.wmnet [12:36:05] effie: want me to do the image rebuild etc.? [12:36:08] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753 (owner: 10Effie Mouzeli) [12:37:23] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Revert "php: add env[MCROUTER_SERVER] variable" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008753 (owner: 10Effie Mouzeli) [12:39:58] (03PS1) 10Jaime Nuche: ci_test.pp: add missing Python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008849 (https://phabricator.wikimedia.org/T358237) [12:40:03] (03PS1) 10Jaime Nuche: ci_test.pp: remove explicit installation of Python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008850 (https://phabricator.wikimedia.org/T358237) [12:40:20] claime: via scap you mean ? [12:40:25] yeah [12:40:35] once you're done with build-production-images [12:41:13] (03CR) 10CI reject: [V: 04-1] ci_test.pp: add missing Python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008849 (https://phabricator.wikimedia.org/T358237) (owner: 10Jaime Nuche) [12:41:29] <_joe_> claime: I can run it just for that image [12:41:41] <_joe_> if there's dangling images that fail to build [12:42:06] <_joe_> is anyone running it? [12:42:14] (03CR) 10Jgiannelos: mobileapps: Switchover outgoing parsoid traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007317 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [12:42:41] I am on the build host [12:42:54] claime: I will do it no problem [12:42:58] ack [12:43:37] <_joe_> effie: then let me paste you the command to just rebuild that image [12:44:39] _joe_: I already run build-production-images [12:45:32] <_joe_> ah ok [12:46:13] !log eoghan@cumin1002 START - Cookbook sre.dns.netbox [12:46:43] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Switchover outgoing parsoid traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007317 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [12:47:34] (03Merged) 10jenkins-bot: mobileapps: Switchover outgoing parsoid traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007317 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [12:48:03] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [12:48:06] moritzm: the previous patch for ci_test wasn't enough, the packages still need to be installed on the host. Could you take a look at these two followups?: [12:48:06] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008849 [12:48:06] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008850 [12:48:16] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [12:48:30] (if you have the time) [12:50:12] nemo-yiannis: can you wait a bit before actually deploying that change? [12:50:29] ok [12:50:35] we'd like to put the mw-on-k8s deployments back into a stable, all at the same version state before [12:51:11] I've also scaled back a bit from the 240 replicas, so I'd like to make sure I'm around to ramp up if needed, and right now I can't do that because our images are borked [12:51:34] !log eoghan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: vrts1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1002" [12:51:53] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9600557 (10dr0ptp4kt) @VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish with a hostname of `wdqs1025.eqiad.wmnet`?... [12:51:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1186', diff saved to https://phabricator.wikimedia.org/P58474 and previous config saved to /var/cache/conftool/dbconfig/20240305-125152-root.json [12:52:03] claime: is there a ticket to track when this work is going to be complete so I deploy after? [12:52:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Optimize revision table T354015 [12:52:22] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [12:52:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Optimize revision table T354015 [12:52:51] nemo-yiannis: https://phabricator.wikimedia.org/T359155#9600551 [12:52:52] !log eoghan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: vrts1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1002" [12:52:52] !log eoghan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:52:53] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts vrts1002.eqiad.wmnet [12:54:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [12:54:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [12:54:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [12:54:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [12:56:03] (03PS1) 10Marostegui: installserver: Do not reimage db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008853 [12:56:59] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [12:57:08] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [12:59:54] (03PS1) 10Giuseppe Lavagetto: Fixes for rebuild of php-fpm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008854 [13:00:01] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008853 (owner: 10Marostegui) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1300) [13:01:18] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9600588 (10Jelto) >>! In T358658#9596119, @KCVelaga_WMF wrote: > @MoritzMuehlenhoff When I change my email to wikimedia.org for the developer account, I... [13:03:14] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Fixes for rebuild of php-fpm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008854 (owner: 10Giuseppe Lavagetto) [13:04:26] (SystemdUnitFailed) resolved: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:08] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [13:11:24] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [13:12:43] !log jiji@deploy2002 Started scap: (no justification provided) [13:17:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2151.codfw.wmnet with reason: Silence for cloning [13:17:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2151.codfw.wmnet with reason: Silence for cloning [13:17:47] (03PS1) 10Cory Massaro: Allow custom type converters in the orchestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008859 [13:18:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: provisionning db2217.codfw.wmnet - T355422 [13:18:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: provisionning db2217.codfw.wmnet - T355422 [13:18:21] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [13:18:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: provisionning db2217.codfw.wmnet - T355422 [13:18:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: provisionning db2217.codfw.wmnet - T355422 [13:19:04] (03PS2) 10Cory Massaro: Allow custom type converters in the orchestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008859 (https://phabricator.wikimedia.org/T359098) [13:21:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2151 in db2217 for T355422', diff saved to https://phabricator.wikimedia.org/P58475 and previous config saved to /var/cache/conftool/dbconfig/20240305-132106-arnaudb.json [13:21:18] (03Abandoned) 10Jaime Nuche: ci_test.pp: remove explicit installation of Python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008850 (https://phabricator.wikimedia.org/T358237) (owner: 10Jaime Nuche) [13:21:48] (03Abandoned) 10Jaime Nuche: ci_test.pp: add missing Python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/1008849 (https://phabricator.wikimedia.org/T358237) (owner: 10Jaime Nuche) [13:24:27] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:54] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2151.codfw.wmnet onto db2217.codfw.wmnet [13:28:49] !log jnuche@deploy2002 Started deploy [zuul/deploy@bb76c45]: test deployment for new host [13:28:54] !log jnuche@deploy2002 Finished deploy [zuul/deploy@bb76c45]: test deployment for new host (duration: 00m 04s) [13:28:58] jnuche: I am rebuilding still [13:29:02] I will let you know when it is done [13:29:19] effie: that wasn't a train deployment [13:29:21] nemo-yiannis ^ same [13:29:25] jnuche: I know :) [13:29:38] ah, silly coincidence, sry :) [13:29:52] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9600741 (10cmooney) Taavi advised on IRC about the gerrit issue: > gerrit enforces that user emails are unique. they need to update the email on the ol... [13:31:21] 06SRE, 10Data Pipelines, 06Data-Engineering, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#9600748 (10dr0ptp4kt) Originally, the thought was to be able to simply count relative volume of these types of inbound taps/clicks. Although we want fidelit... [13:33:08] !log running refreshImageMetadata.php on commons for Алфавітно-предметний_покажчик_за_1938_рік_до_Збірника_постанов_і_розпоряджень_Уряду_Української_Радянської_Соціалістичної_Республіки.pdf [13:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:03] "Error: 502, Server Hangup at 2024-03-05 13:34:41 GMT" [13:35:05] :( [13:35:31] Bsadowski1: context? [13:35:31] !log jiji@deploy2002 Finished scap: (no justification provided) (duration: 22m 47s) [13:35:32] Bsadowski1: what url? [13:35:45] It was a checkuser request [13:35:50] https://login.wikimedia.org/wiki/Special:CheckUser [13:35:57] (steward action) [13:36:15] Can't check that, no perm :/ [13:36:15] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9600760 (10KCVelaga_WMF) Thanks @Jelto! GitLab works. I mistakenly assumed that updating the email at idm.wikimedia.org will get reflected across the bo... [13:36:45] Okay I retried the action and it seemed to work. [13:36:53] Weird. [13:37:01] Actually it has an explanation [13:37:32] Well... there are a ton of results for the range I checked.. [13:37:41] seldom used functions are not sometimes well optimized, so the db needs to heat to succeed [13:37:49] yes, that would explain it [13:38:09] but on a second run it is possible that the data is in memory, succeeding [13:38:12] Maybe Dreamy_Jazz could help with CheckUser things [13:38:16] :D [13:38:20] hehe :) [13:38:29] Databases are cold-blooded animals [13:38:40] They need some warmth to function properly x) [13:38:52] I believe there are projects or tasks to make checkuser more... reliable? [13:39:28] it shouldn't be like this, but things that run often are noticed more often that funtions that are only used occasionally, independently of the importance [13:40:00] yes, also I belive it is not a core feature, so it may not have as much support as other stuff [13:40:41] and with core I mean the tecnical meaning (it is an extension) not its importance [13:40:44] Ah [13:40:53] yep yep :) [13:41:45] my suggestion would be- if it is a fast query (e.g. < 1 minute) try a couple of times, if it fails consistently (and there is no ongoing outage), file a task [13:44:58] (03CR) 10Jforrester: "I was just dropping the flag: https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/merge_requests/141" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008859 (https://phabricator.wikimedia.org/T359098) (owner: 10Cory Massaro) [13:45:44] jnuche: nemo-yiannis you are free to do whatever you wanted to do [13:46:04] effie: thx! [13:46:07] sorry for the trouble I caused [13:46:36] claime: should I try to go ahead with the train or there's something else you wanted to check/do first? [13:46:43] jnuche: nope, good on my end [13:46:46] effie: no worries :) [13:46:54] nemo-yiannis: please wait for the train, and then you're good to go [13:46:58] there's supposed to be a backport window in 15 minutes [13:47:03] augh. [13:47:56] PCS and backports should not conflict too much with the changes I made to maxSurge etc. [13:47:56] yeah, there's a patch there, unfortunately first we need to get the train stuff out of the way [13:48:02] But train needs to happen before backport [13:48:16] (it should be all right) [13:49:01] ok, doing the deed [13:49:13] !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.21 refs T354439 [13:49:17] T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439 [13:51:40] (03CR) 10Effie Mouzeli: "done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli) [13:52:23] (03PS6) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 [13:58:25] (03PS1) 10Majavah: aptrepo: Drop apt.kubernetes.io updates [puppet] - 10https://gerrit.wikimedia.org/r/1008864 (https://phabricator.wikimedia.org/T359169) [13:59:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1400). [14:00:05] dbrant and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] o/ [14:00:18] waiting for jnuche to finish first, I assume [14:00:22] 👋 we ran into multiple issues with the train today and we are still running it, backports cannot happen at the moment, I'm sorry about that [14:00:32] ack [14:00:35] (03Abandoned) 10Cory Massaro: Allow custom type converters in the orchestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008859 (https://phabricator.wikimedia.org/T359098) (owner: 10Cory Massaro) [14:00:40] (03CR) 10Cory Massaro: "Oh, nice. That's definitely a better solution! I'll close this one." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008859 (https://phabricator.wikimedia.org/T359098) (owner: 10Cory Massaro) [14:00:51] do you think we’ll be able to do backports later in the window or will there not be enough time? [14:01:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1008864 (https://phabricator.wikimedia.org/T359169) (owner: 10Majavah) [14:01:09] there's also a good chance the train is gonna eat up the entire hour and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1005161 will have to be rescheduled [14:01:37] (03CR) 10Majavah: [C: 03+2] aptrepo: Drop apt.kubernetes.io updates [puppet] - 10https://gerrit.wikimedia.org/r/1008864 (https://phabricator.wikimedia.org/T359169) (owner: 10Majavah) [14:01:43] ok [14:02:18] I'm ok with the backports happening after the window if need be [14:02:28] (03PS18) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [14:02:35] I'll have to juggle a bit with the network migration happening at 1600UTC [14:02:43] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [14:02:44] busy busy day [14:03:13] indeed [14:03:16] 10SRE-swift-storage, 10MediaWiki-File-management, 10media-backups: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996#9601034 (10jcrespo) [14:04:14] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:04:21] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:04:54] (03PS1) 10Majavah: hieradata: update striker to 2024-02-28-214103-production [puppet] - 10https://gerrit.wikimedia.org/r/1008865 (https://phabricator.wikimedia.org/T358615) [14:05:27] (03PS2) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) [14:05:48] 10SRE-swift-storage, 10MediaWiki-File-management, 10media-backups: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996#9601050 (10jcrespo) [14:06:36] (03CR) 10Majavah: [C: 03+2] hieradata: update striker to 2024-02-28-214103-production [puppet] - 10https://gerrit.wikimedia.org/r/1008865 (https://phabricator.wikimedia.org/T358615) (owner: 10Majavah) [14:06:41] (03CR) 10Elukey: "IIUC the renew_seconds parameter should force puppet to renew the cert earlier, and the idea is to allow more time for an admin to perform" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron) [14:06:50] claime: ok [14:07:18] now we got past the testservers :) [14:07:27] https://www.irccloud.com/pastebin/ph4Rkix3/ [14:08:05] claime, akosiaris, effie, _joe_: thank you all for your help [14:08:10] cheers [14:08:11] \o/ [14:08:13] \o/ [14:09:34] (03PS4) 10Eevans: restbase: provision restbase1038-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005594 (https://phabricator.wikimedia.org/T354560) [14:09:39] (03PS4) 10Eevans: restbase: provision restbase1039-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005595 (https://phabricator.wikimedia.org/T354560) [14:09:44] (03PS4) 10Eevans: restbase: provision restbase1040-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005596 (https://phabricator.wikimedia.org/T354560) [14:09:49] (03PS4) 10Eevans: restbase: provision restbase1041-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005597 (https://phabricator.wikimedia.org/T354560) [14:09:56] (03PS4) 10Eevans: restbase: provision restbase1042-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005598 (https://phabricator.wikimedia.org/T354560) [14:10:04] (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [14:11:34] (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1038-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005594 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans) [14:12:33] (03PS1) 10Clément Goubert: mediawiki: Fine-tune deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008867 [14:13:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2151.codfw.wmnet onto db2217.codfw.wmnet [14:14:33] (03CR) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [14:14:40] PROBLEM - Check whether ferm is active by checking the default input chain on parse2017 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:15:20] (03PS1) 10Jelto: aptrepo: Update GitLab 3F01618A51312F3F gpg key [puppet] - 10https://gerrit.wikimedia.org/r/1008868 [14:15:48] (03PS2) 10Alexandros Kosiaris: Switch restbase102[2345], restbase202[4567] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006895 (https://phabricator.wikimedia.org/T357392) [14:15:53] (03PS2) 10Alexandros Kosiaris: Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) [14:15:58] (03PS2) 10Alexandros Kosiaris: Switch the remaining parsoid hosts to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006897 (https://phabricator.wikimedia.org/T357392) [14:16:04] (03PS2) 10Alexandros Kosiaris: restbase: Switch the default to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006898 (https://phabricator.wikimedia.org/T357392) [14:16:12] (03PS2) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T357392) [14:16:20] (03PS2) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T357392) [14:16:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch restbase102[2345], restbase202[4567] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006895 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [14:16:36] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Switch restbase102[2345], restbase202[4567] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006895 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [14:16:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 5%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58476 and previous config saved to /var/cache/conftool/dbconfig/20240305-141649-arnaudb.json [14:17:22] (03PS2) 10Jelto: aptrepo: Update GitLab 3F01618A51312F3F gpg key [puppet] - 10https://gerrit.wikimedia.org/r/1008868 [14:17:34] (03CR) 10CI reject: [V: 04-1] Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [14:18:31] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:25] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008084 (https://phabricator.wikimedia.org/T355422) [14:21:26] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:22:46] (03CR) 10Clément Goubert: [C: 03+1] eqiad: Move all but 1 parsoid node to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008841 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris) [14:24:27] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:52] ^it's lying it's fine [14:26:18] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:26:25] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:27:53] (03CR) 10EoghanGaffney: [C: 03+2] [vrts] Remove ticket-test.wm.o and vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1008447 (https://phabricator.wikimedia.org/T359041) (owner: 10EoghanGaffney) [14:28:23] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:28:29] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:29:17] (03PS1) 10EoghanGaffney: Revert "[vrts] Remove ticket-test.wm.o and vrts1002" [puppet] - 10https://gerrit.wikimedia.org/r/1008756 [14:29:27] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1008868 (owner: 10Jelto) [14:29:51] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:30:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] mediawiki: Fine-tune deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008867 (owner: 10Clément Goubert) [14:31:03] jnuche: almost there x0 [14:31:21] !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.21 refs T354439 (duration: 42m 08s) [14:31:22] yep yep yep [14:31:26] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:31:30] T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439 [14:31:31] train presync done, rolling forward to group0 in a sec [14:31:47] (should be relatively fast) [14:31:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58477 and previous config saved to /var/cache/conftool/dbconfig/20240305-143154-arnaudb.json [14:31:59] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:32:23] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:32:36] (03CR) 10EoghanGaffney: [C: 03+2] Revert "[vrts] Remove ticket-test.wm.o and vrts1002" [puppet] - 10https://gerrit.wikimedia.org/r/1008756 (owner: 10EoghanGaffney) [14:32:51] jnuche: I took the liberty to attach to your screen, I've never watched a train rollout, hope you don't mind [14:33:09] no problemo :) [14:33:24] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for wqds1025 - cmooney@cumin1002" [14:33:48] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008871 (https://phabricator.wikimedia.org/T354439) [14:33:53] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008871 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [14:33:59] here we go [14:34:15] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for wqds1025 - cmooney@cumin1002" [14:34:16] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:34:25] (03PS1) 10EoghanGaffney: [vrts] Remove vrts1002 reverences [puppet] - 10https://gerrit.wikimedia.org/r/1008872 [14:34:28] choo choo [14:34:31] (03PS3) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) [14:34:44] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:34:47] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008871 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [14:34:50] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:35:10] !log fabfur@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on dns2004.wikimedia.org with reason: T355873 [14:35:15] T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 [14:35:25] !log fabfur@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dns2004.wikimedia.org with reason: T355873 [14:35:27] (03CR) 10Jelto: [C: 03+2] aptrepo: Update GitLab 3F01618A51312F3F gpg key [puppet] - 10https://gerrit.wikimedia.org/r/1008868 (owner: 10Jelto) [14:35:28] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:36:07] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:36:30] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:36:50] PROBLEM - cassandra-a CQL 10.64.16.28:9042 on restbase1038 is CRITICAL: connect to address 10.64.16.28 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:37:12] !log depooling dns2004 for T355873 [14:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:29] jnuche: I should have merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1008867 before you rolled forward, it would have made the deployment faster [14:37:41] I'll merge it right quick afterwards [14:37:47] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for wqds1025 - cmooney@cumin1002" [14:37:48] !log fabfur@cumin2002 conftool action : set/pooled=no; selector: name=dns2004.wikimedia.org [14:37:59] ack [14:38:04] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:07] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9601203 (10akosiaris) [14:38:26] Right now at every deployment we exceed our capacity by around 800CPUs because of maxSurge/maxUnavailable settings, which means more wait for containers to be ready, etc. [14:38:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for wqds1025 - cmooney@cumin1002" [14:38:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:39:31] (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [14:40:22] !log remove all but 1 host from parsoid@eqiad [14:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:29] hmm https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&viewPanel=23 [14:40:32] !log remove all but 1 host from parsoid@eqiad T358752 [14:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:40] PROBLEM - Check whether ferm is active by checking the default input chain on parse1020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:40:43] T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752 [14:40:56] 06SRE, 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9601214 (10elukey) a:05klausman→03None [14:40:56] claime: hmmm [14:41:09] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9601211 (10akosiaris) We at [~50% mw-parsoid](https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetr... [14:41:37] 06SRE, 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9601234 (10elukey) Removed Tobias as assignee so the new node can be initialized. [14:42:00] akosiaris: bump in captcha displayed at the same time [14:42:45] topranks: dns2004 is depooled and downtimed ready for T355873 [14:42:46] T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 [14:42:50] PROBLEM - cassandra-b CQL 10.64.16.32:9042 on restbase1038 is CRITICAL: connect to address 10.64.16.32 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:43:02] fabfur: super thanks! [14:43:27] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9601236 (10cmooney) >>! In T358727#9600557, @dr0ptp4kt wrote: > @VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish... [14:44:08] (03CR) 10Volans: "Nice! One typo and a small formatting issue, looks sane otherwise to me, but I'll leave to ServiceOps to review the helmfile command." [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [14:44:12] urandom: something going on with this restbase node? ^^ [14:44:25] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-02-26-150614 to 2024-03-05-140533 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008874 (https://phabricator.wikimedia.org/T296937) [14:44:40] RECOVERY - Check whether ferm is active by checking the default input chain on parse2017 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:44:51] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.21 refs T354439 [14:44:55] T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439 [14:44:58] group0 completed, give me a min to check a couple things [14:45:25] claime: cassandra appears to be running [14:45:52] PROBLEM - cassandra-b SSL 10.64.16.32:7000 on restbase1038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:46:27] Condition check resulted in distributed storage system for structured data being skipped ? [14:46:54] (03PS19) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [14:46:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58480 and previous config saved to /var/cache/conftool/dbconfig/20240305-144658-arnaudb.json [14:47:03] claime: all done, you can go ahead with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1008867 if you want [14:47:03] 4 log entries for cassandra-b and -c since 14:16 today [14:47:08] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [14:47:10] jnuche: awesome thanks [14:47:29] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fine-tune deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008867 (owner: 10Clément Goubert) [14:48:21] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008084 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [14:48:50] PROBLEM - cassandra-c CQL 10.64.16.35:9042 on restbase1038 is CRITICAL: connect to address 10.64.16.35 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:48:53] (03Merged) 10jenkins-bot: mediawiki: Fine-tune deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008867 (owner: 10Clément Goubert) [14:49:19] !log cgoubert@deploy2002 Started scap: (no justification provided) [14:49:42] !log cgoubert@deploy2002 Finished scap: (no justification provided) (duration: 00m 23s) [14:51:15] !log cgoubert@deploy2002 Started scap: (no justification provided) [14:51:50] PROBLEM - cassandra-c SSL 10.64.16.35:7000 on restbase1038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:52:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli) [14:52:09] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1038.eqiad.wmnet with reason: Bootstrapping — T354560 [14:52:12] T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560 [14:52:23] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1038.eqiad.wmnet with reason: Bootstrapping — T354560 [14:53:42] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9601326 (10MoritzMuehlenhoff) @KCVelaga_WMF Can you try logging into https:/idm.wikimedia.org with your old account? Under "e-mail" you can click "Updat... [14:54:10] !log jnuche@deploy2002 Started deploy [zuul/deploy@cadc625]: test deployment for new host [14:55:54] PROBLEM - Check whether ferm is active by checking the default input chain on mw2260 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:56:34] !log cgoubert@deploy2002 Finished scap: (no justification provided) (duration: 05m 18s) [14:57:01] ok we're good [14:57:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] eqiad: Move all but 1 parsoid node to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008841 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris) [14:57:12] dbrant, MatmaRex, Lucas_WMDE, you can proceed with backports [14:57:15] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:57:21] (03PS4) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) [14:57:22] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:57:24] sorry it took so long, we can overflow the window [14:58:04] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:10] thanks, i'm around if anyone can deploy [14:58:18] same [14:59:08] (03CR) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [14:59:25] o/ [14:59:28] jouncebot: nowandnext [14:59:28] For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1400) [14:59:28] In 1 hour(s) and 0 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1600) [14:59:34] we still have a free hour [14:59:42] so I guess we’ll just do the deployments now then [14:59:48] just a sec, need to finish a comment on phab first [15:00:04] <_joe_> Lucas_WMDE: please hold a sec [15:00:15] ok [15:02:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 25%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58481 and previous config saved to /var/cache/conftool/dbconfig/20240305-150203-arnaudb.json [15:02:09] (03PS1) 10Arnaudb: mariadb: add all missing hosts from T355422 [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422) [15:02:23] (03CR) 10Volans: [C: 03+1] "LGTM cookbook/python wise :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [15:02:42] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host mw1357.eqiad.wmnet with OS bullseye [15:02:56] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601359 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host mw1357.eqiad.wmnet with OS bullseye [15:03:13] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host mw1356.eqiad.wmnet with OS bullseye [15:03:27] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host mw1356.eqiad.wmnet with OS bullseye [15:06:31] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1003.eqiad.wmnet with OS bullseye [15:06:45] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601375 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1003.eqiad.wmnet with OS bullseye [15:07:02] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1004.eqiad.wmnet with OS bullseye [15:07:18] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1004.eqiad.wmnet with OS bullseye [15:08:52] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:08:58] !log disable meta-monitoring for alert1001 - T333615 [15:08:59] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:11] T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 [15:09:58] (03PS2) 10Arnaudb: mariadb: add all missing hosts from T355422 [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422) [15:10:03] (03CR) 10Andrea Denisse: [C: 03+2] alert: Failover Icinga and Alertmanager to alert2001 [puppet] - 10https://gerrit.wikimedia.org/r/1003513 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [15:10:21] (03CR) 10Andrea Denisse: [C: 03+2] icinga: Set log group to 'nagios' to resolve permission conflicts [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) (owner: 10Andrea Denisse) [15:10:25] <_joe_> jouncebot: now [15:10:26] No deployments scheduled for the next 0 hour(s) and 49 minute(s) [15:10:40] RECOVERY - Check whether ferm is active by checking the default input chain on parse1020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:10:53] (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [15:11:14] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:11:18] (03PS3) 10Arnaudb: mariadb: add all missing hosts from T355422 [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422) [15:11:20] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:12:42] (03CR) 10Arnaudb: mariadb: add all missing hosts from T355422 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [15:14:55] (SystemdUnitFailed) firing: (2) mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:47] (03CR) 10Jcrespo: [C: 03+1] mariadb: add all missing hosts from T355422 [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [15:18:55] dbrant, MatmaRex: just FYI, the deployment won’t happen now after all, sorry for the troubles [15:19:17] (03CR) 10Arnaudb: [C: 03+2] mariadb: add all missing hosts from T355422 [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [15:19:30] no worries, will move to the next window [15:19:55] (SystemdUnitFailed) firing: (3) ferm.service on kubernetes1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:21:33] (03CR) 10Marostegui: "All these hosts will start showing up on icinga when puppet starts running - they won't page as they correctly have notifications disabled" [puppet] - 10https://gerrit.wikimedia.org/r/1008085 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [15:24:37] (03PS2) 10Andrew Bogott: role::puppetserver::cloud_vps_project: remove firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1008554 (https://phabricator.wikimedia.org/T351450) [15:24:45] (03PS3) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [15:25:00] (ProbeDown) firing: (4) Service wdqs1011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:25:05] (03PS3) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) [15:25:13] (03PS1) 10Andrew Bogott: profile::puppetserver::wmcs: parametrize a few hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1008879 [15:25:27] (JobUnavailable) firing: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:25:35] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:25:46] (03CR) 10Andrea Denisse: [C: 03+2] alert: Resolve alerts DNS queries to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/1003516 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [15:27:28] (03PS2) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/1003516 (https://phabricator.wikimedia.org/T333615) [15:28:08] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] alert: Resolve alerts DNS queries to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/1003516 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [15:29:12] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:13] (03PS1) 10Majavah: hieradata: fix alert2001 IP address [puppet] - 10https://gerrit.wikimedia.org/r/1008880 [15:29:20] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:11] (03PS20) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [15:30:27] (03Abandoned) 10Majavah: hieradata: fix alert2001 IP address [puppet] - 10https://gerrit.wikimedia.org/r/1008880 (owner: 10Majavah) [15:30:33] (03PS1) 10Marostegui: instances.yaml: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1008881 (https://phabricator.wikimedia.org/T355422) [15:32:10] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [15:32:46] (03CR) 10Arnaudb: [C: 03+1] instances.yaml: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1008881 (https://phabricator.wikimedia.org/T355422) (owner: 10Marostegui) [15:32:51] (03CR) 10Arnaudb: [C: 03+2] instances.yaml: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1008881 (https://phabricator.wikimedia.org/T355422) (owner: 10Marostegui) [15:34:45] (03PS1) 10Filippo Giunchedi: wikimedia.org: failover icinga to alert2001 too [dns] - 10https://gerrit.wikimedia.org/r/1008882 (https://phabricator.wikimedia.org/T333615) [15:35:12] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1008882 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:35:45] (03CR) 10Filippo Giunchedi: [C: 03+2] wikimedia.org: failover icinga to alert2001 too [dns] - 10https://gerrit.wikimedia.org/r/1008882 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:37:02] (03CR) 10Bking: "We're still getting alert spam, so I'm going to merge this. Happy to follow up on suggestions in a future patch." [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [15:37:12] (03CR) 10Bking: [C: 03+2] wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [15:37:31] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:37:37] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:38:00] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [15:38:57] 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873#9601519 (10Joe) [15:39:23] <_joe_> !log draining kubernetes2035 T355873 [15:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:27] T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 [15:39:43] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601524 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host mw1357.eqiad.wmnet with OS bullseye complet... [15:40:14] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9601526 (10Jhancock.wm) @jcrespo I replaced that cable. It was quick enough it didn't even notice. I remember we tried this in the past and it didn't work. But I have a brand new cable, so maybe that will be the difference. [15:40:22] 06SRE, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9601527 (10andrea.denisse) [15:41:15] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1356.eqiad.wmnet with OS bullseye [15:41:29] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601529 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host mw1356.eqiad.wmnet with OS bullseye complet... [15:43:13] <_joe_> !log draining kubernetes2054 T355873 [15:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:43:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2040.codfw.wmnet with OS bookworm [15:43:29] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9601541 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2040.codfw.wmnet with OS bookworm completed: - es2040 (**WARN**) -... [15:43:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 8 hosts with reason: Silence for maintenance T355873 [15:43:40] We're seeing a flood of nagios/icinga "passive check is awol" alerts from alert1002, has nsca or icinga fallen over? [15:43:45] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:43:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1003.eqiad.wmnet with OS bullseye [15:43:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 8 hosts with reason: Silence for maintenance T355873 [15:43:51] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1008872 (owner: 10EoghanGaffney) [15:44:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T355873 - depooling db2148 db2163 db2185 db2164 db2189 es2025 es2029 es2030', diff saved to https://phabricator.wikimedia.org/P58489 and previous config saved to /var/cache/conftool/dbconfig/20240305-154400-arnaudb.json [15:44:12] Also, I think the icinga alerts are flapping between warning and recovery. [15:44:17] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601543 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1003.eqiad.wmnet with OS bullseye comp... [15:44:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:03] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:45:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1004.eqiad.wmnet with OS bullseye [15:45:10] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9601554 (10Jhancock.wm) [15:45:24] Jeff_Green: we did an alert host failover, likely that [15:45:32] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:45:47] godog: oh, huh, I wonder if we're able to report to the new host properly [15:46:14] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:46:21] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:46:25] Jeff_Green: could be, one sec [15:46:34] Hi Jeff_Green, can you share where are those alerts going? [15:46:47] I'd like to see them to understand the problem further. [15:46:47] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1004.eqiad.wmnet with OS bullseye comp... [15:46:57] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:47:05] denisse: do you mean the email alerts, or where our hosts post the nsca reports? [15:47:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [15:47:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58490 and previous config saved to /var/cache/conftool/dbconfig/20240305-154718-arnaudb.json [15:47:27] the email alerts are going to fr-tech-ops@wikimedia.org [15:47:53] Thank you Jeff_Green, taking a look. [15:48:03] (03PS21) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [15:48:03] !log bounce ircecho on alert2001 [15:48:04] denisse: great [15:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:30] <_joe_> !log draining mw2434 T355873 [15:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:33] T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 [15:48:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 23:00:00 on db2096.codfw.wmnet with reason: Silence for cloning [15:48:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 23:00:00 on db2096.codfw.wmnet with reason: Silence for cloning [15:49:16] fwiw we have two hosts configured for nsca reporting: 208.80.154.88 and 208.80.153.84 [15:49:17] ok ircecho should be back in some fashion [15:49:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 23:00:00 on db2196.codfw.wmnet with reason: Silence for cloning [15:49:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 23:00:00 on db2196.codfw.wmnet with reason: Silence for cloning [15:49:55] (SystemdUnitFailed) firing: (3) ferm.service on kubernetes1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:43] <_joe_> !log draining mw2435 T355873 [15:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:51] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9601629 (10VRiley-WMF) Thank you @cmooney ! I have also relabeled this unit to match the name. Closing this ticket as per our discussion s... [15:51:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [15:52:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 20 hosts with reason: Silence for cloning [15:52:35] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9601633 (10VRiley-WMF) 05Open→03Resolved [15:52:43] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9601636 (10jcrespo) I didn't ask for a cable change, and so far I haven't observed any problem with the host, TBH, it was @ayounsi who requested it, but I wonder if the metrics are too sensitive- we do the backup as fast a... [15:52:52] PROBLEM - cassandra-b CQL 10.64.16.32:9042 on restbase1038 is CRITICAL: connect to address 10.64.16.32 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:53:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 20 hosts with reason: Silence for cloning [15:53:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db[2219-2220].codfw.wmnet with reason: Silence for cloning [15:53:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db[2219-2220].codfw.wmnet with reason: Silence for cloning [15:53:45] icinga-wm: <3 [15:54:18] (03PS22) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [15:54:21] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on cr[1-2]-codfw,lsw1-b8-codfw.mgmt asw-b-codfw with reason: prepping for server uplink migration codfw rack b8 [15:54:22] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on cr[1-2]-codfw,lsw1-b8-codfw.mgmt asw-b-codfw with reason: prepping for server uplink migration codfw rack b8 [15:54:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2035.codfw.wmnet with OS bookworm [15:54:34] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [15:54:38] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9601658 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2035.codfw.wmnet with OS bookworm [15:54:45] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b8-codfw.mgmt with reason: prepping for server uplink migration codfw rack b8 [15:54:48] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:54:54] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b8-codfw.mgmt with reason: prepping for server uplink migration codfw rack b8 [15:54:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:55:06] 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873#9601663 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=19e5ce18-f2ba-4d9e-a80a-2c957c2eecad) set by cmoon... [15:55:21] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1002.eqiad.wmnet with OS bullseye [15:55:37] !log bounce ircecho on alert2001 one last time [15:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:40] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 25 hosts with reason: Migrating servers in codfw rack B8 to lsw1-b8-codfw [15:55:53] PROBLEM - cassandra-b SSL 10.64.16.32:7000 on restbase1038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:56:00] <_joe_> !log depooled parse2008-10 T355873 [15:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:04] T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 [15:56:06] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1005.eqiad.wmnet with OS bullseye [15:56:19] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 25 hosts with reason: Migrating servers in codfw rack B8 to lsw1-b8-codfw [15:56:38] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1006.eqiad.wmnet with OS bullseye [15:57:04] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1002.eqiad.wmnet with OS bullseye [15:57:08] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1007.eqiad.wmnet with OS bullseye [15:58:09] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1008.eqiad.wmnet with OS bullseye [15:58:16] 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873#9601682 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f241631d-4830-4ac7-b5c1-29790ccbb916) set by cmoon... [15:58:28] <_joe_> !log depooled mw2434-5, T355873 [15:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:40] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1005.eqiad.wmnet with OS bullseye [15:58:55] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:59:01] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:59:32] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1006.eqiad.wmnet with OS bullseye [16:00:05] eoghan, jelto, and arnoldokoth: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1600). [16:00:24] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1007.eqiad.wmnet with OS bullseye [16:00:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:00:27] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:00:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:00:56] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1009.eqiad.wmnet with OS bullseye [16:01:17] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601700 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1008.eqiad.wmnet with OS bullseye [16:01:21] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:02:05] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:02:12] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:03:50] (03PS1) 10Andrew Bogott: eqiad1::pdns: remove check_dns alerts [puppet] - 10https://gerrit.wikimedia.org/r/1008887 [16:04:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:04:30] (03PS1) 10Arnaudb: mariadb: removes backup sources from instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1008906 (https://phabricator.wikimedia.org/T355422) [16:04:52] !log commencing migration of servers in codfw rack b8 to lsw1-b8-codfw T355873 [16:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:56] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9601738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1009.eqiad.wmnet with OS bullseye [16:05:09] T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 [16:05:49] (03CR) 10CI reject: [V: 04-1] mariadb: removes backup sources from instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1008906 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [16:05:57] (03CR) 10Majavah: [C: 03+1] eqiad1::pdns: remove check_dns alerts [puppet] - 10https://gerrit.wikimedia.org/r/1008887 (owner: 10Andrew Bogott) [16:06:17] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:06:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51595 bytes in 0.823 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:06:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.870 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:06:21] (03PS2) 10Arnaudb: mariadb: removes backup sources from instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1008906 (https://phabricator.wikimedia.org/T355422) [16:06:26] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:06:32] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:07:05] (03CR) 10EoghanGaffney: [C: 03+2] [vrts] Remove vrts1002 reverences [puppet] - 10https://gerrit.wikimedia.org/r/1008872 (owner: 10EoghanGaffney) [16:07:25] (03CR) 10Marostegui: [C: 03+1] mariadb: removes backup sources from instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1008906 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [16:07:41] (03CR) 10Arnaudb: [C: 03+2] mariadb: removes backup sources from instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1008906 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [16:08:09] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage [16:08:49] PROBLEM - Host kubestagetcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:49] PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:53] PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:55] PROBLEM - Host ml-staging-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:57] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1005.eqiad.wmnet with reason: host reimage [16:09:31] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:09:38] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:10:01] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1007.eqiad.wmnet with reason: host reimage [16:10:38] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage [16:11:14] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1008.eqiad.wmnet with reason: host reimage [16:11:40] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:11:47] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:12:19] RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [16:12:19] RECOVERY - Host kubestagetcd2001 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [16:12:53] RECOVERY - Host ml-staging-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [16:13:21] RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [16:13:25] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1007.eqiad.wmnet with reason: host reimage [16:13:38] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1009.eqiad.wmnet with reason: host reimage [16:13:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: provisionning db2203.codfw.wmnet - T355422 [16:13:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: provisionning db2203.codfw.wmnet - T355422 [16:14:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: provisionning db2203.codfw.wmnet - T355422 [16:14:05] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [16:14:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: provisionning db2203.codfw.wmnet - T355422 [16:15:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2103 in db2203 for T355422', diff saved to https://phabricator.wikimedia.org/P58492 and previous config saved to /var/cache/conftool/dbconfig/20240305-161517-arnaudb.json [16:15:21] 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873#9601818 (10cmooney) All links moved without problem, servers back online and responding to ping now. [16:15:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1008.eqiad.wmnet with reason: host reimage [16:15:41] !log Repooling mw2433.codfw.wmnet mw2432.codfw.wmnet parse2008.codfw.wmnet parse2009.codfw.wmnet parse2010.codfw.wmnet [16:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:55] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2004.wikimedia.org [16:16:02] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1::pdns: remove check_dns alerts [puppet] - 10https://gerrit.wikimedia.org/r/1008887 (owner: 10Andrew Bogott) [16:16:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2035.codfw.wmnet with reason: host reimage [16:16:18] jouncebot: nowandnext [16:16:18] For the next 0 hour(s) and 43 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1600) [16:16:19] In 0 hour(s) and 43 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1700) [16:16:35] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2103.codfw.wmnet onto db2203.codfw.wmnet [16:16:42] (03PS1) 10Majavah: hieradata: update test VM without floating IP [puppet] - 10https://gerrit.wikimedia.org/r/1008892 [16:16:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for dns2004.wikimedia.org [16:16:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2004.wikimedia.org [16:17:17] !log uncordon kubernetes2035.codfw.wmnet kubernetes2034.codfw.wmnet mw2434.codfw.wmnet mw2435.codfw.wmnet [16:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:03] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1009.eqiad.wmnet with reason: host reimage [16:19:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58493 and previous config saved to /var/cache/conftool/dbconfig/20240305-161921-arnaudb.json [16:19:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58494 and previous config saved to /var/cache/conftool/dbconfig/20240305-161932-arnaudb.json [16:19:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58495 and previous config saved to /var/cache/conftool/dbconfig/20240305-161955-arnaudb.json [16:20:00] (ProbeDown) resolved: (4) Service wdqs1011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58496 and previous config saved to /var/cache/conftool/dbconfig/20240305-162011-arnaudb.json [16:20:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58497 and previous config saved to /var/cache/conftool/dbconfig/20240305-162025-arnaudb.json [16:20:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58498 and previous config saved to /var/cache/conftool/dbconfig/20240305-162043-arnaudb.json [16:20:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58499 and previous config saved to /var/cache/conftool/dbconfig/20240305-162056-arnaudb.json [16:22:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1005.eqiad.wmnet with reason: host reimage [16:23:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: provisionning db2204.codfw.wmnet - T355422 [16:23:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: provisionning db2204.codfw.wmnet - T355422 [16:23:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: provisionning db2204.codfw.wmnet - T355422 [16:23:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: provisionning db2204.codfw.wmnet - T355422 [16:23:22] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [16:24:04] !log patching oldimage table for commons T359176 [16:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:18] T359176: Long-titled archived files can get its path metadata truncated due to not having enough storage space, leading to orphan, non accesible files (was: Two files on commons have invalid UTF-8 characters in path metadata) - https://phabricator.wikimedia.org/T359176 [16:24:24] (03PS1) 10Muehlenhoff: Point apt discovery records to apt1002/apt2002 (new bookworm hosts) [puppet] - 10https://gerrit.wikimedia.org/r/1008893 (https://phabricator.wikimedia.org/T331613) [16:24:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2104 in db2204 for T355422~', diff saved to https://phabricator.wikimedia.org/P58500 and previous config saved to /var/cache/conftool/dbconfig/20240305-162442-arnaudb.json [16:25:13] (03CR) 10Herron: "Yes this is my understanding as well, essentially two settings:" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron) [16:25:45] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2104.codfw.wmnet onto db2204.codfw.wmnet [16:27:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2035.codfw.wmnet with reason: host reimage [16:28:41] jouncebot nowandnext [16:28:41] For the next 0 hour(s) and 31 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1600) [16:28:41] In 0 hour(s) and 31 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1700) [16:28:55] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1002.eqiad.wmnet with OS bullseye [16:29:19] mutante: just fyi going to do a backport of https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1008476 [16:29:45] ^ cc: Jdlrobson [16:29:47] brennen: alright, thanks [16:30:08] the window is empty [16:31:09] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602016 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1002.eqiad.wmnet with OS bullseye comp... [16:31:11] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1007.eqiad.wmnet with OS bullseye [16:31:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008476 (https://phabricator.wikimedia.org/T358164) (owner: 10Jdlrobson) [16:32:51] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1007.eqiad.wmnet with OS bullseye comp... [16:33:41] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1008.eqiad.wmnet with OS bullseye [16:34:24] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1008.eqiad.wmnet with OS bullseye comp... [16:34:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58501 and previous config saved to /var/cache/conftool/dbconfig/20240305-163426-arnaudb.json [16:34:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58502 and previous config saved to /var/cache/conftool/dbconfig/20240305-163437-arnaudb.json [16:35:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58503 and previous config saved to /var/cache/conftool/dbconfig/20240305-163501-arnaudb.json [16:35:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58504 and previous config saved to /var/cache/conftool/dbconfig/20240305-163516-arnaudb.json [16:35:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58505 and previous config saved to /var/cache/conftool/dbconfig/20240305-163530-arnaudb.json [16:35:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58506 and previous config saved to /var/cache/conftool/dbconfig/20240305-163548-arnaudb.json [16:36:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58507 and previous config saved to /var/cache/conftool/dbconfig/20240305-163601-arnaudb.json [16:36:03] (03PS1) 10Ebernhardson: cirrus: Check backfill status prior to reindexing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008895 [16:36:06] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1009.eqiad.wmnet with OS bullseye [16:36:20] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1009.eqiad.wmnet with OS bullseye comp... [16:38:35] (03CR) 10Majavah: [C: 03+1] role::puppetserver::cloud_vps_project: remove firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1008554 (https://phabricator.wikimedia.org/T351450) (owner: 10Andrew Bogott) [16:38:54] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1006.eqiad.wmnet with OS bullseye [16:39:07] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602077 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1006.eqiad.wmnet with OS bullseye exec... [16:39:21] !log enabling meta-monitoring for the alert* hosts - T333615 [16:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:25] T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 [16:39:31] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1006.eqiad.wmnet with OS bullseye [16:39:49] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1006.eqiad.wmnet with OS bullseye [16:40:45] (03PS1) 10Muehlenhoff: Fix airflow firewall setting for an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1008896 [16:40:54] (03PS2) 10Muehlenhoff: Fix airflow firewall setting for an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1008896 [16:41:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1005.eqiad.wmnet with OS bullseye [16:41:19] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1005.eqiad.wmnet with OS bullseye comp... [16:42:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:42:16] (03PS1) 10Daniel Kinzler: Rest: allow Handlers to disable body parsing. [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008758 (https://phabricator.wikimedia.org/T357025) [16:42:45] (03CR) 10Brouberol: [C: 03+1] Fix airflow firewall setting for an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1008896 (owner: 10Muehlenhoff) [16:43:27] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1010.eqiad.wmnet with OS bullseye [16:43:42] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602107 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1010.eqiad.wmnet with OS bullseye [16:43:58] (03CR) 10Muehlenhoff: [C: 03+2] Fix airflow firewall setting for an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1008896 (owner: 10Muehlenhoff) [16:44:08] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1011.eqiad.wmnet with OS bullseye [16:44:24] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602112 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1011.eqiad.wmnet with OS bullseye [16:44:49] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1012.eqiad.wmnet with OS bullseye [16:45:05] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602116 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1012.eqiad.wmnet with OS bullseye [16:45:28] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1013.eqiad.wmnet with OS bullseye [16:45:43] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602117 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1013.eqiad.wmnet with OS bullseye [16:46:05] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1014.eqiad.wmnet with OS bullseye [16:46:20] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye [16:47:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:47:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2035.codfw.wmnet with OS bookworm [16:47:30] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9602134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2035.codfw.wmnet with OS bookworm completed: - es2035 (**PASS**) -... [16:48:41] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9602143 (10Jhancock.wm) @Marostegui this is completed [16:49:02] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9602140 (10Jhancock.wm) 05Open→03Resolved [16:49:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58508 and previous config saved to /var/cache/conftool/dbconfig/20240305-164931-arnaudb.json [16:49:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58509 and previous config saved to /var/cache/conftool/dbconfig/20240305-164942-arnaudb.json [16:49:58] (03PS1) 10Hnowlan: api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008900 (https://phabricator.wikimedia.org/T357907) [16:50:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58510 and previous config saved to /var/cache/conftool/dbconfig/20240305-165006-arnaudb.json [16:50:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58511 and previous config saved to /var/cache/conftool/dbconfig/20240305-165022-arnaudb.json [16:50:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58512 and previous config saved to /var/cache/conftool/dbconfig/20240305-165035-arnaudb.json [16:50:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58513 and previous config saved to /var/cache/conftool/dbconfig/20240305-165053-arnaudb.json [16:51:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58514 and previous config saved to /var/cache/conftool/dbconfig/20240305-165106-arnaudb.json [16:51:55] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1006.eqiad.wmnet with reason: host reimage [16:51:56] (03Merged) 10jenkins-bot: Partial Revert "Set background/color to inherit for common templates" [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008476 (https://phabricator.wikimedia.org/T358164) (owner: 10Jdlrobson) [16:52:44] !log brennen@deploy2002 Started scap: Backport for [[gerrit:1008476|Partial Revert "Set background/color to inherit for common templates" (T358164)]] [16:52:49] T358164: Set color/background to inherit or #333 on common templates/use of HTML4 bgcolor - https://phabricator.wikimedia.org/T358164 [16:53:34] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:53:35] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9602175 (10Marostegui) Thank you so much! [16:53:41] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:53:57] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [16:54:04] (03PS1) 10Andrea Denisse: Revert "alert: Resolve alerts DNS queries to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/1008759 [16:54:19] (03PS1) 10Andrea Denisse: Revert "wikimedia.org: failover icinga to alert2001 too" [dns] - 10https://gerrit.wikimedia.org/r/1008760 [16:54:32] (03PS1) 10Andrea Denisse: Revert "alert: Failover Icinga and Alertmanager to alert2001" [puppet] - 10https://gerrit.wikimedia.org/r/1008761 [16:55:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1006.eqiad.wmnet with reason: host reimage [16:55:12] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9602183 (10Marostegui) [16:56:25] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1010.eqiad.wmnet with reason: host reimage [16:56:51] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1011.eqiad.wmnet with reason: host reimage [16:56:53] akosiaris: i am getting some errors for parse* hosts from scap here; guessing this is an indicator i shouldn't be deploying at present? [16:57:29] (03CR) 10Fabfur: [V: 03+1 C: 03+2] cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [16:57:37] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1012.eqiad.wmnet with reason: host reimage [16:57:41] brennen: which nodes? [16:58:12] parse1010, 1013, 1011, 1014, 1012 [16:58:29] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1013.eqiad.wmnet with reason: host reimage [16:58:29] !log brennen@deploy2002 jdlrobson and brennen: Backport for [[gerrit:1008476|Partial Revert "Set background/color to inherit for common templates" (T358164)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:58:31] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for bdgreenlee - https://phabricator.wikimedia.org/T359123#9602232 (10bdgreenlee) I'm told I'll need `analytics-privatedata-users` too. Can I tack that onto this ticket, or should I file a new one? [16:58:34] T358164: Set color/background to inherit or #333 on common templates/use of HTML4 bgcolor - https://phabricator.wikimedia.org/T358164 [16:58:45] brennen: gimme a second, i'll fix it [16:59:11] claime: thanks, holding remainder of sync until i hear back. Jdlrobson, if there's testing to do i think you can go ahead and do it now. [16:59:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1010.eqiad.wmnet with reason: host reimage [16:59:16] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: cluster=parsoid [16:59:56] claime: for what it's worth, it seems like maybe a few things pooled that shouldn't have been? errors were changed keys and a couple of timeouts. [17:00:04] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:38] brennen: yeah basically puppet didn't run on deploy host between a.kosiaris removing nodes from prod for reimage and you running your deployment [17:00:40] jhathaway, rzl: apologies for stepping on your window, in the midst of a backport for a could-be train blocker. [17:00:46] (nothing to do in the puppet window, feel free to-- haha [17:00:51] you're good, it's all yours :) [17:00:53] rzl: right on. :) [17:01:01] I'm running it now [17:01:08] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for bdgreenlee - https://phabricator.wikimedia.org/T359123#9602244 (10Dzahn) If you don't mind please file a new one since that's a different tag/board/process. [17:01:09] claime: cool, thx. [17:01:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1012.eqiad.wmnet with reason: host reimage [17:01:59] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=mw243(2|3).* [17:02:36] (03PS10) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) [17:02:41] (03PS2) 10Hnowlan: mobileapps: add cassandra config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/993154 (https://phabricator.wikimedia.org/T350507) [17:02:57] (03CR) 10Hnowlan: mobileapps: add cassandra config in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/993154 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [17:03:09] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for bdgreenlee - https://phabricator.wikimedia.org/T359123#9602269 (10odimitrijevic) Approved [17:03:27] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197 (10bdgreenlee) [17:04:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1011.eqiad.wmnet with reason: host reimage [17:04:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58517 and previous config saved to /var/cache/conftool/dbconfig/20240305-170437-arnaudb.json [17:04:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58518 and previous config saved to /var/cache/conftool/dbconfig/20240305-170448-arnaudb.json [17:05:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58519 and previous config saved to /var/cache/conftool/dbconfig/20240305-170511-arnaudb.json [17:05:19] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [17:05:21] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197#9602283 (10odimitrijevic) Approved [17:05:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58520 and previous config saved to /var/cache/conftool/dbconfig/20240305-170527-arnaudb.json [17:05:31] brennen: should be good now, those parse nodes are not in dsh anymore [17:05:32] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on lvs2012.codfw.wmnet with reason: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [17:05:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58521 and previous config saved to /var/cache/conftool/dbconfig/20240305-170540-arnaudb.json [17:05:47] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2011.codfw.wmnet with reason: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [17:05:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58522 and previous config saved to /var/cache/conftool/dbconfig/20240305-170558-arnaudb.json [17:06:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [17:06:11] brennen: they're all being reimaged as k8s nodes, so sync errors to them are not a problem [17:06:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58523 and previous config saved to /var/cache/conftool/dbconfig/20240305-170611-arnaudb.json [17:06:13] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602285 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6010131f-b756-49c6-8082-62badba41... [17:06:16] claime: thanks, going ahead since this is a revert. [17:06:23] !log brennen@deploy2002 jdlrobson and brennen: Continuing with sync [17:06:30] it's basically a race condition [17:06:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1013.eqiad.wmnet with reason: host reimage [17:07:32] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr[1-2]-codfw,lsw1-a2-codfw.mgmt with reason: moving lvs2011 which will disrupt bgp [17:07:39] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr[1-2]-codfw,lsw1-a2-codfw.mgmt with reason: moving lvs2011 which will disrupt bgp [17:08:16] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602297 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c0fe6035-a553-49f8-8b94-3d7840e51... [17:09:18] 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198 (10cmooney) p:05Triage→03Medium [17:10:15] !log disabling pybal on lvs2011 (traffic will move to lvs2014) in advance of reimage T352920 [17:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:30] T352920: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920 [17:11:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2096.codfw.wmnet onto db2196.codfw.wmnet [17:11:39] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1006.eqiad.wmnet with OS bullseye [17:11:52] RECOVERY - MariaDB Replica IO: x1 #page on db2096 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:11:52] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602348 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1006.eqiad.wmnet with OS bullseye comp... [17:12:09] brennen: going all right? [17:12:29] claime: yep, all smooth so far. [17:12:33] fantastic [17:16:27] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:1008476|Partial Revert "Set background/color to inherit for common templates" (T358164)]] (duration: 23m 42s) [17:16:33] (03CR) 10Dzahn: [V: 04-1] "This is just a question to Antoine: "Are you going to need a copy of /var/lib/zuul prod data on the test host to test zuul?"" [puppet] - 10https://gerrit.wikimedia.org/r/1007433 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [17:16:37] T358164: Set color/background to inherit or #333 on common templates/use of HTML4 bgcolor - https://phabricator.wikimedia.org/T358164 [17:17:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1010.eqiad.wmnet with OS bullseye [17:17:20] Jdlrobson: should be good to go. [17:19:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1012.eqiad.wmnet with OS bullseye [17:19:54] 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602418 (10andrea.denisse) a:03andrea.denisse [17:21:35] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1011.eqiad.wmnet with OS bullseye [17:21:45] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1010.eqiad.wmnet with OS bullseye comp... [17:22:50] RECOVERY - MariaDB Replica SQL: x1 #page on db2096 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:23:21] Didn't that already recover like 10 minutes ago? [17:23:33] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1012.eqiad.wmnet with OS bullseye comp... [17:23:58] Slave_IO_Running / Slave_SQL_Running but otherwise same host yep [17:24:33] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602463 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1011.eqiad.wmnet with OS bullseye comp... [17:24:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1013.eqiad.wmnet with OS bullseye [17:25:34] denisse: I think I can help with that icinga issue and the bfd check [17:25:52] there is this package on alert hosts: snmp-mibs-downloader [17:26:20] 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602476 (10dcaro) Affecting also the cloudswitches {F42399814} [17:26:30] I think we have to use that to download the missing MIB file and what looks for this is the "Snimpy" "load" https://snimpy.readthedocs.io/en/latest/usage.html [17:26:52] if that would download the file at the top of https://www.circitor.fr/Mibs/Html/B/BFD-STD-MIB.php [17:26:55] then the check should work again [17:27:20] (03CR) 10RLazarus: [C: 03+2] k8s: Add getter for the Batch API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [17:27:27] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:27:32] 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602493 (10dcaro) It's gone now :) [17:28:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58524 and previous config saved to /var/cache/conftool/dbconfig/20240305-172834-arnaudb.json [17:29:07] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:29:14] 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9602507 (10andrea.denisse) [17:29:22] 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602506 (10andrea.denisse) [17:29:43] (03CR) 10Clément Goubert: [C: 03+1] api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008900 (https://phabricator.wikimedia.org/T357907) (owner: 10Hnowlan) [17:29:55] 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602503 (10fgiunchedi) I've bandaided the issue on alert2001, we'll need a more proper fix: ` # download-mibs # cd /var/lib/snmp && ln -s ../mibs ` [17:30:44] sukhe: Ah thanks I missed that diff. [17:30:45] 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602522 (10Dzahn) There is this package on the alert hosts: ` ii snmp-mibs-downloader 1.2 all install and manage Management Information B... [17:31:03] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for lvs2011 - cmooney@cumin1002" [17:31:08] mutante: Thanks for your comments, we were indeed missing those files. [17:31:11] 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9602535 (10fgiunchedi) Something else that didn't work well: the current version of `ircecho` doesn't seem to attempt reopening the files it is supposed to look for in `/var/log/icinga`. I... [17:31:23] I'll send a patch to automate it. :) [17:31:48] denisse: :)) [17:31:56] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for lvs2011 - cmooney@cumin1002" [17:31:56] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:32:23] PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:32:24] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008900 (https://phabricator.wikimedia.org/T357907) (owner: 10Hnowlan) [17:32:27] (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:32:38] 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602540 (10cmooney) >>! In T359198#9602522, @Dzahn wrote: > I guess the snmp-mibs-downloader just has to be automated to download stuff? Yeah on it's own that package installs but doesn't do a... [17:33:21] (03CR) 10Cwhite: [C: 03+2] "Yep, we should definitely stop doing that." [puppet] - 10https://gerrit.wikimedia.org/r/1008833 (https://phabricator.wikimedia.org/T359153) (owner: 10Filippo Giunchedi) [17:33:31] (03Merged) 10jenkins-bot: api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008900 (https://phabricator.wikimedia.org/T357907) (owner: 10Hnowlan) [17:34:24] (03Merged) 10jenkins-bot: k8s: Add getter for the Batch API [software/spicerack] - 10https://gerrit.wikimedia.org/r/1008582 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [17:35:11] (03Abandoned) 10Daniel Kinzler: Rest: allow Handlers to disable body parsing. [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008758 (https://phabricator.wikimedia.org/T357025) (owner: 10Daniel Kinzler) [17:39:53] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [17:40:06] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [17:40:12] (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:40:20] !log bking@prometheus1006 reload prometheus service as part of troubleshooting T358029 [17:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:24] T358029: Migrate selected Search Platform alerts from icinga search-platform team to prometheus data-platform team - https://phabricator.wikimedia.org/T358029 [17:40:25] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache lvs2011.codfw.wmnet on all recursors [17:40:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) lvs2011.codfw.wmnet on all recursors [17:40:54] (03CR) 10Cathal Mooney: [C: 03+2] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/1007703 (https://phabricator.wikimedia.org/T352920) (owner: 10Cathal Mooney) [17:41:23] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [17:41:38] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [17:42:45] (03PS2) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) [17:43:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58526 and previous config saved to /var/cache/conftool/dbconfig/20240305-174339-arnaudb.json [17:44:08] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [17:44:33] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1015.eqiad.wmnet with OS bullseye [17:44:34] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [17:44:46] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1015.eqiad.wmnet with OS bullseye [17:44:56] (03PS3) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) [17:45:04] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1016.eqiad.wmnet with OS bullseye [17:45:19] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602583 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1016.eqiad.wmnet with OS bullseye [17:46:11] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1017.eqiad.wmnet with OS bullseye [17:46:38] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602586 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1017.eqiad.wmnet with OS bullseye [17:46:42] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [17:46:43] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1018.eqiad.wmnet with OS bullseye [17:47:22] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1019.eqiad.wmnet with OS bullseye [17:47:27] (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:48:03] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host... [17:48:17] (03CR) 10Scott French: "Apologies in advance for the long commit message - wanted to make sure the tradeoffs w.r.t. replication index key are explicit. Happy to r" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:48:25] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1018.eqiad.wmnet with OS bullseye [17:49:02] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602595 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1019.eqiad.wmnet with OS bullseye [17:49:48] (03PS1) 10Btullis: Restrict the set of URLS serviced by Archiva [puppet] - 10https://gerrit.wikimedia.org/r/1008926 (https://phabricator.wikimedia.org/T359031) [17:53:02] (03CR) 10Btullis: "Currently testing this manually on archiva1002.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1008926 (https://phabricator.wikimedia.org/T359031) (owner: 10Btullis) [17:53:36] 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602613 (10Dzahn) Looks like it's: `man 1 download-mibs` `download-mibs --help` and the config is at `/etc/snmp-mibs-downloader/snmp-mibs-downloader.conf` which has some kind of "AUTOLOAD" c... [17:55:55] 06SRE, 06Infrastructure-Foundations, 10netops: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602620 (10cmooney) >>! In T359198#9602613, @Dzahn wrote: > Looks like it's: > > `man 1 download-mibs` > `download-mibs --help` > > and the config is at `/etc/snmp-mibs-downloader/snmp-mibs-... [17:57:19] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1015.eqiad.wmnet with reason: host reimage [17:58:01] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1016.eqiad.wmnet with reason: host reimage [17:58:32] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1017.eqiad.wmnet with reason: host reimage [17:58:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58527 and previous config saved to /var/cache/conftool/dbconfig/20240305-175844-arnaudb.json [17:59:59] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1018.eqiad.wmnet with reason: host reimage [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1800) [18:00:09] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1015.eqiad.wmnet with reason: host reimage [18:00:22] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1019.eqiad.wmnet with reason: host reimage [18:00:27] RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:02:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1016.eqiad.wmnet with reason: host reimage [18:04:43] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1019.eqiad.wmnet with reason: host reimage [18:06:28] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1014.eqiad.wmnet with OS bullseye [18:06:59] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1018.eqiad.wmnet with reason: host reimage [18:07:06] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602700 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye exec... [18:09:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1017.eqiad.wmnet with reason: host reimage [18:10:53] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage [18:11:43] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:11:49] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:12:30] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:13:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58528 and previous config saved to /var/cache/conftool/dbconfig/20240305-181349-arnaudb.json [18:13:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage [18:15:31] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:17:59] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1015.eqiad.wmnet with OS bullseye [18:18:13] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1015.eqiad.wmnet with OS bullseye comp... [18:19:54] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1016.eqiad.wmnet with OS bullseye [18:19:55] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:20:10] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9602719 (10bking) [18:20:16] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602717 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1016.eqiad.wmnet with OS bullseye comp... [18:22:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1019.eqiad.wmnet with OS bullseye [18:22:27] (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:22:35] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602723 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1019.eqiad.wmnet with OS bullseye comp... [18:24:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1018.eqiad.wmnet with OS bullseye [18:25:06] !log bking@cumin2002 START - Cookbook sre.dns.netbox [18:25:07] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602725 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1018.eqiad.wmnet with OS bullseye comp... [18:26:30] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:27:34] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1017.eqiad.wmnet with OS bullseye [18:27:49] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9602731 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1017.eqiad.wmnet with OS bullseye comp... [18:28:01] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1025 [18:28:04] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1025 [18:28:36] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [18:30:47] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2011.codfw.wmnet with OS bullseye [18:30:56] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602746 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host lvs... [18:31:37] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602756 (10cmooney) Reimage looks good, BGP up and lvs2011 handling traffic again: ` cmooney@cumin1002:~$ sud... [18:37:12] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:37:18] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:37:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2103.codfw.wmnet onto db2203.codfw.wmnet [18:37:57] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for FebinBellamy - https://phabricator.wikimedia.org/T359208 (10FBellamy-WMF) [18:40:38] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:40:44] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:46:11] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:46:17] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:47:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2104.codfw.wmnet onto db2204.codfw.wmnet [18:54:19] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:54:26] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:56:19] !log T357007 Running mwscript CampaignEvents:GenerateInvitationList --wiki=metawiki --listfile=/home/daimona/list.txt [18:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:23] T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007 [18:57:31] RECOVERY - cassandra-a CQL 10.64.16.28:9042 on restbase1038 is OK: TCP OK - 0.039 second response time on 10.64.16.28 port 9042 https://phabricator.wikimedia.org/T93886 [18:59:33] RECOVERY - cassandra-b SSL 10.64.16.32:7000 on restbase1038 is OK: SSL OK - Certificate restbase1038-b valid until 2026-02-20 21:34:07 +0000 (expires in 717 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:00:05] jnuche and dduvall: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T1900). [19:03:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:03:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:06:28] 06SRE, 06Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869#9602910 (10cmooney) [19:06:30] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9602909 (10cmooney) 05Open→03Resolved [19:16:27] 06SRE, 10netops, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Icinga BFD check failing - https://phabricator.wikimedia.org/T359198#9602958 (10andrea.denisse) [19:17:54] 06SRE, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Icinga Log Permission Conflict with Puppet Configuration - https://phabricator.wikimedia.org/T358539#9602963 (10andrea.denisse) 05Open→03Resolved [19:17:57] 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9602964 (10andrea.denisse) [19:17:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:18:04] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:29:27] Do we know what happened yesterday with the late UTC backport window / if today's backport window is good to go? Sorry if there is a better place to ask… [19:30:20] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:31:12] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:31:18] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:41:48] 06SRE, 10ops-eqiad, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9603028 (10Andrew) Notes from today's (unproductive) meeting: We met with several Dell reps including an engineer n... [19:44:07] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye [19:46:01] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1025'] [19:46:37] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['wdqs1025'] [19:47:51] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [19:48:10] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:33] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:50:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:53:10] (SystemdUnitFailed) firing: (2) mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:57:29] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: flink-zk reboots [19:57:35] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on 6 hosts with reason: flink-zk reboots [19:58:00] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: flink-zk reboots T356239 [19:58:06] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: flink-zk reboots T356239 [20:00:41] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:00:47] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:04:56] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:05:36] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:06:01] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [20:07:09] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [20:07:11] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [20:08:15] 06SRE: Improve automation for the vendor maintenance calendar - https://phabricator.wikimedia.org/T357630#9603076 (10andrea.denisse) Thanks for your comments @ayounsi and @cmooney. While Janitor looks promising, I believe {icon globe} [[ https://developers.google.com/apps-script | Google Apps Script ]]. would b... [20:08:31] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [20:10:55] (03PS2) 10Scott French: Add support for an optional ignored-keys pattern [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1008944 [20:11:43] (03CR) 10Ryan Kemper: [C: 03+1] partman: configure wdqs1025 partioning [puppet] - 10https://gerrit.wikimedia.org/r/1008943 (https://phabricator.wikimedia.org/T359062) (owner: 10Bking) [20:14:56] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [20:19:56] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [20:35:55] (03CR) 10Andrew Bogott: [C: 03+2] puppetserver: add ssldir_on_srv param for cloud-vps [puppet] - 10https://gerrit.wikimedia.org/r/1008940 (https://phabricator.wikimedia.org/T276327) (owner: 10Andrew Bogott) [20:40:43] (03CR) 10Ottomata: [C: 03+1] eventstreams: change default num_workers to 0. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008446 (https://phabricator.wikimedia.org/T359051) (owner: 10Gmodena) [20:43:12] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9603154 (10Volans) 05Resolved→03Open a:03Volans Re-opening as AAAA records were erroneously added to the hosts (AAAA records:**N**). I'll remove them programmatically. [20:46:07] !log Start rolling out updated fifo-log-demux and configuration to A:cp and A:ncredir - T355905 [20:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:12] T355905: Restarting fifo-log-demux should not restart nginx - https://phabricator.wikimedia.org/T355905 [20:46:27] !log Disable puppet on A:cp and A:ncredir - T355905 [20:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:35] (03CR) 10Bking: [C: 03+2] partman: configure wdqs1025 partioning [puppet] - 10https://gerrit.wikimedia.org/r/1008943 (https://phabricator.wikimedia.org/T359062) (owner: 10Bking) [20:50:05] (03CR) 10BCornwall: [V: 03+1 C: 03+2] fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [20:50:05] !log volans@cumin1002 START - Cookbook sre.dns.netbox [20:52:13] !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Deleted AAAA records from new DBs - volans@cumin1002" [20:52:34] !log upload fifo-log-demux 0.6.5 to bookworm-wikimedia [20:52:35] (03PS6) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [20:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:40] (03PS6) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) [20:52:48] (03PS1) 10Andrew Bogott: puppetserver::wmcs: pull enc_path out of the enc class [puppet] - 10https://gerrit.wikimedia.org/r/1008951 [20:53:03] !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Deleted AAAA records from new DBs - volans@cumin1002" [20:53:03] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:53:54] (03PS2) 10Andrew Bogott: puppetserver::wmcs: pull enc_path out of the enc class [puppet] - 10https://gerrit.wikimedia.org/r/1008951 [20:53:56] (03PS7) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [20:54:00] (03PS7) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) [20:54:07] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache es2035.codfw.wmnet es2036.codfw.wmnet es2037.codfw.wmnet es2038.codfw.wmnet es2039.codfw.wmnet es2040.codfw.wmnet on all recursors [20:54:10] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) es2035.codfw.wmnet es2036.codfw.wmnet es2037.codfw.wmnet es2038.codfw.wmnet es2039.codfw.wmnet es2040.codfw.wmnet on all recursors [20:58:33] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9603243 (10Volans) 05Open→03Resolved Got the list of affected hosts with `nodeset -S '","' -e "es20[35-40]"` on a cumin host, then I run the following code on Netbox: `lang=... [20:58:36] (03CR) 10Andrew Bogott: [C: 03+2] puppetserver::wmcs: pull enc_path out of the enc class [puppet] - 10https://gerrit.wikimedia.org/r/1008951 (owner: 10Andrew Bogott) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240305T2100) [21:00:04] houseblaster, dbrant, MatmaRex, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:15] hi [21:00:41] o/ [21:00:51] hi! [21:01:45] (03PS1) 10Jdlrobson: Set background/color to inherit for common templates in dark mode [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008765 (https://phabricator.wikimedia.org/T358164) [21:01:45] i can deploy [21:01:49] good evening everyone! [21:01:57] o/ [21:02:19] (03CR) 10Urbanecm: [C: 03+2] HandleSectionLinks: Fix handling headings with raw `>` in attributes [core] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1008472 (https://phabricator.wikimedia.org/T358810) (owner: 10Bartosz Dziewoński) [21:02:27] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye [21:02:29] Jdlrobson: i see you uploaded a backport – do you want to do that in this window? [21:02:47] urbanecm: yep just aded to calendar along with the config change [21:03:01] houseblaster: i see your patch is already merged (and supposedly deployed). is there anything else to do? [21:03:14] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [21:03:21] (03PS3) 10Urbanecm: Move account vanishing contact form to Meta wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005161 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant) [21:03:28] dbrant: going with your patch [21:03:30] (03CR) 10Urbanecm: [C: 03+2] Move account vanishing contact form to Meta wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005161 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant) [21:04:16] Jdlrobson: ah, thanks for the info. i didn't reload the calendar apparently. just double-checking, on the calendar you say wmf.20, but the patch is for wmf.21. can you confirm which version you want to backport to? [21:04:20] (03Merged) 10jenkins-bot: Move account vanishing contact form to Meta wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005161 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant) [21:04:39] PROBLEM - Kafka broker TLS certificate validity on kafka-logging2001 is CRITICAL: SSL CRITICAL - Certificate kafka-logging2001.codfw.wmnet valid until 2024-03-12 21:04:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [21:04:46] Sorry 1.42.0-wmf.21 [21:04:59] Huh. Yesterday it was scheduled to be deployed, but was told it failed. Let me try testing it without debug enabled [21:05:01] (corrected) [21:05:21] (03PS3) 10Urbanecm: Stop sharing vector and vector-2022 scripts on wikis where no users are impacted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007992 (https://phabricator.wikimedia.org/T331679) (owner: 10Jdlrobson) [21:05:42] (03CR) 10Urbanecm: [C: 03+2] Set background/color to inherit for common templates in dark mode [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008765 (https://phabricator.wikimedia.org/T358164) (owner: 10Jdlrobson) [21:05:42] Jdlrobson: no worries, just wanted to confirm because i hit the button :) [21:05:50] (03CR) 10Urbanecm: [C: 03+2] Stop sharing vector and vector-2022 scripts on wikis where no users are impacted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007992 (https://phabricator.wikimedia.org/T331679) (owner: 10Jdlrobson) [21:05:57] 🫡 [21:06:04] Working. Nothing further to do, and sorry for the confusion! :) [21:06:19] houseblaster: no worries. thanks for confirming! [21:06:35] (03Merged) 10jenkins-bot: Stop sharing vector and vector-2022 scripts on wikis where no users are impacted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007992 (https://phabricator.wikimedia.org/T331679) (owner: 10Jdlrobson) [21:07:20] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1005161|Move account vanishing contact form to Meta wiki. (T343536)]], [[gerrit:1007992|Stop sharing vector and vector-2022 scripts on wikis where no users are impacted (T331679)]] [21:07:27] T343536: [M] Create v1 of Special:Contact page for account vanish requests - https://phabricator.wikimedia.org/T343536 [21:07:28] T331679: Disable sharing of site/user scripts between Vector and Vector 2022 skins - https://phabricator.wikimedia.org/T331679 [21:10:29] !log urbanecm@deploy2002 jdlrobson and urbanecm and dbrant: Backport for [[gerrit:1005161|Move account vanishing contact form to Meta wiki. (T343536)]], [[gerrit:1007992|Stop sharing vector and vector-2022 scripts on wikis where no users are impacted (T331679)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:10:41] dbrant: Jdlrobson: can you test yours at mwdebug, please? [21:10:45] urbanecm: on it [21:11:08] urbanecm: mine looks good! [21:11:34] ty! [21:11:45] that was tagged to thw wrong task btw [21:12:11] urbanecm: LGTM please sync [21:12:16] ty [21:12:17] oh, nvm - itw as two unrelated [21:12:17] proceeding [21:12:20] !log urbanecm@deploy2002 jdlrobson and urbanecm and dbrant: Continuing with sync [21:17:15] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye [21:17:26] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1025'] [21:20:01] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir5001.eqsin.wmnet [21:20:40] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2107'] [21:20:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2107'] [21:21:57] (03CR) 10CI reject: [V: 04-1] Set background/color to inherit for common templates in dark mode [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008765 (https://phabricator.wikimedia.org/T358164) (owner: 10Jdlrobson) [21:22:05] wonderful [21:22:07] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1005161|Move account vanishing contact form to Meta wiki. (T343536)]], [[gerrit:1007992|Stop sharing vector and vector-2022 scripts on wikis where no users are impacted (T331679)]] (duration: 14m 46s) [21:22:11] (03Merged) 10jenkins-bot: HandleSectionLinks: Fix handling headings with raw `>` in attributes [core] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1008472 (https://phabricator.wikimedia.org/T358810) (owner: 10Bartosz Dziewoński) [21:22:12] T343536: [M] Create v1 of Special:Contact page for account vanish requests - https://phabricator.wikimedia.org/T343536 [21:22:12] T331679: Disable sharing of site/user scripts between Vector and Vector 2022 skins - https://phabricator.wikimedia.org/T331679 [21:22:24] CI issue seems unrelated urbanecm [21:22:27] 22:02:49 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/mediawiki/extensions/InputBox [21:22:29] yeah, appears so [21:22:33] let's see what gate-and-submit will do [21:25:32] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir5001.eqsin.wmnet [21:25:42] (03Merged) 10jenkins-bot: Set background/color to inherit for common templates in dark mode [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008765 (https://phabricator.wikimedia.org/T358164) (owner: 10Jdlrobson) [21:26:31] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1008765|Set background/color to inherit for common templates in dark mode (T358164)]], [[gerrit:1008472|HandleSectionLinks: Fix handling headings with raw `>` in attributes (T358810)]] [21:26:37] T358164: Set color/background to inherit or #333 on common templates/use of HTML4 bgcolor - https://phabricator.wikimedia.org/T358164 [21:26:37] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir4001.ulsfo.wmnet [21:26:37] T358810: Having <> in headings leads to errors - https://phabricator.wikimedia.org/T358810 [21:27:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1025'] [21:28:01] !log urbanecm@deploy2002 matmarex and jdlrobson and urbanecm: Backport for [[gerrit:1008765|Set background/color to inherit for common templates in dark mode (T358164)]], [[gerrit:1008472|HandleSectionLinks: Fix handling headings with raw `>` in attributes (T358810)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:28:31] Jdlrobson: MatmaRex: can you test at mwdebug, please? [21:29:09] looking [21:30:25] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [21:30:28] urbanecm: yep looking [21:30:55] my change looks good [21:31:30] thanks for confirming MatmaRex [21:32:33] urbanecm: LGTM please sync [21:32:38] ty, proceeding [21:32:40] !log urbanecm@deploy2002 matmarex and jdlrobson and urbanecm: Continuing with sync [21:38:07] (03PS5) 10Ahmon Dancy: scap.cfg.erb: Settestservers_check_cmd_* in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117) [21:38:21] (03PS6) 10Ahmon Dancy: scap.cfg.erb: Set testservers_check_cmd_* in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117) [21:41:11] 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9603410 (10CodeReviewBot) dancy opened https://gitlab.wikimedia.org/repos/releng/train-dev/-/merge_requests/47 Dockerfile.deploy: Add httpbb [21:41:38] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1014.eqiad.wmnet with OS bullseye [21:41:53] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9603411 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye [21:42:06] 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9603412 (10CodeReviewBot) dancy merged https://gitlab.wikimedia.org/repos/releng/train-dev/-/merge_requests/47 Dockerfile.deploy: Add httpbb [21:42:21] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1008765|Set background/color to inherit for common templates in dark mode (T358164)]], [[gerrit:1008472|HandleSectionLinks: Fix handling headings with raw `>` in attributes (T358810)]] (duration: 15m 50s) [21:42:26] T358164: Set color/background to inherit or #333 on common templates/use of HTML4 bgcolor - https://phabricator.wikimedia.org/T358164 [21:42:26] T358810: Having <> in headings leads to errors - https://phabricator.wikimedia.org/T358810 [21:42:28] and deployed [21:42:31] anything else? [21:45:47] thanks urbanecm! [21:47:05] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye [21:47:52] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1025'] [21:48:10] any time [21:49:17] !log Remove fifo-log-demux from bookworm-wikimedia (dist version needs revision) [21:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:39] 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9603420 (10CodeReviewBot) dancy opened https://gitlab.wikimedia.org/repos/releng/train-dev/-/merge_requests/48 exp/files/php/scap.cfg: Set testservers_check_cmd_*... [21:51:23] thanks urbanecm for your help today! [22:03:10] !log upload fifo-log-demux 0.6.5+deb12u1 to bookworm-wikimedia [22:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:42] !log upload fifo-log-demux 0.6.5+deb11u1 to bullseye-wikimedia [22:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:05] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1025'] [22:18:49] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5025.eqsin.wmnet [22:19:55] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:21:43] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir4037.ulsfo.wmnet [22:22:41] q [22:25:12] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:26:11] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir4037.ulsfo.wmnet [22:27:56] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [22:32:48] (PuppetDisabled) firing: (2) Puppet disabled on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-test&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [22:33:18] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [22:34:53] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [22:35:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 41.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:37:10] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 60 days, 0:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: T337013 [22:37:13] T337013: [Epic] Splitting the graph in WDQS - https://phabricator.wikimedia.org/T337013 [22:37:14] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 60 days, 0:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: T337013 [22:37:26] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:37:33] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:01:56] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1014.eqiad.wmnet with OS bullseye [23:02:09] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9603617 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye exec... [23:03:25] RECOVERY - cassandra-b CQL 10.64.16.32:9042 on restbase1038 is OK: TCP OK - 0.030 second response time on 10.64.16.32 port 9042 https://phabricator.wikimedia.org/T93886 [23:08:16] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:08:22] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:22:46] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:22:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:26:21] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:26:27] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:27:31] RECOVERY - cassandra-c SSL 10.64.16.35:7000 on restbase1038 is OK: SSL OK - Certificate restbase1038-c valid until 2026-02-20 21:34:09 +0000 (expires in 716 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [23:30:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:34:36] 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873#9603722 (10cmooney) 05Open→03Resolved a:03cmooney [23:34:41] 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9603724 (10cmooney) [23:35:25] 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9603728 (10cmooney) [23:35:29] 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9603729 (10cmooney) [23:35:48] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:35:49] 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9603725 (10cmooney) 05Open→03Resolved a:03cmooney Closing task. Big thanks to all the SRE teams for the help and co-operation getting this o... [23:35:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:42:19] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:42:25] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:45:13] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:45:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 46.14% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:45:19] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:46:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 39.71% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:48:10] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:46] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:48:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:50:59] (03PS1) 10RLazarus: deployment_server: Typo fix in mwscript_k8s.py [puppet] - 10https://gerrit.wikimedia.org/r/1008975 (https://phabricator.wikimedia.org/T341553) [23:53:10] (SystemdUnitFailed) firing: (2) mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:37] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1025.eqiad.wmnet with OS bullseye [23:54:58] (03CR) 10RLazarus: [C: 03+2] deployment_server: Typo fix in mwscript_k8s.py [puppet] - 10https://gerrit.wikimedia.org/r/1008975 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)