[00:01:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 43.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:17:24] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:17:31] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:21:44] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[00:39:09] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008910
[00:39:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008910 (owner: 10TrainBranchBot)
[00:40:58] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:41:05] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[01:02:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008910 (owner: 10TrainBranchBot)
[01:04:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 46.88% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:09:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 44.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:12:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 40.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:22:02] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Profiler: Silence "RedisException: Connection timed out" (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008752 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle)
[01:22:50] <wikibugs>	 (03Merged) 10jenkins-bot: Profiler: Silence "RedisException: Connection timed out" (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008752 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle)
[01:22:51] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqiad%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[01:27:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 47.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:28:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 38.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:28:41] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[01:34:16] <logmsgbot>	 !log krinkle@deploy2002 Synchronized src/Profiler.php: I101a80a (duration: 10m 48s)
[01:36:29] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:36:35] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:39:33] <wikibugs>	 (03PS1) 10Stoyofuku-wmf: Rename `--color-link--visited` to `--color-visited` [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008986 (https://phabricator.wikimedia.org/T356928)
[01:52:30] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:58:41] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[02:00:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 44.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:02:24] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:02:31] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:05:03] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:05:10] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:06:09] <wikibugs>	 (03PS1) 10BCornwall: cdn: Fix site var for ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1008981
[02:08:22] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:08:29] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:08:54] <urandom>	 here, and looking at the ATSBackendErrorsHigh/restbase.discovery alerts 
[02:10:10] <urandom>	 seems like something happened ~12 hours ago that resulted in steady increase in 500s — https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqiad%20prometheus%2Fops&var-cluster=text&var-origin=restbase.discovery.wmnet&from=now-12h&to=now&var-site=eqiad
[02:10:52] <urandom>	 "14:40 a.kosiaris: remove all but 1 host from parsoid@eqiad T358752 " maybe?
[02:10:53] <stashbot>	 T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752
[02:18:01] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Rename `--color-link--visited` to `--color-visited` [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008986 (https://phabricator.wikimedia.org/T356928) (owner: 10Stoyofuku-wmf)
[02:18:45] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:18:51] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:19:55] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:22:25] <urandom>	 based on a sampling of the logstash errors, it seems like it's en wiktionary, and that it's the same error we had before, a missing content-language header that's causing restbase to except
[02:25:12] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:27:50] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:27:57] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:37:27] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:00:12] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:10:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[03:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:14:27] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.16.35:9042 on restbase1038 is OK: TCP OK - 0.030 second response time on 10.64.16.35 port 9042 https://phabricator.wikimedia.org/T93886
[03:17:51] <jinxer-wm>	 (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqiad%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[03:18:21] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqiad%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[03:21:59] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:22:07] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:22:49] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.380 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:22:59] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51595 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:30:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:31:47] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:31:54] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:44:31] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:46:59] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[03:47:35] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[03:48:10] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:53:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:54:33] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:57:13] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 37.04 ms
[03:57:49] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms
[04:00:44] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:00:50] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:04:25] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:04:31] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:09:37] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:09:44] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:14:53] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:14:59] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:21:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[04:36:12] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:36:18] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:59:24] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[04:59:30] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:06:34] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:06:41] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:08:53] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:09:00] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:10:57] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:11:04] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:13:00] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:13:06] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:20:04] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:20:11] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:25:05] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:25:12] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:28:23] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:28:30] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:34:21] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:34:28] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:38:44] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:38:51] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:41:10] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:41:16] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:43:13] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:43:20] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:45:18] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:45:24] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[05:47:22] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[05:47:29] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[06:02:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P58532 and previous config saved to /var/cache/conftool/dbconfig/20240306-060239-root.json
[06:06:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.04 - 2024.03.24): Requesting access to kubernetes deployment for tjones - https://phabricator.wikimedia.org/T359092#9604574 (10Marostegui)
[06:10:37] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for bdgreenlee - https://phabricator.wikimedia.org/T359123#9604576 (10Marostegui) 05Open→03Resolved a:03Marostegui bdgreenlee added to WMF group.
[06:13:39] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for FebinBellamy - https://phabricator.wikimedia.org/T359208#9604580 (10Marostegui) p:05Triage→03Medium @FBellamy-WMF we'd need your manager to approve this.
[06:16:04] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197#9604584 (10Marostegui) @bdgreenlee please follow the ticket template at https://phabricator.wikimedia.org/maniphest/task/edit/form/8/
[06:16:57] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197#9604585 (10Marostegui) p:05Triage→03Medium
[06:17:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P58533 and previous config saved to /var/cache/conftool/dbconfig/20240306-061744-root.json
[06:22:18] <wikibugs>	 (03PS1) 10Marostegui: es1025: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1009152 (https://phabricator.wikimedia.org/T358746)
[06:22:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1025', diff saved to https://phabricator.wikimedia.org/P58534 and previous config saved to /var/cache/conftool/dbconfig/20240306-062221-root.json
[06:23:10] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:24:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1025: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1009152 (https://phabricator.wikimedia.org/T358746) (owner: 10Marostegui)
[06:28:21] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage db2196 [puppet] - 10https://gerrit.wikimedia.org/r/1009154
[06:29:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 1%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58535 and previous config saved to /var/cache/conftool/dbconfig/20240306-062919-root.json
[06:32:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db2196 [puppet] - 10https://gerrit.wikimedia.org/r/1009154 (owner: 10Marostegui)
[06:32:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P58536 and previous config saved to /var/cache/conftool/dbconfig/20240306-063249-root.json
[06:44:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58537 and previous config saved to /var/cache/conftool/dbconfig/20240306-064424-root.json
[06:47:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P58538 and previous config saved to /var/cache/conftool/dbconfig/20240306-064754-root.json
[06:59:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58539 and previous config saved to /var/cache/conftool/dbconfig/20240306-065929-root.json
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T0700)
[07:00:12] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:03:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P58540 and previous config saved to /var/cache/conftool/dbconfig/20240306-070259-root.json
[07:14:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58541 and previous config saved to /var/cache/conftool/dbconfig/20240306-071435-root.json
[07:18:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P58542 and previous config saved to /var/cache/conftool/dbconfig/20240306-071804-root.json
[07:28:01] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Switch more eqiad parsoid hosts to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006897 (https://phabricator.wikimedia.org/T357392)
[07:28:03] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392)
[07:28:06] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: restbase: Switch the default to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006898 (https://phabricator.wikimedia.org/T357392)
[07:28:14] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T357392)
[07:28:22] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T357392)
[07:29:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[07:29:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58543 and previous config saved to /var/cache/conftool/dbconfig/20240306-072940-root.json
[07:30:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:30:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch more eqiad parsoid hosts to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006897 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[07:35:06] <wikibugs>	 (03CR) 10Slyngshede: "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1008893 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[07:37:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Point apt discovery records to apt1002/apt2002 (new bookworm hosts) [puppet] - 10https://gerrit.wikimedia.org/r/1008893 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[07:37:25] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9604693 (10akosiaris)
[07:41:27] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1014.eqiad.wmnet with OS bullseye
[07:41:41] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye
[07:44:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58544 and previous config saved to /var/cache/conftool/dbconfig/20240306-074445-root.json
[07:49:25] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:51:21] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:51:27] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:51:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[07:53:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:55:27] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1014.eqiad.wmnet with reason: host reimage
[07:55:27] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1356.eqiad.wmnet|mw1357.eqiad.wmnet|parse1002.eqiad.wmnet|parse1003.eqiad.wmnet|parse1004.eqiad.wmnet|parse1005.eqiad.wmnet|parse1006.eqiad.wmnet|parse1007.eqiad.wmnet|parse1008.eqiad.wmnet|parse1009.eqiad.wmnet|parse1010.eqiad.wmnet|parse1011.eqiad.wmnet|parse1012.eqiad.wmnet|parse1013.eqiad.wmnet|parse1014.eqiad.wmnet|parse1015.eqiad.
[07:55:27] <logmsgbot>	 wmnet|parse1016.eqiad.wmnet|parse1017.eqiad.wmnet|parse1018.eqiad.wmnet|parse1019.eqiad.wmnet),cluster=kubernetes,service=kubesvc
[07:58:17] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1014.eqiad.wmnet with reason: host reimage
[07:59:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58545 and previous config saved to /var/cache/conftool/dbconfig/20240306-075950-root.json
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T0800). Please do the needful.
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:08:53] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008084 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[08:12:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 5%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58546 and previous config saved to /var/cache/conftool/dbconfig/20240306-081244-arnaudb.json
[08:17:11] <akosiaris>	 !log depool parse2008.codfw.wmnet,parse2009.codfw.wmnet,parse2010.codfw.wmnet,parse2011.codfw.wmnet,parse2012.codfw.wmnet,parse2013.codfw.wmnet,parse2014.codfw.wmnet,parse2015.codfw.wmnet from parsoid. T358752
[08:27:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58547 and previous config saved to /var/cache/conftool/dbconfig/20240306-082749-arnaudb.json
[08:33:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:37:25] <akosiaris>	 seems like a spike ^, already dropping
[08:37:31] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1014.eqiad.wmnet with OS bullseye
[08:38:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 5%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58548 and previous config saved to /var/cache/conftool/dbconfig/20240306-083804-arnaudb.json
[08:38:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:38:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 5%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58549 and previous config saved to /var/cache/conftool/dbconfig/20240306-083822-arnaudb.json
[08:38:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 5%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58550 and previous config saved to /var/cache/conftool/dbconfig/20240306-083829-arnaudb.json
[08:39:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:42:44] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2008.codfw.wmnet with OS bullseye
[08:42:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 15%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58551 and previous config saved to /var/cache/conftool/dbconfig/20240306-084254-arnaudb.json
[08:43:15] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2009.codfw.wmnet with OS bullseye
[08:43:30] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:43:51] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2010.codfw.wmnet with OS bullseye
[08:44:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:44:34] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2011.codfw.wmnet with OS bullseye
[08:45:05] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2012.codfw.wmnet with OS bullseye
[08:45:52] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2013.codfw.wmnet with OS bullseye
[08:46:49] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2014.codfw.wmnet with OS bullseye
[08:47:29] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2015.codfw.wmnet with OS bullseye
[08:50:59] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2205.codfw.wmnet with reason: Silence for cloning
[08:51:03] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2205.codfw.wmnet with reason: Silence for cloning
[08:51:18] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:51:24] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2105.codfw.wmnet with reason: Silence for cloning
[08:51:25] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:51:39] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2105.codfw.wmnet with reason: Silence for cloning
[08:51:48] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: provisionning db2205.codfw.wmnet - T355422
[08:51:52] <stashbot>	 T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422
[08:51:53] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: provisionning db2205.codfw.wmnet - T355422
[08:51:56] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2205.codfw.wmnet with reason: provisionning db2205.codfw.wmnet - T355422
[08:52:00] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2205.codfw.wmnet with reason: provisionning db2205.codfw.wmnet - T355422
[08:53:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2105 in db2205 for T355422', diff saved to https://phabricator.wikimedia.org/P58552 and previous config saved to /var/cache/conftool/dbconfig/20240306-085318-arnaudb.json
[08:53:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 10%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58553 and previous config saved to /var/cache/conftool/dbconfig/20240306-085322-arnaudb.json
[08:53:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 10%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58554 and previous config saved to /var/cache/conftool/dbconfig/20240306-085327-arnaudb.json
[08:53:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 10%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58555 and previous config saved to /var/cache/conftool/dbconfig/20240306-085334-arnaudb.json
[08:54:17] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2105.codfw.wmnet onto db2205.codfw.wmnet
[08:56:28] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[08:56:35] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[08:57:53] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: provisionning db2206.codfw.wmnet - T355422
[08:57:57] <stashbot>	 T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422
[08:58:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58556 and previous config saved to /var/cache/conftool/dbconfig/20240306-085759-arnaudb.json
[08:58:09] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: provisionning db2206.codfw.wmnet - T355422
[08:58:12] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: provisionning db2206.codfw.wmnet - T355422
[08:58:16] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: provisionning db2206.codfw.wmnet - T355422
[08:58:58] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2009.codfw.wmnet with reason: host reimage
[08:59:00] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2008.codfw.wmnet with reason: host reimage
[08:59:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2106 in db2206 for T355422', diff saved to https://phabricator.wikimedia.org/P58557 and previous config saved to /var/cache/conftool/dbconfig/20240306-085924-arnaudb.json
[08:59:54] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2010.codfw.wmnet with reason: host reimage
[09:00:05] <jouncebot>	 jnuche and dduvall: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T0900)
[09:00:06] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2011.codfw.wmnet with reason: host reimage
[09:00:30] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2106.codfw.wmnet onto db2206.codfw.wmnet
[09:00:32] <jnuche>	 morning, the train is currently blocked by T359290
[09:00:42] <stashbot>	 T359290: ArgumentCountError: Too few arguments to function MediaWiki\Extension\Gadgets\GadgetRepo::titleWithoutPrefix(), 1 passed in /srv/mediawiki/php-1.42.0-wmf.21/extensions/Gadgets/includes/GadgetResourceLoaderModule.php on line 80  - https://phabricator.wikimedia.org/T359290
[09:00:45] <jnuche>	 for the moment I'm going to backport a fix for a different blocker T359229
[09:00:47] <stashbot>	 T359229: Regression: Visited links on mobile appearing as black again - https://phabricator.wikimedia.org/T359229
[09:00:57] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2012.codfw.wmnet with reason: host reimage
[09:01:24] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2013.codfw.wmnet with reason: host reimage
[09:01:44] <jnuche>	 akosiaris: hi there, looks like I should wait for you to finish
[09:01:56] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2009.codfw.wmnet with reason: host reimage
[09:02:28] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2014.codfw.wmnet with reason: host reimage
[09:03:45] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2015.codfw.wmnet with reason: host reimage
[09:03:51] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: provisionning db2208.codfw.wmnet - T355422
[09:03:56] <stashbot>	 T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422
[09:04:00] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2008.codfw.wmnet with reason: host reimage
[09:04:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: provisionning db2208.codfw.wmnet - T355422
[09:04:10] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: provisionning db2208.codfw.wmnet - T355422
[09:04:14] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: provisionning db2208.codfw.wmnet - T355422
[09:05:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2108 in db2208 for T355422', diff saved to https://phabricator.wikimedia.org/P58558 and previous config saved to /var/cache/conftool/dbconfig/20240306-090524-arnaudb.json
[09:06:14] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2013.codfw.wmnet with reason: host reimage
[09:06:26] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2108.codfw.wmnet onto db2208.codfw.wmnet
[09:08:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 15%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58559 and previous config saved to /var/cache/conftool/dbconfig/20240306-090827-arnaudb.json
[09:08:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 15%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58560 and previous config saved to /var/cache/conftool/dbconfig/20240306-090833-arnaudb.json
[09:08:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 15%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58561 and previous config saved to /var/cache/conftool/dbconfig/20240306-090839-arnaudb.json
[09:08:43] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2015.codfw.wmnet with reason: host reimage
[09:11:05] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2012.codfw.wmnet with reason: host reimage
[09:13:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 25%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58562 and previous config saved to /var/cache/conftool/dbconfig/20240306-091304-arnaudb.json
[09:13:39] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2014.codfw.wmnet with reason: host reimage
[09:16:44] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2010.codfw.wmnet with reason: host reimage
[09:20:13] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2011.codfw.wmnet with reason: host reimage
[09:20:56] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2009.codfw.wmnet with OS bullseye
[09:23:13] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2008.codfw.wmnet with OS bullseye
[09:23:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 20%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58563 and previous config saved to /var/cache/conftool/dbconfig/20240306-092332-arnaudb.json
[09:23:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 20%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58564 and previous config saved to /var/cache/conftool/dbconfig/20240306-092337-arnaudb.json
[09:23:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 20%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58565 and previous config saved to /var/cache/conftool/dbconfig/20240306-092343-arnaudb.json
[09:24:56] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2013.codfw.wmnet with OS bullseye
[09:25:37] <jnuche>	 akosiaris: it looks like the downtiming cookbooks are done? can I go ahead?
[09:27:07] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2015.codfw.wmnet with OS bullseye
[09:27:52] <claime>	 jnuche: you should be fine to go ahead 
[09:28:00] <claime>	 jnuche: the hosts are removed from dsh
[09:28:10] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 50%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58566 and previous config saved to /var/cache/conftool/dbconfig/20240306-092809-arnaudb.json
[09:28:20] <jnuche>	 claime: thx!
[09:29:55] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2012.codfw.wmnet with OS bullseye
[09:32:32] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2014.codfw.wmnet with OS bullseye
[09:35:33] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2010.codfw.wmnet with OS bullseye
[09:38:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 25%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58567 and previous config saved to /var/cache/conftool/dbconfig/20240306-093837-arnaudb.json
[09:38:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 25%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58568 and previous config saved to /var/cache/conftool/dbconfig/20240306-093842-arnaudb.json
[09:38:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 25%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58569 and previous config saved to /var/cache/conftool/dbconfig/20240306-093849-arnaudb.json
[09:39:35] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2011.codfw.wmnet with OS bullseye
[09:42:48] <godog>	 wikibugs is stuck? on strike perhaps?
[09:42:58] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 181 probes of 737 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:43:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58570 and previous config saved to /var/cache/conftool/dbconfig/20240306-094314-arnaudb.json
[09:46:24] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: toggle notifications for db2203/2204 [puppet] - 10https://gerrit.wikimedia.org/r/1008911 (https://phabricator.wikimedia.org/T355422)
[09:46:39] <claime>	 godog: just a hard morning apparently
[09:46:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Add an motd for the old buster reposority server [puppet] - 10https://gerrit.wikimedia.org/r/1009199 (https://phabricator.wikimedia.org/T331613)
[09:46:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add an motd for the old buster reposority server [puppet] - 10https://gerrit.wikimedia.org/r/1009199 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[09:46:59] <godog>	 claime: understandable
[09:47:00] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Move parse2008-parse2015 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009200 (https://phabricator.wikimedia.org/T358752)
[09:47:08] <wikibugs>	 (03PS2) 10Muehlenhoff: Add an motd for the old buster reposority server [puppet] - 10https://gerrit.wikimedia.org/r/1009199 (https://phabricator.wikimedia.org/T331613)
[09:47:10] <taavi>	 i restarted the redis->irc listener, seems like it's back
[09:47:24] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[09:47:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Move parse2008-parse2015 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009200 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris)
[09:47:49] <wikibugs>	 (03Merged) 10jenkins-bot: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[09:47:50] <claime>	 taavi: many thanks
[09:47:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db2203/2204 [puppet] - 10https://gerrit.wikimedia.org/r/1008911 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[09:47:58] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 40 probes of 737 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:48:05] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: toggle notifications for db2196 [puppet] - 10https://gerrit.wikimedia.org/r/1008912 (https://phabricator.wikimedia.org/T355422)
[09:48:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db2196 [puppet] - 10https://gerrit.wikimedia.org/r/1008912 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[09:48:45] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db2203/2204 [puppet] - 10https://gerrit.wikimedia.org/r/1008911 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[09:48:53] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db2196 [puppet] - 10https://gerrit.wikimedia.org/r/1008912 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb)
[09:49:17] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye comp...
[09:50:09] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604842 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2008.codfw.wmnet with OS bullseye
[09:50:29] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2009.codfw.wmnet with OS bullseye
[09:50:57] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2010.codfw.wmnet with OS bullseye
[09:51:17] <wikibugs>	 (03PS1) 10Volans: validators: improve IPs DNS name validation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202
[09:51:25] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2011.codfw.wmnet with OS bullseye
[09:51:45] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009199 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[09:51:53] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604850 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2012.codfw.wmnet with OS bullseye
[09:52:29] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2013.codfw.wmnet with OS bullseye
[09:52:39] <logmsgbot>	 !log jnuche@deploy2002 Started scap: Backport for [[gerrit:1008986|Rename `--color-link--visited` to `--color-visited` (T356928)]]
[09:52:43] <stashbot>	 T356928: Regression: Visited links on mobile appearing as black - https://phabricator.wikimedia.org/T356928
[09:53:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 50%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58571 and previous config saved to /var/cache/conftool/dbconfig/20240306-095342-arnaudb.json
[09:53:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 50%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58572 and previous config saved to /var/cache/conftool/dbconfig/20240306-095347-arnaudb.json
[09:53:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 50%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58573 and previous config saved to /var/cache/conftool/dbconfig/20240306-095354-arnaudb.json
[09:54:13] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2014.codfw.wmnet with OS bullseye
[09:54:42] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604860 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2015.codfw.wmnet with OS bullseye
[09:55:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[09:55:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add an motd for the old buster reposority server [puppet] - 10https://gerrit.wikimedia.org/r/1009199 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[09:55:34] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:tomcat10 Sync server.xml with default from Tomcat10/Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede)
[09:56:39] <logmsgbot>	 !log jnuche@deploy2002 jnuche and toyofuku: Backport for [[gerrit:1008986|Rename `--color-link--visited` to `--color-visited` (T356928)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:56:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix resource header [puppet] - 10https://gerrit.wikimedia.org/r/1009203
[09:57:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] cdn: Fix site var for ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1008981 (owner: 10BCornwall)
[09:57:05] <logmsgbot>	 !log jnuche@deploy2002 jnuche and toyofuku: Continuing with sync
[09:57:11] <wikibugs>	 (03PS2) 10Slyngshede: Sync web.xml to default template from Tomcat 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007603 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff)
[09:57:19] <wikibugs>	 (03PS3) 10Slyngshede: Sync web.xml to default template from Tomcat 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007603 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff)
[09:57:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix resource header [puppet] - 10https://gerrit.wikimedia.org/r/1009203 (owner: 10Muehlenhoff)
[09:57:35] <wikibugs>	 (03PS4) 10Muehlenhoff: Sync web.xml to default template from Tomcat 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007603 (https://phabricator.wikimedia.org/T357748)
[09:58:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Sync web.xml to default template from Tomcat 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007603 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff)
[09:58:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58574 and previous config saved to /var/cache/conftool/dbconfig/20240306-095820-arnaudb.json
[09:59:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you for expanding on the rationale/context!" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron)
[10:00:15] <wikibugs>	 (03PS1) 10Volans: validators: add field name to fail messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206
[10:02:21] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2009.codfw.wmnet with OS bullseye comp...
[10:03:25] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2008.codfw.wmnet with OS bullseye comp...
[10:03:28] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:03:57] <wikibugs>	 06SRE, 10SRE Observability (FY2023/2024-Q3): ircecho doesn't attempt to open log files created after startup - https://phabricator.wikimedia.org/T359292 (10fgiunchedi)
[10:04:25] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2013.codfw.wmnet with OS bullseye comp...
[10:04:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:06:05] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: ProbeDown (instance centrallog1002:6514) - https://phabricator.wikimedia.org/T359293 (10LSobanski)
[10:06:23] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: ProbeDown (instance centrallog1002:6514) - https://phabricator.wikimedia.org/T359293#9605030 (10LSobanski) Same set of alerts is firing for centrallog2002.
[10:06:33] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2015.codfw.wmnet with OS bullseye comp...
[10:07:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jnuche@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008986 (https://phabricator.wikimedia.org/T356928) (owner: 10Stoyofuku-wmf)
[10:07:42] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605037 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2012.codfw.wmnet with OS bullseye comp...
[10:08:15] <logmsgbot>	 !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:1008986|Rename `--color-link--visited` to `--color-visited` (T356928)]] (duration: 15m 35s)
[10:08:19] <stashbot>	 T356928: Regression: Visited links on mobile appearing as black - https://phabricator.wikimedia.org/T356928
[10:08:52] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[10:08:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 75%: Cloning done', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240306-100847-arnaudb.json
[10:08:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 75%: Cloning done', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240306-100853-arnaudb.json
[10:09:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 75%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58576 and previous config saved to /var/cache/conftool/dbconfig/20240306-100859-arnaudb.json
[10:09:05] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[10:09:14] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605060 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2014.codfw.wmnet with OS bullseye comp...
[10:09:45] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[10:10:07] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2010.codfw.wmnet with OS bullseye comp...
[10:10:09] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[10:10:27] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007335 (owner: 10Majavah)
[10:10:35] <wikibugs>	 (03PS1) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109)
[10:10:51] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] api-gateway: make ratelimit timeout a value, set to .5s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008933 (owner: 10Hnowlan)
[10:11:19] <wikibugs>	 (03PS2) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109)
[10:11:25] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[10:11:27] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605090 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2011.codfw.wmnet with OS bullseye comp...
[10:11:50] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[10:12:19] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] ldap: fix sssd socket activation on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007335 (owner: 10Majavah)
[10:13:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:13:15] <wikibugs>	 (03PS3) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109)
[10:13:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:15:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9605115 (10KCVelaga_WMF) As @MoritzMuehlenhoff suggested, I have updated my email to kcvelaga+old@wikimedia.org at idm.wikimedia.org, which is now being...
[10:15:18] <wikibugs>	 (03PS4) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109)
[10:15:52] <wikibugs>	 (03PS17) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859
[10:16:20] <wikibugs>	 (03Merged) 10jenkins-bot: Rename `--color-link--visited` to `--color-visited` [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008986 (https://phabricator.wikimedia.org/T356928) (owner: 10Stoyofuku-wmf)
[10:17:08] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): "I made the patch much smaller (again) in patchset 17." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE))
[10:18:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[10:18:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[10:18:59] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: update test VM without floating IP [puppet] - 10https://gerrit.wikimedia.org/r/1008892 (owner: 10Majavah)
[10:19:15] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Overall lgtm, I worry we complexity it too much, but it's not too bad so far :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 (owner: 10Volans)
[10:19:55] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Convert remaining images to shell webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1005952 (https://phabricator.wikimedia.org/T293552) (owner: 10Majavah)
[10:20:09] <wikibugs>	 (03Merged) 10jenkins-bot: Convert remaining images to shell webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1005952 (https://phabricator.wikimedia.org/T293552) (owner: 10Majavah)
[10:20:41] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1589/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[10:21:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "TIL :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206 (owner: 10Volans)
[10:21:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1008583 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus)
[10:21:32] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[10:21:40] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] ferm: Check ferm.service status in ferm_status.py (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1005978 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert)
[10:21:56] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: make ratelimit timeout a value, set to .5s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008933 (owner: 10Hnowlan)
[10:22:20] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: make ratelimit timeout a value, set to .5s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008933 (owner: 10Hnowlan)
[10:23:10] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:23:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 100%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58577 and previous config saved to /var/cache/conftool/dbconfig/20240306-102357-arnaudb.json
[10:24:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 100%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58578 and previous config saved to /var/cache/conftool/dbconfig/20240306-102402-arnaudb.json
[10:24:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 100%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58579 and previous config saved to /var/cache/conftool/dbconfig/20240306-102404-arnaudb.json
[10:25:25] <wikibugs>	 (03CR) 10Clément Goubert: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[10:28:18] <wikibugs>	 06SRE: Improve automation for the vendor maintenance calendar - https://phabricator.wikimedia.org/T357630#9605262 (10ayounsi) Thanks for looking into it !  I worry about re-writing an in house library to parse vendor emails, as those emails come in all shapes and forms and change regularly, from attached ICS, to...
[10:28:58] <wikibugs>	 (03PS1) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109)
[10:29:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[10:30:28] <wikibugs>	 (03CR) 10Volans: "addressing comments" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 (owner: 10Volans)
[10:30:52] <wikibugs>	 (03PS2) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109)
[10:31:44] <wikibugs>	 (03PS3) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109)
[10:33:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[10:34:25] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2105.codfw.wmnet onto db2205.codfw.wmnet
[10:35:59] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[10:37:32] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli)
[10:37:39] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[10:37:46] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:38:37] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(parse2008.codfw.wmnet|parse2009.codfw.wmnet|parse2010.codfw.wmnet|parse2011.codfw.wmnet|parse2012.codfw.wmnet|parse2013.codfw.wmnet|parse2014.codfw.wmnet|parse2015.codfw.wmnet),cluster=kubernetes,service=kubesvc
[10:38:51] <wikibugs>	 (03PS1) 10Marostegui: data.yaml: Add bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123)
[10:40:22] <wikibugs>	 (03Merged) 10jenkins-bot: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli)
[10:41:01] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Use upper case for request methods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009220
[10:41:04] <wikibugs>	 (03PS1) 10Fabfur: cache: fix benthos typo [puppet] - 10https://gerrit.wikimedia.org/r/1009221 (https://phabricator.wikimedia.org/T358109)
[10:42:02] <wikibugs>	 (03CR) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[10:42:27] <wikibugs>	 (03CR) 10Clément Goubert: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[10:43:24] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[10:44:57] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] cache: fix benthos typo [puppet] - 10https://gerrit.wikimedia.org/r/1009221 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[10:46:21] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[10:46:55] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[10:47:37] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[10:47:57] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[10:48:04] <wikibugs>	 (03CR) 10Muehlenhoff: data.yaml: Add bdgreenlee (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123) (owner: 10Marostegui)
[10:48:54] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[10:49:23] <wikibugs>	 (03PS2) 10Marostegui: data.yaml: Add bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123)
[10:49:32] <wikibugs>	 (03CR) 10Marostegui: data.yaml: Add bdgreenlee (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123) (owner: 10Marostegui)
[10:49:47] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: sync
[10:49:55] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: sync
[10:50:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Host has already been cloned, there was 2 candidate master', diff saved to https://phabricator.wikimedia.org/P58580 and previous config saved to /var/cache/conftool/dbconfig/20240306-105007-arnaudb.json
[10:50:12] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:50:17] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Use upper case method name for requests to rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306)
[10:51:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1008926 (https://phabricator.wikimedia.org/T359031) (owner: 10Btullis)
[10:52:05] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] restbase: Start moving mwapi calls to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005756 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert)
[10:52:45] <moritzm>	 !og installing gnutls28 security updates on bullseye
[10:58:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] openstack::base::pdns::recursor::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1006525 (owner: 10Muehlenhoff)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1100)
[11:04:01] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2108.codfw.wmnet onto db2208.codfw.wmnet
[11:04:43] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mw-parsoid: replicas x2 for hopefully the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009227 (https://phabricator.wikimedia.org/T357392)
[11:05:45] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] mw-parsoid: replicas x2 for hopefully the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009227 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[11:05:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mw-parsoid: replicas x2 for hopefully the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009227 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[11:06:15] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mw-parsoid: replicas x2 for hopefully the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009227 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[11:07:27] <wikibugs>	 (03Merged) 10jenkins-bot: mw-parsoid: replicas x2 for hopefully the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009227 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[11:08:33] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply
[11:08:59] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply
[11:10:08] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply
[11:10:52] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply
[11:10:58] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: Use upper case for request methods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009220 (owner: 10Jgiannelos)
[11:13:00] <wikibugs>	 (03PS1) 10Clément Goubert: Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009229 (https://phabricator.wikimedia.org/T351074)
[11:14:34] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008914
[11:14:36] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008915
[11:14:57] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008914 (owner: 10PipelineBot)
[11:15:40] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008914 (owner: 10PipelineBot)
[11:15:47] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392)
[11:15:49] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: restbase: Switch the default to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006898 (https://phabricator.wikimedia.org/T357392)
[11:15:56] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T357392)
[11:16:04] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T357392)
[11:17:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[11:17:03] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[11:17:31] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[11:17:37] <wikibugs>	 (03PS2) 10Clément Goubert: restbase: Start moving mwapi calls to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005756 (https://phabricator.wikimedia.org/T358213)
[11:19:02] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: Switch restbase1026-1033, restbase20289-2034 to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392)
[11:19:04] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: restbase: Switch the default to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006898 (https://phabricator.wikimedia.org/T357392)
[11:19:07] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T357392)
[11:19:15] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T357392)
[11:19:54] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[11:20:56] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[11:21:04] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[11:21:28] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2106.codfw.wmnet onto db2206.codfw.wmnet
[11:21:53] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[11:21:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9605592 (10cmooney) @kcvelaga_wmf great news!    I think the next steps would be to move any files you have.  I can do this for the stats boxes or other...
[11:24:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch restbase1026-1033, restbase20289-2034 to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[11:26:28] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009229 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[11:28:03] <claime>	 jouncebot: nowandnext
[11:28:03] <jouncebot>	 For the next 0 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1100)
[11:28:04] <jouncebot>	 In 2 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1400)
[11:28:07] <claime>	 !log Disabling puppet on deployment servers for canary api_appserver move - T351074
[11:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:11] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[11:30:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:31:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009229 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[11:31:16] <claime>	 !log Disabling puppet on mw2374.codfw.wmnet,mw2376.codfw.wmnet,mw2283.codfw.wmnet,mw2284.codfw.wmnet,mw2371.codfw.wmnet,mw2372.codfw.wmnet,mw2373.codfw.wmnet,mw2375.codfw.wmnet for canary api_appserver move - T351074
[11:31:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:30] <wikibugs>	 (03PS1) 10Jaime Nuche: Add missing function argument to titleWithoutPrefix call [extensions/Gadgets] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009231 (https://phabricator.wikimedia.org/T359290)
[11:31:49] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[11:31:55] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[11:32:01] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009229 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[11:33:41] <claime>	 !log Enabling and running puppet on new canaries mw2283.codfw.wmnet,mw2284.codfw.wmnet - T351074
[11:33:44] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] mobileapps: Use upper case method name for requests to rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) (owner: 10Jgiannelos)
[11:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:45] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[11:37:46] <claime>	 !log Enabling and running puppet on deployment servers - T351074
[11:37:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) otelcol-contrib.service on mw2283:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:40:37] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9605712 (10KCVelaga_WMF) @cmooney I have moved over the files from stat1005:kcv-wikimf to stat1008:kcvelaga, and everything is working fine.  After a co...
[11:40:38] <claime>	 !log pooling new canaries - T351074
[11:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:42] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[11:41:18] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: cluster=api_appserver,service=canary,dc=codfw
[11:41:27] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/weight=30; selector: cluster=api_appserver,service=canary,dc=codfw
[11:42:36] <claime>	 !log Depooling mw2371.codfw.wmnet,mw2372.codfw.wmnet,mw2373.codfw.wmnet,mw2374.codfw.wmnet,mw2375.codfw.wmnet,mw2376.codfw.wmnet for reimage to kubernetes - T351074
[11:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) otelcol-contrib.service on mw2283:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:43:20] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[11:43:26] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[11:43:27] <claime>	 ^lies
[12:15:38] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2310 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:17:39] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[12:17:56] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[12:17:56] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2375.codfw.wmnet with reason: host reimage
[12:18:00] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2371.codfw.wmnet with reason: host reimage
[12:18:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T352010)', diff saved to https://phabricator.wikimedia.org/P58581 and previous config saved to /var/cache/conftool/dbconfig/20240306-121800-ladsgroup.json
[12:18:04] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2373.codfw.wmnet with reason: host reimage
[12:18:10] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2374.codfw.wmnet with reason: host reimage
[12:18:12] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[12:18:22] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2376.codfw.wmnet with reason: host reimage
[12:18:26] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2372.codfw.wmnet with reason: host reimage
[12:19:41] <Amir1>	 jouncebot: nowandnext
[12:19:41] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 40 minute(s)
[12:19:41] <jouncebot>	 In 1 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1400)
[12:19:53] <Amir1>	 claime: okay if I deploy mw?
[12:19:59] <claime>	 Amir1: check with jnuche 
[12:20:12] <Amir1>	 cool thanks
[12:20:16] <claime>	 he's backporting something, then rolling the train forward
[12:20:17] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Use upper case method names in request templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306)
[12:20:26] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2375.codfw.wmnet with reason: host reimage
[12:20:26] <Amir1>	 ah I see
[12:20:30] <claime>	 so maybe you can squeeze your backport in between
[12:20:30] <claime>	 idk
[12:20:43] <Amir1>	 yeah, I wait for it to finish
[12:21:33] <wikibugs>	 10ops-eqiad, 06DC-Ops: Inconsistent data in Netbox for some msw device - https://phabricator.wikimedia.org/T359326 (10Volans)
[12:21:47] <Amir1>	 jnuche: please let me know once you're done with your magic.
[12:21:51] <wikibugs>	 (03PS2) 10Volans: validators: improve IPs DNS name validation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202
[12:21:53] <wikibugs>	 (03PS2) 10Volans: validators: add field name to fail messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206
[12:21:57] <wikibugs>	 (03PS1) 10Volans: validators: fix existing bugs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009240
[12:22:04] <jnuche>	 Amir1: hi there, what's your patch? maybe we can merger it ahead of time to go faster
[12:22:21] <Amir1>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1008503
[12:22:28] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2372.codfw.wmnet with reason: host reimage
[12:22:35] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: reduce cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009238 (owner: 10Effie Mouzeli)
[12:23:00] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] sre.switchdc.mediawiki: update mediawiki services [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[12:23:33] <wikibugs>	 (03Merged) 10jenkins-bot: mw-mcrouter: reduce cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009238 (owner: 10Effie Mouzeli)
[12:23:47] <jnuche>	 Amir1: looks like it needs a rebase, should I just do it from the UI?
[12:23:58] <Amir1>	 yeah, that's a lie
[12:24:17] <Amir1>	 (any edit on IS.php triggers merge conflict)
[12:24:29] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] validators: improve IPs DNS name validation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 (owner: 10Volans)
[12:24:34] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Move parse2002-parse2007 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009239 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris)
[12:24:53] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2371.codfw.wmnet with reason: host reimage
[12:24:58] <jnuche>	 yeah, oversensitivity of gerrit with file modification
[12:25:03] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] validators: add field name to fail messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206 (owner: 10Volans)
[12:25:21] <wikibugs>	 (03PS2) 10Jaime Nuche: Set two more wikis to read new for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008503 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup)
[12:25:24] <wikibugs>	 (03CR) 10Volans: [C: 03+2] validators: improve IPs DNS name validation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 (owner: 10Volans)
[12:25:32] <wikibugs>	 (03CR) 10Volans: [C: 03+2] validators: add field name to fail messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206 (owner: 10Volans)
[12:25:57] <wikibugs>	 (03Merged) 10jenkins-bot: validators: improve IPs DNS name validation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 (owner: 10Volans)
[12:26:02] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[12:26:03] <wikibugs>	 (03Merged) 10jenkins-bot: validators: add field name to fail messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206 (owner: 10Volans)
[12:26:30] <jnuche>	 will merge it in a sec, waiting for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Gadgets/+/1009231 to merge to avoid issues
[12:27:15] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2376.codfw.wmnet with reason: host reimage
[12:27:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123) (owner: 10Marostegui)
[12:27:58] <wikibugs>	 (03Merged) 10jenkins-bot: Add missing function argument to titleWithoutPrefix call [extensions/Gadgets] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009231 (https://phabricator.wikimedia.org/T359290) (owner: 10Jaime Nuche)
[12:28:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] data.yaml: Add bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123) (owner: 10Marostegui)
[12:28:23] <logmsgbot>	 !log jnuche@deploy2002 Started scap: Backport for [[gerrit:1009231|Add missing function argument to titleWithoutPrefix call (T359290)]]
[12:28:27] <stashbot>	 T359290: ArgumentCountError: Too few arguments to function MediaWiki\Extension\Gadgets\GadgetRepo::titleWithoutPrefix(), 1 passed in /srv/mediawiki/php-1.42.0-wmf.21/extensions/Gadgets/includes/GadgetResourceLoaderModule.php on line 80  - https://phabricator.wikimedia.org/T359290
[12:29:06] <wikibugs>	 (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[12:29:09] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] Set two more wikis to read new for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008503 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup)
[12:29:12] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] validators: fix existing bugs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009240 (owner: 10Volans)
[12:29:50] <wikibugs>	 (03Merged) 10jenkins-bot: Set two more wikis to read new for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008503 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup)
[12:30:00] <logmsgbot>	 !log jnuche@deploy2002 jnuche: Backport for [[gerrit:1009231|Add missing function argument to titleWithoutPrefix call (T359290)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:30:03] <wikibugs>	 (03CR) 10Volans: [C: 03+2] validators: fix existing bugs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009240 (owner: 10Volans)
[12:30:13] <logmsgbot>	 !log jnuche@deploy2002 jnuche: Continuing with sync
[12:30:21] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2373.codfw.wmnet with reason: host reimage
[12:30:36] <wikibugs>	 (03Merged) 10jenkins-bot: validators: fix existing bugs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009240 (owner: 10Volans)
[12:32:39] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "(WMPL team asked me to review)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008830 (https://phabricator.wikimedia.org/T358379) (owner: 10Urbanecm)
[12:33:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) ferm.service on mw2310:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:33:22] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2374.codfw.wmnet with reason: host reimage
[12:33:29] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[12:33:48] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[12:34:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1003416 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:35:02] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[12:37:10] <wikibugs>	 (03PS3) 10Jgiannelos: mobileapps: Use upper case method names for rest.php requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306)
[12:37:14] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[12:37:21] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:37:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Move parse2002-parse2007 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009239 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris)
[12:39:29] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2375.codfw.wmnet with OS bullseye
[12:39:34] <logmsgbot>	 !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:1009231|Add missing function argument to titleWithoutPrefix call (T359290)]] (duration: 11m 10s)
[12:39:39] <stashbot>	 T359290: ArgumentCountError: Too few arguments to function MediaWiki\Extension\Gadgets\GadgetRepo::titleWithoutPrefix(), 1 passed in /srv/mediawiki/php-1.42.0-wmf.21/extensions/Gadgets/includes/GadgetResourceLoaderModule.php on line 80  - https://phabricator.wikimedia.org/T359290
[12:40:31] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2002.codfw.wmnet with OS bullseye
[12:40:44] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2002.codfw.wmnet with OS bullseye
[12:40:48] <jnuche>	 Amir1: backporting your change
[12:41:17] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2003.codfw.wmnet with OS bullseye
[12:41:22] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2372.codfw.wmnet with OS bullseye
[12:41:32] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2003.codfw.wmnet with OS bullseye
[12:41:34] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[12:41:51] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2004.codfw.wmnet with OS bullseye
[12:42:04] <logmsgbot>	 !log jnuche@deploy2002 Started scap: Backport for [[gerrit:1008503|Set two more wikis to read new for pagelinks migration (T351237)]]
[12:42:05] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2004.codfw.wmnet with OS bullseye
[12:42:18] <stashbot>	 T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237
[12:42:26] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2005.codfw.wmnet with OS bullseye
[12:42:36] <Amir1>	 jnuche: thanks!
[12:42:41] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2005.codfw.wmnet with OS bullseye
[12:42:59] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2006.codfw.wmnet with OS bullseye
[12:43:08] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2371.codfw.wmnet with OS bullseye
[12:43:14] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606073 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2006.codfw.wmnet with OS bullseye
[12:43:28] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2007.codfw.wmnet with OS bullseye
[12:43:52] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606080 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2007.codfw.wmnet with OS bullseye
[12:45:36] <logmsgbot>	 !log jnuche@deploy2002 jnuche and ladsgroup: Backport for [[gerrit:1008503|Set two more wikis to read new for pagelinks migration (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:45:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2310 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:46:04] <logmsgbot>	 !log jnuche@deploy2002 jnuche and ladsgroup: Continuing with sync
[12:46:26] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2376.codfw.wmnet with OS bullseye
[12:46:52] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[12:47:53] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:49:25] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2373.codfw.wmnet with OS bullseye
[12:49:34] <icinga-wm>	 PROBLEM - RPKI Validator RTR port on rpki2002 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[12:49:36] <icinga-wm>	 PROBLEM - Routinator process on rpki2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[12:52:18] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2374.codfw.wmnet with OS bullseye
[12:52:53] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:53:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:53:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) ferm.service on kubernetes2033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:53:54] <claime>	 !log Running homer 'cr*codfw*' commit 'T351074'
[12:54:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:10] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[12:54:19] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[12:54:26] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:54:57] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance
[12:55:12] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:55:23] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance
[12:55:24] <logmsgbot>	 !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:1008503|Set two more wikis to read new for pagelinks migration (T351237)]] (duration: 13m 20s)
[12:55:29] <stashbot>	 T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237
[12:55:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58582 and previous config saved to /var/cache/conftool/dbconfig/20240306-125529-arnaudb.json
[12:55:33] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[12:55:34] <icinga-wm>	 RECOVERY - RPKI Validator RTR port on rpki2002 is OK: TCP OK - 0.001 second response time on 10.192.0.103 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[12:55:36] <icinga-wm>	 RECOVERY - Routinator process on rpki2002 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[12:56:00] <jnuche>	 Amir1: done!
[12:56:09] <Amir1>	 awesome. Thanks you!
[12:56:11] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2002.codfw.wmnet with reason: host reimage
[12:56:21] <jnuche>	 I noticed the blocker (T359290) errors happen consistently at 10 minutes past the top of the hour
[12:56:22] <stashbot>	 T359290: ArgumentCountError: Too few arguments to function MediaWiki\Extension\Gadgets\GadgetRepo::titleWithoutPrefix(), 1 passed in /srv/mediawiki/php-1.42.0-wmf.21/extensions/Gadgets/includes/GadgetResourceLoaderModule.php on line 80  - https://phabricator.wikimedia.org/T359290
[12:56:29] <wikibugs>	 (03PS1) 10Muehlenhoff: routinator: Drop --tal-dir [puppet] - 10https://gerrit.wikimedia.org/r/1009247
[12:56:34] <jnuche>	 so I'm going to wait a bit until 13:10 UTC to verify the backport fixed the problem before rolling forward the train
[12:57:04] <jnuche>	 jouncebot: next
[12:57:04] <jouncebot>	 In 1 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1400)
[12:57:20] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2003.codfw.wmnet with reason: host reimage
[12:57:25] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2004.codfw.wmnet with reason: host reimage
[12:57:44] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2005.codfw.wmnet with reason: host reimage
[12:58:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:58:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) ferm.service on kubernetes2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:58:24] <logmsgbot>	 !log robh@cumin1002 START - Cookbook sre.dns.netbox
[12:58:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2436:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2436 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:58:47] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[12:58:49] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2006.codfw.wmnet with reason: host reimage
[12:58:55] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[12:59:00] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2002.codfw.wmnet with reason: host reimage
[12:59:01] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:59:27] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2007.codfw.wmnet with reason: host reimage
[12:59:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) ferm.service on kubernetes2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:00:16] <logmsgbot>	 !log robh@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fixing incorrect asset tags - robh@cumin1002"
[13:00:56] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] routinator: Drop --tal-dir [puppet] - 10https://gerrit.wikimedia.org/r/1009247 (owner: 10Muehlenhoff)
[13:01:08] <logmsgbot>	 !log robh@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fixing incorrect asset tags - robh@cumin1002"
[13:01:08] <logmsgbot>	 !log robh@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:01:26] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2005.codfw.wmnet with reason: host reimage
[13:01:43] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[13:01:49] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[13:02:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Sure makes sense, I'd write a comment to the puppet class to highlight this decision though, otherwise it may be missed at first." [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron)
[13:03:10] <jinxer-wm>	 (SystemdUnitFailed) resolved: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:03:37] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2003.codfw.wmnet with reason: host reimage
[13:03:47] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:05:13] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1590/co" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron)
[13:05:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58583 and previous config saved to /var/cache/conftool/dbconfig/20240306-130542-arnaudb.json
[13:05:52] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[13:06:05] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2006.codfw.wmnet with reason: host reimage
[13:06:41] <claime>	 !log Pooling and uncordoning mw2372.codfw.wmnet mw2373.codfw.wmnet mw2374.codfw.wmnet mw2375.codfw.wmnet mw2376.codfw.wmnet - T351074
[13:06:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:50] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2371.codfw.wmnet|mw2372.codfw.wmnet|mw2373.codfw.wmnet|mw2374.codfw.wmnet|mw2375.codfw.wmnet|mw2376.codfw.wmnet),cluster=kubernetes,service=kubesvc
[13:06:56] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[13:07:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9606215 (10MoritzMuehlenhoff)
[13:08:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:08:23] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2007.codfw.wmnet with reason: host reimage
[13:09:04] <wikibugs>	 (03PS2) 10Muehlenhoff: routinator: Drop --tal-dir [puppet] - 10https://gerrit.wikimedia.org/r/1009247
[13:11:20] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2004.codfw.wmnet with reason: host reimage
[13:12:27] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:12:52] <jnuche>	 the blocker error is not reproducing anymore, rolling train forward
[13:13:01] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[13:13:07] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[13:13:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) ferm.service on kubernetes2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:13:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:13:30] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009250 (https://phabricator.wikimedia.org/T354439)
[13:13:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009250 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot)
[13:14:16] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009250 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot)
[13:16:32] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[13:16:38] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[13:16:41] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[13:16:47] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[13:17:41] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2002.codfw.wmnet with OS bullseye
[13:17:56] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606271 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2002.codfw.wmnet with OS bullseye comp...
[13:20:21] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2005.codfw.wmnet with OS bullseye
[13:20:36] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606285 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2005.codfw.wmnet with OS bullseye comp...
[13:20:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P58585 and previous config saved to /var/cache/conftool/dbconfig/20240306-132048-arnaudb.json
[13:21:17] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: ProbeDown (instance centrallog1002:6514) - https://phabricator.wikimedia.org/T359293#9606310 (10fgiunchedi) Thank you @LSobanski ! Those are known, I've silenced the alerts for now, leaving the task open as a reminder
[13:23:31] <jnuche>	 akosiaris, claime: four of the K8s parse hosts failed to pull the latest multiversion image during the train rollout, presumably due to the reimaging:
[13:23:36] <jnuche>	 https://www.irccloud.com/pastebin/SOxQUtd6/
[13:24:00] <jnuche>	 will they pull the latest version of the image once they get put back in the rotation?
[13:24:29] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2003.codfw.wmnet with OS bullseye
[13:24:38] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2006.codfw.wmnet with OS bullseye
[13:24:38] <claime>	 jnuche: they shouldn't have even tried
[13:24:43] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606341 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2003.codfw.wmnet with OS bullseye comp...
[13:24:44] <claime>	 they're not parse anymore
[13:24:47] <claime>	 proceed
[13:24:55] <jnuche>	 alright, thank you
[13:24:56] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2006.codfw.wmnet with OS bullseye comp...
[13:25:03] <claime>	 almost everything has been migrated to mw-parsoid now
[13:25:09] <wikibugs>	 06SRE, 10SRE Observability (FY2023/2024-Q3): ircecho doesn't attempt to open log files created after startup - https://phabricator.wikimedia.org/T359292#9606343 (10fgiunchedi) Logs from `ircecho.service`  ` Mar 05 15:14:33 alert2001 ircecho[1136326]: Failed to open file: /var/log/icinga/irc-analytics.log Mar 0...
[13:25:42] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[13:25:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] routinator: Drop --tal-dir [puppet] - 10https://gerrit.wikimedia.org/r/1009247 (owner: 10Muehlenhoff)
[13:25:49] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[13:27:06] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2007.codfw.wmnet with OS bullseye
[13:27:22] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2007.codfw.wmnet with OS bullseye comp...
[13:27:37] <logmsgbot>	 !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.21  refs T354439
[13:27:41] <stashbot>	 T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439
[13:28:10] <wikibugs>	 (03PS1) 10Fabfur: haproxy: send errored messages to separate (deadletter) topic [puppet] - 10https://gerrit.wikimedia.org/r/1009255 (https://phabricator.wikimedia.org/T358109)
[13:30:09] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2004.codfw.wmnet with OS bullseye
[13:30:22] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606405 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2004.codfw.wmnet with OS bullseye comp...
[13:35:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P58586 and previous config saved to /var/cache/conftool/dbconfig/20240306-133555-arnaudb.json
[13:36:11] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Restrict the set of URLS serviced by Archiva [puppet] - 10https://gerrit.wikimedia.org/r/1008926 (https://phabricator.wikimedia.org/T359031) (owner: 10Btullis)
[13:37:27] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:39:17] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1591/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009255 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[13:45:13] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:49:12] <wikibugs>	 (03CR) 10Muehlenhoff: Routed Ganeti: use per tap interface dhcrelay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003452 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[13:49:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292)
[13:50:12] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:50:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi)
[13:50:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/995032 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[13:51:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58587 and previous config saved to /var/cache/conftool/dbconfig/20240306-135102-arnaudb.json
[13:51:07] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[13:52:09] <wikibugs>	 (03PS2) 10Filippo Giunchedi: icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292)
[13:53:21] <wikibugs>	 (03CR) 10Dreamy Jazz: [C: 03+1] throttle: Allow for overriding temp account creation limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) (owner: 10Kosta Harlan)
[13:54:03] <wikibugs>	 (03CR) 10Jgiannelos: "I suggest we keep only uppercase the references to rest.php so we make sure that while this [1] patch is not deployed in production we don" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) (owner: 10Jgiannelos)
[13:55:04] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for FebinBellamy - https://phabricator.wikimedia.org/T359208#9606600 (10SLopes-WMF) Approved. Please go ahead.
[13:55:23] <wikibugs>	 (03CR) 10Muehlenhoff: icinga: create ircecho log files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi)
[13:55:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi)
[13:58:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] cdn: Fix site var for ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1008981 (owner: 10BCornwall)
[13:59:06] <wikibugs>	 (03PS2) 10Reedy: CommonSettings: Add $wgSecurePollExcludedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008997 (https://phabricator.wikimedia.org/T303135)
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1400).
[14:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:37] <TheresNoTime>	 agree :D
[14:02:43] <wikibugs>	 (03PS3) 10Herron: profile::kafka::broker: set cert renewal at 1 month [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870)
[14:02:57] <jnuche>	 urbanecm: I see you're already working on a fix for T359216. Do you think the user impact is bad enough to rollback while you work on the fix?
[14:02:58] <stashbot>	 T359216: [testwiki - wmf.21] Bad request for page/summary and user-impact - https://phabricator.wikimedia.org/T359216
[14:03:41] <urbanecm>	 jnuche: just saw your comment on the task, replied there
[14:03:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: icinga: create ircecho log files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi)
[14:03:51] <wikibugs>	 (03PS3) 10Filippo Giunchedi: icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292)
[14:04:21] <wikibugs>	 (03PS1) 10Marostegui: data.yaml: Add FebinBellamy [puppet] - 10https://gerrit.wikimedia.org/r/1009259 (https://phabricator.wikimedia.org/T359208)
[14:04:35] <jnuche>	 urbanecm: thx 👍
[14:05:03] <urbanecm>	 jnuche: my "fix" reverts bunch of other things, not sure what exactly those commits change. i pinged Daniel in #engineering-all at Slack, let's see what happens.
[14:05:19] <jnuche>	 ack
[14:05:43] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for FebinBellamy - https://phabricator.wikimedia.org/T359208#9606622 (10Marostegui) a:03Marostegui
[14:06:49] <wikibugs>	 (03CR) 10Herron: [C: 03+2] "done!" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron)
[14:07:15] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-mcrouter: lower memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009260
[14:08:22] <wikibugs>	 (03PS1) 10Ssingh: P:dns::auth: skipping running authdns-update on host if not pooled [puppet] - 10https://gerrit.wikimedia.org/r/1009261 (https://phabricator.wikimedia.org/T347054)
[14:09:22] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: lower memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009260 (owner: 10Effie Mouzeli)
[14:10:17] <wikibugs>	 (03Merged) 10jenkins-bot: mw-mcrouter: lower memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009260 (owner: 10Effie Mouzeli)
[14:10:19] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1592/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009261 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[14:11:01] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[14:11:25] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[14:11:31] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[14:11:49] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[14:12:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.switchdc.mediawiki: update mediawiki services [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[14:13:10] <wikibugs>	 (03CR) 10Bking: [C: 03+1] "Giving my +1 so we can merge and test this today." [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro)
[14:14:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.switchdc.mediawiki: update mediawiki services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[14:17:46] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, if you don't mind, be quite verbose on irc when you deploy this in codfw (in case anyone is doing any tests)." [puppet] - 10https://gerrit.wikimedia.org/r/1008462 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah)
[14:18:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[14:20:08] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9606851 (10akosiaris)
[14:20:55] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(parse2002.codfw.wmnet|parse2003.codfw.wmnet|parse2004.codfw.wmnet|parse2005.codfw.wmnet|parse2006.codfw.wmnet|parse2007.codfw.wmnet),cluster=kubernetes,service=kubesvc
[14:21:01] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] sre.switchdc.mediawiki: update mediawiki services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[14:22:09] <wikibugs>	 (03PS1) 10David Caro: bullseye-standalone: add logrotate [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1009264 (https://phabricator.wikimedia.org/T357567)
[14:23:30] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606911 (10akosiaris) Almost all parsoid hosts have been reimaged as kubernetes nodes. Scandium, testreduce1002, parse1001 and parse1002 being the exce...
[14:24:32] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9606936 (10akosiaris)
[14:25:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009259 (https://phabricator.wikimedia.org/T359208) (owner: 10Marostegui)
[14:26:15] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606934 (10akosiaris) 05Open→03Resolved
[14:27:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] data.yaml: Add FebinBellamy [puppet] - 10https://gerrit.wikimedia.org/r/1009259 (https://phabricator.wikimedia.org/T359208) (owner: 10Marostegui)
[14:28:22] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: update mediawiki services [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli)
[14:28:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks "good" to me." [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi)
[14:30:36] <wikibugs>	 (03Abandoned) 10David Caro: bullseye-standalone: add logrotate [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1009264 (https://phabricator.wikimedia.org/T357567) (owner: 10David Caro)
[14:30:40] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for FebinBellamy - https://phabricator.wikimedia.org/T359208#9606987 (10Marostegui) 05Open→03Resolved This is all done
[14:31:40] <wikibugs>	 (03PS1) 10Clément Goubert: Move 6 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074)
[14:32:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool to reimage T358642', diff saved to https://phabricator.wikimedia.org/P58588 and previous config saved to /var/cache/conftool/dbconfig/20240306-143204-arnaudb.json
[14:32:21] <stashbot>	 T358642: Upgrade x1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358642
[14:33:30] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2131.codfw.wmnet with reason: Silence for reimaging
[14:33:38] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] mobileapps: Use upper case method names for rest.php requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) (owner: 10Jgiannelos)
[14:33:44] <moritzm>	 !log installing nftables bugfix updates from bullseye point release
[14:33:45] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2131.codfw.wmnet with reason: Silence for reimaging
[14:33:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:00] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Use upper case method names for rest.php requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) (owner: 10Jgiannelos)
[14:34:18] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi)
[14:34:31] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2131.codfw.wmnet with OS bookworm
[14:34:32] <wikibugs>	 (03PS2) 10Fabfur: haproxy: send errored messages to separate (deadletter) topic [puppet] - 10https://gerrit.wikimedia.org/r/1009255 (https://phabricator.wikimedia.org/T358109)
[14:34:33] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9607078 (10akosiaris) 05In progress→03Resolved
[14:35:09] <wikibugs>	 06SRE, 10MW-on-K8s, 06Traffic, 06serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9607081 (10akosiaris)
[14:35:54] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Use upper case method names for rest.php requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) (owner: 10Jgiannelos)
[14:37:27] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:41] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009267
[14:38:15] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 60% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508)
[14:38:29] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009267 (owner: 10Jgiannelos)
[14:39:25] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009267 (owner: 10Jgiannelos)
[14:39:35] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: move 60% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1009269 (https://phabricator.wikimedia.org/T357508)
[14:40:17] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[14:40:22] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[14:40:35] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[14:40:39] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[14:41:04] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[14:41:41] <wikibugs>	 (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro)
[14:41:43] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[14:42:15] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[14:42:41] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[14:42:54] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[14:44:15] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[14:44:25] <wikibugs>	 (03CR) 10Hnowlan: [C: 04-1] Move 6 eqiad appservers to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[14:44:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1009270
[14:45:41] <moritzm>	 !log installing postgres 13 security updates
[14:45:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:17] <wikibugs>	 (03PS2) 10Clément Goubert: Move 6 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074)
[14:46:39] <wikibugs>	 (03CR) 10Clément Goubert: Move 6 eqiad appservers to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[14:47:25] <wikibugs>	 (03CR) 10Hnowlan: Create a shellbox deployment for videoscalers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[14:47:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi)
[14:48:02] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] Move 6 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[14:51:00] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw
[14:51:15] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-logging2001 is OK: SSL OK - Certificate kafka-logging2001.codfw.wmnet valid until 2025-03-01 20:58:00 +0000 (expires in 360 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[14:51:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2131.codfw.wmnet with reason: host reimage
[14:51:44] <claime>	 jouncebot: nowandnext
[14:51:44] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1400)
[14:51:44] <jouncebot>	 In 0 hour(s) and 8 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1500)
[14:52:36] <claime>	 !log Depooling mw1441.eqiad.wmnet,mw1442.eqiad.wmnet,mw1451.eqiad.wmnet,mw1452.eqiad.wmnet,mw1454.eqiad.wmnet,mw1455.eqiad.wmnet for reimage to kubernetes - T351074
[14:52:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:40] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[14:53:02] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Move 6 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[14:53:45] <wikibugs>	 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9607287 (10fgiunchedi)
[14:54:30] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2131.codfw.wmnet with reason: host reimage
[14:55:06] <wikibugs>	 06SRE, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): ircecho doesn't attempt to open log files created after startup - https://phabricator.wikimedia.org/T359292#9607285 (10fgiunchedi) 05Open→03Resolved Calling this done, albeit with an hack
[14:55:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but idea LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009255 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[14:56:33] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] trafficserver: move 60% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1009269 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert)
[14:56:37] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet
[14:56:39] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0)
[14:57:26] <wikibugs>	 (03CR) 10Hnowlan: [C: 04-1] mw-web, mw-api-ext: Raise replicas for 60% traffic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert)
[14:57:27] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:02] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] openstack: neutron: add API support for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1008462 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah)
[15:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1500)
[15:01:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[15:02:19] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] haproxy: send errored messages to separate (deadletter) topic [puppet] - 10https://gerrit.wikimedia.org/r/1009255 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[15:04:54] <wikibugs>	 (03PS1) 10Clément Goubert: Add missing node definition [puppet] - 10https://gerrit.wikimedia.org/r/1009273
[15:05:00] <wikibugs>	 (03CR) 10Volans: "question inline, I'm happy either way" [puppet] - 10https://gerrit.wikimedia.org/r/1009270 (owner: 10Muehlenhoff)
[15:06:52] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Add missing node definition [puppet] - 10https://gerrit.wikimedia.org/r/1009273 (owner: 10Clément Goubert)
[15:08:30] <wikibugs>	 (03PS2) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 60% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508)
[15:08:38] <wikibugs>	 (03CR) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 60% traffic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert)
[15:09:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ferm.service on mw1367:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:10:32] <wikibugs>	 (03PS19) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894)
[15:11:10] <wikibugs>	 (03PS2) 10Muehlenhoff: Add Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1009270
[15:11:15] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches
[15:11:21] <wikibugs>	 (03CR) 10Muehlenhoff: Add Cumin alias for routed Ganeti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009270 (owner: 10Muehlenhoff)
[15:11:39] <logmsgbot>	 !log jiji@cumin1002 END (FAIL) - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches (exit_code=99)
[15:12:01] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl
[15:12:54] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1441.eqiad.wmnet with OS bullseye
[15:12:57] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1442.eqiad.wmnet with OS bullseye
[15:13:00] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1451.eqiad.wmnet with OS bullseye
[15:13:03] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1452.eqiad.wmnet with OS bullseye
[15:13:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009270 (owner: 10Muehlenhoff)
[15:13:06] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1454.eqiad.wmnet with OS bullseye
[15:13:08] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1455.eqiad.wmnet with OS bullseye
[15:13:52] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw
[15:14:13] <volans>	 claime: I would suggest to increase a bit the sleep between starts... this will bottleneck on running puppet on the alert host for the downtime
[15:15:19] <claime>	 volans: it's not a sleep it's me starting them too fast and then cursing myself every time
[15:15:30] <claime>	 (manually I mean)
[15:15:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9607433 (10MoritzMuehlenhoff)
[15:15:46] <volans>	 lol
[15:15:57] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9607425 (10bking) 05Resolved→03Open @VRiley-WMF `wdqs1025` is failing to reimage. I can't see any disks in the DRAC interface, are you...
[15:16:14] <wikibugs>	 (03PS1) 10Brouberol: Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894)
[15:16:14] <claime>	 see this as a stress test of the locking mechanism >:)
[15:17:10] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2131.codfw.wmnet with OS bookworm
[15:17:46] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0)
[15:18:08] <volans>	 :D
[15:19:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[15:21:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Move the old apt servers to insetup::buster role [puppet] - 10https://gerrit.wikimedia.org/r/1009281 (https://phabricator.wikimedia.org/T331613)
[15:21:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Move nginx/Puppet settings for new apt hosts to the role Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1009282 (https://phabricator.wikimedia.org/T331613)
[15:21:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool to clone on db2131 T358642', diff saved to https://phabricator.wikimedia.org/P58589 and previous config saved to /var/cache/conftool/dbconfig/20240306-152130-arnaudb.json
[15:21:43] <stashbot>	 T358642: Upgrade x1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358642
[15:21:45] <Lucas_WMDE>	 jouncebot: now
[15:21:45] <jouncebot>	 For the next 0 hour(s) and 38 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1500)
[15:22:24] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1367 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:22:58] <Lucas_WMDE>	 !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki viwiki --current --all --touched-after=20230613000000 --start '["8661638"]' 2>&1 | tee ~/T315510-viwiki-2 # in tmux
[15:23:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) ferm.service on kubernetes1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:23:52] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2196.codfw.wmnet with reason: provisionning db2131.codfw.wmnet - T355422
[15:23:55] <stashbot>	 T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422
[15:24:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2196.codfw.wmnet with reason: provisionning db2131.codfw.wmnet - T355422
[15:24:10] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: provisionning db2131.codfw.wmnet - T355422
[15:24:15] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: provisionning db2131.codfw.wmnet - T355422
[15:24:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[15:25:42] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2196.codfw.wmnet onto db2131.codfw.wmnet
[15:27:02] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1441.eqiad.wmnet with reason: host reimage
[15:27:08] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1451.eqiad.wmnet with reason: host reimage
[15:27:25] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1454.eqiad.wmnet with reason: host reimage
[15:27:32] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1442.eqiad.wmnet with reason: host reimage
[15:27:56] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1452.eqiad.wmnet with reason: host reimage
[15:28:08] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks
[15:28:10] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) ferm.service on kubernetes1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:28:11] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1455.eqiad.wmnet with reason: host reimage
[15:28:26] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0)
[15:29:35] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1441.eqiad.wmnet with reason: host reimage
[15:31:13] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye
[15:31:38] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1454.eqiad.wmnet with reason: host reimage
[15:31:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[15:32:47] <wikibugs>	 (03PS5) 10Eevans: restbase: provision restbase1039-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005595 (https://phabricator.wikimedia.org/T354560)
[15:32:49] <wikibugs>	 (03PS5) 10Eevans: restbase: provision restbase1040-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005596 (https://phabricator.wikimedia.org/T354560)
[15:32:53] <wikibugs>	 (03PS5) 10Eevans: restbase: provision restbase1041-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005597 (https://phabricator.wikimedia.org/T354560)
[15:33:01] <wikibugs>	 (03PS5) 10Eevans: restbase: provision restbase1042-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005598 (https://phabricator.wikimedia.org/T354560)
[15:34:04] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1452.eqiad.wmnet with reason: host reimage
[15:34:32] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1039-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005595 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans)
[15:34:36] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance
[15:34:47] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0)
[15:36:41] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1442.eqiad.wmnet with reason: host reimage
[15:36:59] <wikibugs>	 (03PS20) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894)
[15:39:29] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1455.eqiad.wmnet with reason: host reimage
[15:42:27] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1451.eqiad.wmnet with reason: host reimage
[15:43:45] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.02-set-readonly
[15:43:45] <logmsgbot>	 !log jiji@cumin1002 [DRY-RUN] MediaWiki read-only period starts at: 2024-03-06 15:43:44.970687
[15:43:47] <wikibugs>	 (03PS1) 10Btullis: Allow the lilypond packages to be installed on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228)
[15:43:48] <stashbot>	 jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[15:44:01] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0)
[15:44:02] <stashbot>	 jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[15:44:15] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.16.39:9042 on restbase1039 is CRITICAL: connect to address 10.64.16.39 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:44:16] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1039.eqiad.wmnet with reason: Bootstrapping — T354560
[15:44:18] <stashbot>	 eevans@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[15:44:18] <stashbot>	 T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560
[15:44:31] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1039.eqiad.wmnet with reason: Bootstrapping — T354560
[15:44:33] <stashbot>	 eevans@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[15:44:37] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] mw-web, mw-api-ext: Raise replicas for 60% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert)
[15:45:16] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki
[15:45:17] <stashbot>	 jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[15:45:30] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0)
[15:45:31] <stashbot>	 jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[15:46:12] <wikibugs>	 (03CR) 10Bking: "Per IRC conversation with volans, we're going to wait until after the offsite before merging, so we have time to address some of these con" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro)
[15:46:49] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite
[15:46:50] <stashbot>	 jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[15:46:52] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0)
[15:46:53] <stashbot>	 jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[15:47:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1009270 (owner: 10Muehlenhoff)
[15:47:52] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
[15:47:53] <stashbot>	 jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[15:47:56] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1441.eqiad.wmnet with OS bullseye
[15:47:56] <stashbot>	 cgoubert@cumin2002: Failed to log message to wiki. Somebody should check the error logs.
[15:48:02] <logmsgbot>	 !log jiji@cumin1002 [DRY-RUN] MediaWiki read-only period ends at: 2024-03-06 15:48:02.718097
[15:48:03] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0)
[15:48:33] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Allow the lilypond packages to be installed on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis)
[15:48:43] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner
[15:48:43] <logmsgbot>	 !log root@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync
[15:48:43] <logmsgbot>	 !log root@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync
[15:48:44] <logmsgbot>	 !log root@deploy2002 helmfile [eqiad] [main] FAIL helmfile.d/services/mw-jobrunner : sync
[15:48:44] <logmsgbot>	 !log root@deploy2002 helmfile [eqiad] [canary] FAIL helmfile.d/services/mw-jobrunner : sync
[15:48:45] <logmsgbot>	 !log jiji@cumin1002 END (FAIL) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=99)
[15:49:17] <wikibugs>	 (03CR) 10Muehlenhoff: "I don't think this is needed/correct? Bullseye should have a recent enough Lilypond version by itself?" [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis)
[15:50:03] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED
[15:50:04] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1454.eqiad.wmnet with OS bullseye
[15:50:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED
[15:51:23] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1452.eqiad.wmnet with OS bullseye
[15:52:25] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1367 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:54:35] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance
[15:55:31] <wikibugs>	 (03PS1) 10Brouberol: Superset: migrate external services egress to Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009290 (https://phabricator.wikimedia.org/T359411)
[15:55:45] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1442.eqiad.wmnet with OS bullseye
[15:55:47] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED
[15:55:50] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED
[15:56:53] <wikibugs>	 (03PS2) 10Brouberol: Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894)
[15:57:05] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0)
[15:57:29] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1455.eqiad.wmnet with OS bullseye
[15:58:05] <wikibugs>	 (03CR) 10Btullis: "Oh right, yes it has 2.22.0-10 in the bullseye repos but 2.22.1-2~bpo11+1 in bullseye-backports. I had assumed that the backported one wou" [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis)
[15:59:27] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl
[15:59:59] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0)
[16:00:21] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1451.eqiad.wmnet with OS bullseye
[16:00:44] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters
[16:05:07] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:05:14] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:05:35] <wikibugs>	 (03CR) 10Muehlenhoff: "Either is fine I guess, we can also just keep it as-is." [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis)
[16:14:25] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:15:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T352010)', diff saved to https://phabricator.wikimedia.org/P58590 and previous config saved to /var/cache/conftool/dbconfig/20240306-161546-ladsgroup.json
[16:15:57] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[16:18:35] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1009294
[16:19:13] <wikibugs>	 (03PS3) 10Brouberol: global_config: add presto/druid/IDP node IPs to the k8s global config [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411)
[16:19:20] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol)
[16:21:03] <wikibugs>	 (03PS4) 10Brouberol: global_config: add presto/druid/IDP node IPs to the k8s global config [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411)
[16:21:24] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol)
[16:21:28] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T359387)
[16:21:36] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T359387)
[16:26:03] <denisse>	 !log Disable meta-monitoring for alert1001 - T333615
[16:26:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:07] <stashbot>	 T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615
[16:27:26] <wikibugs>	 (03CR) 10Btullis: "Oh it's a bit noisy. puppet is displaying a notice for each package that uses this format, on both buster and bullseye." [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis)
[16:28:07] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417 (10bdgreenlee)
[16:29:07] <wikibugs>	 (03PS2) 10Fabfur: haproxy: enable log to benthos socket [puppet] - 10https://gerrit.wikimedia.org/r/1009293 (https://phabricator.wikimedia.org/T358109)
[16:29:22] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1009294 (owner: 10Volans)
[16:30:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P58591 and previous config saved to /var/cache/conftool/dbconfig/20240306-163053-ladsgroup.json
[16:31:27] <wikibugs>	 (03PS1) 10Volans: Upstream release v8.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1009297
[16:31:30] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] cdn: Fix site var for ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1008981 (owner: 10BCornwall)
[16:31:33] <wikibugs>	 (03PS1) 10EoghanGaffney: [gitlab] Failover test of gitlab replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1009298 (https://phabricator.wikimedia.org/T358559)
[16:32:31] <wikibugs>	 (03PS1) 10Elukey: slo_template: update SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009299
[16:34:39] <wikibugs>	 (03CR) 10Subramanya Sastry: "How does this impact scandium and our use of that server for round-trip testing which we run weekly?" [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris)
[16:34:54] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:35:01] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1009293 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[16:35:01] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:36:08] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reimage for host alert1001.wikimedia.org with OS bookworm
[16:36:38] <wikibugs>	 (03PS1) 10EoghanGaffney: [gitlab] Failover test of gitlab replica hosts [dns] - 10https://gerrit.wikimedia.org/r/1009300 (https://phabricator.wikimedia.org/T358559)
[16:36:50] <wikibugs>	 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9607819 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin2002 for host alert1001.wikimedia.org with OS bookworm
[16:36:54] <claime>	 !log Running homer 'cr*eqiad*' commit 'T351074'
[16:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:58] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[16:37:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [gitlab] Failover test of gitlab replica hosts [dns] - 10https://gerrit.wikimedia.org/r/1009300 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney)
[16:38:13] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:38:19] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:38:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cumin: fix ganeti-all alias [puppet] - 10https://gerrit.wikimedia.org/r/1009302
[16:39:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'm assuming the current ganeti-all version is what you meant" [puppet] - 10https://gerrit.wikimedia.org/r/1009302 (owner: 10Filippo Giunchedi)
[16:40:32] <wikibugs>	 (03CR) 10Muehlenhoff: cumin: fix ganeti-all alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009302 (owner: 10Filippo Giunchedi)
[16:40:53] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v8.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1009297 (owner: 10Volans)
[16:41:57] <wikibugs>	 (03PS2) 10EoghanGaffney: [gitlab] Failover test of gitlab replica hosts [dns] - 10https://gerrit.wikimedia.org/r/1009300 (https://phabricator.wikimedia.org/T358559)
[16:42:05] <wikibugs>	 (03PS2) 10Filippo Giunchedi: cumin: fix ganeti-all alias [puppet] - 10https://gerrit.wikimedia.org/r/1009302
[16:42:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: cumin: fix ganeti-all alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009302 (owner: 10Filippo Giunchedi)
[16:43:40] <wikibugs>	 (03PS2) 10Ssingh: P:dns::auth: skipping running authdns-update on host if not pooled [puppet] - 10https://gerrit.wikimedia.org/r/1009261 (https://phabricator.wikimedia.org/T347054)
[16:44:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9607891 (10odimitrijevic) Approved
[16:44:51] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:44:58] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:45:11] <wikibugs>	 (03PS53) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822)
[16:45:13] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job alertmanager in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:45:13] <wikibugs>	 (03PS1) 10AOkoth: vrts: disable vrts-cache-cleanup timer [puppet] - 10https://gerrit.wikimedia.org/r/1009303 (https://phabricator.wikimedia.org/T354422)
[16:46:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P58592 and previous config saved to /var/cache/conftool/dbconfig/20240306-164559-ladsgroup.json
[16:46:56] <wikibugs>	 (03PS2) 10AOkoth: vrts: disable vrts-cache-cleanup timer [puppet] - 10https://gerrit.wikimedia.org/r/1009303 (https://phabricator.wikimedia.org/T354422)
[16:48:54] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on alert1001.wikimedia.org with reason: host reimage
[16:49:02] <volans>	 !log uploaded spicerack_8.4.1 to apt.wikimedia.org bullseye-wikimedia
[16:49:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:26] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2196.codfw.wmnet onto db2131.codfw.wmnet
[16:50:13] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job alertmanager in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:50:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1009302 (owner: 10Filippo Giunchedi)
[16:51:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] cumin: fix ganeti-all alias [puppet] - 10https://gerrit.wikimedia.org/r/1009302 (owner: 10Filippo Giunchedi)
[16:52:03] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] haproxy: enable log to benthos socket [puppet] - 10https://gerrit.wikimedia.org/r/1009293 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur)
[16:52:43] <logmsgbot>	 !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on alert1001.wikimedia.org with reason: host reimage
[16:52:44] <claime>	 !log Pooling and uncordoning mw1441.eqiad.wmnet,mw1442.eqiad.wmnet,mw1451.eqiad.wmnet,mw1452.eqiad.wmnet,mw1454.eqiad.wmnet,mw1455.eqiad.wmnet - T351074
[16:52:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:52] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[16:52:55] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1441.eqiad.wmnet|mw1442.eqiad.wmnet|mw1451.eqiad.wmnet|mw1452.eqiad.wmnet|mw1454.eqiad.wmnet|mw1455.eqiad.wmnet),cluster=kubernetes,service=kubesvc
[16:54:12] <urbanecm>	 jouncebot: nowandnext
[16:54:12] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 5 minute(s)
[16:54:13] <jouncebot>	 In 0 hour(s) and 5 minute(s): Alert hosts failover alert2001 -> alert1001 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1700)
[16:54:24] <wikibugs>	 (03PS1) 10Btullis: Enable the MarketingCampaignsReporting plugin for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1009305 (https://phabricator.wikimedia.org/T319013)
[16:54:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 25%: Clone source repooling', diff saved to https://phabricator.wikimedia.org/P58593 and previous config saved to /var/cache/conftool/dbconfig/20240306-165439-arnaudb.json
[16:55:06] <urbanecm>	 not sure what the alert hosts failover is about – I was about to do a MW backport for a train blocker, but I can wait until after the failover if required?
[16:55:13] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job alertmanager in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:55:17] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1595/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009305 (https://phabricator.wikimedia.org/T319013) (owner: 10Btullis)
[16:55:32] <urbanecm>	 denisse: maybe you know, as it seems like the host's reimaging right now?
[16:55:39] <wikibugs>	 (03PS2) 10Btullis: Enable the MarketingCampaignsReporting plugin for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1009305 (https://phabricator.wikimedia.org/T319013)
[16:56:06] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:56:13] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:57:01] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1596/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009305 (https://phabricator.wikimedia.org/T319013) (owner: 10Btullis)
[16:57:13] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Enable the MarketingCampaignsReporting plugin for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1009305 (https://phabricator.wikimedia.org/T319013) (owner: 10Btullis)
[16:57:29] <denisse>	 Hi @urbanecm, we're upgrading our Alert hosts instances. We're just doing the reimage of the passive host and plan on doing the failover at 17 UTC.
[16:58:26] <denisse>	 We're still waiting for the re-image to finish so please proceed with your backport.
[16:58:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw2436:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2436 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:58:42] <denisse>	 Do you have an estimate of how long is it going to take?
[16:59:40] <urbanecm>	 denisse: thanks for the info. since it's a core patch, it might take ~35 mins due to CI. not sure when the reimage might finish.
[16:59:55] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:00:02] <urbanecm>	 but maybe i can +2 it now, wait for the failover and then finish it?
[17:00:05] <jouncebot>	 Deploy window Alert hosts failover alert2001 -> alert1001 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1700)
[17:00:27] <denisse>	 urbanecm: Thanks Martin, due to the time it would take we would greatly appreciate it if you could merge it after the failover.
[17:00:37] <denisse>	 I'll let you know ASAP when we finish.
[17:00:41] <urbanecm>	 okay, no problem. will wait for the ping from you then.
[17:01:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T352010)', diff saved to https://phabricator.wikimedia.org/P58594 and previous config saved to /var/cache/conftool/dbconfig/20240306-170106-ladsgroup.json
[17:01:09] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance
[17:01:19] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance
[17:01:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T352010)', diff saved to https://phabricator.wikimedia.org/P58595 and previous config saved to /var/cache/conftool/dbconfig/20240306-170125-ladsgroup.json
[17:01:30] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[17:02:35] <claime>	 !log restart rsyslog on mw2436
[17:02:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw2436:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2436 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:06:08] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[17:06:15] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[17:09:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 50%: Clone source repooling', diff saved to https://phabricator.wikimedia.org/P58596 and previous config saved to /var/cache/conftool/dbconfig/20240306-170944-arnaudb.json
[17:10:12] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[17:10:19] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[17:13:01] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 622.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:13:47] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197#9608001 (10bdgreenlee) Done: https://phabricator.wikimedia.org/T359417
[17:15:02] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host alert1001.wikimedia.org with OS bookworm
[17:15:20] <wikibugs>	 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9608025 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin2002 for host alert1001.wikimedia.org with OS bookworm completed: - alert1001 (**WARN**)   - Remo...
[17:17:27] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[17:17:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[17:18:03] <denisse>	 !log failing over from alert2001 to alert1001
[17:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:26] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] Revert "alert: Failover Icinga and Alertmanager to alert2001" [puppet] - 10https://gerrit.wikimedia.org/r/1008761 (owner: 10Andrea Denisse)
[17:21:03] <wikibugs>	 (03PS2) 10Andrea Denisse: Revert "alert: Resolve alerts DNS queries to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/1008759
[17:21:19] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] Revert "wikimedia.org: failover icinga to alert2001 too" [dns] - 10https://gerrit.wikimedia.org/r/1008760 (owner: 10Andrea Denisse)
[17:21:43] <wikibugs>	 (03PS3) 10Andrea Denisse: Revert "alert: Resolve alerts DNS queries to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/1008759
[17:23:27] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] Revert "alert: Resolve alerts DNS queries to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/1008759 (owner: 10Andrea Denisse)
[17:23:31] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197#9608094 (10Marostegui)
[17:23:39] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608096 (10Marostegui)
[17:24:04] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608100 (10Marostegui) @odimitrijevic I assume you are also their manager and hence approving for manager and analytics group?
[17:24:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608101 (10Marostegui) p:05Triage→03Medium
[17:24:47] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[17:24:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 75%: Clone source repooling', diff saved to https://phabricator.wikimedia.org/P58597 and previous config saved to /var/cache/conftool/dbconfig/20240306-172449-arnaudb.json
[17:24:54] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[17:33:11] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: migrate 5 eqiad appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1009309 (https://phabricator.wikimedia.org/T351074)
[17:35:17] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9608157 (10VRiley-WMF) @dr0ptp4kt would you be able to try to reimage this unit again? I have ran it through a power cycle and that can he...
[17:37:09] <denisse>	 @urbanecm : Hi, we've finished with the Alert hosts failover.
[17:37:12] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job icinga-am in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:37:17] <urbanecm>	 denisse: ack, thanks!
[17:37:47] <wikibugs>	 (03PS1) 10Urbanecm: JS REST: make POST default to empty object [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009326 (https://phabricator.wikimedia.org/T359216)
[17:37:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] JS REST: make POST default to empty object [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009326 (https://phabricator.wikimedia.org/T359216) (owner: 10Urbanecm)
[17:39:33] <urbanecm>	 jnuche: fyi i plan to deploy that patch myself (so i can test it). i can ping you once done if that'd be helpful.
[17:39:43] <urbanecm>	 just waiting on CI rn
[17:39:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 100%: Clone source repooling', diff saved to https://phabricator.wikimedia.org/P58598 and previous config saved to /var/cache/conftool/dbconfig/20240306-173954-arnaudb.json
[17:39:58] <jnuche>	 urbanecm: sounds great, thanks a lot
[17:41:18] <urbanecm>	 No problém. 
[17:44:39] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2518.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:47:12] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job icinga-am in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:51:39] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:53:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1800)
[18:02:12] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9608284 (10bking) @VRiley-WMF   Unfortunately, I'm still getting errors [[ https://ewr1.vultrobjects.com/work/disk_errors_wdqs1025.png | (...
[18:07:37] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1009326|JS REST: make POST default to empty object (T359216)]]
[18:07:53] <stashbot>	 T359216: [testwiki - wmf.21] Bad request for page/summary and user-impact - https://phabricator.wikimedia.org/T359216
[18:11:39] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1009326|JS REST: make POST default to empty object (T359216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:12:02] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[18:21:56] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1009326|JS REST: make POST default to empty object (T359216)]] (duration: 14m 19s)
[18:22:04] <stashbot>	 T359216: [testwiki - wmf.21] Bad request for page/summary and user-impact - https://phabricator.wikimedia.org/T359216
[18:22:16] * urbanecm done
[18:32:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839#9608338 (10cmooney) p:05Medium→03Low
[18:33:05] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9608340 (10VRiley-WMF) Swapped cable with a new one (same port), shut down the unit and reseated the drives as well. Powered the unit back on
[18:45:48] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@af71f6e] (releasing): (no justification provided)
[18:46:29] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@af71f6e] (releasing): (no justification provided) (duration: 00m 41s)
[18:49:19] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608395 (10odimitrijevic) Yes, that's correct! Approve x 2
[18:57:45] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608411 (10Marostegui)
[18:59:01] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye
[18:59:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608415 (10Marostegui)
[18:59:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1025']
[18:59:53] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1025']
[19:00:05] <jouncebot>	 jnuche and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1900).
[19:00:05] <jouncebot>	 jnuche and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1900).
[19:00:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[19:04:46] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:04:53] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:10:46] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] slo_template: update SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009299 (owner: 10Elukey)
[19:31:55] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:36:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2271.codfw.wmnet, mw2438.codfw.wmnet, mw2336.codfw.wmnet, mw2331.codfw.wmnet, mw2415.codfw.wmnet, mw2276.codfw.wmnet, mw2393.codfw.wmnet, mw2413.codfw.wmnet, mw2329.codfw.wmnet, mw2325.codfw.wmnet, mw2414.codfw.wmnet, mw2386.codfw.wmnet, mw2275.codfw.wmnet, mw2408.codfw.wmnet, mw2269.codfw.wmnet, mw2361.codfw.wmnet, mw
[19:36:31] <icinga-wm>	 fw.wmnet, mw2270.codfw.wmnet, mw2441.codfw.wmnet, mw2337.codfw.wmnet, mw2274.codfw.wmnet, mw2277.codfw.wmnet, mw2272.codfw.wmnet, mw2407.codfw.wmnet, mw2268.codfw.wmnet, mw2273.codfw.wmnet, mw2333.codfw.wmnet, mw2432.codfw.wmnet, mw2303.codfw.wmnet, mw2439.codfw.wmnet, mw2389.codfw.wmnet, mw2390.codfw.wmnet, mw2412.codfw.wmnet are marked down but pooled: mw-web_4450: Servers mw2424.codfw.wmnet, kubernetes2046.codfw.wmnet, mw2317.codfw.wmn
[19:36:31] <icinga-wm>	 rnetes2045.codfw.wmnet, kubernetes2058.codfw.wmnet, mw2301.codfw.wmnet, mw2377.codfw.wmnet, mw2447.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2034.codfw.wmnet, mw2422.codfw.wmnet, pa https://wikitech.wikimedia.org/wiki/PyBal
[19:36:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2271.codfw.wmnet, mw2409.codfw.wmnet, mw2438.codfw.wmnet, mw2392.codfw.wmnet, mw2393.codfw.wmnet, mw2338.codfw.wmnet, mw2325.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2269.codfw.wmnet, mw2408.codfw.wmnet, mw2327.codfw.wmnet, mw2433.codfw.wmnet, mw2270.codfw.wmnet, mw2441.codfw.wmnet, mw2339.codfw.wmnet, mw
[19:36:31] <icinga-wm>	 fw.wmnet, mw2277.codfw.wmnet, mw2388.codfw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2407.codfw.wmnet, mw2268.codfw.wmnet, mw2336.codfw.wmnet, mw2276.codfw.wmnet, mw2363.codfw.wmnet, mw2432.codfw.wmnet, mw2303.codfw.wmnet, mw2391.codfw.wmnet, mw2309.codfw.wmnet, mw2439.codfw.wmnet, mw2390.codfw.wmnet, mw2412.codfw.wmnet are marked down but pooled: mw-web_4450: Servers mw2424.codfw.wmnet, mw2292.codfw.wmnet, mw2350.codfw.wmnet, kube
[19:36:32] <icinga-wm>	 60.codfw.wmnet, kubernetes2058.codfw.wmnet, mw2426.codfw.wmnet, kubernetes2007.codfw.wmnet, mw2267.codfw.wmnet, mw2420.codfw.wmnet, parse2010.codfw.wmnet, mw2294.codfw.wmnet, parse2006. https://wikitech.wikimedia.org/wiki/PyBal
[19:36:57] <jinxer-wm>	 (ProbeDown) firing: (14) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:37:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 14.3% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:38:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=codfw&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:38:54] <brett>	 shit
[19:39:31] <bblack>	 any idea what's up here ^ ?
[19:39:35] <urandom>	 o/
[19:39:44] <jinxer-wm>	 (HaproxyUnavailable) firing: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[19:40:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[19:41:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (2) Average latency high: codfw api_appserver GET/200: 0.4919868694431104s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:41:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: codfw mw-parsoid (k8s) 14.7s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:41:57] <jinxer-wm>	 (ProbeDown) firing: (15) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:42:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[19:42:29] <hoo>	 Tons of timeouts while accessing the database?!
[19:42:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (4) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:43:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver POST/504: 430.3384974393115s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:43:16] <bblack>	 yeah I don't see an outside traffic spike at first glance
[19:43:30] <cdanis>	 hi, I'm here, sorry
[19:45:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[19:45:26] <bblack>	 I don't even see a 5xx spike from the edge POV
[19:45:33] <bblack>	 just a drop in traffic
[19:46:02] <bblack>	 and only on the codfw side of the world (codfw+ulsfo+eqsin)
[19:46:15] <jinxer-wm>	 (HttpdUnreachable) firing: httpd unavailable for deployment mw-web at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable
[19:46:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (3) p75 latency high: codfw mw-api-ext (k8s) 1.799s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:46:57] <jinxer-wm>	 (ProbeDown) firing: (15) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:47:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (4) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:49:51] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: (2) ATS: elevated 5xx errors from appservers-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[19:50:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[19:51:15] <jinxer-wm>	 (HttpdUnreachable) resolved: httpd unavailable for deployment mw-web at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable
[19:51:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (3) p75 latency high: codfw mw-api-ext (k8s) 1.031s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:51:58] <jinxer-wm>	 (ProbeDown) firing: (14) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:52:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:52:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (3) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:53:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (2) Average latency high: codfw appserver GET/200: 6.358808350662945s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:53:15] <jinxer-wm>	 (HttpdUnreachable) firing: httpd unavailable for deployment mw-web at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable
[19:54:51] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: (14) ATS: elevated 5xx errors from appservers-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[19:55:24] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:55:30] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:56:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (2) p75 latency high: codfw mw-parsoid (k8s) 21.46s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:56:58] <jinxer-wm>	 (ProbeDown) resolved: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:57:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[19:57:21] <jinxer-wm>	 (ProbeDown) firing: (15) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:57:51] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: (3) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:58:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=codfw&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:58:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (3) Average latency high: codfw appserver GET/200: 71.20180196214734s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:58:15] <jinxer-wm>	 (HttpdUnreachable) resolved: httpd unavailable for deployment mw-web at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable
[19:59:05] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:59:12] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:59:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:59:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:59:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[19:59:51] <jinxer-wm>	 (ATSBackendErrorsHigh) resolved: (15) ATS: elevated 5xx errors from appservers-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[20:00:05] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:00:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:01:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: (2) Average latency high: codfw api_appserver GET/200: 0.23083468048045475s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:01:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: (2) p75 latency high: codfw mw-parsoid (k8s) 3.74s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:01:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:02:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (3) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 20.18% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:02:21] <jinxer-wm>	 (ProbeDown) resolved: (14) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:03:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: (3) Average latency high: codfw appserver GET/200: 71.20180196214734s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:03:59] <wikibugs>	 (03PS1) 10Majavah: Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325
[20:04:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325 (owner: 10Majavah)
[20:04:54] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:05:00] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:05:21] <wikibugs>	 (03PS2) 10Majavah: Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325
[20:05:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325 (owner: 10Majavah)
[20:06:46] <wikibugs>	 (03Merged) 10jenkins-bot: Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325 (owner: 10Majavah)
[20:07:11] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:1009325|Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki]]
[20:08:49] <logmsgbot>	 !log taavi@deploy2002 taavi: Backport for [[gerrit:1009325|Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:09:24] <logmsgbot>	 !log taavi@deploy2002 taavi: Continuing with sync
[20:10:55] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:11:02] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:14:05] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.16.39:9042 on restbase1039 is OK: TCP OK - 0.000 second response time on 10.64.16.39 port 9042 https://phabricator.wikimedia.org/T93886
[20:15:27] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1390 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:16:55] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:19:12] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:1009325|Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki]] (duration: 12m 01s)
[20:20:54] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1025.eqiad.wmnet with OS bullseye
[20:25:26] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:25:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:45:26] <wikibugs>	 (03PS1) 10Majavah: Undeploy Striker from codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1009350
[20:45:27] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1390 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:46:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:50:05] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:50:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T352010)', diff saved to https://phabricator.wikimedia.org/P58599 and previous config saved to /var/cache/conftool/dbconfig/20240306-205006-ladsgroup.json
[20:50:25] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T2100).
[21:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[21:01:13] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:01:20] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:01:59] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:04:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1025
[21:04:24] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1025
[21:05:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P58600 and previous config saved to /var/cache/conftool/dbconfig/20240306-210512-ladsgroup.json
[21:13:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:18:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:19:09] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:19:16] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:20:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P58601 and previous config saved to /var/cache/conftool/dbconfig/20240306-212019-ladsgroup.json
[21:25:09] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:25:15] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:27:31] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "Fine with me, but I would defer to Andrew's opinion. At this point I actually don't remember which things made fully deploying it difficul" [puppet] - 10https://gerrit.wikimedia.org/r/1009350 (owner: 10Majavah)
[21:35:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T352010)', diff saved to https://phabricator.wikimedia.org/P58604 and previous config saved to /var/cache/conftool/dbconfig/20240306-213525-ladsgroup.json
[21:35:29] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[21:35:41] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[21:35:42] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[21:35:44] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[21:35:57] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[21:36:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T352010)', diff saved to https://phabricator.wikimedia.org/P58605 and previous config saved to /var/cache/conftool/dbconfig/20240306-213603-ladsgroup.json
[21:40:48] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325 (owner: 10Majavah)
[21:47:12] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T2200)
[22:25:27] <wikibugs>	 (03PS1) 10Bking: WIP: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213)
[22:28:01] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490 (10Himejijo) 03NEW
[22:33:22] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9609487 (10Jclark-ctr) @bking  was puppet and site.pp updated?  unfortunately me and Valerie do not have access to push updates and has be...
[22:33:47] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/1006974/1597/gitlab2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn)
[22:34:48] <wikibugs>	 (03PS2) 10BBlack: Make auth NSID distinct from recdns on same host [puppet] - 10https://gerrit.wikimedia.org/r/1009316
[22:34:51] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "Resources only in the new catalog" [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn)
[22:35:10] <wikibugs>	 (03CR) 10BBlack: Make auth NSID distinct from recdns on same host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009316 (owner: 10BBlack)
[22:36:30] <wikibugs>	 (03PS5) 10Dzahn: phabricator: setup scap bin link in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572)
[22:38:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: setup scap bin link in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn)
[22:59:36] <wikibugs>	 (03PS1) 10Bking: site.pp: Add wdqs1025 host [puppet] - 10https://gerrit.wikimedia.org/r/1009361 (https://phabricator.wikimedia.org/T358727)
[23:02:21] <icinga-wm>	 PROBLEM - Thanos swift https on thanos-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos
[23:02:21] <icinga-wm>	 PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos
[23:04:11] <icinga-wm>	 RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Thanos
[23:04:11] <icinga-wm>	 RECOVERY - Thanos swift https on thanos-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.391 second response time https://wikitech.wikimedia.org/wiki/Thanos
[23:08:34] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] site.pp: Add wdqs1025 host [puppet] - 10https://gerrit.wikimedia.org/r/1009361 (https://phabricator.wikimedia.org/T358727) (owner: 10Bking)
[23:09:20] <wikibugs>	 (03CR) 10Bking: [C: 03+2] site.pp: Add wdqs1025 host [puppet] - 10https://gerrit.wikimedia.org/r/1009361 (https://phabricator.wikimedia.org/T358727) (owner: 10Bking)
[23:10:21] <wikibugs>	 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9609616 (10CodeReviewBot) thcipriani merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/230  scap sync-world: Add support for testserver checks
[23:16:35] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 3 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9609622 (10bking) @Jclark-ctr Thanks for the tip, I've added a patch and will try the reimage again.
[23:16:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye